CN112270377B

CN112270377B - Target image extraction method, neural network training method and device

Info

Publication number: CN112270377B
Application number: CN202011251871.2A
Authority: CN
Inventors: 董青
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2024-03-15
Anticipated expiration: 2040-11-11
Also published as: CN112270377A

Abstract

The present disclosure provides a target image extraction method, a neural network training method, a device, a computing apparatus and a medium, relates to the technical field of artificial intelligence, and more particularly relates to computer vision. The target image extraction method comprises the steps of obtaining an input image; identifying whether an object belonging to a target class is contained in the input image; outputting a target image in response to the input image including an object belonging to the target category and the object having a first appearance form; and not outputting the target image in response to the input image including the object belonging to the target class and the object having a second appearance form different from the first appearance form, wherein the target image is a local area of the input image including the object.

Description

Target image extraction method, neural network training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to computer vision, and particularly relates to a target image extraction method, a neural network training method, a device, a computing device and a medium.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

A target recognition technique, or referred to as an object recognition technique, that extracts a target image from an image has been widely used. However, in the case where there are many original images, there are a plurality of forms of specific targets, or the image quality is poor, such extracted target images are likely to be undesirable.

Disclosure of Invention

According to one aspect of the present disclosure, a target image extraction method is disclosed, comprising: acquiring an input image; identifying whether the input image contains an object belonging to a target class; outputting a target image in response to the input image including an object belonging to the target category and the object having a first appearance; and not outputting the target image in response to the input image including an object belonging to the target category and the object having a second appearance form different from the first appearance form, wherein the target image is a local area of the input image including the object.

According to another aspect of the present disclosure, a training method of a neural network is disclosed, the neural network being trained using an image in which an object belonging to a target class exists as a positive sample set, and using an image in which an object belonging to the target class does not exist as a negative sample set; wherein the positive sample set comprises a first positive sample subset and a second positive sample subset, the images in the first positive sample subset comprising objects having a first appearance morphology and the images in the second positive sample subset comprising objects having a second appearance morphology, and wherein training the neural network further comprises calculating appearance attribute values of the samples; and adjusting parameters of the neural network so that appearance attribute values of the samples in the first positive sample subset fall into a first appearance attribute value interval, and appearance attribute values of the samples in the second positive sample subset fall into a second appearance attribute value interval, wherein the second appearance attribute value interval is not overlapped with the first appearance attribute value interval.

According to another aspect of the present disclosure, an apparatus is disclosed, comprising: an image input unit for acquiring an input image; an object recognition unit, configured to recognize whether an object belonging to a target class is included in the input image; and an image output unit configured to: outputting a target image in response to the input image including an object belonging to the target category and the object having a first appearance; and not outputting the target image in response to the input image including an object belonging to the target category and the object having a second appearance form different from the first appearance form, wherein the target image is a local area of the input image including the object.

According to yet another aspect of the present disclosure, an apparatus is disclosed that includes a neural network trained according to the above-described training method.

According to another aspect of the disclosure, a computing device is disclosed that may include: a processor; and a memory storing a program comprising instructions that when executed by the processor cause the processor to perform the target image extraction method or training method described above.

According to yet another aspect of the present disclosure, a computer-readable storage medium storing a program is disclosed, the program may include instructions that when executed by a processor of a server, cause the server to perform the above-described target image extraction method or training method.

According to yet another aspect of the present disclosure, a computer program product is disclosed, comprising a computer program which, when executed by a processor, implements the above-described target image extraction method or training method.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 is a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a target image extraction method according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a method for training a neural network according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of a method for training a neural network according to another embodiment of the present disclosure;

FIG. 5 is a flow chart of a method for training a neural network according to another embodiment of the present disclosure;

FIGS. 6 (a) -6 (b) are example sample graphs that may be used with the method of FIG. 5;

FIG. 7 is a network architecture diagram of an example neural network that may be trained using the method of FIG. 5;

FIG. 8 shows a block diagram of an apparatus according to an embodiment of the disclosure;

fig. 9 illustrates a block diagram of an exemplary server and client that can be used to implement embodiments of the present disclosure.

Detailed Description

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable methods of extracting target images, training neural networks, or generating sign images.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 to extract target images, train a neural network, or generate a sign image. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computing systems, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computing devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, apple iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., google Chrome OS); or include various mobile operating systems such as Microsoft Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing system in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in a variety of locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In some embodiments, the data store used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

A flowchart of a target image extraction method according to an embodiment of the present disclosure is described below with reference to fig. 2.

At step S201, an input image is acquired.

At step S202, it is recognized whether an object belonging to the target class is contained in the input image.

At step S203, a target image is output in response to an object belonging to the target category being contained in the input image and the object having a first appearance form. The target image is a local area in the input image containing the object.

At step S204, in response to the input image including the object belonging to the target category and having the second appearance form different from the first appearance form, the target image is not output.

According to the method of fig. 2, the morphology of the target image of the same category can be distinguished while the target extraction is achieved, thereby obtaining a more desirable target image.

For example, objects of a first appearance form and objects of a second appearance form that belong to the same target class may differ in at least one of: sharpness, integrity, texture features, text boundary continuity.

According to some embodiments, the first appearance form and the second appearance form may be distinguished using a sharpness index, and in particular, the object having the first appearance form is an object satisfying the sharpness index and the object having the second appearance form is an object not satisfying the sharpness index. Therefore, only the target image with the definition meeting the requirement can be extracted, and the quality or accuracy of the generated target image can be improved while the input image is not screened.

Sharpness may be calculated in a variety of ways. Only the sharpness of the image of the extracted target object portion may be calculated, or the sharpness of the whole image may be calculated as a sharpness index. For example, a method of evaluating based on gradient and phase consistency of images may be employed. It is often used to extract edge information by means of gradient functions. Well focused images, with sharper edges, should have larger gradient function values. Thus, the sharpness of an image can be calculated by examining the field contrast of the image, i.e., the gradient difference of gray features between adjacent pixels. The sharpness of the entire image may be used, or the sharpness of the object region may be applied as the sharpness score of the target object. The sharpness index may be determined according to the desired application. As another example, a mathematical method based on laplace transform and variance may be used. The sharpness of the image may be calculated by subjecting the image to a laplace transform and then removing the mean square error. For example, in this case, the sharpness index may be set as follows: if the result of the calculation is less than 100, the object is considered to be blurred, i.e. to have a second appearance. Alternatively, the neural network can be trained directly by positive and negative samples, so that the neural network can judge what features are needed for the clear image and the blurred image respectively. Thus, the sharpness index may be set with positive and negative samples that are manually distinguished or otherwise distinguished, for example, from different data sources.

According to some embodiments, the integrity of the object may be utilized to distinguish between the first appearance form and the second appearance form, and in particular, the object having the first appearance form is a complete object and the object having the second appearance form is a partially occluded object. Therefore, only the target image with the integrity meeting the requirement can be extracted, and the quality or accuracy of the generated target image can be improved while the input image is not screened.

The integrity or occlusion of an object may be calculated in a number of ways. For example, keypoints may be selected according to a particular object category, and the number of keypoints of an identified object in the image is detected, and the integrity of the object in the image is calculated by the number of visible keypoints, optionally in combination with geometric analysis. As another example, where there is already a library of extracted object images, the objects may be compared to images in the library of images, including shape comparison and OCR-based comparison. For the case that the object image library is not extracted but the object contains characters, the integrity of the object can be judged by analyzing the OCR result through NLP. Alternatively, the neural network can be trained by directly using the complete image and the blocked image as positive and negative samples respectively, so that the neural network can judge what features the complete object and the blocked object have respectively.

According to some embodiments, the texture features of the object may be utilized to distinguish between a first appearance form and a second appearance form, and in particular, the object having the first appearance form is an object having a first texture feature and the object having the second appearance form is an object having a second texture feature different from the first texture feature. Therefore, only the target image with the texture characteristics meeting the requirements can be extracted, and the accuracy of the generated target image can be improved while the input image is not screened.

The texture features of an object may be calculated in a number of ways. Texture features herein may also include geometric features of the object. For example, a gradient-based edge detection method may be used. For example, using the edge detection method implemented by the OpenCV software library, edges can be extracted by gradient changes at the texture. As another example, statistical-based gray level co-occurrence matrices, autocorrelation functions, and the like may be used to calculate texture features of an object. For example, objects with two texture features may be distinguished based on the gray scale properties of the pel and its neighborhood, studying statistical properties in the texture region, or first, second or higher order statistical properties of the gray scale within the pel and its neighborhood, etc. Alternatively, the neural network may be trained by directly using the object images with two visually different texture features selected manually or in other manners as positive and negative samples, respectively, so that the neural network can determine parameters corresponding to the different texture features through learning.

According to some embodiments, the target class characterizes the object containing text, and may utilize the text boundary continuity of the object to distinguish the first appearance form from the second appearance form. Specifically, the object having the first appearance form is an object whose text boundary is continuous, and the object having the second appearance form is an object whose text boundary is discontinuous. Thus, the target quality can be evaluated and classified by the text content, and by extracting only the target images with continuous text boundaries, the accuracy and quality of the generated target images can be improved while the input images are not screened.

If the object to be identified contains characters, the integrity, the occlusion, the blurring, the difficulty of human eyes and other visual characteristics of the object can be analyzed by judging the continuity of the character boundaries. For example, if there is no problem with text boundaries, then the likelihood that the object image is completely visible is high. The literal boundary continuity of an object may be calculated in a number of ways. For example, text edges may be extracted by a gradient method. Alternatively, deep learning, such as deep Lab, mask-RCNN, etc., may be used to determine text boundary continuity based on text instance segmentation results.

It is to be readily understood that the first appearance configuration and the second appearance configuration are not limited to the above features. Each of the above features may be used alone, and any combination thereof is also applicable. The object image may be identified as having the first appearance form or the second appearance form based on the object having one of these features, or may even be based on a combination of more than one of these features. For example, a first appearance form may characterize an object as distinct and boundary-intact, while a second appearance form may characterize an object as containing specific texture features and text boundaries as discontinuous. The first appearance form may characterize the object as clear, complete, having a first texture feature and the text boundary being continuous, while the second appearance form may characterize the object as blurred, occluded, containing a different second texture feature and the text boundary being discontinuous.

Furthermore, the first appearance form and the second appearance form may be distinguished from the object appearance properties of the above-described features for different application scenarios. For example, the first appearance form and the second appearance form may characterize different scales, shapes, orientations, shooting angles, life stages, colors, etc. of the object, and the present invention is not limited thereto.

According to some embodiments, the above method may be performed by a neural network. The determining whether the input image contains the object belonging to the target class may include: and outputting one or more candidate areas in which the object belonging to the target category is positioned in the input image through a first sub-network in the neural network. In response to the input image containing an object belonging to the target class and the object having the first appearance, outputting the target image may include, for each of the one or more candidate regions, via a second sub-network of the neural network: the appearance attribute value of the object in the candidate region is calculated, and the candidate region is output as the target image based on the calculated appearance attribute value falling within the first appearance attribute value interval. Calculating the appearance attribute value in the candidate region as the local region of the input image can simplify the calculation complexity of calculating the whole input image, and increase the calculation efficiency and accuracy. According to some embodiments, the neural network described above may be end-to-end trained. For example, parameters for image recognition tasks and image classification tasks may be trained simultaneously using only one loss function to simplify the complexity of the network design and system architecture. Furthermore, end-to-end network architecture may also save computational effort compared to cascaded network architecture.

A flowchart of a method for training a neural network according to an embodiment of the present disclosure is described below with reference to fig. 3.

At step S301, a neural network is input using an image in which an object belonging to a target class exists as a positive sample set, and using an image in which an object belonging to a target class does not exist as a negative sample set. The positive sample set comprises a first positive sample subset and a second positive sample subset, the images in the first positive sample subset comprising objects having a first appearance morphology and the images in the second positive sample subset comprising objects having a second appearance morphology.

At step S302, appearance attribute values of the samples are calculated.

At step S303, parameters of the neural network are adjusted based on the appearance attribute values of the samples such that the appearance attribute values of the samples in the first positive subset of samples fall into a first appearance attribute value interval and the appearance attribute values of the samples in the second positive subset of samples fall into a second appearance attribute value interval, the second appearance attribute value interval not overlapping the first appearance attribute value interval.

According to the method described in fig. 3, a neural network capable of distinguishing the morphology of the target image of the same category while achieving target extraction, thereby obtaining a more desirable target image, can be obtained.

Fig. 4 is a flow chart of a method for training a neural network according to another embodiment of the present disclosure.

At step S401, the neural network is input using an image in which an object belonging to the target class exists as a positive sample set, and using an image in which an object belonging to the target class does not exist as a negative sample set. The positive sample set comprises a first positive sample subset and a second positive sample subset, the images in the first positive sample subset comprising objects having a first appearance morphology and the images in the second positive sample subset comprising objects having a second appearance morphology.

At step S402, appearance attribute values of the sample are calculated from one or more of the following: sharpness, integrity, texture features, text boundary continuity.

At step S403, parameters of the neural network are adjusted based on the appearance attribute values of the samples such that the appearance attribute values of the samples in the first positive subset of samples fall into a first appearance attribute value interval and the appearance attribute values of the samples in the second positive subset of samples fall into a second appearance attribute value interval, the second appearance attribute value interval not overlapping the first appearance attribute value interval.

According to some embodiments, the images in the first positive sample subset are distinguished from the images in the second positive sample subset by sharpness. Specifically, the objects in the first positive sample subset are objects that satisfy the sharpness index, and the objects in the second positive sample subset are objects that do not satisfy the sharpness index. The sharpness index may be manually calibrated or calculated by any of the calculation methods as described above. The neural network trained by the method can identify the target object and distinguish the definition of the target image, so that the quality or accuracy of the generated target image can be improved while the quality of the input image is not screened.

According to some embodiments, the images in the first positive sample subset are distinguished from the images in the second positive sample subset by integrity. In particular, the objects in the first positive subset of samples are complete objects and the objects in the second positive subset of samples are partially occluded objects. The neural network trained by the method can identify the target object and distinguish the integrity of the target image, so that the quality or accuracy of the generated target image can be improved while the input image is not screened.

According to some embodiments, the images in the first positive sample subset are distinguished from the images in the second positive sample subset by texture features. Specifically, the objects in the first positive subset of samples are objects having a first texture feature, and the objects in the second positive subset of samples are objects having a second texture feature different from the first texture feature. The neural network trained in this way can distinguish the texture features of the target image while identifying the target object, so that only one network can be used, and the accuracy of the generated target image can be improved while the input image is not screened.

According to some embodiments, the target class characterizes an object containing text, and the images in the first positive sample subset and the second positive sample subset are distinguished by text boundary continuity. Specifically, the objects in the first positive subset of samples are objects with continuous literal boundaries, and the objects in the second positive subset of samples are objects with discontinuous literal boundaries. The neural network trained in this way can recognize the character boundary of the target image while recognizing the target object, so that only one network can be used, and the accuracy of the generated target image can be improved while the input image is not screened.

It is readily understood that the images in the first positive sample subset and the second positive sample subset are not limited to the features described above. Each of the above features may be used alone, and any combination thereof is also applicable. For example, the positive sample set may be divided into a first positive sample subset and a second positive sample subset based on only one of these features, or may be based on a combination or even all of a plurality of these features. The images in the first positive sample subset may be images containing sharp and complete objects, while the images in the second positive sample subset may be images containing objects of specific texture features and text boundaries being discontinuous. Alternatively, the image in the first positive sample subset is an image containing objects that are clear, complete, have a first texture feature, and have a continuous text boundary, while the second appearance form may be an image containing objects that are blurred, occluded, contain a different second texture feature, and have a discontinuous text boundary.

Furthermore, for different application scenarios, the images in the first positive sample subset and the second positive sample subset may be distinguished from the above-described features by object appearance properties, e.g. dimensions, shape, orientation, shooting angle, life stage, color, etc. of the object, as long as these features can be characterized by appearance property values by any method in the art, and the invention is not limited thereto.

The appearance property value of a sample may be calculated in a number of different ways. According to some embodiments, calculating the appearance attribute value of the sample includes: extracting the bottom layer semantics in the sample image by convolving the sample image; sample appearance attribute values are calculated based on the underlying semantics. The underlying semantics can identify the appearance characteristics of the image as a whole. The quality of the image can be primarily screened before the area of the target object is identified by the bottom semantic of the whole image. According to some embodiments, calculating the appearance attribute value of the sample includes: outputting a candidate region in which an object belonging to a target class in the sample image is located through a first sub-network in the neural network; for each of the one or more candidate regions, through a second sub-network in the neural network: calculating an image quality score of the candidate region by convolving the candidate region; and calculating an appearance attribute value of the sample image based on the image quality score of the candidate region. Candidate areas are identified through the trained network part, and then the candidate areas are calculated, so that the calculated amount can be reduced. It will be appreciated that these features may each be present alone or in combination. For example, the sample image may be convolved to extract the underlying semantics, then convolved again to extract the candidate region of the target object, and then appearance attribute values may be calculated again for the candidate region or the local portion of the target object.

According to some embodiments, the neural network is end-to-end trained. Thus, parameters for image recognition tasks and image classification tasks can be trained simultaneously to simplify the complexity of network design and system architecture. Furthermore, the end-to-end network architecture thus trained may also save computational effort compared to a cascaded network architecture.

Fig. 5 is a flow chart of a method of training a neural network according to another embodiment of the present disclosure.

In extracting information from an acquired image application scene, in order to reduce the cost and realize real-time update of data, it is attractive to use an on-vehicle image as a data source. However, due to poor quality of the current part of vehicle-mounted acquisition equipment, the acquired image data are blurred, shielded, distorted and the like, and thus the extracted object information is often poor in quality and is very unfavorable for subsequent processing.

Here, a signboard image is used as an example of an object to be detected. The sign board contains abundant text information, and therefore plays an important role in data mining and automatic production. For example, sign detection may be the first step in a point of interest automated production process in the map field, and thus, the quality of sign extraction affects the performance of subsequent automated production processes.

The object detection technique is a basic computer vision task. The main purpose is to detect the position of the target in the image and assign a class label to the target in the position. At present, along with breakthrough of the deep learning theory, the target detection technology is greatly improved, and the main stream target detection network comprises an anchor point-based detection method or an anchor-based detection method, an anchor-free detection method and the like. Anchor-based methods include, for example, faster-RCNN, retinaNet, and the like. The anchor-free method includes, for example, centerNet, cornerNet and the like. The current target detection technology is often suitable for detecting a required target object from an image with high quality and relatively consistent target object morphology, and the target object often does not perform well when the target object faces poor quality or has various appearance morphologies.

Thus, fine-grained classification features are used to evaluate the target object. Fine-grained classification aims at assigning class labels to images using their detail features. Mainly comprises a fine tuning method based on a conventional image classification network, a method based on fine granularity feature learning, a method based on an attention mechanism and the like.

At step S501, fine-grained features suitable for identifying different appearance forms of an object are acquired for a category to be detected.

In this example, for a sign category, and in particular for a business scenario in which a sign image is extracted for point of interest generation, fine-grained features may include: sign blurriness, sign occlusion, sign texture, character boundary continuity, etc. It is to be appreciated that the fine-grained nature of the object class is not so limited. For example, the degree of aging, orientation, placement position, etc. of the sign may also be extracted. The fine-grained features described above can be used to characterize whether the quality of an object (sign) in an image meets business requirements.

In step S502, the training samples are classified and calculated. Specifically, the object of the first appearance form and the object of the second appearance form are distinguished or modeled according to the acquired fine-grained characteristics for the specific object class.

Fig. 6 (a) -6 (b) are example sample images of objects containing a second appearance, e.g., fig. 6 (a) shows a blurred signage object, which may be calculated by blur level features, texture features, text features, or the like. Fig. 6 (b) shows an occluded sign object, which can be calculated by occlusion case features or text boundary continuity, etc. For example, an image of the first appearance, i.e. meeting the business requirements, may be used as a positive sample, while an image of an object of the second appearance, i.e. containing no objects, and an image of no object, may be used as a negative sample. Alternatively, an image of the first appearance, i.e. meeting the business requirements, may be used as a first type of positive sample, an image of an object of the second appearance, i.e. containing the non-meeting the requirements, may be used as a second type of positive sample, and an image of an object not containing the object may be used as a negative sample. Thus, both detection of an object and classification of an object can be achieved.

The characteristics of the sign blur level may be used to determine the sign blur level. For signs, a larger feature activation value indicates a more blurred sign image and a greater probability of failing to perform subsequent processing. Thus, a large set of samples can be screened out by learning the blur degree of the signboard image. As described with reference to fig. 2, the feature judgment method of the degree of blurring may include a judgment method based on gradient and phase consistency of an image, a mathematical method based on laplace transform and variance, a signboard quality regression algorithm based on deep learning, and the like, and the present disclosure is not limited thereto.

Features of the sign occlusion can be used to evaluate the occlusion of the sign or conversely the integrity or visibility of the sign. For the task of sign extraction, the subsequent automated processing flow is negatively affected by samples that cannot be processed subsequently due to text occlusion. The shielding of the non-text area does not influence the subsequent processing flow. As described with reference to fig. 2, the signboard occlusion evaluation main method may include a determination based on the number of visible signboard keypoints, an occlusion condition determination based on a text detection result, a signboard region occlusion degree regression algorithm based on deep learning, and the like, and the present disclosure is not limited thereto.

Features of the sign texture may be used to describe the texture details of the sign image. Features of the sign texture may also be used to indirectly determine the degree of sign ambiguity and sign occlusion. As described with reference to fig. 2, the signboard texture judgment method may include a gradient-based edge detection method, a statistical-based gray level co-occurrence matrix autocorrelation function method, a deep learning-based texture detection method, and the like, and the present disclosure is not limited thereto.

The character boundary continuity feature can also be used to determine the occlusion and blurring of the sign by analyzing the character boundary in the sign. As described with reference to fig. 2, the text boundary continuity judging method may include a text texture based discriminant analysis method, a text instance based segmentation result based boundary continuity judging method, and the like, and the present disclosure is not limited thereto.

In step S503, the neural network is trained using the classified samples. This step enables joint training of fine-grained classification of object categories (e.g., signs) and target detection networks. The method mainly comprises the steps of designing a proper network model, carrying out gradient feedback and optimization by combining a fine-granularity classification task and a target detection task, and using fine-granularity classification information for detecting a supervision signal of a network. With reference to fig. 7, a schematic diagram of an example network architecture that may be trained using the method of fig. 5 is described.

On the far left side of the neural network, an input image is acquired, which is a sample image during the training phase. Subsequently, the input image is convolved to obtain an intermediate feature map.

The intermediate feature map contains underlying semantic information that can be used to detect edge/text boundaries, etc. The underlying semantic information refers to underlying information such as contours, edges, colors, textures, and shape features. The calculated loss value may be referred to as an edge/text loss, for example, and is input to a loss function. The gradient calculations here may correspond to the two features "sign texture" and "character boundary continuity" in the foregoing. The appearance morphology recognition task or training, referred to as target classification task, of the target object can be implemented in this part of the network, and is specifically the first part of the training of the target classification task.

And further convolving to obtain a characteristic diagram. By convolving the extracted features, high-level semantic information can be obtained.

The three branches in parallel then characterize the training of the target detection task. The three branches may be respectively an offset value (offset) characterizing the offset of the center point of the target, e.g. the offset/loss at the pixel level after upsampling, a heat map value (hetmap) characterizing the location where the target may occur, a scale value (scale) characterizing the scale (length, width, etc.) of the target where the center point may occur. The loss may be calculated for each and a loss function may be input. Note that the description is made herein with reference to a network architecture using an anchor-free object detection model, but the present disclosure is not limited thereto, and any model or layer structure capable of realizing object recognition may be applied to the network of the present disclosure.

Next, the results generated by the object recognition task are respectively compared with a threshold value, and one or more candidate areas in the image where the object may exist are obtained. The candidate regions are convolved to calculate a mass loss (gradient) afferent loss function. The quality loss here may correspond to the fine-grained character of the sign blur, sign occlusion, in the foregoing. This part of the network is the second part of the training of the objective classification task. It will be appreciated that the first and second portions may exist separately and that training of the target classification task can also be achieved.

Therefore, the fusion structure of the detection network and the classification network is realized. With such a network, the accuracy of the object detection task can be improved.

According to the method or the trained neural network, particularly in the case that the category to be detected is a signboard image, the signboard image with high quality, clear text and complete content can be obtained. Such a signage image is particularly suitable for the generation of points of interest in the map or navigation field. According to some embodiments, there may also be provided a sign image generation method comprising generating a sign image using a target image extraction method according to the above or using a neural network trained by the neural network training method described above, the generated sign image being suitable for extraction of points of interest. The signboard image generation method is particularly suitable for generating a scene of a signboard image based on an on-vehicle image or a large number of images mixed with poor quality or definition, and can greatly reduce production cost.

According to some embodiments, there may also be provided a point of interest generating method comprising extracting a point of interest name using a signboard image generated according to the above method. Thus, the input image screening cost can be reduced, and the high-precision interest point data can be generated at a low production cost.

Fig. 8 illustrates a block diagram of an apparatus 800 according to some embodiments of the present disclosure. The apparatus 800 includes an image input unit 801, an object recognition unit 802, and an image output unit 803. The image input unit 801 may be configured to acquire an input image. The object recognition unit 802 may be configured to recognize whether an object belonging to a target class is contained in the input image. The image output unit 803 may be configured to: outputting a target image in response to the input image including an object belonging to the target category and the object having a first appearance form; and not outputting the target image in response to the input image including the object belonging to the target category and the object having a second appearance form different from the first appearance form. The target image is a local area in the input image containing the object.

According to another aspect of the present disclosure, there is also provided an apparatus that may include a neural network trained according to the neural network training method described herein.

According to another aspect of the disclosure, there is also provided a computing device, which may include: a processor; and a memory storing a program comprising instructions that when executed by the processor cause the processor to perform the target image extraction method or training method described above.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium storing a program, which may include instructions that when executed by a processor of a server, cause the server to perform the above-described target image extraction method or training method.

With reference to fig. 9, a block diagram of a computing device 900 that may be a server or client of the present disclosure will now be described, which is an example of a hardware device that may be applied to aspects of the present disclosure.

Computing device 900 may include elements that are connected to bus 902 (possibly via one or more interfaces) or communicate with bus 902. For example, computing device 900 can include a bus 902, one or more processors 904, one or more input devices 906, and one or more output devices 908. The one or more processors 904 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (e.g., special processing chips). The processor 904 can process instructions executing within the computing device 900, including instructions stored in or on memory to display graphical information of a GUI on an external input/output device, such as a display device coupled to an interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computing devices may be connected, with each device providing part of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 904 is illustrated in fig. 9.

Input device 906 may be any type of device capable of inputting information to computing device 900. The input device 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of a computing device used to extract target images, train a neural network, or generate a sign image, and may include, but is not limited to, a mouse, keyboard, touch screen, track pad, track ball, joystick, microphone, and/or remote control. Output device 908 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers.

Computing device 900 may also include a non-transitory storage device 910 or be connected to non-transitory storage device 910, which may be any storage device that is non-transitory and that may enable data storage, and may include, but is not limited to, magnetic disk drives, optical storage devices, solid-state memory, floppy diskettes, flexible disks, hard disks, magnetic tape, or any other magnetic medium, optical disks or any other optical medium, ROM (read-only memory), RAM (random access memory), cache memory, and/or any other memory chip or cartridge, and/or any other medium from which a computer may read data, instructions, and/or code. The non-transitory storage device 910 may be detachable from the interface. The non-transitory storage device 910 may have data/programs (including instructions)/code/modules (e.g., image input module 801, object recognition module 802, and image output module 803 shown in fig. 8) for implementing the methods and steps described above.

Computing device 900 may also include a communication device 912. The communication device 912 may be any type of device or system that enables communication with external devices and/or with a network, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication devices, and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

Computing device 900 may also include a working memory 914, which may be any type of working memory that may store programs (including instructions) and/or data useful for the operation of processor 904, and may include, but is not limited to, random access memory and/or read-only memory devices.

Software elements (programs) may reside in working memory 914 including, but not limited to, an operating system 916, one or more application programs 918, drivers, and/or other data and code. Instructions for performing the above-described methods and steps may be included in one or more applications 918, and the above-described methods may be implemented by instructions of one or more applications 918 being read and executed by processor 904. Executable code or source code for instructions of software elements (programs) may also be downloaded from a remote location.

It should also be understood that various modifications may be made according to specific requirements. For example, custom hardware may also be used, and/or particular elements may be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. For example, some or all of the disclosed methods and apparatus may be implemented by programming hardware (e.g., programmable logic circuits including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs)) in an assembly language or hardware programming language such as VERILOG, VHDL, c++ using logic and algorithms according to the present disclosure.

It should also be appreciated that the foregoing method may be implemented by a server-client mode. For example, a client may receive data entered by a user and send the data to a server. The client may also receive data input by the user, perform a part of the foregoing processes, and send the processed data to the server. The server may receive data from the client and perform the aforementioned method or another part of the aforementioned method and return the execution result to the client. The client may receive the result of the execution of the method from the server and may present it to the user, for example, via an output device. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computing devices and having a client-server relationship to each other. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should also be appreciated that components of computing device 900 may be distributed over a network. For example, some processes may be performed using one processor while other processes may be performed by another processor remote from the one processor. Other components of computing device 900 may also be similarly distributed. As such, computing device 900 may be interpreted as a distributed computing system that performs processing in multiple locations.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A target image extraction method comprising:

Acquiring an input image;

identifying whether the input image contains an object belonging to a target class;

outputting a target image in response to the input image including an object belonging to the target category and the object having a first appearance; and

responsive to the input image including an object belonging to the target category and the object having a second appearance form different from the first appearance form, the target image is not output, wherein,

the target image is a local area in the input image containing the object,

wherein the method is performed by a neural network,

wherein determining whether the input image contains an object belonging to a target class includes:

outputting one or more candidate areas in which the object belonging to the target category is located in the input image through a first sub-network in the neural network; and is also provided with

Wherein outputting the target image includes, for each of the one or more candidate regions, through a second sub-network of the neural network in response to the input image including an object belonging to the target class and the object having a first appearance form:

Calculating appearance attribute values of objects in the candidate region,

outputting the candidate region as the target image based on the calculated appearance attribute value falling within a first appearance attribute value interval, and

wherein the objects of the first appearance form and the objects of the second appearance form belonging to the same target class differ in at least one of: sharpness, integrity, texture features, text boundary continuity.

2. The method of claim 1, wherein the object having the first appearance form is an object that meets a sharpness index and the object having the second appearance form is an object that does not meet a sharpness index.

3. The method of claim 1 or 2, wherein the object having the first appearance form is a complete object and the object having the second appearance form is a partially occluded object.

4. The method of claim 1 or 2, wherein the object having the first appearance form is an object having a first texture feature and the object having the second appearance form is an object having a second texture feature different from the first texture feature.

5. The method of claim 1 or 2, wherein the target class characterizes text-containing objects, and wherein objects having the first appearance form are text boundary-continuous objects and objects having the second appearance form are text boundary-discontinuous objects.

6. The method of claim 1, wherein the neural network is end-to-end trained.

7. A method for training a neural network,

training the neural network using an image in which an object belonging to a target class exists as a positive sample set, and using an image in which an object belonging to the target class does not exist as a negative sample set;

wherein the positive sample set comprises a first positive sample subset and a second positive sample subset, the images in the first positive sample subset comprising objects having a first appearance form and the images in the second positive sample subset comprising objects having a second appearance form, and

wherein training the neural network further comprises,

calculating appearance attribute values of the samples;

and adjusting parameters of the neural network so that appearance attribute values of the samples in the first positive sample subset fall into a first appearance attribute value interval, and appearance attribute values of the samples in the second positive sample subset fall into a second appearance attribute value interval, wherein the second appearance attribute value interval is not overlapped with the first appearance attribute value interval.

8. The training method of claim 7, wherein the objects in the first positive subset of samples are objects that meet a sharpness index and the objects in the second positive subset of samples are objects that do not meet the sharpness index.

9. Training method according to claim 7 or 8, wherein the objects in the first positive subset of samples are complete objects and the objects in the second positive subset of samples are partially occluded objects.

10. Training method according to claim 7 or 8, wherein the objects in the first positive subset of samples are objects having a first texture feature and the objects in the second positive subset of samples are objects having a second texture feature different from the first texture feature.

11. Training method according to claim 7 or 8, wherein the target class characterizes objects containing text, and wherein the objects in the first positive subset of samples are objects with a continuous literal boundary and the objects in the second positive subset of samples are objects with a discontinuous literal boundary.

12. The training method of claim 7 or 8, wherein calculating the appearance attribute value of the sample comprises:

Extracting the bottom layer semantics in the sample image by convolving the sample image;

appearance attribute values of the samples are calculated based on underlying semantics.

13. The training method of claim 7 or 8, wherein calculating the appearance attribute value of the sample comprises:

outputting a candidate region in which an object belonging to the target class is located in a sample image through a first subnetwork in the neural network;

for each of the one or more candidate regions, through a second sub-network of the neural network:

calculating an image quality score of a candidate region by convolving the candidate region;

and calculating appearance attribute values of the sample images based on the image quality scores of the candidate areas.

14. The training method of claim 7 or 8, wherein the neural network is end-to-end trained.

15. A target image extraction apparatus comprising:

an image input unit for acquiring an input image;

an object recognition unit, configured to recognize whether an object belonging to a target class is included in the input image; and

an image output unit configured to:

the target image is a local area in the input image containing the object,

wherein the operations of the object recognition unit and the image output unit are performed through a neural network,

calculating appearance attribute values of objects in the candidate region,

16. A target image extraction device comprising a neural network trained in accordance with the training method of any one of claims 7-14.

17. A computing device, comprising:

a processor; and

a memory storing a program comprising instructions that when executed by the processor cause the processor to perform the target image extraction method according to any one of claims 1-6 or the training method according to any one of claims 7-14.

18. A computer readable storage medium storing a program, the program comprising instructions that, when executed by a processor of an electronic device, instruct the electronic device to perform the target image extraction method according to any one of claims 1-6 or the training method according to any one of claims 7-14.