WO2019106095A1

WO2019106095A1 - Hierarchical image interpretation system

Info

Publication number: WO2019106095A1
Application number: PCT/EP2018/083023
Authority: WO
Inventors: Daniel Hubert; Ben BOUTCHER-WEST; Gabriel J. Brostow
Original assignee: Yellow Line Parking Ltd.
Priority date: 2017-11-29
Filing date: 2018-11-29
Publication date: 2019-06-06
Also published as: GB2570762A; GB201719862D0; GB201819450D0

Abstract

There is provided a system and method for parsing parking signs, comprising receiving image data representing an image, processing the image data to determine a first information region in the image and associating the first information region with a parent node of a hierarchy, processing the image data to determine one or more information sub-regions wholly contained within the first information region, and associating each determined sub-region with a sub-node of the hierarchy, wherein each sub-node is a child to the parent node, and outputting data indicative of the hierarchy.

Description

HIERARCHICAL IMAGE INTERPRETATION SYSTEM

TECHNICAL FIELD

The present disclosure relates to a system and method for parsing parking signs.

BACKGROUND

It is generally desired to provide image parsing systems capable of extracting useful information from digital images, for example using computer vision techniques. Traditionally, computer vision involves the acquisition, extraction and analysis of information present in one or more images through algorithmic or analytical methods to acliieve a visual understanding of the image. The applications of computer vision are numerous, and examples include Optical Character Recognition (OCR), object detection and recognition, biometrics, and others.

It is an aim of the invention to mitigate one or more problems of the prior art.

SUMMARY OF THE INVENTION

An aspect of the invention provides a method for parsing parking signs, comprising receiving image data representing an image, processing the image data to determine a first information region in the image and associating the first information region with a parent node of a hierarchy, processing the image data to determine one or more information sub-regions wholly contained within the first information region and associating each determined sub-region with a sub-node of the hierarchy, wherein each sub-node is a child to the parent node, and outputting data indicative of the hierarchy.

The method may further comprise iteratively determining one or more further sub-regions wholly contained within one or more previously determined sub-regions and associating each further determined sub-region with a further sub-node of the hierarchy, wherein each farther sub-node is a child to the corresponding previously determined parent sub-node. Advantageously, images comprising multiple regions of interest and therefore multiple levels of association may be parsed.

Determining a first information region or sub-region of the image may comprise using one or more feature detection algorithms. Optionally, the feature detection algorithm comprises one or more of Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), Speeded up Robust Feature (SURF), Harr-like features, or a neural network. Advantageously, information regions of interest may be efficiently determined.

The method may further comprise determining a semantic classification of each information region or sub- region and associating the semantic classification with the corresponding node or sub-node. Optionally, determining a semantic classification of each information region may comprise using a classification

I algoritlun. Optionally, the classification algorithm comprises using one or more of a neural network, decision forest, or logistic regression algorithm. Advantageously, the hierarchical associations may be preserved through to semantic understanding.

The method may further comprise using a non-maximal suppression method to prevent overlap of determined sub-regions. Advantageously, this prevents information from being parsed repeatedly.

The method may comprise co-training the classification algorithm with the feature detection algorithm.

The method may comprise training one or more of the feature detection and classification algorithms using data indicative of a predicted hierarchy.

The method may comprise training one or more of the feature detection and classification algorithms using data indicative of a position of one or more information regions. Advantageously, this increases the accuracy of the feature detection and classification algorithms for determining and classifying spatial information regions.

Optionally, the determined information regions are of different sizes. Advantageously, this allows for increased diversification of input images to be parsed.

An aspect of the invention provides a system for parsing parking signs, comprising input means arranged to receive image data representing an image; processing means arranged to: determine a first information region in the image and associate the first information region with a parent node of a hierarchy, determine one or more information sub-regions wholly contained within the first information region and associating each determined sub-region with a sub-node of the hierarchy, wherein each sub-node is a child to the parent node; and output means arranged to output data indicative of the hierarchy.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows an example of an image to be parsed;

Figure 2 shows a system for parsing images;

Figure 3 shows a method for parsing images;

Figure 4 shows an example of a parsed image;

Figure 5 shows an output hierarchy;

Figure 6 shows an example of a semantic classification method.

DETAILED DESCRIPTION In the field of computer vision, the extraction and analysis of information present in images is particularly important for gaining an understanding of the image. However, semantic information can often be lost with traditional computer vision techniques.

Many images contain information whose natural representation is a hierarchy, i.e. an image may contain a multiple regions of information that are inherently associated with each other. Figure 1 for example shows an image of a parking sign 100. The parking sign consists of multiple information regions 1 10, 120, 130, wherein each region comprises one or more rules 140 (e.g. No loading, pay at machine, etc.) and times 150 (7-10am, Mon-Fri, etc.). It will be appreciated that each rale is associated with one or more respective times, and govern the state of a parking space. It will be appreciated that other rules are possible, such as the type/purpose of the parking space or other instructions on use. A hierarchy therefore exists for the rules and times in the image of the parking sign, and a hierarchy may therefore be defined as a representation of the associations between regions of the image containing semantic information, A hierarchy may also be defined for other types of data, such as the associations between characters or words in a text string, or subjects in a video. The information represented in the parking sign 100 may thus be represented hierarchically by representing the specific associations between individual rules and times.

When using traditional image parsing techniques such as OCR or neural networks, it becomes difficult to retain the hierarchical associations between information. For example, a standard OCR method may perfectly convert the image of the parking sign in Figure 1 into text, but will lose understanding of the required positional or spatial associations between elements, and therefore will lose context of the associations between individual rules and times. In this case, further semantic processing must be required in order to provide a full understanding of the purpose of the sign. Without a hierarchical representation of the information, it becomes difficult for an image parsing system to understand specifically what times are associated with‘No loading’ or‘Pay at machine’, or even to distinguish that‘No loading’ or‘Pay at machine’ are not associated together. It will be appreciated therefore that an image parsing system capable of extracting and retaining hierarchically represented information is required.

Figure 2 illustrates a system 200 for image parsing. Parsing in this context refers to extracting or interpreting information or semantic understanding from an input, such as an image. However, other forms of input may be envisaged, such as text or video. The system 200 comprises an input means 210 arranged to receive image data 205 representing an image containing one or more information regions, a processing means 220 to process the image data to determine a first information region in the image and associate the first information region with a parent node of a hierarchy and determine one or more information sub-regions wholly contained within the first information region and associating each determined sub-region with a sub-node of the hierarchy, wherein each sub-node is a child to the parent node, and an output means 230 to output data indicative of the hierarchy. The hierarchy may comprise any suitable data structure operable to the processing means 220, such as a tree, linked list, or relational database. The input means 210 is arranged to receive image data 205 representing an image containing one or more information regions comprising a physical region of the image in which inter-associated information is contained, as will be explained later. The image data 205 may be received from any suitable image capture means or data storage means and through any suitable wired or wireless connection to the input means 210. The image capture means may comprise a sensor, such as a LIDAR sensor, or any other suitable image capture device. Typically such sensors will operate based on visual information, however as parking signs develop and communicate using other media, including electronic, it is envisaged that the image data 205 may be received from other suitable sensors and/or devices.

The input means 210 is digitally coupled to the processing means 220, which may comprise one or more processing devices. The processing means 220 may be coupled to a database 225. The processing means

220 may be arranged to perform one or more computer vision met!iods on the received image data 205 to determine one or more information regions or sub-regions and designate each region as a node in a hierarchy. In this way, each region may be associated with a node in a hierarchy, as will be explained later.

The processing means 220 is digitally coupled to the output means 230. The output means 230 may comprise any suitable means for outputting data indicative of the hierarchy 235, for example connected device e.g. a computing device, data storage means, distributed platform, or display means.

Figure 3 illustrates a method 300 for image parsing. The method 300 may be performed using the system 200.

The method 300 comprises the step 310 of receiving image data representing an image, the image containing or more information regions. The step 310 may be performed using the input means 210 as shown in Figure 2. Information regions may comprise an area containing text, objects, or other image features of interest. Examples of information regions include the rales and times shown in the parking sign in Figure 1. Other examples may include images comprising information regions showing faces, clothing, brands, buildings, vehicles, or other image features or semantic objects to be determined.

The method 300 further comprises the step 320 of determining a first information region in the image and associating the first information region with a parent node (i.e. designating the first information region as the parent node) of a hierarchy. The step 320 may be performed using the processing means 220 as shown in Figure 2. Determining the first information region may be performed using any suitable object detection algorithm, such as a feature detection algorithm. Examples of feature detection algorithms include, amongst others, Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), Speeded up Robust Feature (SURF), Harr-like features, or a neural network, which provide indications of relevant points of the image. The object detection algorithm may further comprise any suitable machine-learning algorithm arranged to determine a spatial area of the image, such as Fast R-CNN. The determined information region may be any suitably shaped subset of pixels of the image comprising the object or objects of interest. For example, the infonnation region may be rectangular or square, or comprise any other non-regular shape. The feature detection algorithm may have been trained on previous datasets indicative of the feature to be detected, such as images of parking signs.

Upon processing the image data to determine the first information region in the image, the first information region is associated with a parent node of a hierarchy. The parent node may be the root node of the hierarchy, and may represent the main region of interest determined in the image. The hierarchy may be represented by any suitable data format operable to the processing means 220, and may be stored in the database 225. The parent node may be associated with infonnation relating to one or more of the size, position, or contents of the infonnation region determined.

Optionally, image processing may then be applied to the image data corresponding to the determined first infonnation region to prepare for determining an information sub-region. The image processing may comprise any suitable image processing operation such as image rectification, cropping, rotation, warping, hue/saturation/contrast alterations, de-noising, sharpening, blurring etc. The image processing operations may be applied to the image data manually by a user, or may be applied automatically by the processing means 220 based on determined parameter values of the image data. The determined parameter values may comprise one or more of an image skewness, distortion, hue, saturation, contrast, brightness, contrast, noise level, sharpness, and blur level, however other image parameter values will be envisaged. For example, if a value of a brightness parameter of the image data is determined to be below a predetermined threshold, an image processing operation may be applied to the image data to increase the brightness. In some embodiments, the parameter values may be determined algorithmically. For example, if an image processing algorithm determines the subject-matter of an image is rotated beyond a predetermined axis, a rotation operation may be applied to the image data to align the axis of its subject-matter with the predetermined axis.

The method 300 further comprises the step 330 of processing the image data to determine one or more information sub-regions wholly contained within the first information region and associating (i.e. designating) each determined sub-region with/as a sub-node, wherein each sub-node is a child to the parent node, i.e. the node associated with the first information region. The step of determining the information sub-regions may again be performed using any suitable object detection algorithm, such as a feature detection algorithm. The feature detection algorithm for determining the information sub-region may be trained to determine a different feature from the first feature detection algorithm arranged to determine a first information region, such as sections of a parking sign. Each determined information sub-region may any suitably shaped subset of pixels of the first information region. Each sub-region may be of different sizes or shapes. A non-maximal suppression method may be applied when determining information sub- regions to prevent overlap of the determined sub-regions. Each sub-node may be associated with information relating to one or more of the size, position, or contents of the corresponding information sub- region detemiined. hi this way, a hierarchy of nodes may be formed, comprising a representation of associations between information sub-regions.

The method 300 may optionally also comprise the step 340 of determining a semantic classification for each information detemiined region or sub-region. Any suitable classifier may be used to determine the semantic classification of each region or sub-region. For example, in a first information region comprising multiple objects of interest, each sub-region of the first information region may have a classifier applied to determine the class of objects present within each region. The classifier may comprise any suitable classification algorithm, such as neural network, decision forest, and logistic regression algorithms, however other algorithms will be envisaged. The determined semantic classification may be compared to an existing dataset of semantic classifications to identify inconsistencies or anomalies in the detemiined semantic classification. In this way, the determined semantic classification may be validated based on the historical context of previous classifications. For example, the semantic classification of a parking sign rule may be compared against a dataset of existing semantic classifications for parking signs in order to determine a valid classification. This prevents errors arising in the semantic classification due to noise (such as poor quality images, objects blocking the view of a sign, graffiti on signs, etc.) or algorithmic error. In some embodiments, a confidence score may also be produced by the classifier which is indicative of the degree of certainty or error by the classifier. This confidence score may reflect the quality of the input. For example, when applied to parking signs, the confidence score may reflect the physical quality of the parking sign, which may be useful for physical sign maintenance purposes. The classifier may have been co-trained with the feature detection algorithms. The detemiined semantic classification may also be associated with the corresponding node, to provide semantic meaning to each node in the hierarchy. Further image processing may be applied to each determined information region or sub-region before classification, such as image rectification, cropping, rotation, warping, hue'saturation alterations, etc. The semantic classification may be computer-parsable.

The method 300 may also comprise the step 335 of iteratively determining one or more further sub-regions wholly contained within one or more previously determined sub-regions, and associating each further detemiined sub-region with a further sub-node of the hierarchy, wherein each further sub-node is a child to the corresponding previously detemiined parent sub-node. In this v/ay, a hierarchy comprising multiple levels of nodes may be provided. The number of iterations to determine sub-regions may be predetermined, or may be algorithmically detemiined. As before, semantic classifiers may be applied to each determined sub-region and associated with the corresponding node in the hierarchy. Each iteration may of the step 335 may apply a different feature detection algorithm. In this way, each layer of nodes in the hierarchy may correspond to different a class of detected feature. Each iteration of the step 335 may also apply a different semantic classifier.

(> Training of feature detection algorithms may be performed by specifying a structure of hierarchy to be output from the method 300, i.e. in a supervised manner. Training of the feature detection algorithms may further involve segmenting each image to explicitly indicate the position anchor sizes of each information region. Training the feature detection algorithms may comprise training from a collective dataset indicative of all the features to be determined, or each feature detection algorithm may be trained on a different dataset respectively.

One particular issue with current methods of parsing images and text is that when an image processing algorithm is applied incorrectly, for example due to poor quality training data, the resulting output is often uninterpretable. This makes it difficult to understand why an image processing algorithm made a particular output decision. In contrast, the invention as disclosed advantageously allows for clearer debugging of the image processing method, as each node of the predicted hierarchy (and associated semantic classification) may be compared directly with the respective node of the correct hierarchy, allowing for more granular validation of the image processing algorithm. Other advantages include the ability to deduce errors in the training stage by directly comparing predicted hierarchies with correct hierarchies.

Finally, the method 300 comprises the step 350 of outputting data indicative of the hierarchy. The data indicative of the hierarchy may comprise data relating to one or more of the size, shape, contents, and semantic classification associated with each node, as well as the specific parent-child relationships between each node. The data indicative of the hierarchy may be formatted in any suitable data structure operable to the processing means 220, such as a tree, linked list, relational table, or otherwise.

Once data indicative of a hierarchy is output, it may be stored in a data storage means such that the data may be retrieved for further processing in future. As noted, the data indicative of the liierarchy may comprise any suitable data structure operable to the processing means 220. The data structure may be arranged such that individual nodes and associated content may be edited, removed, or added to the hierarchy whilst preserving the remaining hierarchy structure.

For example, an individual node of the hierarchy may be updated upon request from the processing means 220 to a different semantic classification. In some instances, an explicit instruction rnay be received by the processing means 220 to update, remove, overwrite, overrule, or add a node to an existing hierarchy. For example, the processing means 220 may receive an electronic instruction to modify the parking times associated with a parking sign that has been parsed as a hierarchy. The electronic instruction may be received over a wired or wireless network from an external device, such as a computing server or computing device. In these instances, the semantic classification contained within the node corresponding to the parking time associated with that parking sign will be updated. In some instances, the method 300 may be applied to a second input, and may output data indicative of a hierarchy that is different to a previously determined hierarchy. For example, the method 300 may be re applied to an image of a parking sign that has already been parsed. The differences between semantic classifications contained in each of the nodes of the previously determined hierarchy and the newly determined hierarchy may be identified, and the previous hierarchy may be updated to include the new semantic classifications. In some embodiments, the second input may comprise an image, however other forms of input may be envisaged.

In some embodiments, multiple related images comprising identically located information regions having differing semantic information may be received by the method 300. For example, multiple images of varying states of a digital parking sign may be received, wherein each image contains information regions having different semantic information, reflecting the changing nature of the digital parking sign. For example, a digital parking sign may show“No parking” on a Monday between 9am and 5pm. However, outside of these times, the digital parking sign may show“1 hour parking - no return within 2 hours”. In example scenarios such as these, multiple parent nodes may be created for the hierarchy with respect to each set of semantic information.

Figure 4 shows an exemplary application 400 of the method 300 applied to the image of the parking sign 100 shown in Figure 1. Figure 5 similarly shows the hierarchy 500 determined from 400. The method 300 is particularly useful for processing parking signs, as they are naturally arranged as a hierarchy of rules, and therefore by explicitly analysing the grouping of components of the sign and retaining the associated hierarchy, the semantic understanding of the sign is not lost. However, it will be appreciated that other forms of input may be used, such as road markings, signalling equipment, etc. Examples of traffic-related inputs for which the present invention may be applied to include images of parking signs, warning signs, regulatory signs, speed limit signs, low bridge signs, level crossing signs and signals, train signs, signals and road markings, bus and cycle signs and road markings, pedestrian zone signs, traffic calming signs, motorway signs, signals, and road markings, directional signs, information signs, traffic signals, road work signs, and others.

Applying the step 320 to the parking sign 100 provides the first information region 410. The first information region 410 determined corresponds to the entire area of the sign itself, which therefore comprises the parent node 510 of the hierarchy 500. In this example, a feature detection algorithm trained to detect signs was used to determine the first information region 410. As can be seen, the information region 410 comprises multiple semantic objects of interest.

Further image processing is then applied to the first information region 410 to provide the image 420. In this example, warping is applied to the image corresponding to the determined region to ensure the

S determined sign is fronto-parallel. The warping is performed by regressing a two-dimensional offset applied to each comer of the determined image region corresponding to the sign.

One or more information sub-regions 422, 424, 426 wholly contained within the processed first information region 420 are determined, in accordance with step 330. In this example, feature detection algorithms trained to detect individual sections of a sign were used to determine the information sub-regions 422, 424, 426. As per step 330, each of the determined sub-regions is associated with a sub-node of the hierarchy, as shown by sub-nodes 520, 530 540 in Figure 5. Each of the sub-nodes 520, 530 and 540 is a child to the parent node 510, and the parent-child relationships between each node can clearly be seen.

One or more further information sub-regions wholly contained within each of the previously determined sub-regions is then determined, as per step 335. Further sub-regions 432, 434, 436, 438 for example have been determined from sub-region 422. In this example, feature detection algorithms trained to detect individual rules and times were used to determine the further sub-regions 432, 434, 436, 438. It should also be noted that the region 440 corresponding to the instructional message‘Display ticket’ has not been determined as a sub-region, due to the feature detection algorithm being selected to ignore this type of region. As per step 335, each of the determined further sub-regions is associated with a sub-node of the hierarchy, as shown for example by sub-nodes 520, 530 540 in Figure 5. Each of the sub-nodes 522, 524, 526, 528 is a child to the parent sub-node 520, and the parent-child relationships between each node can clearly be seen.

In Figure 5 it can be seen that a semantic classification of each of the further sub-regions 432, 434, 436 has been performed and the resulting classification has been associated with the corresponding sub-nodes 522, 524, 528, such that each of the sub-nodes has been categorised into‘Rule’,‘Applies weekdays’ ,‘Applies times’, and‘Return period’ categories. The semantic classification may therefore comprise a determined rule or time period relating to the state of the parking space, and the determined hierarchy is indicative of the times the rules are applicable. Figure 5 therefore shows a resulting hierarchy, wherein each node corresponds to a region or object of interest. As can be seen, the spatial/positional associations of each rule and time has advantageously been retained in the hierarchical structure, such that the semantic understanding of the parking sign 100 has not been lost.

Figure 6 shows an example of the semantic classifiers applied to some of the determined sub-regions, wherein each semantic classification is performed using a classification algorithm trained to return a relevant rale or time category. 610, 620, and 630 show an example application of‘Day’,‘Rule’, and‘Penult code’ semantic classifiers using Convolutional Neural Networks and OCR parsers. The resulting semantic classifications 640 are output in a computer-readable format. Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention hi the claims, the tenn‘comprising’ does not exclude the presence of other elements or steps.

Furthermore, the order of features in the claims does not imply any specific order in which the features must be performed and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus, references to‘a’,‘an’,‘first’,‘second’, etc. do not preclude a plurality. In the claims, the tenn‘comprising’ or“including” does not exclude the presence of other elements.

Claims

CLAMS What is claimed is:

1. A method for parsing parking signs, comprising:

receiving image data representing an image;

processing the image data to determine a first information region in the image and associating the first information region with a parent node of a hierarchy;

processing the image data to determine one or more information sub-regions wholly contained within the first information region and associating each determined sub-region with a sub-node of the hierarchy, wherein each sub-node is a child to the parent node; and

outputting data indicative of the hierarchy.

2. The method of claim 2, comprising iteratively determining one or more further sub-regions wholly contained within one or more previously determined sub-regions and associating each further determined sub-region with a further sub-node of the hierarchy, wherein each further sub-node is a child to the corresponding previously determined parent sub-node.

3. The method of any preceding claim, wherein determining a first information region or sub-region of the image comprises using one or more feature detection algorithms.

4. The method of claim 3, wherein using the feature detection algorithm comprises using one or more of Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), Speeded up Robust Feature (SURF), Hair-like features, or a neural network.

5. The method of any preceding claim, comprising determining a semantic classification of each information region or sub-region and associating the semantic classification with the corresponding node or sub-node.

6. The method of claim 5, wherein determining a semantic classification of each information region or sub-region comprises using a classification algorithm.

7. The method of claim 6, wherein using the classification algorithm comprises using one or more of a neural network, decision forest, or logistic regression algorithm.

8. The method of any preceding claim, comprising using a hsh-maximal suppression method to prevent overlap of determined sub-regions.

9. The method of any of claims 5 to 8, comprising co-training the classification algoritlim with the feature detection algoritlim.

10. The method of any of claims 5 to 9, comprising training one or more of the feature detection and classification algorithms using data indicative of a predicted hierarchy.

11. The method of any of claims 5 to 10, comprising training one or more of the feature detection and classification algorithms using data indicative of a position of one or more information regions.

12. The method of any preceding claim, wherein determined information regions are of different sizes.

13. The method of any preceding claim, comprising applying one or more image processing operations to the image data.

14. The method of claim 13, wherein the image processing operation comprises one or more of image rectification, cropping, rotation, warping, hue alterations, saturation alterations, contrast alterations, de-noising, sharpening, and blurring.

15. The method of claim 13 or 14, wherein the one or more image processing operations are based on one or more algorithmically determined parameter values of the image data.

16. The method of claim 15 or 16, wherein the determined parameter values comprise one or more of an image skewness, distortion, hue, saturation, contrast, brightness, contrast, noise level, sharpness, and blur level.

17. The method of any preceding claim, wherein the data indicative of the hierarchy comprises a tree, linked list, or relational table data structure.

18 The method of any of claims 5-17, wherein the semantic classification is compared against a dataset of existing semantic classifications to determine a valid classification.

19. The method of claim 5, further comprising determining a confidence score indicative of the degree of certainty or error of the semantic classification,

20. A system for parsing parking signs, comprising:

input means arranged to receive image data representing an image;

processing means arranged to: determine a first information region in the image and associate the first information region with a parent node of a hierarchy,

determine one or more information sub-regions wholly contained within the first information region and associating each determined sub-region with a sub-node of the hierarchy, wherein each sub-node is a child to the parent node; and

output means arranged to output data indicative of the hierarchy.