CN114202761B

CN114202761B - Information batch extraction method based on picture information clustering

Info

Publication number: CN114202761B
Application number: CN202210140562.0A
Authority: CN
Inventors: 纪俊光; 黎慧燕; 陈学言
Original assignee: Guangdong Shuyuan Zhihui Technology Co ltd
Current assignee: Guangdong Shuyuan Zhihui Technology Co ltd
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-06-21
Anticipated expiration: 2042-02-16
Also published as: CN114202761A

Abstract

The invention discloses an information batch extraction method, a system and a computer readable storage medium based on picture information clustering, wherein the method comprises the following steps: extracting commodity objects and character objects from the image to be identified, classifying and numbering the commodity objects and the character objects, and determining a coordinate system of each object; dotting the obtained edges of different objects and determining the coordinates of the dotting; performing collision calculation on adjacent different types of objects by using edge points, and taking the current two objects as combined objects if the distance between the edge points of the two adjacent different types of objects is smaller than a preset value; and continuing to perform collision calculation on the combined object and other objects, if the distance between the edge points is greater than a preset multiple of the distance between the edge points of the current combined object, judging that the object does not belong to the object in the same combination, continuing to perform collision calculation with other objects of different types until all the objects are combined, and outputting the combined object. The invention can realize the combined identification of the associated objects in the complex background and extract the information.

Description

Information batch extraction method based on picture information clustering

Technical Field

The invention relates to the technical field of intelligent processing of internet big data, in particular to a method and a system for extracting information in batch based on picture information clustering and a computer-readable storage medium.

Background

The OCR technology frame is an important technology often used by Internet companies to identify graphic and text information, and the technology itself is to acquire the information of characters and pictures on paper by optical input devices such as scanners or cameras, analyze the morphological structure of the characters by using various pattern recognition algorithms to form corresponding character feature descriptions, and convert the characters in the images into text formats by using a proper character matching method.

The method is a very practical and efficient technology for analyzing a large number of pictures by using big data, but the traditional identification technology is usually a single information scanning mode, the identified characters are treated as single individuals, the function of identifying combined contents is not realized, and the characters are processed in a block scanning mode, so that the real semantic condition of the description object cannot be accurately known by the identified single characters frequently.

The prior art discloses a method and a device for identifying an object in an image, wherein the method comprises the following steps: preprocessing an image to be recognized to obtain a binary image of the image to be recognized; cutting the binary image into a plurality of sub-regions, and selecting a first sub-region from the plurality of sub-regions, wherein the first sub-region is a sub-region containing preset pixels; combining the first sub-regions to obtain at least one second sub-region based on the distances of different first sub-regions in the binary image; identifying a target object in the second sub-region. The scheme aims at object recognition in a complex background, and does not solve the problem of recognition of associated objects or combined objects.

Disclosure of Invention

The invention provides a method and a system for extracting information in batches based on picture information clustering and a computer-readable storage medium, aiming at overcoming the defect that the existing picture information extraction method does not solve the identification and extraction of associated objects or combined objects.

The primary objective of the present invention is to solve the above technical problems, and the technical solution of the present invention is as follows:

the invention provides a method for extracting information in batches based on picture information clustering, which comprises the following steps:

s1: extracting commodity objects and character objects from the images to be recognized by using an OCR recognition method, classifying the commodity objects and the character objects into numbers, taking the objects in each image as independent objects, and determining a coordinate system of each object;

s2: dotting the edges of all the commodity objects and the character objects in each image, marking the dotted points as edge points, and determining the coordinates of the edge points according to the coordinate system of each object;

s3: performing collision calculation on adjacent different types of objects by using edge points, and taking the current two objects as combined objects if the distance between the edge points of the two adjacent different types of objects is smaller than a preset value;

s4: and continuously performing collision calculation on the two combined objects and other objects in different classes respectively, if the distance between the edge points is greater than a preset multiple of the distance between the edge points of the current combined objects, judging that the object does not belong to the object in the same combination, continuously searching other objects in different classes for performing collision calculation until all the objects are combined, and outputting the combined object.

Further, in step S1, the OCR recognition method is used to extract the commodity object and the character object from the image to be recognized from left to right and from top to bottom.

Further, the specific process of dotting the edges of all the commodity objects and the character objects in each image is as follows:

determining a dotting object, firstly, respectively taking 4 points at the farthest distances of the upper left corner, the upper right corner, the lower left corner and the lower right corner of the dotting object, and constructing a four-point connecting line into an irregular rectangle;

and respectively taking the centers between points from the upper left corner to the upper right corner, from the lower left corner to the lower right corner, from the upper left corner to the lower left corner and from the upper right corner to the lower right corner, and correspondingly determining 4 points, namely, upper, lower, left and right points.

Further, the collision calculation process is as follows:

the distance between two objects on the x-axis is expressed as | x2-x1| by respectively describing the points adjacent to the two objects as P1 and P2, the coordinate of the point P1 as (x 1, y 1), and the coordinate of the point P2 as (x 2, y 2).

Further, the preset multiple of step S4 is greater than or equal to 2.

Further, in step S4, when the collision calculation is continued to search for other objects of different types, if no valid data is identified, the current process is ended and the combined object is output.

Further, the collision calculation is only performed between different types of objects.

The invention provides a system for extracting information in batches based on image information clustering in a second aspect, which comprises: the information batch extraction method based on the picture information clustering comprises a memory and a processor, wherein the memory comprises an information batch extraction method program based on the picture information clustering, and when the information batch extraction method program based on the picture information clustering is executed by the processor, the following steps are realized:

s1: extracting commodity objects and character objects from the images to be recognized by using an OCR recognition method, classifying and numbering the commodity objects and the character objects, taking the objects in each image as independent objects, and determining a coordinate system of each object;

The third aspect of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a program of a batch information extraction method based on picture information clustering, and when the program of the batch information extraction method based on picture information clustering is executed by a processor, the steps of the batch information extraction method based on picture information clustering are implemented.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the method, different objects in the picture are firstly identified and classified, then distance calculation is carried out on different independent objects, and then the different objects are combined.

Drawings

FIG. 1 is a flow chart of an information batch extraction method based on image information clustering according to the present invention.

Fig. 2 is a diagram of the recognition effect according to the embodiment of the present invention.

FIG. 3 is a diagram illustrating neighboring points of different objects according to an embodiment of the present invention.

FIG. 4 is a schematic diagram illustrating matching between neighboring points of different objects according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of the collision calculation of neighboring points of the combined object according to the embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Example 1

As shown in fig. 1, a first aspect of the present invention provides a batch information extraction method based on image information clustering, including the following steps:

in a specific embodiment, for example, in a detailed picture of a sales promotion advertisement, the picture has a plurality of mobile phone images and a plurality of corresponding commodity prices, and the images + text are arranged from top to bottom and from left to right, as shown in fig. 2, the names of the commodities and the commodity prices are arranged below the mobile phone images, and at this time, they need to be identified as a combination, that is, the mobile phone image corresponds to the text name and price.

Firstly, a commodity object and a character object are extracted from an image to be recognized, the commodity object and the character object can be respectively extracted by recognition scanning from left to right and from top to bottom by adopting an OCR recognition method, such as the commodity object 001, the character object 001 and the like, the object in each image is taken as an independent object, and a coordinate system of each object is determined;

in a specific embodiment, after obtaining the classified objects and determining the coordinate system of each object, dotting on the edge of the object is required, for example, eight points may be dotted, specifically as follows: determining a dotting object, firstly, respectively taking 4 points from the farthest distances of the upper left corner, the upper right corner, the lower left corner and the lower right corner of the dotting object, and constructing a four-point connecting line into an irregular rectangle;

and respectively taking centers between points from the upper left corner to the upper right corner, from the lower left corner to the lower right corner, from the upper left corner to the lower left corner and from the upper right corner to the lower right corner, and correspondingly determining 4 points, namely upper, lower, left and right points.

It should be noted that each of the marked points can be used to determine a coordinate position according to the coordinate system of the object for the subsequent calculation of the distance between the objects.

in one specific embodiment, the collision calculation process is: the distance between two objects in the x axis is expressed as | x2-x1| by respectively describing the adjacent edge points of the two objects as P1 and P2, the coordinate of the point P1 as (x 1, y 1), and the coordinate of the point P2 as (x 2, y 2). It should be noted that the collision calculation is only performed between different types of objects, such as: only the commodity object 1 and the character object 1 are calculated point-to-point, but the commodity object 1 and the commodity object 2 are not calculated in a collision manner, three edge points of a left lower edge point 1, a lower edge point 2 and a right lower edge point 3 of a commodity map (namely the commodity object) shown in fig. 3 are calculated with three edge points of a left upper edge point 4, an upper edge point 5 and a right upper edge point 6 of the character object, when the distance between the commodity map 1 and the character object is smaller than a preset value, the three edge points of the left lower edge point 1, the lower edge point 2 and the right lower edge point 3 of the commodity object (the commodity map) are set as combined objects, and when the distance between the commodity map 1 and the character object is smaller than the preset value, the three edge points of the left lower edge point 1, the lower edge point 2 and the right lower edge point 3 of the commodity object are calculated with the three edge points of the left upper edge point 4, the upper edge point 5 and the right upper edge point 6 of the character object, and the commodity map 1 and the character object are set as the combined objects.

It should be noted that, in the above processing method, the current collision calculation operation is performed only when the distances of 2 edge points (including more than 2) are close, there is no fixed matching object between the edge points and the edge points, and the edge points are only used for calculation, and the edge points with the closest distance are taken for calculation. As shown in fig. 4, the edge points 2 and 3 of the product object (i.e., the product picture) are computationally matched with the edge points 6 and 7 of the lower character object.

In a specific embodiment, as shown in fig. 5, after a combined object is obtained, collision calculation is continuously performed on the two combined objects and other different types of objects, if a distance between edge points of the two objects is greater than a preset multiple of a distance between edge points of currently combined objects in the collision calculation, for example, the preset multiple is 2 times or more than 2 times, it is determined that the object does not belong to an object in the same combination, the other different types of objects are continuously searched for collision calculation until all the objects are combined, and the combined object is output, that is, the recognized text information is output. It should be noted that, when continuing to search for other objects of different types for performing collision calculation, if no valid data is identified, the current process is ended and the combined object is output.

Example 2

in a specific embodiment, for example, in a detailed picture of the sales promotion advertisement of the product, the picture has a plurality of mobile phone images and a plurality of corresponding product prices, and the images + characters are arranged in an up-down layout from left to right, as shown in fig. 2, the product name and the product price are arranged below the mobile phone image, and then they need to be identified as a combination, which means that the name and the price of the character correspond to the mobile phone image.

in a specific embodiment, after obtaining the classified objects and determining the coordinate system of each object, dotting on the edge of the object is required, for example, eight points may be dotted, specifically as follows: determining a dotting object, firstly, respectively taking 4 points at the farthest distances of the upper left corner, the upper right corner, the lower left corner and the lower right corner of the dotting object, and constructing a four-point connecting line into an irregular rectangle;

in one specific embodiment, the collision calculation process is: the distance between two objects on the x axis is expressed as | x2-x1| by respectively describing the points adjacent to the two objects as P1 and P2, the coordinates of the point P1 as (x 1, y 1) and the coordinates of the point P2 as (x 2, y 2). It should be noted that the collision calculation is only performed between different types of objects, such as: only the commodity object 1 and the character object 1 are calculated point-to-point, but the commodity object 1 and the commodity object 2 are not calculated in a collision manner, as shown in fig. 3, three edge points of a left lower edge point 1, a lower edge point 2 and a right lower edge point 3 of the commodity object (commodity diagram) are calculated with three edge points of a left upper edge point 4, an upper edge point 5 and a right upper edge point 6 of the character object, and when the distance between the commodity diagram 1 and the character object is smaller than a preset value, the two are set as a combined object.

It should be noted that, in the above processing method, the current collision calculation operation is performed only when the distances of 2 edge points (including more than 2) are close, there is no fixed matching object between the edge points and the edge points, and the edge points are only used for calculation, and the edge points with the closest distance are taken for calculation. As shown in fig. 4, the edge points 2 and 3 of the product object (product map) are computationally matched with the edge points 6 and 7 of the lower character object.

In a specific embodiment, after the combined object is obtained, collision calculation is performed on the two combined objects and other different types of objects respectively, in the collision calculation, if a distance between edge points of the two objects is greater than a preset multiple of a distance between edge points of currently combined objects, for example, if the preset multiple is 2 times or more than 2 times, it is determined that the object does not belong to an object in the same combination, the other different types of objects are continuously searched for collision calculation until all the objects are combined, and the combined object is output, that is, the identified text information is output. It should be noted that, when other objects of different types are continuously searched for collision calculation, if no valid data is identified, the current process is ended and the combined object is output.

Example 3

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An information batch extraction method based on picture information clustering is characterized by comprising the following steps:

the specific process of dotting the edges of all the commodity objects and the character objects in each image is as follows: determining a dotting object, firstly, respectively taking 4 points at the farthest distances of the upper left corner, the upper right corner, the lower left corner and the lower right corner of the dotting object, and constructing a four-point connecting line into an irregular rectangle;

respectively taking centers between points from the upper left corner to the upper right corner, from the lower left corner to the lower right corner, from the upper left corner to the lower left corner and from the upper right corner to the lower right corner, and correspondingly determining 4 points, namely, upper, lower, left and right points;

the collision calculation process comprises the following steps:

the adjacent edge points of the two objects are respectively recorded as P1 and P2, the coordinate of the point P1 is recorded as (x 1, y 1), the coordinate of the point P2 is recorded as (x 2, y 2), and then the distance between the two objects on the x axis is recorded as | x2-x1 |;

s4: and continuously performing collision calculation on the two combined objects and other different types of objects respectively, if the distance between the edge points is greater than a preset multiple of the distance between the edge points of the current combined objects, judging that the other different types of objects subjected to the collision calculation do not belong to the objects in the same combination, continuously searching other different types of objects for the collision calculation until all the objects are combined, and outputting the combined objects.

2. The batch information extraction method based on picture information clustering of claim 1, wherein in step S1, the OCR recognition method is used to extract commodity objects and character objects from the image to be recognized from left to right and from top to bottom.

3. The batch extraction method of information based on picture information clustering of claim 1, wherein the preset multiple of step S4 is greater than or equal to 2.

4. The batch extraction method of information based on image information clustering of claim 1, wherein in step S4, when continuing to search for other objects of different classes for collision calculation, if no valid data is identified, the current process is also ended and the combined object is output.

5. The batch extraction method for information based on picture information clustering according to claim 1, characterized in that the collision calculation is only performed between different types of objects.

6. The utility model provides an information batch extraction system based on picture information clustering which characterized in that, this system includes: the information batch extraction method based on the picture information clustering comprises a memory and a processor, wherein the memory comprises an information batch extraction method program based on the picture information clustering, and when the information batch extraction method program based on the picture information clustering is executed by the processor, the following steps are realized:

the collision calculation process comprises the following steps:

the adjacent edge points of the two objects are respectively marked as P1 and P2, the coordinate of the point P1 is marked as (x 1, y 1), the coordinate of the point P2 is marked as (x 2, y 2), and then the distance between the two objects in the x axis is marked as | x2-x1 |;

7. The system of claim 6, wherein in step S1, the image to be recognized is scanned from top to bottom to extract the commodity objects and the text objects from the image to be recognized by using an OCR recognition method.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium includes a program of batch information extraction method based on picture information clustering, and when the program of batch information extraction method based on picture information clustering is executed by a processor, the steps of a batch information extraction method based on picture information clustering according to any one of claims 1 to 5 are implemented.