CN111353458A

CN111353458A - Text box marking method and device and storage medium

Info

Publication number: CN111353458A
Application number: CN202010161011.3A
Authority: CN
Inventors: 彭梅英; 鲁四喜; 农高明; 唐嘉龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2020-06-30
Anticipated expiration: 2040-03-10
Also published as: CN111353458B

Abstract

The application discloses a text box annotation method, a text box annotation device and a storage medium, and belongs to the technical field of information. The method comprises the following steps: acquiring position information of a plurality of text boxes in an image; determining attribute names of the text boxes according to the position information of the text boxes; and using the position information and the attribute names of the plurality of text boxes as text box mark information of the image. The method and the device for marking the text box automatically acquire the text box marking information of the image, so that marking efficiency can be improved, marking time is reduced, subjective errors caused by manual marking are avoided, and yield and quality of text box marking work can be improved.

Description

Text box marking method and device and storage medium

Technical Field

The present application relates to the field of information technology, and in particular, to a method and an apparatus for annotating a text box, and a storage medium.

Background

OCR (Optical Character Recognition) refers to a technique of converting characters of bills, newspapers, books, documents, certificates, and other printed matters into image information by an Optical input method such as scanning, and then converting the image information into usable computer input by using a Character Recognition technique.

OCR may be implemented by means of deep learning. Specifically, an image to be recognized is provided for a text box detection model based on deep learning to obtain text box position information in the image, then the image is cut according to the text box position information to obtain an image block to be recognized, the image block is provided for a text content recognition model to obtain text content in the image block.

The performance of the text box detection model greatly affects the recognition accuracy of the OCR, so that parameter tuning and verification tests for the text box detection model are very important, and a large amount of text box marking information is needed in the process. At present, the text boxes are marked manually by technicians, the workload is large, the time consumption is very long, and the parameter optimization and verification test of the text box detection model are not facilitated.

Disclosure of Invention

The application provides a text box labeling method, a text box labeling device and a storage medium, which can improve the yield and quality of text box labeling work.

In one aspect, a method for annotating a text box is provided, and the method includes:

acquiring position information of a plurality of text boxes in an image;

determining attribute names of the text boxes according to the position information of the text boxes;

and using the position information and the attribute names of the plurality of text boxes as text box mark information of the image.

In one aspect, an apparatus for labeling a text box is provided, the apparatus comprising:

the acquisition module is used for acquiring the position information of a plurality of text boxes in the image;

the determining module is used for determining the attribute names of the text boxes according to the position information of the text boxes;

and the marking module is used for taking the position information and the attribute names of the text boxes as the text box marking information of the image.

In one aspect, a text box labeling apparatus is provided, where the text box labeling apparatus includes a processor and a memory, and the memory is used to store a program that supports the text box labeling apparatus to execute the text box labeling method, and to store data related to implementing the text box labeling method. The processor is configured to execute programs stored in the memory. The text box labeling apparatus may further include a communication bus for establishing a connection between the processor and the memory.

In one aspect, a computer-readable storage medium is provided, which stores instructions that, when executed by a processor, implement the steps of the text box annotation method described above.

In one aspect, a computer program product containing instructions is provided which, when run on a computer, causes the computer to perform the above-described text box annotation method.

The technical scheme provided by the application can at least bring the following beneficial effects:

after the position information of the plurality of text boxes in the image is acquired, the attribute names of the plurality of text boxes are determined according to the position information of the plurality of text boxes, so that the speed and the accuracy of acquiring the attribute names of the text boxes are improved. And finally, using the position information and the attribute names of the plurality of text boxes as text box mark information of the image. Therefore, the text box labeling information of the image is automatically acquired, so that the labeling efficiency can be improved, the labeling time is reduced, subjective errors caused by manual labeling are avoided, and the yield and the quality of the text box labeling work can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an ID card image provided by an embodiment of the present application;

fig. 2 is a flowchart of a text box annotation method provided in an embodiment of the present application;

FIG. 3 is a flowchart of another method for annotating a text box according to an embodiment of the present application;

FIG. 4 is a diagram illustrating a method for annotating a text box according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a textbox labeling apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that reference to "a plurality" in this application means two or more. In the description of the present application, "/" means "or" unless otherwise stated, for example, a/B may mean a or B; "and/or" herein is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, for the convenience of clearly describing the technical solutions of the present application, the terms "first", "second", and the like are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

Before explaining the embodiments of the present application in detail, an application scenario of the embodiments of the present application will be described.

OCR is a technique of converting the text of bills, newspapers, books, documents, certificates, and other printed matters into image information by means of optical input means such as scanning, and then converting the image information into usable computer input by means of a text recognition technique. In the field of OCR, text detection is a precondition for scene text recognition, and the problem to be solved is how to accurately position the position of a text box matched with characters in a disordered and strange complex scene.

OCR generally includes the steps of image preprocessing, text box detection, character recognition, and the like. The text box detection is to detect the position range and layout of characters, including layout analysis, character line detection, and the like, and then to send the detected image blocks corresponding to the text box containing the characters to a character recognition model for character recognition. Therefore, the recognition accuracy of the OCR is greatly influenced by the accuracy of the text box detection, and the parameter tuning and verification test aiming at the text box detection algorithm is particularly important.

However, in the process of parameter tuning and verification test of the text box detection algorithm, a large amount of text box marking information is needed to continuously optimize the text box detection algorithm, and the accuracy of the text box detection algorithm is promoted to be improved. Therefore, the text box labeling work in the text box detection algorithm is very important.

The following describes the related contents labeled in the text box:

the text box label mainly comprises a text box position label and a text box attribute name label. The position information of the text box is used to indicate the position of the text box in the image. The attribute name of the text box is used to indicate the meaning of the text content in the text box.

For example, the identity card image shown in fig. 1 includes a plurality of text contents. The text box marking information of the identity card image comprises position information and attribute names of text boxes corresponding to each text content in the plurality of text contents.

Fig. 2 is a flowchart of a text box annotation method according to an embodiment of the present application. Referring to fig. 2, the method includes:

step 201: position information of each of a plurality of text boxes in an image is acquired.

The image is an image to which a text box is to be labeled. The image may be an image taken of a subject in need of text recognition. For example, the image may be an image of a subject such as a certificate (including but not limited to an identification card, a port and australian pass, a driver's license, etc.), a ticket, etc.

Additionally, the text box label may include a text box location information label and a text box attribute name label. That is, the position information and the attribute name of the text box in the image may be used as the text box label information of the image. Wherein the position information of the text box is used for indicating the position of the text box in the image. The attribute name of the text box is used to indicate the meaning of the text content in the text box.

Further, the position information of the text box is determined in the image coordinate system. That is, the image coordinate system is a coordinate system for determining the position information of the text box. The text box is generally a quadrilateral box and includes four corner points, specifically, two upper corner points and two lower corner points, where the two upper corner points are an upper left corner point and an upper right corner point, and the two lower corner points are a lower left corner point and a lower right corner point. The position information of the text box may include positions of four corner points of the text box, and the positions of the four corner points of the text box may be expressed by coordinates of the four corner points of the text box in an image coordinate system.

Specifically, the image may be input into a text box detection model, position information of a plurality of text boxes in the image is obtained, and then the position information of the plurality of text boxes in the image is obtained according to the position information of the plurality of text boxes.

The text box detection model is used to identify a text region containing text content in the image, and output an edge position of the text region as text box position information. The textbox detection model can be a neural network model such as a convolutional neural network, a cyclic neural network, a deep neural network, and the like.

The text box detection model may be trained using a large number of image samples that include different text box annotation information. For example, a plurality of training samples may be determined in advance, for any one of the plurality of training samples, the sample data of the training sample is an image, and the sample label of the training sample is text box annotation information of the image. Then, the plurality of training samples may be used for model training, specifically, the sample data in the plurality of training samples may be used as input, the sample labels of the plurality of training samples may be used as expected output, and model training is performed to obtain the textbox detection model.

The operation of acquiring the position information of the plurality of text boxes in the image according to the position information of the plurality of text boxes may be: and directly using the position information of the plurality of text boxes as the position information of the plurality of text boxes in the image. Alternatively, a text box indicated by each of the plurality of text box position information may be displayed in the image; adjusting the position of one text box displayed in the image in response to the adjustment operation of the one text box; responding to the frame selection operation executed in the image, and displaying a frame selection box corresponding to the frame selection operation in the image as a text box; position information of all text boxes displayed in the image is acquired.

It should be noted that the adjusting operation may be an operation triggered by a technician to adjust the position of the text box, such as an operation of lengthening or compressing the text box. The technician can adjust the positions of the four corner points of the text box through the adjusting operation to adjust the overall position of the text box.

Additionally, the box operation may be a technician-triggered operation to add a new text box. The technician may directly frame out a new text box in the image through a frame selection operation.

In addition, in the embodiment of the application, after the text box detected by the text box detection model is displayed in the image, a technician may adjust the position of the displayed text box through an adjustment operation, and may add a new text box in the image through a frame selection operation, so that the technician may review and correct the text box detected by the text box detection model.

In this embodiment of the present application, in one mode, the position information of the text box output by the text box detection model may be directly used as the position information of the text box in the image, so that the speed of determining the position of the text box may be increased. In another mode, based on the position information of the text box output by the text box detection model, a technician performs review correction on the position information of the text box output by the text box detection model, and the position information of the text box after review correction is used as the position information of the text box in the image, so that the accuracy of the determined position of the text box can be improved.

Step 202: and determining attribute names of the text boxes according to the position information of the text boxes.

Specifically, the text boxes may be sorted according to the position information of the text boxes and a specified sorting rule to obtain the serial numbers of the text boxes; and acquiring the attribute name corresponding to the sequence number of the first text box from the corresponding relation between the sequence number and the attribute name as the attribute name of the first text box, wherein the first text box is one text box in the plurality of text boxes.

It should be noted that after the plurality of text boxes are sorted, the plurality of text boxes have an order, and thus each text box in the plurality of text boxes also has a sequence number representing the order. The ordering of the plurality of text boxes may represent the positional relationship of the plurality of text boxes in the object to which they pertain. For example, the ordering of the plurality of text boxes is from top to bottom and from left to right in the object.

In addition, the specified sorting rule can be preset, which means that sorting is performed according to a certain rule. For example, the specified sort rule may be a rule that sorts by positional relationship from top to bottom and from left to right.

In general, an object contained in an image may be at 0 degrees or at a non-0 degree. At 0 degrees means that the pose of the object in the image is relatively positive, i.e. the edge line of the object is relatively parallel to the horizontal or vertical axis of the image coordinate system. Being at a non-0 degree means that the pose of the object in the image is relatively skewed, i.e. the edge line of the object is not parallel to both the horizontal and vertical axes of the image coordinate system. In the embodiment of the present application, whether to sort the plurality of text boxes directly according to the position information of each text box in the plurality of text boxes may be determined according to whether an object included in the image is at 0 degree.

The operation of sorting the plurality of text boxes according to the specified sorting rule according to the position information of the plurality of text boxes may include the following steps (1) to (3):

(1) and taking the text box with the largest length of the long edge as a second text box.

It should be noted that the long side of the text box is the side between two upper corner points or two lower corner points. The text box with the largest length of the long side in the plurality of text boxes is the longest text box in the plurality of text boxes. The one text box is taken as the second text box at the moment, so that the posture of the object to which the second text box belongs can be conveniently researched through the second text box in the follow-up process.

In addition, the length of the long side of the text box is the difference between the abscissas of its two upper corner points or two lower corner points. For example, the long side of the text box is the side between two upper corner points, and assuming that the positions of the upper left corner point, the upper right corner point, the lower left corner point and the lower right corner point of a certain text box are respectively represented by (x1, y1), (x2, y2), (x3, y3), (x4, y4), the length of the long side of the text box is the difference between x2 and x1 (i.e., x 2-x 1).

(2) And acquiring an angle of an included angle between a straight line where the long edge of the second text box is located and a horizontal axis of the image coordinate system as a target angle.

It should be noted that, since the second text box is the longest text box among all text boxes of the object to which the second text box belongs, the long side of the second text box tends to be consistent with the edge line of the object to which the second text box belongs. In this case, the angle of the angle between the straight line on which the long side of the second text box is located and the horizontal axis of the image coordinate system (i.e., the target angle) may represent the angle of the angle between the edge line of the object to which the second text box belongs and the horizontal axis of the image coordinate system, and thus the target angle may reflect the posture of the object in the image. That is, when the target angle is small, it indicates that the pose of the object in the image is relatively correct. When the target angle is larger, it indicates that the object is relatively skewed in the image.

(3) Rotating the plurality of text boxes by a target angle by taking one corner point of the second text box as a rotating point to obtain a plurality of corresponding target boxes; and sequencing the plurality of text boxes according to the specified sequencing rule according to the position information of the plurality of target boxes.

It should be noted that the target angle may represent an angle between an edge line of the object to which the second text box belongs and a horizontal axis of the image coordinate system. Therefore, after a corner point of the second text box is taken as a rotation point and each text box in the plurality of text boxes is rotated by a target angle to obtain a corresponding target box, the postures of the plurality of target boxes in the image are relatively correct. In this case, the plurality of text boxes corresponding to the plurality of target boxes one to one can be sorted according to the specified sorting rule according to the position information of the plurality of target boxes, so that the sorting accuracy of the plurality of text boxes can be improved.

The operation of sorting the text boxes according to the specified sorting rule according to the position information of the target boxes may be: sequencing the plurality of text boxes corresponding to the plurality of target boxes one by one according to the sequence of the vertical coordinates of the target corner points of the plurality of target boxes from small to large; if the vertical coordinates of the target corner points of the at least two target frames are the same, sorting the at least two text frames corresponding to the at least two target frames one by one according to the sequence of the horizontal coordinates of the target corner points of the at least two target frames from small to large.

It should be noted that the target corner point may be a fixed corner point in the quadrilateral frame. For example, one of the upper left corner point, the upper right corner point, the lower left corner point, and the lower right corner point may be defined in advance as the target corner point.

In this embodiment of the present application, the target frames may be preferentially sorted according to an order from small to large on a vertical coordinate and then from small to large on a horizontal coordinate, and then the order of the target frames is used as the order of the text frames corresponding to the target frames one to one. In this way, the plurality of text boxes are sorted in the order from top to bottom and from left to right.

It is noted that step (3) may be directly performed after the target angle is obtained in step (2). Or, after the target angle is obtained in step (2), it may be determined whether the target angle is greater than or equal to a specified angle; if the target angle is larger than or equal to the specified angle, executing the step (3); if the target angle is smaller than the designated angle, the plurality of text boxes can be sorted according to the sequence of the vertical coordinates of the target corner points of the plurality of text boxes from small to large; if the vertical coordinates of the target corner points of the at least two text boxes are the same, the at least two text boxes are sequenced according to the sequence that the horizontal coordinates of the target corner points of the at least two text boxes are from small to large.

It should be noted that the designated angle may be set in advance, and the designated angle may be set to be smaller. When the target angle is greater than or equal to the specified angle, it indicates that the object is at a non-0 degree, that is, the posture of the object in the image is relatively skewed, so that step (3) needs to be performed to rotate the text boxes to obtain a plurality of target boxes, and then the text boxes are sorted according to the target boxes, so that the sorting accuracy of the text boxes can be improved. When the target angle is smaller than the specified angle, the object is at 0 degree, namely the posture of the object in the image is more correct, so that the plurality of text boxes can be sorted directly according to the position information of the plurality of text boxes, and the sorting speed of the plurality of text boxes can be improved.

Generally, the meaning of the text content at each position in the object (such as identity card, hong Kong and Macau pass, driver's license, etc.) is relatively fixed. For example, for the identity card shown in fig. 1, the attribute names of the text boxes from top to bottom and from left to right may be: attribute name of the first text box: resident identification card, attribute name of the second text box: chinese name, attribute name of third text box: english name, attribute name of the fourth text box: number, attribute name of fifth text box: date of birth, attribute name of sixth text box: date 1, attribute name of seventh text box: gender, attribute name of eighth text box: issue date, attribute name of ninth textbox: date 2.

In this way, the correspondence between the sequence numbers representing the order of the text boxes and the corresponding attribute names can be set in advance for each object. For example, for the identity card shown in fig. 1, the correspondence between the set sequence number and the attribute name may be as shown in table 1 below:

TABLE 1

Serial number	Attribute name
		1	Resident identification card
2	Name of Chinese
		3	Name of English
4	Number (I)
		5	Date of birth
6	Date 1
		7	Sex
8	Date of issuance
		9	Date 2

Note that, in the embodiments of the present application, the correspondence between the serial numbers and the attribute names is described only by taking table 1 as an example, and table 1 does not limit the embodiments of the present application.

In the embodiment of the application, the attribute name of each text box in the plurality of text boxes can be accurately obtained according to the sequence number obtained after the plurality of text boxes are sequenced, so that the speed and the accuracy of obtaining the attribute name of the text box are improved.

Step 203: and using the position information and the attribute names of the plurality of text boxes as text box mark information of the image.

It should be noted that, for any one image in a large number of images, the text box annotation information of the image may be obtained according to the text box annotation method provided in the embodiment of the present application.

In addition, after the text box annotation information of the image is obtained, the image and the text box annotation information of the image can be used as training samples to perform parameter optimization on the text box detection algorithm, or the image and the text box annotation information of the image can be used as test samples to perform verification test on the text box detection algorithm.

In the embodiment of the application, after the position information of the plurality of text boxes in the image is acquired, the attribute names of the plurality of text boxes are determined according to the position information of the plurality of text boxes, so that the speed and the accuracy of acquiring the attribute names of the text boxes are improved. And finally, using the position information and the attribute names of the plurality of text boxes as text box mark information of the image. Therefore, the text box labeling information of the image is automatically acquired, so that the labeling efficiency can be improved, the labeling time is reduced, subjective errors caused by manual labeling are avoided, and the yield and the quality of the text box labeling work can be improved.

For ease of understanding, the text box labeling method provided in the embodiment of the present application is illustrated in the following with reference to fig. 3 and 4.

Referring to fig. 3, the text box annotation method may include steps 301 to 303 as follows.

Step 301: and automatically generating a basic annotation file containing the position information of the text box and the attribute name.

Inputting each image (such as a certificate photo and the like) in a plurality of images to be subjected to text box labeling into a text box detection model, and obtaining a plurality of text box position information in each image output by the text box detection model. In addition, a uniform attribute name (which may be a value customized in advance) may be set to all output text boxes. In this way, a base annotation file containing the text box location information and the text box attribute name is generated for each image. The format of the basic Markup file needs to support the loading of the Markup software, and the format of the basic Markup file may be defined in advance, such as json (JavaScript Object Notation), xml (Extensible Markup Language), and the like.

Step 302: and manually reviewing the position information of the corrected text box.

All the basic annotation files output in step 301 and the plurality of images are loaded using the annotation software to display the plurality of images, and a text box indicated by the position information of each text box is displayed in each image of the plurality of images. The technician performs review correction on the position information of the text box (the position information of the text box output by the text detection model is incorrect and needs to be corrected, and the position information of the text box output by the text detection model is correct and does not need to be corrected), for example, the position information of the text box can be corrected by adjusting the displayed text box. The technician may also add a new textbox, and may continue to set the attribute name of the added textbox as the attribute name that was uniformly set in step 301. And then, regenerating a new annotation file for each image, wherein the annotation file comprises the text box position information and the text box attribute name.

Step 303: automatically adding a modified text box attribute name.

For the annotation file of each image output in step 302, a specified sort rule is provided to sort the text boxes in each image, such as sorting from top to bottom and from left to right. And for any one of a plurality of text boxes in a certain image, acquiring an attribute name corresponding to the serial number of the text box as the attribute name of the text box, and modifying the originally set uniform attribute name into the acquired attribute name of each text box. In this way, the position information and the attribute name of each of the plurality of text boxes in each image are obtained as the text box annotation information. By using the text box mark information, parameter tuning and test verification can be performed on the text box detection algorithm.

The embodiment of fig. 3 will be described in detail with reference to fig. 4. Referring to fig. 4, the following three parts may be included:

a first part: software module for defining automatic generation of basic annotation file

Python loads an open source or an existing text box detection model, and the function of automatically generating the position information of the text box is realized aiming at large-batch image input. After generating the text box position information of each image, automatically adding an attribute name to each text box, wherein the attribute name can be uniformly set to be a value. The markup file containing the text box position information and the text box attribute name generated by the software module is called a basic markup file. Each image can generate a basic annotation file, and the format of the basic annotation file meets the loading requirement of the annotation software.

A second part: manually rechecking and correcting the position of the text box through labeling software

And loading all the basic annotation files output by the first part and the plurality of images by using annotation software to display the plurality of images, and displaying the text box indicated by the position information of each text box in each image in the plurality of images.

The technician performs review correction on the position information of the text box (the position information of the text box output by the text detection model is incorrect and needs to be corrected, and the position information of the text box output by the text detection model is correct and does not need to be corrected), for example, the position information of the text box can be corrected by adjusting the displayed text box.

Technicians can also add a new text box and can continuously set the attribute name of the new text box as the uniformly set attribute name in the first part.

After the loaded multiple images are reviewed and corrected, a new annotation file is regenerated for each image, wherein the annotation file comprises text box position information and text box attribute names. And finally, exporting the annotation files of all the images again.

And a third part: software module for defining automatic adding and changing text box attribute name

(1) Textbox attribute name definition

The meaning represented by each text box on an image containing a specific object can be known in advance, for example, the meaning represented by each text box can be known when the text boxes on an identity card are viewed from top to bottom and from left to right. Accordingly, a set of attribute names can be preset according to the sequence of the text boxes, that is, the corresponding relation between the sequence number and the attribute names is set.

(2) Textbox ordering algorithm implementation

Regarding a plurality of text boxes in each image, taking one text box with the largest length of a long edge in the plurality of text boxes as a second text box; acquiring an angle of an included angle between a straight line where a long edge of the second text box is located and a transverse axis of the image coordinate system as a target angle; taking one corner point of the second text box as a rotation point, and rotating each text box in the plurality of text boxes by a target angle to obtain a corresponding target box; and sequencing the plurality of text boxes according to the specified sequencing rule according to the position information of the plurality of target boxes.

(3) Automatic addition and modification of textbox attribute name implementation

And acquiring the attribute name corresponding to the serial number of the text box from the corresponding relation between the serial number and the attribute name as the attribute name of the text box, and modifying the originally set uniform attribute name into the acquired attribute name of each text box. In this way, the position information and the attribute name of each of the plurality of text boxes in each image are obtained as the text box label information of each image.

Fig. 5 is a schematic structural diagram of a text box labeling apparatus according to an embodiment of the present application. Referring to fig. 5, the apparatus includes:

an obtaining module 501, configured to obtain position information of multiple text boxes in an image;

a determining module 502, configured to determine attribute names of the text boxes according to the location information of the text boxes;

and the labeling module 503 is configured to use the position information and the attribute names of the multiple text boxes as text box labeling information of the image.

Optionally, the obtaining module 501 is configured to:

inputting the image into a text box detection model to obtain position information of a plurality of text boxes in the image;

and acquiring the position information of the plurality of text boxes in the image according to the position information of the plurality of text boxes.

Optionally, the obtaining module 501 is configured to:

displaying a text box indicated by the position information of the plurality of text boxes in the image;

adjusting the position of a text box in response to an adjusting operation of the text box displayed in the image;

responding to the frame selection operation executed in the image, and displaying a frame selection frame corresponding to the frame selection operation in the image as a text frame;

position information of all text boxes displayed in the image is acquired.

Optionally, the determining module 502 is configured to:

sequencing the plurality of text boxes according to the position information of the plurality of text boxes and a specified sequencing rule to obtain the sequence numbers of the plurality of text boxes;

and acquiring the attribute name corresponding to the sequence number of the first text box from the corresponding relation between the sequence number and the attribute name as the attribute name of the first text box, wherein the first text box is one text box in the plurality of text boxes.

Optionally, the determining module 502 is configured to:

taking one text box with the largest length of the long sides of the plurality of text boxes as a second text box, wherein the long sides of the text boxes are the sides between two upper corner points or two lower corner points;

acquiring an angle of an included angle between a straight line where a long edge of the second text box is located and a transverse axis of an image coordinate system as a target angle, wherein the image coordinate system is used for determining position information of the text box;

rotating the plurality of text boxes by a target angle by taking one corner point of the second text box as a rotating point to obtain a plurality of corresponding target boxes;

and sequencing the plurality of text boxes according to the specified sequencing rule according to the position information of the plurality of target boxes.

Optionally, the determining module 502 is configured to:

sequencing a plurality of text boxes corresponding to a plurality of target boxes one by one according to the sequence of the vertical coordinates of the target corner points of the plurality of target boxes from small to large; if the vertical coordinates of the target corner points of the at least two target frames are the same, sorting the at least two text frames corresponding to the at least two target frames one by one according to the sequence of the horizontal coordinates of the target corner points of the at least two target frames from small to large.

It should be noted that: in the text box labeling apparatus provided in the above embodiment, when displaying a page, only the division of the above functional modules is used for illustration, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the text box labeling device provided by the above embodiment and the text box labeling method embodiment belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and will not be described herein again.

Fig. 6 is a schematic structural diagram of a server 600 according to an embodiment of the present application. Server 600 may be a server in a background server cluster. Specifically, the method comprises the following steps:

the server 600 includes a CPU (Central Processing Unit) 601, a system Memory 604 including a RAM (Random Access Memory) 602 and a ROM (Read-Only Memory) 603, and a system bus 605 connecting the system Memory 604 and the Central Processing Unit 601. The server 600 also includes a basic I/O (Input/Output) system 606, which facilitates the transfer of information between devices within the computer, and a mass storage device 607, which stores an operating system 613, application programs 614, and other program modules 615.

The basic input/output system 606 includes a display 608 for displaying information and an input device 609 such as a mouse, keyboard, etc. for user input of information. Wherein a display 608 and an input device 609 are connected to the central processing unit 601 through an input/output controller 610 connected to the system bus 605. The basic input/output system 606 may also include an input/output controller 610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, an input/output controller 610 may also provide output to a display screen, a printer, or other type of output device.

The mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 607 and its associated computer-readable media provide non-volatile storage for the server 600. That is, the mass storage device 607 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact disk-read-Only Memory) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM (Electrically Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, and CD-ROM, DVD (Digital versatile disk) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 604 and mass storage device 607 may collectively be referred to as memory.

According to various embodiments of the present application, the server 600 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 600 may be connected to the network 612 through the network interface unit 611 connected to the system bus 605, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 611.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the operations performed in the text box annotation methods provided by the method embodiments herein.

Fig. 7 is a schematic structural diagram of a terminal 700 according to an embodiment of the present application. The terminal 700 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 700 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so on.

In general, terminal 700 includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to perform operations performed in a text box annotation method provided by method embodiments herein.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch screen display 705, camera 706, audio circuitry 707, positioning components 708, and power source 709.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or both of processor 701, memory 702, and peripherals interface 703 may be implemented on separate chips or circuit boards, which is not limited in this application.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, etc. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 705 may be one, disposed on a front panel of the terminal 700; in other embodiments, the display 705 can be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display 705 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 706 is used to capture images or video. Optionally, camera assembly 706 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 700. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 707 may also include a headphone jack.

The positioning component 708 is used to locate the current geographic position of the terminal 700 to implement navigation or LBS (location based Service). The positioning component 708 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 709 is provided to supply power to various components of terminal 700. The power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When power source 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 700. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 701 may control the touch screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the terminal 700, and the gyro sensor 712 may cooperate with the acceleration sensor 711 to acquire a 3D motion of the terminal 700 by the user. From the data collected by the gyro sensor 712, the processor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 713 may be disposed on a side bezel of terminal 700 and/or an underlying layer of touch display 705. When the pressure sensor 713 is disposed on a side frame of the terminal 700, a user's grip signal on the terminal 700 may be detected, and the processor 701 performs right-left hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of the touch display 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 714 is used for collecting a fingerprint of a user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 714 may be disposed on the front, back, or side of the terminal 700. When a physical button or a vendor Logo is provided on the terminal 700, the fingerprint sensor 714 may be integrated with the physical button or the vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the touch display 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 705 is increased; when the ambient light intensity is low, the display brightness of the touch display 705 is turned down. In another embodiment, processor 701 may also dynamically adjust the shooting parameters of camera assembly 706 based on the ambient light intensity collected by optical sensor 715.

The proximity sensor 716, also referred to as a distance sensor, is typically disposed on a front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front surface of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually decreases, the processor 701 controls the touch display 705 to switch from the bright screen state to the dark screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually becomes larger, the processor 701 controls the touch display 705 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 7 is not intended to be limiting of terminal 700 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

In some embodiments, a computer-readable storage medium is also provided, in which a computer program is stored, which when executed by a processor implements the steps of the text box annotation method provided in the embodiment of fig. 2 above. For example, the computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is noted that the computer-readable storage medium referred to in the embodiments of the present application may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps for implementing the above embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer-readable storage medium described above.

In some embodiments, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the text box annotation method provided in the embodiment of fig. 2 above.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for annotating a text box, the method comprising:

acquiring position information of a plurality of text boxes in an image;

2. The method of claim 1, wherein the obtaining position information of a plurality of text boxes in an image comprises:

3. The method of claim 2, wherein the obtaining the position information of the plurality of text boxes in the image according to the position information of the plurality of text boxes comprises:

displaying a text box indicated by the plurality of text box position information in the image;

adjusting the position of one text box displayed in the image in response to the adjustment operation of the one text box;

responding to a frame selection operation executed in the image, and displaying a frame selection box corresponding to the frame selection operation in the image as a text box;

and acquiring the position information of all the text boxes displayed in the image.

4. The method of any of claims 1-3, wherein said determining attribute names of the plurality of text boxes based on the location information of the plurality of text boxes comprises:

and acquiring an attribute name corresponding to the sequence number of a first text box from the corresponding relation between the sequence number and the attribute name as the attribute name of the first text box, wherein the first text box is one text box in the text boxes.

5. The method of claim 4, wherein said sorting the plurality of text boxes according to a specified sorting rule based on the position information of the plurality of text boxes comprises:

rotating the plurality of text boxes by the target angle by taking one corner point of the second text box as a rotating point to obtain a plurality of corresponding target boxes;

6. The method of claim 5, wherein said sorting the plurality of text boxes according to a specified sorting rule based on the position information of the plurality of target boxes comprises:

sequencing the plurality of text boxes corresponding to the plurality of target boxes one by one according to the sequence of the vertical coordinates of the target corner points of the plurality of target boxes from small to large; if the vertical coordinates of the target corner points of the at least two target frames are the same, sequencing the at least two text frames corresponding to the at least two target frames one by one according to the sequence of the horizontal coordinates of the target corner points of the at least two target frames from small to large.

7. A text box annotation apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the acquisition module is to:

9. The apparatus of claim 7 or 8, wherein the determination module is to:

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of the method of any of claims 1-6.