CN112764642B

CN112764642B - Canvas technology-based universal document labeling method and system

Info

Publication number: CN112764642B
Application number: CN202011634774.1A
Authority: CN
Inventors: 王力国; 徐浪; 李宏亮; 纪达麒; 陈运文
Original assignee: Daguan Data Chengdu Co ltd
Current assignee: Daguan Data Chengdu Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-11-29
Anticipated expiration: 2040-12-31
Also published as: CN112764642A

Abstract

The invention relates to the technical field of document marking, and discloses a Canvas technology-based universal document marking method and a Canvas technology-based universal document marking system, wherein the method comprises the following steps: step 1: analyzing a PDF original file, and typing in canvas in the analyzed file structure; step 2: controlling pixels on the canvas to be drawn by using a script language to realize the labeling on the PDF file; and step 3: aggregation of canvas layers is performed in a multi-page document labeling viewport and a corresponding canvas drawing area of a PDF file, and therefore labeling of multi-page documents can be performed on a single-layer canvas. According to the invention, the HTML5 Canvas technology is integrated into the labeling tool, so that the applicability, the labeling efficiency and the performance of the labeling are greatly improved.

Description

Canvas technology-based universal document labeling method and system

Technical Field

The invention relates to the technical field of document marking, in particular to a Canvas technology-based universal document marking method and system.

Background

With the development of the times and the continuous progress of paperless office technologies, more and more electronic documents need to be processed in people's life, and the paper document proportion is gradually reduced. Document processing in an enterprise often requires a large amount of text labeling work for model training if the document processing is based on NLP correlation technology, and a labeling system which is convenient to operate and use is required for completing the operations on an electronic document.

The current popular labeling mode is mainly based on the fact that after a PDF document is analyzed, characters are marked and selected to complete labeling. However, such methods have many disadvantages, such as inability to select on a single-layer PDF, inability to mark contents such as stamp watermarks, and inability to label forms on documents.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the existing problems, the invention provides a universal document marking method and a universal document marking system based on Canvas technology.

The technical scheme adopted by the invention is as follows:

a Canvas technology-based universal document labeling method comprises the following steps:

step 1: analyzing a PDF original file, and typing in canvas in an analyzed file structure;

step 2: controlling pixels on the canvas to be drawn by using a script language to realize the labeling on the PDF file;

and step 3: aggregation of canvas layers is performed in a multi-page document labeling viewport and a corresponding canvas drawing area of a PDF file, and therefore labeling of multi-page documents can be performed on a single-layer canvas.

Wherein, in the step 2, the method specifically comprises the following steps:

step 21: and selecting a drawing area on the canvas, and using a script language to control canvas pixels to carry out graphic editing in the drawing area to obtain a label.

Step 22: and mapping the drawing area coordinates to the PDF text, obtaining a PDF text selection area corresponding to the coordinates, obtaining the text content of the PDF text selection area, and realizing the labeling on the PDF document.

Therefore, the method provides the labeling of rectangles such as texts, ellipse labels such as seal watermarks, polygon labels such as objects and complex types such as tables.

Further, the drawing area in step 21 is obtained by a mouse event; the mouse events comprise mouse pressing events, mouse moving events and mouse releasing events; the mouse presses down the event to draw an editable graphic object on the canvas; the mouse moving event continuously resets the editable graphic object and draws a drawing area in real time; the mouse-off event ultimately generates a drawing area.

Further, in the coordinate mapping process, the coordinate distance of the label mapped on the PDF file on the canvas is the distance from the label to the top of the canvas plus the top rolling distance of the canvas, and the distance from the label to the side offset of the canvas plus the side rolling distance.

The invention also provides a Canvas technology-based universal document labeling system, which comprises: the system comprises a text marking module, a text box selecting and marking module, a data stream control module and a graph editing and scheduling module;

the text marking module is responsible for marking and selecting texts;

the text box selection marking module is responsible for selecting and marking the text box;

the image editing and scheduling module is responsible for mapping image editing coordinates on the canvas to PDF text selection area coordinates to acquire PDF text selection area contents;

the data stream control module is responsible for abstracting each graphical editing event and data object into a data stream.

Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows: this text mark system can be very convenient the text on the mark document, compare in traditional text layer word segmentation and select, this scheme utilizes Canvas technique can conveniently select and filter the text through coordinate mapping and conversion, the show region uses a plurality of Canvas can improve user's browsing experience, user mark region then uses the Canvas of fixed view port size can greatly promote the figure performance of drawing, the quantity that utilizes the cache control figure to draw can further improve the Canvas performance.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a schematic diagram of computing annotation coordinates from a captured mouse position.

FIG. 3 is a schematic diagram of the generation of annotations as completed by capturing a mouse event.

Fig. 4 is a layer view of the present system.

FIG. 5 is a schematic diagram of a non-page-crossing scenario wipe.

FIG. 6 is a schematic diagram of cross-page scenario wipe.

FIG. 7 is a block diagram of a non-page-crossing situation.

FIG. 8 is a block diagram of a page crossing situation.

Fig. 9 is a schematic diagram of data flow control.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides a method for labeling a general document based on Canvas technology, which includes the following specific steps:

step 1: analyzing a PDF original file, and keying Canvas in an analyzed file structure; canvas provides a way to draw graphics through JavaScript and HTML < Canvas > elements, which can be used for animation, game-playing, data visualization, picture editing, and real-time video processing.

Step 2: controlling pixels on the canvas to be drawn by using a script language to realize the labeling on the PDF file; the Canvas is used as a Canvas, and the JavaScript is used for controlling the drawing of each pixel on the Canvas, so that the possibility of labeling rectangles such as a text, ellipses such as a stamp watermark, polygons such as an object, tables and other complex types is provided, and the traditional document labeling can not be realized.

Specifically, in this embodiment, the step 2 specifically includes:

step 21: selecting a drawing area on the canvas, and using a script language to control canvas pixels to carry out graphic editing in the drawing area to obtain a label;

However, creating graphics on the canvas does not directly capture the text excerpt content, and requires a series of coordinate mappings. As shown in fig. 2, what is more important is the top scroll distance of the canvas scrollTop $, the side scroll distance scrollLeft $, the offset distance of the graphic object position from the top of the canvas top, and the offset distance of the graphic object position from the side of the canvas left.

Therefore, in the coordinate mapping process, the distance between the top coordinate of the label on the canvas mapped to the PDF file is the distance between the top of the label and the canvas plus the top rolling distance scrollTop $.

Meanwhile, the drawing area in step 21 is obtained by a mouse event. Specifically, the mouse event includes a mouse press event mousedown, a mouse move event mouseove and a mouse release event mouseup; the method comprises the steps that a mouse is pressed down, a mousedown event is triggered, a mouseabove event is triggered when the mouse moves, the pressed mouse is released, an editable graphic object can be drawn on a drawing board through the mousedown event, the graphic object can be continuously reset in the mouseup process, a rectangular selection area is drawn in real time, the object can be finally generated during mouseup, the content of the selection area is fixed, and the drawing area is generated. As shown in fig. 3.

The embodiment also provides a Canvas technology-based universal document labeling system, which is divided into three layers, namely a viewport layer, a Canvas layer and a document layer from top to bottom, as shown in fig. 4, wherein the upper layer "viewport layer" is a region which can be seen by a user interface, the middle layer "Canvas layer" covering the viewport layer is used for drawing graphics and capturing coordinates, and the "document layer" positioned at the bottommost layer is an original document display region.

Specifically, the document marking system comprises a text selecting and marking module, a text box selecting and marking module, a data flow control module and a graph editing and scheduling module.

The text marking module is responsible for marking and selecting the texts; the calculation of the text selection area comprises the content calculation of three parts of the top selection area, the middle selection area and the bottom selection area. The top selection area and the bottom selection area are used for simulating and calculating the effect of the text selection area, and the middle selection area needs to be divided into two conditions of page crossing and page non-crossing of the current mouse selection area:

as shown in fig. 5, when the middle selection area does not span the page, the width of the middle selection area is the width of the whole page, and the height is the difference value of the bottom selection area and the top selection area.

As shown in fig. 6, when the intermediate selection is spread, the selection needs to be divided into a previous selection, an intermediate selection (if multiple pages are spread), and a next selection. The maximum height of the previous page selection area is the page height of the previous page, and the minimum height of the next page selection area is 0.

The text box selection marking module is responsible for selecting and marking the text box; text framing can conveniently frame the content in a whole document. In some scenarios, a whole block of text needs to be labeled, and if the text in the labeling table for one line is selected by using the text, the text is very cumbersome, and in such a case, the frame selection is more appropriate. Similar to the text selection module, the text box selection also needs to distinguish whether pages are spread.

As shown in fig. 7, when the page crossing is not needed, the text width, i.e., the width of the mouse selection area, needs to be calculated, and the text height, i.e., the height of the mouse selection area, needs to be calculated.

As shown in fig. 8, when a page needs to be spanned, for the previous selection area, the height of the selection area is the initial height of the mouse frame selection area to the height of the previous page; for the next page selection, the height of the selection area is 0 to the height of the end of the mouse frame selection area.

The image editing and scheduling module is responsible for mapping the image editing coordinate on the canvas to the PDF text selection area coordinate and acquiring the content of the PDF text selection area.

The data stream control module is responsible for abstracting each graphic editing event and data object into a data stream. Each user event and data object can be abstracted into a data stream by a responsive programming mode, and each data stream can be notified to each consumer by a multicast mode, so that the annotation data and the event trigger time can be clearly and effectively managed, and fig. 9 is a simplified data stream control schematic of the system.

In the figure, mousedown $, mousteove $andmouseup $arethe most basic mouse event stream, mousedown $cancontinuously trigger isdrowning $eventstreams to indicate that Canvas object drawing has started, when the mouseup $streamis triggered, the label is stored and triggered createlabelels $andlabelschange $istriggered, as the just drawn Canvas object exists in the Canvas, the mouse is continuously moved, along with the continuous triggering of mousteove $, the hoveinelaels label is triggered, the mouse is continuously moved to be moved out, hovereutlesels $istriggered, and if the user presses the mouse on the existing Canvas object, the mousedewedvove $triggersselectedlelschange $, so that the completed data are basically connected in series.

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed. Those skilled in the art to which the invention pertains will appreciate that insubstantial changes or modifications can be made without departing from the spirit of the invention as defined by the appended claims.

Claims

1. A Canvas technology-based universal document labeling method is characterized by comprising the following steps:

and step 3: aggregation of canvas layers is carried out on a multipage document labeling viewport and a corresponding canvas drawing area of a PDF file, so that the multipage document labeling can be carried out on a single canvas;

the step 2 specifically comprises:

step 22: mapping the drawing area coordinates to a PDF text, obtaining a PDF text selection area corresponding to the coordinates, obtaining text contents of the PDF text selection area, and realizing marking on the PDF document;

in the coordinate mapping process, the coordinate distance of the label on the canvas mapped to the PDF file is the distance from the label to the top of the canvas plus the top rolling distance of the canvas, and the distance from the label to the side offset of the canvas plus the side rolling distance.

2. The Canvas technology-based general document labeling method as claimed in claim 1, wherein the drawing area in step 21 is obtained by a mouse event.

3. The Canvas technology based general document tagging method according to claim 2, wherein the mouse events include a mouse down event, a mouse move event and a mouse up event; the mouse presses down the event to draw an editable graphic object on the canvas; the mouse moving event continuously resets the editable graphic object and draws a drawing area in real time; the mouse-off event ultimately generates a drawing area.

4. A Canvas technology-based universal document labeling system applied to the Canvas technology-based universal document labeling method according to any one of claims 1 to 3, comprising: the system comprises a text marking module, a text box selecting and marking module, a data stream control module and a graph editing and scheduling module;

the text selection marking module is responsible for selecting and marking the texts;

the image editing and scheduling module is responsible for mapping the image editing coordinate on the canvas to the PDF text selection area coordinate to obtain the content of the PDF text selection area;