CN112764642B - Canvas technology-based universal document labeling method and system - Google Patents

Canvas technology-based universal document labeling method and system Download PDF

Info

Publication number
CN112764642B
CN112764642B CN202011634774.1A CN202011634774A CN112764642B CN 112764642 B CN112764642 B CN 112764642B CN 202011634774 A CN202011634774 A CN 202011634774A CN 112764642 B CN112764642 B CN 112764642B
Authority
CN
China
Prior art keywords
canvas
pdf
text
mouse
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011634774.1A
Other languages
Chinese (zh)
Other versions
CN112764642A (en
Inventor
王力国
徐浪
李宏亮
纪达麒
陈运文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Chengdu Co ltd
Original Assignee
Daguan Data Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Chengdu Co ltd filed Critical Daguan Data Chengdu Co ltd
Priority to CN202011634774.1A priority Critical patent/CN112764642B/en
Publication of CN112764642A publication Critical patent/CN112764642A/en
Application granted granted Critical
Publication of CN112764642B publication Critical patent/CN112764642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04842Selection of displayed objects or displayed text elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention relates to the technical field of document marking, and discloses a Canvas technology-based universal document marking method and a Canvas technology-based universal document marking system, wherein the method comprises the following steps: step 1: analyzing a PDF original file, and typing in canvas in the analyzed file structure; step 2: controlling pixels on the canvas to be drawn by using a script language to realize the labeling on the PDF file; and step 3: aggregation of canvas layers is performed in a multi-page document labeling viewport and a corresponding canvas drawing area of a PDF file, and therefore labeling of multi-page documents can be performed on a single-layer canvas. According to the invention, the HTML5 Canvas technology is integrated into the labeling tool, so that the applicability, the labeling efficiency and the performance of the labeling are greatly improved.

Description

Canvas technology-based universal document labeling method and system
Technical Field
The invention relates to the technical field of document marking, in particular to a Canvas technology-based universal document marking method and system.
Background
With the development of the times and the continuous progress of paperless office technologies, more and more electronic documents need to be processed in people's life, and the paper document proportion is gradually reduced. Document processing in an enterprise often requires a large amount of text labeling work for model training if the document processing is based on NLP correlation technology, and a labeling system which is convenient to operate and use is required for completing the operations on an electronic document.
The current popular labeling mode is mainly based on the fact that after a PDF document is analyzed, characters are marked and selected to complete labeling. However, such methods have many disadvantages, such as inability to select on a single-layer PDF, inability to mark contents such as stamp watermarks, and inability to label forms on documents.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the existing problems, the invention provides a universal document marking method and a universal document marking system based on Canvas technology.
The technical scheme adopted by the invention is as follows:
a Canvas technology-based universal document labeling method comprises the following steps:
step 1: analyzing a PDF original file, and typing in canvas in an analyzed file structure;
step 2: controlling pixels on the canvas to be drawn by using a script language to realize the labeling on the PDF file;
and step 3: aggregation of canvas layers is performed in a multi-page document labeling viewport and a corresponding canvas drawing area of a PDF file, and therefore labeling of multi-page documents can be performed on a single-layer canvas.
Wherein, in the step 2, the method specifically comprises the following steps:
step 21: and selecting a drawing area on the canvas, and using a script language to control canvas pixels to carry out graphic editing in the drawing area to obtain a label.
Step 22: and mapping the drawing area coordinates to the PDF text, obtaining a PDF text selection area corresponding to the coordinates, obtaining the text content of the PDF text selection area, and realizing the labeling on the PDF document.
Therefore, the method provides the labeling of rectangles such as texts, ellipse labels such as seal watermarks, polygon labels such as objects and complex types such as tables.
Further, the drawing area in step 21 is obtained by a mouse event; the mouse events comprise mouse pressing events, mouse moving events and mouse releasing events; the mouse presses down the event to draw an editable graphic object on the canvas; the mouse moving event continuously resets the editable graphic object and draws a drawing area in real time; the mouse-off event ultimately generates a drawing area.
Further, in the coordinate mapping process, the coordinate distance of the label mapped on the PDF file on the canvas is the distance from the label to the top of the canvas plus the top rolling distance of the canvas, and the distance from the label to the side offset of the canvas plus the side rolling distance.
The invention also provides a Canvas technology-based universal document labeling system, which comprises: the system comprises a text marking module, a text box selecting and marking module, a data stream control module and a graph editing and scheduling module;
the text marking module is responsible for marking and selecting texts;
the text box selection marking module is responsible for selecting and marking the text box;
the image editing and scheduling module is responsible for mapping image editing coordinates on the canvas to PDF text selection area coordinates to acquire PDF text selection area contents;
the data stream control module is responsible for abstracting each graphical editing event and data object into a data stream.
Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows: this text mark system can be very convenient the text on the mark document, compare in traditional text layer word segmentation and select, this scheme utilizes Canvas technique can conveniently select and filter the text through coordinate mapping and conversion, the show region uses a plurality of Canvas can improve user's browsing experience, user mark region then uses the Canvas of fixed view port size can greatly promote the figure performance of drawing, the quantity that utilizes the cache control figure to draw can further improve the Canvas performance.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a schematic diagram of computing annotation coordinates from a captured mouse position.
FIG. 3 is a schematic diagram of the generation of annotations as completed by capturing a mouse event.
Fig. 4 is a layer view of the present system.
FIG. 5 is a schematic diagram of a non-page-crossing scenario wipe.
FIG. 6 is a schematic diagram of cross-page scenario wipe.
FIG. 7 is a block diagram of a non-page-crossing situation.
FIG. 8 is a block diagram of a page crossing situation.
Fig. 9 is a schematic diagram of data flow control.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present invention provides a method for labeling a general document based on Canvas technology, which includes the following specific steps:
step 1: analyzing a PDF original file, and keying Canvas in an analyzed file structure; canvas provides a way to draw graphics through JavaScript and HTML < Canvas > elements, which can be used for animation, game-playing, data visualization, picture editing, and real-time video processing.
Step 2: controlling pixels on the canvas to be drawn by using a script language to realize the labeling on the PDF file; the Canvas is used as a Canvas, and the JavaScript is used for controlling the drawing of each pixel on the Canvas, so that the possibility of labeling rectangles such as a text, ellipses such as a stamp watermark, polygons such as an object, tables and other complex types is provided, and the traditional document labeling can not be realized.
And step 3: aggregation of canvas layers is performed in a multi-page document labeling viewport and a corresponding canvas drawing area of a PDF file, and therefore labeling of multi-page documents can be performed on a single-layer canvas.
Specifically, in this embodiment, the step 2 specifically includes:
step 21: selecting a drawing area on the canvas, and using a script language to control canvas pixels to carry out graphic editing in the drawing area to obtain a label;
step 22: and mapping the drawing area coordinates to the PDF text, obtaining a PDF text selection area corresponding to the coordinates, obtaining the text content of the PDF text selection area, and realizing the labeling on the PDF document.
However, creating graphics on the canvas does not directly capture the text excerpt content, and requires a series of coordinate mappings. As shown in fig. 2, what is more important is the top scroll distance of the canvas scrollTop $, the side scroll distance scrollLeft $, the offset distance of the graphic object position from the top of the canvas top, and the offset distance of the graphic object position from the side of the canvas left.
Therefore, in the coordinate mapping process, the distance between the top coordinate of the label on the canvas mapped to the PDF file is the distance between the top of the label and the canvas plus the top rolling distance scrollTop $.
Meanwhile, the drawing area in step 21 is obtained by a mouse event. Specifically, the mouse event includes a mouse press event mousedown, a mouse move event mouseove and a mouse release event mouseup; the method comprises the steps that a mouse is pressed down, a mousedown event is triggered, a mouseabove event is triggered when the mouse moves, the pressed mouse is released, an editable graphic object can be drawn on a drawing board through the mousedown event, the graphic object can be continuously reset in the mouseup process, a rectangular selection area is drawn in real time, the object can be finally generated during mouseup, the content of the selection area is fixed, and the drawing area is generated. As shown in fig. 3.
The embodiment also provides a Canvas technology-based universal document labeling system, which is divided into three layers, namely a viewport layer, a Canvas layer and a document layer from top to bottom, as shown in fig. 4, wherein the upper layer "viewport layer" is a region which can be seen by a user interface, the middle layer "Canvas layer" covering the viewport layer is used for drawing graphics and capturing coordinates, and the "document layer" positioned at the bottommost layer is an original document display region.
Specifically, the document marking system comprises a text selecting and marking module, a text box selecting and marking module, a data flow control module and a graph editing and scheduling module.
The text marking module is responsible for marking and selecting the texts; the calculation of the text selection area comprises the content calculation of three parts of the top selection area, the middle selection area and the bottom selection area. The top selection area and the bottom selection area are used for simulating and calculating the effect of the text selection area, and the middle selection area needs to be divided into two conditions of page crossing and page non-crossing of the current mouse selection area:
as shown in fig. 5, when the middle selection area does not span the page, the width of the middle selection area is the width of the whole page, and the height is the difference value of the bottom selection area and the top selection area.
As shown in fig. 6, when the intermediate selection is spread, the selection needs to be divided into a previous selection, an intermediate selection (if multiple pages are spread), and a next selection. The maximum height of the previous page selection area is the page height of the previous page, and the minimum height of the next page selection area is 0.
The text box selection marking module is responsible for selecting and marking the text box; text framing can conveniently frame the content in a whole document. In some scenarios, a whole block of text needs to be labeled, and if the text in the labeling table for one line is selected by using the text, the text is very cumbersome, and in such a case, the frame selection is more appropriate. Similar to the text selection module, the text box selection also needs to distinguish whether pages are spread.
As shown in fig. 7, when the page crossing is not needed, the text width, i.e., the width of the mouse selection area, needs to be calculated, and the text height, i.e., the height of the mouse selection area, needs to be calculated.
As shown in fig. 8, when a page needs to be spanned, for the previous selection area, the height of the selection area is the initial height of the mouse frame selection area to the height of the previous page; for the next page selection, the height of the selection area is 0 to the height of the end of the mouse frame selection area.
The image editing and scheduling module is responsible for mapping the image editing coordinate on the canvas to the PDF text selection area coordinate and acquiring the content of the PDF text selection area.
The data stream control module is responsible for abstracting each graphic editing event and data object into a data stream. Each user event and data object can be abstracted into a data stream by a responsive programming mode, and each data stream can be notified to each consumer by a multicast mode, so that the annotation data and the event trigger time can be clearly and effectively managed, and fig. 9 is a simplified data stream control schematic of the system.
In the figure, mousedown $, mousteove $andmouseup $arethe most basic mouse event stream, mousedown $cancontinuously trigger isdrowning $eventstreams to indicate that Canvas object drawing has started, when the mouseup $streamis triggered, the label is stored and triggered createlabelels $andlabelschange $istriggered, as the just drawn Canvas object exists in the Canvas, the mouse is continuously moved, along with the continuous triggering of mousteove $, the hoveinelaels label is triggered, the mouse is continuously moved to be moved out, hovereutlesels $istriggered, and if the user presses the mouse on the existing Canvas object, the mousedewedvove $triggersselectedlelschange $, so that the completed data are basically connected in series.
The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed. Those skilled in the art to which the invention pertains will appreciate that insubstantial changes or modifications can be made without departing from the spirit of the invention as defined by the appended claims.

Claims (4)

1. A Canvas technology-based universal document labeling method is characterized by comprising the following steps:
step 1: analyzing a PDF original file, and typing in canvas in an analyzed file structure;
step 2: controlling pixels on the canvas to be drawn by using a script language to realize the labeling on the PDF file;
and step 3: aggregation of canvas layers is carried out on a multipage document labeling viewport and a corresponding canvas drawing area of a PDF file, so that the multipage document labeling can be carried out on a single canvas;
the step 2 specifically comprises:
step 21: selecting a drawing area on the canvas, and using a script language to control canvas pixels to carry out graphic editing in the drawing area to obtain a label;
step 22: mapping the drawing area coordinates to a PDF text, obtaining a PDF text selection area corresponding to the coordinates, obtaining text contents of the PDF text selection area, and realizing marking on the PDF document;
in the coordinate mapping process, the coordinate distance of the label on the canvas mapped to the PDF file is the distance from the label to the top of the canvas plus the top rolling distance of the canvas, and the distance from the label to the side offset of the canvas plus the side rolling distance.
2. The Canvas technology-based general document labeling method as claimed in claim 1, wherein the drawing area in step 21 is obtained by a mouse event.
3. The Canvas technology based general document tagging method according to claim 2, wherein the mouse events include a mouse down event, a mouse move event and a mouse up event; the mouse presses down the event to draw an editable graphic object on the canvas; the mouse moving event continuously resets the editable graphic object and draws a drawing area in real time; the mouse-off event ultimately generates a drawing area.
4. A Canvas technology-based universal document labeling system applied to the Canvas technology-based universal document labeling method according to any one of claims 1 to 3, comprising: the system comprises a text marking module, a text box selecting and marking module, a data stream control module and a graph editing and scheduling module;
the text selection marking module is responsible for selecting and marking the texts;
the text box selection marking module is responsible for selecting and marking the text box;
the image editing and scheduling module is responsible for mapping the image editing coordinate on the canvas to the PDF text selection area coordinate to obtain the content of the PDF text selection area;
the data stream control module is responsible for abstracting each graphical editing event and data object into a data stream.
CN202011634774.1A 2020-12-31 2020-12-31 Canvas technology-based universal document labeling method and system Active CN112764642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011634774.1A CN112764642B (en) 2020-12-31 2020-12-31 Canvas technology-based universal document labeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011634774.1A CN112764642B (en) 2020-12-31 2020-12-31 Canvas technology-based universal document labeling method and system

Publications (2)

Publication Number Publication Date
CN112764642A CN112764642A (en) 2021-05-07
CN112764642B true CN112764642B (en) 2022-11-29

Family

ID=75699674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011634774.1A Active CN112764642B (en) 2020-12-31 2020-12-31 Canvas technology-based universal document labeling method and system

Country Status (1)

Country Link
CN (1) CN112764642B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836090A (en) * 2021-09-01 2021-12-24 北京来也网络科技有限公司 File labeling method, device, equipment and medium based on AI and RPA
CN116188628B (en) * 2022-12-02 2024-01-12 广东保伦电子股份有限公司 Free painting page-crossing drawing and displaying method and server
CN117591766B (en) * 2024-01-18 2024-04-30 成都怡康科技有限公司 Method for converting webpage into pageable pdf

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595402A (en) * 2018-04-28 2018-09-28 西安极数宝数据服务有限公司 A kind of system of extraction PDF form datas
CN111767702A (en) * 2020-08-14 2020-10-13 腾讯科技(深圳)有限公司 Display control method and device of online document, electronic equipment and storage medium
CN111859865A (en) * 2020-06-30 2020-10-30 深圳市中农易讯信息技术有限公司 Method, device, terminal and medium for converting PDF document

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609401A (en) * 2011-12-26 2012-07-25 北京大学 Webpage annotation method
CN106201475A (en) * 2016-06-29 2016-12-07 江苏中威科技软件系统有限公司 A kind of hand writing system based on Android device WebView and method
CN106502506A (en) * 2016-11-01 2017-03-15 上海爱数信息技术股份有限公司 The mask method of document, system and electronic equipment in webpage
CN106776939A (en) * 2016-12-01 2017-05-31 山东师范大学 A kind of image lossless mask method and system
US11010040B2 (en) * 2019-02-28 2021-05-18 Microsoft Technology Licensing, Llc Scrollable annotations associated with a subset of content in an electronic document
CN110889056B (en) * 2019-12-06 2023-08-22 北京百度网讯科技有限公司 Page marking method and device
CN111144078B (en) * 2019-12-13 2023-09-01 平安银行股份有限公司 Method, device, server and storage medium for determining positions to be marked in PDF (portable document format) file

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595402A (en) * 2018-04-28 2018-09-28 西安极数宝数据服务有限公司 A kind of system of extraction PDF form datas
CN111859865A (en) * 2020-06-30 2020-10-30 深圳市中农易讯信息技术有限公司 Method, device, terminal and medium for converting PDF document
CN111767702A (en) * 2020-08-14 2020-10-13 腾讯科技(深圳)有限公司 Display control method and device of online document, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112764642A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN112764642B (en) Canvas technology-based universal document labeling method and system
US8593666B2 (en) Method and system for printing a web page
CN111008520B (en) Annotating method and device, terminal equipment and storage medium
US8718364B2 (en) Apparatus and method for digitizing documents with extracted region data
US20130205202A1 (en) Transformation of a Document into Interactive Media Content
EP3058512B1 (en) Organizing digital notes on a user interface
US9141134B2 (en) Utilization of temporal and spatial parameters to enhance the writing capability of an electronic device
US5592607A (en) Interactive method and system for producing address-correlated information using user-specified address zones
US6493736B1 (en) Script character processing method for opening space within text and ink strokes of a document
US7870501B2 (en) Method for hollow selection feedback
TWI394055B (en) Common charting using shapes
US6952803B1 (en) Method and system for transcribing and editing using a structured freeform editor
US20100149211A1 (en) System and method for cropping and annotating images on a touch sensitive display device
JP4945813B2 (en) Print structured documents
US20050120302A1 (en) Adding and removing white space from a document
JP5439456B2 (en) Electronic comic editing apparatus, method and program
US7148905B2 (en) Systems and method for annotating pages in a three-dimensional electronic document
US7428711B2 (en) Glow highlighting as an ink attribute
JP2003303047A (en) Image input and display system, usage of user interface as well as product including computer usable medium
JP5664164B2 (en) Electronic information board device, information display method, program
US7945855B2 (en) Smart space insertion
US7280693B2 (en) Document information input apparatus, document information input method, document information input program and recording medium
US8824806B1 (en) Sequential digital image panning
Ramachandran et al. An architecture for ink annotations on web documents
JP2011014076A (en) Information processing apparatus, document enlarging display method, program, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant