CN116992855A - Document processing method, system and related equipment - Google Patents

Document processing method, system and related equipment Download PDF

Info

Publication number
CN116992855A
CN116992855A CN202310616632.XA CN202310616632A CN116992855A CN 116992855 A CN116992855 A CN 116992855A CN 202310616632 A CN202310616632 A CN 202310616632A CN 116992855 A CN116992855 A CN 116992855A
Authority
CN
China
Prior art keywords
document
paragraph
information
directory
directory entry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310616632.XA
Other languages
Chinese (zh)
Inventor
王仁勇
尚东东
孙德旺
谢奕红
李勇
朱辉晃
张平兰
毛瑞彬
杨建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN SECURITIES INFORMATION CO Ltd
Original Assignee
SHENZHEN SECURITIES INFORMATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN SECURITIES INFORMATION CO Ltd filed Critical SHENZHEN SECURITIES INFORMATION CO Ltd
Priority to CN202310616632.XA priority Critical patent/CN116992855A/en
Publication of CN116992855A publication Critical patent/CN116992855A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application discloses a document processing method, a system and related equipment, wherein the method comprises the following steps: analyzing to obtain catalog information, paragraph information and character coordinates of the initial document; configuring a content positioning rule of the target text according to the information of the forward directory entry of the paragraph where the target text is located; content location rules are executed to determine the location of the target text in the initial document. The image boundary recognition algorithm can be utilized to analyze the structural information such as paragraphs and catalogues of PDF documents, and further content positioning rules with strong universality and conciseness are configured according to the structural information, so that the coordinates of target texts in the documents can be determined through the rules, the workload of writing and maintaining the rules is effectively reduced, and the analysis and positioning effects on the document contents are enhanced. In addition, the intermediate information table of the document parsing can be used for different downstream processing logic; the method can display the original text with high fidelity and support the front-end reading effect of expanding the contents such as catalogues, comments and the like in the document.

Description

Document processing method, system and related equipment
Technical Field
The embodiment of the application relates to the technical field of Internet, in particular to a document processing method, a document processing system and related equipment.
Background
Document formats such as PDF become a virtually standard format for many important documents due to their high consistency of rendering effects in different platform environments.
In practical applications, a user often needs to find key content in a PDF document. However, in the current positioning of PDF document content, a direct coding manner is often adopted, so that a matching code needs to be written for each positioning, the efficiency is low, and a large number of matching codes need to be written correspondingly for a large number of positioning requirements.
In this regard, the related art does not provide an effective solution.
Disclosure of Invention
The embodiment of the application provides a document processing method, a document processing system and related equipment, which are used for meeting the positioning requirements of different contents through a universal positioning rule.
An embodiment of the present application provides a document processing method, including:
analyzing an initial document to obtain catalog information, paragraph information and character coordinates of the initial document, wherein the initial document comprises a PDF format document;
configuring a content positioning rule of a target text according to information of a forward directory entry of a paragraph where the target text is located; wherein the forward directory entry refers to a directory entry located above the paragraph in which the forward directory entry is located;
and executing the content positioning rule to determine the position of the target text in the initial document.
The method according to the first aspect of the present application may be implemented in practice using the content according to the second aspect of the present application.
A second aspect of an embodiment of the present application provides a document processing system, including:
the analysis unit is used for analyzing the initial document to obtain catalog information, paragraph information and character coordinates of the initial document, wherein the initial document comprises a PDF format document;
the processing unit is used for configuring the content positioning rule of the target text according to the information of the forward directory entry of the paragraph where the target text is located; wherein the forward directory entry refers to a directory entry located above the paragraph in which the forward directory entry is located;
the processing unit is further configured to execute the content positioning rule to determine a position of the target text in the initial document.
A third aspect of an embodiment of the present application provides an electronic device, including:
a central processing unit, a memory and an input/output interface;
the memory is a short-term memory or a persistent memory;
the central processor is configured to communicate with the memory and to execute instruction operations in the memory to perform the method described in the first aspect of the embodiments of the present application or any particular implementation of the first aspect.
A fourth aspect of the embodiments of the present application provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform a method as described in the first aspect of the embodiments of the present application or any particular implementation of the first aspect.
A fifth aspect of the embodiments of the present application provides a computer program product comprising instructions or a computer program which, when run on a computer, causes the computer to perform the method as described in the first aspect of the embodiments of the present application or any particular implementation of the first aspect.
As can be seen from the above technical solutions, the embodiments of the present application have at least the following advantages:
according to the embodiment of the application, the document information such as the catalogue, the paragraph and the character coordinates in the initial document can be analyzed and obtained, and further, the content positioning rule with strong universality and conciseness is configured according to the catalogue and the paragraph information, so that the position of the target text in the initial document can be conveniently determined through the content positioning rule, the workload of writing and maintaining the rule is effectively reduced, the analysis and positioning effects on the document content are enhanced, and the review experience of a user on the document is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.
It should be noted that, although the steps in the flowcharts (if any) according to the embodiments are drawn in sequence according to the arrow, the steps are not strictly limited to the order shown in the text, and may be executed in other orders. Moreover, at least some of the steps in the flowcharts in accordance with the embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of the steps or stages in other steps or other steps.
FIG. 1 is a schematic flow chart of a document processing method according to an embodiment of the present application;
FIG. 2a is a schematic diagram of another document processing method according to an embodiment of the present application;
FIG. 2b is a schematic diagram of another document processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an analysis of a document processing method according to an embodiment of the present application;
FIG. 4 is a rendering representation of a document processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a sequential stack according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the following description, reference is made to "one embodiment" or "one specific example" and the like, which describe a subset of all possible embodiments, but it is to be understood that "one embodiment" or "one specific example" may be the same subset or a different subset of all possible embodiments and may be combined with each other without conflict. In the following description, the term plurality refers to at least two. Some value as referred to herein reaches a threshold value (if any), and in some specific examples may include the former being greater than the latter; if any reference is made to "any" or "at least one" or the like, that particular reference may be made to any one of the examples listed or any combination between those examples.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
The method of the present application will be described in further detail by using a PDF format document as an initial document, which may be other format documents similar to PDF with high consistency of rendering effects in different platform environments (e.g., consistency of rendering effects displayed on different device terminals).
Referring to fig. 1 to 5, a first aspect of the present application provides a specific embodiment of a document processing method, which includes the following steps:
11. and analyzing to obtain directory information, paragraph information and character coordinates in the initial document.
For example, public announcements disclosed by marketers in securities industry generally adopt PDF documents, but PDF documents have no concept of structural information such as paragraphs and complete catalogs due to the characteristics of design and easy rendering, so that interpretation of chapter structures such as catalog levels is completely manually identified, or by means of written judgment rules, however, the error rate of manually identifying a large number of documents is high, the time cost is high, the judgment rules are written by the retracting features, punctuations, row-column pitches, font sizes, regular expressions and the like of different documents, the maintenance difficulty is high, and the documents cannot be used for different documents. Specifically, the directory entry information may include directory entry level information and directory entry contents, where a directory entry level refers to a parent level or a child level (specifically, may refer to a number of child levels), and the directory entry contents may refer to specific contents such as a number, a symbol, and/or text constituting the current directory entry, for example, the number 1, a dot number, a pause number, and a "related organization" in the directory entry of "1.1, related organization".
In this regard, in some specific examples, the specific operational procedure of "parsing to obtain catalog entry information and character coordinates of an initial document" may include: analyzing the initial document content by using a format analyzer to obtain the coordinates of each character; converting the initial document into an image, inputting the image into a trained image recognition model to recognize a catalog item area and a hierarchy to which the catalog item area belongs in the initial document; and taking the characters with the character coordinates in the catalog entry area as catalog entry characters, and constructing a catalog tree for reflecting catalog level information through the coordinates of each catalog entry character.
Since many PDF files do not have directory information directly set, the directory information cannot be directly analyzed from the PDF files, the existing method usually identifies directory entries by means of fonts or regular expressions, for example, fields are bold, text is a paragraph beginning with a sequence and a symbol is often the directory entry, but this method needs to maintain a large number of rules, and the writing difficulty of the rules is large and the number of the rules is large, so that the method can rely on pictures to identify the directory entries of the PDF files (i.e. analyze to obtain the directory information):
1. the document is intercepted into an image and is input into an image recognition model (which can be called as a model applying an image boundary recognition algorithm), the model can return a detection frame result belonging to a catalog item area to represent that the catalog item is selected as a frame, wherein a rectangular frame shown in fig. 3 can be used as a detection frame, and of course, style frames such as underline or saw-tooth frames and the like besides the rectangular frame can also be used; the training corpus of the image recognition model can be a large number of catalog items in a common format, such as paragraphs with fonts of bold or large-sized body and characters beginning with a sequence number plus one symbol, and the paragraphs can be centered or arranged left;
2. using a standard PDF format analyzer to analyze the coordinates of each character in the PDF document, wherein the step 2 can be carried out simultaneously or not sequentially with the step 1, and can be specifically determined by a user;
3. and (3) combining the steps 1 and 2 to generate directory information of the document, namely, regarding all characters in the step 2 with coordinates in the same rectangular frame in the step 1 as a combined directory entry.
In addition, the position information (such as rectangular box information) of each catalog item can be combined to construct a catalog tree to be stored in a database, so that document contents can be conveniently stored and indexed, or PDF document information can be applied to different downstream processing logic, and the downstream processing logic can refer to character positioning and judging whether the document type belongs to business requirement logic such as a poster.
The process for constructing the directory tree may include: coding each catalog item according to the affiliated hierarchical relationship or the upper and lower position relationship in the initial document and arranging the catalog items according to codes, wherein each child catalog item code belonging to the same parent catalog item carries the parent catalog item code; and pressing all codes of the same parent-level directory entry into the sequence stack in sequence until all directory entries are traversed, so as to construct and obtain a directory tree.
The hierarchical relationship of the directory entries may be specifically determined according to the sequence number of the prefix of the directory entry, for example, a three-level directory entry with the sequence number "1.1.1" represents a two-level (parent level of three levels) directory entry with the sequence number "1.1", where the two-level directory entry is subordinate to a one-level (parent level of two levels) directory entry with the sequence number "1" or the name "first subsection", and accordingly, each directory entry may be encoded according to the sequence number of the prefix of the directory entry and arranged according to a code, for example: 1,1.1,1.1.1,2,2.1,2.2,2.2.1 … …
In addition, exemplary, each directory entry may be encoded (or numbered) according to a top-bottom positional relationship in the initial document, and arranged according to a code, where the top-bottom positional relationship is similar to the sequence number of the prefix, and each of the top-bottom positional relationship may express a subordinate relationship between the directory entries, and the encoding of the first-level directory entry and the subordinate directory entry thereof may be temporarily stored through the sequence stack, so as to construct a directory tree:
1. and taking out all directory entries in the current initial document, and arranging the directory entries in ascending order according to the upper and lower position relation.
2. Fetching the first directory entry (e.g., "directory 1" of fig. 3 or a parent directory entry with prefix number 1), setting its directory code to 001 or 1, setting the parent code to 0 (indicating belonging to the highest hierarchy), and pushing the directory code into a stack variable S of the first-in last-out, which may indicate a path between the current directory entry and the directory tree root, which may indicate which intermediate directory entries have been traversed between the two; the top of stack T at this time may be the number of the first-level directory entry (prefix number is 1) as 1, where the top of stack T may be different according to the stacking element, and the record of T may indicate which parent level (e.g. the first section or prefix number "1") of the stack is currently recorded in each directory entry under the structure; the directory tree root may be considered as a virtual node (numbered 0) to represent that each highest level directory entry (first node, second node …) is subordinate to this maximum root node, or each highest level directory entry may be selected as a directory tree root to help restore the dependency between the upper and lower level directory entries or trace back the parent level directory entry.
3. Continuing to traverse the next directory entry C:
(1) If the hierarchy of the directory entry C (directory entry with prefix number of 2) is the same as that of the stack top element T, the directory entry C is indicated to be the same-level directory, the code of the C is set as the next code of the T, the father code of the C is the same as the T, and the code of the C is set as the stack top;
(2) If the level of C (e.g., the primary directory entry with prefix number 2) is smaller than T (e.g., the tertiary directory entry with prefix number 1.1) and the levels differ by N (e.g., N is 2), the upper N-level directory entry element indicating that C is T may be considered as a directory entry to be traversed back to a higher level at this time, popping up N elements from stack S, taking the top of stack element as T, setting the code of C as the next code of T, the parent code of C is the same as T, and pushing C into stack S;
(3) If the level of C (such as the second-level directory entry with the prefix number of 2.1) is greater than T (such as the first-level directory entry with the prefix number of 2), the lower element of C is represented as T, the code of C is set as T code +001, the father code is set as T, and C is pushed into the stack S;
in summary, the purpose of step 3 is to traverse the highest-level directory entries (e.g., the first subsection and the second subsection) and the subordinate directory entries thereof (e.g., the prefix numbers correspond to the carry numbers "1" and "2"), so that the current stack records all the directory entries of each subsection, as shown in fig. 5.
4. And (3) repeating the step 3 until all the directory entries are traversed, and constructing a directory tree.
Considering that there is no concept of paragraph in the PDF document, the text (or character) that belongs to a paragraph may be broken into multiple segments of elements, so that multiple lines of text that originally belongs to a paragraph need to be combined according to a certain rule. The existing method is to combine according to the indenting of the segment head, the row spacing of adjacent rows, punctuation marks and the like, but the method needs to maintain a large number of rules, and the rules are easy to conflict. Therefore, in some specific examples, similar to the manner of parsing to obtain paragraph information described above, the paragraph layout in the PDF document may be identified based on the image profile (i.e., parsing to obtain paragraph information), specifically, the paragraph area in the initial document may be identified by the image recognition model, and each character with the character coordinates in the same paragraph area may be parsed into characters of one paragraph. Exemplary:
1. converting the PDF document into an image, inputting the image into an image recognition model, recognizing a coordinate area of a paragraph in the document through the model, and specifically using detection frames such as a rectangular frame and the like to show the paragraph area;
2. using a standard PDF format analyzer to analyze the coordinates of each character in the document, wherein the step 2 can be carried out simultaneously or not according to the step 1;
3. generating paragraph information of the document by combining the steps 1 and 2, namely taking all characters with coordinates in the step 2 of the same rectangular frame in the step 1 as a paragraph, and storing the paragraph information into a database;
in a specific application process, the sequence of analyzing the directory-level information and the paragraph information process can be not limited, and can also be executed simultaneously, and the method can be set according to actual conditions.
Based on the above description, considering that in the prior art, an intermediate structure is not defined for the analysis result of the PDF document, for example, information is not stored transitionally, and the output of the analysis step is often directly output to a certain downstream processing logic, so that the analysis result is only suitable for a specific downstream processing logic (for example, extracting and locating a certain character), and cannot be multiplexed by other downstream processing logic (for example, judging a specific service type of the document or converting the PDF into a plain text format), thereby resulting in a great deal of repeated work and reducing the working efficiency. Therefore, after the above-mentioned "each character whose character coordinates are located in the same paragraph region is a character of one paragraph", the method according to the embodiment of the present application may further include: constructing a paragraph intermediate information table according to the analyzed paragraph attribute and each character coordinate in the affiliated paragraph; paragraph attributes contain coordinates of paragraphs in the initial document; the paragraph coordinates may refer to coordinates of the detection frame of the paragraph (such as a sequence number of a rectangular frame or joint coordinates of first and last characters in the frame) or coordinates of first characters in the paragraph.
The paragraph intermediate information table is used as an intermediate structure, so that all information of the initial document can be cached as much as possible, and all downstream processing logics can conveniently and quickly use the same information table, so that the information table of the same initial document does not need to be newly built aiming at different downstream processing logics; in other words, the coupling limit between the document parsing result (such as paragraph and/or directory level information) and downstream processing logic is reduced, the information multiplexing rate and versatility are enhanced, and the workload is reduced. Illustratively, the paragraph intermediate information table may include various paragraph information shown in table 1, such as a page number of a certain paragraph, top coordinates of a start position of the certain paragraph (e.g., coordinates of a first word in the paragraph), coordinates of each character in the certain paragraph (for facilitating subsequent positioning rule positioning use), and so on.
id Main key
file_id PDF file id
page_num The number of the page in which the page is located,
top in-page top value
text Text of
char_pos The position of each character within a block (i.e., paragraph)
text_detail Details of text within blocks, e.g. font sizes
position The coordinates of the block itself, consisting of a plurality of points, e.g. head-to-tail characters
TABLE 1 information contained in the section intermediate information Table
For reading a document, the existing scheme is often implemented based on a browser plug-in or an open-source pdf.js component, and cannot meet the expansion of personalized requirements such as online pre-annotation, so in some specific examples, in order to individually expand annotation content, after step 11, the method of the embodiment of the present application may further include: according to the directory hierarchy information, displaying the contents of all directory entries of the initial document according to a tree structure; constructing a directory intermediate information table corresponding to the directory hierarchy information; and displaying the recording elements of each character contained in the initial document, and correspondingly acquiring and displaying the annotation content of each character through the character coordinates.
The parsed directory hierarchy information may be stored in a low redundancy tree structure (i.e., in the form of a directory tree), with each record being a directory entry. In addition, the parsed directory hierarchy information may be formed into an intermediate directory information table, which may include the directory entry information shown in table 2, such as the current directory entry encoding (encoding uses a self-layer encoding scheme, which facilitates the related operations of the directory tree in the subsequent steps), the parent directory encoding, the text content of the directory entry, and the intra-page coordinate top value. The directory codes (codes) such as 001002003 may represent a set of parent directory codes in the initial document, 001, 002 and 003 may represent codes (parent_codes) corresponding to the first-level, second-level and third-level directory entries, so that a directory tree is conveniently constructed, and the parent-level codes (parent_codes) may be ignored when the directory codes (codes) are recorded, thereby reducing data redundancy.
Table 2 information contained in intermediate information table of directory
The front-end visual diagram shown in fig. 4 can be divided into a left part, a middle part and a right part, the left side is the content of the parsed catalogue (tree), and clicking one item in the catalogue can quickly jump to the corresponding position of the original text; the middle is a PDF original text rendering area, the PDF original text can be restored with high fidelity in an HTML DOM mode, characters which are pre-positioned (see steps 12 and 13 for details) in the original text can be highlighted through a content positioning rule, the right side of the positioned characters can display an annotation frame display area corresponding to the positioned characters, the display area contains annotation content, and the positioned characters and the annotation content can be connected through a connecting line to represent a corresponding relation.
Specifically, regarding the rendering and presentation of directory trees: the back end analyzes the catalog of the current PDF document in the analysis stage and stores the catalog in the database, so that the front end browser can acquire multi-item catalog data and convert the multi-item catalog data into a tree data result (such as a catalog tree) which is displayed in the left area of the interface. It should be noted that, the directory tree is constructed mainly to facilitate the background personnel to index the attention point content in the document, for example, to quickly locate the record position of a certain amount in the PDF of the tendering book, and to extract the value of the amount to determine whether the audit standard is reached.
Regarding rendering and presentation of PDF documents: the rendering of PDF original text at the browser end is divided into a canvas layer and a text layer, when a user browses a page, binary document streams of corresponding pages are loaded, interface elements in the binary document streams are analyzed in a format defined by PDF standards, and the canvas layer is used for drawing interface elements such as characters and pictures in the original document in the browser by using canvas drawing technology.
Highlighting the pre-annotation position and corresponding annotation boxes: the browser acquires all annotations and coordinate information thereof of the current document through an interface, the front-end code judges in a page rolling event, if the annotation information exists in the current display page, corresponding HTML block elements are found in a text layer rendered by PDF according to the annotation coordinate information, and background color of the HTML block elements is changed to indicate that the current position is the preset annotation. Meanwhile, an html text box is generated in the display area of the annotation box on the right side, description information related to the annotation point, namely annotation content, is displayed in the text box, wherein the text box possibly contains rich html interface elements such as characters, links, pictures and tables, the top value of the coordinates of the text box (such as the coordinates of the first character in the box) is set as the top value of the coordinates of the annotation box, and finally a connecting line is drawn between the annotated text and the annotation content in a mode of svg pictures. The annotated text, the annotating frame, the connecting lines between the annotating frames and the positioning mode can be set to be relatively positioned, so that when the container integrally rolls, the positions of the annotating text frames synchronously roll along with the position of the annotated text, and the annotating text frames synchronously roll along with the position of the annotated text.
In practical application, the sequence of operations such as constructing the intermediate information table, displaying the content of the catalog item, displaying the annotation and the like and steps 12 or 13 may not be limited, and may also be executed simultaneously, which may be specifically set by the actual situation or logic requirement.
12. And configuring the content positioning rule of the target text according to the information of the forward directory entry of the paragraph where the target text is located.
The directory entry information may include directory entry level information, a directory entry level indicating that the current directory entry belongs to a parent or child level; therefore, in particular, the content locating rule can be configured according to the sub-directory information passing between the paragraph and the adjacent parent forward directory entry.
In many business scene demands (such as auditing of market and stock books), some content positioning rules are required to be predefined, positions needing to be focused in a document are marked, namely, the positions of the audit points are pre-positioned, so that business personnel can conveniently confirm whether the audit points reach standards, the rules are usually given in a form of a catalog level or keywords, regular expressions are usually executed by hard coding, but the definition mode of the rules is not uniform, and the description mode is not concise enough, the maintenance difficulty is high and the universality is poor, so that a uniform content positioning rule can be defined, the rule can be related to each character coordinate, and the general form can be as follows:
A->B->C[key];
wherein A, B, C each represents a regular expression of directory entry matching, - > represents the next level of directory in arrow pictograms, and the content in brackets [ ] represents a regular expression of matching paragraphs (text), wherein the directory may represent any number of directory entries, i.e., any number of levels, using wildcards, for example: * - > B [ key ] means that the last-level directory entry satisfies rule B and that the paragraph (text) under that directory entry finds the key; a- > [ key ] indicates that the first-level directory entry meets the rule A, and a keyword key is found in a paragraph (text) under the directory entry;
13. content location rules are executed to determine the location of the target text in the initial document.
Based on the description of step 12, for example, in the case that the business requirement of locating the target text position of "Zhang Sany" (i.e. key) is faced, and the two-character coordinates of the text to be located (i.e. the location point) of "Zhang Sany" are obtained by parsing in step 11, and the location point has forward directory entries of "sponsor" (child level), "related organization" (child level), "release profile" (parent level), the following content locating rules can be written accordingly:
"issue profile- > related institution- > sponsor [ name ]";
the meaning of this content locating rule is that the configured primary catalog item contains "release profile", the secondary catalog item below it contains "relevant organization", and the tertiary catalog item below it contains "sponsor", the text below the tertiary catalog item contains the keyword "name", that is, if the "name" specifically corresponds to the term "Zhang Sano" which is required, the specific coordinates of the name "Zhang Sano" in the document chapter are indicated to be located and saved, so that the pre-review point "Zhang Sano" is highlighted later for review by a reviewer, or the corresponding coordinate position is directly jumped to.
In summary, as shown in fig. 2a or fig. 2b, the embodiment of the present application proposes a scheme for performing paragraph and directory parsing by using an image recognition algorithm, defines a general focus positioning rule for a document, and finally can implement an expandable PDF reader based on a browser. Specifically, the PDF document can be parsed to obtain the coordinate information of each character in the document, meanwhile, each page of the PDF is intercepted into an image and input into an image recognition model to obtain the position outline rectangular frame of the paragraphs and the catalogs, the two types of position information (character coordinates and rectangular frame selection areas) are synthesized, the paragraph and the catalogue tree information (such as a general format intermediate information table, a catalogue tree and the like) in the document can be parsed and obtained to be stored in a database, then the defined content positioning rules are utilized to calculate the coordinate positions corresponding to each rule and store in the database, and finally, the front-end browser is used for rendering and realizing man-machine interaction operation.
In particular, the embodiment of the application has the advantages that:
1. the proposal of analyzing the structural information such as paragraphs and catalogues in the PDF document by utilizing the image boundary recognition algorithm is provided, and complicated analysis rules are not required to be defined; 2. the utility model provides a general intermediate data format (such as an intermediate information table) for storing PDF document analysis results, wherein the general property of the storage format can be suitable for different downstream processing, and repeated work such as repeated table construction and the like is avoided; 3. the universal content positioning rule writing grammar based on the directory structure and paragraph keywords is provided, so that the writing efficiency of the positioning rule and the execution efficiency in running can be greatly improved; 4. the proposal of high-fidelity display of PDF at the browser end is provided, and the PDF contains the additional information such as catalogues, annotation frames and the like, is realized by using a pure front end, and has good expandability.
A second aspect of the present application provides a specific example of a document processing system, the system comprising:
the analyzing unit is used for analyzing the initial document to obtain catalog information, paragraph information and character coordinates of the initial document, wherein the initial document comprises a PDF format document;
the processing unit is used for configuring the content positioning rule of the target text according to the information of the forward directory entry of the paragraph where the target text is located; wherein, the forward directory entry refers to the directory entry located above the paragraph in which the forward directory entry is located;
and the processing unit is also used for executing the content positioning rule to determine the position of the target text in the initial document.
Optionally, the directory entry information includes directory entry level information, where the directory entry level refers to that the current directory entry belongs to a parent level or a child level; the processing unit is specifically configured to:
and configuring content positioning rules according to the child directory information passing between the paragraph and the adjacent parent forward directory entry.
Optionally, the parsing unit is specifically configured to:
analyzing the initial document content by using a format analyzer to obtain the coordinates of each character;
converting the initial document into an image, inputting the image into a trained image recognition model to recognize a catalog item area and a hierarchy to which the catalog item area belongs in the initial document;
and taking the characters with the character coordinates in the catalog entry area as catalog entry characters, and constructing a catalog tree for reflecting catalog level information through the coordinates of each catalog entry character.
Optionally, the parsing unit is specifically configured to:
coding each catalog item according to the affiliated hierarchical relationship or the upper and lower position relationship in the initial document and arranging the catalog items according to codes, wherein each child catalog item code belonging to the same parent catalog item carries the parent catalog item code;
and pressing all codes of the same parent-level directory entry into the sequence stack in sequence until all directory entries are traversed, so as to construct and obtain a directory tree.
Optionally, the parsing unit is specifically configured to:
and identifying a paragraph area in the initial document through the image identification model, and analyzing each character with character coordinates in the same paragraph area into characters of one paragraph.
Optionally, the processing unit is further configured to:
constructing a paragraph intermediate information table according to the analyzed paragraph attribute and each character coordinate in the affiliated paragraph; paragraph attributes contain the coordinates of the paragraph in the initial document.
Optionally, the directory entry information comprises directory hierarchy information; the processing unit is further configured to:
according to the directory hierarchy information, displaying the contents of all directory entries of the initial document according to a tree structure;
constructing a directory intermediate information table corresponding to the directory hierarchy information;
and displaying the recording elements of each character contained in the initial document, and correspondingly acquiring and displaying the annotation content of each character through the character coordinates.
In the embodiment of the present application, the operations performed by each unit of the document processing system are similar to those described in the foregoing first aspect or any one of the specific method embodiments of the first aspect, and are not described herein in detail. Of course, the specific implementation of the operations of the first aspect of the present application may also be implemented with reference to the related description of the second aspect.
Referring to fig. 6, an electronic device 600 of an embodiment of the present application may include one or more central processing units (CPUs, central processing units) 601 and a memory 605, where the memory 605 stores one or more application programs or data.
Wherein the memory 605 may be volatile storage or persistent storage. The program stored in the memory 605 may include one or more modules, each of which may include a series of instruction operations in the electronic device. Still further, the central processor 601 may be arranged to communicate with the memory 605 to execute a series of instruction operations in the memory 605 on the electronic device 600.
The electronic device 600 may also include one or more power supplies 602, one or more wired or wireless network interfaces 603, one or more input/output interfaces 604, and/or one or more operating systems, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.
The cpu 601 may perform the operations performed by the foregoing first aspect or any specific method embodiment of the first aspect, which are not described herein.
The application provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform a method as described in the first aspect or any particular implementation of the first aspect.
The present application provides a computer program product comprising instructions or a computer program which, when run on a computer, causes the computer to perform a method as described above in the first aspect or any particular implementation of the first aspect.
It should be understood that, in various embodiments of the present application, the sequence number of each step does not mean that the execution sequence of each step should be determined by the function and the internal logic, and should not limit the implementation process of the embodiments of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system (if any) and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, which are not described in detail herein.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system or apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product (computer program product) stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a service server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (10)

1. A document processing method, comprising:
analyzing an initial document to obtain catalog information, paragraph information and character coordinates of the initial document, wherein the initial document comprises a PDF format document;
configuring a content positioning rule of a target text according to information of a forward directory entry of a paragraph where the target text is located; wherein the forward directory entry refers to a directory entry located above the paragraph in which the forward directory entry is located;
and executing the content positioning rule to determine the position of the target text in the initial document.
2. The document processing method according to claim 1, wherein the directory entry information includes directory entry level information, the directory entry level indicating whether a current directory entry belongs to a parent level or a child level; the configuring the content positioning rule of the target text according to the information of the forward directory entry of the paragraph where the target text is located comprises the following steps:
and configuring the content positioning rule according to the child directory information passing between the paragraph and the adjacent parent forward directory entry.
3. The document processing method according to claim 1, wherein the process of parsing the directory entry information and the respective character coordinates of the initial document includes:
analyzing the initial document content by using a format analyzer to obtain the coordinates of each character;
converting the initial document into an image, inputting the image into a trained image recognition model, and recognizing a catalog item area in the initial document and a hierarchy to which the catalog item area belongs;
and taking the characters with the character coordinates in the catalog entry area as catalog entry characters, and constructing a catalog tree for reflecting the catalog level information through the coordinates of each catalog entry character.
4. A document processing method according to claim 3, wherein the process of constructing the directory tree comprises:
coding each catalog item according to the affiliated hierarchical relationship or the upper and lower position relationship in the initial document and arranging the catalog items according to codes, wherein each child catalog item code belonging to the same parent catalog item carries the parent catalog item code;
and pushing each code of the same parent-level directory entry into a sequence stack in sequence until all the directory entries are traversed, so as to construct and obtain the directory tree.
5. The document processing method according to claim 1, wherein the process of parsing paragraph information of the initial document includes:
and identifying a paragraph region in the initial document through an image identification model, and analyzing each character with character coordinates in the same paragraph region into characters of one paragraph.
6. The document processing method according to claim 5, wherein after each character having character coordinates in the same paragraph region is taken as a character of one paragraph, the method further comprises:
constructing a paragraph intermediate information table according to the analyzed paragraph attribute and each character coordinate in the affiliated paragraph; the paragraph property contains coordinates of the paragraph in the initial document.
7. The document processing method according to claim 1, wherein the directory entry information includes directory hierarchy information; after parsing the initial document, the method further comprises:
according to the directory hierarchy information, displaying contents of all directory entries of the initial document according to a tree structure;
constructing a directory intermediate information table corresponding to the directory hierarchy information;
and displaying the recording elements of each character contained in the initial document, and correspondingly acquiring and displaying annotation content of each character through character coordinates.
8. A document processing system, comprising:
the analysis unit is used for analyzing the initial document to obtain catalog information, paragraph information and character coordinates of the initial document, wherein the initial document comprises a PDF format document;
the processing unit is used for configuring the content positioning rule of the target text according to the information of the forward directory entry of the paragraph where the target text is located; wherein the forward directory entry refers to a directory entry located above the paragraph in which the forward directory entry is located;
the processing unit is further configured to execute the content positioning rule to determine a position of the target text in the initial document.
9. An electronic device, comprising:
a central processing unit, a memory and an input/output interface;
the memory is a short-term memory or a persistent memory;
the central processor is configured to communicate with the memory and to execute instruction operations in the memory to perform the method of any of claims 1 to 7.
10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 7.
CN202310616632.XA 2023-05-29 2023-05-29 Document processing method, system and related equipment Pending CN116992855A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310616632.XA CN116992855A (en) 2023-05-29 2023-05-29 Document processing method, system and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310616632.XA CN116992855A (en) 2023-05-29 2023-05-29 Document processing method, system and related equipment

Publications (1)

Publication Number Publication Date
CN116992855A true CN116992855A (en) 2023-11-03

Family

ID=88529034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310616632.XA Pending CN116992855A (en) 2023-05-29 2023-05-29 Document processing method, system and related equipment

Country Status (1)

Country Link
CN (1) CN116992855A (en)

Similar Documents

Publication Publication Date Title
US7958444B2 (en) Visualizing document annotations in the context of the source document
US8260049B2 (en) Model-based method of document logical structure recognition in OCR systems
US20130205202A1 (en) Transformation of a Document into Interactive Media Content
US10789281B2 (en) Regularities and trends discovery in a flow of business documents
US8073865B2 (en) System and method for content extraction from unstructured sources
CN109933796B (en) Method and device for extracting key information of bulletin text
US8577887B2 (en) Content grouping systems and methods
US11615635B2 (en) Heuristic method for analyzing content of an electronic document
JP6462970B1 (en) Classification device, classification method, generation method, classification program, and generation program
CN110704570A (en) Continuous page layout document structured information extraction method
CN112765999A (en) Machine translation bilingual comparison method and system
JP6505600B2 (en) Automatic configuration evaluator
CN114036909A (en) PDF document page-crossing table merging method and device and related equipment
CN110990539B (en) Manuscript internal duplicate checking method and device and electronic equipment
CN112733056B (en) Document processing method, device, equipment and storage medium
Meuschke et al. A benchmark of pdf information extraction tools using a multi-task and multi-domain evaluation framework for academic documents
CN112433995A (en) File format conversion method, system, computer equipment and storage medium
CN116992855A (en) Document processing method, system and related equipment
Panichkriangkrai et al. Character segmentation and transcription system for historical Japanese books with a self-proliferating character image database
JP2020067987A (en) Summary creation device, summary creation method, and program
CN102622405B (en) Method for computing text distance between short texts based on language content unit number evaluation
CN112347742B (en) Method for generating document image set based on deep learning
JP2015069597A (en) Related document search device, method, and program
US20240086452A1 (en) Tracking concepts within content in content management systems and adaptive learning systems
JP2012141702A (en) Electronic document conversion device and electronic document conversion method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination