CN116702747A

CN116702747A - PDF online reader design method, device, computer equipment and medium

Info

Publication number: CN116702747A
Application number: CN202310628601.6A
Authority: CN
Inventors: 吴珂皓; 薛逢源; 李博岩
Original assignee: Zhuhai Yingmi Fund Sales Co ltd
Current assignee: Zhuhai Yingmi Fund Sales Co ltd
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-09-05

Abstract

The invention relates to the technical field of text processing, and discloses a PDF online reader design method, a PDF online reader design device, computer equipment and a PDF online reader medium: acquiring a PDF document; content analysis is carried out on the PDF document, text content of the PDF document is obtained through extraction, and the text content is stored in the text document; performing entity identification on the text document to obtain a catalog and an entity object in the PDF document; creating hyperlinks pointing to pages corresponding to the catalogs and the entity objects for the catalogs and the entity objects based on the catalogs and the entity objects; the method comprises the steps of setting identifiers for hyperlinks and text documents, storing the hyperlinks and the text documents with the identifiers in a server, and enabling the server to inquire the corresponding text documents or hyperlinks to display the text documents or the hyperlinks to the user in response to the identifiers corresponding to document operation objects of the user.

Description

PDF online reader design method, device, computer equipment and medium

Technical Field

The invention relates to the technical field of text processing, in particular to a PDF online reader design method, a PDF online reader design device, computer equipment and a PDF online reader medium.

Background

PDF (Portable Document Format portable document format) is one of the most popular digital document formats developed by Adobe corporation, usa, and has become an industry standard for academic articles, technical reports, manuals, newspapers, and electronic books. Each PDF file contains a complete description of a flat document of a fixed layout, including text, glyphs, graphics, and other information that needs to be displayed. The existing common PDF reader is Adobe Acrobat DC, has the functions of allowing a user to read a PDF document, filling in a PDF form, viewing PDF file information, rapidly editing the PDF document, converting the PDF document into a Word document, adding a digital signature and the like, and has the characteristics of good stability and compatibility.

In the related art, the existing PDF reader, such as Adobe Acrobat DC, cannot identify directories and entity objects appearing in document contents: when a user needs to inquire the document content corresponding to the catalog, the user needs to transfer to the corresponding page number according to the corresponding page number; when a user needs to inquire entity objects appearing in PDF documents, the documents need to be cut out for inquiry, so that a certain obstacle is caused to the reading of the user, and the inconvenience in use is brought.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, a computer device and a medium for designing a PDF online reader, so as to solve the problem of inconvenient use of the existing PDF reader in the prior art.

In a first aspect, the present invention provides a PDF online reader design method, including:

acquiring a PDF document;

content analysis is carried out on the PDF document, text content of the PDF document is obtained through extraction, and the text content is stored in the text document;

performing entity identification on the text document to obtain a catalog and an entity object in the PDF document;

creating hyperlinks pointing to pages corresponding to the catalogs and the entity objects for the catalogs and the entity objects based on the catalogs and the entity objects;

and setting identifiers for the hyperlinks and the text documents, and storing the hyperlinks and the text documents with the identifiers to a server so that the server queries the corresponding text documents or hyperlinks for display to the user in response to the identifiers corresponding to the document operation objects of the user.

According to the invention, the entity object and the catalog corresponding to the Chinese content in the original PDF document are obtained by converting the PDF document into the text document, the hyperlink is set so that the user can access the content corresponding to the entity object and the catalog by clicking the hyperlink, for example, when the entity object is a term, the hyperlink can point to the encyclopedia entry corresponding to the term, and when the user clicks the catalog, the user can directly turn to the paragraph corresponding to the PDF document, so that the user can conveniently and quickly search the related content according to the generated hyperlink or the encyclopedia entry, great convenience is brought to the use of the user, and the use feeling of the user is better.

In an alternative embodiment, after the content parsing of the PDF document, the method further includes:

extracting formulas, icons and pictures in the PDF document, and converting the formulas, icons and pictures into standard format pictures;

extracting metadata and fonts corresponding to text contents in the PDF document, and converting the fonts into Internet fonts;

converting the text content into internet text content based on the internet fonts;

and storing the internet text content and the standard format picture into an HTML document to obtain the HTML document corresponding to the PDF document.

In the mode, unrecognizable non-text content is converted into a picture with a uniform format and stored, the font of the original PDF document is converted into a corresponding Internet font which can be stored in a server, and the text content is converted into a corresponding Internet text content under the condition of retaining the original font, so that the format of the original PDF document is retained to the greatest extent. Meanwhile, the original PDF document is converted into the document in the HTML format which can be stored in the server, so that the document operations such as annotation, marking and the like of different users of the same server are facilitated.

In an alternative embodiment, after storing the internet text content and the picture in the HTML document to obtain the HTML document corresponding to the PDF document, the method further includes:

Setting a document operation identifier corresponding to a document operation, wherein the document operation comprises the action of marking and/or annotating the HTML document;

combining the document operation identifier with the HTML document to obtain an annotated HTML document;

and storing the marked HTML document into a server based on the text content.

In this manner, by setting a document operation identifier for a document operation such as a mark, an annotation, or the like, a document operation identifier is added to a paragraph corresponding to an HTML document in accordance with the mark and the annotation to the HTML document. When different users access the same document at the local client, marks and comments on the document can be shared in real time according to the increase, decrease and change of the document operation ID on the paragraph.

In an alternative embodiment, storing the tagged HTML document in the server based on the text content, comprising:

generating a document identifier corresponding to the marked HTML document based on the text content;

and storing the marked HTML document and the document identifier into a server, so that the server responds to the document identifier corresponding to the document operation object of the user, inquires the corresponding marked HTML document and displays the corresponding marked HTML document to the user.

In this manner, a document identifier corresponding to a document is generated from text contents. When the user searches the document, the server displays the corresponding document according to the document ID, so that the user can conveniently search the corresponding document.

In an alternative embodiment, performing entity recognition on the text document to obtain a catalog and an entity object in the PDF document includes:

acquiring an initial entity recognition model and a corpus;

training the initial entity recognition model based on the corpus to obtain a target entity recognition model;

and carrying out entity recognition on the text document based on the target entity recognition model to obtain a catalog and an entity object in the PDF document.

In the method, the target entity recognition model meeting the requirements is obtained by training the initial entity recognition model. And carrying out entity recognition on the text document obtained by converting the original PDF document through the target entity recognition model to obtain a catalog and an entity object, and setting hyperlinks on the catalog and the entity object, so that a user can conveniently and quickly read related contents according to the generated hyperlinks.

In an alternative embodiment, training the initial entity recognition model to obtain the target entity recognition model includes:

preprocessing a corpus to obtain a training corpus;

inputting the training corpus into an initial entity recognition model to obtain an initial recognition result, and calculating to obtain loss between the training corpus and the initial recognition result;

And training the initial entity recognition model based on the loss to obtain a target entity recognition model.

In the method, the corpus is preprocessed, loss between the training corpus and the recognition result of the initial entity recognition model is reduced, and the initial entity recognition model is trained to obtain the entity recognition model which meets the requirements more, so that the entity recognition of the target entity recognition model is more accurate.

In an alternative embodiment, after training the initial entity recognition model based on the loss to obtain the target entity recognition model, the method further includes:

setting a plurality of specific tasks;

screening the training corpus based on specific tasks to obtain task corpus;

and fine tuning the target entity recognition model based on the task corpus to obtain a fine-tuned target entity recognition model.

In the mode, in order to enable the trained entity recognition model to be closer to the user requirement, a certain specific task is adopted to conduct fine adjustment on the target entity recognition model, so that the entity recognition model is closer to the relevant field of the document read by the user.

In a second aspect, the present invention provides a PDF online reader design apparatus, including:

The document acquisition module is used for acquiring the PDF document;

the content analysis module is used for carrying out content analysis on the PDF document, extracting the text content of the PDF document, and storing the text content into the text document;

the entity recognition module is used for carrying out entity recognition on the text document to obtain a catalog and an entity object in the PDF document;

the hyperlink creation module is used for creating hyperlinks pointing to pages corresponding to the catalogs and the entity objects for the catalogs and the entity objects based on the catalogs and the entity objects;

and the storage display module is used for setting identifiers for the hyperlinks and the text documents, and storing the hyperlinks and the text documents with the identifiers to the server so that the server can search the corresponding text documents or hyperlinks for display to the user in response to the identifiers corresponding to the document operation objects of the user.

In a third aspect, the present invention provides a computer device comprising: the memory and the processor are in communication connection, computer instructions are stored in the memory, and the processor executes the computer instructions, so that the PDF online reader design method of the first aspect or any implementation manner corresponding to the first aspect is executed.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to execute the PDF online reader design method of the first aspect or any one of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a PDF online reader design method according to an embodiment of the present invention.

Fig. 2 is a flow chart of another PDF online reader design method according to an embodiment of the present invention.

Fig. 3 is a flowchart of PDF document storage to a server according to an embodiment of the present invention.

Fig. 4 is a flowchart of PDF document storage at the server side according to an embodiment of the present invention.

Fig. 5 is a flow chart of yet another PDF online reader design method according to an embodiment of the present invention.

FIG. 6 is a flow chart of entity identification according to an embodiment of the present invention.

Fig. 7 is a block diagram of a PDF online reader design apparatus according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the above-mentioned problems, in the embodiments of the present application, a PDF online reader design method is provided for a computer device, and it should be noted that an execution body of the PDF online reader design method may be a PDF online reader design device, and the PDF online reader design device may be implemented by software, hardware, or a combination of software and hardware to form part or all of the computer device, where the computer device may be a terminal, a client, or a server, and the server may be a server, or may be a server cluster formed by multiple servers. In the following method embodiments, the execution subject is a computer device.

The computer equipment in the embodiment is suitable for the use scene of on-line reading PDF documents in the Internet. According to the design method of the PDF online reader, the PDF document is converted into the text document, the entity object and the directory corresponding to the Chinese content in the original PDF document are identified, the hyperlink is set so that a user can access the content corresponding to the entity object and the directory by clicking the hyperlink, for example, when the entity object is a term, the hyperlink can point to an encyclopedic entry corresponding to the term, and when the user clicks the directory, the user can directly transfer to a paragraph corresponding to the PDF document, so that the user can conveniently and quickly search related content according to the generated hyperlink or the encyclopedic entry, great convenience is brought to the user, and the user use feeling is better.

In accordance with an embodiment of the present invention, there is provided a PDF online reader design method embodiment, it being noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

In the embodiment of the invention, a PDF online reader design method is provided, which can be used for the above terminal, such as a mobile phone, a tablet computer, etc., fig. 1 is a flowchart of the PDF online reader design method according to the embodiment of the invention, and as shown in fig. 1, the flowchart includes the following steps:

step S101, a PDF document is acquired.

Step S102, content analysis is carried out on the PDF document, text content of the PDF document is extracted, and the text content is stored in the text document.

In an example, extracting the PDF document text content may include: first, the PDF file is parsed and converted into an operable data structure, which may be implemented by using a PDF parsing library or tools, such as PyPDF2, PDFMiner, pdfplumber, etc. Secondly, analyzing each page of the PDF document, extracting text content in the page, and extracting text content from analyzed page data, wherein the text can be in a plain text form or a structured text, and comprises text blocks of different types such as paragraphs, titles, lists, tables and the like.

And step S103, performing entity recognition on the text document to obtain a catalog and an entity object in the PDF document.

In one example, the catalogue and entity object in the PDF document are identified by inputting the text document into a trained entity recognition model. The entity recognition model can be a fine-tuned RoBERTa Chinese pre-training model and a conditional random field model.

Step S104, based on the catalogue and the entity object, creating hyperlink pointing to the corresponding page of the catalogue and the entity object for the catalogue and the entity object.

In one example, the identified directory and entity objects may be highlighted, and hyperlinks to specific pages may be created for the identified directory and entity objects, e.g., directory to corresponding paragraphs in PDF documents, general entities to corresponding encyclopedia entries, and company websites when the entities are fund company names.

Step S105, setting identifiers for the hyperlinks and the text documents, and storing the hyperlinks and the text documents with the identifiers to a server, so that the server queries the corresponding text documents or hyperlinks to display to the user in response to the identifiers corresponding to the document operation objects of the user.

In an example, the analyzed text document, the entity object and the hyperlink of the catalogue are set with IDs and stored in the server, and when a user inquires the PDF document, the server can display the document with the highlighted entity object and the highlighted catalogue in real time, so that the user can conveniently and quickly inquire related contents according to the generated catalogue hyperlink or the encyclopedic entry of the entity object.

According to the PDF online reader design method, the PDF document is converted into the text document, the entity object and the directory corresponding to the Chinese content in the original PDF document are identified, the hyperlink is set so that the user can access the content corresponding to the entity object and the directory by clicking the hyperlink, for example, when the entity object is a term, the hyperlink can point to the encyclopedic entry corresponding to the term, and when the user clicks the directory, the user can directly turn to the paragraph corresponding to the PDF document, so that the user can conveniently and quickly search related content according to the generated hyperlink or the encyclopedic entry, great convenience is brought to the use of the user, and the use feeling of the user is better.

In this embodiment, a PDF online reader design method is provided, which may be used for the above terminal, such as a mobile phone, a tablet computer, etc., and fig. 2 is a flowchart of another PDF online reader design method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

Step S201, a PDF document is acquired. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S202, content analysis is carried out on the PDF document, text content of the PDF document is extracted, and the text content is stored in the text document.

Specifically, in step S202, after content parsing is performed on the PDF document, the embodiment of the present invention further includes:

step S2021 extracts the formula, icon and picture in the PDF document, and converts the formula, icon and picture into a standard format picture.

In one example, after content parsing of the PDF document, the PDF document includes formulas, icons, and pictures in addition to text content. Analyzing each page of the PDF document, and extracting formulas, icons and pictures in the page, wherein the steps include: characters in the image are recognized using OCR technology. In order to facilitate the restoration of the PDF document, charts, formulas, pictures and the like in the PDF document are converted into pictures with uniform standard formats, such as png and the like, and are stored in a picture layer.

Step S2022, extracting metadata and fonts corresponding to the text content in the PDF document, and converting the fonts into Internet fonts.

In an example, after the content analysis is performed on the PDF document, the PDF document includes metadata and fonts corresponding to the text content in addition to the text content. The extraction of metadata includes: metadata of the document is extracted from the analyzed PDF data, and the metadata comprises information such as title, author, creation date, modification date, keywords and the like. These metadata are typically embedded in the attributes or tags of PDF files. The font analysis corresponding to the text content comprises the following steps: the font information used in PDF documents, including font name, font type, font size, font style, etc., is analyzed, which may help to understand the appearance and format of the text content. The extracted metadata, text content and font information are collated and stored, either in a database or exported to other formats, for subsequent analysis and use.

Step S2023 converts the text content into internet text content based on the internet font.

In one example, converting the text content to internet text content may be by a font editor recoding the font of the text content in the PDF to convert the text content to internet font text content. The method specifically comprises the following steps: opening a font editor: the PDF file or font file is opened using appropriate font editor software (e.g., fontForge, glyphs, etc.). And importing the fonts to be edited into a font editor. This may be accomplished by selecting a file or importing a font from a font library. In the font editor, the attribute and character set of the font are checked, and information such as font name, font style (bold, italic, etc.), character encoding, etc. can be checked. For text content to be re-encoded, it is mapped with the character encoding of the internet font. Internet fonts typically use Unicode character encoding, and therefore it is necessary to correlate the literal content of the original font with the corresponding Unicode encoding. Some parameter adjustments, such as font size, font spacing, font optimization, etc., may be made in the font editor as needed. After the font editing and character encoding adjustment are completed, the fonts are saved as internet font file formats, such as TrueType (TTF), openType (OTF), and the like. The generated internet font file is exported to a suitable location for subsequent use in a web page or application program.

Step S2024, storing the Internet text content and the standard format picture into an HTML document to obtain an HTML document corresponding to the PDF document.

Fig. 3 is a flowchart of PDF document storage to a server according to an embodiment of the present invention. As shown in fig. 3, in one example, the recoded and converted internet text content is embedded in the original PDF file, replacing the original font used. The converted internet text content is embedded or referenced into an HTML document, the pictures in the standard format are saved and the picture layer is integrated to obtain the HTML document, and the HTML document is stored in a server, so that a user can inquire to obtain the HTML document consistent with the original PDF document format.

In some optional implementations, after the step S2024, an embodiment of the present invention further includes:

step a1, setting a document operation identifier corresponding to a document operation, wherein the document operation comprises the action of marking and/or annotating an HTML document.

In one example, each different document operation, such as a mark and annotation, etc., is assigned a different document operation identifier by designing a marking system for the document operation at the server side. The marking system sets a document operation identifier corresponding to the document operation by defining the document operation type, designing a document operation ID generation rule, creating a marking data structure, designing a user interaction interface, receiving a document operation request and generating a document operation ID.

Specifically, defining the operation type includes: and determining the types of document operations to be supported, such as marking, annotating, highlighting and the like of the HTML document obtained in the steps. The design document operation ID generation rule includes: a unique identifier generation rule is formulated to ensure that each document operation can obtain an independent document operation ID, which may be generated based on a timestamp, a self-increment sequence, a UUID, or the like. Creating the annotation data structure comprises: a suitable data structure is designed to store the document annotation information. The data structure may include fields for document operation ID, operation type, location information (e.g., page number, coordinates, etc.), content, creator information, etc. Designing a user interaction interface includes: a user interaction interface or API is realized at the server side and used for receiving a document operation request of a user, and the document operation request can be in the form of a Web interface or RESTful API and the like. Receiving a document operation request includes: the server receives a document operation request submitted by a user, including operation type, position information, content and the like. Generating the document operation ID includes: a unique document operation ID is generated for the current document operation according to the prescribed document operation ID generation rule.

And a2, combining the document operation identifier with the HTML document to obtain the marked HTML document.

In one example, the annotated HTML document is obtained by adding a document operation ID to the corresponding paragraph based on the tags and annotations on the document.

And a step a3, storing the marked HTML document into a server based on the text content.

In one example, the annotated HTML document is stored in the server by storing the document operation identifier so that the server can respond to a request by the user to query for annotations. Specifically, storing the document operation identifier includes: the generated relevant information of the document operation ID, operation type, position information, content and the like is stored in a database or other persistent storage. Responding to the user request includes: according to the request of the user, the server returns a corresponding document operation ID or a message that the document operation is successful, so that the user can acquire and view the corresponding document operation.

In some alternative embodiments, step a3 includes:

step a31 generates a document identifier corresponding to the annotated HTML document based on the text content.

Step a32 stores the marked HTML document and the document identifier in the server, so that the server responds to the document identifier corresponding to the document operation object of the user to inquire the corresponding marked HTML document and display the corresponding marked HTML document to the user.

Fig. 4 is a flowchart of PDF document storage at the server side according to an embodiment of the present invention. As shown in FIG. 4, in one example, the document may be saved using an elastic search server. After each uploaded server document is converted into an html document and a text document, corresponding document IDs are assigned according to semantic text segments of text contents. When a user performs a document search, the server displays a corresponding document according to the document ID.

Step S203, entity recognition is carried out on the text document, and the catalogue and the entity object in the PDF document are obtained. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S204, based on the catalog and the entity object, creating hyperlink pointing to the corresponding page of the catalog and the entity object for the catalog and the entity object. Please refer to step S104 in the embodiment shown in fig. 1 in detail, which is not described herein.

In step S205, identifiers are set for the hyperlinks and the text documents, and the hyperlinks and the text documents with the identifiers are stored in the server, so that the server queries the corresponding text documents or hyperlinks to display to the user in response to the identifiers corresponding to the document operation objects of the user. Please refer to step S105 in the embodiment shown in fig. 1 in detail, which is not described herein.

According to the PDF online reader design method, unrecognizable non-text content is converted into a picture with a uniform format and stored, the font of the original PDF document is converted into a corresponding Internet font which can be stored in a server, and the text content is converted into the corresponding Internet text content under the condition of retaining the original font, so that the format of the original PDF document is retained to the greatest extent. Meanwhile, the original PDF document is converted into the document in the HTML format which can be stored in the server, so that the document operations such as annotation, marking and the like of different users of the same server are facilitated.

In this embodiment, a PDF online reader design method is provided, which may be used in the above mobile terminal, such as a mobile phone, a tablet computer, etc., and fig. 5 is a flowchart of a PDF online reader design method according to an embodiment of the present invention, as shown in fig. 5, where the flowchart includes the following steps:

step S501, a PDF document is acquired. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S502, analyzing the content of the PDF document, extracting the text content of the PDF document, and storing the text content into the text document. Please refer to step S102 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S503, entity recognition is carried out on the text document, and the catalogue and the entity object in the PDF document are obtained.

Specifically, the step S503 includes:

in step S5031, an initial entity recognition model and a corpus are obtained.

In one example, the initial entity recognition model may select a RoBERTa Chinese pre-training model and a conditional random field model, and the corpus may select a people daily 2014 corpus. Training is carried out by using a RoBERTa Chinese pre-training model and a conditional random field model and using a personal daily report version 2014 corpus so as to realize Chinese word segmentation, part-of-speech tagging and named entity recognition.

The Roberta Chinese pre-training model is an improved version of BERT, and The State of The Art effect is obtained by improving training tasks and data generation modes, training for a longer time, using a larger batch, using more data and The like. The Roberta training samples were more and diverse than BERT, using BOOKCOORPUS (16 GB), CC-NEWS (76 GB), OPENWEBTEXT (38 GB), STORIES (31 GB) for a total of about 160GB of data, BOOKCOORPUS being the BERT raw training data, it can be seen that the Roberta data volume was 10 times greater. Conditional random fields are a type of discriminative probability matrix that is commonly used to label or analyze sequence data, such as natural language text or biological sequences. In the named entity recognition, the conditional random field can be used for preprocessing the input sentences such as word segmentation, part-of-speech tagging and the like, and then the preprocessing results are used as the input of the conditional random field, so that the tagging results of the named entities are obtained.

Step S5032, training the initial entity recognition model based on the corpus to obtain a target entity recognition model.

FIG. 6 is a flow chart of entity identification according to an embodiment of the present invention. As shown in FIG. 6, in one example, the initial entity recognition model is trained using a corpus such that the trained target entity recognition model has the functions of Chinese segmentation, part-of-speech tagging, and named entity recognition. And comparing whether the target entity recognition model accords with the expectation or not, if so, determining the target entity recognition model as a final model, and otherwise, repeating the training process.

Compared with the conventional BERT mode, the Roberta has the advantages of improvement on a training method, including: 1) Dynamic masking: BERT relies on a random mask and a predictive token. The original BERT implementation performs masking once during data preprocessing, resulting in a static mask. Whereas RoBERTa uses a dynamic mask: a new mask pattern is generated each time a sequence is input to the model. Thus, in the process of continuously inputting a large amount of data, the model can gradually adapt to different mask strategies and learn different language characterizations. 2) Larger batches: roberta uses a larger number of lots during the training process. Experimentally, batch numbers varying from 256 to 8000 may be used. 3) Text encoding: byte-Pair Encoding (BPE) is a mixture of character-level and word-level characterizations that support processing numerous common vocabularies in natural language corpora. The original BERT implementation uses character-level BPE vocabulary, 30K in size, learned after preprocessing the input with heuristic word-segmentation rules. The Facebook researchers did not take this approach, but considered training the BERT with a larger byte-level BPE vocabulary that contained 50K of word units, without any additional preprocessing or word segmentation of the input.

In some alternative embodiments, step S5032 includes:

and b1, preprocessing a corpus to obtain a training corpus.

And b2, inputting the training corpus into an initial entity recognition model to obtain an initial recognition result, and calculating to obtain the loss between the training corpus and the initial recognition result.

And b3, training the initial entity recognition model based on the loss to obtain a target entity recognition model.

In an example, the process of training the initial entity recognition model may include:

1. preparing a data set: a large number of chinese corpora are prepared, including news, community discussions, multiple encyclopedias, web books, novels, story-like literature, microblogs, etc.

2. Data preprocessing: and (5) cleaning, word segmentation, stop word removal and other treatments are carried out on the data. The method specifically comprises the following steps: 1) Data cleaning: noise, errors, and garbage in the data are removed, including removing HTML tags, special symbols, non-ASCII characters, and the like. The cleaning operation may be performed using regular expressions or related text processing tools. 2) Text normalization: the text is subjected to standardized processing, such as converting all letters into lower or upper cases, unified processing of abbreviations and common morphological changes, etc. This helps to eliminate case and morphological differences in the text, making the subsequent processing more consistent. 3) Word segmentation: the text is divided into individual words or tokens forming a collection of words. The word segmentation can be based on rules, statistical models or pre-training models, and common methods include rule-based word segmentation, a maximum matching method, a minimum matching method, a statistical machine learning method, a neural network model and the like. 4) Stop word processing: common disused words such as "a", "the", "is", etc. in english and "have", "and" etc. in chinese are removed. These terms appear more frequently in text, but tend not to contribute much to the analysis and modeling of the text. 5) Word drying or word shape normalization: the words are restored to their original stems or unified to some shared morphology. This may reduce the interference of different forms of words with text processing and feature extraction, e.g., reducing complex forms, tenses, etc., of words to a basic form. 6) Building a vocabulary table: a vocabulary or dictionary is constructed from the pre-processed text data, including all the terms that appear and their corresponding numbers. The vocabulary plays an important role in subsequent feature representation and model training. Other treatments: other preprocessing operations may also be performed, such as removing low frequency words, processing special words or phrases, correcting misspellings, etc., depending on the particular task and requirements.

3. And (3) constructing a model: the model was constructed using a PyTorch et al framework in accordance with the method proposed in the Roberta paper. The method specifically comprises the following steps: 1) Data preparation: the data set for model training and evaluation is first prepared. The dataset should include the input text and corresponding tags or targets. An appropriate dataset may be selected according to a particular task, such as text classification, named entity recognition, etc. 2) Model selection: an appropriate model architecture is selected according to the requirements of the task and the characteristics of the data set. In this case, according to the method proposed in the RoBERTa paper, the RoBERTa model may be selected to be used as the base model. 3) Model configuration: and configuring the model according to the requirements of the task and the specific experimental design. This includes setting up hyper-parameters of the model, optimizers, loss functions, etc. Can be adjusted according to the requirements, such as learning rate, batch size, hidden layer size, etc. 4) And (3) model building: the structure of the model is built using a frame such as PyTorch. This includes defining the various layers, connections, and parameters of the model. A neural network comprising a transducer structure is built, comprising a plurality of encoder layers, as described in the RoBERTa paper.

4. Model training: the model is trained using the prepared dataset and accelerated using multiple GPUs. Here, training is performed using a corpus of people daily newspaper version 2014 to achieve chinese word segmentation, part-of-speech tagging, and named entity recognition. The method specifically comprises the following steps: 1) Model training: the model is trained using the prepared training data. This includes inputting data into the model, calculating losses by forward and reverse propagation, and setting the number of rounds and batch size of training based on the parameters of the optimizer update model, and monitoring performance metrics during training. 2) Model evaluation: the trained model is evaluated using a separate validation data set. This includes calculating the accuracy, precision, recall, etc. of the model on the validation set, as well as other evaluation metrics as needed for a particular task. 3) And (3) model tuning: and (5) optimizing the model according to the evaluation result. Super parameters, network structures, etc. can be adjusted to improve the performance and generalization ability of the model. 4) Multiple GPU acceleration: multiple GPUs may be used to speed up model training if computational resources allow. Parameters and computations of the model are distributed to multiple GPUs for parallel processing using parallel computing functionality provided by the framework. 5) Model save and load: the trained models are saved to disk for use in subsequent predictions and applications. The saved model may also be loaded for further training or use.

Step S5033, based on the target entity recognition model, entity recognition is performed on the text document to obtain a catalog and entity objects in the PDF document.

In some optional implementations, after the step S5033, an embodiment of the present invention further includes:

step c1, setting a plurality of specific tasks.

And c2, screening the training corpus based on the specific task to obtain the task corpus.

And c3, fine tuning the target entity recognition model based on the task corpus to obtain a fine-tuned target entity recognition model.

In one example, fine tuning the target entity recognition model includes: and fine tuning the pre-trained model on a specific task. Comparing whether the loss result accords with the expected or not, if so, saving the weight to determine the model as a final model, otherwise, repeating the training process after adjusting the model parameters. Specifically, the method comprises the following steps: 1) Parameter freezing: some or all of the parameters of the pre-trained model are frozen as needed. If the task is of a smaller scale or similar to the feature extraction capabilities of the pre-training model, parameters of the frozen pre-training model may be considered to train only the custom layer for the particular task. 2) Loss function definition: a loss function is defined that is tailored to a specific task. Depending on the task type and tag type, an appropriate loss function is selected to measure the difference between the model output and the real tag. Common loss functions include cross entropy loss, mean square error loss, and the like. 3) Model training: the pre-training model is fine-tuned using the task-specific dataset. The data is input into the model, the loss is calculated by forward and backward propagation, and the parameters of the model are updated according to the optimizer. The number of training rounds and the batch size can be set, and the performance index in the training process can be monitored. 4) Model evaluation: the trimmed model is evaluated using a separate validation dataset. And calculating indexes such as accuracy, precision, recall rate and the like of the model on the verification set, and other evaluation indexes according to the requirements of specific tasks. And performing model tuning according to the evaluation result. 5) Model save and load: the trained models are saved to disk for use in subsequent predictions and applications. The saved model may also be loaded for further training or use.

Step S504, based on the catalog and the entity object, creating hyperlink pointing to the corresponding page of the catalog and the entity object for the catalog and the entity object. Please refer to step S104 in the embodiment shown in fig. 1 in detail, which is not described herein.

In step S505, identifiers are set for the hyperlinks and the text documents, and the hyperlinks and the text documents with the identifiers are stored in the server, so that the server queries the corresponding text documents or hyperlinks to display to the user in response to the identifiers corresponding to the document operation objects of the user. Please refer to step S105 in the embodiment shown in fig. 1 in detail, which is not described herein.

According to the PDF online reader design method provided by the embodiment, the initial entity identification model is trained to obtain the target entity identification model meeting the requirements. And carrying out entity recognition on the text document obtained by converting the original PDF document through the target entity recognition model to obtain a catalog and an entity object, and setting hyperlinks on the catalog and the entity object, so that a user can conveniently and quickly read related contents according to the generated hyperlinks.

The embodiment also provides a PDF online reader design device, which is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a PDF online reader design apparatus, as shown in fig. 7, including:

a document acquisition module 701, configured to acquire a PDF document;

the content analysis module 702 is configured to perform content analysis on the PDF document, extract text content of the PDF document, and store the text content in the text document;

the entity recognition module 703 is configured to perform entity recognition on the text document to obtain a directory and an entity object in the PDF document;

a hyperlink creation module 704, configured to create hyperlinks pointing to pages corresponding to the directory and the entity objects for the directory and the entity objects based on the directory and the entity objects;

the storage display module 705 is configured to set identifiers for hyperlinks and text documents, and store the hyperlinks and text documents with the identifiers to the server, so that the server queries the corresponding text document or hyperlink to display to the user in response to the identifier corresponding to the document operation object of the user.

In some alternative implementations, the content parsing module 702 includes:

and the picture extraction unit is used for extracting the formulas, the icons and the pictures in the PDF document and converting the formulas, the icons and the pictures into standard format pictures.

And the font extraction unit is used for extracting the metadata in the PDF document and the fonts corresponding to the text content and converting the fonts into Internet fonts.

And the font conversion unit is used for converting the text content into the Internet text content based on the Internet fonts.

And the HTML document conversion unit is used for storing the internet text content and the standard format pictures into an HTML document to obtain an HTML document corresponding to the PDF document.

In some alternative embodiments, the content parsing module 702 further includes:

a document operation identifier setting unit for setting a document operation identifier corresponding to a document operation including a behavior of marking and/or annotating an HTML document;

the document labeling unit is used for combining the document operation identifier with the HTML document to obtain a labeled HTML document;

and the document storing unit is used for storing the marked HTML document into the server based on the text content.

In some alternative embodiments, the document deposit unit includes:

A document identifier generating subunit, configured to generate a document identifier corresponding to the annotated HTML document based on the text content;

and the document storing subunit stores the marked HTML document and the document identifier into the server, so that the server responds to the document identifier corresponding to the document operation object of the user, inquires the corresponding marked HTML document and displays the document to the user.

In some alternative embodiments, the entity identification module 703 includes:

the initial model acquisition unit is used for acquiring an initial entity identification model and a corpus;

the model training unit is used for training the initial entity recognition model based on the corpus to obtain a target entity recognition model;

and the entity recognition unit is used for carrying out entity recognition on the text document based on the target entity recognition model to obtain a catalog and an entity object in the PDF document.

In some alternative embodiments, the model training unit comprises:

and the corpus preprocessing subunit is used for preprocessing the corpus database to obtain a training corpus.

The loss calculation subunit is used for inputting the training corpus into the initial entity recognition model to obtain an initial recognition result, and calculating the loss between the training corpus and the initial recognition result.

And the model training subunit is used for analyzing and training the initial entity identification model based on the loss to obtain a target entity identification model.

In some alternative embodiments, the model training unit further comprises:

and the task setting subunit is used for setting a plurality of specific tasks.

And the task corpus screening subunit is used for screening the training corpus based on the specific task to obtain the task corpus.

And the model fine tuning subunit is used for carrying out fine tuning on the target entity recognition model based on the task corpus to obtain a fine-tuned target entity recognition model.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The PDF online reader design apparatus in this embodiment is presented in the form of functional units, where the units refer to ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above-described functions.

The embodiment of the invention also provides computer equipment, which is provided with the PDF online reader design device shown in the figure 7.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 8, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 8.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 8.

The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A PDF online reader design method, the method comprising:

acquiring a PDF document;

performing content analysis on the PDF document, extracting to obtain text content of the PDF document, and storing the text content into a text document;

performing entity recognition on the text document to obtain a catalog and an entity object in the PDF document;

and setting identifiers for the hyperlinks and the text documents, and storing the hyperlinks and the text documents with the identifiers to a server so that the server can inquire the corresponding text documents or hyperlinks for display to a user in response to the identifiers corresponding to the document operation objects of the user.

2. The method of claim 1, wherein after said parsing the PDF document for content, the method further comprises:

extracting metadata in the PDF document and fonts corresponding to the text content, and converting the fonts into Internet fonts;

and storing the internet text content and the standard format picture into an HTML document to obtain an HTML document corresponding to the PDF document.

3. The method of claim 2, wherein after storing the internet text content and the picture in an HTML document to obtain an HTML document corresponding to the PDF document, the method further comprises:

and storing the marked HTML document into a server based on the text content.

4. The method of claim 3, wherein storing the annotated HTML document to a server based on the textual content comprises:

5. The method of claim 1, wherein the performing entity recognition on the text document to obtain the directory and the entity object in the PDF document comprises:

acquiring an initial entity recognition model and a corpus;

6. The method of claim 5, wherein training the initial entity recognition model based on the corpus to obtain a target entity recognition model comprises:

preprocessing the corpus to obtain training corpus;

inputting the training corpus into the initial entity recognition model to obtain an initial recognition result, and calculating to obtain loss between the training corpus and the initial recognition result;

and based on the loss, analyzing and training the initial entity recognition model to obtain a target entity recognition model.

7. The method of claim 6, wherein after parsing training the initial entity recognition model based on the loss to obtain a target entity recognition model, the method further comprises:

Setting a plurality of specific tasks;

screening the training corpus based on the specific task to obtain task corpus;

8. A PDF online reader design apparatus, the apparatus comprising:

the document acquisition module is used for acquiring the PDF document;

the content analysis module is used for carrying out content analysis on the PDF document, extracting the text content of the PDF document, and storing the text content into a text document;

the entity identification module is used for carrying out entity identification on the text document to obtain a catalog and an entity object in the PDF document;

the hyperlink creation module is used for creating hyperlinks pointing to pages corresponding to the catalogue and the entity object for the catalogue and the entity object based on the catalogue and the entity object;

and the storage display module is used for setting identifiers for the hyperlinks and the text documents, and storing the hyperlinks and the text documents with the identifiers to a server so that the server can respond to the identifiers corresponding to the document operation objects of the users to inquire the corresponding text documents or hyperlinks and display the text documents or hyperlinks to the users.

9. A computer device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the PDF online reader design method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the PDF online reader design method of any one of claims 1 to 7.