CN110598189A

CN110598189A - Document processing method, device, equipment and readable storage medium

Info

Publication number: CN110598189A
Application number: CN201910748454.XA
Authority: CN
Inventors: 邱泽斌
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2019-12-20

Abstract

The invention relates to the technical field of data processing, and discloses a document processing method, which comprises the following steps: when a document viewing request sent by a client is received, determining a target document according to the document viewing request and a document stored in a database; analyzing the target document to obtain a text element; performing semantic processing on the text elements to obtain a text with an HTML tag; and returning the text with the HTML tag to the client in a page form so as to load and display page content by the client. The invention also discloses a device, equipment and a computer readable storage medium. The method converts the document into the text with the HTML label and returns the text to the client side for friendly display in a page form, and meanwhile, the page opening speed is high, so that the method not only can meet the high requirement of a user for checking the document, but also can reduce the error rate of document conversion.

Description

Document processing method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for processing a document.

Background

With the rapid development of computer and network technologies, the use of electronic documents has become an essential part of people's daily life and work. The electronic document comprises formats such as Word, HTML, EPUB and PDF, wherein the PDF format has the reading effect of a paper version book, the reading is realized, and the user can not change the format at will, so that the method has the advantage that the formats of other electronic documents cannot be compared.

Currently, most important documents of companies, such as contracts, clauses, insurance policies, and the like, are circulated and archived on the internet in the PDF format. However, when the user uses the client to check the important files, the important files can be displayed only after being downloaded successfully, a large amount of network resources are occupied, the opening speed is low, and the high requirements of the user experience cannot be met. The HTML format is relatively simple, can be read by various application programs APP, and is high in opening speed and simple in operation, but the interface is unfriendly, and the reading is inconvenient. On the other hand, the conversion mode of converting the PDF format into the HTML format in the prior art is not good in effect, and the finally generated HTML effect cannot be checked due to messy codes.

Disclosure of Invention

The invention mainly aims to provide a document processing method, a document processing device and a readable storage medium, and aims to solve the technical problem of how to convert a document into an HTML format and perform friendly display.

In order to achieve the above object, the present invention provides a document processing method, including:

when a document viewing request sent by a client is received, determining a target document according to the document viewing request and a document stored in a database;

analyzing the target document to obtain a text element;

performing semantic processing on the text elements to obtain a text with an HTML tag;

and returning the text with the HTML tag to the client in a page form so as to load and display page content by the client.

Optionally, when receiving a document viewing request sent by a client, determining a target document according to the document viewing request and a document stored in a database, where the determining includes:

when a document viewing request sent by a client is received, responding to the document viewing request, and extracting a document number and a document name from the document viewing request;

using the document number and the document name as keywords, and searching documents matched with the keywords in the documents stored in a database;

and if the documents stored in the database are found to have the documents matched with the keywords, determining the documents matched with the keywords as target documents.

Optionally, the parsing the target document to obtain a text element includes:

and analyzing the target document to obtain analyzed text elements, wherein the text elements at least comprise character strings, pages and lines, and the document format of the target document is a plain text PDF format.

Optionally, the performing semantic processing on the text element to obtain a text with an HTML tag includes:

performing semantic recognition on the text elements to obtain a plurality of sections of texts and the association relation among the sections of texts;

acquiring HTML tags corresponding to all sections of texts according to preset rules;

and determining the sequence of each section of text according to the association relation, and wrapping the corresponding text by using the HTML tag in sequence to obtain the text with the HTML tag.

Optionally, the returning the text with the HTML tag to the client in a page form for the client to load and display page content includes:

reading the text with the HTML tag by using an HTML rendering engine, and converting the text into CSS codes or HTML codes;

encapsulating the converted CSS code or HTML code into an H5 page, and returning the H5 page to the client for the client to load and display the H5 page content.

Optionally, after the returning the H5 page to the client, the method further includes:

monitoring the H5 page to judge whether the H5 page is loaded and displayed by the client side to be abnormal or not;

and if the client loads and displays that the H5 page is abnormal, issuing abnormal information to the client so that a user can select a document viewing request to be sent again or request to download the target document according to the abnormal information.

Optionally, after performing semantic processing on the text element to obtain a text with an HTML tag, the method further includes:

setting the style of the text with the HTML tag, and setting by adopting any one of the following modes: setting according to the style of the text in the target document; or according to the self-defined setting of a developer; wherein, the style at least comprises any one of font type, font size and font color.

Further, to achieve the above object, the present invention also provides a document processing apparatus including:

the receiving module is used for determining a target document according to the document viewing request and the document stored in the database when receiving the document viewing request sent by the client;

the analysis module is used for analyzing the target document to obtain a text element;

the semantic module is used for performing semantic processing on the text elements to obtain a text with an HTML tag;

and the return module is used for returning the text with the HTML tag to the client in a page form so as to load and display page content for the client.

Optionally, the receiving module is specifically configured to:

Optionally, the parsing module is specifically configured to:

Optionally, the semantization module is specifically configured to:

Optionally, the return module is specifically configured to:

Optionally, the document processing apparatus further includes a monitoring module, configured to:

Optionally, the document processing apparatus further includes a setting module, configured to set a style of the text with the HTML tag, and set in any one of the following manners: setting according to the style of the text in the target document; or according to the self-defined setting of a developer; wherein, the style at least comprises any one of font type, font size and font color.

Further, to achieve the above object, the present invention also provides a document processing apparatus including: a memory, a processor and a document processing program stored on the memory and executable on the processor, the document processing program when executed by the processor implementing the steps of the document processing method as claimed in any one of the above.

Further, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a document processing program which, when executed by a processor, implements the steps of the document processing method as described in any one of the above.

According to the method, when a document viewing request sent by a client is received, a target document is determined according to the document viewing request and the document stored in a database, and then the target document is analyzed to obtain the text element, so that the text element is semantically processed to obtain the text with the HTML tag, the error rate of document conversion can be reduced, the efficiency of converting the document into the HTML format is improved, the text with the HTML tag is returned to the client in a page form for friendly display, the page opening speed of the client is high, and the high requirement of a user for viewing the document is met.

Drawings

FIG. 1 is a schematic diagram of an operating environment of a document processing device according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a document processing method according to an embodiment of the present invention;

FIG. 3 is a schematic view of a detailed flow chart of the step S20 in FIG. 2;

FIG. 4 is a functional block diagram of an embodiment of a document processing apparatus according to the invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a document processing device operating environment according to an embodiment of the present invention.

As shown in fig. 1, the document processing apparatus is a server, and may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the hardware configuration of the document processing device shown in FIG. 1 does not constitute a limitation of the document processing device, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a document processing program. Among other things, operating systems are programs that manage and control document processing devices and software resources, supporting the operation of document processing programs, as well as other software and/or programs.

In the hardware configuration of the document processing apparatus shown in fig. 1, the network interface 1004 is mainly used for accessing a network; the user interface 1003 is mainly used for detecting a confirmation instruction, an editing instruction, and the like. And the processor 1001 may be configured to invoke a document processing program stored in the memory 1005 and perform the steps of the following embodiments of the document processing method.

Based on the above-mentioned hardware structure of the document processing device, various embodiments of the document processing method of the present invention are proposed.

Referring to fig. 2, fig. 2 is a flowchart illustrating a document processing method according to an embodiment of the present invention.

In this embodiment, the document processing method includes:

step S10, when receiving the document viewing request sent by the client, determining the target document according to the document viewing request and the document stored in the database;

in this embodiment, since different enterprises and different merchants or platforms have a plurality of types of documents that are enhanced over a network, in order to facilitate understanding of the present solution, taking insurance companies as examples, the insurance companies provide online insurance services, including life insurance, property insurance, automobile insurance, endowment insurance, and other insurance products. Insurance means that the insurance applicant pays insurance premium to the insurance carrier according to contract agreement, and the insurance carrier undertakes the responsibility for compensating insurance premium for property loss caused by the occurrence of the accident possibly caused by the contract agreement, or undertakes the business insurance action for paying insurance premium when the insured dies, is damaged, has diseases or reaches the conditions of the age, the deadline and the like of the contract agreement. The most critical of the online insurance business is to archive and manage important insurance-related documents, such as insurance contracts, insurance clauses, insurance policies and the like, in a PDF format. The insurance contract is a fixed contract, generally speaking, terms are formulated unilaterally by insurance companies, and the content is complex and the specialization is strong. The policy on the insurance policy, which specifies the rights, obligations and other insurance matters of the insurer and insured life, is printed with insurance terms.

In this embodiment, when a user refers to an insurance policy, an insurance contract, or insurance clauses using a client, the client first sends a document viewing request to a document processing device, i.e., a server, according to a user input instruction, then receives content returned by the server according to the request, and finally performs display, so that the user can browse on line on the client. It is understood that each insurance contract, insurance clause and insurance policy has its corresponding electronic document, and the document contents may be the same or different. Therefore, when receiving a document viewing request sent by a client, the document processing device determines a document (i.e. a target document) that a user wants to view according to the document viewing request sent by the client and the documents recorded and saved in the database.

Step S20, analyzing the target document to obtain text elements;

in this embodiment, the document is stored in the database, and preferably, the document format of the stored document is a plain text PDF format, that is, the documents stored in the database are all PDF format documents, and the content thereof is plain text and does not contain pictures, such as clear text clauses and contract clauses related to insurance. If the PDF format document is directly returned to the client, the client can be opened only after downloading is successful, matched PDF reading software is possibly required to be installed, and user experience is poor, so that the PDF format document is converted into an HTML format document and then returned to the user for viewing. And analyzing the target document, namely analyzing the plain text PDF electronic document requested to be checked by the user to obtain text elements. The text element is data for describing and storing text information such as character strings, pages, lines, and the like, not images and sounds.

Step S30, carrying out semantic processing on the text elements to obtain a text with an HTML label;

in this embodiment, the semantic processing of the text element is a process of identifying the text element according to a preset rule and adding an HTML tag to a specific text. If the Chinese safe property insurance company is identified, and the label of h1 is added to the identified Chinese safe property insurance company, the Chinese safe property insurance company becomes < h1> Chinese safe property insurance company </h 1; or recognizing that the insurance contract is composed of insurance clauses, insurance policy, insurance certificate and batch bill, etc., in the general rule, adding h2 label to the insurance contract is changed into < h2> insurance contract is composed of insurance clauses, insurance policy, insurance certificate and batch bill, etc. <h2, etc., so that the text in PDF is converted into text with HTML label.

Further, after step S30, the method further includes: the setting of the style of the text with the HTML label is carried out by adopting any one of the following modes: setting according to the style of the text in the target document; or according to a developer's custom setting. The style includes at least any one of a font type, a font size, and a font color. The style of the text in the target document is set according to the style of the text, for example, the style of the text of the 'Chinese safe property insurance limited' in the PDF electronic document is Song style, font No. 5, and dark, and then the style of the HTML tag of the 'Chinese safe property insurance limited' is automatically set to Song style, font No. 5, and dark. And the setting is carried out according to the self-defined style of the developer. It should be added that the HTML tag is a tag set by the developer for the document stored in the database to convert the document into the HTML format, and there may be many different HTML tags, including < h1>, < h2>, < span >, < br >, and the like, which are set according to the actual situation.

And step S40, returning the text with the HTML tag to the client in a page form, so that the client can load and display the page content.

In this embodiment, in a network environment, a web page is a plain text file containing HTML tags. The text is the most important information carrier and communication tool on the webpage, and the main information in the webpage is mainly in the form of text. Therefore, the converted text with the HTML tag of the target document is returned to the client in a page form, the client can load and display the page content, the page opening speed is high, and the interface display is friendly.

In the embodiment, when a document viewing request sent by a client is received, a target document is determined according to the document viewing request and a document stored in a database, and then the target document is analyzed to obtain a text element, so that the text element is subjected to semantic processing to obtain a text with an HTML (hypertext markup language) tag, the error rate of document conversion can be reduced, the effect of converting the document into an HTML format is improved, the text with the HTML tag is returned to the client in a page form for friendly display, the page opening speed of the client is high, and the high requirement of a user for viewing the document is met.

Referring to fig. 3, fig. 3 is a schematic view of a detailed flow of the step S10 in fig. 2.

Based on the above embodiment, in this embodiment, in step S10, when receiving the document viewing request sent by the client, determining the target document according to the document viewing request and the document saved in the database includes:

step S11, when receiving the document viewing request sent by the client, responding to the document viewing request and extracting the document number and the document name from the document viewing request;

in this embodiment, the document viewing request sent by the client carries the document number and the document name for which the viewing is formulated, and the device responds to the received request when receiving the request, so as to extract the document number and the document name from the document viewing request, and further determine the target document that the user wants to view according to the document number and the document name.

Step S12, using the document number and the document name as key words, searching the document stored in the database for the document matching with the key words;

in this embodiment, the database stores thousands of documents, and the document numbers and document names are used as keywords to search for the documents. There are two types of search results, one is to find a document matching the keyword, and the other is to find a document not matching the keyword. If the document does not match the keyword, the document may be deleted or the user-specified document may be in error, and the document may not be provided to the user for viewing.

In step S13, if a document matching the keyword is found in the documents stored in the database, the document matching the keyword is determined to be the target document.

In this embodiment, the document stored in the database is found to have a document matching the keyword, which indicates that the matching document is the document that the user wants to view, and thus the matching document can be determined as the target document to further process the target document, so that the user can view the target document conveniently, and the use experience of the user in viewing the document is improved.

Based on the foregoing embodiment, in this embodiment, in step S30, performing semantic processing on the text element to obtain a text with an HTML tag, including:

step S31, semantic recognition is carried out on the text elements to obtain a plurality of sections of texts and the incidence relation among the sections of texts;

in this embodiment, the semantic recognition is to recognize Chinese characters, letters, numbers and special characters in the text elements, the special characters are space carriage return pages of!. # ￥%, and the like.

Step S32, acquiring HTML tags corresponding to all sections of texts according to preset rules;

in this embodiment, the HTML tags include < h1>, < h2>, < span >, < br >, and the like, which are set according to the actual situation. The preset rule is a preset rule matched with each section of text label, and an HTML label corresponding to each section of text is obtained, for example, a < h1> label is applied to the text between the first section and the general rule; the < h2> label is applied to texts such as general rules, attached rules, insurance marks and guarantee contents; the text of page one uses the < h3> tag; ending with colon, semicolon, adding line-changing symbol or ending with period, and applying < br > label to the text of line-changing in page, which is set according to actual situation.

And step S33, determining the sequence of each section of text according to the association relationship, and wrapping the corresponding text with the HTML label in sequence to obtain the text with the HTML label.

In this embodiment, the order between the sections of text is determined according to the association relationship, for example, the first section "china safe property insurance limited company" and the second section "safe individual regards as injury insurance clause" as upper and lower paragraphs, the second section and the third section "docket number XXX" as upper and lower paragraphs, the third section and the fourth section "general rule", and the fourth section and the fifth section "first insurance contract is composed of insurance clause, insurance policy, insurance certificate, lot slip, and the like. All the contracts related to this insurance contract should be in written form. The second, etc. "is upper and lower. The insurance contract consists of insurance clauses, insurance policy, insurance certificate, batch bill and the like. All the contracts related to the insurance contract should be in written form. "is a top sentence," the first "and" the insurance contract is the insurance clause "are top and bottom words, so the sequence between the text can be determined.

In this embodiment, according to the determined sequence between each section of text, the corresponding text is wrapped with the HTML tag in sequence to obtain the text with the HTML tag, that is, the text is wrapped with the corresponding HTML tag in sequence. For example, the safety contract of < h1> China safety property insurance Limited company </h1> < h2> for the injury insurance clause < h2> < h2> consists of insurance clauses, insurance policy, insurance certificate and batch slip, etc. the first insurance contract of </h2> < h3> consists of insurance clauses, insurance policy, insurance certificate and batch slip, etc. All the contracts related to this insurance contract should be in written form. The second equivalent < h3>, etc., as the case may be.

Based on the foregoing embodiment, in this embodiment, in step S40, returning the text with the HTML tag to the client in the form of a page, so that the client loads and displays the page content, including:

step S41, reading the text with HTML label by using HTML rendering engine, and converting into CSS code or HTML code;

in this embodiment, the rendering engine is used to render variables, i.e. all text converted into HTML tags, into templates, and then convert the variables into CSS code or HTML code. CSS code and HTML code are one of the programming languages.

Step S42, encapsulate the converted CSS code or HTML code into an H5 page, and return the H5 page to the client for the client to load and display the H5 page content.

In this embodiment, H5 is an abbreviation of HTML5, which is an advanced web page technology. H5 has more interaction and functionality than H4, one of the biggest advantages is multimedia support on mobile devices. And packaging the text with the HTML label into an H5 page, so that the PDF style can be restored and friendly display can be performed. Meanwhile, through the H5 page, the client can read smoothly and is adaptive to various types of machines, and the user experience is improved.

Further, after the H5 page is returned to the client, the document processing method further includes:

1. monitoring the H5 page to judge whether the H5 page loaded and displayed by the client is abnormal or not;

2. and if the H5 page loaded and displayed by the client is abnormal, issuing abnormal information to the client so that the user can select a document viewing request to be sent again or request to download the target document according to the abnormal information.

In this embodiment, the H5 page is monitored, and it is substantially monitored whether the H5 page is loaded or the H5 page is displayed on the client, so as to determine and find out an abnormal problem in time. Usually, based on the characteristics of an H5 page, a client can quickly open and display an H5 page, but small-probability events such as code conversion loss or network abnormality can also cause abnormality in loading and displaying an H5 page at the client, and in order to avoid that a user cannot view a document, abnormal information is issued to the client to inform the user of the reason of the abnormality, two schemes are provided for solving the abnormality problem and enabling the user to select the abnormal information autonomously, one scheme is a document viewing request which is sent again, and the other scheme is that a target document is downloaded and then viewed, so that the use experience of the user can be effectively improved.

Referring to FIG. 4, FIG. 4 is a functional block diagram of an embodiment of a document processing apparatus according to the present invention.

In this embodiment, the document processing apparatus includes:

the receiving module 10 is configured to, when a document viewing request sent by a client is received, determine a target document according to the document viewing request and a document stored in a database;

the analysis module 20 is configured to analyze the target document to obtain a text element;

the semantic module 30 is configured to perform semantic processing on the text elements to obtain a text with an HTML tag;

and the return module 40 is configured to return the text with the HTML tag to the client in a page form, so that the client can load and display page content.

It should be noted that the embodiments of the document processing apparatus are substantially the same as the embodiments of the document processing method, and are not described in detail here.

Furthermore, the present invention also provides a computer-readable storage medium having stored thereon a document processing program which, when executed by a processor, implements the steps of the document processing method as set forth in any one of the above.

The specific embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the document processing method described above, and will not be described in detail herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A document processing method, characterized by comprising the steps of:

analyzing the target document to obtain a text element;

2. The document processing method according to claim 1, wherein determining the target document according to the document viewing request and the document stored in the database when receiving the document viewing request sent by the client comprises:

3. The document processing method of claim 1, wherein the parsing the target document to obtain text elements comprises:

4. The document processing method of claim 1, wherein the semantically processing the text element to obtain a text with an HTML tag comprises:

5. The document processing method of claim 1, wherein said returning said HTML tagged text to said client in page form for loading and displaying page content by said client comprises:

6. The document processing method of claim 5, wherein after said returning said H5 page to said client, further comprising:

7. The document processing method of any one of claims 1-6, wherein after the semantically processing the text element to obtain the text with the HTML tag, further comprising:

8. A document processing apparatus, characterized by comprising:

9. A document processing apparatus, characterized by comprising: memory, a processor and a document processing program stored on the memory and executable on the processor, the document processing program when executed by the processor implementing the steps of the document processing method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a document processing program which, when executed by a processor, implements the steps of the document processing method according to any one of claims 1 to 7.