CN113204951A - Document processing method, document processing device, storage medium and computer equipment - Google Patents

Document processing method, document processing device, storage medium and computer equipment Download PDF

Info

Publication number
CN113204951A
CN113204951A CN202110583801.5A CN202110583801A CN113204951A CN 113204951 A CN113204951 A CN 113204951A CN 202110583801 A CN202110583801 A CN 202110583801A CN 113204951 A CN113204951 A CN 113204951A
Authority
CN
China
Prior art keywords
document
title
processed
chapter
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110583801.5A
Other languages
Chinese (zh)
Inventor
廖林涛
朱增
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ONYX INTERNATIONAL Inc
Original Assignee
ONYX INTERNATIONAL Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ONYX INTERNATIONAL Inc filed Critical ONYX INTERNATIONAL Inc
Priority to CN202110583801.5A priority Critical patent/CN113204951A/en
Publication of CN113204951A publication Critical patent/CN113204951A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/191Automatic line break hyphenation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

According to the document processing method, the document processing device, the storage medium and the computer equipment, when a user reads a document, byte data in a designated range in a chapter list can be loaded according to the reading position of the user, and the whole document does not need to be loaded, so that the typesetting speed and the page turning speed of the document are improved, the document with larger cache can be supported, and the reading experience of the user is improved; in addition, when the directory and the chapter list are established, the document to be processed is established according to different titles and chapter starting and ending positions corresponding to the titles, and another page is established among different chapters, so that the typesetting of the document can be performed in order, and the reading experience of a user is further improved.

Description

Document processing method, document processing device, storage medium and computer equipment
Technical Field
The present invention relates to the field of document optimization technologies, and in particular, to a document processing method, an apparatus, a storage medium, and a computer device.
Background
Along with the development of information technology, more and more types of electronic products are popular, such as an ink screen electronic book reader, which is used as a device special for reading electronic books, so that reading enjoyment like paper is provided for users, and compared with other devices, the user experience is greatly improved.
Corresponding reading software such as a NeoReader is installed in an existing ink screen electronic book reader, and a user can log in the NeoReader and select a corresponding electronic book to download and read; however, when a user reads, another page is not arranged between different chapters, so that the reading experience is poor; in addition, the existing typesetting engine of the NeoReader typesets aiming at HTML and CSS, if the TXT document is directly converted into HTML, the typesetting speed and the page turning speed are very low, especially for the document with a large cache, the document cannot be loaded and displayed in time, and the reading experience of a user is poor.
Disclosure of Invention
The invention aims to solve at least one of the technical defects, in particular to the technical defect of poor reading experience of reading software installed in the ink screen electronic book reader in the prior art.
The invention provides a document processing method, which comprises the following steps:
scanning each line of text in a document to be processed;
screening text lines serving as titles based on the byte length contained in each line of text and a preset title rule, and determining the position of each title in the document to be processed;
determining the starting and stopping positions of the sections corresponding to each title based on the positions of the titles in the document to be processed;
and establishing a directory and a chapter list corresponding to the document to be processed according to the title and the starting and ending positions of the chapters corresponding to the title.
Optionally, the document processing method further includes:
when a user opens the document to be processed, calling the chapter list to determine a byte range to be read according to the current byte position of the user staying in the document to be processed;
a byte stream within the byte range is read.
Optionally, the step of scanning each line of text in the document to be processed includes:
detecting a document code of a document to be processed;
and scanning each line of text in the document to be processed according to the document code.
Optionally, the step of scanning each line of text in the document to be processed according to the document code includes:
determining the tail byte position of each line of the document to be processed based on the document code;
and determining each line of text in the document to be processed according to the position of the last tail byte of each line.
Optionally, the step of screening the text lines as the titles based on the byte length of each line of text and a preset title rule includes:
determining the byte length corresponding to each line of text;
screening text lines with the byte length not greater than a preset title length threshold value for decoding;
judging whether each row of character strings obtained after decoding is a title or not according to a preset title rule;
and if so, taking the text line corresponding to the character string as a title.
Optionally, after the step of determining whether each line of character strings obtained after decoding is a title according to a preset title rule, the method further includes:
if one row of character strings is not a title, dividing the adjacent two sides of the character strings as character strings of the title according to a preset chapter length threshold;
and determining the starting and ending positions of each divided chapter.
Optionally, the step of determining the start-stop position of the chapter corresponding to each title based on the position of each title in the document to be processed includes:
for a target title:
taking the byte position of the beginning character of the target title in the document to be processed as the starting position of the chapter corresponding to the target title;
and taking the initial character of the next title adjacent to the target title as the cut-off position of the chapter corresponding to the target title at the byte position in the document to be processed.
The present invention also provides a document processing apparatus, comprising:
the data scanning module is used for scanning each line of text in the document to be processed;
the title confirming module is used for screening text lines serving as titles based on the byte length contained in each line of text and a preset title rule, and determining the position of each title in the document to be processed;
the chapter confirming module is used for determining the starting and stopping positions of chapters corresponding to each title based on the position of each title in the document to be processed;
and the document establishing module is used for establishing a catalogue and a chapter list corresponding to the document to be processed according to the title and the starting and stopping positions of the chapters corresponding to the title.
The present invention also provides a storage medium having stored therein computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the document processing method as described in any one of the above embodiments.
The present invention also provides a computer device having computer readable instructions stored thereon which, when executed by one or more processors, cause the one or more processors to perform the steps of the document processing method as in any one of the above embodiments.
According to the technical scheme, the embodiment of the invention has the following advantages:
when a document to be processed is processed, scanning each line of text in the document to be processed, screening out text lines serving as titles according to the byte length contained in each line of text and a preset title rule, determining the starting and stopping positions of chapters corresponding to each title based on the positions of the titles in the document to be processed, and finally establishing a directory and a chapter list according to the titles and the starting and stopping positions of the chapters corresponding to the titles; therefore, when a user reads a document, byte data in a specified range in the chapter list can be loaded according to the reading position of the user without loading the whole document, so that the typesetting speed and the page turning speed of the document are improved, the document with larger cache can be supported, and the reading experience of the user is improved; in addition, when the directory and the chapter list are established, the document to be processed is established according to different titles and chapter starting and ending positions corresponding to the titles, and another page is established among different chapters, so that the typesetting of the document can be performed in order, and the reading experience of a user is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flowchart illustrating a document processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of screening text lines according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic internal structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Along with the development of information technology, more and more types of electronic products are popular, such as an ink screen electronic book reader, which is used as a device special for reading electronic books, so that reading enjoyment like paper is provided for users, and compared with other devices, the user experience is greatly improved.
Corresponding reading software such as a NeoReader is installed in an existing ink screen electronic book reader, and a user can log in the NeoReader and select a corresponding electronic book to download and read; however, when a user reads, another page is not arranged between different chapters, so that the reading experience is poor; in addition, the existing typesetting engine of the NeoReader typesets aiming at HTML and CSS, if the TXT document is directly converted into HTML, the typesetting speed and the page turning speed are very low, especially for the document with a large cache, the document cannot be loaded and displayed in time, and the reading experience of a user is poor.
Therefore, the invention aims to solve the technical problem of poor reading experience of reading software installed in the ink screen electronic book reader in the prior art, and provides the following technical scheme:
in one embodiment, as shown in fig. 1, fig. 1 is a flowchart illustrating a document processing method according to an embodiment of the present invention; the invention provides a document processing method, which specifically comprises the following steps:
s110: each line of text in the document to be processed is scanned.
In this step, after the document to be scanned is determined, the document is taken as a document to be processed, and when the document to be processed is scanned, bytes need to be read one by one from the document to be processed, and each line of text is determined according to the document code of the document to be processed, so that each line of text is correspondingly processed.
Further, before document scanning, an MD5 value of the document to be processed may be calculated, and the MD5 value obtained after calculation is used as a key to search whether a cache scanning result exists from a disk cache; if so, no scanning is performed, and if not, scanning is performed.
It is to be understood that the document to be processed in this application refers to a plain text document, such as a TXT document.
S120: and screening the text lines serving as the titles, and determining the position of each title in the document to be processed.
In this step, after each line of text of the document to be processed is scanned in step S110, text lines that can be used as titles are screened out based on the byte length and the preset title rule included in each line of text, and then the position of each title in the document to be processed is determined, so as to determine the start-stop position of the chapter corresponding to each title.
It can be understood that the document to be processed herein contains a plurality of bytes of data, and after determining each line of text in the document to be processed, it is equivalent to determining the byte data of each line, i.e. how many bytes are contained.
The preset title rule refers to a preset rule for identifying chapter titles, and the rule may be to use a trained machine learning model to identify titles of character strings of text lines, or to compare the text lines with a preset title library, so as to determine whether the text lines are titles according to the comparison result.
Specifically, when text lines that can be used as titles in each line of text need to be determined, each line of text may be preliminarily screened according to the byte length included in each line of text, and then whether the text lines after preliminary screening are titles or not may be identified according to a preset title rule, so that the text lines obtained through final screening are used as titles.
Further, for the text lines which cannot be determined as the title or the chapters without the title, the total length of the bytes of the text lines obtained by scanning can be divided by a certain length, so that the phenomenon that the typesetting is slow due to overlarge single chapter is avoided.
After the text lines which can be used as the titles are screened out, the positions of the titles in the document to be processed can be determined according to the positions of the text lines in the document to be processed.
For example, each line of text includes a plurality of bytes, each byte has a specific position in the document to be processed, and after a text line that can be a title is screened out, the position of the first byte in the text line in the document to be processed can be used as the starting position of the title in the document to be processed corresponding to the text line, and the position of the last byte in the document to be processed can be used as the ending position of the title in the document to be processed corresponding to the text line.
S130: and determining the starting and stopping positions of the sections corresponding to each title based on the positions of the titles in the document to be processed.
In this step, the text lines that can be used as the titles are screened out based on the byte length and the preset title rule included in each line of text in step S120, and after the position of each title in the document to be processed is determined, the start-stop position of the chapter corresponding to each title can be determined based on the position of each title in the document to be processed.
For example, after the position of each title in the document to be processed is determined, the start position of the chapter corresponding to each title may be determined according to the position of the start byte of each title, and then the end position of the chapter corresponding to each title may be determined according to the position of the start byte of the next title adjacent to each title, so as to finally determine the start and end positions of the chapter corresponding to each title.
S140: and establishing a directory and a chapter list corresponding to the document to be processed.
In this step, after the start/stop position of the chapter corresponding to each title is determined based on the position of each title in the document to be processed in step S130, a directory and a chapter list corresponding to the document to be processed may be established according to the title and the start/stop position of the chapter corresponding to the title.
Specifically, after determining each title of the document to be processed and the start/stop position of the corresponding chapter, a corresponding directory may be established according to each title in the document to be processed, and a chapter list may be established according to the start/stop position of the chapter corresponding to each title.
For example, a chinese TXT document with a file size of 1.15MB (1205862 bytes) has the following scanned directories by the document processing method of the present application: [ chapter i XX, chapter ii XX, chapter iii XX,. 3, chapter ten XX ];
the scanned chapter list may be [ start, chapter one XX, chapter two XX, chapter three XX,. ], chapter ten XX, chapter eleven, chapter twelfth ]. The first chapter is not at the beginning of the document, so that one more beginning chapter is added in the chapter list, and titles are not detected in the eleventh chapter and the twelfth chapter and can be obtained by dividing according to a preset chapter length threshold value. The byte start range corresponding to the scanned chapter list may be:
beginning: [0,1024)
Chapter one XX: [1024,10240)
Chapter ii XX: [10240,20480)
Chapter three XX: [20480,40960)
……
Chapter ten XX: [102400,105000)
Chapter eleven: [105000,1155000 ], the maximum range of this chapter is 105000+100 × 1024-1152400 bytes, and when a line of text is read and the end-of-line byte position 1155000 is greater than 1152400, this line is marked as the end line of the eleventh chapter.
Chapter twelfth: 1155000,1205862, 1205862 is the last byte position of the file.
In the above embodiment, when a document to be processed is processed, each line of text in the document to be processed is scanned, then the text line serving as a title is screened out according to the byte length contained in each line of text and a preset title rule, then the start-stop position of a chapter corresponding to each title is determined based on the position of each title in the document to be processed, and finally a directory and a chapter list are established according to the title and the start-stop position of the chapter corresponding to the title; therefore, when a user reads a document, byte data in a specified range in the chapter list can be loaded according to the reading position of the user without loading the whole document, so that the typesetting speed and the page turning speed of the document are improved, the document with larger cache can be supported, and the reading experience of the user is improved; in addition, when the directory and the chapter list are established, the document to be processed is established according to different titles and chapter starting and ending positions corresponding to the titles, and another page is established among different chapters, so that the typesetting of the document can be performed in order, and the reading experience of a user is further improved.
The above embodiments mainly explain the document processing method of the present application, and the document processing method of the present application will be further explained below.
In an embodiment, the document processing method may further include:
s150: and when the user opens the document to be processed, calling the chapter list to determine the byte range to be read according to the current byte position of the user staying in the document to be processed.
S160: a byte stream within the byte range is read.
In this embodiment, when the user opens the document to be processed for which the directory and the chapter list are already established, the client may call the chapter list according to the current byte position of the user at the stop of the document to be processed, and determine the byte range to be read in the chapter list corresponding to the current byte position through the chapter list, so as to read the byte stream in the byte range.
For example, after a user opens a certain reading software in the client, the reading software selects a corresponding document to read, if the document is not opened by the user before the reading, the document is displayed as a starting page of the document when the user opens the document at this time, after the client monitors that the current position where the user stays is the starting page of the document, the client calls the chapter list to determine a byte range corresponding to the starting page, the byte range is the byte range of the starting page, and after the byte range is determined, the client can read the byte stream corresponding to the byte range, so that the byte stream is displayed on the client after being typeset and rendered.
If the document is opened by the user before the document is read, the client can monitor the byte position corresponding to the currently opened page of the user according to the historical record of the user, and call the chapter list to determine the corresponding byte range, wherein the byte range is the byte range of the chapter corresponding to the page, and after the byte range is determined, the client can read the byte stream corresponding to the byte range, so that the byte stream is typeset and rendered, and the page data of the page is displayed.
When a byte stream of a specified range is loaded according to the chapter list, an HTML document can be created, creating CSS styles. For example, chapter XX, first, reads the byte stream from the file, line by line, according to the starting range of bytes [1024,10240 ], the first line being the title, applying the title style; the second line to the last line, which are paragraphs, a paragraph style is applied.
According to the created HTML document and the CSS style, when the page is turned, each chapter is typeset, and the new chapter is typeset from the new page, so that the effect that another page of the new chapter is achieved. Reading the title, and rendering according to the title style, for example, the common title style is that the font size is larger than the text, the font size is increased in black and the text paragraph is smaller than the title, and the font size is not increased in black and the text paragraph is not increased in thick.
In the above embodiment, since the documents to be processed are processed by chaptering, the typesetting speed is fast, and the text data loaded into the memory is small, thereby supporting large TXT documents, such as 500MB TXT documents.
Taking a Chinese TXT document with the file size of 1.15MB (1205862 bytes) as an example, when the existing scheme of a reader NeoReader is used for opening, the whole TXT document needs to be scanned, which takes 1123 milliseconds and pages 1102 milliseconds; after the method is used, the document is scanned for 132 milliseconds and is divided into thirteen chapters, and when the document is loaded for the first time, only the chapter 'start' needs to be loaded, only 1024 bytes of data are needed, 1 millisecond is consumed, and the page turning is carried out for 109 milliseconds. Therefore, the document opening speed and the page turning speed are improved after the method is used. When the existing scheme is used, a document is opened, the whole TXT document needs to be loaded, and the memory is occupied by 1.15 MB; after the method is used, only byte data of the current section is loaded, for example, the section of 'start' is loaded, only 1024 bytes are occupied, and the use of the memory is reduced.
The above embodiments further explain the document processing method of the present application, and the following describes how to scan and obtain each line of text in the document to be processed.
In one embodiment, the step of scanning each line of text in the document to be processed in step S110 may include:
s111: and detecting the document code of the document to be processed.
S112: and scanning each line of text in the document to be processed according to the document code.
In this embodiment, when scanning each line of text of the to-be-processed document, the document code of the to-be-processed document may be detected by the third-party algorithm, and then each line of text of the to-be-processed document may be scanned according to the document code.
For example, the application may use a Mozilla open source algorithm to automatically detect the document encoding of the TXT document, and when the document encoding of the TXT document is detected, the document encoding may be scanned line by line according to the format in the corresponding document encoding, thereby obtaining each line of text.
The above embodiment specifically describes how to scan and acquire each line of text in the document to be processed, and the following description will explain how to scan each line of text in the document to be processed according to the document code.
In one embodiment, the step of scanning each line of text in the document to be processed according to the document code in step S112 may include:
s201: and determining the tail byte position of each line of the document to be processed based on the document coding.
S202: and determining each line of text in the document to be processed according to the position of the last tail byte of each line.
In this embodiment, after detecting the document code of the document to be processed, the last byte position of each line of the document to be processed may be determined according to the document code, and then each line of text in the document to be processed may be determined according to the last byte position of each line.
Taking the GBK code as an example, when it is detected that the document code of the document to be processed is the GBK code, the line break of the GBK code may be \ r \ n (hexadecimal represents 0x0D 0x0A) or \ n (0x0A), then the document to be processed is judged byte by byte according to the line break of the GBK code, whether the line break is determined, the line break is found, and the position of the last tail byte of the line can be obtained; alternatively, the end of the file is read, and the end of file byte position is the byte position of the last line of text.
Further, when reading BYTEs, taking GBK coding as an example, reading BYTEs one by one from the beginning of the document to be processed, reading \ n (0x0A) with the position of BYTE1, then judging whether the last BYTE is \ r (0x0D), if yes, the first row BYTE data RANGE is LINE _ BYTE _ RANGE [0, BYTE1-1 ]; if not \ r, the first row BYTE data RANGE is LINE _ BYTE _ RANGE [0, BYTE1 ]; the first LINE BYTE data is then loaded according to the first LINE start RANGE LINE _ BYTE _ RANGE. Similarly, the second line is read.. nth line, where the last line refers to the end-to-file byte position.
The above embodiment explains how to scan each line of text in the document to be processed according to the document code, and the following describes in detail how to filter the text line corresponding to the title.
In one embodiment, as shown in fig. 2, fig. 2 is a schematic flow chart of screening text lines according to an embodiment of the present invention; in step S120, the step of screening the text line as the title based on the byte length of each line of text and the preset title rule may include:
s121: the byte length corresponding to each line of text is determined.
S122: and screening text lines with the byte length not greater than a preset title length threshold value for decoding.
S123: it is determined whether each decoded row of character strings is a title, if so, step S124 is executed, and if one row of character strings is not a title, step S125 is executed.
S124: and taking the text line corresponding to the character string as a title.
In this embodiment, as shown in fig. 2, when a text line serving as a title is screened according to a byte length included in each line of text and a preset title rule, the byte length corresponding to each line of text may be determined, then a text line having a length not greater than a preset title length threshold is screened out for decoding, then each line of character strings obtained after decoding is judged according to the preset title rule, whether the line of character strings is a title is judged, and if the line of character strings is a title, the text line corresponding to the line of character strings is used as a title.
It is to be understood that the preset header length threshold herein refers to a value not greater than the maximum header byte length, and since it takes a long time to decode, the text scanning speed is increased by the preset header length threshold herein.
For example, the preset chinese title length threshold is 40 characters, the detected document encoding of the document to be processed is UTF-8, one character encoded by UTF-8 is 3 bytes at the maximum, the maximum byte length is the preset chinese title length threshold, the maximum byte number corresponding to each character in the text encoding, and the preset chinese title length threshold is 40-3-120 bytes.
And after the screened text lines with the byte length not greater than the preset title length threshold are decoded, judging whether each line of character strings obtained after decoding is a title or not according to a preset title rule, and if so, taking the text lines corresponding to each line of character strings as the titles.
The preset title rule refers to a preset rule for identifying chapter titles, and the rule can be used for identifying titles of character strings of text lines by using a trained machine learning model, or comparing the text lines with a preset title library, so as to judge whether the text lines are titles or not according to a comparison result.
The above embodiment describes in detail how to filter the text line corresponding to the title, and the following explains the text line that cannot be identified as the title.
In an embodiment, as shown in fig. 2, after the step of determining whether each line of character strings obtained after decoding is a title according to the preset title rule in step S123, the method may further include:
s125: and dividing the character strings taking the two adjacent sides of the character strings as titles according to a preset chapter length threshold.
S126: and determining the starting and ending positions of each divided chapter.
In this embodiment, if some character strings have a situation that the title cannot be identified or no chapter title exists when performing title identification on each row of character strings, the adjacent two sides of the row of character strings can be divided into character strings serving as titles according to a preset chapter length threshold, and the start-stop positions of the divided chapters are determined.
For example, if the identification is unsuccessful, it indicates that the identification of the title in the current line fails, or there is no title in the next section, at this time, it can be determined whether the length of the scanned byte of the current section is greater than the preset section length threshold by the preset section length threshold, and if so, the current position is recorded as the end position of the current section and the start position of the next section, and the next section has no title.
The preset chapter length threshold is a value not larger than the maximum chapter length, and is mainly used for avoiding that the typesetting is slowed down due to too large single chapter. The preset chapter length threshold may be 100KB, and the maximum number of bytes per chapter is 100 KB.
For example, for a chinese TXT document with a file size of 1.15MB (1205862 bytes), the list of chapters scanned by the document processing method of the present application may be [ start, chapter i XX, chapter ii XX, chapter iii XX,. ], chapter ten XX, chapter eleven, chapter twelfth ].
The first chapter is not at the beginning of the document, so that one more beginning chapter is added in the chapter list, and titles are not detected in the eleventh chapter and the twelfth chapter and can be obtained by dividing according to a preset chapter length threshold value. During the division, the start and stop positions of the division can be determined according to the titles of the tenth chapter to the thirteenth chapter, then the chapter length between the tenth chapter and the thirteenth chapter is divided according to the preset chapter length threshold value, after the chapter ten to the thirteenth chapter is divided into four sections according to the preset chapter length threshold value, the middle two sections can be determined to be the eleventh chapter and the twelfth chapter respectively, but because the division is carried out according to the chapter length, the chapter eleventh chapter and the chapter twelfth chapter do not have titles.
Further, for a large TXT of 500MB, when the existing scheme is used, since the device memory is limited to 256MB, the document cannot be opened; when the scheme is used, the preset chapter length threshold of each chapter is limited, for example, each chapter occupies 100KB of the memory at most, so that the opening of the large document is supported.
The above embodiment explains the text line that cannot be identified as a title, and how to determine the start and stop positions of the corresponding chapters of each title will be described in detail below.
In one embodiment, the step of determining the start-stop position of each title corresponding to a chapter based on the position of each title in the document to be processed in step S130 may include:
s131: for a target title: and taking the byte position of the beginning character of the target title in the document to be processed as the starting position of the chapter corresponding to the target title.
S132: and taking the initial character of the next title adjacent to the target title as the cut-off position of the chapter corresponding to the target title at the byte position in the document to be processed.
In this embodiment, in the process of determining the start and stop positions of the chapter corresponding to each title based on the position of each title in the document to be processed, the title of the start and stop position of the chapter corresponding to each title may be determined as the target title, then the byte position of the head character of the target title in the document to be processed is used as the start position of the chapter corresponding to the target title, and the byte position of the head character of the next title adjacent to the target title in the document to be processed is used as the stop position of the chapter corresponding to the target title, so as to obtain the start and stop positions of the chapter corresponding to the target title.
The following describes a document processing apparatus provided in an embodiment of the present application, and the document processing apparatus described below and the document processing method described above may be referred to in correspondence with each other.
In one embodiment, as shown in fig. 3, fig. 3 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present invention; the invention also provides a document processing device, which comprises a data scanning module 210, a title confirming module 220, a chapter confirming module 230 and a document establishing module 240, and specifically comprises the following steps:
and the data scanning module 210 is used for scanning each line of text in the document to be processed.
The title confirming module 220 is configured to filter text lines serving as titles based on the byte length of each line of text and a preset title rule, and determine the position of each title in the to-be-processed document.
The chapter confirming module 230 is configured to determine, based on the position of each title in the document to be processed, a start-stop position of a chapter corresponding to each title.
And the document establishing module 240 is configured to establish a directory and a chapter list corresponding to the to-be-processed document according to the title and the start-stop position of the chapter corresponding to the title.
In the above embodiment, when a document to be processed is processed, each line of text in the document to be processed is scanned, then the text line serving as a title is screened out according to the byte length contained in each line of text and a preset title rule, then the start-stop position of a chapter corresponding to each title is determined based on the position of each title in the document to be processed, and finally a directory and a chapter list are established according to the title and the start-stop position of the chapter corresponding to the title; therefore, when a user reads a document, byte data in a specified range in the chapter list can be loaded according to the reading position of the user without loading the whole document, so that the typesetting speed and the page turning speed of the document are improved, the document with larger cache can be supported, and the reading experience of the user is improved; in addition, when the directory and the chapter list are established, the document to be processed is established according to different titles and chapter starting and ending positions corresponding to the titles, and another page is established among different chapters, so that the typesetting of the document can be performed in order, and the reading experience of a user is further improved.
In one embodiment, the present invention also provides a storage medium having stored therein computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the document processing method as in any one of the above embodiments.
In one embodiment, the present invention also provides a computer device having computer-readable instructions stored therein, which, when executed by one or more processors, cause the one or more processors to perform the steps of the document processing method as in any one of the above embodiments.
Fig. 4 is a schematic diagram illustrating an internal structure of a computer device according to an embodiment of the present invention, and fig. 4 is a schematic diagram, where the computer device 300 may be provided as a server. Referring to fig. 4, the computer device 300 includes a processing component 302 that further includes one or more processors and memory resources, represented by memory 301, for storing instructions, such as application programs, that are executable by the processing component 302. The application programs stored in memory 301 may include one or more modules that each correspond to a set of instructions. Further, the processing component 302 is configured to execute instructions to perform the document processing method of any of the embodiments described above.
The computer device 300 may also include a power component 303 configured to perform power management of the computer device 300, a wired or wireless network interface 304 configured to connect the computer device 300 to a network, and an input output (I/O) interface 305. The computer device 300 may operate based on an operating system stored in memory 301, such as Windows Server, Mac OS XTM, Unix, Linux, Free BSDTM, or the like.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of document processing, the method comprising:
scanning each line of text in a document to be processed;
screening text lines serving as titles based on the byte length contained in each line of text and a preset title rule, and determining the position of each title in the document to be processed;
determining the starting and stopping positions of the sections corresponding to each title based on the positions of the titles in the document to be processed;
and establishing a directory and a chapter list corresponding to the document to be processed according to the title and the starting and ending positions of the chapters corresponding to the title.
2. The document processing method according to claim 1, further comprising:
when a user opens the document to be processed, calling the chapter list to determine a byte range to be read according to the current byte position of the user staying in the document to be processed;
a byte stream within the byte range is read.
3. The document processing method according to claim 1, wherein the step of scanning each line of text in the document to be processed comprises:
detecting a document code of a document to be processed;
and scanning each line of text in the document to be processed according to the document code.
4. The document processing method according to claim 3, wherein the step of scanning each line of text in the document to be processed according to the document code comprises:
determining the tail byte position of each line of the document to be processed based on the document code;
and determining each line of text in the document to be processed according to the position of the last tail byte of each line.
5. The method of claim 1, wherein the step of filtering the text lines as titles based on the byte length of each line of text and a preset title rule comprises:
determining the byte length corresponding to each line of text;
screening text lines with the byte length not greater than a preset title length threshold value for decoding;
judging whether each row of character strings obtained after decoding is a title or not according to a preset title rule;
and if so, taking the text line corresponding to the character string as a title.
6. The document processing method according to claim 5, wherein after the step of determining whether each line of character strings obtained after decoding is a title according to a preset title rule, the method further comprises:
if one row of character strings is not a title, dividing the adjacent two sides of the character strings as character strings of the title according to a preset chapter length threshold;
and determining the starting and ending positions of each divided chapter.
7. The document processing method according to claim 1, wherein the step of determining the start-stop position of the chapter corresponding to each title based on the position of each title in the document to be processed comprises:
for a target title:
taking the byte position of the beginning character of the target title in the document to be processed as the starting position of the chapter corresponding to the target title;
and taking the initial character of the next title adjacent to the target title as the cut-off position of the chapter corresponding to the target title at the byte position in the document to be processed.
8. A document processing apparatus, comprising:
the data scanning module is used for scanning each line of text in the document to be processed;
the title confirming module is used for screening text lines serving as titles based on the byte length contained in each line of text and a preset title rule, and determining the position of each title in the document to be processed;
the chapter confirming module is used for determining the starting and stopping positions of chapters corresponding to each title based on the position of each title in the document to be processed;
and the document establishing module is used for establishing a catalogue and a chapter list corresponding to the document to be processed according to the title and the starting and stopping positions of the chapters corresponding to the title.
9. A storage medium, characterized by: the storage medium having stored therein computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the document processing method of any one of claims 1 to 7.
10. A computer device, characterized by: the computer device has stored therein computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the document processing method of any one of claims 1 to 7.
CN202110583801.5A 2021-05-27 2021-05-27 Document processing method, document processing device, storage medium and computer equipment Pending CN113204951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110583801.5A CN113204951A (en) 2021-05-27 2021-05-27 Document processing method, document processing device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110583801.5A CN113204951A (en) 2021-05-27 2021-05-27 Document processing method, document processing device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN113204951A true CN113204951A (en) 2021-08-03

Family

ID=77023691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110583801.5A Pending CN113204951A (en) 2021-05-27 2021-05-27 Document processing method, document processing device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN113204951A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115146608A (en) * 2022-05-13 2022-10-04 北京字节跳动网络技术有限公司 Content typesetting method, device, equipment and storage medium
WO2023016117A1 (en) * 2021-08-13 2023-02-16 北京字节跳动网络技术有限公司 Content typesetting method and apparatus, computer device, and storage medium
CN116451683A (en) * 2022-11-08 2023-07-18 深圳市航顺芯片技术研发有限公司 Document merging method, terminal and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059800A (en) * 2006-04-21 2007-10-24 上海晨兴电子科技有限公司 Method and apparatus for displaying electronic book on mobile phone
CN102375806A (en) * 2010-08-23 2012-03-14 北大方正集团有限公司 Document title extraction method and device
CN105302778A (en) * 2015-10-23 2016-02-03 北京奇虎科技有限公司 Article chapter generation method and system and electronic book reader
CN109726166A (en) * 2018-12-20 2019-05-07 百度在线网络技术(北京)有限公司 Display methods, device, computer equipment and the readable storage medium storing program for executing of e-book
CN110717323A (en) * 2019-10-17 2020-01-21 北京幻想纵横网络技术有限公司 Document seal dividing method and device, terminal and computer readable storage medium
CN111382258A (en) * 2018-12-27 2020-07-07 阿里巴巴集团控股有限公司 Method and device for determining electronic reading object chapter
US10956731B1 (en) * 2019-10-09 2021-03-23 Adobe Inc. Heading identification and classification for a digital document

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059800A (en) * 2006-04-21 2007-10-24 上海晨兴电子科技有限公司 Method and apparatus for displaying electronic book on mobile phone
CN102375806A (en) * 2010-08-23 2012-03-14 北大方正集团有限公司 Document title extraction method and device
CN105302778A (en) * 2015-10-23 2016-02-03 北京奇虎科技有限公司 Article chapter generation method and system and electronic book reader
CN109726166A (en) * 2018-12-20 2019-05-07 百度在线网络技术(北京)有限公司 Display methods, device, computer equipment and the readable storage medium storing program for executing of e-book
CN111382258A (en) * 2018-12-27 2020-07-07 阿里巴巴集团控股有限公司 Method and device for determining electronic reading object chapter
US10956731B1 (en) * 2019-10-09 2021-03-23 Adobe Inc. Heading identification and classification for a digital document
CN110717323A (en) * 2019-10-17 2020-01-21 北京幻想纵横网络技术有限公司 Document seal dividing method and device, terminal and computer readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023016117A1 (en) * 2021-08-13 2023-02-16 北京字节跳动网络技术有限公司 Content typesetting method and apparatus, computer device, and storage medium
CN115146608A (en) * 2022-05-13 2022-10-04 北京字节跳动网络技术有限公司 Content typesetting method, device, equipment and storage medium
CN116451683A (en) * 2022-11-08 2023-07-18 深圳市航顺芯片技术研发有限公司 Document merging method, terminal and computer readable storage medium
CN116451683B (en) * 2022-11-08 2024-01-30 深圳市航顺芯片技术研发有限公司 Document merging method, terminal and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN113204951A (en) Document processing method, document processing device, storage medium and computer equipment
WO2019200783A1 (en) Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium
US20080320387A1 (en) Information displaying device and information displaying method
CN111191414A (en) Page watermark generation method, identification method, device, equipment and storage medium
EP2034487A1 (en) Method and system for generating thumbnails for video files
US9355250B2 (en) Method and system for rapidly scanning files
CN105975311B (en) Application starting method and device
US20230128946A1 (en) Subtitle generation method and apparatus, and device and storage medium
US20100278427A1 (en) Method and system for processing text
US20170139813A1 (en) Method and device for checking influence of deleting cache file, and mobile terminal
WO2020048189A1 (en) Image generation
US20180300250A1 (en) Method and apparatus for storing data
CN111291572A (en) Character typesetting method and device and computer readable storage medium
US8773733B2 (en) Image capture device for extracting textual information
CN113110801A (en) Method, system, equipment and storage medium for accelerating small file reading speed
CN114706825A (en) File scanning method and device, terminal equipment and storage medium
WO2013177240A1 (en) Textual information extraction method using multiple images
CN111240790B (en) Multi-language adaptation method, device, client and storage medium for application
CN108959527B (en) Method for reading and displaying interlocking log based on Windows file mapping technology
CN116795803A (en) File data storage method, device, equipment and storage medium
CN115297104A (en) File uploading method and device, electronic equipment and storage medium
CN113378527A (en) PDF document editing method and device, storage medium and electronic equipment
CN113571061A (en) System, method, device and equipment for editing voice transcription text
CN111125567A (en) Equipment marking method and device, electronic equipment and storage medium
CN105718799B (en) Method and system for identifying file overflow vulnerability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination