CN111695414B - Document processing method and device, electronic equipment and computer readable storage medium - Google Patents

Document processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN111695414B
CN111695414B CN202010351716.1A CN202010351716A CN111695414B CN 111695414 B CN111695414 B CN 111695414B CN 202010351716 A CN202010351716 A CN 202010351716A CN 111695414 B CN111695414 B CN 111695414B
Authority
CN
China
Prior art keywords
text block
text
space width
character
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010351716.1A
Other languages
Chinese (zh)
Other versions
CN111695414A (en
Inventor
童征宇
李俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202010351716.1A priority Critical patent/CN111695414B/en
Publication of CN111695414A publication Critical patent/CN111695414A/en
Application granted granted Critical
Publication of CN111695414B publication Critical patent/CN111695414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention provides a document processing method and device, electronic equipment and a computer readable storage medium. The document processing method comprises the following steps: identifying text blocks in the target layout document; and in the process of converting the target format document into the streaming document, adding a space character between the first text block and the second text block according to the text information of the first text block and the second text block respectively. According to the technical scheme provided by the embodiment of the invention, the space character can be added when the space character is required to be supplemented between two adjacent text blocks, so that the problem that a user is difficult to break sentences when reading due to the lack of the space character is solved.

Description

Document processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present invention relates to the field of document processing technologies, and in particular, to a document processing method and apparatus, an electronic device, and a computer readable storage medium.
Background
With the development of intelligent terminal technology, the functions supported by the intelligent terminal are more and more, wherein the functions comprise electronic reading functions. The electronic reading enables the reading of the user not to depend on paper files completely, and the user can read through the intelligent terminal at any time and any place, so that the reading requirement of the user is better met.
In electronic reading, one common electronic document is a layout document. A layout document refers to a document that meets the format specification of the layout document. And the format document format is an electronic document format with a fixed layout presentation effect, and the presentation of the format document is irrelevant to the equipment. For the format document, when the document is displayed in the screen of the intelligent terminal, the problem of inconvenient reading sometimes exists due to the fixed format. In order for the user to have a better reading experience, the layout document may be converted to a streaming document.
In a layout document, some space characters are not recorded in the form of text elements, but are represented by setting the positions of text blocks to manufacture the interval distance between two adjacent text blocks, and when the space characters are converted into a streaming document, the space characters are generally ignored, so that the space characters are missing, the separation between the document contents is influenced, and the problem of difficult sentence breaking is caused when a user reads the document.
Disclosure of Invention
The invention provides a document processing method and device, electronic equipment and a computer readable storage medium, so as to solve the problem that in the prior art, when format documents are converted into streaming documents, space characters are lost, and sentence breaking is affected when a user reads.
In a first aspect of the present invention, there is provided a document processing method, including:
identifying text blocks in the target layout document;
in the process of converting the target format document into a streaming document, adding a space character between the first text block and the second text block according to the text information of the first text block and the second text block respectively;
the first text block and the second text block are two adjacent text blocks in the reading direction in the target format document; the text information includes: at least one of font, font size, and character type of the text.
Optionally, in a case that the first text block and the second text block are in the same line, adding a space symbol between the first text block and the second text block according to the text information of the first text block and the second text block, respectively, includes:
determining a reference space width between the first text block and the second text block according to the text information of the first text block and the second text block respectively;
a space character is added between the first text block and the second text block in the case where a separation distance between the first text block and the second text block is greater than or equal to the reference space width.
Optionally, the determining the reference space width between the first text block and the second text block according to the text information of the first text block and the second text block respectively includes:
determining a target space width according to the fonts and/or the word sizes of the texts in the first text block and the second text block respectively;
determining a target space width ratio value according to the character types of texts in the first text block and the second text block respectively;
and determining the product of the target space width and the target space width proportional value as a reference space width between the first text block and the second text block.
Optionally, the determining the target space width according to the font and/or the font size of the text in the first text block and the second text block respectively includes:
according to preset corresponding relations between different fonts and/or word sizes and space widths, determining a first space width corresponding to the fonts and/or word sizes of the texts in the first text block and a second space width corresponding to the fonts and/or word sizes of the texts in the second text block respectively;
and determining the smaller space width of the first space width and the second space width as the target space width.
Optionally, the determining the target space width ratio value according to the character types of the texts in the first text block and the second text block includes:
according to preset corresponding relations between different text types and space width ratio values, a first space width ratio value corresponding to the character type of the text in the first text block and a second space width ratio value corresponding to the character type of the text in the second text block are respectively determined;
and determining a smaller space width ratio value of the first space width ratio value and the second space width ratio value as the target space width ratio value.
Optionally, in the case that a text block includes at least two character types, determining a space width ratio value with a minimum value among space width ratio values respectively corresponding to the at least two character types as a space width ratio value corresponding to a character type of the text in the text block.
Optionally, in a case that the first text block and the second text block are respectively located in two adjacent lines in the same paragraph, adding a space character between the first text block and the second text block according to the text information of the first text block and the second text block respectively includes:
If the character type of the last character in the first text block is the target character type and/or if the character type of the first character in the second text block is the target character type, adding a space character between the first text block and the second text block;
in the reading direction, the first text block is the last text block in the previous line of two adjacent lines, and the second text block is the first text block in the next line of two adjacent lines; the target character type includes: at least one of English and number.
In a second aspect of the present invention, there is provided a document processing apparatus comprising:
the identifying module is used for identifying text blocks in the target format document;
the adding module is used for adding space characters between the first text block and the second text block according to the text information of the first text block and the second text block respectively in the process of converting the target format document into the streaming document;
the first text block and the second text block are two adjacent text blocks in the reading direction in the target format document; the text information includes: at least one of font, font size, and character type of the text.
Optionally, the adding module includes:
a determining submodule, configured to determine a reference space width between the first text block and the second text block according to text information of the first text block and the second text block, respectively; wherein the first text block and the second text block are in the same line;
a first adding sub-module for adding a space character between the first text block and the second text block in the case that the interval distance between the first text block and the second text block is greater than or equal to the reference space width.
Optionally, the determining submodule includes:
a first determining unit, configured to determine a target space width according to a font and/or a font size of text in the first text block and the second text block, respectively;
the second determining unit is used for determining a target space width proportion value according to the character types of texts in the first text block and the second text block respectively;
and a third determining unit configured to determine a product of the target space width and the target space width ratio value as a reference space width between the first text block and the second text block.
Optionally, the first determining unit includes:
a first determining subunit, configured to determine, according to preset correspondence between different fonts and/or word sizes and space widths, a first space width corresponding to a font and/or word size of a text in the first text block, and a second space width corresponding to a font and/or word size of a text in the second text block, respectively;
and the second determining subunit is used for determining the space width with smaller value of the first space width and the second space width as the target space width.
Optionally, the second determining unit includes:
a third determining subunit, configured to determine, according to a preset correspondence between different text types and space width ratio values, a first space width ratio value corresponding to a character type of a text in the first text block and a second space width ratio value corresponding to a character type of a text in the second text block, respectively;
and a fourth determining subunit, configured to determine, as the target space width ratio value, a space width ratio value with a smaller numerical value of the first space width ratio value and the second space width ratio value.
Optionally, in the case that a text block includes at least two character types, determining a space width ratio value with a minimum value among space width ratio values respectively corresponding to the at least two character types as a space width ratio value corresponding to a character type of the text in the text block.
Optionally, the adding module includes:
a second adding sub-module, configured to add a space character between the first text block and the second text block, where a character type of a last character in the first text block is a target character type, and/or where a character type of a first character in the second text block is a target character type;
wherein the first text block and the second text block are respectively positioned in two adjacent lines in the same paragraph; in the reading direction, the first text block is the last text block in the previous line of two adjacent lines, and the second text block is the first text block in the next line of two adjacent lines; the target character type includes: at least one of English and number.
In a third aspect of the present invention, there is also provided an electronic device, including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
A memory for storing a computer program;
and a processor for implementing the document processing method as described above when executing the program stored on the memory.
In a fourth aspect of the invention, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document processing method as described above.
In a fifth aspect of embodiments of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps in the document processing method as described above.
Aiming at the prior art, the invention has the following advantages:
in the embodiment of the invention, in the process of converting the format document into the streaming document, whether the space character needs to be supplemented between two adjacent text blocks or not can be determined according to the text information of the two adjacent text blocks, and the space character is added when the space character needs to be supplemented, so that the problem that a user is difficult to break sentences when reading due to the lack of the space character is solved.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly described below.
FIG. 1 is a schematic flow chart of a document processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of another document processing method according to an embodiment of the present invention;
FIG. 3 is one of the schematic diagrams of examples provided by embodiments of the present invention;
FIG. 4 is a flowchart of another document processing method according to an embodiment of the present invention;
FIG. 5 is a second schematic diagram of an example provided by an embodiment of the present invention;
FIG. 6 is a schematic block diagram of a document processing apparatus according to an embodiment of the present invention;
FIG. 7 is a schematic block diagram of another document processing apparatus provided in an embodiment of the present invention;
fig. 8 is a schematic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
According to an aspect of the embodiment of the present invention, there is provided a document processing method, which may be applied to a user equipment or a server.
As shown in fig. 1, the document processing method may include:
step 101: text blocks in the target layout document are identified.
For the target layout document, the text content in each page is recorded in the form of at least one text block. Wherein each text block includes: a rectangular box (i.e., box) and text content within the rectangular box. The rectangular box may be used to define information such as the location of text content within the box in the page, the space occupied, etc.
According to the embodiment of the invention, the text blocks in the target format document can be identified according to the format grammar corresponding to the target format document. The format of the target format document may be a portable document format (Portable Document Format, PDF for short), or may be other types of format document formats.
Step 102: in the process of converting the target format document into the streaming document, a space character is added between the first text block and the second text block according to the text information of the first text block and the second text block respectively.
The first text block and the second text block are two adjacent text blocks in the reading direction in the target layout document. The first text block and the second text block may be in the same line, or may be two adjacent lines in the same paragraph (for example, the first text block is the last text block in the previous line of the two adjacent lines, and the second text block is the first text block in the next line of the two adjacent lines).
In the embodiment of the invention, whether the space character is added between the first text block and the second text block can be determined according to the text information of the first text block and the second text block respectively. Wherein, the text information described herein may include: at least one of font, font size, and character type of the text. And the character types of the text may include: chinese, english, number, punctuation and operation symbol.
The font, size, character type, etc. of the text may often determine whether adding a space character is necessary. For example, for a character type, when the character types of the text in the first text block and the second text block are english, it is necessary to add a space character between the two text blocks to separate words from each other. For another example, in a layout document, since some space characters are represented by setting positions of text blocks to make a separation distance between adjacent two text blocks, it is possible to determine whether or not a space character needs to be supplemented according to the separation distance between the adjacent two text blocks. The space character width may be different between the texts with different fonts and/or character sizes, so that the spacing distance used for representing the space character is also different, and therefore, whether the spacing distance between two text blocks is used for representing the space character or not can be determined according to the fonts and/or character sizes of the texts, and if the space character is represented, the space character needs to be supplemented; if the space character is not represented, no supplemental space character is needed.
In summary, in the embodiment of the present invention, in the process of converting a format document into a streaming document, whether a space symbol needs to be added between two adjacent text blocks or not may be determined according to text information of the two adjacent text blocks, and the space symbol is added when the space symbol needs to be added, so as to reduce the problem that a sentence is difficult to break when a user reads due to lack of the space symbol.
According to another aspect of the embodiment of the present invention, a document processing method is provided, which may be applied to a user device or a server.
As shown in fig. 2, the document processing method includes:
step 201: text blocks in the target layout document are identified.
For the target layout document, the text content in each page is recorded in the form of at least one text block. Wherein each text block includes: a rectangular box (i.e., box) and text content within the rectangular box. The rectangular box may be used to define information such as the location of text content within the box in the page, the space occupied, etc.
According to the embodiment of the invention, the text blocks in the target format document can be identified according to the format grammar corresponding to the target format document. The format of the target format document can be a portable document format or other types of format document formats.
Step 202: in the process of converting the target format document into the streaming document, determining the reference space width between the first text block and the second text block according to the text information of the first text block and the second text block respectively.
The first text block and the second text block are two adjacent text blocks in the reading direction in the target layout document, and the first text block and the second text block are in the same row.
In the embodiment of the invention, the reference space width between the first text block and the second text block can be determined according to the text information of the first text block and the second text block respectively. Wherein, the text information described herein may include: at least one of font, font size, and character type of the text. And the character types of the text may include: chinese, english, number, punctuation and operation symbol.
When at least two reference space widths can be determined according to the text information of the first text block and the text information of the second text block, the reference space width with the smallest numerical value can be taken as the reference space width between the first text block and the second text block so as to reduce omission of space characters and add the space characters to the greatest extent. For example, according to the text information of the first text block and the text information of the second text block, two different values of reference space widths, namely m and n, can be determined, and at this time, the minimum value of m and n is taken as the reference space width between the first text block and the second text block.
Step 203: in the case where the separation distance between the first text block and the second text block is greater than or equal to the reference space width, a space character is added between the first text block and the second text block.
After determining a reference space width between the first text block and the second text block, a separation distance between the first text block and the second text block is compared to the reference space width. If the separation distance is greater than or equal to the reference space width, the separation distance is considered to represent a space character, and it is determined that a space character is added between the first text block and the second text block. If the separation distance is less than the reference space width, the separation distance is considered to be not indicative of a space character, and it is determined that no space character needs to be added between the first text block and the second text block.
For a better understanding of the flow of adding a space character, the text content shown in fig. 3 is exemplified below.
The a-graph in fig. 3 illustrates a piece of text content in a target layout document. In the process of converting the target format document into the stream document, the text content is identified, the text block corresponding to the text content is determined, and the number of the text blocks corresponding to the text content is 6 as shown in the b diagram in fig. 3, and the text blocks are respectively: the text content "english master copy" corresponds to one text block, the text content "O' reily" corresponds to one text block, the text content "Media", the text content "inc." corresponds to one text block, the text content "publication" corresponds to one text block, the text content "2011". "corresponds to a block of text". After the recognition is completed, extraction analysis is performed on the text content to determine which positions require addition of a space character. The c-chart in fig. 3 illustrates the position in the text content of the segment where the space character needs to be added, and the black dot above the lower line in the c-chart indicates that the space character should be added at the position where the black dot is located. After determining which places in the piece of text content require adding space characters, adding space characters at corresponding positions. The d-diagram in fig. 3 illustrates the typesetting effect of the text content after converting the target layout document into a stream document, wherein the gray rectangles in the diagram illustrate the space characters between the text content.
In the embodiment of the invention, in the process of converting the format document into the streaming document, the reference space width between two adjacent text blocks can be determined according to the text information of the two adjacent text blocks respectively. And comparing the interval distance between two adjacent text blocks with the reference space width, determining whether a space character needs to be supplemented between the two adjacent text blocks according to the comparison result, and adding the space character when the space character needs to be supplemented so as to reduce the problem that the user is difficult to break sentences when reading due to the lack of the space character.
Optionally, in the embodiment of the present invention, before determining whether to add a space character between two adjacent text blocks according to the text information of the text blocks (for example, before determining the reference space width between the first text block and the second text block according to the text information of the first text block and the second text block, respectively), whether a space character is set between the first text block and the second text block may be determined first. If a space symbol is provided, it is described that the space symbol is not required to be added, namely: judging whether to add a space character or not according to the text information; if the space character is not set, whether the space character is added is judged according to the text information. Therefore, the situation that whether the space character is added or not is judged according to the text information can be eliminated, the processing time is saved, and the processing efficiency is improved.
Wherein, can adopt the mode of extracting text content, judge whether to set up the space character between two adjacent text blocks. If a space character can be extracted, a space character is set in the description, and if a space character cannot be extracted, a space character is not set in the description.
Optionally, step 202: determining a reference space width between the first text block and the second text block based on text information of the first text block and the second text block, respectively, may include:
step one: and determining the target space width according to the fonts and/or the word sizes of the texts in the first text block and the second text block respectively.
Different fonts and/or font sizes (i.e., different fonts, different font sizes and different font and font size combinations) may correspond to different preset space widths, for example, the preset space width corresponding to "regular script number four" is 0.5mm, the preset space width corresponding to "regular script number one" is 0.6mm, the preset space width corresponding to "Song body number four" is 0.4mm, and so on. The preset space width described herein may also be understood as a standard space width corresponding to a font and/or a font size. Based on the above, the embodiment of the invention can determine the target space width according to the fonts and/or the word sizes of the texts in the first text block and the second text block respectively.
Wherein, the target space width is: and determining the space width with smaller value in the two space widths according to the fonts and/or the word sizes of the texts in the first text block and the second text block respectively. In the embodiment of the invention, when the space width corresponding to the font and/or the font size of the text in the first text block is different from the space width corresponding to the font and/or the font size of the text in the second text block, the rule of taking the small rule is selected. For example, according to preset corresponding relations between different fonts and/or word sizes and space widths, determining a first space width corresponding to the fonts and/or word sizes of the texts in the first text block and a second space width corresponding to the fonts and/or word sizes of the texts in the second text block respectively; and then determining the space width with smaller value of the first space width and the second space width as the target space width.
When determining the space widths corresponding to two adjacent text blocks, the method is performed according to the same text information. For example, space widths corresponding to fonts and font sizes of text within a block of text are each determined, space widths corresponding to font sizes of text within a block of text are each determined, or space widths corresponding to a combination of fonts and font sizes of text are each determined.
Step two: and determining the target space width ratio value according to the character types of the texts in the first text block and the second text block.
In the layout document creation process, there may be a case where the space distance representing the space character is compressed. In addition, boxes of english, numerals, punctuation marks may be larger than the actual space occupied by characters, and there may be a leading space or a trailing space, so that when a layout document is produced, there may be a case where the space distance for representing a space symbol is compressed by adjusting the position of a text block. In both cases, the actual distance between two adjacent text blocks is smaller than the distance corresponding to the space symbol, which affects the judgment of whether to add the space symbol or not according to the distance between the two adjacent text blocks. Wherein, different characters have different font space distribution, and the compressible space is different in size, so that the size of the compressed space can be determined according to the character types in the text block.
In the embodiment of the present invention, in order to avoid the above situation as much as possible from determining whether to add a space character, different space width ratio parameters (i.e., space width ratio values) are set for different character types, so in the embodiment of the present invention, the target space width ratio value may be determined according to the character type of the text in the first text block and the character type of the text in the second text block, respectively. The target space width ratio value is: and determining the space width ratio value with smaller numerical value in the two space width ratio values according to the character types of texts in the first text block and the second text block respectively.
In the embodiment of the invention, when the space width ratio value corresponding to the character type of the text in the first text block is different from the space width ratio value corresponding to the character type of the text in the second text block, the rule of taking the space width ratio value is selected. For example, according to a preset corresponding relation between different text types and space width ratio values, a first space width ratio value corresponding to the character type of the text in the first text block and a second space width ratio value corresponding to the character type of the text in the second text block are respectively determined; and determining the space width ratio value with the smaller numerical value of the first space width ratio value and the second space width ratio value as the target space width ratio value.
Optionally, in the case that a text of at least two character types is included in one text block, determining a space width ratio value of which the value in the space width ratio values respectively corresponding to the at least two character types is the smallest as the space width ratio value corresponding to the character type of the text in the text block.
Alternatively, when a text block includes text of at least two character types, the corresponding space width ratio value in each text block may be determined according to the character type of the first character or the character type of the last character in the text block. For example, in the reading direction, the first text block precedes the second text block, and the space width ratio value corresponding to the character type of the last character in the first text block may be determined as the space width ratio value corresponding to the character type of the text in the first text block; for the second text block, the space width ratio value corresponding to the character type of the first character in the second text block can be determined as the space width ratio value corresponding to the character type of the text in the second text block.
Step three: the product of the target space width and the target space width ratio value is determined as a reference space width between the first text block and the second text block.
After the target space width to target space width ratio value is obtained, the product of the target space width and the target space width ratio value is determined as the reference space width between the first text block and the second text block. By such processing, the value of the reference space width for determining whether to add a space symbol can be reduced, and the addition of a space symbol can be performed to the maximum extent.
In summary, in the embodiment of the present invention, in the process of converting a format document into a streaming document, for two adjacent text blocks in the same line, the distance between the two adjacent text blocks may be compared with the reference space width, and whether to supplement a space symbol between the two adjacent text blocks is determined according to the comparison result, and when the space symbol is needed to be supplemented, the space symbol is added, so as to reduce the problem that the user is difficult to break the sentence when reading due to lack of the space symbol.
According to another aspect of the embodiment of the present invention, a document processing method is provided, which may be applied to a user device or a server.
As shown in fig. 4, the document processing method includes:
step 401: text blocks in the target layout document are identified.
For the target layout document, the text content in each page is recorded in the form of at least one text block. Wherein each text block includes: a rectangular box (i.e., box) and text content within the rectangular box. The rectangular box may be used to define information such as the location of text content within the box in the page, the space occupied, etc.
According to the embodiment of the invention, the text blocks in the target format document can be identified according to the format grammar corresponding to the target format document. The format of the target format document can be a portable document format or other types of format document formats.
Step 402: in the process of converting the target format document into the stream document, if the character type of the last character in the first text block is the target character type and/or if the character type of the first character in the second text block is the target character type, a space character is added between the first text block and the second text block.
The first text block and the second text block are two text blocks adjacent to each other in the reading direction in the target layout document, and the first text block and the second text block are respectively located in two adjacent lines in the same paragraph, for example, the first text block is the last text block of the previous line in the two adjacent lines, and the second text block is the first text block of the next line in the two adjacent lines.
For two text blocks that are respectively located in two adjacent lines of the same paragraph and are adjacent in the reading direction, since text contents in the two adjacent text blocks may be located in the same line or the same column in the streaming document after the format document is converted into the streaming document, it is also necessary to determine in advance whether a space character needs to be added between the two adjacent text blocks in order to ensure the separation that should be present between the two text contents. The embodiment of the invention can be provided with: in the case that the character type of the last character in the first text block is the target character type and/or in the case that the character type of the first character in the second text block is the target character type, it is determined to add a space character between the first text block and the second text block. The character types of text described herein may include: chinese, english, number, punctuation and operation symbol. The target character types described herein may include: at least one of English and number.
For a better understanding of the flow of adding a space character, the text content shown in fig. 5 is exemplified below.
As shown in fig. 5, a diagram a in fig. 5 illustrates a piece of text content in a target layout document. And in the process of converting the target format document into the streaming document, identifying the text content, and determining a text block corresponding to the text content. The last text block in the first line of the piece of text content (i.e. text block a corresponding to text content "O' reproduction", hereinafter) and the first text block in the second line (i.e. text content "Media", corresponding to text block, hereinafter text block b) are illustrated in fig. 5 b. After the recognition is completed, the text content is extracted and analyzed to determine whether a space character needs to be added between the text block a and the text block b. The graph c in fig. 5 illustrates that a space character needs to be added between the text block a and the text block b, and the black dot above the lower line in the graph c indicates that the space character should be added at the position where the black dot is located. After determining that a space character needs to be added between the text block a and the text block b, the space character is added. The d-diagram in fig. 5 illustrates the typesetting effect of the text content after converting the target layout document into a streaming document, wherein the gray rectangles in the diagram illustrate the space characters between the text content "O' reily" and "Media".
In summary, in the embodiment of the present invention, in the process of converting a format document into a streaming document, for two text blocks that are respectively located in two adjacent lines of the same paragraph and are adjacent in the reading direction, whether a space character needs to be supplemented between the two adjacent text blocks or not may be determined according to the character type of the last character in the preceding text block and/or the character type of the first character in the following text block, respectively, in the two adjacent text blocks, and the space character is added when the space character needs to be supplemented, so as to reduce the problem that the user is difficult to break a sentence when reading due to the lack of the space character.
According to another aspect of an embodiment of the present invention, there is provided a document processing apparatus. The document processing device can be applied to user equipment and a server.
As shown in fig. 6, the document processing apparatus 600 may include:
the identifying module 601 is configured to identify text blocks in the target layout document.
And the adding module 602 is configured to add a space symbol between the first text block and the second text block according to the text information of the first text block and the second text block, respectively, in the process of converting the target layout document into the streaming document.
The first text block and the second text block are two adjacent text blocks in the reading direction in the target format document; the text information includes: at least one of font, font size, and character type of the text.
Alternatively, the adding module 602 may be specifically configured to: and in the process of converting the target format document into the streaming document, if a space character is not arranged between the first text block and the second text block, adding the space character between the first text block and the second text block according to the text information of the first text block and the second text block respectively.
Optionally, as shown in fig. 7, the adding module 602 includes:
a determining submodule 6021, configured to determine a reference space width between the first text block and the second text block according to the text information of the first text block and the second text block, respectively.
Wherein the first text block and the second text block are in the same line.
A first adding sub-module 6022 for adding a space character between the first text block and the second text block in case that a separation distance between the first text block and the second text block is greater than or equal to the reference space width.
Optionally, as shown in fig. 7, the determining submodule 6021 includes:
the first determining unit 60211 is configured to determine a target space width according to a font and/or a font size of the text in the first text block and the second text block, respectively.
And a second determining unit 60212, configured to determine a target space width ratio value according to the character types of the texts in the first text block and the second text block, respectively.
A third determining unit 60213 that determines a product of the target space width and the target space width ratio value as a reference space width between the first text block and the second text block.
Alternatively, as shown in fig. 7, the first determining unit 60211 includes:
the first determining subunit 602111 is configured to determine, according to a preset correspondence between different fonts and/or word sizes and space widths, a first space width corresponding to a font and/or word size of a text in the first text block and a second space width corresponding to a font and/or word size of a text in the second text block, respectively.
And a second determining subunit 602112 configured to determine, as the target space width, a space width having a smaller value of the first space width and the second space width.
Alternatively, as shown in fig. 7, the second determining unit 60212 includes:
the third determining subunit 602121 is configured to determine, according to a preset correspondence between different text types and space width ratio values, a first space width ratio value corresponding to a character type of a text in the first text block and a second space width ratio value corresponding to a character type of a text in the second text block, respectively.
A fourth determining subunit 602122 is configured to determine, as the target space width ratio value, a space width ratio value with a smaller value of the first space width ratio value and the second space width ratio value.
Optionally, in the case that a text block includes at least two character types, determining a space width ratio value with a minimum value among space width ratio values respectively corresponding to the at least two character types as a space width ratio value corresponding to a character type of the text in the text block.
Optionally, as shown in fig. 7, the adding module 602 includes:
a second adding sub-module 6022 for adding a space character between the first text block and the second text block in case the character type of the last character within the first text block is a target character type and/or in case the character type of the first character within the second text block is a target character type;
Wherein the first text block and the second text block are respectively positioned in two adjacent lines in the same paragraph; in the reading direction, the first text block is the last text block in the previous line of two adjacent lines, and the second text block is the first text block in the next line of two adjacent lines; the target character type includes: at least one of English and number.
In the embodiment of the invention, in the process of converting the format document into the streaming document, whether the space character needs to be supplemented between two adjacent text blocks or not can be determined according to the text information of the two adjacent text blocks, and the space character is added when the space character needs to be supplemented, so that the problem that a user is difficult to break sentences when reading due to the lack of the space character is solved.
For the device embodiments described above, reference is made to the description of the method embodiments for the relevant points, since they are substantially similar to the method embodiments.
The embodiment of the invention also provides electronic equipment which can be a server. As shown in fig. 8, the device comprises a processor 801, a communication interface 802, a memory 803 and a communication bus 804, wherein the processor 801, the communication interface 802 and the memory 803 communicate with each other through the communication bus 804.
A memory 803 for storing a computer program.
When the electronic device is a terminal device, the processor 801 is configured to execute the program stored in the memory 803, and implement the following steps:
identifying text blocks in the target layout document;
in the process of converting the target format document into a streaming document, adding a space character between the first text block and the second text block according to the text information of the first text block and the second text block respectively;
the first text block and the second text block are two adjacent text blocks in the reading direction in the target format document; the text information includes: at least one of font, font size, and character type of the text.
Optionally, in a case that the first text block and the second text block are in the same line, adding a space symbol between the first text block and the second text block according to the text information of the first text block and the second text block, respectively, includes:
determining a reference space width between the first text block and the second text block according to the text information of the first text block and the second text block respectively;
A space character is added between the first text block and the second text block in the case where a separation distance between the first text block and the second text block is greater than or equal to the reference space width.
Optionally, the determining the reference space width between the first text block and the second text block according to the text information of the first text block and the second text block respectively includes:
determining a target space width according to the fonts and/or the word sizes of the texts in the first text block and the second text block respectively;
determining a target space width ratio value according to the character types of texts in the first text block and the second text block respectively;
and determining the product of the target space width and the target space width proportional value as a reference space width between the first text block and the second text block.
Optionally, the determining the target space width according to the font and/or the font size of the text in the first text block and the second text block respectively includes:
according to preset corresponding relations between different fonts and/or word sizes and space widths, determining a first space width corresponding to the fonts and/or word sizes of the texts in the first text block and a second space width corresponding to the fonts and/or word sizes of the texts in the second text block respectively;
And determining the smaller space width of the first space width and the second space width as the target space width.
Optionally, the determining the target space width ratio value according to the character types of the texts in the first text block and the second text block includes:
according to preset corresponding relations between different text types and space width ratio values, a first space width ratio value corresponding to the character type of the text in the first text block and a second space width ratio value corresponding to the character type of the text in the second text block are respectively determined;
and determining a smaller space width ratio value of the first space width ratio value and the second space width ratio value as the target space width ratio value.
Optionally, in a case that the first text block and the second text block are respectively located in two adjacent lines in the same paragraph, adding a space character between the first text block and the second text block according to the text information of the first text block and the second text block respectively includes:
if the character type of the last character in the first text block is the target character type and/or if the character type of the first character in the second text block is the target character type, adding a space character between the first text block and the second text block;
In the reading direction, the first text block is the last text block in the previous line of two adjacent lines, and the second text block is the first text block in the next line of two adjacent lines; the target character type includes: at least one of English and number.
The communication bus mentioned by the above electronic device may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the document processing method described in the above embodiment.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the document processing method described in the above embodiments.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (8)

1. A document processing method, comprising:
identifying text blocks in the target layout document;
in the process of converting the target format document into a streaming document, adding space characters between a first text block and a second text block according to the text information of the first text block and the second text block respectively;
and adding a space character between the first text block and the second text block according to the text information of the first text block and the second text block respectively under the condition that the first text block and the second text block are in the same line, wherein the adding comprises the following steps:
determining a target space width according to the fonts and/or the word sizes of the texts in the first text block and the second text block respectively;
determining a target space width ratio value according to the character types of texts in the first text block and the second text block respectively;
determining a product of the target space width and the target space width ratio value as a reference space width between the first text block and the second text block;
Adding a space character between the first text block and the second text block if a separation distance between the first text block and the second text block is greater than or equal to the reference space width;
the first text block and the second text block are two adjacent text blocks in the reading direction in the target format document; the text information includes: font type, font size, and character type of text.
2. The document processing method of claim 1, wherein the determining the target space width based on the font and/or font size of text within the first text block and the second text block, respectively, comprises:
according to preset corresponding relations between different fonts and/or word sizes and space widths, determining a first space width corresponding to the fonts and/or word sizes of the texts in the first text block and a second space width corresponding to the fonts and/or word sizes of the texts in the second text block respectively;
and determining the smaller space width of the first space width and the second space width as the target space width.
3. The document processing method of claim 1, wherein the determining the target space width ratio value based on the character types of the text in the first text block and the second text block, respectively, comprises:
According to preset corresponding relations between different text types and space width ratio values, a first space width ratio value corresponding to the character type of the text in the first text block and a second space width ratio value corresponding to the character type of the text in the second text block are respectively determined;
and determining a smaller space width ratio value of the first space width ratio value and the second space width ratio value as the target space width ratio value.
4. A document processing method according to claim 3, wherein in the case where a text of at least two character types is included in one text block, a space width ratio value of which the numerical value is the smallest among space width ratio values respectively corresponding to the at least two character types is determined as a space width ratio value corresponding to a character type of the text in the text block.
5. The document processing method according to claim 1, wherein in a case where the first text block and the second text block are in two adjacent lines in the same paragraph, respectively, the adding a space character between the first text block and the second text block based on text information of the first text block and the second text block, respectively, comprises:
If the character type of the last character in the first text block is the target character type and/or if the character type of the first character in the second text block is the target character type, adding a space character between the first text block and the second text block;
in the reading direction, the first text block is the last text block in the previous line of two adjacent lines, and the second text block is the first text block in the next line of two adjacent lines; the target character type includes: at least one of English and number.
6. A document processing apparatus, comprising:
the identifying module is used for identifying text blocks in the target format document;
the adding module is used for adding space characters between the first text block and the second text block according to the text information of the first text block and the second text block respectively in the process of converting the target format document into the streaming document;
the adding module comprises:
a first determining unit, configured to determine a target space width according to a font and/or a font size of text in the first text block and the second text block, respectively;
The second determining unit is used for determining a target space width proportion value according to the character types of texts in the first text block and the second text block respectively;
a third determining unit configured to determine a product of the target space width and the target space width ratio value as a reference space width between the first text block and the second text block;
a first adding sub-module for adding a space character between the first text block and the second text block in a case that a separation distance between the first text block and the second text block is greater than or equal to the reference space width;
the first text block and the second text block are two adjacent text blocks in the reading direction in the target format document; the text information includes: font type, font size, and character type of text.
7. An electronic device, comprising: a processor, a communication interface, a memory, and a communication bus; the processor, the communication interface and the memory complete communication with each other through a communication bus;
a memory for storing a computer program;
a processor for implementing the document processing method according to any one of claims 1 to 5 when executing a program stored on a memory.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the document processing method according to any one of claims 1 to 5.
CN202010351716.1A 2020-04-28 2020-04-28 Document processing method and device, electronic equipment and computer readable storage medium Active CN111695414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010351716.1A CN111695414B (en) 2020-04-28 2020-04-28 Document processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010351716.1A CN111695414B (en) 2020-04-28 2020-04-28 Document processing method and device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111695414A CN111695414A (en) 2020-09-22
CN111695414B true CN111695414B (en) 2024-03-01

Family

ID=72476702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010351716.1A Active CN111695414B (en) 2020-04-28 2020-04-28 Document processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111695414B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699634B (en) * 2020-12-28 2022-05-24 掌阅科技股份有限公司 Typesetting processing method of electronic book, electronic equipment and storage medium
CN113723048A (en) * 2021-09-06 2021-11-30 北京字跳网络技术有限公司 Method and device for setting rich text space, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516868A (en) * 2013-09-30 2015-04-15 北大方正集团有限公司 Layout space streaming restoring method and layout space streaming restoring system
CN104536947A (en) * 2014-12-10 2015-04-22 百度在线网络技术(北京)有限公司 Layout document processing method and device
CN105335346A (en) * 2015-11-09 2016-02-17 汉王科技股份有限公司 PDF (Portable Document Format) document text extracting method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001065355A1 (en) * 2000-03-01 2001-09-07 Celltrex Ltd. System and method for rapid document conversion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104516868A (en) * 2013-09-30 2015-04-15 北大方正集团有限公司 Layout space streaming restoring method and layout space streaming restoring system
CN104536947A (en) * 2014-12-10 2015-04-22 百度在线网络技术(北京)有限公司 Layout document processing method and device
CN105335346A (en) * 2015-11-09 2016-02-17 汉王科技股份有限公司 PDF (Portable Document Format) document text extracting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种UOF文档保存和还原版式的方案设想;荣明军等;北京信息科技大学学报(自然科学版);20101215(S2);全文 *

Also Published As

Publication number Publication date
CN111695414A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
US9348799B2 (en) Forming a master page for an electronic document
US8225200B2 (en) Extracting a character string from a document and partitioning the character string into words by inserting space characters where appropriate
CN111695414B (en) Document processing method and device, electronic equipment and computer readable storage medium
US8538154B2 (en) Image processing method and image processing apparatus for extracting heading region from image of document
JP5412903B2 (en) Document image processing apparatus, document image processing method, and document image processing program
JP5790082B2 (en) Document recognition apparatus, document recognition method, program, and storage medium
US9218327B2 (en) Optimizing the layout of electronic documents by reducing presentation size of content within document sections so that when combined a plurality of document sections fit within a page
EP2191396A2 (en) An apparatus for preparing a display document for analysis
CN117216279A (en) Text extraction method, device and equipment of PDF (portable document format) file and storage medium
CN110377885B (en) Method, device, equipment and computer storage medium for converting PDF file
CN112699634B (en) Typesetting processing method of electronic book, electronic equipment and storage medium
JP5715172B2 (en) Document display device, document display method, and document display program
CN112686000B (en) Format conversion method of electronic book document, electronic equipment and storage medium
CN113297425B (en) Document conversion method, device, server and storage medium
CN111444456B (en) Style editing method and device and electronic equipment
CN110941972B (en) Segmentation method and device for characters in PDF document and electronic equipment
CN112434487A (en) Image-text typesetting method and device and electronic equipment
CN110874519B (en) Method and device for converting Markdown document into PDF document
CN112100978A (en) Typesetting processing method based on electronic book, electronic equipment and storage medium
CN112487759A (en) Document page number setting method and device, electronic equipment and storage medium
CN110457659B (en) Clause document generation method and terminal equipment
CN117391045B (en) Method for outputting file with portable file format capable of copying Mongolian
CN111832262B (en) Document processing method and device, electronic equipment and storage medium
US11163511B2 (en) Information processing apparatus and non-transitory computer readable medium
CN117291152A (en) Table extraction method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant