CN115618808A - Document typesetting method and device, electronic equipment and storage medium - Google Patents

Document typesetting method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115618808A
CN115618808A CN202211321232.8A CN202211321232A CN115618808A CN 115618808 A CN115618808 A CN 115618808A CN 202211321232 A CN202211321232 A CN 202211321232A CN 115618808 A CN115618808 A CN 115618808A
Authority
CN
China
Prior art keywords
layout
text
document
layout block
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211321232.8A
Other languages
Chinese (zh)
Inventor
于娟娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202211321232.8A priority Critical patent/CN115618808A/en
Publication of CN115618808A publication Critical patent/CN115618808A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The disclosure provides a document typesetting method, a document typesetting device, electronic equipment and a storage medium. The document typesetting method comprises the following steps: the method comprises the steps of dividing a document into blocks to obtain a plurality of blocks with different semantic information; performing semantic recognition on each line of text extracted from the document, and adding each line of text to a corresponding layout block according to a semantic recognition result; and sequencing each layout block, and generating a text sequence according to a sequencing result. The method can typeset the extracted text according to the semantic information of the document, so that the typesetted text sequence is more in line with the reading habit of the user.

Description

Document typesetting method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for typesetting a document, an electronic device, and a storage medium.
Background
Resume parsing is a process of parsing a text resume into a structured resume, and the structured resume generated by the process is more beneficial to storage and subsequent use. The typesetting recovery is used as an important link in the resume analysis process, and the sequence of the extracted text is adjusted, so that the adjusted text sequence is more consistent with the reading sequence, and the basic guarantee is provided for the subsequent resume analysis process.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The disclosure provides a document typesetting method and device, electronic equipment and a storage medium.
The present disclosure adopts the following technical solutions.
In some embodiments, the present disclosure provides a document layout method, including:
the method comprises the steps of dividing a document into a plurality of sections with different semantic information;
performing semantic recognition on each line of text extracted from the document, and adding each line of text to a corresponding layout block according to a semantic recognition result;
and sequencing all the layout blocks, and generating a text sequence according to a sequencing result.
In some embodiments, the present disclosure provides a document layout apparatus, including:
the first processing module is used for carrying out layout block division on the document to obtain a plurality of layout blocks with different semantic information;
the second processing module is used for performing semantic recognition on each line of text extracted from the document and adding each line of text to a corresponding layout block according to a semantic recognition result;
and the third processing module is used for sequencing all the layout blocks and generating a text sequence according to a sequencing result.
In some embodiments, the present disclosure provides an electronic device comprising: at least one memory and at least one processor;
the memory is used for storing program codes, and the processor is used for calling the program codes stored in the memory to execute the method.
In some embodiments, the present disclosure provides a computer-readable storage medium for storing program code which, when executed by a processor, causes the processor to perform the above-described method.
The document typesetting method provided by the embodiment of the disclosure is characterized in that a plurality of different layout blocks with different semantic information are obtained by dividing the layout blocks of the document; then carrying out semantic recognition on each line of text extracted from the document, and adding each line of text to a corresponding layout block according to a semantic recognition result; and finally, sequencing all the layout blocks, and generating a text sequence according to a sequencing result. The embodiment of the disclosure performs initial semantic layout block division on a document, puts text lines belonging to one semantic layout block together, and then performs overall sequencing on each layout block, thereby restoring the document typesetting to a readable sequence.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.
Fig. 1 is a flowchart of a document layout method according to an embodiment of the present disclosure.
Fig. 2 is one of schematic diagrams of document layout according to an embodiment of the present disclosure.
Fig. 3 is a second schematic diagram of document layout according to the embodiment of the disclosure.
Fig. 4 is a third schematic diagram of a document layout according to an embodiment of the present disclosure.
Fig. 5 is a fourth schematic diagram of document layout according to an embodiment of the present disclosure.
Fig. 6 is a fifth schematic diagram of document layout according to the embodiment of the present disclosure.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that various steps recited in method embodiments of the present disclosure may be performed in parallel and/or in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. The term "responsive to" and related terms mean that one signal or event is affected to some extent, but not necessarily completely or directly, by another signal or event. If an event x occurs "in response" to an event y, x may respond directly or indirectly to y. For example, the occurrence of y may ultimately result in the occurrence of x, but other intermediate events and/or conditions may exist. In other cases, y may not necessarily result in the occurrence of x, and x may occur even though y has not already occurred. Furthermore, the term "responsive to" may also mean "at least partially responsive to".
The term "determining" broadly encompasses a wide variety of actions that can include obtaining, calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like, and can also include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like, as well as resolving, selecting, choosing, establishing and the like. Relevant definitions for other terms will be given in the following description. Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, fig. 1 is a flow chart of a method of an embodiment of the present disclosure, including the following steps.
Step S01: the method comprises the steps of dividing a document into a plurality of sections with different semantic information;
in some embodiments, taking the resume document as an example, for a list of resume documents, the reading order of the resume document can be restored by the related art scheme of rearranging the resumes based on the relative positions of the x coordinate or the y coordinate of the characters. However, as shown in fig. 2, for the resume with two or even three columns, i.e., left and right columns, the layout provided by the related art may cause the finally recovered resume to be the result of inserting left and right text lines together, and the original semantic structure of the resume is lost, so that the recovered resume does not conform to the intuitive reading sequence, which is not conducive to the reading of the user, and may bring a great influence on the subsequent layout partitioning and entity relationship extraction, thereby reducing the accuracy of the finally generated structured resume. For the resume document in fig. 2, a text sequence recovered by using the typesetting scheme provided by the related art is shown in fig. 3, and information of each edition block is inserted together, which easily causes ambiguity.
In some embodiments, the embodiment of the present disclosure first performs a preliminary section recognition on the resume document, and may identify section semantic information included in the document, for example, for the resume document of fig. 2, the embodiment of the present disclosure may first perform a conventional document section recognition technology, divide according to recognized "basic _ info" semantic information to obtain a "basic information" section, divide according to recognized "education experience" section, divide according to recognized "career" semantic information to obtain a "work experience" section, and divide according to "project" semantic information to obtain a "skill" section. Therefore, the embodiment of the disclosure firstly identifies the layout blocks of the document, and obtains a plurality of layout blocks with different semantic information through preliminary division.
Step S02: performing semantic recognition on each line of text extracted from the document, and adding each line of text to a corresponding layout block according to a semantic recognition result;
in some embodiments, each line of text in the document is first extracted through an OCR (optical character recognition) technology, then after the text is sorted according to a preset order, semantic recognition is performed on each line of text in sequence, and then each line of text is divided into corresponding sections according to a semantic recognition result.
Step S03: and sequencing all the layout blocks, and generating a text sequence according to a sequencing result.
In some embodiments, after all text lines are divided into corresponding sections, the sections are sorted according to a preset sequence, and after the sections are sorted, the text lines in each section are sequentially traversed, and a text sequence finally conforming to a reading sequence is generated in sequence.
The document typesetting method provided by the embodiment of the disclosure is characterized in that a plurality of different layout blocks with different semantic information are obtained by dividing the layout blocks of the document; then, performing semantic recognition on each line of text extracted from the document, and adding each line of text to a corresponding layout block according to a semantic recognition result; and finally, sequencing all the layout blocks, and generating a text sequence according to a sequencing result. The method and the device have the advantages that the primary semantic layout block division is carried out on the document, the text lines belonging to one semantic layout block are put together, then the whole sequencing is carried out on all the layout blocks, and the document typesetting can be restored to a readable sequence.
In some embodiments, the segmenting the document into the sections to obtain a plurality of sections with different semantic information includes:
carrying out layout block identification on the document to obtain semantic information contained in the document;
and dividing the document into different sections according to the semantic information.
In some embodiments, as shown in fig. 2, the resume document is subjected to block recognition, and semantic information "basic _ info", "education", "career", and "project" contained in the document is obtained, so that the document is divided into four blocks, namely "basic information", "educational experience", "work experience", and "skill", respectively.
In some embodiments, said semantically identifying each line of text extracted from said document comprises:
and sequencing each line of text extracted from the document, and performing semantic recognition on each line of text after sequencing.
In some embodiments, all text lines extracted from the document are sorted in a preset order, for example, the text lines are sorted from left to right, from top to bottom, and the sorted text lines are subjected to semantic recognition in sequence.
In some embodiments, the adding each line of text to the corresponding section according to the semantic recognition result includes:
if the semantic recognition result of the current text contains layout block semantic information, adding the current text to a corresponding layout block according to the layout block semantic information;
and if the semantic recognition result of the current text does not contain the layout block semantic information, adding the current text to the corresponding layout block according to the context information.
In some embodiments, as shown in fig. 6, if a section currently exists and a section semantic information section name identified by a current text line is not empty, it is determined whether a section matching the section semantic information currently exists, and if so, the current line is added to the section. For example, if the section name of the current line is consistent with the section name of the previous layout block, it is indicated that the layout blocks with the same semantic continuously appear, and the current line is added to the layout block. And if the semantic recognition result of the current line does not contain the layout block semantic information, adding the current text to the corresponding layout block according to the context information. For example, if the current text line belongs to the layout block, it is added to the layout block according to the current text line context information traversing all the existing layout blocks.
In some embodiments, if the semantic recognition result of the current text includes layout block semantic information, adding the current text to a corresponding layout block according to the layout block semantic information includes:
if a layout matched with the semantic information of the layout exists at present, adding the current text to the layout;
and if no layout block matched with the semantic information of the layout block exists at present, newly building the layout block, and adding the current text to the newly built layout block.
In some embodiments, if there is no layout matching with the layout semantic information of the current text line, a new layout is created, and the current text line is added to the new layout.
In some embodiments, if the semantic recognition result of the current text does not include layout block semantic information, adding the current text to a corresponding layout block according to context information includes:
if a layout block matched with the context information of the current text exists at present, adding the current text to the layout block;
and if no layout block matched with the context information of the current text exists at present, newly building a layout block, and adding the current text to the newly built layout block.
In some embodiments, if there is no layout matching the context information of the current text line, a new layout is created, and the current text line is added to the new layout.
In some embodiments, further comprising:
and if the layout block does not exist at present, creating a new layout block according to the semantic recognition result of the current text, and adding the current text to the new layout block.
In some embodiments, as shown in fig. 6, if there is no partitioned block currently, a block is newly created according to the semantic recognition result of the current line, and the semantic block information of the current line is used as the initial block semantic information of the newly created block.
In some embodiments, after adding each line of text to the corresponding layout block, further comprising:
and updating the boundary information of the layout block according to the character coordinate information of the text, wherein the boundary information comprises boundary coordinate values.
In some embodiments, after each line of text is added to the corresponding layout block, the boundary information of the layout block is updated according to the character coordinate information of the text, and the boundary information includes boundary coordinate values. For example, when a layout block is newly created according to the semantic recognition result of the text line, character coordinate information (x _ left, x _ right, y _ top, y _ bottom) of the text is taken as the initial boundary information of the newly created layout block, where x _ left represents a left boundary coordinate value of the layout block, x _ right represents a right boundary coordinate value of the layout block, y _ top represents an upper boundary coordinate value of the layout block, and y _ bottom represents a lower boundary coordinate value of the layout block. For another example, after a text is added to a matching section, boundary coordinate information of the section is updated according to character coordinate information of the text.
In some embodiments, the sorting the respective sections includes:
judging whether the layout meets a preset merging condition or not according to the boundary information of the layout;
if the preset merging condition is met, merging the layout blocks meeting the preset merging condition, and then sequencing the layout blocks, otherwise, directly sequencing the layout blocks.
In some embodiments, after all text lines of a document are added to corresponding layout blocks, whether the layout blocks with crossed left and right boundaries exist is judged according to boundary information of each layout block, if so, the layout blocks with crossed left and right boundaries are merged to generate a new layout block, all the layout blocks which currently exist are integrally sorted, and if not, all the layout blocks which currently exist are integrally sorted directly. The sorting order of the tiles can be from left to right and from top to bottom.
In some embodiments, the generating a text sequence according to the sorting result includes:
and traversing the texts in each layout block in sequence according to the sequencing result to generate a text sequence.
In some embodiments, if all old layout pieces have been merged into a new layout piece, the new layout piece is sorted and a text sequence is generated, otherwise, the old layout piece is used for generation. The text sequence generation mode is as follows: and traversing the sorted new layout blocks or old layout blocks in sequence, traversing the text lines in each layout block, and generating a text sequence finally conforming to the reading sequence in sequence, as shown in fig. 4.
In some embodiments, as shown in fig. 5, an embodiment of the present disclosure provides a document layout method, including:
step S11: dividing the sections of the resume;
step S12: sequencing each line of text from left to right and from top to bottom;
step S13: adding each line of text to a corresponding layout block;
step S14: sorting all the sections from left to right and from top to bottom;
step S15: combining the sections, if the left and right boundaries of the two sections are crossed, combining;
step S16: sorting the combined layout blocks;
step S17: and generating a text sequence after the typesetting according to the sequencing result.
The document typesetting method provided by the embodiment of the disclosure can effectively utilize the rich text information and the semantic structure information of the document to perform the typesetting again on the document, so that the final text sequence after the typesetting again accords with the reading sequence, the accuracy of the subsequent block division and the entity relation extraction is greatly ensured, and the accuracy of the final document analysis is further improved.
The embodiment of the present disclosure further provides a document typesetting apparatus, including:
the first processing module is used for carrying out layout block division on the document to obtain a plurality of layout blocks with different semantic information;
the second processing module is used for performing semantic recognition on each line of text extracted from the document and adding each line of text to a corresponding layout block according to a semantic recognition result;
and the third processing module is used for sequencing all the layout blocks and generating a text sequence according to a sequencing result.
In some embodiments, the first processing module is specifically configured to:
carrying out layout block identification on the document to obtain semantic information contained in the document;
and dividing the document into different sections according to the semantic information.
In some embodiments, the second processing module is specifically configured to:
and sequencing each line of text extracted from the document, and performing semantic recognition on each line of text after sequencing.
In some embodiments, the second processing module is specifically configured to:
if the semantic recognition result of the current text contains layout block semantic information, adding the current text to a corresponding layout block according to the layout block semantic information;
and if the semantic recognition result of the current text does not contain layout block semantic information, adding the current text to the corresponding layout block according to the context information.
In some embodiments, the second processing module is specifically configured to:
if a layout block matched with the layout block semantic information exists at present, adding the current text to the layout block;
and if no layout block matched with the semantic information of the layout block exists at present, newly building the layout block, and adding the current text to the newly built layout block.
In some embodiments, the second processing module is specifically configured to:
if a layout block matched with the context information of the current text exists at present, adding the current text to the layout block;
and if no layout block matched with the context information of the current text exists at present, newly building a layout block, and adding the current text to the newly built layout block.
In some embodiments, the second processing module is further specifically configured to:
and if the layout block does not exist at present, creating a new layout block according to the semantic recognition result of the current text, and adding the current text to the new layout block.
In some embodiments, the second processing module is further specifically configured to:
and updating the boundary information of the layout block according to the character coordinate information of the text, wherein the boundary information comprises boundary coordinate values.
In some embodiments, the third processing module is specifically configured to:
judging whether the layout meets a preset merging condition or not according to the boundary information of the layout;
if the preset merging condition is met, merging the layout blocks meeting the preset merging condition, and then sequencing the layout blocks, otherwise, directly sequencing the layout blocks.
In some embodiments, the third processing module is further specifically configured to:
and traversing the texts in each layout block in sequence according to the sequencing result to generate a text sequence.
For the embodiments of the apparatus, since they correspond substantially to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described apparatus embodiments are merely illustrative, wherein the modules described as separate modules may or may not be separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The method and apparatus of the present disclosure have been described above based on the embodiments and application examples. In addition, the present disclosure also provides an electronic device and a computer-readable storage medium, which are described below.
Referring now to fig. 7, shown is a schematic block diagram of an electronic device (e.g., a terminal device or server) 800 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in the drawings is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
The electronic device 800 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing device 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, or the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While the figure illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods of the present disclosure as described above.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, there is provided a document layout method including:
the method comprises the steps of dividing a document into blocks to obtain a plurality of blocks with different semantic information;
performing semantic recognition on each line of text extracted from the document, and adding each line of text to a corresponding layout block according to a semantic recognition result;
and sequencing each layout block, and generating a text sequence according to a sequencing result.
According to one or more embodiments of the present disclosure, a method is provided, where the method for partitioning a document into sections to obtain a plurality of sections with different semantic information includes:
carrying out layout block identification on the document to obtain semantic information contained in the document;
and dividing the document into different sections according to the semantic information.
According to one or more embodiments of the present disclosure, there is provided a method of semantically identifying each line of text extracted from the document, comprising:
and sequencing each line of text extracted from the document, and performing semantic recognition on each line of text after sequencing.
According to one or more embodiments of the present disclosure, there is provided a method for adding each line of text to a corresponding section according to a semantic recognition result, including:
if the semantic recognition result of the current text contains layout block semantic information, adding the current text to a corresponding layout block according to the layout block semantic information;
and if the semantic recognition result of the current text does not contain the layout block semantic information, adding the current text to the corresponding layout block according to the context information.
According to one or more embodiments of the present disclosure, a method is provided, where if a semantic recognition result of a current text includes layout block semantic information, adding the current text to a corresponding layout block according to the layout block semantic information, including:
if a layout matched with the semantic information of the layout exists at present, adding the current text to the layout;
and if no layout block matched with the semantic information of the layout block exists at present, newly building the layout block, and adding the current text to the newly built layout block.
According to one or more embodiments of the present disclosure, a method is provided, where if a semantic recognition result of a current text does not include layout block semantic information, adding the current text to a corresponding layout block according to context information includes:
if a layout block matched with the context information of the current text exists at present, adding the current text to the layout block;
and if no layout block matched with the context information of the current text exists at present, newly building a layout block, and adding the current text to the newly built layout block.
In accordance with one or more embodiments of the present disclosure, there is provided a method, further comprising:
and if the layout does not exist at present, newly building the layout according to the semantic recognition result of the current text, and adding the current text to the newly built layout.
According to one or more embodiments of the present disclosure, there is provided a method, after adding each line of text to a corresponding layout block, further comprising:
and updating the boundary information of the layout block according to the character coordinate information of the text, wherein the boundary information comprises boundary coordinate values.
According to one or more embodiments of the present disclosure, there is provided a method of sorting the respective sections, including:
judging whether the layout meets a preset merging condition or not according to the boundary information of the layout;
if the preset merging condition is met, merging the layout blocks meeting the preset merging condition, and then sequencing the layout blocks, otherwise, directly sequencing the layout blocks.
According to one or more embodiments of the present disclosure, there is provided a method of generating a text sequence according to a sorting result, including:
and traversing the texts in each layout block in sequence according to the sequencing result to generate a text sequence.
According to one or more embodiments of the present disclosure, there is provided a document layout apparatus including:
the first processing module is used for carrying out layout block division on the document to obtain a plurality of layout blocks with different semantic information;
the second processing module is used for performing semantic recognition on each line of text extracted from the document and adding each line of text to a corresponding layout block according to a semantic recognition result;
and the third processing module is used for sequencing all the layout blocks and generating a text sequence according to a sequencing result.
According to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one memory and at least one processor;
wherein the at least one memory is configured to store program code, and the at least one processor is configured to call the program code stored in the at least one memory to perform the method of any one of the above.
According to one or more embodiments of the present disclosure, a computer-readable storage medium for storing program code, which, when executed by a processor, causes the processor to perform the above-described method, is provided.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (13)

1. A document layout method, comprising:
the method comprises the steps of dividing a document into a plurality of sections with different semantic information;
performing semantic recognition on each line of text extracted from the document, and adding each line of text to a corresponding layout block according to a semantic recognition result;
and sequencing each layout block, and generating a text sequence according to a sequencing result.
2. The method of claim 1, wherein the segmenting the document into sections to obtain a plurality of sections with different semantic information comprises:
carrying out layout block identification on the document to obtain semantic information contained in the document;
and dividing the document into different sections according to the semantic information.
3. The method of claim 2, wherein the semantically recognizing each line of text extracted from the document comprises:
and sequencing each line of text extracted from the document, and performing semantic recognition on each line of text after sequencing.
4. The method according to claim 3, wherein the adding each line of text to the corresponding section according to the semantic recognition result comprises:
if the semantic recognition result of the current text contains layout block semantic information, adding the current text to a corresponding layout block according to the layout block semantic information;
and if the semantic recognition result of the current text does not contain layout block semantic information, adding the current text to the corresponding layout block according to the context information.
5. The method of claim 4, wherein if the semantic recognition result of the current text includes layout semantic information, adding the current text to a corresponding layout according to the layout semantic information, includes:
if a layout block matched with the layout block semantic information exists at present, adding the current text to the layout block;
and if no layout block matched with the semantic information of the layout block exists at present, newly building the layout block, and adding the current text to the newly built layout block.
6. The method of claim 4, wherein if the semantic recognition result of the current text does not include layout semantic information, adding the current text to a corresponding layout according to context information comprises:
if a layout block matched with the context information of the current text exists at present, adding the current text to the layout block;
and if no layout block matched with the context information of the current text exists at present, newly building a layout block, and adding the current text to the newly built layout block.
7. The method of claim 1, further comprising:
and if the layout does not exist at present, newly building the layout according to the semantic recognition result of the current text, and adding the current text to the newly built layout.
8. The method of claim 1, after adding each line of text to the corresponding layout piece, further comprising:
and updating the boundary information of the layout block according to the character coordinate information of the text, wherein the boundary information comprises a boundary coordinate value.
9. The method of claim 8, wherein the sorting the respective slabs comprises:
judging whether the layout meets a preset merging condition or not according to the boundary information of the layout;
if the preset merging condition is met, merging the layout blocks meeting the preset merging condition, and then sequencing the layout blocks, otherwise, directly sequencing the layout blocks.
10. The method of claim 9, wherein generating a text sequence according to the ranking result comprises:
and traversing the texts in each layout block in sequence according to the sequencing result to generate a text sequence.
11. A document layout apparatus, comprising:
the first processing module is used for carrying out layout block division on the document to obtain a plurality of layout blocks with different semantic information;
the second processing module is used for performing semantic recognition on each line of text extracted from the document and adding each line of text to a corresponding layout block according to a semantic recognition result;
and the third processing module is used for sequencing all the layout blocks and generating a text sequence according to a sequencing result.
12. An electronic device, comprising:
at least one memory and at least one processor;
wherein the at least one memory is configured to store program code and the at least one processor is configured to invoke the program code stored in the at least one memory to perform the method of any of claims 1 to 10.
13. A computer-readable storage medium for storing program code, which when executed by a computer device, causes the computer device to perform the method of any one of claims 1 to 10.
CN202211321232.8A 2022-10-26 2022-10-26 Document typesetting method and device, electronic equipment and storage medium Pending CN115618808A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211321232.8A CN115618808A (en) 2022-10-26 2022-10-26 Document typesetting method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211321232.8A CN115618808A (en) 2022-10-26 2022-10-26 Document typesetting method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115618808A true CN115618808A (en) 2023-01-17

Family

ID=84864035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211321232.8A Pending CN115618808A (en) 2022-10-26 2022-10-26 Document typesetting method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115618808A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117648909A (en) * 2024-01-29 2024-03-05 国网湖北省电力有限公司信息通信公司 Electric power system document data management system and method based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117648909A (en) * 2024-01-29 2024-03-05 国网湖北省电力有限公司信息通信公司 Electric power system document data management system and method based on artificial intelligence
CN117648909B (en) * 2024-01-29 2024-04-12 国网湖北省电力有限公司信息通信公司 Electric power system document data management system and method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110321958B (en) Training method of neural network model and video similarity determination method
CN110781658B (en) Resume analysis method, resume analysis device, electronic equipment and storage medium
CN113407814B (en) Text searching method and device, readable medium and electronic equipment
CN111857720B (en) User interface state information generation method and device, electronic equipment and medium
CN111460288B (en) Method and device for detecting news event
CN111680491A (en) Document information extraction method and device and electronic equipment
CN115618808A (en) Document typesetting method and device, electronic equipment and storage medium
CN114638218A (en) Symbol processing method, device, electronic equipment and storage medium
CN114995691B (en) Document processing method, device, equipment and medium
CN111382365B (en) Method and device for outputting information
CN113807056B (en) Document name sequence error correction method, device and equipment
CN112115720B (en) Method, device, terminal equipment and medium for determining association relation between entities
CN110659208A (en) Test data set updating method and device
CN111782895B (en) Retrieval processing method and device, readable medium and electronic equipment
CN111737571B (en) Searching method and device and electronic equipment
CN111339776B (en) Resume parsing method and device, electronic equipment and computer-readable storage medium
CN114429629A (en) Image processing method and device, readable storage medium and electronic equipment
CN114492413B (en) Text proofreading method and device and electronic equipment
CN116186093B (en) Address information processing method, address information processing device, electronic equipment and computer readable medium
CN116541421B (en) Address query information generation method and device, electronic equipment and computer medium
CN115796159A (en) Method, device, medium and electronic equipment for determining questions
CN117216190A (en) Text relevance determining method, device, equipment and storage medium
CN116663529A (en) Entry generation method and device and electronic equipment
CN116822475A (en) Processing method, device, equipment and medium of form data
CN116204740A (en) Label determining method, information recommending method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination