CN117349472A - Index word extraction method, device, terminal and medium based on XML document - Google Patents

Index word extraction method, device, terminal and medium based on XML document Download PDF

Info

Publication number
CN117349472A
CN117349472A CN202311384092.3A CN202311384092A CN117349472A CN 117349472 A CN117349472 A CN 117349472A CN 202311384092 A CN202311384092 A CN 202311384092A CN 117349472 A CN117349472 A CN 117349472A
Authority
CN
China
Prior art keywords
tag
index word
reading
xml document
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311384092.3A
Other languages
Chinese (zh)
Other versions
CN117349472B (en
Inventor
肖辉
唐小兴
廖晓华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Artron Art Printing Co ltd
Shanghai Artron Art Printing Co ltd
Artron Art Group Co ltd
Original Assignee
Beijing Artron Art Printing Co ltd
Shanghai Artron Art Printing Co ltd
Artron Art Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Artron Art Printing Co ltd, Shanghai Artron Art Printing Co ltd, Artron Art Group Co ltd filed Critical Beijing Artron Art Printing Co ltd
Priority to CN202311384092.3A priority Critical patent/CN117349472B/en
Publication of CN117349472A publication Critical patent/CN117349472A/en
Application granted granted Critical
Publication of CN117349472B publication Critical patent/CN117349472B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses an index word extraction method, device, terminal and medium based on XML documents, wherein the method comprises the following steps: acquiring at least one preset label name to be extracted, and forming a configuration table; creating a blank index word list; reading tags from a root node of an XML document; judging whether the name of the current tag exists in the configuration table, and if so, reading the content of the current tag; if the result is negative, ignoring the current label and reading the next label; reading a page where a current tag is located, calculating a region where the current tag is located, and generating a page number identification; and adding the current tag content and the corresponding page number identification as a line of text to the tail of the index word list. According to the invention, the page where the current tag is located is read, the area where the current tag is located is calculated, and the page number identification is generated, so that the index and the mark guide are not needed to be additionally created, the extraction efficiency is high, the page area position information where the index word is located can be generated, and the subsequent more accurate retrieval is facilitated.

Description

Index word extraction method, device, terminal and medium based on XML document
[ field of technology ]
The invention relates to the technical field of publication design, in particular to an index word extraction method, device, terminal and medium based on an XML document.
[ background Art ]
Books such as encyclopedia, tool books, specifications and the like which need to be queried and searched usually need to be indexed during manual typesetting or automatic typesetting. In order to improve typesetting efficiency, inDesign (function expansion) typesetting software of a layout functional module (Baike Typesetting, an encyclopedic typesetting main service module) can be used for automatic typesetting, and the method is used for importing XML (eXtensible Markup Language ) and expanding typesetting, so that formats of characters, pictures, layout and the like are solved. By using the index function of InDesign typesetting software, the creation of a theme (index word) and the marking mode of a reference are complicated, and the regional position information of the page where the index word is located cannot be generated, so that the extraction efficiency of the index word is low and the retrieval is inconvenient.
In view of the foregoing, it is desirable to provide a method, apparatus, terminal and medium for extracting index words based on XML documents to overcome the above-mentioned drawbacks.
[ invention ]
The invention aims to provide an index word extraction method, device, terminal and medium based on an XML document, and aims to solve the problem of low extraction efficiency of index words in the existing typesetting mode. The page area position information of the index word can be generated, and more accurate retrieval is facilitated.
In order to achieve the above object, a first aspect of the present invention provides an index word extraction method based on an XML document, including:
step S10: acquiring at least one preset label name to be extracted, and forming a configuration table;
step S20: creating a blank index word list;
step S30: reading tags from a root node of an XML document;
step S40: judging whether the name of the current tag exists in the configuration table, and if so, reading the content of the current tag; if the result is negative, ignoring the current label and reading the next label;
step S50: when the content of the current tag is read, simultaneously reading a page where the current tag is located and calculating a region where the current tag is located, and generating a page number identifier; wherein the page is divided into a plurality of areas in advance;
step S60: and adding the current tag content and the corresponding page number identification as a line of text to the tail end of the index word list, and then reading the next tag.
In a preferred embodiment, the method further comprises the step of:
judging whether the XML document is subjected to label traversal, if so, storing the index word list into a file; if the result is negative, the next label is read.
In a preferred embodiment, the step S30 includes:
the XML document is read and then the XML structure tree is read in.
In a preferred embodiment, the page is divided into a plurality of areas in advance specifically including:
the page is divided into a plurality of palace lattices by a plurality of crisscross separation lines, and each palace lattice is numbered in one-to-one correspondence with one letter.
In a preferred embodiment, the page number identifier includes a page number of the page where the tag is located and a letter number corresponding to the box where the tag is located.
The second aspect of the present invention also provides an index word extraction device based on an XML document, including:
the tag configuration module is used for acquiring at least one pre-configured tag name to be extracted and forming a configuration table;
the list creation module is used for creating a blank index word list;
the tag reading module is used for reading the tag from the root node of the XML document;
the label judging module is used for judging whether the name of the current label exists in the configuration table or not, and if so, reading the content of the current label; if the result is negative, ignoring the current label and reading the next label;
the mark generation module is used for simultaneously reading the page where the current tag is located and calculating the area where the current tag is located when the content of the current tag is read, and generating a page number mark; wherein the page is divided into a plurality of areas in advance;
and the index adding module is used for adding the current tag content and the corresponding page number identification as a line of text to the tail end of the index word list, and then reading the next tag.
A third aspect of the present invention provides a terminal comprising a memory, a processor and a computer program stored in the memory, which when executed by the processor, implements the steps of the method for extracting index words based on XML documents as described in any one of the above embodiments.
A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the XML document-based index word extraction method according to any one of the above embodiments.
A fifth aspect of the invention provides a computer program product comprising a computer program or instructions which, when processed for execution, implement the steps of the XML document-based index word extraction method as described in any one of the embodiments above.
According to the index word extraction method, device, terminal and medium based on the XML document, based on the label of the existing XML document, when the content of the current label is read, the page where the current label is located is read, the area where the current label is located is calculated, the page number identification is generated, the content of the current label and the page number identification corresponding to the current label are used as a line of text to be added to the tail end of an index word list, no index or mark guide is needed to be additionally created, the extraction efficiency is high, the page area position information where the index word is located can be generated, and the subsequent more accurate retrieval is facilitated.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an index word extraction method based on XML documents provided by the invention;
FIG. 2 is a schematic diagram of page partitioning in step S50 of the method shown in FIG. 1;
fig. 3 is a frame diagram of an index word extracting device based on an XML document.
[ detailed description ] of the invention
In order to make the objects, technical solutions and advantageous technical effects of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and detailed description. It should be understood that the detailed description is intended to illustrate the invention, and not to limit the invention.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Example 1
In the embodiment of the invention, an index word extraction method based on an XML document is provided, an index word is generated based on the label of the existing XML document, an index and a mark guide are not needed to be additionally created, and meanwhile, the position information of a page area where the index word is located can be generated.
It should be noted that, the InDesign document with XML structure, that is, each content on the page is associated with XML label, and the derived data of the general content platform of the document is automatically generated by the automatic typesetting system.
As shown in FIG. 1, the index word extraction method based on the XML document includes steps S10-S60.
Step S10: at least one pre-configured label name to be extracted is acquired to form a configuration table.
The method comprises the steps of firstly determining a label system structure, wherein label data need to be presented and index/reference labels, and label content comprises an item name, a qualitative description, a text, recommended reading and an item author; the label structure comprises paragraph style, character style, object style and table style corresponding to the label. The tag names to be extracted (may be one or a plurality of tags) are configured in advance.
Step S20: a blank index word list (list) is created.
Step S30: the tag is read starting from the root node of the XML document.
Specifically, step S30 includes reading the original XML document and then reading the XML structure tree.
Step S40: judging whether the name of the current tag exists in the configuration table, and if so, reading the content of the current tag; if the result is negative, the current label is ignored and the next label is read.
Step S50: when the content of the current tag is read, simultaneously reading a page where the current tag is located and calculating a region where the current tag is located, and generating a page number identifier; wherein the page is divided into a plurality of areas in advance.
Specifically, the page is divided into a plurality of areas in advance, specifically including: the page is divided into a plurality of palace lattices by a plurality of crisscross separation lines, and each palace lattice is numbered in one-to-one correspondence with one letter. The page number identification comprises the page number of the page where the tag is located and the letter number corresponding to the palace. As shown in fig. 2, the page may be divided into six grid areas of three columns and two rows, and then each grid is numbered a, b, c, d, e, f sequentially from top to bottom and from left to right.
For example, if the page on which the current tag is located is 36 pages and the tag is located in the d area of the page, the page number of the tag is identified as 36d.
Step S60: and adding the current tag content and the corresponding page number identification as a line of text to the tail end of the index word list, and then reading the next tag.
Further, the method comprises the steps of: judging whether the XML document is subjected to label traversal, if so, storing an index word list into a file; if the result is no, the next label is read, and then the steps are repeated until all the labels are judged to be finished.
Therefore, the method for directly extracting the index from the label is adopted, so that the index generation efficiency is higher; meanwhile, the more accurate position of the index word is obtained, and the method has important significance for query and retrieval.
Example two
The invention also provides an index word extraction device 100 based on the XML document, which generates the index word based on the label of the existing XML document, does not need to additionally create an index and a mark guide, and can generate the page area position information of the index word. It should be noted that, the implementation principle and the specific implementation manner of the index word extraction device 100 based on the XML document are consistent with the above-mentioned index word extraction method based on the XML document, so the description thereof will not be repeated.
As shown in fig. 3, the index word extraction apparatus 100 based on an XML document includes:
the tag configuration module 10 is configured to obtain at least one pre-configured tag name to be extracted, and form a configuration table;
a list creation module 20 for creating a blank index word list (list);
a tag reading module 30 for reading tags from a root node of an XML document;
the tag judgment module 40 is configured to judge whether the name of the current tag exists in the configuration table, and if yes, read the content of the current tag; if the result is negative, ignoring the current label and reading the next label;
the identifier generating module 50 is configured to, when reading the content of the current tag, read the page where the current tag is located and calculate the area where the current tag is located, and generate a page number identifier; wherein the page is divided into a plurality of areas in advance;
the index appending module 60 is configured to append the current tag content and the corresponding page number identifier to the end of the index word list as a line of text, and then read the next tag.
Example III
The present invention provides a terminal comprising a memory, a processor and a computer program stored in the memory, which when executed by the processor, implements the steps of the method for extracting index words based on XML documents according to any one of the above embodiments.
Example IV
The present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the XML document-based index word extraction method according to any one of the above embodiments.
Example five
The present invention provides a computer program product comprising a computer program or instructions which, when processed for execution, implement the steps of the method for extracting index words based on an XML document as described in any one of the above embodiments.
In summary, the method, the device, the terminal and the medium for extracting the index word based on the XML document are based on the label of the existing XML document, when the content of the current label is read, the page where the current label is located is read, the area where the current label is located is calculated, the page number identification is generated, the content of the current label and the corresponding page number identification are taken as a line of text to be added to the tail end of the index word list, no index or mark guide is needed to be additionally created, the extraction efficiency is high, the page area position information where the index word is located can be generated, and the follow-up more accurate retrieval is facilitated.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the system is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the elements and method steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed system or apparatus/terminal device and method may be implemented in other manners. For example, the system or apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, systems or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The present invention is not limited to the details and embodiments described herein, and thus additional advantages and modifications may readily be made by those skilled in the art, without departing from the spirit and scope of the general concepts defined in the claims and the equivalents thereof, and the invention is not limited to the specific details, representative apparatus and illustrative examples shown and described herein.

Claims (8)

1. An index word extraction method based on an XML document is characterized by comprising the following steps:
step S10: acquiring at least one preset label name to be extracted, and forming a configuration table;
step S20: creating a blank index word list;
step S30: reading tags from a root node of an XML document;
step S40: judging whether the name of the current tag exists in the configuration table, and if so, reading the content of the current tag; if the result is negative, ignoring the current label and reading the next label;
step S50: when the content of the current tag is read, simultaneously reading a page where the current tag is located and calculating a region where the current tag is located, and generating a page number identifier; wherein the page is divided into a plurality of areas in advance;
step S60: and adding the current tag content and the corresponding page number identification as a line of text to the tail end of the index word list, and then reading the next tag.
2. The method for extracting an index word based on an XML document according to claim 1, further comprising the steps of:
judging whether the XML document is subjected to label traversal, if so, storing the index word list into a file; if the result is negative, the next label is read.
3. The method for extracting index words based on XML document according to claim 1, wherein said step S30 includes:
the XML document is read and then the XML structure tree is read in.
4. The method for extracting an index word based on an XML document according to claim 1, wherein the page is divided into a plurality of regions in advance, specifically comprising:
the page is divided into a plurality of palace lattices by a plurality of crisscross separation lines, and each palace lattice is numbered in one-to-one correspondence with one letter.
5. The method for extracting indexing words from an XML document according to claim 4, wherein the page number identifier includes a page number of a page on which the tag is located and a letter number corresponding to a palace.
6. An index word extraction device based on an XML document, comprising:
the tag configuration module is used for acquiring at least one pre-configured tag name to be extracted and forming a configuration table;
the list creation module is used for creating a blank index word list;
the tag reading module is used for reading the tag from the root node of the XML document;
the label judging module is used for judging whether the name of the current label exists in the configuration table or not, and if so, reading the content of the current label; if the result is negative, ignoring the current label and reading the next label;
the mark generation module is used for simultaneously reading the page where the current tag is located and calculating the area where the current tag is located when the content of the current tag is read, and generating a page number mark; wherein the page is divided into a plurality of areas in advance;
and the index adding module is used for adding the current tag content and the corresponding page number identification as a line of text to the tail end of the index word list, and then reading the next tag.
7. A terminal comprising a memory, a processor and a computer program stored in the memory, which when executed by the processor, performs the steps of the XML document-based index word extraction method of any one of claims 1 to 5.
8. A computer readable storage medium storing a computer program which when executed by a processor performs the steps of the XML document-based index word extraction method of any one of claims 1 to 5.
CN202311384092.3A 2023-10-24 2023-10-24 Index word extraction method, device, terminal and medium based on XML document Active CN117349472B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311384092.3A CN117349472B (en) 2023-10-24 2023-10-24 Index word extraction method, device, terminal and medium based on XML document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311384092.3A CN117349472B (en) 2023-10-24 2023-10-24 Index word extraction method, device, terminal and medium based on XML document

Publications (2)

Publication Number Publication Date
CN117349472A true CN117349472A (en) 2024-01-05
CN117349472B CN117349472B (en) 2024-05-28

Family

ID=89367637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311384092.3A Active CN117349472B (en) 2023-10-24 2023-10-24 Index word extraction method, device, terminal and medium based on XML document

Country Status (1)

Country Link
CN (1) CN117349472B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136033A (en) * 2006-08-28 2008-03-05 株式会社东芝 Structured document management system and method of managing indexes in the same system
CN111241096A (en) * 2020-01-07 2020-06-05 中孚安全技术有限公司 Text extraction method, system, terminal and storage medium for EXCEL document
US20230005283A1 (en) * 2021-06-30 2023-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Information extraction method and apparatus, electronic device and readable storage medium
CN115599885A (en) * 2022-10-19 2023-01-13 中国建设银行股份有限公司(Cn) Document full-text retrieval method and device, computer equipment, storage medium and product

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136033A (en) * 2006-08-28 2008-03-05 株式会社东芝 Structured document management system and method of managing indexes in the same system
CN111241096A (en) * 2020-01-07 2020-06-05 中孚安全技术有限公司 Text extraction method, system, terminal and storage medium for EXCEL document
US20230005283A1 (en) * 2021-06-30 2023-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Information extraction method and apparatus, electronic device and readable storage medium
CN115599885A (en) * 2022-10-19 2023-01-13 中国建设银行股份有限公司(Cn) Document full-text retrieval method and device, computer equipment, storage medium and product

Also Published As

Publication number Publication date
CN117349472B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
CN108763171B (en) Automatic document generation method based on format template
CN101025738B (en) Template-free dynamic website generating method
US7853869B2 (en) Creation of semantic objects for providing logical structure to markup language representations of documents
EP2291010A1 (en) Structure processing method and apparatus for layout file
CN103914443A (en) Mixed typesetting method and device for plurilingual characters
CN113609820B (en) Method, device and equipment for generating word file based on extensible markup language file
US20120304051A1 (en) Automation Tool for XML Based Pagination Process
CN102110102A (en) Data processing method and device, and file identifying method and tool
CN112199929A (en) Form processing method and device, storage medium and electronic equipment
CN110688825A (en) Method for extracting information of table containing lines in layout document
CN109726369A (en) A kind of intelligent template questions record Implementation Technology based on normative document
CN117349472B (en) Index word extraction method, device, terminal and medium based on XML document
CN102110108B (en) Method and device for processing galley proof file
CN103176956B (en) For the method and apparatus extracting file structure
WO2011074942A1 (en) System and method of converting data from a multiple table structure into an edoc format
CN110390037B (en) Information classification method, device and equipment based on DOM tree and storage medium
CN111126007B (en) HTM L-based medical record document paging algorithm
CN114637505A (en) Page content extraction method and device
CN117236282B (en) Intelligent typesetting method, device, terminal and medium based on XML data
CN112488642A (en) Cloud file management method based on structured tags and taking object as core
CN111046629B (en) Outline display method, device and equipment
CN114186549A (en) Docx document service processing and data utilization system and method
CN104063386B (en) A kind of method and apparatus of content object multiplexing
CN110457659B (en) Clause document generation method and terminal equipment
JP5374712B2 (en) Information processing apparatus, information processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant