CN116702702B - Automatic typesetting method and system based on XML - Google Patents

Automatic typesetting method and system based on XML Download PDF

Info

Publication number
CN116702702B
CN116702702B CN202310397252.1A CN202310397252A CN116702702B CN 116702702 B CN116702702 B CN 116702702B CN 202310397252 A CN202310397252 A CN 202310397252A CN 116702702 B CN116702702 B CN 116702702B
Authority
CN
China
Prior art keywords
label
paragraph
tag
matching degree
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310397252.1A
Other languages
Chinese (zh)
Other versions
CN116702702A (en
Inventor
肖辉
万捷
彭干
程成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Artron Art Group Co ltd
Beijing Artron Art Printing Co ltd
Original Assignee
Artron Art Group Co ltd
Beijing Artron Art Printing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Artron Art Group Co ltd, Beijing Artron Art Printing Co ltd filed Critical Artron Art Group Co ltd
Priority to CN202310397252.1A priority Critical patent/CN116702702B/en
Publication of CN116702702A publication Critical patent/CN116702702A/en
Application granted granted Critical
Publication of CN116702702B publication Critical patent/CN116702702B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention relates to an automatic typesetting method and system based on XML, which relates to the technical field of automatic typesetting, wherein the system comprises the following steps: the importing module is used for importing XML format data; the analysis module is used for analyzing the imported XML format data, and comprises: the system comprises a classifying unit, a recognition unit and a checking unit, wherein the classifying unit is used for classifying data types of imported XML format data, the recognition unit is used for recognizing labels of the classified text data, and the checking unit is used for checking secondary judgment results of the labels; the reorganization module is used for carrying out structural reorganization on the verified labels; the modeling module is used for creating a label style template; the typesetting module is used for importing the recombined data into the label style template for typesetting; the adjusting module is used for carrying out layout adjustment on typeset data; and the export module is used for exporting the file after the layout adjustment. The invention effectively improves typesetting efficiency.

Description

Automatic typesetting method and system based on XML
Technical Field
The invention relates to the technical field of automatic typesetting, in particular to an automatic typesetting method and system based on XML.
Background
The network compiling platform and the content system have a great number of printing and publishing demands, the traditional method is that the platform exports relevant data, the relevant data are arranged by editing and are delivered to typesetting staff for typesetting and outputting printed files, the middle links are more, time and labor are wasted, mistakes are easy to occur, and the efficiency is low.
Chinese patent publication No.: CN110032720a discloses a visual report typesetting and automatic generating method and system based on XML, comprising: designing an XML report template format; the XML report template format directly maps the report batch production program; automatically generating an XML report template based on a visual mode; the visualization mode is realized by an online page application mode; automatically extracting a mappable report content template file by the XML report template; and automatically backfilling the XML report template after replacing the content; reports are generated based on the XML report templates. Therefore, the scheme does not accurately analyze XML data, and has the problems of low typesetting precision and low typesetting efficiency.
Disclosure of Invention
Therefore, the invention provides an automatic typesetting method and system based on XML, which are used for solving the problems of inaccurate typesetting data analysis, low typesetting precision and low typesetting efficiency in the prior art.
To achieve the above object, in one aspect, the present invention provides an automatic typesetting system based on XML, including:
the importing module is used for importing XML format data;
the analysis module is used for analyzing the imported XML format data, is connected with the importing module and comprises: the system comprises a classifying unit, a recognizing unit and a checking unit, wherein the classifying unit is used for classifying data of imported XML format data into text data and picture data, the recognizing unit is used for carrying out tag recognition on the classified text data, the recognizing unit is connected with the classifying unit, when the tag recognition is carried out, the recognizing unit is used for carrying out matching on each tag keyword and each paragraph content of the text data, calculating the tag matching degree P of each paragraph, after the calculation is finished, the recognizing unit is also used for adjusting the tag matching degree P according to whether the same keyword appears in the paragraph, after the adjustment is finished, the recognizing unit is also used for correcting the adjusted tag matching degree P 'according to the same keyword quantity appearing in the paragraph, after the correction is finished, the recognizing unit is also used for carrying out primary judgment on the tag of the paragraph according to the corrected tag matching degree P', and carrying out secondary judgment on the tag of which is successfully matched with the tag judgment time mark according to the paragraph quantity, and the checking unit is used for checking the secondary judgment result of the tag according to the corresponding paragraph quantity of the same paragraph when the checking is carried out the checking, and the checking unit is used for carrying out the checking on the tag judgment result of the second judgment according to the paragraph corresponding to the paragraph quantity;
the reorganization module is used for carrying out structural reorganization on each label after verification and is connected with the analysis module;
the modeling module is used for creating a label style template which is connected with the reorganization module;
the typesetting module is used for importing the data after the label structure is recombined into a label style template for typesetting and is connected with the recombination module;
the adjustment module is used for carrying out layout adjustment on typeset data, is connected with the typesetting module, and is also used for adjusting the dynamic header format when carrying out adjustment so that the header formats of all pages after adjustment are the same, and creating an index label and a reference label;
and the export module is used for exporting the file after layout adjustment and is connected with the adjustment module.
Further, when calculating the tag matching degree P of each paragraph, the identifying unit sets p= (p1+p2+ … Pn)/n, n is the number of similar keywords in the paragraph, n is greater than or equal to 1, pi is the matching degree of similar keywords in the paragraph, pi=l/L0, i=1, 2 … n, L is the number of words of the similar keywords, L is greater than or equal to 2, and L0 is the number of words of the tag keywords.
Further, the identification unit adjusts the tag matching degree P according to whether the same keyword appears in the paragraph when adjusting the tag matching degree P, wherein,
when the same keyword appears in the paragraph, the identification unit selects an adjustment coefficient t to adjust the tag matching degree P so as to increase the tag matching degree, wherein t is more than 1 and less than 1.2, the adjusted tag matching degree is P ', and P' =P×t is set;
when the same keyword does not appear in the paragraph, the recognition unit does not make an adjustment.
Further, when the identification unit corrects the adjusted tag matching degree P ', the identification unit compares the same number S of keywords appearing in the paragraph with a preset same number S0 of keywords, corrects the adjusted tag matching degree P' according to the comparison result, wherein,
when S is more than 1 and less than or equal to S0, the identification unit selects a first correction coefficient g1 to correct the adjusted tag matching degree P' so as to increase the tag matching degree, wherein g1 is more than 1 and less than 1.1;
when S > S0, the identification unit selects a second correction coefficient g2 to correct the adjusted tag matching degree P' so as to increase the tag matching degree, and sets g2=g1+g1× (S-S0)/S;
when the i-th correction coefficient gi is selected to correct the adjusted tag matching degree P ', i=1, 2 is set, the corrected tag matching degree is P ", and P" =p' ×gi is set.
Further, when the identification unit judges the label of the paragraph according to the corrected label matching degree P ', the corrected label matching degree P' is compared with the preset label matching degree P0, and the label of the paragraph is judged for the first time according to the comparison result,
when P' is more than or equal to P0, the identification unit judges that the label is successfully matched, and takes the label successfully matched as the label of the paragraph;
when P' < P0, the identification unit judges that the tag matching fails.
Further, the identification unit compares the word number Z of the successfully matched label paragraph with the word number of each preset label paragraph when performing label secondary judgment, and performs label secondary judgment on the successfully matched label paragraph after the primary label judgment according to the comparison result,
when Z is smaller than Z1 or Z is larger than Z2, the identification unit judges that the successfully matched label cannot be used as the label of the paragraph, and carries out label primary judgment again on the paragraph;
when Z1 is less than or equal to Z2, the identification unit judges that the label successfully matched is used as the label of the paragraph;
wherein Z1 is the number of first preset label paragraph words, Z2 is the number of second preset label paragraph words, and Z1 is less than Z2.
Further, when the verification unit verifies the label secondary judgment result, the verification unit verifies the label secondary judgment result of the paragraph according to the label number corresponding to the same paragraph, wherein,
when a plurality of labels exist in the same paragraph, the verification unit judges that verification fails, sorts the labels of the paragraph according to the matching degree from large to small, and takes the label with the largest matching degree as the label of the paragraph;
when a single label exists in the same paragraph, the verification unit judges that verification is successful.
Further, when the reorganization module reorganizes the structure of each verified label, the label name which is verified successfully is matched with the label name in the preset label structure, and the label is reorganized according to the matching result,
when the tag name successfully checked is successfully matched with the tag name in the preset tag structure, the reorganization module reorganizes the tag structure according to the preset tag structure;
when the label name successfully checked is failed to be matched with the label name in the preset label structure, the reorganization module carries out label judgment again on the paragraph corresponding to the label which is failed to be matched, and when the label judgment is carried out again on the paragraph, the selected label is not used any more until the label name of the paragraph is successfully matched with the label name in the preset label structure.
Further, the label style template comprises a paragraph style, a character style, an object style and a table style corresponding to the label.
On the other hand, the invention also provides an automatic typesetting method based on XML, which comprises the following steps,
step S1, importing XML format data to be typeset through an importing module;
s2, analyzing the imported XML format data through an analysis module to identify tags of the XML format data;
s3, carrying out structural reorganization on each identified label through a reorganization module;
s4, creating a label style template through a modeling module;
s5, importing the data after the label structure reorganization into a label style template through a typesetting module for typesetting;
s6, performing layout adjustment on typeset data through an adjustment module;
and S7, exporting the document with the adjusted layout through an export module.
Compared with the prior art, the system has the beneficial effects that the system is applied to automatic typesetting, the analysis module analyzes the imported XML format data to identify the tags of the XML format data, so that the accuracy of data analysis is effectively ensured, the typesetting efficiency is improved, the structure reorganization of the verified tags is performed by the reorganization module, the accuracy of the structure reorganization of the tags is effectively ensured, and the typesetting efficiency is improved.
In particular, in the embodiment, the matching degree of the similar keywords in the paragraphs is calculated through the ratio of the number of words of the similar keywords to the number of words of the tag keywords, and the matching degree of the similar keywords in the paragraphs is averaged compared with the number of the similar keywords in the paragraphs, so that the tag matching degree P of each paragraph is calculated, the accuracy of data analysis is effectively ensured, and the typesetting efficiency is improved.
Especially, when the identification unit adjusts the tag matching degree P, the identification unit adjusts the tag matching degree P according to whether the same keyword appears in the paragraph, if the same keyword appears in the paragraph, the identification unit selects the adjustment coefficient t to adjust the tag matching degree P so as to increase the tag matching degree, and if the same keyword does not appear in the paragraph, the identification unit does not adjust, thereby effectively ensuring the accuracy of data analysis and further improving typesetting efficiency.
In particular, when the identification unit corrects the adjusted tag matching degree P ', the identification unit compares the number S of identical keywords appearing in the paragraph with the number S0 of preset identical keywords, if the number of identical keywords is greater than 1 and less than or equal to the number of preset identical keywords, selects the first correction coefficient g1 to correct the adjusted tag matching degree P ' so as to increase the tag matching degree, and selects the second correction coefficient g2 to correct the adjusted tag matching degree P ' so as to further increase the tag matching degree, effectively ensures the accuracy of data analysis, and improves the typesetting efficiency.
Particularly, when the identification unit in this embodiment determines the label of the paragraph according to the corrected label matching degree P ", the corrected label matching degree P" is compared with the preset label matching degree P0, if the corrected label matching degree P "is greater than or equal to the preset label matching degree P0, the identification unit determines that the label matching is successful, and uses the successfully matched label as the label of the paragraph, if the corrected label matching degree P" is less than the preset label matching degree P0, the identification unit determines that the label matching is failed, thereby effectively ensuring the accuracy of data analysis and improving typesetting efficiency.
Especially, in this embodiment, different labels are provided with different preset label paragraph numbers, when the identification unit performs label secondary judgment, the identification unit compares the number Z of successfully matched labels with the number Z of preset label paragraphs, if the number Z of successfully matched labels is smaller than the number Z1 of preset label paragraphs or larger than the number Z2 of preset label paragraphs, where Z1 is smaller than Z2, the identification unit judges that the successfully matched labels cannot be used as labels of the paragraphs, and performs label primary judgment again on the paragraphs, and if the number Z of successfully matched labels is within the number Z1 of preset label paragraphs and the number Z2 of preset label paragraphs, the identification unit judges that the successfully matched labels are used as labels of the paragraphs, so that the accuracy of data analysis is effectively ensured, and typesetting efficiency is improved.
In particular, when the verification unit verifies the label judgment result, the verification unit verifies the label judgment result according to the number of labels corresponding to the same paragraph, if a plurality of labels exist in the same paragraph, the verification unit performs matching degree sequencing on the labels, the label with the highest matching degree is used as the label of the paragraph, and if a single label exists in the same paragraph, the verification is successful, so that the accuracy of data analysis is effectively ensured, and the typesetting efficiency is improved.
Especially, when the reorganization module performs structure reorganization on each checked label, the label name that is checked successfully is matched with the label name in the preset label structure, if the label name that is checked successfully is matched with the label name in the preset label structure, the reorganization module performs structure reorganization on the label according to the preset label structure, if the label name that is checked successfully is matched with the label name in the preset label structure, the reorganization module performs label judgment again on a paragraph corresponding to the label that is matched with the failure, and when the label judgment is performed again on the paragraph, the selected label is not used until the label name of the paragraph is matched with the label name in the preset label structure successfully, so that the reorganization precision of the label structure is effectively ensured, and the typesetting efficiency is improved.
Drawings
Fig. 1 is a schematic diagram of an automatic typesetting system based on XML in an embodiment of the present invention;
fig. 2 is a flow chart of an automatic typesetting method based on XML in an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.
Referring to fig. 1, an automatic typesetting system based on XML according to an embodiment of the present invention includes:
the importing module is used for importing XML format data;
the analysis module is used for analyzing the imported XML format data, is connected with the importing module and comprises: the system comprises a classifying unit, a recognizing unit and a checking unit, wherein the classifying unit is used for classifying data of imported XML format data into text data and picture data, the recognizing unit is used for carrying out tag recognition on the classified text data, the recognizing unit is connected with the classifying unit, when the tag recognition is carried out, the recognizing unit is used for carrying out matching on each tag keyword and each paragraph content of the text data, calculating the tag matching degree P of each paragraph, after the calculation is finished, the recognizing unit is also used for adjusting the tag matching degree P according to whether the same keyword appears in the paragraph, after the adjustment is finished, the recognizing unit is also used for correcting the adjusted tag matching degree P 'according to the same keyword quantity appearing in the paragraph, after the correction is finished, the recognizing unit is also used for carrying out primary judgment on the tag of the paragraph according to the corrected tag matching degree P', and carrying out secondary judgment on the tag of which is successfully matched with the tag judgment time mark according to the paragraph quantity, and the checking unit is used for checking the secondary judgment result of the tag according to the corresponding paragraph quantity of the same paragraph when the checking is carried out the checking, and the checking unit is used for carrying out the checking on the tag judgment result of the second judgment according to the paragraph corresponding to the paragraph quantity;
the reorganization module is used for carrying out structural reorganization on each label after verification and is connected with the analysis module;
the modeling module is used for creating a label style template which is connected with the reorganization module, wherein the label style template comprises paragraph styles, character styles, object styles, table styles and the like corresponding to labels;
the typesetting module is used for importing the data after the label structure is recombined into a label style template for typesetting and is connected with the recombination module;
the adjustment module is used for carrying out layout adjustment on typeset data, is connected with the typesetting module, and is also used for adjusting the dynamic header format when carrying out adjustment so that the header formats of all pages after adjustment are the same, and creating an index label and a reference label;
and the export module is used for exporting the file after layout adjustment and is connected with the adjustment module.
Specifically, the system is applied to automatic typesetting, the imported XML format data is analyzed through the analysis module, tag identification is carried out on the XML format data, and therefore accuracy of data analysis is effectively guaranteed, typesetting efficiency is improved, structure reorganization is carried out on the verified tags through the reorganization module, accuracy of tag structure reorganization is effectively guaranteed, and typesetting efficiency is improved. The Chinese character label in this embodiment includes an author label, a text title label, a abstract label, a reference label, and the like, when dividing data types, dividing data containing JPG format into picture data, dividing other data into text data, where the label key is preset, is a key character in the label text, such as the author label, the label key is the author, the paragraphs are identification units divide according to punctuation marks when identifying, the text content between two periods in the text data is used as a paragraph, the same key word is a key character with the same number of words connected with the same number of words, and the number of words connected with the other is greater than or equal to 2 words, when typesetting is performed by the typesetting module in this embodiment, typesetting the picture data and the text data after the label structure is recombined according to a preset application rule and a formula processing mode, generating a result and storing the result, where the preset application rule is an alignment rule, is a first line and a second line, a small number 4 is a song title, a left alignment, the paragraphs, a small number 4 is a paragraph, a left word, a left alignment mode is a line name, and the index label is referred to the label, and the label is found in the index label, and the label can be referred to.
Specifically, when calculating the tag matching degree P of each paragraph, the identifying unit sets p= (p1+p2+ … Pn)/n, n is the number of similar keywords in the paragraph, n is greater than or equal to 1, pi is the matching degree of similar keywords in the paragraph, pi=l/L0, i=1, 2 … n, L is the number of words of the similar keywords, L is greater than or equal to 2, and L0 is the number of words of the tag keywords.
Specifically, in this embodiment, the matching degree of the similar keywords in the paragraphs is calculated by the ratio of the number of words of the similar keywords to the number of words of the tag keywords, and the matching degree of the similar keywords in the paragraphs is averaged compared with the number of the similar keywords in the paragraphs, so as to calculate the tag matching degree P of each paragraph, thereby effectively ensuring the accuracy of data analysis and improving the typesetting efficiency. The similar keywords described in this embodiment are the same or similar keywords with the number of connected words being 2 or more, for example, the similar keywords having "title", "text mark", "text question", "text title", etc. as "text title".
Specifically, when the identification unit adjusts the tag matching degree P, the identification unit adjusts the tag matching degree P according to whether the same keyword appears in the paragraph, wherein,
when the same keyword appears in the paragraph, the identification unit selects an adjustment coefficient t to adjust the tag matching degree P to increase the tag matching degree, 1 < t < 1.2, the adjusted tag matching degree is P ', and P' =p×t is set.
When the same keyword does not appear in the paragraph, the recognition unit does not make an adjustment.
Specifically, when the identification unit adjusts the tag matching degree P, the identification unit adjusts the tag matching degree P according to whether the same keyword appears in the paragraph, if the same keyword appears in the paragraph, the identification unit selects the adjustment coefficient t to adjust the tag matching degree P so as to increase the tag matching degree, and if the same keyword does not appear in the paragraph, the identification unit does not adjust, thereby effectively ensuring the accuracy of data analysis and further improving typesetting efficiency.
Specifically, when the identification unit corrects the adjusted tag matching degree P ', the identification unit compares the same number S of keywords appearing in the paragraph with a preset same number S0 of keywords, corrects the adjusted tag matching degree P' according to the comparison result, wherein,
when S is more than 1 and less than or equal to S0, the identification unit selects a first correction coefficient g1 to correct the adjusted tag matching degree P' so as to increase the tag matching degree, wherein g1 is more than 1 and less than 1.1;
when S > S0, the identification unit selects a second correction coefficient g2 to correct the adjusted tag matching degree P' so as to increase the tag matching degree, and sets g2=g1+g1× (S-S0)/S;
when the i-th correction coefficient gi is selected to correct the adjusted tag matching degree P ', i=1, 2 is set, the corrected tag matching degree is P ", and P" =p' ×gi is set.
Specifically, when the identification unit corrects the adjusted tag matching degree P ', the identification unit compares the number S of identical keywords appearing in the paragraph with the number S0 of preset identical keywords, if the number of identical keywords is greater than 1 and less than or equal to the number of preset identical keywords, selects the first correction coefficient g1 to correct the adjusted tag matching degree P ' so as to increase the tag matching degree, and selects the second correction coefficient g2 to correct the adjusted tag matching degree P ' so as to further increase the tag matching degree, effectively ensures the accuracy of data analysis, and improves the typesetting efficiency.
Specifically, when the identification unit judges the label of the paragraph according to the corrected label matching degree P ', the corrected label matching degree P' is compared with the preset label matching degree P0, and the label of the paragraph is judged for the first time according to the comparison result,
when P' is more than or equal to P0, the identification unit judges that the label is successfully matched, and takes the label successfully matched as the label of the paragraph;
when P' < P0, the identification unit judges that the tag matching fails.
Specifically, when the identification unit determines the label of the paragraph according to the corrected label matching degree P ", the identification unit compares the corrected label matching degree P" with the preset label matching degree P0, if the corrected label matching degree P "is greater than or equal to the preset label matching degree P0, the identification unit determines that the label matching is successful, and uses the successfully matched label as the label of the paragraph, and if the corrected label matching degree P" is less than the preset label matching degree P0, the identification unit determines that the label matching is failed, thereby effectively ensuring the accuracy of data analysis and improving typesetting efficiency.
Specifically, when the identification unit performs the secondary label judgment, the word number Z of the paragraph with successfully matched labels is compared with the word number of each preset label paragraph, and the secondary label judgment is performed on the paragraph with successfully matched labels after the primary label judgment according to the comparison result,
when Z is smaller than Z1 or Z is larger than Z2, the identification unit judges that the successfully matched label cannot be used as the label of the paragraph, and carries out label primary judgment again on the paragraph;
when Z1 is less than or equal to Z2, the identification unit judges that the label successfully matched is used as the label of the paragraph;
wherein Z1 is the number of first preset label paragraph words, Z2 is the number of second preset label paragraph words, and Z1 is less than Z2.
Specifically, in this embodiment, different labels are provided with different preset label paragraph numbers, when the identification unit performs label secondary judgment, the identification unit compares the number Z of successfully matched labels with the preset label paragraph numbers, if the number Z of successfully matched labels is smaller than the first preset label paragraph number Z1 or larger than the second preset label paragraph number Z2, wherein Z1 is smaller than Z2, the identification unit judges that the successfully matched labels cannot be used as labels of the paragraphs, and performs label primary judgment again on the paragraphs, and if the number Z of successfully matched labels is within the first preset label paragraph number Z1 and the second preset label paragraph number Z2, the identification unit judges that the successfully matched labels are used as labels of the paragraphs, so that the accuracy of data analysis is effectively ensured, and the typesetting efficiency is improved.
Specifically, when the verification unit verifies the label secondary judgment result, the verification unit verifies the label secondary judgment result of the paragraph according to the label number corresponding to the same paragraph, wherein,
when a plurality of labels exist in the same paragraph, the verification unit judges that verification fails, sorts the labels of the paragraph according to the matching degree from large to small, and takes the label with the largest matching degree as the label of the paragraph;
when a single label exists in the same paragraph, the verification unit judges that verification is successful.
Specifically, when the verification unit verifies the label judgment result, the verification unit verifies the label judgment result according to the number of labels corresponding to the same paragraph, if a plurality of labels exist in the same paragraph, the verification unit performs matching degree sequencing on the labels, the label with the highest matching degree is used as the label of the paragraph, and if a single label exists in the same paragraph, the verification is successful, so that the accuracy of data analysis is effectively ensured, and the typesetting efficiency is improved.
Specifically, when the reorganization module reorganizes the structure of each verified label, the label name which is verified successfully is matched with the label name in the preset label structure, and the label is reorganized according to the matching result,
when the tag name successfully checked is successfully matched with the tag name in the preset tag structure, the reorganization module reorganizes the tag structure according to the preset tag structure;
when the label name successfully checked is failed to be matched with the label name in the preset label structure, the reorganization module carries out label judgment again on the paragraph corresponding to the label which is failed to be matched, and when the label judgment is carried out again on the paragraph, the selected label is not used any more until the label name of the paragraph is successfully matched with the label name in the preset label structure.
Specifically, when the reorganization module performs structure reorganization on each checked label, the label name that is checked successfully is matched with the label name in the preset label structure, if the label name that is checked successfully is matched with the label name in the preset label structure, the reorganization module performs structure reorganization on the label according to the preset label structure, if the label name that is checked successfully is matched with the label name in the preset label structure, the reorganization module performs label judgment again on a paragraph corresponding to the label that is matched with the failure, and when the label judgment is performed again on the paragraph, the selected label is not used until the label name of the paragraph is matched with the label name in the preset label structure successfully, so that the reorganization precision of the label structure is effectively ensured, and the typesetting efficiency is improved. In this embodiment, when the tag name of the successful verification matches the tag name in the preset tag structure, the reorganization module re-performs the tag judgment on the paragraph corresponding to the tag with the matching failure, when the tag judgment is performed on the paragraph again, the selected tag is not used until the tag name of the paragraph matches the tag name in the preset tag structure, if the tag name of the successful verification has a, b, c, d, e, f, when the tag name of the successful verification matches the tag name in the preset tag structure, c is not the tag name in the preset tag structure, then c is matched with the failure, at this time, the tag judgment is performed again on the paragraph corresponding to c, when the tag of the paragraph corresponding to c is re-judged, c is not used, so as to re-perform the tag judgment on the paragraph corresponding to the tag with the matching failure, until the tag name of the paragraph matches the tag name of the preset tag structure successfully.
Referring to fig. 2, an automatic typesetting method based on XML according to an embodiment of the present invention includes,
step S1, importing XML format data to be typeset through an importing module;
s2, analyzing the imported XML format data through an analysis module to identify tags of the XML format data;
s3, carrying out structural reorganization on each identified label through a reorganization module;
s4, creating a label style template through a modeling module;
s5, importing the data after the label structure reorganization into a label style template through a typesetting module for typesetting;
s6, performing layout adjustment on typeset data through an adjustment module;
and S7, exporting the document with the adjusted layout through an export module.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims (5)

1. An automatic typesetting system based on XML, the system comprising:
the importing module is used for importing XML format data;
the analysis module is used for analyzing the imported XML format data, is connected with the importing module and comprises: the system comprises a classifying unit, a recognizing unit and a checking unit, wherein the classifying unit is used for classifying data of imported XML format data into text data and picture data, the recognizing unit is used for recognizing labels of the classified text data, the recognizing unit is connected with the classifying unit, when the labels are recognized, the recognizing unit is used for matching each label keyword with each paragraph content of the text data, calculating the label matching degree P of each paragraph, after the calculation is completed, the recognizing unit is also used for adjusting the label matching degree P according to whether the same label keyword appears in the paragraphs, after the adjustment is completed, the recognizing unit is also used for correcting the adjusted label matching degree P 'according to the same label keyword quantity appearing in the paragraphs, after the correction is completed, the recognizing unit is also used for carrying out primary judgment on labels of the paragraphs according to the corrected label matching degree P', and carrying out secondary judgment on the labels of the paragraphs successfully matched with the label primary judgment time marks according to the paragraph quantity, and the checking unit is used for checking the secondary label judgment results, when the checking is carried out, the checking unit is used for checking the label secondary judgment results of the labels according to the same label quantity as the corresponding paragraphs;
the reorganization module is used for carrying out structural reorganization on each label after verification and is connected with the analysis module;
the modeling module is used for creating a label style template which is connected with the reorganization module;
the typesetting module is used for importing the data after the label structure is recombined into a label style template for typesetting and is connected with the recombination module;
the adjustment module is used for carrying out layout adjustment on typeset data, is connected with the typesetting module, and is also used for adjusting the dynamic header format when carrying out adjustment so that the header formats of all pages after adjustment are the same, and creating an index label and a reference label;
the export module is used for exporting the file after layout adjustment and is connected with the adjustment module;
when the identification unit calculates the label matching degree P of each paragraph, P= (P1+P2+ … Pn)/n is set, n is the number of similar keywords in the paragraph, n is more than or equal to 1, pi is the matching degree of the similar keywords in the paragraph, pi=L/L0, i=1, 2 … n, L is the word number of the similar keywords, L is more than or equal to 2, and L0 is the word number of the label keywords;
the similar keywords are identical or similar keywords with the number of the connected words being more than or equal to 2 words,
when the identification unit adjusts the tag matching degree P, the identification unit adjusts the tag matching degree P according to whether the same tag keywords appear in the paragraphs, wherein,
when the same tag keywords appear in the paragraphs, the identification unit selects an adjustment coefficient t to adjust the tag matching degree P so as to increase the tag matching degree, wherein t is more than 1 and less than 1.2, the adjusted tag matching degree is P ', and P' =P×t is set;
when the same tag key does not appear in the paragraph, the recognition unit does not make an adjustment,
when the identification unit corrects the adjusted tag matching degree P ', the identification unit compares the number S of the same tag keywords appearing in the paragraph with the number S0 of preset same tag keywords, corrects the adjusted tag matching degree P' according to the comparison result, wherein,
when S is more than 1 and less than or equal to S0, the identification unit selects a first correction coefficient g1 to correct the adjusted tag matching degree P' so as to increase the tag matching degree, wherein g1 is more than 1 and less than 1.1;
when S > S0, the identification unit selects a second correction coefficient g2 to correct the adjusted tag matching degree P' so as to increase the tag matching degree, and sets g2=g1+g1× (S-S0)/S;
when the ith correction coefficient gi is selected to correct the adjusted tag matching degree P ', i=1, 2 is set, the corrected tag matching degree is P ", and P" =p' ×gi is set;
when judging the label of the paragraph according to the corrected label matching degree P ', the identification unit compares the corrected label matching degree P' with the preset label matching degree P0 and carries out primary judgment on the label of the paragraph according to the comparison result, wherein,
when P' is more than or equal to P0, the identification unit judges that the label is successfully matched, and takes the label successfully matched as the label of the paragraph;
when P' is less than P0, the identification unit judges that the label matching fails;
when the identification unit performs the secondary label judgment, the word number Z of the successfully label matched paragraph is compared with the word number of each preset label paragraph, and the secondary label judgment is performed on the successfully label matched paragraph after the primary label judgment according to the comparison result,
when Z is smaller than Z1 or Z is larger than Z2, the identification unit judges that the successfully matched label cannot be used as the label of the paragraph, and carries out label primary judgment again on the paragraph;
when Z1 is less than or equal to Z2, the identification unit judges that the label successfully matched is used as the label of the paragraph;
wherein Z1 is the number of first preset label paragraph words, Z2 is the number of second preset label paragraph words, and Z1 is less than Z2.
2. The automatic typesetting system based on XML as recited in claim 1, wherein the verification unit verifies the tag secondary judgment result of the paragraph according to the number of tags corresponding to the same paragraph when verifying the tag secondary judgment result, wherein,
when a plurality of labels exist in the same paragraph, the verification unit judges that verification fails, sorts the labels of the paragraph according to the matching degree from large to small, and takes the label with the largest matching degree as the label of the paragraph;
when a single label exists in the same paragraph, the verification unit judges that verification is successful.
3. The automatic typesetting system based on XML as recited in claim 2, wherein when the reorganization module reorganizes the structure of each verified label, the label name that is verified successfully is matched with the label name in the preset label structure, and the label is reorganized according to the matching result,
when the tag name successfully checked is successfully matched with the tag name in the preset tag structure, the reorganization module reorganizes the tag structure according to the preset tag structure;
when the label name successfully checked is failed to be matched with the label name in the preset label structure, the reorganization module carries out label judgment again on the paragraph corresponding to the label which is failed to be matched, and when the label judgment is carried out again on the paragraph, the selected label is not used any more until the label name of the paragraph is successfully matched with the label name in the preset label structure.
4. An XML-based automatic typesetting system according to claim 3, wherein said tag style templates include paragraph styles, character styles, object styles and form styles to which tags correspond.
5. A typesetting method applied to the automatic typesetting system based on XML according to any one of claims 1 to 4, comprising,
step S1, importing XML format data to be typeset through an importing module;
s2, analyzing the imported XML format data through an analysis module to identify tags of the XML format data;
s3, carrying out structural reorganization on each identified label through a reorganization module;
s4, creating a label style template through a modeling module;
s5, importing the data after the label structure reorganization into a label style template through a typesetting module for typesetting;
s6, performing layout adjustment on typeset data through an adjustment module;
and S7, exporting the document with the adjusted layout through an export module.
CN202310397252.1A 2023-04-14 2023-04-14 Automatic typesetting method and system based on XML Active CN116702702B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310397252.1A CN116702702B (en) 2023-04-14 2023-04-14 Automatic typesetting method and system based on XML

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310397252.1A CN116702702B (en) 2023-04-14 2023-04-14 Automatic typesetting method and system based on XML

Publications (2)

Publication Number Publication Date
CN116702702A CN116702702A (en) 2023-09-05
CN116702702B true CN116702702B (en) 2024-02-13

Family

ID=87839920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310397252.1A Active CN116702702B (en) 2023-04-14 2023-04-14 Automatic typesetting method and system based on XML

Country Status (1)

Country Link
CN (1) CN116702702B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123002A (en) * 2007-09-14 2008-02-13 北大方正集团有限公司 Picture and words typesetting method
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN101989256A (en) * 2009-07-31 2011-03-23 北京大学 Typesetting method of document file and device
CN105159877A (en) * 2015-06-18 2015-12-16 杭州电子科技大学 Cross-media automatic typesetting system and method thereof
US9779063B1 (en) * 2013-03-15 2017-10-03 Not Invented Here LLC Document processor program having document-type dependent interface
CN107688557A (en) * 2016-08-03 2018-02-13 北大方正集团有限公司 Composition method, composing system and terminal
CN108073562A (en) * 2016-11-16 2018-05-25 北大方正集团有限公司 Publication processing method and processing device based on cloud platform
CN108319579A (en) * 2017-01-18 2018-07-24 北大方正集团有限公司 The composition method and composing device of XML structure data
CN110032720A (en) * 2018-12-28 2019-07-19 万康源(天津)基因科技有限公司 A kind of visualization report typesetting and automatic generation method and system based on XML
WO2021055102A1 (en) * 2019-09-16 2021-03-25 Docugami, Inc. Cross-document intelligent authoring and processing assistant
CN113569530A (en) * 2021-07-29 2021-10-29 北京法意科技有限公司 Intelligent document typesetting method and system
CN114063938A (en) * 2020-07-31 2022-02-18 株式会社理光 Print data processing apparatus, printing system, and print data processing method
CN115601473A (en) * 2022-10-09 2023-01-13 《河南科学》杂志社(Cn) Printed matter typesetting system and method based on intelligent recognition
CN115761778A (en) * 2022-11-24 2023-03-07 联仁健康医疗大数据科技股份有限公司 Document reconstruction method, device, equipment and storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123002A (en) * 2007-09-14 2008-02-13 北大方正集团有限公司 Picture and words typesetting method
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN101989256A (en) * 2009-07-31 2011-03-23 北京大学 Typesetting method of document file and device
US9779063B1 (en) * 2013-03-15 2017-10-03 Not Invented Here LLC Document processor program having document-type dependent interface
CN105159877A (en) * 2015-06-18 2015-12-16 杭州电子科技大学 Cross-media automatic typesetting system and method thereof
CN107688557A (en) * 2016-08-03 2018-02-13 北大方正集团有限公司 Composition method, composing system and terminal
CN108073562A (en) * 2016-11-16 2018-05-25 北大方正集团有限公司 Publication processing method and processing device based on cloud platform
CN108319579A (en) * 2017-01-18 2018-07-24 北大方正集团有限公司 The composition method and composing device of XML structure data
CN110032720A (en) * 2018-12-28 2019-07-19 万康源(天津)基因科技有限公司 A kind of visualization report typesetting and automatic generation method and system based on XML
WO2021055102A1 (en) * 2019-09-16 2021-03-25 Docugami, Inc. Cross-document intelligent authoring and processing assistant
CN114063938A (en) * 2020-07-31 2022-02-18 株式会社理光 Print data processing apparatus, printing system, and print data processing method
CN113569530A (en) * 2021-07-29 2021-10-29 北京法意科技有限公司 Intelligent document typesetting method and system
CN115601473A (en) * 2022-10-09 2023-01-13 《河南科学》杂志社(Cn) Printed matter typesetting system and method based on intelligent recognition
CN115761778A (en) * 2022-11-24 2023-03-07 联仁健康医疗大数据科技股份有限公司 Document reconstruction method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
语法分析与纠错相结合的文档结构重构方法;张真等;北京信息科技大学学报(自然科学版);第34卷(第2期);第29-34页 *

Also Published As

Publication number Publication date
CN116702702A (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN107392143B (en) Resume accurate analysis method based on SVM text classification
US9286526B1 (en) Cohort-based learning from user edits
CN103500216B (en) Method for extracting file information
CN103488627B (en) Full piece patent document interpretation method and translation system
EP2790111A1 (en) Method and device for acquiring structured information in layout file
CN111680634A (en) Document file processing method and device, computer equipment and storage medium
CN113678118A (en) Data extraction system
CN112926299B (en) Text comparison method, contract review method and auditing system
CN116702702B (en) Automatic typesetting method and system based on XML
US20140101112A1 (en) Method and system for managing metadata
CN112214473A (en) Data migration method and system between databases
CN111026815A (en) Method for extracting specific relation of entity pair based on user-assisted correction
US9430451B1 (en) Parsing author name groups in non-standardized format
CN113283231B (en) Method for acquiring signature bit, setting system, signature system and storage medium
CN110569401A (en) paper marking method and device, computer equipment and storage medium
CN110688842B (en) Analysis method, device and server for document title level
CN112347742B (en) Method for generating document image set based on deep learning
CN104156345A (en) Method and device for identifying explanatory text in portable document format file
KR102171325B1 (en) Method for parsing table data in pdf file
CN114912417A (en) Service data processing method, device, equipment and storage medium
CN114186532A (en) Order examination processing method and device
CN113111869A (en) Method and system for extracting text picture and description thereof
CN111159997A (en) Intelligent verification method for enterprise bid document
CN111104480A (en) Innovative AI intelligent text processing system
CN112750434B (en) Method and device for optimizing voice recognition system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Xiao Hui

Inventor after: Wan Jie

Inventor after: Peng Gan

Inventor after: Cheng Cheng

Inventor before: Wan Jie

Inventor before: Peng Gan

Inventor before: Cheng Cheng

Inventor before: Xiao Hui

GR01 Patent grant
GR01 Patent grant