CN110196966B - Method and device for identifying group pictures in Word document - Google Patents

Method and device for identifying group pictures in Word document Download PDF

Info

Publication number
CN110196966B
CN110196966B CN201810161421.0A CN201810161421A CN110196966B CN 110196966 B CN110196966 B CN 110196966B CN 201810161421 A CN201810161421 A CN 201810161421A CN 110196966 B CN110196966 B CN 110196966B
Authority
CN
China
Prior art keywords
picture
paragraph
format
group
pictures
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810161421.0A
Other languages
Chinese (zh)
Other versions
CN110196966A (en
Inventor
黄保健
代芳
朱轩成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201810161421.0A priority Critical patent/CN110196966B/en
Publication of CN110196966A publication Critical patent/CN110196966A/en
Application granted granted Critical
Publication of CN110196966B publication Critical patent/CN110196966B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a method and a device for identifying a group of pictures in a Word document, which convert all pictures in the Word document into an embedded format; traversing each picture, and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group graph judgment rule; if the Word document meets the preset group drawing judgment rule, the picture is determined to belong to the group drawing, and the paragraph meeting the preset group drawing judgment rule is modified into a preset group drawing style, so that automatic recognition of the group drawing in the Word document is realized, manual operation is reduced, higher accuracy is achieved, further correct conversion of the Word document to an XML document format is facilitated, conversion efficiency is improved, and further convenience is brought to data storage, efficient retrieval, online display and the like of users such as periodicals and periodicals.

Description

Method and device for identifying group pictures in Word document
Technical Field
The invention relates to the technical field of digital publishing, in particular to a method and a device for identifying a group diagram in a Word document.
Background
Microsoft Office Word is a Word processor application of Microsoft corporation. Word provides users with tools for creating professional and elegant documents, the most popular Word processing program at present. With the rapid development of information technology, reading behaviors of readers are gradually transferred to a PC (personal computer) end and a mobile end, some international large-scale publishers always recommend XML to be used as a basis for data exchange and storage, and many domestic colleagues also consider XML to be a lawbreak for exchanging and storing contents of science and technology periodicals. In order to effectively manage full-text documents from various institutions, the National Center for Biotechnology Information (NCBI), affiliated with the National library of medicine, has established the data standard JATS (journal Article Tag suite) for a uniform description of the document format and has been officially approved by the National Information Standards institute (NISO) as a National standard. As one of the widely used document resource archiving standards, the JATS standard has been widely used in the fields of publishers, technical journals, libraries, and the like. The JATS label set defines the XML document structure of the technical journal and promotes the digital publishing development of a plurality of journal agencies.
The contributors of the journal agencies often use popular Word to compose and deliver manuscripts, while the journal agencies (or publishers) use XML as a content exchange and storage format in digital transformation. Therefore, Word recognition processing is required to regenerate the XML-formatted content. Since JATS has defined XML storage formats of single pictures and group pictures respectively, the existing tool for converting Word documents into XML documents can directly identify the group pictures in the Word documents into a plurality of single pictures, which is not in accordance with JATS specifications, violates the intention of an author to divide four pictures into one group picture, and simultaneously, the identification of four single pictures can cause the problem of storing main problems. Therefore, manual participation is needed for processing group diagrams when the Word documents are converted into the XML documents, most foreign periodicals are to screen the group diagrams in the Word documents, make a picture and then convert the Word documents into the XML documents for processing, while in China, the Word documents are usually converted into the XML documents firstly and then manually processed, so that the XML format of the group diagrams meets JATS specifications.
In the prior art, manual operation is needed for processing group diagrams when a Word document is converted into an XML document, the processing efficiency is low, the possibility of file errors is high, and the cost is increased.
Disclosure of Invention
The invention provides a method and a device for identifying a group diagram in a Word document, which are used for realizing automatic identification of the group diagram in the Word document, reducing manual operation, facilitating correct conversion of the Word document to an XML document format and improving conversion efficiency.
One aspect of the present invention provides a method for identifying a group diagram in a Word document, comprising:
converting all pictures in the Word document into an embedded format;
traversing each picture, and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group graph judgment rule;
if yes, determining that the picture belongs to a group picture, and modifying the paragraph meeting the preset group picture judgment rule into a preset group picture style.
Further, the determining whether the content in the paragraph adjacent to the paragraph where the picture is located meets a predetermined group diagram determination rule includes:
identifying the content of a second paragraph after the paragraph of the picture;
if the content of the second paragraph contains the sub-picture question with the first preset format, identifying whether the next paragraph of the second paragraph contains a picture;
and repeating the identification steps until the main topic in the second preset format is identified.
Further, before identifying the content of a second paragraph after the paragraph where the picture is located, the method further includes:
judging the number of the pictures contained in the section where the pictures are located;
and determining a first preset format of the sub-chart question according to the number of the pictures.
Further, the converting all pictures in the Word document into an embedded format includes:
detecting the format of each picture in the Word document;
if the format of the picture is a non-embedded format, converting the picture into an InlineShape object by adopting a ConverteToInlineShape command, thereby converting the format of the picture into an embedded format.
Further, the modifying the paragraphs that satisfy the predetermined group drawing decision rule into a predetermined group drawing style includes:
and modifying the styles of the pictures, the sub-diagram questions and the main diagram in the same group of diagrams into the preset group diagram styles, so that when the paragraph styles are identified as the preset group diagram styles in the process of converting the Word document into the XML document, the pictures, the sub-diagram questions and the main diagram in the same group of diagrams are output according to the XML storage format of the group diagrams.
Another aspect of the present invention provides an apparatus for identifying group diagrams in a Word document, the apparatus comprising:
the picture format conversion module is used for converting all pictures in the Word document into an embedded format;
the judging module is used for traversing each picture and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group drawing judging rule or not; if yes, determining that the picture belongs to a group picture;
and the style modification module is used for modifying the paragraphs meeting the preset group drawing judgment rule into a preset group drawing style.
Further, the determining module is configured to:
identifying the content of a second paragraph after the paragraph of the picture;
if the content of the second paragraph contains the sub-picture question with the first preset format, identifying whether the next paragraph of the second paragraph contains a picture;
and repeating the identification steps until the main topic in the second preset format is identified.
Further, the determining module is further configured to:
judging the number of the pictures contained in the section where the pictures are located;
and determining a first preset format of the sub-chart question according to the number of the pictures.
Further, the picture format conversion module is configured to:
detecting the format of each picture in the Word document;
if the format of the picture is a non-embedded format, converting the picture into an InlineShape object by adopting a ConverteToInlineShape command, thereby converting the format of the picture into an embedded format.
Further, the style modification module is to:
and modifying the styles of the pictures, the sub-diagram questions and the main diagram in the same group of diagrams into the preset group diagram styles, so that when the paragraph styles are identified as the preset group diagram styles in the process of converting the Word document into the XML document, the pictures, the sub-diagram questions and the main diagram in the same group of diagrams are output according to the XML storage format of the group diagrams.
The method and the device for identifying the group pictures in the Word document provided by the invention convert all pictures in the Word document into an embedded format; traversing each picture, and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group graph judgment rule; if the Word document meets the preset group drawing judgment rule, the picture is determined to belong to the group drawing, and the paragraph meeting the preset group drawing judgment rule is modified into a preset group drawing style, so that automatic recognition of the group drawing in the Word document is realized, manual operation is reduced, higher accuracy is achieved, further correct conversion of the Word document to an XML document format is facilitated, conversion efficiency is improved, and further convenience is brought to data storage, efficient retrieval, online display and the like of users such as periodicals and periodicals.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is an example of a single graph in a Word document provided by an embodiment of the present invention;
FIG. 2 is an example of a group diagram in a Word document provided by an embodiment of the present invention;
FIG. 3 is a flowchart of a method for identifying group diagrams in a Word document according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for identifying group pictures in a Word document according to another embodiment of the present invention;
FIG. 5 is a flowchart of a method for identifying group pictures in a Word document according to another embodiment of the present invention;
FIG. 6 is a block diagram of an apparatus for identifying group pictures in a Word document according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the jats (journal art Tag suite) standard, an XML format standard for a single picture and a group diagram is defined.
For a single picture, for example, for the single picture shown in fig. 1, that is, only one picture and one title (fig.8 detachs amongg titles receiving day horizontal spatial car or alternative services), the XML format of the single picture is:
<fig id="f8"orientation="portrait"position="float">
<label>FIG.8.</label>
<caption>
<title>Deaths among patients receiving day hospital care
or alternative services.</title>
</caption>
<graphicid="gc1"orientation="portrait"position="float"
xlink:href="dummy1.png"/>
</fig>
for a group diagram, for example, the group diagram shown in fig. 2, it is composed of four pictures, and four sub-charts and one main chart, where the sub-charts are respectively: "(a) abc", "(b) two, two and three", "(c) hij", and "(d) four and five", the main chart is titled: "fig. 1(a) shows the energy level structure of neodymium glass, (b) shows ETU and CR phenomena of energy level population of neodymium glass", and its XML format is:
<fig-group>
< Label > FIG. 1</Label >
<caption>
< title > (a) energy level structure of neodymium glass, and (b) ETU and CR phenomenon of energy level population of neodymium glass >
</caption>
<abstract abstract-type="caption"xml:lang="en">
<label>Fig.1</label>
<title>(a)Energy levels of Nd:glass;(b)Energy transfer up-conversion(ETU)and cross-relaxation(CR)in energy levels of Nd:glass</title>
</abstract>
<abstract abstract-type="note">
< p > Note: energy level Structure of Neodymium glass
</abstract>
<fig orientation="portrait"position="float"id="F1">
<label>(a)</label>
<caption>
<title>abc</title>
</caption>
Href ═ media/brief Tansjie Key research and development project propaganda material-virology report 20161209-repair draft full text _ image annotation _ image2.png "specific-use ═ print" id ═ gc4 >
</graphic>
</fig>
<fig orientation="portrait"position="float"id="F2">
<label>(b)</label>
<caption>
< title > one, two, three </title >
</caption>
Href ═ media/brief Tanken Wenji key research and development project propaganda material-virus academic newspaper 20161209-repair draft full text _ image annotation _ image3.png "specific-use ═ print" id ═ gc3 >
</graphic>
</fig>
<fig orientation="portrait"position="float"id="F3">
<label>(c)</label>
<caption>
<title>hij</title>
</caption>
Href ═ media/brief Tansjie Key research and development project propaganda material-virology report 20161209-repair draft full text _ image annotation _ image2.png "specific-use ═ print" id ═ gc2 >
</graphic>
</fig>
<fig orientation="portrait"position="float"id="F4">
<label>(d)</label>
<caption>
< title > forty-five-six </title >
</caption>
Href ═ media/brief Tanken Wenji key research and development project propaganda material-virus academic newspaper 20161209-repair draft full text _ image annotation _ image3.png "specific-use ═ print" id ═ gc1 >
</graphic>
</fig>
</fig-group>
...
Since the XML storage formats of the single pictures and the group pictures are respectively defined by the JATS, if four pictures in the group picture of fig. 2 in the present text are identified as the four single pictures, the JATS specification is not met, the intention of an author to divide the four pictures into one group picture is also violated, and the identification of the four single pictures also causes the problem of storing main problems. In the prior art, manual participation is needed for processing group diagrams when a Word document is converted into an XML document format, the processing efficiency is low, and the possibility of file errors is high.
FIG. 3 is a flowchart of a method for identifying group pictures in a Word document according to an embodiment of the present invention. The embodiment provides a method for identifying group diagrams in a Word document, aiming at the problems, and the method comprises the following specific steps:
s101, converting all pictures in the Word document into an embedded format.
In this embodiment, the embedded format of the picture, i.e. the inlinescope object, refers to that the picture is treated as a character, the picture is treated as a word, and the typesetting is performed in a word mode on the typesetting. The Shape object represents a graphic object in the document, and the Shape object and the InlineShape object respectively belong to a Shape set and an InlineShape set in the document; the Shape object may be converted to an InlineShape object by its ConvertetToInlineShape method. The InlineShape object may be converted to a Shape object by the ConverteToShape method of the InlineShape object. In the embodiment, all pictures in the Word document are converted into the embedded format, so that the pictures belonging to the group pictures can be conveniently identified. Specifically, the Word document may be processed by VBA (Visual Basic for Applications, Visual Basic macro language) to achieve the above function, but the Word document may also be processed by ooxml (office Open xml) technology, but only the Word file with the suffix of ". docx" may be processed, and the Word file with the suffix of ". doc" or other suffixes may be converted into the Word file with the suffix of ". docx" first. In addition, other methods in the prior art can also be adopted to convert all pictures in the Word document into an embedded format, and details are not repeated here.
Specifically, as shown in fig. 4, the converting of all pictures in the Word document into an embedded format in S101 includes:
s1011, detecting the format of each picture in the Word document;
s1012, if the format of the picture is a non-embedded format, converting the picture into an inlinescape object by using a converttoinlinescape command, thereby converting the format of the picture into an embedded format.
In this embodiment, by detecting the format of each picture in the Word document, if the picture format is a non-embedded format, for example, a picture of a Shape object, the picture format is converted into an inlinescape object by using a converttoinlinescape command, so that the picture in the Word document is automatically converted into an embedded format, and the conversion efficiency is improved.
S102, traversing each picture, and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group graph judgment rule.
In this embodiment, by defining a group diagram determination rule in advance, and then traversing each picture in the Word document, it is determined whether the content in the paragraph adjacent to the paragraph where the picture is located satisfies the group diagram determination rule, thereby determining whether the picture belongs to the group diagram. Specifically, for example, the content of a second paragraph after the paragraph where the picture is located is identified; if the content of the second paragraph contains the sub-picture question with the first preset format, identifying whether the next paragraph of the second paragraph contains a picture; and repeating the identification steps until the main topic in the second preset format is identified. Wherein the sub-topic of the first predetermined format may include "(a) xxxx", or "(a) xxxx (b) xxxxx" or "(1) xxxx (2) xxxx", etc.; the main topic of the second predetermined format may be "FIG. 1 xxxxx" or "FIG. 1 xxxxx", etc. And of course is not limited to the above listed formats. Of course, other determination rules may also be adopted in this embodiment, for example, whether a sub-topic in the first predetermined format exists between two closest pictures that do not belong to the same paragraph, and there is no content other than the sub-topic, and a main topic in the second predetermined format exists in a paragraph next to the paragraph where a sub-topic is located, and then the main topic is determined as a group graph.
S103, if yes, determining that the picture belongs to a group diagram, and modifying the paragraph meeting the preset group diagram judgment rule into a preset group diagram style.
In this embodiment, when it is determined whether the content in the paragraph adjacent to the paragraph where the picture is located satisfies the predetermined group diagram determination rule, it is determined that the picture belongs to the group diagram, and the group diagram is marked by modifying the relevant paragraph into the predetermined group diagram style, so that the group diagram can be identified in the process of converting the Word document into the XML document, and is output according to the XML storage format of the group diagram. In this embodiment, the style of the paragraph where the picture, the sub-chart question and the main chart question belong to the same group chart may be modified into the predetermined group chart style, for example, Word2010 is taken as an example, the group chart style is created in advance by starting the style function in the tab, the formats of the picture, the sub-chart question and the paragraph where the main chart question belongs are respectively defined, and in the process of converting the subsequent Word document into the XML document, whether the style of each paragraph is the predetermined group chart style is obtained through VBA to determine whether to output the paragraph content (the picture, the sub-chart question or the main chart question) to the < fig-group >. Of course, other methods may be adopted to mark the identified group map, and the description is omitted here.
The method for identifying the group pictures in the Word document provided by the embodiment converts all pictures in the Word document into an embedded format; traversing each picture, and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group graph judgment rule; if the Word document meets the preset group drawing judgment rule, the picture is determined to belong to the group drawing, and the paragraph meeting the preset group drawing judgment rule is modified into a preset group drawing style, so that automatic recognition of the group drawing in the Word document is realized, manual operation is reduced, higher accuracy is achieved, further correct conversion of the Word document to an XML document format is facilitated, conversion efficiency is improved, and further convenience is brought to data storage, efficient retrieval, online display and the like of users such as periodicals and periodicals.
FIG. 5 is a flowchart of a method for identifying group pictures in a Word document according to another embodiment of the present invention. On the basis of the foregoing embodiment, as shown in fig. 5, the determining of whether the content in the adjacent paragraph of the paragraph where the picture is located satisfies the predetermined group picture determination rule in S102 may specifically include:
s1021, identifying the content of a second paragraph after the paragraph of the picture.
The picture format has been converted into a non-embedded format, so that for a Word document, a picture and a diagram respectively occupy a paragraph, and further, for a group diagram, the picture and a sub diagram appear alternately, and each group diagram has a main diagram. Therefore, in the process of facilitating each picture, the content of the second paragraph after the paragraph where the picture is located is identified, and whether the picture belongs to a single picture or a group picture is determined by determining whether the content of the second paragraph is a main picture or a sub-picture.
S1022, if the content of the second paragraph includes the sub-chart question with the first predetermined format, identifying whether a next paragraph of the second paragraph includes a picture.
In this embodiment, the sub-topic in the first predetermined format may include "(a) xxxx", or "(a) xxxx (b) xxxxx", or "(1) xxxx" or "(1) xxxx (2) xxxx", or the like, or other predetermined formats. Specifically, the author may edit the Word document according to a predetermined sub-question format, or, without limiting the format of the sub-questions in the editing process of the Word document, the first predetermined format of all possible sub-questions may be stored in advance in a database, and when determining whether the content of the second paragraph includes the sub-question, the matching with the database may be performed to determine whether the sub-question is the sub-question. After the sub-picture contained in the second section of the drop is identified, whether the next section of the drop contains the picture is further judged. Of course, the next piece of content of the sub-picture may be an annotation for the picture corresponding to the sub-picture, and if it is recognized that the next piece of content of the sub-picture does not contain the picture, it is recognized whether the content of the next piece contains the picture.
And S1023, repeating the identification steps until the main topic in the second preset format is identified.
In this embodiment, it is determined that each paragraph belongs to a group diagram by sequentially identifying pictures, sub-diagrams, pictures, sub-diagrams … … in each adjacent paragraph until a main diagram of a second predetermined format is identified. Wherein, the main topic of the second predetermined format can be "fig. 1 xxxxx" or "fig. 1 xxxxx" etc. Further, after the main topic is identified, whether the annotation for the group graph is contained in the next paragraph of the main topic is identified.
For example, for the group diagram shown in fig. 2, first, the picture (diagram (a) and diagram (b)) included in the first paragraph is identified, then the content of the second paragraph after the paragraph in which the picture (a) and diagram (b) is located is identified as "(a) abc (b) two-three", the paragraph includes the sub-diagram problem of the first predetermined format, the next paragraph of the sub-diagram problem of the second paragraph is further identified as "picture (c) and diagram (d)", then the content of the next paragraph in which the picture (c) and diagram (d) is located is identified as "(c) hij (d) four-five-six", the paragraph includes the sub-diagram problem of the first predetermined format, then the content of the next paragraph in which the sub-diagram problem is identified as "the energy level structure of neodymium glass in fig. 1 (a)", the ETU and CR phenomenon of the energy level number of neodymium glass are identified as the main topic of the second predetermined format, further identifying whether the next paragraph of the main chart question contains a comment or foreign language translation for the group chart.
Further, before the step of identifying the content of the second paragraph after the paragraph where the picture is located in S1021, the method further includes:
judging the number of the pictures contained in the section where the pictures are located;
and determining a first preset format of the sub-chart question according to the number of the pictures.
In this embodiment, the first predetermined format of the sub-chart question may be determined according to the number of pictures included in the paragraph where the pictures are located. For example, when the paragraph in which the picture is located includes only one picture, the first predetermined format of the sub-picture question is determined as "(a) xxxx" or "(1) xxxx", that is, the first predetermined format includes only one sub-picture question. Specifically, when a segment in which a picture is located is judged to contain a picture, whether a sub-picture in a format of "(a) xxxx" or "(1) xxxx" is contained in a segment next to the segment in which the picture is located is judged, if yes, the segment is considered to be a group picture possibly, then whether a next segment is the picture is judged, if yes, if the segment also only contains one picture, if yes, whether the next segment of the picture contains the sub-picture in the format of "(b) xxxx" or "(2) xxxx" is further judged, and the main picture is identified. For another example, when the paragraph includes two pictures, the first predetermined format of the sub-chart is determined as "(a) xxxx (b) xxxxx" or "(1) xxxx (2) xxxxx", and the detailed process thereof is not repeated herein.
Further, modifying the paragraphs meeting the predetermined group drawing determination rule into a predetermined group drawing style in S103 includes:
and modifying the styles of the pictures, the sub-diagram questions and the main diagram in the same group of diagrams into the preset group diagram styles, so that when the paragraph styles are identified as the preset group diagram styles in the process of converting the Word document into the XML document, the pictures, the sub-diagram questions and the main diagram in the same group of diagrams are output according to the XML storage format of the group diagrams.
In the embodiment, after the pictures belong to the group pictures, the styles of the paragraphs where the pictures, the sub-chart questions and the main chart questions belong to the same group picture are modified into the preset group picture style, so that the group pictures are marked, the group pictures can be identified in the process of converting the Word document into the XML document, and the XML storage format of the group pictures is output. For example, taking Word2010 as an example, a group diagram style is created in advance by starting a style function in a tab, formats of paragraphs where a picture, a sub-diagram question and a main diagram question are located are respectively defined, and in the process of converting a subsequent Word document into an XML document, whether the style of each paragraph is a predetermined group diagram style is obtained through VBA to determine whether paragraph contents (the picture, the sub-diagram question or the main diagram question) are output to < fig-group >.
FIG. 6 is a block diagram of an apparatus for identifying group pictures in a Word document according to an embodiment of the present invention. The embodiment provides a device for identifying group diagrams in Word documents, which can execute the processing flow provided by the embodiment of the method for identifying group diagrams in Word documents, and as shown in FIG. 6, the device for identifying group diagrams in Word documents of the embodiment comprises: a picture format conversion module 301, a judgment module 302 and a style modification module 303.
The picture format conversion module 301 is configured to convert all pictures in a Word document into an embedded format;
a determining module 302, configured to traverse each of the pictures, and determine whether content in a paragraph adjacent to a paragraph where the picture is located meets a predetermined group graph determining rule; if yes, determining that the picture belongs to a group picture;
a style modification module 303, configured to modify the paragraphs that satisfy the predetermined group diagram determination rule into a predetermined group diagram style.
Further, the determining module 302 is configured to:
identifying the content of a second paragraph after the paragraph of the picture;
if the content of the second paragraph contains the sub-picture question with the first preset format, identifying whether the next paragraph of the second paragraph contains a picture;
and repeating the identification steps until the main topic in the second preset format is identified.
Further, the determining module 302 is further configured to:
judging the number of the pictures contained in the section where the pictures are located;
and determining a first preset format of the sub-chart question according to the number of the pictures.
Further, the picture format conversion module 301 is configured to:
detecting the format of each picture in the Word document;
if the format of the picture is a non-embedded format, converting the picture into an InlineShape object by adopting a ConverteToInlineShape command, thereby converting the format of the picture into an embedded format.
Further, the style modification module 303 is configured to:
and modifying the styles of the pictures, the sub-diagram questions and the main diagram in the same group of diagrams into the preset group diagram styles, so that when the paragraph styles are identified as the preset group diagram styles in the process of converting the Word document into the XML document, the pictures, the sub-diagram questions and the main diagram in the same group of diagrams are output according to the XML storage format of the group diagrams.
The device for identifying group pictures in Word documents provided by the embodiment of the present invention can be specifically used for executing the method embodiments provided in fig. 3 to 5, and specific functions are not described herein again.
The recognition device for group pictures in Word documents provided by the embodiment converts all pictures in the Word documents into an embedded format; traversing each picture, and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group graph judgment rule; if the Word document meets the preset group drawing judgment rule, the picture is determined to belong to the group drawing, and the paragraph meeting the preset group drawing judgment rule is modified into a preset group drawing style, so that automatic recognition of the group drawing in the Word document is realized, manual operation is reduced, higher accuracy is achieved, further correct conversion of the Word document to an XML document format is facilitated, conversion efficiency is improved, and further convenience is brought to data storage, efficient retrieval, online display and the like of users such as periodicals and periodicals.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method for identifying group pictures in a Word document is characterized by comprising the following steps:
converting all pictures in the Word document into an embedded format;
traversing each picture, and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group graph judgment rule;
if yes, determining that the picture belongs to a group picture, and modifying the paragraph meeting the preset group picture judgment rule into a preset group picture style;
the judging whether the content in the adjacent paragraph of the paragraph in which the picture is located meets a predetermined group picture judgment rule includes:
identifying the content of a second paragraph after the paragraph of the picture;
if the content of the second paragraph contains the sub-picture question with the first preset format, identifying whether the next paragraph of the second paragraph contains a picture;
and repeating the identification steps until the main topic in the second preset format is identified.
2. The method of claim 1, wherein before identifying the content of a second paragraph following the paragraph in which the picture is located, the method further comprises:
judging the number of the pictures contained in the section where the pictures are located;
and determining a first preset format of the sub-chart question according to the number of the pictures.
3. The method according to claim 1 or 2, wherein the converting all pictures in the Word document into an embedded format comprises:
detecting the format of each picture in the Word document;
if the format of the picture is a non-embedded format, converting the picture into an InlineShape object by adopting a ConverteToInlineShape command, thereby converting the format of the picture into an embedded format.
4. The method according to claim 1 or 2, wherein modifying the paragraphs that satisfy the predetermined group drawing decision rule into a predetermined group drawing style comprises:
and modifying the styles of the pictures, the sub-diagram questions and the main diagram in the same group of diagrams into the preset group diagram styles, so that when the paragraph styles are identified as the preset group diagram styles in the process of converting the Word document into the XML document, the pictures, the sub-diagram questions and the main diagram in the same group of diagrams are output according to the XML storage format of the group diagrams.
5. An apparatus for identifying group pictures in a Word document, comprising:
the picture format conversion module is used for converting all pictures in the Word document into an embedded format;
the judging module is used for traversing each picture and judging whether the content in the adjacent paragraph of the paragraph in which the picture is positioned meets a preset group drawing judging rule or not; if yes, determining that the picture belongs to a group picture;
a style modification module for modifying the paragraphs meeting the predetermined group drawing judgment rule into a predetermined group drawing style;
the judging module is used for:
identifying the content of a second paragraph after the paragraph of the picture;
if the content of the second paragraph contains the sub-picture question with the first preset format, identifying whether the next paragraph of the second paragraph contains a picture;
and repeating the identification steps until the main topic in the second preset format is identified.
6. The apparatus of claim 5, wherein the determining module is further configured to:
judging the number of the pictures contained in the section where the pictures are located;
and determining a first preset format of the sub-chart question according to the number of the pictures.
7. The apparatus of claim 5 or 6, wherein the picture format conversion module is configured to:
detecting the format of each picture in the Word document;
if the format of the picture is a non-embedded format, converting the picture into an InlineShape object by adopting a ConverteToInlineShape command, thereby converting the format of the picture into an embedded format.
8. The apparatus of claim 5 or 6, wherein the pattern modification module is to:
and modifying the styles of the pictures, the sub-diagram questions and the main diagram in the same group of diagrams into the preset group diagram styles, so that when the paragraph styles are identified as the preset group diagram styles in the process of converting the Word document into the XML document, the pictures, the sub-diagram questions and the main diagram in the same group of diagrams are output according to the XML storage format of the group diagrams.
CN201810161421.0A 2018-02-27 2018-02-27 Method and device for identifying group pictures in Word document Expired - Fee Related CN110196966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810161421.0A CN110196966B (en) 2018-02-27 2018-02-27 Method and device for identifying group pictures in Word document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810161421.0A CN110196966B (en) 2018-02-27 2018-02-27 Method and device for identifying group pictures in Word document

Publications (2)

Publication Number Publication Date
CN110196966A CN110196966A (en) 2019-09-03
CN110196966B true CN110196966B (en) 2020-12-29

Family

ID=67750762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810161421.0A Expired - Fee Related CN110196966B (en) 2018-02-27 2018-02-27 Method and device for identifying group pictures in Word document

Country Status (1)

Country Link
CN (1) CN110196966B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528593B (en) * 2020-12-11 2023-09-01 北京百度网讯科技有限公司 Document processing method, device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504712A (en) * 2014-12-30 2015-04-08 百度在线网络技术(北京)有限公司 Picture processing method and device
CN104536975A (en) * 2014-12-03 2015-04-22 北京奇虎科技有限公司 Method for processing picture information in browser and browser client
CN105095297A (en) * 2014-05-16 2015-11-25 北大方正集团有限公司 Method and apparatus for automatically generating picture code list
CN106294370A (en) * 2015-05-15 2017-01-04 株式会社理光 The method and apparatus determining picture and text corresponding relation
US9753908B2 (en) * 2007-11-05 2017-09-05 The Neat Company, Inc. Method and system for transferring data from a scanned document into a spreadsheet

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9171545B2 (en) * 1999-04-19 2015-10-27 At&T Intellectual Property Ii, L.P. Browsing and retrieval of full broadcast-quality video
EP3019973A4 (en) * 2013-07-09 2017-03-29 Blueprint Sofware Systems Inc. Computing device and method for converting unstructured data to structured data
CN105404629B (en) * 2014-09-12 2020-10-27 华为技术有限公司 Method and device for determining map interface
CN105824788B (en) * 2016-03-18 2019-04-12 天津城建大学 A kind of method and system that PowerPoint file is converted to word document
CN107590115B (en) * 2017-09-13 2020-08-11 北京勤哲软件技术有限责任公司 Automatic Word report generation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9753908B2 (en) * 2007-11-05 2017-09-05 The Neat Company, Inc. Method and system for transferring data from a scanned document into a spreadsheet
CN105095297A (en) * 2014-05-16 2015-11-25 北大方正集团有限公司 Method and apparatus for automatically generating picture code list
CN104536975A (en) * 2014-12-03 2015-04-22 北京奇虎科技有限公司 Method for processing picture information in browser and browser client
CN104504712A (en) * 2014-12-30 2015-04-08 百度在线网络技术(北京)有限公司 Picture processing method and device
CN106294370A (en) * 2015-05-15 2017-01-04 株式会社理光 The method and apparatus determining picture and text corresponding relation

Also Published As

Publication number Publication date
CN110196966A (en) 2019-09-03

Similar Documents

Publication Publication Date Title
US8843815B2 (en) System and method for automatically extracting metadata from unstructured electronic documents
Tanner et al. Measuring mass text digitization quality and usefulness
CN108108342B (en) Structured text generation method, search method and device
US20130238968A1 (en) Automatic Creation of a Table and Query Tools
US20140325348A1 (en) Conversion of a document of captured images into a format for optimized display on a mobile device
US20120124464A1 (en) Apparatus and method for extracting cascading style sheet rules
US7370060B2 (en) System and method for user edit merging with preservation of unrepresented data
CN101908218A (en) Editing equipment and method for arranging
US9135234B1 (en) Collaborative generation of digital content with interactive reports
CN112016290A (en) Automatic document typesetting method, device, equipment and storage medium
CN109885641A (en) A kind of method and system of database Chinese Full Text Retrieval
CN110196966B (en) Method and device for identifying group pictures in Word document
CN110889266A (en) Conference record integration method and device
CN113033162A (en) Electronic document conversion method capable of controlling editing rule
KR102126342B1 (en) Electronic document braille translation system and a method therefor
US20130326329A1 (en) Method and apparatus for collecting, merging and presenting content
Pledge et al. Process and progress: working with born-digital material in the Wendy Cope Archive at the British Library
Reynaert et al. Piccl: Philosophical integrator of computational and corpus libraries
US20150095458A1 (en) Methods and systems for providing a seamless transition of documents between client types
US20150199322A1 (en) Operating Method of Terminal for Proofreading Electronic Document
CN100498765C (en) Method and device for making electric newspaper printing plate
CN110457659B (en) Clause document generation method and terminal equipment
JP7501255B2 (en) Document search system, document search method and program
CN116010356B (en) Method, device, network disk and storage medium for quickly previewing file through label
CN116758565B (en) OCR text restoration method, equipment and storage medium based on decision tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230614

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201229