CN117574855A - Method and device for replacing OFD file sensitive words - Google Patents

Method and device for replacing OFD file sensitive words Download PDF

Info

Publication number
CN117574855A
CN117574855A CN202311451554.9A CN202311451554A CN117574855A CN 117574855 A CN117574855 A CN 117574855A CN 202311451554 A CN202311451554 A CN 202311451554A CN 117574855 A CN117574855 A CN 117574855A
Authority
CN
China
Prior art keywords
character
replaced
text
replacing
ofd file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311451554.9A
Other languages
Chinese (zh)
Inventor
艾佳
方俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuxin Kunpeng Beijing Information Technology Co ltd
Original Assignee
Fuxin Kunpeng Beijing Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuxin Kunpeng Beijing Information Technology Co ltd filed Critical Fuxin Kunpeng Beijing Information Technology Co ltd
Priority to CN202311451554.9A priority Critical patent/CN117574855A/en
Publication of CN117574855A publication Critical patent/CN117574855A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a method and a device for replacing sensitive words of an OFD file, belonging to the technical field of document processing, wherein the method comprises the following steps: receiving an operation instruction; the operation instruction is used for carrying out sensitive word replacement on the OFD file, wherein the OFD file comprises at least one text page; based on the operation instruction, analyzing the OFD file by taking pages as units to obtain a text character string set corresponding to each text page in the OFD file; matching each character in each text character string set with a preset sensitive word list, and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced; and replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced. The method for replacing the OFD file sensitive words realizes automatic replacement of the OFD file sensitive words.

Description

Method and device for replacing OFD file sensitive words
Technical Field
The invention relates to the technical field of document processing, in particular to a method and a device for replacing sensitive words of an OFD file.
Background
An Open-layout Document (OFD) is a layout electronic Document format, which performs layout solidification and presentation on a plurality of digital content objects such as characters, graphics, images and the like according to a certain rule. This format is widely used by various industries and an increasingly diverse content uses this format for content bearing. However, with the popularity of OFD, there have therefore been some phenomena that use layout documents to propagate articles containing sensitive words.
Sensitive words refer to words or phrases that are deemed inappropriate or unsuitable in a particular environment. May relate to a negative topic or may relate to a secret. Therefore, filtering and replacing these layout documents containing sensitive words is an urgent need. Based on the nature of the curing presentation of the layout document, we have found that there is currently no way to solve this problem.
Therefore, how to implement the detection of sensitive words in an OFD document and automatically filter and replace the sensitive words therein is a problem that needs to be solved in the art.
Disclosure of Invention
The invention provides a method and a device for replacing sensitive words in an OFD file, which are used for solving the defect that the sensitive words in the OFD file cannot be automatically replaced in the prior art and realizing the automatic replacement of the sensitive words in the OFD file.
The invention provides a method for replacing sensitive words of an OFD file of an open plate type document, which comprises the following steps:
receiving an operation instruction; the operation instruction is used for carrying out sensitive word replacement on an OFD file, and the OFD file comprises at least one text page;
analyzing the OFD file by taking a page as a unit based on the operation instruction to obtain a text character string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character;
Matching each character in each text character string set with a preset sensitive word list, and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced;
and replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
According to the method for replacing the OFD file sensitive word provided by the invention, before replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced, the method comprises the following steps:
acquiring the number N of the characters to be replaced in each character set to be replaced; n is a positive integer;
and determining a replacement rule based on the number N of the characters to be replaced.
According to the method for replacing the OFD file sensitive words, which is provided by the invention, the method further comprises the following steps:
and replacing each character to be replaced with a target character based on the replacement rule and the position index corresponding to each character to be replaced.
According to the method for replacing the sensitive word of the OFD file provided by the invention, each character to be replaced is replaced by a target character based on the replacement rule and the position index corresponding to each character to be replaced, and the method comprises the following steps:
Acquiring a first position index corresponding to the character to be replaced under the condition that N=1; setting the character to be replaced as a first empty character; inserting a first target character at the first position index, and replacing the first null character with the first target character;
or under the condition that N is more than 1, acquiring a position index set corresponding to each character to be replaced; determining whether each position index in the position index set is continuous; setting each character to be replaced as a blank character group under the condition that the judging result is continuous; and inserting a target character set at the position index corresponding to the empty character set, and replacing the empty character set with the target character set.
According to the method for replacing the OFD file sensitive words, which is provided by the invention, the method further comprises the following steps:
setting each character to be replaced as a second empty character in sequence under the condition that the judging result is discontinuous;
and sequentially inserting a second target character at the position index corresponding to each second empty character group, and replacing each second empty character with the second target character.
According to the method for replacing the sensitive word of the OFD file provided by the invention, the OFD file is analyzed by taking a page as a unit based on the operation instruction to obtain a text string set corresponding to each text page in the OFD file, and the method comprises the following steps:
Acquiring the object type of the character structure information of each page in the OFD file; the object type is one of a non-text object and a text object;
judging the type of the object type, and under the condition that the object type is judged to be the text type, acquiring the text character strings corresponding to the character structure information to obtain a text character string set corresponding to each text page in the OFD file.
According to the method for replacing the sensitive words of the OFD file, the preset sensitive word list is determined based on the content field of the OFD file.
The invention also provides a device for replacing the sensitive words of the OFD file, which comprises:
the receiving module is used for receiving an operation instruction, wherein the operation instruction is used for carrying out sensitive word replacement on an OFD file, and the OFD file comprises at least one text page;
the analysis module is used for analyzing the OFD file by taking pages as units based on the operation instruction to obtain a text character string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character;
the matching module is used for matching each character in each text character string set with a preset sensitive word list and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced;
And the replacing module is used for replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method for replacing the OFD file sensitive words when executing the program.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of replacing an OFD file sensitive word as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of replacing sensitive words of an OFD file as described in any of the above.
The invention provides a method and a device for replacing sensitive words of an OFD file, wherein the method comprises the steps of replacing sensitive words of the content of each page of text page in the OFD file based on a received operation instruction, firstly analyzing the OFD file in a page unit to obtain a text character string set corresponding to each page of text page in the OFD file, wherein the text character string set comprises a plurality of characters and position indexes corresponding to each character, matching each character with a preset sensitive word list to determine characters to be replaced in the page in the OFD file, and replacing each character to be replaced with a target character based on the position indexes corresponding to the characters to be replaced. The method for replacing the OFD file sensitive words realizes automatic replacement of the OFD file sensitive words, and improves accuracy and efficiency of the replacement of the OFD file sensitive words.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an alternative method for the OFD file sensitive word provided by the invention;
FIG. 2 is a second flow chart of the method for replacing the sensitive word of the OFD file provided by the invention;
FIG. 3 is a third flow chart of the method for replacing the OFD file sensitive words provided by the invention;
fig. 4 is a schematic structural diagram of an alternative device for the sensitive word of the OFD file provided by the invention;
fig. 5 is a schematic structural diagram of an electronic device according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to facilitate a clearer understanding of the various embodiments of the present application, some relevant background knowledge is first presented below.
The OFD file is not affected by equipment and has a fixed format. The method is completely consistent with the paper file in aspects of format, layout, fonts, font sizes and the like. The characteristics of the format document format make the format document format become ideal document formats for serious electronic document release, digital information propagation and archiving, and are mainly applied to scenes such as finalizing, version fixing, archiving and the like. Such as propaganda materials and propaganda books, academic papers and research reports, product manuals, user guides, presentation files, electronic contracts, official documents and the like, and just because of the application scene of the OFD file, sensitive words in the OFD file need to be queried and replaced.
In the prior art, the sensitive words in the OFD file are usually checked and replaced by a manual processing mode, and the method is time-consuming and labor-consuming and cannot ensure the integrity and accuracy of the replacement of the sensitive words.
Before describing a specific technical scheme, the functions and scenes of the sensitive word replacement are described, and the general functional scenes of the sensitive word replacement are two, namely the sensitive word replacement of the whole document; the user designates the sensitive word list to be replaced, replaces the vocabulary characters in all the hit sensitive word list aiming at the OFD document of the target, replaces the vocabulary characters with the user designated characters, a new document is then generated and the replacement itself is persisted and is not at the rendering level but at the file data level. The second is search sensitive word replacement; after a user opens a certain OFD document by using the OFD reader, a function of searching for replacement can be used, a sensitive word to be replaced and a new word to be replaced are respectively input in an input box of searching for and replacing for, and after searching for replacement is executed, the currently searched word is replaced by the new word. All operations before the completion of the seek replacement can be completed in the memory, and if the user saves, the persistent replacement can be realized. Of course, it is specifically explained that the function of replacing sensitive words does not involve document typesetting, so that the number of characters of the search word and the replacement word is equal, and the overall effect of the layout software is kept stable.
The Text in an OFD layout document is made up of a series of Text objects that are split between themselves. However, from the perspective of the user, the text in the whole document is continuous, the sensitive words are continuous, the problem that one sensitive word is divided into a plurality of text objects can exist, and the replacement of the sensitive words should not affect other non-sensitive words.
The solution to this problem is that we cannot take text objects as units of processing when we make sensitive word substitutions, but must take at least document pages and even text of the whole document as units of processing. The processing unit with the page as the sensitive word is selected, so that all texts in one page can be regarded as a continuous whole rather than split text objects, all the sensitive words can be searched out from the whole continuous texts in the whole page, and some sensitive words cannot be searched and replaced due to the fact that the text objects are split. Meanwhile, as the page is a processing unit, to realize accurate sensitive replacement of the queried sensitive word, the character structural information corresponding to each character in the continuous text must be extracted while the continuous text is extracted, so as to conveniently locate each character.
In summary, in order to realize automatic replacement of the sensitive words in the OFD file, accuracy and efficiency of document sensitive word replacement are improved. The embodiment of the invention provides a method and a device for replacing sensitive words of an OFD file.
The following describes the method and apparatus for replacing the sensitive word of the OFD file according to the present invention with reference to fig. 1 to 5.
Fig. 1 is one of flow diagrams of a method for replacing an OFD file sensitive word provided by the present invention, and as shown in fig. 1, the method for replacing an OFD file sensitive word provided by the present invention includes:
step 11, receiving an operation instruction; the operation instruction is used for carrying out sensitive word replacement on the OFD file, wherein the OFD file comprises at least one text page;
the target instruction is used for indicating to shield the sensitive words in the OFD file;
specifically, the method for replacing the sensitive words of the OFD file is applied to a scene requiring the replacement of the sensitive words of the content in the OFD file. The execution body can be a server side or a client side; when the execution main body of the sensitive word replacement method of the OFD file is a server side, a user can send an operation instruction through the server side; when the execution subject of the sensitive word replacement method of the OFD file is a client, that is, a software development kit is added in the OFD reader, the sensitive word replacement method of the OFD file is integrated, and the user can issue an operation instruction in the OFD reader, typically through an instruction such as "replace sensitive word", which is not specifically limited herein. It should be noted that, the replacement of the sensitive word in the OFD file based on the server side is a persistent replacement, and the replacement of the sensitive word based on the OFD reader is a replacement at the rendering level.
Step 12, analyzing the OFD file by taking pages as units based on the operation instruction to obtain a text character string set corresponding to each page of text page in the OFD file;
specifically, in this step, each page of content in the OFD file corresponds to a text string set, and each text string set includes at least one character and a position index corresponding to each character; the embodiment of the invention processes the OFD file in the unit of page, so that the OFD file is analyzed in the unit of page based on the operation instruction. The structure of the OFD file format is generally divided into three layers: the first layer is a virtual storage system and comprises a package organization structure and an in-package directory organization structure; the second layer is a document model and comprises document, page, outline, file-level resource and other organization structures; and the third layer is the page content description in the OFD file and comprises page-level resources, graphics, images, characters and the like.
Step 13, matching each character in each text character string set with a preset sensitive word list, and determining a character set to be replaced corresponding to each text character string set;
specifically, in this step, each character set to be replaced includes at least one character to be replaced; and comparing each character in each text string set with characters in a preset sensitive word list, and determining which characters in the text string set are sensitive words, namely, the characters need to be subjected to replacement operation. The characters to be replaced and the position indexes corresponding to the characters to be replaced are stored as a sensitive word character index array, and the character to be replaced is represented by a character set in the embodiment of the invention.
And 14, replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
Specifically, based on the step 13, a set of characters to be replaced corresponding to a certain page of content in the OFD file is obtained, the set of characters to be replaced includes not only the characters to be replaced but also the position indexes corresponding to the characters to be replaced, and based on the position indexes corresponding to the characters to be replaced, the characters to be replaced are replaced with target characters, wherein the target characters are user-defined and are not specifically limited herein.
According to the method for replacing the sensitive words of the OFD file, based on the received operation instruction, the content of each text page in the OFD file is replaced by the sensitive words, the OFD file is firstly analyzed by taking a page as a unit to obtain a text character string set corresponding to each text page in the OFD file, the text character string set comprises a plurality of characters and position indexes corresponding to each character, each character is matched with a preset sensitive word list to determine characters to be replaced in the page in the OFD file, and the characters to be replaced are replaced by target characters based on the position indexes corresponding to the characters to be replaced. The method for replacing the OFD file sensitive words realizes automatic replacement of the OFD file sensitive words, and improves accuracy and efficiency of the replacement of the OFD file sensitive words.
Optionally, before the step 14, the method for replacing the sensitive word of the OFD file provided by the embodiment of the present invention includes the following steps, fig. 2 is a second flow chart of the method for replacing the sensitive word of the OFD file provided by the present invention, and as shown in fig. 2, before replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced, the method further includes:
step 21, obtaining the number N of the characters to be replaced in each character set to be replaced; n is a positive integer;
specifically, based on the above step 13, a set of characters to be replaced is obtained, and since the number of characters to be replaced included in the set of characters to be replaced is not fixed, there are different replacement rules based on the number of characters to be replaced in the set of characters to be replaced.
Step 22, determining a replacement rule based on the number N of the characters to be replaced.
According to the replacing method for the OFD file sensitive words, different replacing rules are determined based on the number of characters to be replaced in the character set to be replaced, flexibility of replacing the OFD file sensitive words is improved, and application scenes are expanded. The efficiency of the replacement of the sensitive words of the OFD file is improved to a certain extent.
Optionally, the method for replacing the sensitive word of the OFD file provided by the embodiment of the invention further includes:
And replacing each character to be replaced with the target character based on the replacement rule and the position index corresponding to each character to be replaced.
Specifically, after the replacement rule is determined based on the number N of characters to be replaced, the characters to be replaced are replaced with target characters based on the determined replacement rule and the position index corresponding to each character in the set of characters to be replaced.
Optionally, the method for replacing the sensitive word of the OFD file according to the embodiment of the present invention replaces each character to be replaced with the target character based on the replacement rule and the position index corresponding to each character to be replaced, and the specific implementation manner is as follows:
under the condition that N=1, acquiring a first position index corresponding to the character to be replaced;
setting the character to be replaced as a first empty character;
inserting a first target character at the first position index, and replacing the first empty character with the first target character;
specifically, when only one sensitive character, i.e., n=1, is included in the character set to be replaced, only the character needs to be directly replaced. The index of the character to be replaced, namely the first position index in the embodiment of the invention, is obtained, and then the character information structure body OFD_CHAR_INFO object of the position is taken out from the cached character structure array based on the first position index. Based on the text object handle and the text segment index, the text segment object corresponding to the sensitive character can be extracted, and then the character to be replaced can be positioned according to the index of the character to be replaced in the text segment. Whether or not there is a glyph transform for the corresponding TextCode object is handled in two cases. It should be noted that, the content actually included in the text string set is Unicode code of the character, text object handle, text fragment index, and index of the character in the text fragment. And caching each character information structure object in the text information of a certain page of OFD file into a character structure array according to the sequence, wherein the cached array index is the index corresponding to the current character in the complete text of the whole page. All text data in each text segment are sequentially extracted to obtain a complete text character string of the whole page, namely a text character string set in the embodiment of the invention, namely the extraction of all text character string data of a certain page of an OFD file and the extraction of a character information structure array of each character in the text character string.
In the embodiment of the present invention, it should be further noted that, when only one sensitive character is included in the character set to be replaced, whether the Text Code object corresponding to the sensitive character has a font transformation is classified into two cases. There are two types of descriptions of text objects, no glyph transformation node and no glyph transformation, i.e., whether there is a CG transformation node.
The condition that no font transformation exists is relatively simple, only the character at the appointed position in the Text Code in the Text object is required to be replaced by the user-defined character, and only the character to be replaced at the position corresponding to the index of the Text Code is required to be replaced by the user-defined character, namely the sensitive character is replaced by the target character. If a glyph transformation exists. Corresponding CG transition nodes are needed to be sequentially obtained from the fragments of the current text, whether the current character belongs to the current CG transition node conversion range is judged, and if so, the CG transition nodes are needed to be split first. The specific splitting mode is as follows: if the character index is equal to the Code Position, removing Glyphs of the first character from Glyphs; if the character index points to the last glyph in the current CG Transform, i.e., index=code position+glyphcount-1, then the glyph of the last character of Glyphs is removed; if the character index is located in the middle of Glyphs, the CG Transform is split into two sections, namely a section [ Code Position, index-1 ] and a section [ index+1,Code Position+GlyphCount-1 ]. The character of the Text Code corresponding to the index position is then set as a null character. Thus, the removal of index position characters is realized. I.e. the character to be replaced is set to the first empty character. And inserting a new text object at the same position of the character to be replaced, namely the position of the first empty character, wherein the new text object is set as a new character customized by a user, namely the target character. More specifically, the attributes of Boundary, CTM, size, stroke, fill, HScale, read Direction, char Direction, alpha, weight, italic, fill Color, stroke Color, and the like of the newly created text object are directly set as the corresponding attributes of the original OFD file. The Font property resource may be set to be an ID of a new SimSun non-embedded Font resource, and more importantly, the text code Chinese text positioning property of the new text object, and if index of the replaced character in the text segment of the original text document is n, xnew=xorig+deltax0+deltax1+ … +deltaxn-1.Y is set to the original text document Y value and DeltaX, deltaY both attributes need not be set.
Furthermore, it should be further noted that if there are n sensitive characters in a text, the CG Transform node may be split into multiple pieces, and multiple splitting may ensure that the effect is correct because the attribute value depicted by the CG Transform node is the global index value of all the characters in the current text.
The case where a glyph transformation exists, typically with an embedded font, text substitution in such a case can be difficult to handle. The solution to this problem we propose is: character-by-character replacement is carried out, character transformation at a designated position is removed through a CG Transform node splitting mode in each single character replacement process, and in order to avoid influencing other characters, the CG Transform node needs to be split into a front section and a rear section by taking index of the character as the center. Next, the character at the corresponding position in the TextCode needs to be set as a null character. The above two steps can realize that the sensitive words in the original text object are emptied. For the empty character, we achieve the text replacement effect by reinserting a custom character text object.
Or under the condition that N is more than 1, acquiring a position index set corresponding to each character to be replaced;
Determining whether each position index in the position index set is continuous;
setting each character to be replaced as an empty character group under the condition that the judging result is continuous;
and inserting a target character set at the position index corresponding to the empty character set, and replacing the empty character set with the target character set.
Specifically, when the set of characters to be replaced includes a plurality of characters to be replaced, there are different replacing modes according to whether the characters to be replaced are continuous, so that it is necessary to judge firstIf the position indexes corresponding to the characters to be replaced are continuous, the continuous multiple characters to be replaced can be replaced based on a combined batch replacement mode, namely if the continuous n characters to be replaced exist in the character set to be replaced, the n characters to be replaced can be replaced in batch at one time. Specifically, in the process of splitting the CG Transform node, it is assumed that Index of n consecutive characters to be replaced is Index respectively 1 To Index n . If Index 1 Equal to Code Position, then remove Glyphs from the first n characters; if Index n Pointing to the last glyph, index, in the current CG transition n =codeposition+glyphcount-1, then glyph of the last n characters of Glyphs is removed; if the character Index is located in the middle of Glyphs, the CG Transform is split into two sections, namely a [ CodPosition, index1-1 ] section and a [ Indexn+1, codPosition+GlyphCount-1 ] section. Each character to be replaced is then set to an empty character set, i.e., index 1 To Index n The character to be replaced of the position is set as an empty character group. And inserting a new text object at the same position of the empty character set, wherein the new text object is set as a new user-defined character with the length of n, namely a target character set.
Specifically, a new text object is inserted at the same position of the replaced character set, and the new text object is set as a new user-defined character with a length of n. More specifically, most of the attribute settings are consistent with the above, that is, the attribute of Boundary, CTM, size, stroke, fill, HScale, read Direction, char Direction, alpha, weight, italic, fill Color, stroke Color, etc. of the newly created text object is directly set as the corresponding attribute of the original OFD file. The main difference is the text-to-text positioning property in TextCode of the new text object, assuming that the first character of the replaced character is Index in the text segment of the original text document 1 Let X be k new =X origin +DeltaX 0 +DeltaX 1 +…+DeltaX k-1 . Y is set to the original text document Y value, deltaX is set to "DeltaX k 、DeltaX k+1 、…、DeltaX k+n-2 ”,DeltThe aY attribute need not be set.
Further, if there are n consecutive sensitive characters in the same text object, the above step is only needed to be performed 1 time. However, if there are n non-continuous sensitive strings in the same text object, the above steps still need to be split n times to complete the replacement of n sensitive strings in the text object.
Specifically, for example, the corresponding sensitive word list is different for different fields of the OFD file content, and it is assumed that in a certain field, "a" is a sensitive word, but in another field, "a word composed of two and three" are connected, the sensitive word is only a sensitive word, so in the above method, a continuous character to be replaced, namely, "a two and a three" situation occurs.
According to the method for replacing the OFD file sensitive words, when only one sensitive word needs to be replaced, automatic replacement is directly carried out; when a plurality of sensitive words to be replaced are provided, continuous sensitive words to be replaced are replaced in batches in a mode of inserting the target character set in the positions of the set empty character set and the re-empty character set, so that the efficiency of replacing the OFD file is improved.
Optionally, the method for replacing the sensitive word of the OFD file provided by the embodiment of the present invention further includes:
setting each character to be replaced as a second empty character in sequence under the condition that the judging result is discontinuous;
and sequentially inserting second target characters at the position indexes corresponding to the second empty character groups, and replacing the second empty characters with the second target characters.
Specifically, there is also a case where the number of characters to be replaced in the set of characters to be replaced is more than one, but the characters to be replaced are scattered, and then it is necessary to perform an operation one by one, and replace each character to be replaced one by one. By way of illustration, as sensitive words are referred to, sensitive words that require substitution are characterized by "two three". The page of the OFD with id= "39" is a Text Object with CG transformation node, containing the sensitive word "one two three". The sensitive character index is 9, 10, 11, 33, respectively, comprising two consecutive character sets [ 9,11 ] and [ 33, 33 ]. For the Text object with id= "39", the 2 character groups to be replaced are subjected to two rounds of sensitive word replacement, CG transformation is split into two sections, the sections are respectively [ 0,8 ] and [ 12,32 ], and the corresponding character to be replaced in the Text Code is set as an empty character group.
According to the replacing method for the OFD file sensitive words, provided by the embodiment of the invention, when the position indexes corresponding to the plurality of characters to be replaced are judged to be discontinuous under the condition that a plurality of sensitive words to be replaced exist, the characters to be replaced are replaced one by one, the situation of error replacement is avoided, and the accuracy of OFD file replacement is improved.
In addition to the replacement method, the embodiment of the invention also provides a method for replacing the sensitive words of the OFD file in the OFD reader, and the OFD file is analyzed and rendered in the OFD reader. Starting a function of searching for replacement, and inputting a sensitive word to be replaced and a new word after being replaced in an input box of searching for and replacing for. For example, find "two three", replace by ". Here it is necessary to limit the search words to be the same length as the word to be replaced. The extraction of all text character string data of the current page and the extraction of the character information structure array of each character in the text character string are completed, and detailed steps are the same as above and are not repeated; besides, an external matrix array of each character needs to be constructed, and the basic steps are similar to the construction of the character information structure array, and the only difference is that the external matrix information of each character is extracted for each character. The method comprises the steps of obtaining a command of searching the next place sent by a user, finding index of a first character of a sensitive word in a text character string extracted from a current document according to the searched sensitive word, finding an external matrix of characters corresponding to the index through a built character external matrix array, finding external matrixes of all characters of the sensitive word, and obtaining a specific position of the first sensitive word searched in a page document, and setting the specific position to be highlighted. And acquiring a replacement instruction sent by a user. The step of replacing the currently highlighted sensitive word is accomplished in accordance with the detailed step of sensitive word replacement described previously. And refreshing the rendering effect and displaying the replaced data in real time. And continuing to find index of the next character of the sensitive word in the text character string extracted from the current document according to the searched sensitive word. And then, finding out the external matrix of the characters corresponding to the index through the constructed character external matrix array, finding out the external matrix of all the characters of the sensitive word, and obtaining the specific position of the second sensitive word found in the page document, and setting the specific position to be highlighted. The subsequent steps are analogized. Thus, the sequential replacement of all the sensitive words can be completed under the scene of searching the sensitive words.
Optionally, according to the method for replacing the sensitive word of the OFD file provided by the embodiment of the present invention, the step 12 specifically includes the following steps, fig. 3 is a third flow chart of the method for replacing the sensitive word of the OFD file provided by the present invention, as shown in fig. 3, according to the operation instruction, the OFD file is parsed by using a page as a unit, so as to obtain a text string set corresponding to each text page in the OFD file, including:
step 121, obtaining the object type of the character structure information of each page in the OFD file;
and 122, judging the type of the object type, and acquiring a text character string corresponding to the character structure information under the condition that the object type is judged to be the text type, so as to obtain a text character string set corresponding to each text page in the OFD file.
Specifically, in the step, the OFD file is still processed by taking a page as a unit, and the object type of the OFD file is one of a non-text object and a text object; and extracting character structure information of the current page of the OFD file. And traversing each object in turn, and if the object is a non-text object such as an Image object, a path object, a Video object and the like, directly passing the object. If the text object is the current text object, traversing each text segment in the current text object in turn, traversing each character in the text segment respectively, and storing Unicode codes of the current characters, text object handles, text segment index, index of characters in the text segment and other information into a character information structure OFD_CHAR_INFO object. And caching the object into a character structure array, wherein the cached array index is the index corresponding to the current character in the complete text of the whole page. If the corresponding Text Code carries the corresponding CG Transform character transformation, unicode of the corresponding character is fetched from the font based on the Glyph of the character transformation. And sequentially extracting all text data in the current text segment to obtain a complete text character string of the whole page.
According to the method for replacing the sensitive words of the OFD file, the OFD file is analyzed through the method, and the file format convenient for replacing the subsequent sensitive words is obtained.
Optionally, in the method for replacing the sensitive words of the OFD file provided by the embodiment of the present invention, the preset sensitive word list is determined based on the content field of the OFD file.
Specifically, the content of the OFD file belongs to different fields, and corresponding preset sensitive word lists are different and are determined based on field personnel.
According to the method for replacing the sensitive words of the OFD file, the corresponding preset sensitive word list is determined based on the field where the content of the OFD file is located, for example, the OFD file in the news field needs to be created based on sensitive words which are easy to appear in news, if the content of the OFD file contains key technical terms which need to be kept secret, the corresponding preset sensitive word list is the main technical vocabulary in the field, and through the setting, the accuracy of replacing the sensitive words of the OFD file is improved.
The device for replacing the OFD file sensitive words provided by the invention is described below, and the device for replacing the OFD file sensitive words described below and the method for replacing the OFD file sensitive words described above can be correspondingly referred to each other. Fig. 4 is a schematic structural diagram of a device for replacing sensitive words of an OFD file according to the present invention, as shown in fig. 4, including:
The receiving module 41 is configured to receive an operation instruction, where the operation instruction is used to perform sensitive word replacement on an OFD file, and the OFD file includes at least one text page;
the parsing module 42 is configured to parse the OFD file in units of pages based on the operation instruction, so as to obtain a text string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character;
the matching module 43 is configured to match each character in each text string set with a preset sensitive word list, and determine a to-be-replaced character set corresponding to each text string set; each character set to be replaced comprises at least one character to be replaced;
and a replacing module 44, configured to replace each character to be replaced with a target character based on the position index corresponding to each character to be replaced.
According to the device for replacing the sensitive words of the OFD file, the sensitive words of the OFD file are replaced through mutual matching among the modules, specifically, based on a received operation instruction, the content of each text page in the OFD file is replaced by the sensitive words, the OFD file is analyzed by taking a page as a unit to obtain a text character string set corresponding to each text page in the OFD file, the text character string set comprises a plurality of characters and position indexes corresponding to each character, each character is matched with a preset sensitive word list, so that characters to be replaced in the page in the OFD file are determined, and based on the position indexes corresponding to the characters to be replaced, the characters to be replaced are replaced by target characters. The method for replacing the OFD file sensitive words realizes automatic replacement of the OFD file sensitive words, and improves accuracy and efficiency of the replacement of the OFD file sensitive words.
Optionally, the matching module is further configured to:
acquiring the number N of the characters to be replaced in each character set to be replaced; n is a positive integer;
and determining a replacement rule based on the number N of the characters to be replaced.
And replacing each character to be replaced with a target character based on the replacement rule and the position index corresponding to each character to be replaced.
Optionally, the replacing module is specifically configured to:
acquiring a first position index corresponding to the character to be replaced under the condition that N=1;
setting the character to be replaced as a first empty character;
inserting a first target character at the first position index, and replacing the first null character with the first target character;
or under the condition that N is more than 1, acquiring a position index set corresponding to each character to be replaced;
determining whether each position index in the position index set is continuous;
setting each character to be replaced as a blank character group under the condition that the judging result is continuous;
and inserting a target character set at the position index corresponding to the empty character set, and replacing the empty character set with the target character set.
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the above-described method of replacing sensitive words of an OFD file, the method comprising: receiving an operation instruction; the operation instruction is used for carrying out sensitive word replacement on an OFD file, and the OFD file comprises at least one text page; analyzing the OFD file by taking a page as a unit based on the operation instruction to obtain a text character string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character; matching each character in each text character string set with a preset sensitive word list, and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced; and replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, where the computer program when executed by a processor can perform a method for replacing an OFD file sensitive word provided by the above methods, where the method includes: receiving an operation instruction; the operation instruction is used for carrying out sensitive word replacement on an OFD file, and the OFD file comprises at least one text page; analyzing the OFD file by taking a page as a unit based on the operation instruction to obtain a text character string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character; matching each character in each text character string set with a preset sensitive word list, and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced; and replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method for replacing an OFD file sensitive word provided by the above methods, the method comprising: receiving an operation instruction; the operation instruction is used for carrying out sensitive word replacement on an OFD file, and the OFD file comprises at least one text page; analyzing the OFD file by taking a page as a unit based on the operation instruction to obtain a text character string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character; matching each character in each text character string set with a preset sensitive word list, and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced; and replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The method for replacing the sensitive words of the OFD file of the open plate-type document is characterized by comprising the following steps of:
receiving an operation instruction; the operation instruction is used for carrying out sensitive word replacement on an OFD file, and the OFD file comprises at least one text page;
analyzing the OFD file by taking a page as a unit based on the operation instruction to obtain a text character string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character;
matching each character in each text character string set with a preset sensitive word list, and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced;
and replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
2. The method for replacing the OFD file sensitive word according to claim 1, wherein before replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced, the method comprises:
Acquiring the number N of the characters to be replaced in each character set to be replaced; n is a positive integer;
and determining a replacement rule based on the number N of the characters to be replaced.
3. The method for replacing the OFD file sensitive words according to claim 2, characterized in that the method further comprises:
and replacing each character to be replaced with a target character based on the replacement rule and the position index corresponding to each character to be replaced.
4. The method for replacing the OFD file sensitive word according to claim 3, wherein replacing each character to be replaced with the target character based on the replacement rule and the position index corresponding to each character to be replaced, comprises:
acquiring a first position index corresponding to the character to be replaced under the condition that N=1; setting the character to be replaced as a first empty character; inserting a first target character at the first position index, and replacing the first null character with the first target character;
or under the condition that N is more than 1, acquiring a position index set corresponding to each character to be replaced; determining whether each position index in the position index set is continuous;
setting each character to be replaced as a blank character group under the condition that the judging result is continuous; and inserting a target character set at the position index corresponding to the empty character set, and replacing the empty character set with the target character set.
5. The method for replacing OFD file sensitive words according to claim 4, characterized in that the method further comprises:
setting each character to be replaced as a second empty character in sequence under the condition that the judging result is discontinuous;
and sequentially inserting a second target character at the position index corresponding to each second empty character group, and replacing each second empty character with the second target character.
6. The method for replacing sensitive words of an OFD file according to claim 1, wherein the parsing the OFD file in units of pages based on the operation instruction to obtain a text string set corresponding to each text page in the OFD file comprises:
acquiring the object type of the character structure information of each page in the OFD file; the object type is one of a non-text object and a text object;
judging the type of the object type, and under the condition that the object type is judged to be the text type, acquiring the text character strings corresponding to the character structure information to obtain a text character string set corresponding to each text page in the OFD file.
7. The method for replacing sensitive words of an OFD file according to claim 1, wherein the preset sensitive word list is determined based on the content domain of the OFD file.
8. An OFD file sensitive word replacing device, comprising:
the receiving module is used for receiving an operation instruction, wherein the operation instruction is used for carrying out sensitive word replacement on an OFD file, and the OFD file comprises at least one text page;
the analysis module is used for analyzing the OFD file by taking pages as units based on the operation instruction to obtain a text character string set corresponding to each text page in the OFD file; each text string set comprises at least one character and a position index corresponding to each character;
the matching module is used for matching each character in each text character string set with a preset sensitive word list and determining a character set to be replaced corresponding to each text character string set; each character set to be replaced comprises at least one character to be replaced;
and the replacing module is used for replacing each character to be replaced with the target character based on the position index corresponding to each character to be replaced.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of replacing sensitive words of an OFD file according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method of replacing sensitive words of an OFD file according to any one of claims 1 to 7.
CN202311451554.9A 2023-11-02 2023-11-02 Method and device for replacing OFD file sensitive words Pending CN117574855A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311451554.9A CN117574855A (en) 2023-11-02 2023-11-02 Method and device for replacing OFD file sensitive words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311451554.9A CN117574855A (en) 2023-11-02 2023-11-02 Method and device for replacing OFD file sensitive words

Publications (1)

Publication Number Publication Date
CN117574855A true CN117574855A (en) 2024-02-20

Family

ID=89883412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311451554.9A Pending CN117574855A (en) 2023-11-02 2023-11-02 Method and device for replacing OFD file sensitive words

Country Status (1)

Country Link
CN (1) CN117574855A (en)

Similar Documents

Publication Publication Date Title
US9268768B2 (en) Non-standard and standard clause detection
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
US20110258202A1 (en) Concept extraction using title and emphasized text
CN111176650B (en) Parser generation method, search method, server, and storage medium
US10936667B2 (en) Indication of search result
CN107358208A (en) A kind of PDF document structured message extracting method and device
US11663408B1 (en) OCR error correction
CN111563380A (en) Named entity identification method and device
CN110704719B (en) Enterprise search text word segmentation method and device
JP7040227B2 (en) Information processing programs, information processing methods, and information processing equipment
JPWO2009048149A1 (en) Electronic document equivalence judgment system and equivalence judgment method
CN111160445B (en) Bid file similarity calculation method and device
CN111240962B (en) Test method, test device, computer equipment and computer storage medium
US7779351B2 (en) Coloring a generated document by replacing original colors of a source document paragraph with colors to identify the paragraph and with colors to mark color boundries
CN114579796B (en) Machine reading understanding method and device
CN117574855A (en) Method and device for replacing OFD file sensitive words
CN112513863A (en) Deriving non-mesh components when exporting 3D objects to a 3D file format
US8578268B2 (en) Rendering electronic documents having linked textboxes
CN115221266A (en) Raw corpus retrieval method and device, electronic equipment and storage medium
US20110296292A1 (en) Efficient application-neutral vector documents
CN116340263B (en) Word document conversion method and device based on machine identification and storage medium
CN112395865B (en) Check method and device for customs clearance sheet
CN117891904A (en) Searching method, terminal device and computer readable storage medium
CN114116603A (en) ePub file format conversion method, device, equipment and readable storage medium
CN116384362A (en) Presentation generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination