CN116776862A - Sensitive word shielding method, device, equipment and medium of OFD file - Google Patents
Sensitive word shielding method, device, equipment and medium of OFD file Download PDFInfo
- Publication number
- CN116776862A CN116776862A CN202311076705.7A CN202311076705A CN116776862A CN 116776862 A CN116776862 A CN 116776862A CN 202311076705 A CN202311076705 A CN 202311076705A CN 116776862 A CN116776862 A CN 116776862A
- Authority
- CN
- China
- Prior art keywords
- ofd file
- sensitive
- sensitive word
- ofd
- shielding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 87
- 238000001514 detection method Methods 0.000 claims abstract description 66
- 230000007246 mechanism Effects 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 37
- 238000000547 structure data Methods 0.000 claims abstract description 15
- 230000000873 masking effect Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 13
- 238000013473 artificial intelligence Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 2
- 238000012216 screening Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000008520 organization Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007711 solidification Methods 0.000 description 1
- 230000008023 solidification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides a sensitive word shielding method, device, equipment and medium of an OFD file, belonging to the technical field of document processing, wherein the method comprises the following steps: receiving a target instruction aiming at an OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file; analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file; carrying out structuring processing on each text character object in at least one text character object to obtain structure data corresponding to each text character object; inputting each piece of structure data into the sensitive word detection model to obtain identification information corresponding to each piece of structure data output by the sensitive word detection model; the identification information is used for indicating the positions and grades corresponding to the sensitive words in the corresponding structural data; based on the identification information and a preset level mechanism, shielding each sensitive word in the OFD file.
Description
Technical Field
The invention relates to the technical field of document processing, in particular to a sensitive word shielding method, device, equipment and medium of an OFD file.
Background
An Open-layout Document (OFD) format file is an electronic Document format file format, which is intended to replace other electronic Document formats, such as a portable Document format (Portable Document Format, PDF), and has the characteristics of independent format, fixed format, solidification and presentation, and is mainly applied to the scenes of finalizing, archiving, and the like.
In the application scene of the OFD file, sensitive words in the content of the OFD file need to be detected, in the prior art, the sensitive words in the OFD file are generally checked in a manual checking mode, and then editing is carried out to enable the OFD file to accord with the behavior specification.
The existing sensitive word detection method and the corresponding editing method for the OFD file cannot automatically and accurately shield the sensitive words in the OFD file.
Disclosure of Invention
The invention provides a method, a device, equipment and a medium for shielding sensitive words of an OFD file, which are used for solving the defect that the sensitive words in the OFD file cannot be automatically and accurately shielded in the prior art and realizing the accurate automatic shielding of the sensitive words of the OFD file.
The invention provides a sensitive word shielding method of an OFD file, which comprises the following steps:
Receiving a target instruction aiming at an OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file;
analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file;
carrying out structuring processing on each text character object in the at least one text character object to obtain structure data corresponding to each text character object;
inputting each piece of structural data into a sensitive word detection model to obtain identification information corresponding to each piece of structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words;
and shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
According to the sensitive word shielding method for the OFD file provided by the invention, the OFD file is analyzed based on the target instruction to obtain at least one text character object in the OFD file, and the method comprises the following steps:
acquiring a page object of at least one page in the OFD file; the page object is one of a graphic object, an image object and a text object;
And judging the object type of each page object in the at least one page object, and acquiring the text character object of the page object under the condition that the object type is judged to be the text type.
According to the method for shielding the sensitive words of the OFD file provided by the invention, each text character object in the at least one text character object is subjected to structuring processing to obtain the structural data respectively corresponding to each text character object, and the method comprises the following steps:
carrying out the character processing on each text character object to obtain at least one character corresponding to each text character object;
acquiring attribute information of each character; and the attribute information comprises at least one of a page object number of the character position, a chapter index of the character position and a position index of a chapter where the character is positioned, and the structural data corresponding to each text character object is obtained based on the attribute information of each character.
According to the method for shielding the sensitive words of the OFD file provided by the invention, the sensitive words in the OFD file are shielded based on the identification information and a preset level mechanism, and the method comprises the following steps:
Based on the level of each sensitive word, matching a shielding strategy corresponding to each sensitive word in the preset level mechanism; the preset level mechanism comprises a mapping relation between the level of the sensitive word and a shielding strategy;
and shielding each sensitive word in the OFD file based on the position of each sensitive word and the shielding strategy corresponding to each sensitive word.
According to the sensitive word shielding method of the OFD file, the shielding strategy comprises at least one of the following steps:
masking sensitive words;
replacing sensitive words;
sensitive words are deleted.
The sensitive word shielding method of the OFD file provided by the invention further comprises the following steps:
and storing the replaced sensitive words under the condition that the shielding strategy comprises the replaced sensitive words.
According to the sensitive word shielding method of the OFD file, the sensitive word detection model is based on artificial intelligence.
The invention also provides a sensitive word shielding device of the OFD file, which comprises the following components:
the receiving module is used for receiving a target instruction aiming at the OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file;
The analysis module is used for analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file;
the processing module is used for carrying out structuring processing on each text character object in the at least one text character object to obtain structure data corresponding to each text character object;
the detection module is used for inputting the structural data into the sensitive word detection model to obtain identification information corresponding to the structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words;
and the shielding module is used for shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the sensitive word shielding method of the OFD file according to any one of the above when executing the program.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a sensitive word masking method of an OFD file as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a sensitive word masking method of an OFD file as described in any one of the above.
According to the method, the device, the equipment and the medium for shielding the sensitive words of the OFD file, after receiving the target instruction for indicating to shield the sensitive words in the OFD file, the OFD file is analyzed to obtain all text character objects in the OFD file, all the text character objects are subjected to structural processing to obtain structural data corresponding to all the text character objects, the sensitive words in the text character objects are detected through the sensitive word detection model, the sensitive words in the structural data, the positions of the sensitive words in the OFD file and the grades of the sensitive words are output, the detection accuracy of the sensitive words of the OFD file is improved, and finally the automatic shielding of the sensitive words in the OFD file is realized based on the matching of the sensitive word grades and the preset grade mechanism.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for shielding sensitive words of an OFD file;
FIG. 2 is a second flowchart of a method for shielding sensitive words of an OFD file according to the present application;
FIG. 3 is a third flow chart of the method for shielding sensitive words of an OFD file according to the present application;
fig. 4 is a schematic structural data diagram of a sensitive word masking method of an OFD file provided by the present application;
FIG. 5 is a flow chart of a method for shielding sensitive words of an OFD file provided by the application;
fig. 6 is a schematic structural diagram of a sensitive word shielding device of an OFD file provided by the application;
fig. 7 is a schematic structural diagram of an electronic device provided by the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to facilitate a clearer understanding of various embodiments of the present application, some relevant background knowledge is first presented as follows.
The OFD file is not affected by equipment and has a fixed format. The method is completely consistent with the paper file in aspects of format, layout, fonts, font sizes and the like. The characteristics of the format document format make the format document format become ideal document formats for serious electronic document release, digital information propagation and archiving, and are mainly applied to scenes such as finalizing, version fixing, archiving and the like. Such as propaganda materials and propaganda books, academic papers and research reports, product manuals, user guides, presentation files, electronic contracts, official documents and the like, precisely because of the application scene of the OFD file, the sensitive words in the OFD file need to be checked, and the checked sensitive words need to be correspondingly processed.
In the prior art, the detection is usually performed manually by a worker, which is time-consuming and labor-consuming, the integrity and accuracy of the detection cannot be ensured, and corresponding processing is required to be performed on the detected sensitive words.
In summary, in order to realize automatic shielding of sensitive words in an OFD file and improve the checking efficiency and accuracy of the sensitive words of the OFD file, the embodiment of the invention provides a sensitive word shielding method, device, electronic equipment and storage medium of the OFD file.
The sensitive word shielding method of the OFD file can be applied to the technical field of document processing. Optionally, the sensitive word shielding method of the OFD file provided by the invention can be realized at a server side, and is used for checking and shielding a large number of OFD files in a public cloud or private cloud mode; or in the client, the OFD file reader is embedded in the form of a software development kit.
The sensitive word shielding method, device, equipment and medium of the OFD file are described below with reference to FIGS. 1-7.
Fig. 1 is one of flow diagrams of a method for shielding sensitive words of an OFD file provided by the present invention, and as shown in fig. 1, the method for shielding sensitive words of an OFD file provided by the present invention includes:
step 101, receiving a target instruction aiming at an OFD file;
the target instruction is used for indicating to shield the sensitive words in the OFD file;
specifically, the sensitive word shielding method of the OFD file is applied to a scene requiring sensitive word shielding of contents in the OFD file. The execution body can be a server side or a client side; when the execution main body of the sensitive word shielding method of the OFD file is a server side, a user can send out a target instruction through the server side; when the execution subject of the sensitive word shielding method of the OFD file is a client, that is, a software development kit is added in the OFD reader, the sensitive word shielding method of the OFD file is integrated, and the user can issue a target instruction in the OFD reader, generally by an instruction such as "shielding the sensitive word", which is not particularly limited herein. The number of the OFD files can be one or a plurality of the OFD files after receiving the target instruction aiming at the OFD files, the sensitive words of the single OFD files can be shielded, and the sensitive words of the plurality of OFD files can be shielded in batches simultaneously.
102, analyzing the OFD file based on a target instruction to obtain at least one text character object in the OFD file;
specifically, in this step, after receiving a target instruction from a user, the OFD file is parsed. The structure of the OFD file format is generally divided into three layers: the first layer is a virtual storage system and comprises a package organization structure and an in-package directory organization structure; the second layer is a document model and comprises document, page, outline, file-level resource and other organization structures; and the third layer is the page content description in the OFD file and comprises page-level resources, graphics, images, characters and the like.
In the page content description, three most basic primitive objects are specifically included:
graphic object: the method comprises the steps of forming a region by a series of Bezier curves and circular arcs, wherein the graphical object can be filled or edge-hooked;
image object: each pixel value determines the color value of a specified point of the rectangular area;
text character object: the text character object is formed by a series of characters and positioning information corresponding to the characters, the font of each character is determined by the designated font and other parameters, and the text character object can be filled or bordered.
In the specific embodiment of the invention, after receiving a target instruction from a user, the OFD file is analyzed according to a three-layer structure, a page content description part in the three-layer structure is obtained, and then a text character object in the page content description part is obtained as a basis for subsequent sensitive word retrieval.
Step 103, carrying out structuring processing on each text character object in at least one text character object to obtain structure data corresponding to each text character object;
specifically, in the embodiment of the present invention, each page in the OFD file is parsed in units of pages, and after the text character object of each page is obtained, the text character object of each page needs to be processed to form structured data, which is specifically implemented as performing structural processing on each character included in the text object of each page in the OFD file. The structured data for each character contains its exact location on the corresponding page.
104, inputting each structural data into the sensitive word detection model to obtain identification information corresponding to each structural data output by the sensitive word detection model; the identification information is used for indicating the positions and grades corresponding to the sensitive words in the corresponding structural data;
Specifically, after the text object of each page in the OFD file is structured, all the structured data of the OFD file after being structured is input into a sensitive word detection model. Specifically, the sensitive word detection model in the embodiment of the invention is a sensitive word detection model based on an artificial intelligence technology, sequential sensitive word detection can be performed on each character subjected to structural processing through a semantic analysis technology, semantic analysis (Semantic Analysis) is a branch of artificial intelligence (Artificial Intelligence, AI), is a core task in a natural language processing technology, semantic analysis refers to learning and understanding semantic content represented by a section of text by using various methods, and any understanding of language can be classified into the category of semantic analysis. A piece of text is typically composed of words, sentences and paragraphs, and semantic analysis can be further decomposed into vocabulary-level semantic analysis, sentence-level semantic analysis and chapter-level semantic analysis according to the language units of the understanding objects. In the implementation of the invention, the sensitive words in the OFD file can avoid common sensitive word retrieval in a certain confusion mode, such as English short for name, or space is added in the middle of the sensitive words, special symbols are added in the sensitive words, or the sensitive words are confused in the file in a Chinese-English mixing mode, and text character objects in the OFD file are sequentially detected by taking characters as minimum units through a semantic analysis function in an artificial intelligent sensitive word detection model, and identification information is output. The identification information in this embodiment returns the sensitive word in the OFD file, the specific location where the sensitive word is located, and the level of the sensitive word. The level of the sensitive word is preset, specifically, may be preset according to the type of the OFD file, and is not specifically limited herein.
And 105, shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
Specifically, identification information output by the sensitive word detection model, namely what the sensitive word in the OFD file is, the position where the sensitive word is located and the grade of the sensitive word are obtained. In specific implementation, a shielding level mechanism is preset, matching is performed in the shielding level mechanism aiming at different levels of the sensitive words, corresponding shielding strategies are determined, and corresponding shielding operations are performed on different sensitive words based on different shielding strategies corresponding to different sensitive words.
According to the sensitive word shielding method of the OFD file, after receiving the target instruction for shielding the sensitive word in the OFD file, analyzing the OFD file to obtain all text character objects in the OFD file, carrying out structural processing on all the text character objects to obtain structural data corresponding to all the text character objects, carrying out sensitive word detection on the structural data through a sensitive word detection model, outputting the sensitive word in the text character object, the position of the sensitive word in the OFD file and the grade of the sensitive word, improving the detection accuracy of the sensitive word of the OFD file, finally, carrying out matching based on the sensitive word grade and a preset grade mechanism, determining a shielding mechanism corresponding to the sensitive word grade, and executing, thereby realizing automatic shielding of the sensitive word in the OFD file.
Optionally, the step 102 specifically includes the following steps, and fig. 2 is a second flow chart of a method for shielding sensitive words of an OFD file provided by the present invention, as shown in fig. 2, where the step 102 specifically includes:
step 1021, acquiring a page object of at least one page in the OFD file; the page object is one of a graphic object, an image object and a text character object;
step 1022, determining an object type of each of the at least one page object, and obtaining a text character object of the page object if the object type is determined to be the text type.
Specifically, in the embodiment of the invention, after receiving the target instruction from the user, the OFD file is parsed, and the OFD file is firstly parsed according to the three-layer structure to obtain the page content description part in the three-layer structure, and then the text character object in the page content description part is obtained to serve as the basis of the subsequent sensitive word retrieval.
According to the sensitive word shielding method of the OFD file, the text type content in the OFD file is obtained by analyzing the structure of the OFD file, a foundation is laid for subsequent structuring, and sensitive word detection is conveniently carried out on the text content in the OFD file by the sensitive word detection model.
Optionally, the step 103 specifically includes the following steps, and fig. 3 is a third flow chart of a method for shielding a sensitive word of an OFD file provided by the present invention, as shown in fig. 3, where the step 103 specifically includes:
step 1031, carrying out the character processing on each text character object to obtain at least one character corresponding to each text character object;
step 1032, obtaining attribute information of each character; the attribute information includes at least one of a page index of the character position, a chapter index of the character position, and a code of the character;
and 1033, obtaining the structure data corresponding to each text character object based on the attribute information of each character.
Specifically, each page in the OFD file is parsed in units of pages, and after the text character object of each page is obtained, the text character object of each page needs to be processed to form structured data, which is specifically implemented as performing structural processing on each character included in the text object of each page in the OFD file. The structured data for each character contains its exact location on the corresponding page.
Fig. 4 is a schematic structural data diagram of the sensitive word shielding method of the OFD file provided by the present invention, after a certain page in the OFD file is parsed as shown in fig. 4, a text character object of the page is structured to obtain attribute information of each character in the text character object, as shown in fig. 4, after the text character object in the page is "hello beijing", and after the structuring, each character has respective corresponding attribute information according to the characters, including, a Piece Index, that is, a page Index where the character is located, a Char Index, that is, a chapter Index where the character is located, and a Char Code, that is, a Code of the character itself, and after each association, all characters in one page and attribute information corresponding to each character together form structural data of the page.
According to the sensitive word shielding method of the OFD file, the text character objects in the OFD file are subjected to structural processing, in fact, the attribute information of each character in the text character objects, namely the page index of the character position, the chapter index of the character position and the code of the character, is determined by taking the characters as units, each character and the corresponding attribute information are associated, further the structural data of the text character objects are obtained, and the structural data can be processed by a sensitive word detection model to realize detection of sensitive words in the OFD file.
Optionally, the step 105 specifically includes the following steps, and fig. 5 is a schematic flow chart of a sensitive word shielding method of an OFD file provided by the present invention, as shown in fig. 5, where the step 105 specifically includes the following steps:
step 1051, matching shielding strategies corresponding to each sensitive word in a preset level mechanism based on the level of each sensitive word; the preset level mechanism comprises a mapping relation between the level of the sensitive word and a shielding strategy;
specifically, identification information output by the sensitive word detection model, namely what the sensitive word in the OFD file is, the position where the sensitive word is located and the grade of the sensitive word are obtained. In specific implementation, a shielding level mechanism is preset, matching is performed in the shielding level mechanism aiming at different levels of the sensitive words, corresponding shielding strategies are determined, and corresponding shielding operations are performed on different sensitive words based on different shielding strategies corresponding to different sensitive words.
Step 1052, shielding each sensitive word in the OFD file based on the position of each sensitive word and the shielding policy corresponding to each sensitive word.
Specifically, the identification information in the result returned by the sensitive word detection model is received, the structured data of each character after being structured and the corresponding identification information thereof can be obtained, if the sensitive word detection model detects that the character is a sensitive word, the identification information can include the code of the character itself, the page index of the character in the OFD file, the chapter index of the character in the page and the grade corresponding to the sensitive word, and the grade corresponding to the sensitive word is matched with a preset grade mechanism based on the grade of the sensitive word, so as to obtain a shielding strategy corresponding to the sensitive word, for example, the grade of the sensitive word is first grade, the matching is performed in the preset grade mechanism, and when the grade of the sensitive word is first grade, the shielding strategy which should be executed is what to execute the corresponding shielding strategy on the sensitive word.
According to the method for shielding the sensitive words of the OFD file, different shielding mechanisms are preset according to different grades of the sensitive words, so that grading processing of the different sensitive words of the OFD file is realized, and flexibility of shielding the sensitive words of the OFD file is enhanced.
Optionally, the method for shielding the sensitive words of the OFD file provided by the embodiment of the present invention, the shielding policy includes at least one of the following:
masking sensitive words;
replacing sensitive words;
sensitive words are deleted.
Specifically, as an example, when the level of the sensitive word output by the sensitive word detection model is successfully matched with the first level in the preset level mechanism, the shielding strategy for the sensitive word is to cover the sensitive word. Under the condition that the shielding strategy is to cover the sensitive words, the content of the source document is kept unchanged, only the sensitive words are covered in the page in a black, white or mosaic fuzzy mode at the positions of the characters, and the shielding effect is achieved, and the specific covering mode is not limited.
When the secondary matching of the level of the sensitive word output by the sensitive word detection model and the summary of a preset level mechanism is successful, the shielding strategy for the sensitive word is to replace the sensitive word. In the case that the masking policy is to replace the sensitive word, the character is replaced by a star symbol or other symbols, and the method is specifically divided into two cases, wherein the first case is that the sensitive word after replacement cannot be restored, the character is permanently shown in the form of a "x" in the following OFD file, and the sensitive word after replacement is actually deleted and cannot be restored; in the second case, the content of the character itself cannot be deleted because of important information or information needing confidentiality, and when the OFD file is displayed, the character can be restored by replacing the character itself with a special symbol, that is, the replaced character itself is stored in the OFD file in a certain data format, the sensitive word can be restored when necessary according to the actual requirement of the OFD file, and the replaced character itself can be placed in a custom subscript.
When the three-level matching of the level of the sensitive word output by the sensitive word detection model and the summary of the preset level mechanism is successful, the shielding strategy for the sensitive word is to delete the sensitive word. Under the condition that the shielding strategy is to delete the sensitive word, the character itself is directly deleted from the OFD source file, and under special conditions, the whole paragraph or the whole page where the sensitive word is located can be deleted.
According to the method for shielding the sensitive words of the OFD file, different shielding mechanisms are preset for different grades of the sensitive words, the grading processing of the different sensitive words of the OFD file is achieved, the shielding mechanisms comprise covering, replacing and deleting, and the availability of the OFD file is improved to a certain extent.
Optionally, the sensitive word shielding method for the OFD file provided by the embodiment of the present invention further includes:
in the case where the masking policy includes replacement sensitive words, the replaced sensitive words are stored.
Specifically, in the case that the masking policy is to replace the sensitive word, the method is divided into two refinement cases, as in the first case, the sensitive word after replacement cannot be restored, in the following OFD file, the character is permanently represented in "x", and the sensitive word after replacement is actually deleted and cannot be restored; in the second case, the content of the character itself cannot be deleted because of important information or information needing confidentiality, and when the OFD file is displayed, the character can be restored by replacing the character itself with a special symbol, that is, the replaced character itself is stored in the OFD file in a certain data format, the sensitive word can be restored when necessary according to the actual requirement of the OFD file, and the replaced character itself can be placed in a custom subscript. In the second case, the sensitive word is important information or information that needs to be kept secret, and the characters placed in the custom subscript may be encrypted, and the specific encryption method is not limited herein.
According to the sensitive word shielding method for the OFD file, under the condition that the shielding strategy of the sensitive word level corresponding to the sensitive word is to replace the sensitive word, the replaced sensitive word can be encrypted, so that the method can be applied to various scenes, and the usability of the sensitive word shielding method for the OFD file is improved.
According to the sensitive word shielding method of the OFD file, the sensitive word detection model is based on artificial intelligence.
Specifically, after the text object of each page in the OFD file is structured, all the structured data of the OFD file after being structured is input into a sensitive word detection model. Specifically, the sensitive word detection model in the embodiment of the invention is a sensitive word detection model based on an artificial intelligence technology, sequential sensitive word detection can be performed on each character subjected to structural processing through a semantic analysis technology, semantic analysis is one branch of AI, a core task in a natural language processing technology sequentially detects text character objects in an OFD file by taking the character as a minimum unit through a semantic analysis function in the artificial intelligence sensitive word detection model, and identification information is output. The identification information in this embodiment returns the sensitive word in the OFD file, the specific location where the sensitive word is located, and the level of the sensitive word. The level of the sensitive word is preset, and can be preset according to the specific content of the OFD file, which is not specifically limited herein.
According to the method for shielding the sensitive words of the OFD file, provided by the embodiment of the invention, by combining an artificial intelligence technology and utilizing the semantic analysis capability of the method, the sensitive words of the OFD file are accurately identified, the detection efficiency of the sensitive words of the OFD file is improved, multiple concurrent processing can be performed, and meanwhile, the sensitive words of a large quantity of OFD files are detected.
The sensitive word shielding device of the OFD file provided by the invention is described below, and the sensitive word shielding device of the OFD file described below and the sensitive word shielding method of the OFD file described above can be correspondingly referred to each other. Fig. 6 is a schematic structural diagram of a sensitive word shielding device for an OFD file provided by the present invention, and as shown in fig. 6, the sensitive word shielding device for an OFD file includes:
a receiving module 601, configured to receive a target instruction for an OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file;
specifically, the sensitive word shielding method of the OFD file can be applied to a server side and a client side; when the sensitive word shielding method of the OFD file is applied to the server side, a user can send out a target instruction through the server side; when the sensitive word shielding method of the OFD file is applied to the client, namely, a software development kit is added in the OFD reader, the sensitive word shielding method of the OFD file is integrated, a user can send out a target instruction in the OFD reader, and the target instruction is generally sent out through an instruction such as 'shielding sensitive words', and the method is not particularly limited herein. The number of the OFD files can be one or a plurality of the OFD files after receiving the target instruction aiming at the OFD files, the sensitive words of the single OFD files can be shielded, and the sensitive words of the plurality of OFD files can be shielded in batches simultaneously.
The parsing module 602 is configured to parse the OFD file based on the target instruction, to obtain at least one text character object in the OFD file;
specifically, in this step, after receiving a target instruction from a user, the OFD file is parsed. The structure of the OFD file format is generally divided into three layers: the first layer is a virtual storage system and comprises a package organization structure and an in-package directory organization structure; the second layer is a document model and comprises document, page, outline, file-level resource and other organization structures; and the third layer is the page content description in the OFD file and comprises page-level resources, graphics, images, characters and the like.
In the page content description, three most basic primitive objects are specifically included:
graphic object: the method comprises the steps of forming a region by a series of Bezier curves and circular arcs, wherein the graphical object can be filled or edge-hooked;
image object: each pixel value determines the color value of a specified point of the rectangular area;
text character object: the text character object is formed by a series of characters and positioning information corresponding to the characters, the font of each character is determined by the designated font and other parameters, and the text character object can be filled or bordered.
In the specific embodiment of the invention, after receiving a target instruction from a user, the OFD file is analyzed according to a three-layer structure, a page content description part in the three-layer structure is obtained, and then a text character object in the page content description part is obtained as a basis for subsequent sensitive word retrieval.
The processing module 603 is configured to perform a structuring process on each text character object in the at least one text character object, so as to obtain structural data corresponding to each text character object;
specifically, in the embodiment of the present invention, each page in the OFD file is parsed in units of pages, and after the text character object of each page is obtained, the text character object of each page needs to be processed to form structured data, which is specifically implemented as performing structural processing on each character included in the text object of each page in the OFD file. The structured data for each character contains its exact location on the corresponding page.
The detection module 604 is configured to input each piece of structural data to a sensitive word detection model, so as to obtain identification information corresponding to each piece of structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words;
Specifically, after the text object of each page in the OFD file is structured, all the structured data of the OFD file after being structured is input into a sensitive word detection model. Specifically, the sensitive word detection model in the embodiment of the invention is a sensitive word detection model based on an artificial intelligence technology, and sequential sensitive word detection can be performed on each character subjected to structural processing through a semantic analysis technology.
And a shielding module 605, configured to shield each sensitive word in the OFD file based on each identification information and a preset ranking mechanism.
Specifically, identification information output by the sensitive word detection model, namely what the sensitive word in the OFD file is, the position where the sensitive word is located and the grade of the sensitive word are obtained. In specific implementation, a shielding level mechanism is preset, matching is performed in the shielding level mechanism aiming at different levels of the sensitive words, corresponding shielding strategies are determined, and corresponding shielding operations are performed on different sensitive words based on different shielding strategies corresponding to different sensitive words.
According to the sensitive word shielding device for the OFD file, disclosed by the embodiment of the invention, the automatic shielding of the sensitive word of the OFD file is realized through the mutual coordination among the modules.
Optionally, the parsing module specifically includes:
the first acquisition unit is used for acquiring a page object of at least one page in the OFD file; the page object is one of a graphic object, an image object and a text object;
and the judging unit is used for judging the object type of each page object in the at least one page object, and acquiring the text character object of the page object under the condition that the object type is judged to be the text type.
According to the sensitive word shielding device for the OFD file, the structure of the OFD file is analyzed through the first acquisition unit and the judgment unit in the analysis module, text type content in the OFD file is acquired, a foundation is laid for subsequent structuring, and sensitive word detection is conveniently carried out on the text content in the OFD file by the sensitive word detection model.
Optionally, the processing module specifically includes: the character processing unit is used for carrying out character processing on each text character object to obtain at least one character corresponding to each text character object;
a second acquisition unit configured to acquire attribute information of each character; the attribute information includes at least one of a page index of the character position, a chapter index of the character position, and a code of the character;
And the structure data acquisition unit is used for acquiring the structure data respectively corresponding to the text character objects based on the attribute information of each character.
According to the sensitive word shielding device of the OFD file, the text character object in the OFD file is subjected to structural processing, the attribute information of each character in the text character object, namely the page index of the character position, the chapter index of the character position and the code of the character, is determined by taking the characters as units, each character is associated with the corresponding attribute information, further the structural data of the text character object is obtained, and the structural data can be processed by a sensitive word detection model to realize detection of the sensitive word in the OFD file.
Optionally, the shielding module specifically includes:
the matching unit is used for matching shielding strategies corresponding to the sensitive words in a preset level mechanism based on the level of the sensitive words; the preset level mechanism comprises a mapping relation between the level of the sensitive word and a shielding strategy;
and the shielding unit is used for shielding each sensitive word in the OFD file based on the position of each sensitive word and the shielding strategy corresponding to each sensitive word.
According to the sensitive word shielding device for the OFD file, different shielding mechanisms are preset according to different grades of the sensitive words, so that grading processing of the different sensitive words of the OFD file is realized, and flexibility of shielding the sensitive words of the OFD file is enhanced.
Fig. 7 is a schematic structural diagram of an electronic device according to the present invention, and fig. 7 illustrates a schematic physical structural diagram of an electronic device, as shown in fig. 7, where the electronic device may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform the sensitive word masking method of the OFD file described above, which includes: receiving a target instruction aiming at an OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file; analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file; carrying out structuring processing on each text character object in the at least one text character object to obtain structure data corresponding to each text character object; inputting each piece of structural data into a sensitive word detection model to obtain identification information corresponding to each piece of structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words; and shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, where the computer program when executed by a processor can perform a method for masking sensitive words of an OFD file provided by the above methods, where the method includes: receiving a target instruction aiming at an OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file; analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file; carrying out structuring processing on each text character object in the at least one text character object to obtain structure data corresponding to each text character object; inputting each piece of structural data into a sensitive word detection model to obtain identification information corresponding to each piece of structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words;
And shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for masking sensitive words of an OFD file provided by the above methods, the method comprising: receiving a target instruction aiming at an OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file; analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file; carrying out structuring processing on each text character object in the at least one text character object to obtain structure data corresponding to each text character object; inputting each piece of structural data into a sensitive word detection model to obtain identification information corresponding to each piece of structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words; and shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. The sensitive word shielding method for the open format document OFD file is characterized by comprising the following steps of:
receiving a target instruction aiming at an OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file;
analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file;
carrying out structuring processing on each text character object in the at least one text character object to obtain structure data corresponding to each text character object;
inputting each piece of structural data into a sensitive word detection model to obtain identification information corresponding to each piece of structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words;
And shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
2. The method for masking sensitive words of an OFD file according to claim 1, wherein said parsing the OFD file based on the target instruction to obtain at least one text character object in the OFD file comprises:
acquiring a page object of at least one page in the OFD file; the page object is one of a graphic object, an image object and a text object;
and judging the object type of each page object in the at least one page object, and acquiring the text character object of the page object under the condition that the object type is judged to be the text type.
3. The method for masking sensitive words of an OFD file according to claim 1, wherein the structuring process is performed on each text character object in the at least one text character object to obtain structural data corresponding to each text character object, respectively, includes:
carrying out the character processing on each text character object to obtain at least one character corresponding to each text character object;
Acquiring attribute information of each character; the attribute information comprises at least one of a page object number of the character position, a chapter index of the character position and a position index of a chapter where the character is located;
and obtaining the structure data corresponding to each text character object based on the attribute information of each character.
4. The method for masking sensitive words in an OFD file according to claim 1, wherein masking each sensitive word in the OFD file based on each of the identification information and a preset ranking mechanism comprises:
based on the level of each sensitive word, matching a shielding strategy corresponding to each sensitive word in the preset level mechanism; the preset level mechanism comprises a mapping relation between the level of the sensitive word and a shielding strategy;
and shielding each sensitive word in the OFD file based on the position of each sensitive word and the shielding strategy corresponding to each sensitive word.
5. The sensitive word masking method of an OFD file according to claim 4, wherein the masking policy comprises at least one of:
masking sensitive words;
replacing sensitive words;
Sensitive words are deleted.
6. The method for shielding sensitive words of an OFD file according to claim 5, characterized in that the method further comprises:
and storing the replaced sensitive words under the condition that the shielding strategy comprises the replaced sensitive words.
7. The method for masking sensitive words of an OFD file according to any one of claims 1 to 6, wherein the sensitive word detection model is an artificial intelligence based sensitive word detection model.
8. A sensitive word screening apparatus for an OFD file, comprising:
the receiving module is used for receiving a target instruction aiming at the OFD file; the target instruction is used for indicating to shield sensitive words in the OFD file;
the analysis module is used for analyzing the OFD file based on the target instruction to obtain at least one text character object in the OFD file;
the processing module is used for carrying out structuring processing on each text character object in the at least one text character object to obtain structure data corresponding to each text character object;
the detection module is used for inputting the structural data into the sensitive word detection model to obtain identification information corresponding to the structural data output by the sensitive word detection model; the identification information is used for indicating sensitive words in the corresponding structural data and positions and grades corresponding to the sensitive words;
And the shielding module is used for shielding each sensitive word in the OFD file based on each identification information and a preset level mechanism.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the sensitive word masking method of an OFD file according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the sensitive word masking method of an OFD file according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311076705.7A CN116776862A (en) | 2023-08-25 | 2023-08-25 | Sensitive word shielding method, device, equipment and medium of OFD file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311076705.7A CN116776862A (en) | 2023-08-25 | 2023-08-25 | Sensitive word shielding method, device, equipment and medium of OFD file |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116776862A true CN116776862A (en) | 2023-09-19 |
Family
ID=88013824
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311076705.7A Pending CN116776862A (en) | 2023-08-25 | 2023-08-25 | Sensitive word shielding method, device, equipment and medium of OFD file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116776862A (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050132070A1 (en) * | 2000-11-13 | 2005-06-16 | Redlich Ron M. | Data security system and method with editor |
CN109522740A (en) * | 2018-10-16 | 2019-03-26 | 易保互联医疗信息科技(北京)有限公司 | Health data goes privacy processing method and system |
CN110457428A (en) * | 2019-06-26 | 2019-11-15 | 北京印刷学院 | A kind of sensitive word detection filter method, device and electronic equipment |
CN111159329A (en) * | 2019-12-24 | 2020-05-15 | 深圳市优必选科技股份有限公司 | Sensitive word detection method and device, terminal equipment and computer-readable storage medium |
CN113204949A (en) * | 2021-05-28 | 2021-08-03 | 中国建设银行股份有限公司 | Desensitization processing method and device for sensitive data in electronic document |
CN113642739A (en) * | 2021-08-12 | 2021-11-12 | 北京华宇元典信息服务有限公司 | Training method of sensitive word shielding quality evaluation model and corresponding evaluation method |
CN114330287A (en) * | 2021-12-23 | 2022-04-12 | 北京八分量信息科技有限公司 | Pseudo processing method and device for sensitive data in heterogeneous network and related product |
CN114398873A (en) * | 2022-01-11 | 2022-04-26 | 山东东葳电子科技有限公司 | Sensitive word processing method and processing device |
CN115455473A (en) * | 2022-09-06 | 2022-12-09 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for automatically processing sensitive words of electronic document |
CN115495621A (en) * | 2022-08-03 | 2022-12-20 | 熵链科技(厦门)有限公司 | Sensitive word data shielding method, device, equipment and storage medium |
-
2023
- 2023-08-25 CN CN202311076705.7A patent/CN116776862A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050132070A1 (en) * | 2000-11-13 | 2005-06-16 | Redlich Ron M. | Data security system and method with editor |
CN109522740A (en) * | 2018-10-16 | 2019-03-26 | 易保互联医疗信息科技(北京)有限公司 | Health data goes privacy processing method and system |
CN110457428A (en) * | 2019-06-26 | 2019-11-15 | 北京印刷学院 | A kind of sensitive word detection filter method, device and electronic equipment |
CN111159329A (en) * | 2019-12-24 | 2020-05-15 | 深圳市优必选科技股份有限公司 | Sensitive word detection method and device, terminal equipment and computer-readable storage medium |
CN113204949A (en) * | 2021-05-28 | 2021-08-03 | 中国建设银行股份有限公司 | Desensitization processing method and device for sensitive data in electronic document |
CN113642739A (en) * | 2021-08-12 | 2021-11-12 | 北京华宇元典信息服务有限公司 | Training method of sensitive word shielding quality evaluation model and corresponding evaluation method |
CN114330287A (en) * | 2021-12-23 | 2022-04-12 | 北京八分量信息科技有限公司 | Pseudo processing method and device for sensitive data in heterogeneous network and related product |
CN114398873A (en) * | 2022-01-11 | 2022-04-26 | 山东东葳电子科技有限公司 | Sensitive word processing method and processing device |
CN115495621A (en) * | 2022-08-03 | 2022-12-20 | 熵链科技(厦门)有限公司 | Sensitive word data shielding method, device, equipment and storage medium |
CN115455473A (en) * | 2022-09-06 | 2022-12-09 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for automatically processing sensitive words of electronic document |
Non-Patent Citations (2)
Title |
---|
冯辉 等: "OFD的安全应用分析", 信息技术与标准化, no. 11, pages 49 - 53 * |
曾冬 等: "多模多维舆情监控智能过滤审核系统的功能与使用", 视听, no. 03, pages 75 - 79 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11830266B2 (en) | Data processing systems, devices, and methods for content analysis | |
US7428701B1 (en) | Method, system and computer program for redaction of material from documents | |
US20180267946A1 (en) | Techniques and systems for storing and protecting signatures and images in electronic documents | |
US11675963B2 (en) | Suggestion techniques for documents to-be-translated | |
CN112417899A (en) | Character translation method, device, computer equipment and storage medium | |
CN111985202A (en) | Method, equipment and storage medium for generating PDF electronic signature based on template | |
CN111291575A (en) | Text processing method and device, electronic equipment and storage medium | |
CN105512096B (en) | A kind of optimization method and device based on font embedded in document | |
KR20090008747A (en) | Method for reformating contents and recalculating number of pages of electronic book in case of a font size change, and apparatus applied to the same | |
US10140278B2 (en) | Computer-implemented methods and systems for associating files with cells of a collaborative spreadsheet | |
CN116776862A (en) | Sensitive word shielding method, device, equipment and medium of OFD file | |
US20120192046A1 (en) | Generation of a source complex document to facilitate content access in complex document creation | |
JP2011515730A (en) | Method and apparatus for supplying electronic documents page by page as computer graphics | |
CN114925655A (en) | PDF file display method and device, computer equipment and storage medium | |
JP2018036843A (en) | Device, method, and program | |
CN111046096A (en) | Method and device for generating image-text structured information | |
CN117634450B (en) | Test paper generation method and system | |
JP2019109703A (en) | Document search apparatus, document search method, and program | |
EP2711847A2 (en) | Page data generation apparatus, recording medium and page data generation method | |
JP7501255B2 (en) | Document search system, document search method and program | |
CN116227465A (en) | Word processing method, word processing device, storage medium and electronic equipment | |
CN118154117A (en) | Control method, device and readable storage medium for contract checking system | |
CN114546306A (en) | Data processing method, device, equipment, medium and program product applied to report form printing | |
CN116884009A (en) | License information identification method and device, and model training method and device | |
CN116226563A (en) | Rich text editing method, rich text editing device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230919 |
|
RJ01 | Rejection of invention patent application after publication |