Data watermark identification and analysis method and system
Technical Field
The invention relates to the technical field of information security, in particular to a data watermark identification and analysis method and a data watermark identification and analysis system.
Background
The information technology is developed rapidly, represented by big data analysis and a new generation artificial intelligence technology, has played an important role in aspects such as national governance, organization lean management, customer service promotion and the like, and the full fusion and sharing of data become great trends and bring about deep influence on the development of the economic society. However, the data security problem is increasingly highlighted, and the data theft and abuse problem is increasingly serious and is a primary problem which currently prevents the data from further fusion sharing.
The data watermarking technology is to embed some identification information (namely data watermark) directly into a digital carrier (comprising multimedia, documents, software and the like), does not affect the use value of the original carrier, is not easy to be detected and modified again, and can be identified and recognized by a producer. The additive data watermark is added in the data string, is used for marking the attribute of the data, can be used for defining ownership and defining a data distribution process, however, the data watermark added in the prior art changes the structure of the data, thereby causing the misidentification of the data and causing the interference of a data table meaning.
Disclosure of Invention
Therefore, the data watermark identification and analysis method and the data watermark identification and analysis system provided by the invention overcome the defects of data misidentification and interference on the data expression caused by adding a watermark data structure in the prior art.
In order to achieve the purpose, the invention provides the following technical scheme:
in a first aspect, an embodiment of the present invention provides a data watermark identification and analysis method, including:
acquiring data content to be identified, wherein the data content to be identified comprises the following steps: at least one single piece of data;
classifying according to the data content to be identified to generate at least one semantic segment;
generating a semantic library according to different semantic segments;
and matching the single data in the data content to be identified with the semantic segments in the semantic library, and marking the data which cannot be matched with the semantic segments in the semantic library as the data watermark.
In one embodiment, the content of the data to be recognized is obtained as a total set of input data with the same grammar structure.
In one embodiment, the data content to be identified is classified according to the same field and the position of the same field, and at least one semantic segment is generated.
In an embodiment, the classifying according to the same field and the position of the same field in the data content to be identified to generate at least one semantic segment includes:
determining fields with the number of repeated fields in the data content to be identified larger than or equal to a first preset value as same fields, and counting the same fields and the field positions of the same fields;
deleting the same fields with the number smaller than a second preset value, and reserving the same fields with the number larger than or equal to the second preset value;
and sequencing the same reserved fields according to the sequence position of the single data in the data content to be identified to generate at least one semantic segment.
In an embodiment, after the step of marking, as the data watermark, the single piece of data in the data content to be identified and the semantic segment in the semantic library, where the data that cannot be matched with the semantic segment in the semantic library, the method further includes: the content and location marked as data watermark is returned.
In one embodiment, the data content to be identified includes: at least one of a word and a character string.
In a second aspect, an embodiment of the present invention provides a data watermark identification and analysis system, including:
the data acquisition module is used for acquiring data content to be identified, and the data content to be identified comprises: at least one single piece of data;
the semantic segment generation module is used for classifying according to the data content to be identified and generating at least one semantic segment;
the language library segment generation module is used for generating a semantic library according to different semantic segments;
and the data watermark identification module is used for matching the single data in the data content to be identified with the semantic segments in the semantic library, and marking the data which cannot be matched with the semantic segments in the semantic library as the data watermark.
In one embodiment, the data watermark identification and analysis system further comprises:
and the data watermark content and position acquisition module is used for returning the content and the position marked as the data watermark.
In a third aspect, an embodiment of the present invention provides a terminal, including: the system comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the data watermark identification and analysis method according to the first aspect of the embodiment of the invention.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause the computer to execute the data watermark identification and analysis method according to the first aspect of the embodiment of the present invention.
The technical scheme of the invention has the following advantages:
the data watermark identification and analysis method and system provided by the invention can acquire the data content to be identified, and the data content to be identified comprises the following steps: at least one single piece of data; classifying according to the data content to be identified to generate at least one semantic segment; generating a semantic library according to different semantic segments; and matching the single data in the data content to be identified with the semantic segments in the semantic library, and marking the data which cannot be matched with the semantic segments in the semantic library as the data watermark. The invention provides accurate analysis for the identification processing of the data watermark by analyzing and identifying the data watermark, and reduces the false identification of the data and the interference of the data meaning.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a specific example of a data watermark identification and analysis method provided by an embodiment of the present invention;
fig. 2 is a flowchart of another specific example of a data watermark identification and analysis method according to an embodiment of the present invention;
fig. 3 is a block diagram of a specific example of a data watermark identification and analysis system according to an embodiment of the present invention;
fig. 4 is a block diagram of another specific example of a data watermark identification and analysis system according to an embodiment of the present invention;
fig. 5 is a composition diagram of a specific example of a terminal according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Example 1
The data watermark identification and analysis method provided by the embodiment of the invention, as shown in fig. 1, comprises the following steps:
step S1: acquiring data content to be identified, wherein the data content to be identified comprises the following steps: at least one single piece of data.
In the embodiment of the present invention, the content of the data to be recognized is obtained as a total set of input data with a grammatical structure, where the grammatical structure includes regular and irregular, for example: the rule includes: sentences with main, predicate and guest structures are only taken as examples, but not limited to the examples, and corresponding structures are selected according to specific requirements; irregular syntax structures, such as: each number of the identity card represents different meanings, sorting of address information from large to small, and the like, which are only taken as examples and not as limitations, and the corresponding structure is selected according to specific requirements.
In the embodiment of the present invention, the data content to be identified includes: at least one of the characters and the character strings is only used as an example, but not limited to, and the corresponding content is selected according to the corresponding requirement.
Step S2: and classifying according to the data content to be identified to generate at least one semantic segment.
In the embodiment of the invention, the classification is carried out according to the same field and the position of the same field in the data content to be identified, and at least one semantic segment is generated. The embodiment defines the same field, extracts effective information of the data content to be identified by counting the probability of the repeated occurrence of the same field, and generates at least one semantic segment. Classifying according to the same field and the position of the same field in the data content to be identified, and the specific process of generating at least one semantic segment comprises the following steps: determining fields with the number of repeated fields in the data content to be identified larger than or equal to a first preset value as same fields, and counting the same fields and the field positions of the same fields; deleting the same fields with the number smaller than a second preset value, and reserving the same fields with the number larger than or equal to the second preset value; and sequencing the same reserved fields according to the sequence position of the single data in the data content to be identified to generate at least one semantic segment.
In the embodiment of the present invention, the fields with the number of the repeated fields being greater than or equal to 2 in the data content to be identified are determined as the same fields, which is only taken as an example and not limited to this, corresponding values are set according to reasonable requirements, and the field positions of the same fields and the same fields are counted; deleting the same fields with the number less than 2, and reserving the same fields with the number greater than or equal to 2, which is only taken as an example and not taken as a limitation, and setting corresponding numerical values according to reasonable requirements; and sequencing the same reserved fields according to the sequence position of the single data in the data content to be identified to generate at least one semantic segment.
Step S3: and generating a semantic library according to different semantic segments.
In the embodiment of the invention, the semantic library comprises at least one semantic segment which is generated by classifying according to the data content to be identified.
Step S4: and matching the single data in the data content to be identified with the semantic segments in the semantic library, and marking the data which cannot be matched with the semantic segments in the semantic library as the data watermark.
In the embodiment of the present invention, the single data in the data content to be identified is matched with the semantic segments in the semantic library, and when matching, it is ensured that the single data in the data content to be identified is matched with each semantic segment in the semantic library, and the data label that cannot be matched with the semantic segments in the semantic library is a data watermark, as shown in fig. 2, after identifying the data watermark, the method further includes:
step S5: and returning the position of the data watermark.
In the embodiment of the invention, the data watermark is analyzed and identified, and the position of the data watermark is returned, so that accurate analysis is provided for the identification processing of the data watermark, and the false identification of data and the interference of data definition are reduced; meanwhile, the position of the data watermark can be returned, and accurate positioning is provided for subsequent processing aiming at the data watermark.
The data watermark identification and analysis method provided by the embodiment of the invention obtains the data content to be identified, and the data content to be identified comprises the following steps: at least one single piece of data; classifying according to the data content to be identified to generate at least one semantic segment; generating a semantic library according to different semantic segments; and matching the single data in the data content to be identified with the semantic segments in the semantic library, and marking the data which cannot be matched with the semantic segments in the semantic library as the data watermark. By analyzing and identifying the data watermark, accurate analysis is provided for the identification processing of the data watermark, and the false identification of the data and the interference of the data meaning are reduced.
Example 2
An embodiment of the present invention provides a data watermark identification and analysis system, as shown in fig. 3, including:
the data obtaining module 1 is configured to obtain data content to be identified, where the data content to be identified includes: at least one single piece of data; this module executes the method described in step S1 in embodiment 1, and is not described herein again.
The semantic segment generation module 2 is used for classifying according to the data content to be identified and generating at least one semantic segment; this module executes the method described in step S2 in embodiment 1, and is not described herein again.
The language library segment generating module 3 is used for generating a semantic library according to different semantic segments; this module executes the method described in step S3 in embodiment 1, and is not described herein again.
The data watermark recognition module 4 is used for matching single data in the data content to be recognized with semantic segments in a semantic library, and data which cannot be matched with the semantic segments in the semantic library is a data watermark; this module executes the method described in step S4 in embodiment 1, and is not described herein again.
In this embodiment of the present invention, as shown in fig. 4, the data watermark identification and analysis system further includes:
a data watermark content and location obtaining module 5, configured to return the content and location marked as a data watermark, where this module executes the method described in step S5 in embodiment 1, and details are not described here again.
The embodiment of the invention provides a data watermark identification and analysis system, which acquires data content to be identified through a data acquisition module, wherein the data content to be identified comprises the following steps: at least one single piece of data; the semantic segment generation module classifies the data content to be identified and generates at least one semantic segment; the language library segment generation module generates a semantic library according to different semantic segments; the data watermark recognition module matches single data in the data content to be recognized with semantic segments in the semantic library, and data which cannot be matched with the semantic segments in the semantic library is marked as data watermarks. By analyzing and identifying the data watermark, accurate analysis is provided for the identification processing of the data watermark, and the false identification of the data and the interference of the data meaning are reduced.
Example 3
An embodiment of the present invention provides a terminal, as shown in fig. 5, including: at least one processor 401, such as a CPU (Central Processing Unit), at least one communication interface 403, memory 404, and at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The communication interface 403 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may also include a standard wired interface and a standard wireless interface. The Memory 404 may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The memory 404 may optionally be at least one memory device located remotely from the processor 401. Wherein the processor 401 may execute the data watermark identification analysis method in embodiment 1. A set of program codes is stored in the memory 404, and the processor 401 calls the program codes stored in the memory 404 for executing the data watermark identification analysis method in embodiment 1. The communication bus 402 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 5, but this does not represent only one bus or one type of bus. The memory 404 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 404 may also comprise a combination of memories of the kind described above. The processor 401 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The memory 404 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviation: HDD), or a solid-state drive (english: SSD); the memory 404 may also comprise a combination of memories of the kind described above.
The processor 401 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The processor 401 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The aforementioned PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
Optionally, the memory 404 is also used to store program instructions. The processor 401 may call program instructions to implement the data watermark identification and analysis method in embodiment 1 as described in this application.
The embodiment of the invention further provides a computer-readable storage medium, wherein computer-executable instructions are stored on the computer-readable storage medium, and the computer-executable instructions can execute the data watermark identification and analysis method in the embodiment 1. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.