CN115455965B - Character grouping method based on word distance word chain, storage medium and electronic equipment - Google Patents
Character grouping method based on word distance word chain, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN115455965B CN115455965B CN202211416946.7A CN202211416946A CN115455965B CN 115455965 B CN115455965 B CN 115455965B CN 202211416946 A CN202211416946 A CN 202211416946A CN 115455965 B CN115455965 B CN 115455965B
- Authority
- CN
- China
- Prior art keywords
- character
- word
- characters
- frequency
- grouped
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Abstract
The invention particularly relates to a character grouping method based on a word distance word chain, a storage medium and electronic equipment, wherein the character grouping method comprises the following steps: traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the word formed by the N characters; repeating the following steps according to the word frequency from high to low until all the characters are grouped; calculating the character c to be assigned to the grouped character c in the kth group i The word chain of (1); and adding the character c to be assigned to the group with the minimum weight by taking the normalized word chain sum as the weight. The word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the grouping problem of the characters is converted into specific weight size comparison, so that the grouping is more rational and more accurate.
Description
Technical Field
The invention relates to the technical field of word stock invisible watermarks, in particular to a character grouping method based on word distance and word chain, a storage medium and electronic equipment.
Background
In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, the text digital watermarking technology based on character topological structure modification becomes the mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. When we use different character deformation data for different users, the specific watermark word stock constitutes the safe word stock for the user.
The prior secure word stock has many defects, and in order to solve the problems of poor universality of watermark loading, poor system stability, complex implementation process, low robustness performance of a watermark algorithm and the like in the prior art on the premise of not changing any use habit of a user, the following scheme is disclosed in a patent of a universal text watermarking method and device (publication number: CN 114708133A) applied by Beijing national crypto-technology Limited company: a general text watermarking method, comprising the steps of: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; according to the watermark coding data, combining the watermark character data temporary file and the grouped characters, dynamically generating and loading a watermark character library file in real time; and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font file.
In this scheme, characters need to be grouped. When characters are grouped, theoretically, the characters with higher word frequency should be respectively located in different groups; the characters, which are often present together, should be located in different groups, respectively. The safety word stock generated by meeting the two requirements needs fewer word contents when the safety code is extracted, so that the extraction effect and the accuracy are better. The character grouping method in the scheme has a plurality of defects: first, the number of characters in each group is substantially equal, which conflicts with the above requirement; secondly, only the word frequency is considered during grouping, the word frequency is not fully considered, theoretically, corresponding characters in frequently-occurring words should be grouped into different groups, so that more groups can appear in shorter contents, and fewer contents are required during extraction of the security codes; thirdly, the calculation process for optimizing the packets in the scheme is too complex, and a large amount of time and calculation power are consumed.
Disclosure of Invention
The invention aims to provide a character grouping method based on word distance word chains, which can more reasonably group characters.
In order to realize the purpose, the invention adopts the technical scheme that: a character grouping method based on word distance word chains comprises the following steps: traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the words formed by the N characters(ii) a Repeating the following steps according to the word frequency from high to low until all the characters are grouped; calculating the characters c to be assigned to the kth group according to the following formulaMiddle grouped character c i And:
in the formula (I), the compound is shown in the specification,to contain character c and character c i All the words of (a); the word chain sum of all groups is normalized to obtain(ii) a With normalized word chain sumFor the weight, the character c to be assigned is added to the group with the smallest weight.
Compared with the prior art, the invention has the following technical effects: the word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the above-mentioned relation is just reflected according to the word chain and the calculated weight; by converting the grouping problem of the characters into specific weight size comparison, the grouping is more rational and more accurate.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of the present invention;
FIG. 3 is a schematic flow chart of a second embodiment of the present invention;
FIG. 4 is a flow chart of a third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to fig. 1 to 4.
Referring to fig. 1, the invention discloses a character grouping method based on word distance word chains, comprising the following steps: and traversing the corpus, counting the frequency of N characters to be grouped, wherein the optimal value range of N is 1000 to 3000, if the N characters to be grouped are specified, directly counting the characters, and if only N values are given, sequencing the characters according to the character frequency from high to low, and selecting the N characters with higher character frequency. The word segmentation models are many, mature word segmentation models are selected for segmenting all texts in a corpus, and the probability of each word is calculated according to the frequency of occurrence of the word formed by N characters(ii) a The word frequency and the word frequency can be calculated by using the existing corpus and model, and the calculated result can also be directly adopted. The corpus can be selected according to the requirements of users, namely a general corpus can be selected, an internal corpus of a certain enterprise or organization can be selected, and the obtained character groups are different for different corpora.
According to the word frequency from high to low, the following steps are repeatedly executed until all characters are grouped, the word chain of the character c to be distributed into each group is calculated according to a formula subsequently, and until the character c is added into the group with the minimum weight, the complete grouping process aiming at the character c is realized, and the grouping of N characters can be completed only by executing the grouping process on each character according to the sequence of the word frequency from high to low.
Calculating the characters c to be assigned to the kth group according to the following formulaMiddle grouped character c i And:
in the formula (I), the compound is shown in the specification,to contain character c and character c i All words of (2), here for characters c and c i Whether the adjacent characters and the front and back sequences are not limited, and all the words containing the two characters belong to words. The word chain and the weight that reflects the two characters appear together as a word are intended to place the two characters that appear together in different groups in the following groupings as much as possible.
The word chain sum of all groups is normalized to obtain(ii) a With normalized word chain sumFor the weight, the character c to be assigned is added to the group with the smallest weight. By converting the grouping problem of the characters into specific weight size comparison, the grouping is more rational and more accurate.
The above weights are calculated only in terms of word chains and correlation, and for the purpose of introducing information about word frequency, three preferred embodiments are provided in the present invention for reference.
Referring to fig. 2, in the first embodiment, the word distance sum is introduced into the weight calculation formula. Specifically, the present invention also includes the followingThe method comprises the following steps: sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sortingMiddle grouped character c i The word distance sum of:
in the formula (I), the compound is shown in the specification,is character c and character c i The distance of (d); the word distance mentioned here means that after sorting according to word frequency, their subscript difference, such as the 1 st word and the 2 nd word, their distance is 1, the word distance between the word with the highest word frequency and the word with the lowest word frequency is the largest, which is N-1, and the word distance directly reflects the word frequency.
For simple calculation, the word distance sum of all groups is normalized to obtain(ii) a Then, the weight of the character c to be assigned to the kth group is calculated according to the following formula:
in the formula (I), the compound is shown in the specification,is a predetermined weight coefficient and is greater than or equal to 0 whenWhen the sum of word chains is taken as the weight, whenThe word chain sum and the word distance sum are used as the weight of the common calculation result. By presetting parametersThe calculated weights may be adjustedThe sum of word chains and the sum of word distances. And finally, adding the character c to be distributed to the group with the minimum weight.
In the above formula, C is the grouped character set,the specific calculation process of the standard deviation after the frequency normalization of the grouped characters is as follows: first, the frequency of all the grouped characters is normalized, for example, the frequency of the grouped characters is marked as Q 1 、Q 2 、Q 3 …, normalization, i.e., dividing the frequency of each character by the sum of the total frequencies; secondly, solving the standard deviation of the numerical value obtained after normalization to obtain the standard deviation。
The word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the word distance reflects the relation of word frequency, and more similar high-frequency words are allocated to different groups; the weights calculated from these two relationships reflect exactly the above relationship.
Referring to fig. 3, in the second embodiment, the frequency difference sum is introduced into the calculation formula of the weight. Specifically, the invention also comprises the following steps: sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sortingMiddle grouped character c i The sum of the frequency differences:
in the formula (I), the compound is shown in the specification,is character c and character c i The frequency difference value directly reflects the word frequency relation; normalizing the frequency difference sum of all groups to obtain(ii) a The weight of the character c to be assigned to the kth group is calculated according to the following formula:
in the formula (I), the compound is shown in the specification,the weight coefficient is preset and is greater than or equal to 0,C is the grouped character set,normalized standard deviation for the frequency of the grouped characters.Is arranged andthe effect is similar, the value ranges of the two are the same, and the two are both [0,10]However, the values of the two are not necessarily equal to each other, and may be different.
Referring to fig. 4, in the third embodiment, the word distance sum and the frequency difference sum are introduced into the weight calculation formula at the same time. Specifically, word distance sum and frequency difference sum are calculated according to the steps, and then the weight of the character c to be allocated to the kth group is calculated according to the following formula:
in the embodiment, word chain sum, word distance sum, frequency difference value and three factors are integrated, the calculated weight can better express the word frequency and word frequency relation, and the characters can be grouped more reasonably.
The invention also discloses a computer readable storage medium and an electronic device. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for character grouping based on word distance word chains as set forth above. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor, when executing the computer program, implementing the character grouping method based on word distance word chains as described above.
Claims (7)
1. A character grouping method based on word distance word chains is characterized in that: the method comprises the following steps:
traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the words formed by the N characters;
Repeating the following steps according to the word frequency from high to low until all the characters are grouped;
calculating the characters c to be assigned to the kth group according to the following formulaMiddle grouped character c i And:
in the formula (I), the compound is shown in the specification,to contain character c and character c i All the words of (a);
2. The character grouping method based on the word distance word chain as claimed in claim 1, wherein: also comprises the following steps:
sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sortingMiddle grouped character c i The word distance sum of:
in the formula (I), the compound is shown in the specification,as character c and character c i I.e. after sorting by word frequency, character c and character c i Subscript difference of (d);
The weight of the character c to be assigned to the kth group is calculated according to the following formula:
3. The character grouping method based on the word distance word chain as claimed in claim 1, wherein: also comprises the following steps:
sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sortingMiddle grouped character c i The sum of the frequency differences:
in the formula (I), the compound is shown in the specification,as character c and character c i The difference in frequency of (a);
The weight of the character c to be assigned to the kth group is calculated according to the following formula:
4. The character grouping method based on a word distance word chain as claimed in claim 1, wherein: also comprises the following steps:
sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sortingMiddle grouped character c i Word distance of (c) and:
in the formula (I), the compound is shown in the specification,is character c and character c i The distance of (d);
and calculating the characters c to be allocated to the k group after the sortingMiddle grouped character c i The sum of the frequency differences:
in the formula (I), the compound is shown in the specification,as character c and character c i The difference in frequency of (a);
The weight of the character c to be assigned to the kth group is calculated according to the following formula:
6. A computer-readable storage medium characterized by: stored thereon a computer program which, when executed by a processor, implements the word distance word chain based character grouping method as claimed in any one of claims 1-5.
7. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory, the processor, when executing the computer program, implementing the method for grouping characters based on a word distance word chain according to any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211416946.7A CN115455965B (en) | 2022-11-14 | 2022-11-14 | Character grouping method based on word distance word chain, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211416946.7A CN115455965B (en) | 2022-11-14 | 2022-11-14 | Character grouping method based on word distance word chain, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115455965A CN115455965A (en) | 2022-12-09 |
CN115455965B true CN115455965B (en) | 2023-03-10 |
Family
ID=84295728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211416946.7A Active CN115455965B (en) | 2022-11-14 | 2022-11-14 | Character grouping method based on word distance word chain, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115455965B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1740943A (en) * | 2004-08-27 | 2006-03-01 | 北京北大方正电子有限公司 | A file enciphering method |
CN114708133A (en) * | 2022-01-27 | 2022-07-05 | 北京国隐科技有限公司 | Universal text watermarking method and device |
CN114936961A (en) * | 2022-06-07 | 2022-08-23 | 杭州电子科技大学 | Robust text watermarking method based on Chinese character characteristic modification and grouping |
-
2022
- 2022-11-14 CN CN202211416946.7A patent/CN115455965B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1740943A (en) * | 2004-08-27 | 2006-03-01 | 北京北大方正电子有限公司 | A file enciphering method |
CN114708133A (en) * | 2022-01-27 | 2022-07-05 | 北京国隐科技有限公司 | Universal text watermarking method and device |
CN114936961A (en) * | 2022-06-07 | 2022-08-23 | 杭州电子科技大学 | Robust text watermarking method based on Chinese character characteristic modification and grouping |
Non-Patent Citations (1)
Title |
---|
基于数字水印技术的文档追踪系统的研究和实现;于泳波;《中国优秀硕士学位论文全文数据库 信息科技辑》;中国知网;20181115(第11期);第I136-63页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115455965A (en) | 2022-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108053545B (en) | Certificate verification method and device, server and storage medium | |
CN104317891B (en) | A kind of method and device that label is marked to the page | |
CN114708133B (en) | Universal text watermarking method and device | |
CN111931489B (en) | Text error correction method, device and equipment | |
CN115689853A (en) | Robust text watermarking method based on Chinese character characteristic modification and grouping | |
CN101639828B (en) | Method for hiding and extracting watermark based on XML electronic document | |
CN112016061A (en) | Excel document data protection method based on robust watermarking technology | |
CN114356919A (en) | Watermark embedding method, tracing method and device for structured database | |
CN112861844A (en) | Service data processing method and device and server | |
Alkhafaji et al. | Payload capacity scheme for quran text watermarking based on vowels with kashida | |
CN110770725A (en) | Data processing method and device | |
CN115618809A (en) | Character grouping method based on binary character frequency and safe word stock construction method | |
CN115455965B (en) | Character grouping method based on word distance word chain, storage medium and electronic equipment | |
CN112860957B (en) | Method, medium and system for checking fixed value list | |
Ghilan et al. | Combined Markov model and zero watermarking techniques to enhance content authentication of english text documents | |
CN103136166B (en) | Method and device for font determination | |
WO2024066271A1 (en) | Database watermark embedding method and apparatus, database watermark tracing method and apparatus, and electronic device | |
CN115455966B (en) | Safe word stock construction method and safe code extraction method thereof | |
CN115455987B (en) | Character grouping method based on word frequency and word frequency, storage medium and electronic equipment | |
Shah et al. | Query preserving relational database watermarking | |
CN115422125A (en) | Electronic document automatic filing method and system based on intelligent algorithm | |
Majumder et al. | A generalized model of text steganography by summary generation using frequency analysis | |
CN115883111A (en) | Phishing website identification method and device, electronic equipment and storage medium | |
CN117648681B (en) | OFD format electronic document hidden information extraction and embedding method | |
CN112732901A (en) | Abstract generation method and device, computer readable storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |