CN115455965B - Character grouping method based on word distance word chain, storage medium and electronic equipment - Google Patents

Character grouping method based on word distance word chain, storage medium and electronic equipment Download PDF

Info

Publication number
CN115455965B
CN115455965B CN202211416946.7A CN202211416946A CN115455965B CN 115455965 B CN115455965 B CN 115455965B CN 202211416946 A CN202211416946 A CN 202211416946A CN 115455965 B CN115455965 B CN 115455965B
Authority
CN
China
Prior art keywords
character
word
characters
frequency
grouped
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211416946.7A
Other languages
Chinese (zh)
Other versions
CN115455965A (en
Inventor
田辉
鲁国峰
朱鹏远
郭玉刚
张志翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei High Dimensional Data Technology Co ltd
Original Assignee
Hefei High Dimensional Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei High Dimensional Data Technology Co ltd filed Critical Hefei High Dimensional Data Technology Co ltd
Priority to CN202211416946.7A priority Critical patent/CN115455965B/en
Publication of CN115455965A publication Critical patent/CN115455965A/en
Application granted granted Critical
Publication of CN115455965B publication Critical patent/CN115455965B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The invention particularly relates to a character grouping method based on a word distance word chain, a storage medium and electronic equipment, wherein the character grouping method comprises the following steps: traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the word formed by the N characters; repeating the following steps according to the word frequency from high to low until all the characters are grouped; calculating the character c to be assigned to the grouped character c in the kth group i The word chain of (1); and adding the character c to be assigned to the group with the minimum weight by taking the normalized word chain sum as the weight. The word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the grouping problem of the characters is converted into specific weight size comparison, so that the grouping is more rational and more accurate.

Description

Character grouping method based on word distance word chain, storage medium and electronic equipment
Technical Field
The invention relates to the technical field of word stock invisible watermarks, in particular to a character grouping method based on word distance and word chain, a storage medium and electronic equipment.
Background
In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, the text digital watermarking technology based on character topological structure modification becomes the mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. When we use different character deformation data for different users, the specific watermark word stock constitutes the safe word stock for the user.
The prior secure word stock has many defects, and in order to solve the problems of poor universality of watermark loading, poor system stability, complex implementation process, low robustness performance of a watermark algorithm and the like in the prior art on the premise of not changing any use habit of a user, the following scheme is disclosed in a patent of a universal text watermarking method and device (publication number: CN 114708133A) applied by Beijing national crypto-technology Limited company: a general text watermarking method, comprising the steps of: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; according to the watermark coding data, combining the watermark character data temporary file and the grouped characters, dynamically generating and loading a watermark character library file in real time; and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font file.
In this scheme, characters need to be grouped. When characters are grouped, theoretically, the characters with higher word frequency should be respectively located in different groups; the characters, which are often present together, should be located in different groups, respectively. The safety word stock generated by meeting the two requirements needs fewer word contents when the safety code is extracted, so that the extraction effect and the accuracy are better. The character grouping method in the scheme has a plurality of defects: first, the number of characters in each group is substantially equal, which conflicts with the above requirement; secondly, only the word frequency is considered during grouping, the word frequency is not fully considered, theoretically, corresponding characters in frequently-occurring words should be grouped into different groups, so that more groups can appear in shorter contents, and fewer contents are required during extraction of the security codes; thirdly, the calculation process for optimizing the packets in the scheme is too complex, and a large amount of time and calculation power are consumed.
Disclosure of Invention
The invention aims to provide a character grouping method based on word distance word chains, which can more reasonably group characters.
In order to realize the purpose, the invention adopts the technical scheme that: a character grouping method based on word distance word chains comprises the following steps: traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the words formed by the N characters
Figure DEST_PATH_IMAGE001
(ii) a Repeating the following steps according to the word frequency from high to low until all the characters are grouped; calculating the characters c to be assigned to the kth group according to the following formula
Figure 227440DEST_PATH_IMAGE002
Middle grouped character c i And:
Figure 693057DEST_PATH_IMAGE004
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE005
to contain character c and character c i All the words of (a); the word chain sum of all groups is normalized to obtain
Figure 952000DEST_PATH_IMAGE006
(ii) a With normalized word chain sum
Figure 503067DEST_PATH_IMAGE006
For the weight, the character c to be assigned is added to the group with the smallest weight.
Compared with the prior art, the invention has the following technical effects: the word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the above-mentioned relation is just reflected according to the word chain and the calculated weight; by converting the grouping problem of the characters into specific weight size comparison, the grouping is more rational and more accurate.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of the present invention;
FIG. 3 is a schematic flow chart of a second embodiment of the present invention;
FIG. 4 is a flow chart of a third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to fig. 1 to 4.
Referring to fig. 1, the invention discloses a character grouping method based on word distance word chains, comprising the following steps: and traversing the corpus, counting the frequency of N characters to be grouped, wherein the optimal value range of N is 1000 to 3000, if the N characters to be grouped are specified, directly counting the characters, and if only N values are given, sequencing the characters according to the character frequency from high to low, and selecting the N characters with higher character frequency. The word segmentation models are many, mature word segmentation models are selected for segmenting all texts in a corpus, and the probability of each word is calculated according to the frequency of occurrence of the word formed by N characters
Figure 787549DEST_PATH_IMAGE001
(ii) a The word frequency and the word frequency can be calculated by using the existing corpus and model, and the calculated result can also be directly adopted. The corpus can be selected according to the requirements of users, namely a general corpus can be selected, an internal corpus of a certain enterprise or organization can be selected, and the obtained character groups are different for different corpora.
According to the word frequency from high to low, the following steps are repeatedly executed until all characters are grouped, the word chain of the character c to be distributed into each group is calculated according to a formula subsequently, and until the character c is added into the group with the minimum weight, the complete grouping process aiming at the character c is realized, and the grouping of N characters can be completed only by executing the grouping process on each character according to the sequence of the word frequency from high to low.
Calculating the characters c to be assigned to the kth group according to the following formula
Figure 45355DEST_PATH_IMAGE002
Middle grouped character c i And:
Figure DEST_PATH_IMAGE007
in the formula (I), the compound is shown in the specification,
Figure 537516DEST_PATH_IMAGE005
to contain character c and character c i All words of (2), here for characters c and c i Whether the adjacent characters and the front and back sequences are not limited, and all the words containing the two characters belong to words
Figure 513562DEST_PATH_IMAGE005
. The word chain and the weight that reflects the two characters appear together as a word are intended to place the two characters that appear together in different groups in the following groupings as much as possible.
The word chain sum of all groups is normalized to obtain
Figure 851003DEST_PATH_IMAGE006
(ii) a With normalized word chain sum
Figure 963315DEST_PATH_IMAGE006
For the weight, the character c to be assigned is added to the group with the smallest weight. By converting the grouping problem of the characters into specific weight size comparison, the grouping is more rational and more accurate.
The above weights are calculated only in terms of word chains and correlation, and for the purpose of introducing information about word frequency, three preferred embodiments are provided in the present invention for reference.
Referring to fig. 2, in the first embodiment, the word distance sum is introduced into the weight calculation formula. Specifically, the present invention also includes the followingThe method comprises the following steps: sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting
Figure 111531DEST_PATH_IMAGE002
Middle grouped character c i The word distance sum of:
Figure DEST_PATH_IMAGE009
in the formula (I), the compound is shown in the specification,
Figure 840452DEST_PATH_IMAGE010
is character c and character c i The distance of (d); the word distance mentioned here means that after sorting according to word frequency, their subscript difference, such as the 1 st word and the 2 nd word, their distance is 1, the word distance between the word with the highest word frequency and the word with the lowest word frequency is the largest, which is N-1, and the word distance directly reflects the word frequency.
For simple calculation, the word distance sum of all groups is normalized to obtain
Figure DEST_PATH_IMAGE011
(ii) a Then, the weight of the character c to be assigned to the kth group is calculated according to the following formula:
Figure DEST_PATH_IMAGE013
in the formula (I), the compound is shown in the specification,
Figure 512742DEST_PATH_IMAGE014
is a predetermined weight coefficient and is greater than or equal to 0 when
Figure DEST_PATH_IMAGE015
When the sum of word chains is taken as the weight, when
Figure 761452DEST_PATH_IMAGE016
The word chain sum and the word distance sum are used as the weight of the common calculation result. By presetting parameters
Figure 329837DEST_PATH_IMAGE014
The calculated weights may be adjusted
Figure DEST_PATH_IMAGE017
The sum of word chains and the sum of word distances. And finally, adding the character c to be distributed to the group with the minimum weight.
In the above formula, C is the grouped character set,
Figure 811634DEST_PATH_IMAGE018
the specific calculation process of the standard deviation after the frequency normalization of the grouped characters is as follows: first, the frequency of all the grouped characters is normalized, for example, the frequency of the grouped characters is marked as Q 1 、Q 2 、Q 3 …, normalization, i.e., dividing the frequency of each character by the sum of the total frequencies; secondly, solving the standard deviation of the numerical value obtained after normalization to obtain the standard deviation
Figure 428560DEST_PATH_IMAGE018
The word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the word distance reflects the relation of word frequency, and more similar high-frequency words are allocated to different groups; the weights calculated from these two relationships reflect exactly the above relationship.
Referring to fig. 3, in the second embodiment, the frequency difference sum is introduced into the calculation formula of the weight. Specifically, the invention also comprises the following steps: sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting
Figure 577781DEST_PATH_IMAGE002
Middle grouped character c i The sum of the frequency differences:
Figure 254750DEST_PATH_IMAGE020
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE021
is character c and character c i The frequency difference value directly reflects the word frequency relation; normalizing the frequency difference sum of all groups to obtain
Figure 971646DEST_PATH_IMAGE022
(ii) a The weight of the character c to be assigned to the kth group is calculated according to the following formula:
Figure DEST_PATH_IMAGE023
in the formula (I), the compound is shown in the specification,
Figure 985739DEST_PATH_IMAGE024
the weight coefficient is preset and is greater than or equal to 0,C is the grouped character set,
Figure 927150DEST_PATH_IMAGE018
normalized standard deviation for the frequency of the grouped characters.
Figure 837337DEST_PATH_IMAGE024
Is arranged and
Figure 496988DEST_PATH_IMAGE014
the effect is similar, the value ranges of the two are the same, and the two are both [0,10]However, the values of the two are not necessarily equal to each other, and may be different.
Referring to fig. 4, in the third embodiment, the word distance sum and the frequency difference sum are introduced into the weight calculation formula at the same time. Specifically, word distance sum and frequency difference sum are calculated according to the steps, and then the weight of the character c to be allocated to the kth group is calculated according to the following formula:
Figure 268766DEST_PATH_IMAGE026
in the embodiment, word chain sum, word distance sum, frequency difference value and three factors are integrated, the calculated weight can better express the word frequency and word frequency relation, and the characters can be grouped more reasonably.
The invention also discloses a computer readable storage medium and an electronic device. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for character grouping based on word distance word chains as set forth above. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor, when executing the computer program, implementing the character grouping method based on word distance word chains as described above.

Claims (7)

1. A character grouping method based on word distance word chains is characterized in that: the method comprises the following steps:
traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the words formed by the N characters
Figure DEST_PATH_IMAGE002
Repeating the following steps according to the word frequency from high to low until all the characters are grouped;
calculating the characters c to be assigned to the kth group according to the following formula
Figure DEST_PATH_IMAGE004
Middle grouped character c i And:
Figure DEST_PATH_IMAGE006
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE008
to contain character c and character c i All the words of (a);
the word chain sum of all groups is normalized to obtain
Figure DEST_PATH_IMAGE010
With normalized word chain sum
Figure 364773DEST_PATH_IMAGE010
For the weight, the character c to be assigned is added to the group with the smallest weight.
2. The character grouping method based on the word distance word chain as claimed in claim 1, wherein: also comprises the following steps:
sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting
Figure 16334DEST_PATH_IMAGE004
Middle grouped character c i The word distance sum of:
Figure DEST_PATH_IMAGE012
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE014
as character c and character c i I.e. after sorting by word frequency, character c and character c i Subscript difference of (d);
the word distance sum of all groups is normalized to obtain
Figure DEST_PATH_IMAGE016
The weight of the character c to be assigned to the kth group is calculated according to the following formula:
Figure DEST_PATH_IMAGE018
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE020
is a preset weight coefficient and is greater than or equal to 0,C is a grouped character set,
Figure DEST_PATH_IMAGE022
normalized standard deviation for the frequency of the grouped characters.
3. The character grouping method based on the word distance word chain as claimed in claim 1, wherein: also comprises the following steps:
sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting
Figure 484487DEST_PATH_IMAGE004
Middle grouped character c i The sum of the frequency differences:
Figure DEST_PATH_IMAGE024
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE026
as character c and character c i The difference in frequency of (a);
normalizing the frequency difference sum of all groups to obtain
Figure DEST_PATH_IMAGE028
The weight of the character c to be assigned to the kth group is calculated according to the following formula:
Figure DEST_PATH_IMAGE030
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE032
the weight coefficient is preset and is greater than or equal to 0,C is the grouped character set,
Figure 223904DEST_PATH_IMAGE022
normalized to the standard deviation of the frequency of the grouped characters.
4. The character grouping method based on a word distance word chain as claimed in claim 1, wherein: also comprises the following steps:
sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting
Figure 897331DEST_PATH_IMAGE004
Middle grouped character c i Word distance of (c) and:
Figure DEST_PATH_IMAGE012A
in the formula (I), the compound is shown in the specification,
Figure 359991DEST_PATH_IMAGE014
is character c and character c i The distance of (d);
and calculating the characters c to be allocated to the k group after the sorting
Figure 361445DEST_PATH_IMAGE004
Middle grouped character c i The sum of the frequency differences:
Figure DEST_PATH_IMAGE024A
in the formula (I), the compound is shown in the specification,
Figure 622662DEST_PATH_IMAGE026
as character c and character c i The difference in frequency of (a);
normalizing the word distance sum and the frequency difference sum of all groups to obtain
Figure 25962DEST_PATH_IMAGE016
Figure 753746DEST_PATH_IMAGE028
The weight of the character c to be assigned to the kth group is calculated according to the following formula:
Figure DEST_PATH_IMAGE034
in the formula (I), the compound is shown in the specification,
Figure 321125DEST_PATH_IMAGE020
and
Figure 651612DEST_PATH_IMAGE032
the weight coefficient is preset and is greater than or equal to 0,C is the grouped character set,
Figure 909418DEST_PATH_IMAGE022
normalized to the standard deviation of the frequency of the grouped characters.
5. The character grouping method based on the word distance word chain as claimed in claim 4, wherein: said
Figure 73683DEST_PATH_IMAGE020
And
Figure 925096DEST_PATH_IMAGE032
all values of (1) are [0,10]。
6. A computer-readable storage medium characterized by: stored thereon a computer program which, when executed by a processor, implements the word distance word chain based character grouping method as claimed in any one of claims 1-5.
7. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory, the processor, when executing the computer program, implementing the method for grouping characters based on a word distance word chain according to any of claims 1-5.
CN202211416946.7A 2022-11-14 2022-11-14 Character grouping method based on word distance word chain, storage medium and electronic equipment Active CN115455965B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211416946.7A CN115455965B (en) 2022-11-14 2022-11-14 Character grouping method based on word distance word chain, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211416946.7A CN115455965B (en) 2022-11-14 2022-11-14 Character grouping method based on word distance word chain, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN115455965A CN115455965A (en) 2022-12-09
CN115455965B true CN115455965B (en) 2023-03-10

Family

ID=84295728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211416946.7A Active CN115455965B (en) 2022-11-14 2022-11-14 Character grouping method based on word distance word chain, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115455965B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A file enciphering method
CN114708133A (en) * 2022-01-27 2022-07-05 北京国隐科技有限公司 Universal text watermarking method and device
CN114936961A (en) * 2022-06-07 2022-08-23 杭州电子科技大学 Robust text watermarking method based on Chinese character characteristic modification and grouping

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1740943A (en) * 2004-08-27 2006-03-01 北京北大方正电子有限公司 A file enciphering method
CN114708133A (en) * 2022-01-27 2022-07-05 北京国隐科技有限公司 Universal text watermarking method and device
CN114936961A (en) * 2022-06-07 2022-08-23 杭州电子科技大学 Robust text watermarking method based on Chinese character characteristic modification and grouping

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于数字水印技术的文档追踪系统的研究和实现;于泳波;《中国优秀硕士学位论文全文数据库 信息科技辑》;中国知网;20181115(第11期);第I136-63页 *

Also Published As

Publication number Publication date
CN115455965A (en) 2022-12-09

Similar Documents

Publication Publication Date Title
CN108053545B (en) Certificate verification method and device, server and storage medium
CN104317891B (en) A kind of method and device that label is marked to the page
CN114708133B (en) Universal text watermarking method and device
CN111931489B (en) Text error correction method, device and equipment
CN115689853A (en) Robust text watermarking method based on Chinese character characteristic modification and grouping
CN101639828B (en) Method for hiding and extracting watermark based on XML electronic document
CN112016061A (en) Excel document data protection method based on robust watermarking technology
CN114356919A (en) Watermark embedding method, tracing method and device for structured database
CN112861844A (en) Service data processing method and device and server
Alkhafaji et al. Payload capacity scheme for quran text watermarking based on vowels with kashida
CN110770725A (en) Data processing method and device
CN115618809A (en) Character grouping method based on binary character frequency and safe word stock construction method
CN115455965B (en) Character grouping method based on word distance word chain, storage medium and electronic equipment
CN112860957B (en) Method, medium and system for checking fixed value list
Ghilan et al. Combined Markov model and zero watermarking techniques to enhance content authentication of english text documents
CN103136166B (en) Method and device for font determination
WO2024066271A1 (en) Database watermark embedding method and apparatus, database watermark tracing method and apparatus, and electronic device
CN115455966B (en) Safe word stock construction method and safe code extraction method thereof
CN115455987B (en) Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
Shah et al. Query preserving relational database watermarking
CN115422125A (en) Electronic document automatic filing method and system based on intelligent algorithm
Majumder et al. A generalized model of text steganography by summary generation using frequency analysis
CN115883111A (en) Phishing website identification method and device, electronic equipment and storage medium
CN117648681B (en) OFD format electronic document hidden information extraction and embedding method
CN112732901A (en) Abstract generation method and device, computer readable storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant