CN115455965B

CN115455965B - Character grouping method based on word distance word chain, storage medium and electronic equipment

Info

Publication number: CN115455965B
Application number: CN202211416946.7A
Authority: CN
Inventors: 田辉; 鲁国峰; 朱鹏远; 郭玉刚; 张志翔
Original assignee: Hefei High Dimensional Data Technology Co ltd
Current assignee: Hefei High Dimensional Data Technology Co ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-03-10
Anticipated expiration: 2042-11-14
Also published as: CN115455965A

Abstract

The invention particularly relates to a character grouping method based on a word distance word chain, a storage medium and electronic equipment, wherein the character grouping method comprises the following steps: traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the word formed by the N characters; repeating the following steps according to the word frequency from high to low until all the characters are grouped; calculating the character c to be assigned to the grouped character c in the kth group _i The word chain of (1); and adding the character c to be assigned to the group with the minimum weight by taking the normalized word chain sum as the weight. The word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the grouping problem of the characters is converted into specific weight size comparison, so that the grouping is more rational and more accurate.

Description

Character grouping method based on word distance word chain, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of word stock invisible watermarks, in particular to a character grouping method based on word distance and word chain, a storage medium and electronic equipment.

Background

In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, the text digital watermarking technology based on character topological structure modification becomes the mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. When we use different character deformation data for different users, the specific watermark word stock constitutes the safe word stock for the user.

The prior secure word stock has many defects, and in order to solve the problems of poor universality of watermark loading, poor system stability, complex implementation process, low robustness performance of a watermark algorithm and the like in the prior art on the premise of not changing any use habit of a user, the following scheme is disclosed in a patent of a universal text watermarking method and device (publication number: CN 114708133A) applied by Beijing national crypto-technology Limited company: a general text watermarking method, comprising the steps of: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; according to the watermark coding data, combining the watermark character data temporary file and the grouped characters, dynamically generating and loading a watermark character library file in real time; and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font file.

In this scheme, characters need to be grouped. When characters are grouped, theoretically, the characters with higher word frequency should be respectively located in different groups; the characters, which are often present together, should be located in different groups, respectively. The safety word stock generated by meeting the two requirements needs fewer word contents when the safety code is extracted, so that the extraction effect and the accuracy are better. The character grouping method in the scheme has a plurality of defects: first, the number of characters in each group is substantially equal, which conflicts with the above requirement; secondly, only the word frequency is considered during grouping, the word frequency is not fully considered, theoretically, corresponding characters in frequently-occurring words should be grouped into different groups, so that more groups can appear in shorter contents, and fewer contents are required during extraction of the security codes; thirdly, the calculation process for optimizing the packets in the scheme is too complex, and a large amount of time and calculation power are consumed.

Disclosure of Invention

The invention aims to provide a character grouping method based on word distance word chains, which can more reasonably group characters.

In order to realize the purpose, the invention adopts the technical scheme that: a character grouping method based on word distance word chains comprises the following steps: traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the words formed by the N characters

(ii) a Repeating the following steps according to the word frequency from high to low until all the characters are grouped; calculating the characters c to be assigned to the kth group according to the following formula

Middle grouped character c _i And:

in the formula (I), the compound is shown in the specification,

to contain character c and character c _i All the words of (a); the word chain sum of all groups is normalized to obtain

(ii) a With normalized word chain sum

For the weight, the character c to be assigned is added to the group with the smallest weight.

Compared with the prior art, the invention has the following technical effects: the word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the above-mentioned relation is just reflected according to the word chain and the calculated weight; by converting the grouping problem of the characters into specific weight size comparison, the grouping is more rational and more accurate.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of the present invention;

FIG. 3 is a schematic flow chart of a second embodiment of the present invention;

FIG. 4 is a flow chart of a third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to fig. 1 to 4.

Referring to fig. 1, the invention discloses a character grouping method based on word distance word chains, comprising the following steps: and traversing the corpus, counting the frequency of N characters to be grouped, wherein the optimal value range of N is 1000 to 3000, if the N characters to be grouped are specified, directly counting the characters, and if only N values are given, sequencing the characters according to the character frequency from high to low, and selecting the N characters with higher character frequency. The word segmentation models are many, mature word segmentation models are selected for segmenting all texts in a corpus, and the probability of each word is calculated according to the frequency of occurrence of the word formed by N characters

(ii) a The word frequency and the word frequency can be calculated by using the existing corpus and model, and the calculated result can also be directly adopted. The corpus can be selected according to the requirements of users, namely a general corpus can be selected, an internal corpus of a certain enterprise or organization can be selected, and the obtained character groups are different for different corpora.

According to the word frequency from high to low, the following steps are repeatedly executed until all characters are grouped, the word chain of the character c to be distributed into each group is calculated according to a formula subsequently, and until the character c is added into the group with the minimum weight, the complete grouping process aiming at the character c is realized, and the grouping of N characters can be completed only by executing the grouping process on each character according to the sequence of the word frequency from high to low.

Calculating the characters c to be assigned to the kth group according to the following formula

Middle grouped character c _i And:

in the formula (I), the compound is shown in the specification,

to contain character c and character c _i All words of (2), here for characters c and c _i Whether the adjacent characters and the front and back sequences are not limited, and all the words containing the two characters belong to words

. The word chain and the weight that reflects the two characters appear together as a word are intended to place the two characters that appear together in different groups in the following groupings as much as possible.

The word chain sum of all groups is normalized to obtain

(ii) a With normalized word chain sum

For the weight, the character c to be assigned is added to the group with the smallest weight. By converting the grouping problem of the characters into specific weight size comparison, the grouping is more rational and more accurate.

The above weights are calculated only in terms of word chains and correlation, and for the purpose of introducing information about word frequency, three preferred embodiments are provided in the present invention for reference.

Referring to fig. 2, in the first embodiment, the word distance sum is introduced into the weight calculation formula. Specifically, the present invention also includes the followingThe method comprises the following steps: sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting

Middle grouped character c _i The word distance sum of:

in the formula (I), the compound is shown in the specification,

is character c and character c _i The distance of (d); the word distance mentioned here means that after sorting according to word frequency, their subscript difference, such as the 1 st word and the 2 nd word, their distance is 1, the word distance between the word with the highest word frequency and the word with the lowest word frequency is the largest, which is N-1, and the word distance directly reflects the word frequency.

For simple calculation, the word distance sum of all groups is normalized to obtain

(ii) a Then, the weight of the character c to be assigned to the kth group is calculated according to the following formula:

in the formula (I), the compound is shown in the specification,

is a predetermined weight coefficient and is greater than or equal to 0 when

When the sum of word chains is taken as the weight, when

The word chain sum and the word distance sum are used as the weight of the common calculation result. By presetting parameters

The calculated weights may be adjusted

The sum of word chains and the sum of word distances. And finally, adding the character c to be distributed to the group with the minimum weight.

In the above formula, C is the grouped character set,

the specific calculation process of the standard deviation after the frequency normalization of the grouped characters is as follows: first, the frequency of all the grouped characters is normalized, for example, the frequency of the grouped characters is marked as Q ₁ 、Q ₂ 、Q ₃ …, normalization, i.e., dividing the frequency of each character by the sum of the total frequencies; secondly, solving the standard deviation of the numerical value obtained after normalization to obtain the standard deviation

。

The word chain reflects the relationship that the character c to be assigned and other characters in the group appear together as words, and the larger the value is, the more the characters appear together are, the words are assigned to different groups; the word distance reflects the relation of word frequency, and more similar high-frequency words are allocated to different groups; the weights calculated from these two relationships reflect exactly the above relationship.

Referring to fig. 3, in the second embodiment, the frequency difference sum is introduced into the calculation formula of the weight. Specifically, the invention also comprises the following steps: sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting

Middle grouped character c _i The sum of the frequency differences:

in the formula (I), the compound is shown in the specification,

is character c and character c _i The frequency difference value directly reflects the word frequency relation; normalizing the frequency difference sum of all groups to obtain

(ii) a The weight of the character c to be assigned to the kth group is calculated according to the following formula:

in the formula (I), the compound is shown in the specification,

the weight coefficient is preset and is greater than or equal to 0,C is the grouped character set,

normalized standard deviation for the frequency of the grouped characters.

Is arranged and

the effect is similar, the value ranges of the two are the same, and the two are both [0,10]However, the values of the two are not necessarily equal to each other, and may be different.

Referring to fig. 4, in the third embodiment, the word distance sum and the frequency difference sum are introduced into the weight calculation formula at the same time. Specifically, word distance sum and frequency difference sum are calculated according to the steps, and then the weight of the character c to be allocated to the kth group is calculated according to the following formula:

in the embodiment, word chain sum, word distance sum, frequency difference value and three factors are integrated, the calculated weight can better express the word frequency and word frequency relation, and the characters can be grouped more reasonably.

The invention also discloses a computer readable storage medium and an electronic device. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for character grouping based on word distance word chains as set forth above. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor, when executing the computer program, implementing the character grouping method based on word distance word chains as described above.

Claims

1. A character grouping method based on word distance word chains is characterized in that: the method comprises the following steps:

traversing the corpus, counting the frequency of N characters to be grouped, segmenting all texts in the corpus, and calculating the probability of each word according to the frequency of the words formed by the N characters

；

Repeating the following steps according to the word frequency from high to low until all the characters are grouped;

Middle grouped character c _i And:

in the formula (I), the compound is shown in the specification,

to contain character c and character c _i All the words of (a);

the word chain sum of all groups is normalized to obtain

；

With normalized word chain sum

2. The character grouping method based on the word distance word chain as claimed in claim 1, wherein: also comprises the following steps:

sorting the characters according to the character frequency, and calculating the characters c to k group to be distributed after sorting

Middle grouped character c _i The word distance sum of:

in the formula (I), the compound is shown in the specification,

as character c and character c _i I.e. after sorting by word frequency, character c and character c _i Subscript difference of (d);

the word distance sum of all groups is normalized to obtain

；

The weight of the character c to be assigned to the kth group is calculated according to the following formula:

in the formula (I), the compound is shown in the specification,

is a preset weight coefficient and is greater than or equal to 0,C is a grouped character set,

normalized standard deviation for the frequency of the grouped characters.

3. The character grouping method based on the word distance word chain as claimed in claim 1, wherein: also comprises the following steps:

Middle grouped character c _i The sum of the frequency differences:

in the formula (I), the compound is shown in the specification,

as character c and character c _i The difference in frequency of (a);

normalizing the frequency difference sum of all groups to obtain

；

in the formula (I), the compound is shown in the specification,

normalized to the standard deviation of the frequency of the grouped characters.

4. The character grouping method based on a word distance word chain as claimed in claim 1, wherein: also comprises the following steps:

Middle grouped character c _i Word distance of (c) and:

in the formula (I), the compound is shown in the specification,

is character c and character c _i The distance of (d);

and calculating the characters c to be allocated to the k group after the sorting

Middle grouped character c _i The sum of the frequency differences:

in the formula (I), the compound is shown in the specification,

as character c and character c _i The difference in frequency of (a);

normalizing the word distance sum and the frequency difference sum of all groups to obtain

、

；

in the formula (I), the compound is shown in the specification,

and

5. The character grouping method based on the word distance word chain as claimed in claim 4, wherein: said

And

all values of (1) are [0,10]。

6. A computer-readable storage medium characterized by: stored thereon a computer program which, when executed by a processor, implements the word distance word chain based character grouping method as claimed in any one of claims 1-5.

7. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory, the processor, when executing the computer program, implementing the method for grouping characters based on a word distance word chain according to any of claims 1-5.