CN111144107B

CN111144107B - Messy code identification method based on slicing algorithm

Info

Publication number: CN111144107B
Application number: CN201911357125.9A
Authority: CN
Inventors: 刘德建; 张伟泽; 陈宏展
Original assignee: Fujian Tianqing Online Interactive Technology Co Ltd
Current assignee: Fujian Tianqing Online Interactive Technology Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2022-08-09
Anticipated expiration: 2039-12-25
Also published as: CN111144107A

Abstract

The invention provides a messy code identification method based on a slicing algorithm, which is characterized in that a character string set needing to be judged is taken out, and the character string set and a non-messy code character string set of a coding mode of a language corresponding to the character string set are combined into a new character string set; acquiring the original length of a character string set, performing supplementary expansion on the character string with short length, calculating the word number of the character string by a slicing algorithm, and solving the ratio of the word number to the length to form a set; performing interval estimation on each variable in the set through normal distribution, performing probabilistic judgment on the variable falling into the interval, and recognizing the messy code text under the condition of the probability of the interval of the horizontal axis higher than a certain confidence; the invention can automatically identify the messy codes under high efficiency and ensure reliable accuracy.

Description

Messy code identification method based on slicing algorithm

Technical Field

The invention relates to the field of character string filtering and processing, in particular to a messy code identification method based on a slicing algorithm.

Background

At present, a method for recognizing a messy code includes extracting a text to be recognized and recognizing a coding format thereof, converting a first text into a second text having a second coding format by obtaining a first coding format of the first text to be recognized in a page according to a correspondence between characters corresponding to the second coding format and characters corresponding to other coding formats, and converting the second text into a third text according to a correspondence between characters corresponding to the second coding format and characters corresponding to the first coding format, so that whether the messy code exists in the first text can be determined according to the third text and the first text. From: a method and a device for identifying messy codes of texts in web pages are disclosed. And the operator checks the texts one by one so as to judge whether the texts have messy codes.

The existing method has low conversion efficiency for the initial text, if the character set of the specified second coding format does not have corresponding byte codes, the conversion can be wrong, and then the conversion can be wrong. The requirement for character sets is high. Through the identification mode of the operator, under the condition that the operator is tired or the workload is large, identification errors are easy to occur.

Disclosure of Invention

In order to overcome the problems, the invention aims to provide a scrambling code identification method based on a slicing algorithm, so that the identification efficiency and reliability are improved.

The invention is realized by adopting the following scheme: a scrambling code identification method based on a slicing algorithm comprises the following steps: s1, taking out the character string set needed to be judged, combining the character string set and the non-messy code character string set of the coding mode of the language corresponding to the character string set into a new character string set; performing operations of steps S2 through S5 on each character string in the character string set;

step S2, obtaining the original length of the character string set, judging whether the character string is long or short according to the original length, if the character string is short, performing single character supplementary expansion on the character string, judging whether the supplemented character string reaches the shortest length of the long class, if not, continuing the single character supplementary, if so, obtaining the length of the character string after supplementary expansion, setting the length as the length after processing, and entering step S3; if the length is the long type, the original length is directly set as the processed length, and the step S4 is proceeded;

step S3, carrying out slicing and word segmentation processing on the character string after the supplementary expansion to obtain word number, and entering step S5;

step S4, carrying out slicing word segmentation processing on the character string to obtain the number of words, and entering step S5;

step S5, calculating a variable x, which is the number of words/processed length, to obtain a variable x set corresponding to the string set, performing interval estimation on the variable x set by using a mathematical normal distribution, and performing probabilistic judgment on variables falling in the interval to determine whether the string to be judged is a garbled code.

Further, the single characters in the single character supplementary expansion in step S2 are randomly selected characters, and several added single characters cannot form new words, and must form words with the character string.

Further, the slice word segmentation processing is as follows: the method comprises the steps of predefining a dictionary of words, scanning character strings to be recognized when a program runs, checking whether the character strings to be recognized have the words appearing in the dictionary or not, wherein the word number matching mode is matched according to length priority, namely the length from long to short is prior, and then subtracting one word from each character string one by one to judge whether the words appear in the dictionary or not one by one so as to obtain the final word number.

Further, the step S5 is further specifically: calculating a variable x as the number of words/processed length, and solving an average u and a standard deviation a of a variable x set; referring to a mathematical normal distribution for the mean u and the standard deviation a, the area of the normal distribution in the horizontal axis interval (u-a, u + a) is 68.268949%, the area in the horizontal axis interval (u-1.96 a, u +1.96 a) is 95.449974%, and the area in the horizontal axis interval (u-2.58 a, u +2.58 a) is 99.730020%; that is, the probabilities in the horizontal axis intervals of the three confidence levels are 68.27%, 95.45%, and 99.73%, respectively, and the probabilities are normal string probabilities, it is possible to determine that the abnormal string is a scramble code with the same probability.

Furthermore, in the slicing word segmentation processing, after the single character is supplemented and expanded, when the single character and the peripheral characters cannot be combined into a word, the single character is also a word.

The invention has the beneficial effects that: 1. by using the slicing algorithm and the characteristic of normal distribution for estimation, the messy codes can be automatically identified at high efficiency, and reliable accuracy is ensured.

2. Compared with the method for converting the text into another text recognition messy code, the method reduces the internal memory IO, hands more work to the CPU for arithmetic operation, and improves the running speed.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, a scrambling code identification method based on a slicing algorithm of the present invention includes the following steps: step S1, a character string set needing to be judged is taken out, and the character string set and a non-messy code character string set of the coding mode of the language corresponding to the character string set are combined into a new character string set; performing operations of steps S2 through S5 on each character string in the character string set;

step S2, obtaining the original length of the character string set, judging whether the character string is long or short according to the original length, if the character string is short, performing single character supplementary expansion on the character string, judging whether the supplemented character string reaches the shortest length of the long class, if not, continuing the single character supplementary, if so, obtaining the length of the character string after supplementary expansion, setting the length as the length after processing, and entering step S3; if the length is the long type, the original length is directly set as the processed length, and the step S4 is proceeded; the single characters in the single character supplementary expansion in the step S2 are randomly selected characters, and several added single characters cannot form new words with each other, and must form words with the character string.

Step S3, carrying out slicing word segmentation processing on the character string after the supplementary expansion (when a single character and peripheral characters can not be combined into words, a word is also calculated), obtaining the number of words, and entering step S5;

Wherein the slice word segmentation processing is as follows: the method comprises the steps of predefining a dictionary of words, scanning character strings to be recognized when a program runs, checking whether the character strings to be recognized have the words appearing in the dictionary or not, wherein the word number matching mode is matched according to length priority, namely the length from long to short is prior, and then subtracting one word from each character string one by one to judge whether the words appear in the dictionary or not one by one so as to obtain the final word number.

In the present invention, the step S5 further includes: calculating a variable x as the number of words/processed length, and solving an average u and a standard deviation a of a variable x set; referring to a mathematical normal distribution for the mean u and the standard deviation a, the area of the normal distribution in the horizontal axis interval (u-a, u + a) is 68.268949%, the area in the horizontal axis interval (u-1.96 a, u +1.96 a) is 95.449974%, and the area in the horizontal axis interval (u-2.58 a, u +2.58 a) is 99.730020%; that is, the probabilities in the horizontal axis intervals of the three confidence levels are 68.27%, 95.45%, and 99.73%, respectively, and the probabilities are normal string probabilities, it is possible to determine that the abnormal string is a scramble code with the same probability.

The invention is further illustrated below with reference to a specific embodiment: the invention utilizes a statistical normal distribution scheme to identify character string messy codes, and achieves approximate normal distribution by supplementing normal samples, wherein the normal samples refer to: and (4) supplementing the normal character strings (non-messy code character strings) in proportion in a supplementing mode, and filling the normal character strings (non-messy code character strings) in proportion to the number of the character strings of the character string set to be recognized, namely 100. Taking out a character string set needing to be judged, and combining the character string set and a non-messy code character string set of a coding mode of a language corresponding to the character string set into a new character string set; operating each character string in the character string set; acquiring the original length of the character string set, and performing supplementary expansion on the character string with short length, wherein the filling mode specifically comprises the following steps: firstly, determining the number of candidate characters to be used according to the length of the shortest and long class, randomly selecting common characters such as 'the', 'and', and the characters can not form new words; namely, A and B output random candidate characters, and AB and BA are not a word. Such as "and", "and" are not words. Then, calculating the word number of the character string through a slicing algorithm, wherein the slicing method comprises the following steps: by predefining a dictionary of words. Then the program runs to scan the character string to be recognized, and check whether words appear in the dictionary, and the word number matching mode is matched according to length priority, such as: the character string "SJMY",

1. and whether the priority matching 'SJMY' is in the dictionary or not is judged, if yes, matching is finished, and 1 word is obtained.

2. If not, match "SJM" or "JMY".

There are two words of the group "SJM", "Y" or two words of "S", "JMY".

3. If not, match "SJ", "MY", i.e., 2 words; or "JM" gets the 3 words "S", "JM", "Y".

4. If the output is still not the same, the worst output is achieved as four words of 'S', 'J', 'M' and 'Y'. And the ratio of the number of words and the length is obtained to form a set, and the ratio obtaining mode is as follows: the ratio is the number of words/length;

the set is formed by each ratio.

Set representation: { x | x ═ word number/length }. And carrying out interval estimation on normal distribution of each variable in the set. The mean u and standard deviation a of this set are calculated. A mathematical normal distribution is referenced. The area of the normal distribution in the horizontal axis segment (u-a, u + a) was 68.268949%, the area in the horizontal axis segment (u-1.96 a, u +1.96 a) was 95.449974%, and the area in the horizontal axis segment (u-2.58 a, u +2.58 a) was 99.730020%. That is to say the probabilities within the three confidence intervals are 68.27%, 95.45%, 99.73%, respectively. If the probability is the normal string probability, the abnormal string, i.e. the messy code, can be judged under the same probability. That is, the normal string profile is such that, at this probability, it can be determined that the string is an abnormal string, and both are equal probability events. And carrying out probabilistic judgment on the variables falling into the interval. Scrambled text can be identified above a certain confidence probability.

The application mode of the invention is as follows: 1. the API mode can be provided for developers to use, before the developers use, a character string list needing to be recognized is preset, and then the character string list and a character string coding format are used as parameters to be input into an API recognition program. Then, the output result flag can be used to judge whether the code is scrambled.

2. The method can be packaged into a visualization tool, firstly, character strings to be recognized are divided according to lines and stored in a text file, an input tool is used for specifying a text encoding format, the tool is started, and a text absolute path is provided. And outputting the result to be marked to the user according to the line after the tool is calculated.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A messy code identification method based on a slicing algorithm is characterized in that: the method comprises the following steps: s1, taking out the character string set needed to be judged, combining the character string set and the non-messy code character string set of the coding mode of the language corresponding to the character string set into a new character string set; performing operations of steps S2 through S5 on each character string in the character string set;

step S2, obtaining the original length of the character string set, judging whether the character string is long or short according to the original length, if the character string is short, performing single character supplementary expansion on the character string, judging whether the supplemented character string reaches the shortest length of the long class, if not, continuing the single character supplementary, if so, obtaining the length of the character string after supplementary expansion, setting the length as the length after processing, and entering step S3; if the length is the long type, the original length is directly set as the processed length, and the step S4 is proceeded; the single characters in the single character supplementary expansion in the step S2 are randomly selected characters, and a plurality of added single characters cannot form new words with each other and can form words with the character strings;

step S5, calculating a variable x = the number of words/the processed length to obtain a variable x set corresponding to the string set, performing interval estimation on the variable x set by using a mathematical normal distribution, and performing probabilistic judgment on the variable falling in the interval to know whether the string to be judged is a messy code.

2. The scrambling code identification method based on the slicing algorithm as claimed in claim 1, wherein: the word segmentation processing comprises the following steps: the method comprises the steps of predefining a dictionary of words, scanning character strings to be recognized when a program runs, checking whether the character strings to be recognized have the words appearing in the dictionary or not, wherein the word number matching mode is matched according to length priority, namely, matching is carried out from long to short, and then subtracting one word from the character strings one by one to judge whether the words appear in the dictionary or not one by one so as to obtain the final word number.

3. The scrambling code identification method based on the slicing algorithm as claimed in claim 1, wherein: the step S5 further includes: calculating variable x = word number/processed length, and solving average number u and standard deviation a of variable x set; referring to a mathematical normal distribution for the mean u and the standard deviation a, the area of the normal distribution in the horizontal axis interval (u-a, u + a) is 68.268949%, the area in the horizontal axis interval (u-1.96 a, u +1.96 a) is 95.449974%, and the area in the horizontal axis interval (u-2.58 a, u +2.58 a) is 99.730020%; that is, the probabilities in the horizontal axis intervals of the three confidences are 68.27%, 95.45%, and 99.73%, respectively, and this probability is a normal string probability, and it is determined whether or not the normal string probability is an abnormal string, that is, a scramble code.

4. The scrambling code identification method based on the slicing algorithm as claimed in claim 2, wherein: in the slicing word segmentation processing, after the single character is supplemented and expanded, when the single character and the peripheral characters cannot be combined into words, the single character is also a word.