CN117272989A - Character encoding compression-based mask word recognition method, device, equipment and medium - Google Patents

Character encoding compression-based mask word recognition method, device, equipment and medium Download PDF

Info

Publication number
CN117272989A
CN117272989A CN202311552860.1A CN202311552860A CN117272989A CN 117272989 A CN117272989 A CN 117272989A CN 202311552860 A CN202311552860 A CN 202311552860A CN 117272989 A CN117272989 A CN 117272989A
Authority
CN
China
Prior art keywords
sequence
character
value
word
shielding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311552860.1A
Other languages
Chinese (zh)
Other versions
CN117272989B (en
Inventor
郑明�
夏东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Wooduan Technology Co ltd
Original Assignee
Zhejiang Wooduan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Wooduan Technology Co ltd filed Critical Zhejiang Wooduan Technology Co ltd
Priority to CN202311552860.1A priority Critical patent/CN117272989B/en
Publication of CN117272989A publication Critical patent/CN117272989A/en
Application granted granted Critical
Publication of CN117272989B publication Critical patent/CN117272989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a character encoding compression-based mask word recognition method, a device, equipment and a medium, wherein the method comprises the following steps: analyzing a preset shielding word library to construct a tree-structure-based shielding word tree; traversing the shielding word tree according to a preset encoding compression rule, so as to encode and compress characters included in the shielding word tree to obtain a corresponding linear encoding sequence; integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file; and if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included. According to the identification method, the linear coding sequence is constructed in advance, and the shielding word is identified based on the linear coding sequence, so that the loading time of the whole application program is greatly shortened, the identification efficiency of shielding word symbols is improved, and the whole operation efficiency of the application program is remarkably improved.

Description

Character encoding compression-based mask word recognition method, device, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for recognizing a shielding word based on character coding compression.
Background
In order to prevent a user from abnormally using a software program, a shielding word library is generally required to be configured in the software program so as to quickly identify whether shielding words exist in sentences input by the user; the mask word is typically composed of at least two characters, and prior art methods are typically based on mask word generating crotch structures. In the prior art, although the method can realize local detection of the shielding words, as the number of the shielding words in the word stock is increased, the number of branches contained in the crotch structure is increased sharply, the time consumed for loading the shielding word stock and generating the crotch structure according to the shielding word stock is correspondingly increased when the software program is operated, the memory occupied by the software program is correspondingly increased, and the overall operation efficiency of the software program is influenced. Therefore, the technical method for recognizing the mask words in the prior art method has the problem that the loading recognition takes longer time.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a medium for recognizing a shielding word based on character encoding compression, which aim to solve the problem of long loading recognition time consumption in a technical method for recognizing the shielding word in the prior art.
In a first aspect, an embodiment of the present invention provides a method for recognizing a mask word based on character encoding compression, where the method includes:
analyzing a preset shielding word library to construct a tree-structure-based shielding word tree;
traversing the shielding word tree according to a preset encoding compression rule, so as to encode and compress characters included in the shielding word tree to obtain a corresponding linear encoding sequence;
integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file;
and if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included or not.
In a second aspect, an embodiment of the present invention further provides a mask word recognition device based on character encoding compression, where the device is configured to perform the mask word recognition method based on character encoding compression according to the first aspect, and the device includes:
the shielding word tree construction unit is used for analyzing a preset shielding word library to construct a shielding word tree based on a tree structure;
The linear coding sequence acquisition unit is used for traversing the shielding word tree according to a preset coding compression rule so as to code and compress characters included in the shielding word tree to obtain a corresponding linear coding sequence;
the integration unit is used for integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file;
and the recognition result acquisition unit is used for comparing and recognizing the text segments according to the code compression rule and the linear code sequence if the text segments input by the user are received, so as to obtain a recognition result of whether the shielding words are included.
In a third aspect, an embodiment of the present invention further provides a computer device, where the device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of the mask word recognition method based on character code compression according to the first aspect when executing the program stored in the memory.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for recognizing mask words based on character encoding compression according to the first aspect.
The embodiment of the invention provides a method, a device, equipment and a medium for recognizing a shielding word based on character encoding compression, wherein the method comprises the following steps: analyzing a preset shielding word library to construct a tree-structure-based shielding word tree; traversing the shielding word tree according to a preset encoding compression rule, so as to encode and compress characters included in the shielding word tree to obtain a corresponding linear encoding sequence; integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file; and if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included. According to the identification method, the linear coding sequence is firstly constructed, and the shielding word is identified based on the linear coding sequence, so that the overall loading time of the application program is greatly shortened, the analysis time of the linear coding sequence by the application program is greatly shortened, the memory occupation is greatly reduced, and the identification efficiency of shielding words is improved while the overall operation efficiency of the application program is remarkably improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for recognizing a mask word based on character encoding compression according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an application effect of a method for recognizing a mask word based on character encoding compression according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of another application effect of the method for recognizing mask words based on character encoding compression according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a mask word recognition device based on character encoding compression according to an embodiment of the present invention;
fig. 5 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Referring to fig. 1, as shown in the drawing, an embodiment of the present invention provides a method for recognizing a mask word based on character encoding compression, which is applied to a terminal device. Before the target program file is executed, the specific execution steps of constructing the linear coding sequence and constructing the target program file can be executed in other devices which establish communication connection with the terminal device, for example, the management server executes corresponding steps to construct the linear coding sequence and correspondingly construct the target program file, and then the management server transmits the constructed target program file to the terminal device for execution. As shown in FIG. 1, the method includes steps S110 to S140.
S110, analyzing a preset shielding word library to construct a shielding word tree based on a tree structure.
And analyzing the preset shielding word library to construct a tree-structure-based shielding word tree. Specifically, the shielding word tree includes a base node and a plurality of forks extending outwards from the base node, each fork corresponds to one shielding word in the shielding word bank, and then the shielding word tree can be constructed according to the corresponding shielding word bank. When the mask word needs to be updated, the mask word contained in the mask word library may be updated first, and then the mask word tree is updated according to the updated mask word library, where the generated mask word tree is shown in fig. 2.
In a specific embodiment, step S110 includes the sub-steps of: constructing a base node; creating corresponding forks on the periphery of the basic node according to the character number of each shielding word in the shielding word library, wherein each fork comprises a node chain corresponding to one shielding word; and sequentially filling characters contained in each shielding word into the crotch according to the extending direction from the base node to the corresponding node chain, so as to construct and obtain a shielding word tree.
Specifically, first, a base node is created, the base node is a root node of a shielding word tree, the base node is represented by "root" in fig. 2, a corresponding number of forks are generated according to the number of shielding words contained in a shielding word library, specifically, each fork comprises a node chain corresponding to one shielding word, since the shielding word at least comprises two characters, the node chain corresponds to at least two nodes and is formed by connecting at least two nodes in series, and the number of the nodes contained in the node chain is equal to the number of the characters contained in the shielding word.
Further, starting from the base node, the characters contained in each shielding word are sequentially filled into the nodes contained in the crotch according to the extending direction of the node chain, so that the character sequence of the shielding word corresponds to the extending direction of the node chain, and the extending direction can be represented by an arrow shown in fig. 2.
For example, the shielding word library contains two shielding words of 'missile' and 'ship' and two node chains can be correspondingly generated based on the basic nodes, each node chain corresponds to a crotch, and the extending direction of the crotch is from the basic node to the tail end of the node chain. And correspondingly adding two shielding words of 'missile' and 'ship expelling' into the generated crotch to obtain the shielding word tree shown in figure 2.
And S120, traversing the shielding word tree according to a preset encoding compression rule, and encoding and compressing characters included in the shielding word tree to obtain a corresponding linear encoding sequence.
Traversing the shielding word tree according to a preset encoding compression rule, and encoding and compressing characters included in the shielding word tree to obtain a corresponding linear encoding sequence. Traversing the generated shielding word tree according to the encoding compression rule, so as to realize encoding compression of the shielding word tree, namely converting the shielding word tree recorded in a character form into a linear encoding sequence recorded in a numerical form, and realizing compression of the shielding word tree due to smaller storage space occupied by the linear encoding sequence. The coding compression rule is specific rule information for compressing the mask word tree, and comprises a coding dictionary and an offset identification strategy. The coding dictionary is a mapping dictionary for converting characters to obtain corresponding coding values, the coding dictionary comprises characters and coding values corresponding to the characters one by one, the characters are mapped and converted through the corresponding relation interface, the coding values corresponding to the characters are obtained, and the offset identification strategy is specific strategy information for obtaining offset values corresponding to the characters according to the position relation of the characters in the mask words.
In a specific embodiment, step S120 includes the sub-steps of: creating a first sequence and a second sequence corresponding to a coding dictionary in the coding compression rule, wherein the coding dictionary comprises characters and coding values corresponding to the characters one by one; traversing each crotch in the shielding word tree in turn according to an offset identification strategy and a crotch extending direction in the coding compression rule so as to add offset values corresponding to each crotch in the first sequence; traversing each crotch in the shielding word tree in turn according to the crotch extending direction so as to add a jump value corresponding to each crotch in the second sequence; and combining the first sequence added with the offset value and the second sequence added with the jump value to obtain the linear coding sequence.
Specifically, a first sequence and a second sequence corresponding to the coding dictionary in the coding compression rule can be created, wherein the first sequence and the second sequence are two-dimensional arrays, and the lengths of the first sequence and the second sequence are equal. For example, the sequence values of the characters in the coding dictionary may be sequentially ordered according to the coding values of the characters in the coding dictionary, and a first sequence and a second sequence including a sequence of consecutive coding values may be correspondingly generated, where each coding value in the first sequence and the second sequence corresponds to a character position, and the sequence values in the first sequence and the sequence values in the second sequence generated are both default values (e.g., the default values may be configured as "0"), and two sequence values having the same coding value in the first sequence and the second sequence correspond to each other. The lengths of the first sequence and the second sequence are the length of the characters in the coding dictionary plus 1, and the coding dictionary contains 9 characters, so that the lengths of the corresponding first sequence and second sequence are 10, and the obtained first sequence and second sequence are shown in fig. 3. In the first sequence shown in fig. 3, the sequence values of the first sequence are within the boxes, and the second sequence is also the same; the continuous coding value sequence is arranged below the first sequence (or the second sequence) square frame, the position of the coding value of 0 corresponds to a basic node in the shielding word tree, and the sequence values corresponding to the coding value of 0 in the first sequence and the second sequence are fixed.
Further, each crotch in the shielding word tree can be traversed in sequence according to an offset identification strategy and a crotch extending direction, so that offset values corresponding to characters in each crotch are added in the first sequence; and if the position corresponding to a certain character in the shielding word tree exists in the first sequence, changing the sequence value initially configured at the position corresponding to the character in the first sequence into an offset value according to an offset identification strategy. Through the method, the characters contained in the mask word tree can be identified in an offset mode through the first sequence added with the offset value.
And traversing each crotch in the shielding word tree in turn according to the crotch extending direction, so that jump values respectively corresponding to characters in each crotch are added in the second sequence, and the jump relation of the characters contained in the crotch in the shielding word tree can be identified through the second sequence with the jump values added. And combining the first sequence added with the offset value with the second sequence added with the jump value to obtain a linear coding sequence, namely the wire core coding sequence comprises the first sequence and the second sequence.
In a specific embodiment, the traversing each crotch in the mask word tree according to the offset identification policy and the crotch extending direction in the encoding compression rule, so as to add offset values corresponding to each crotch in the first sequence, includes: traversing the characters contained in each crotch according to the crotch extending direction, and determining the corresponding positions of the characters in the crotch in the first sequence according to the coding dictionary; judging whether the characters in the crotch are tail characters or not; if the character is not the end character, configuring the numerical value of the position corresponding to the character in the first sequence as a first offset value in the offset identification strategy; and if the character is the last character, configuring the numerical value of the position corresponding to the character in the first sequence as a second offset value in the offset identification strategy.
Traversing characters contained in the crotch according to the extending direction of the crotch, determining the corresponding position of the characters in the crotch in the first sequence according to the coding dictionary, for example, determining that the corresponding coding value in the coding dictionary is '3', and determining that the current position is the position with the coding value of '3' in the first sequence. The offset identification strategy comprises a first offset value and a second offset value, so that whether the characters in the crotch are end characters or not can be further judged, and if the characters are end characters, the sequence value of the corresponding position in the first sequence is changed and configured to be the second offset value; and if the sequence value is not the end character, changing and configuring the sequence value at the corresponding position in the first sequence as a first offset value.
For example, if the "guide" character is not the last character in the crotch corresponding to the mask word tree shown in fig. 2, the sequence value corresponding to the code value "3" in the first sequence is correspondingly configured to be the first sequence value (e.g., the first sequence value is configured to be "1"); the "bullet" character is an end character in the crotch corresponding to the mask word tree shown in fig. 2, and the "bullet" character has a code value of "4" corresponding to the code dictionary, and the sequence value corresponding to the code value of "4" in the first sequence is correspondingly configured to be a second sequence value (e.g., the second sequence value is configured to be "-1"). Assuming that the corresponding encoding values in the encoding dictionary are "1", "5" and "8", respectively, the first sequence obtained after adding the offset values according to the mask word tree as shown in fig. 2 is shown in fig. 3.
In a specific embodiment, the traversing each crotch in the shielding word tree according to the crotch extending direction sequentially to add a jump value corresponding to each crotch in the second sequence includes: traversing the characters contained in each crotch according to the crotch extending direction, and determining the corresponding positions of the characters in the crotch in the second sequence according to the coding dictionary; judging whether the characters in the crotch are initial characters or not; and if the character is not the initial character, configuring the numerical value of the position corresponding to the character in the second sequence as the coding value of the previous character.
While traversing each crotch in the shielding word tree in turn, determining the corresponding position of each character in the crotch in the second sequence according to the coding dictionary, and judging whether the character in the crotch is a starting character (ignoring the basic node, because the basic node does not represent a character with practical meaning), if the character is the starting character, not changing the numerical value of the position corresponding to the character in the second sequence (namely, keeping the sequence value of the position corresponding to the character as a default value of 0, and keeping the coding value corresponding to the basic node as 0, so that the sequence value of the position corresponding to the character does not need to be changed at the moment); if the character is not the initial character, the coding value of the previous character is obtained, and the numerical value corresponding to the current character in the second sequence is changed and configured as the coding value of the previous character. When the character is the initial character, the former character of the initial character corresponds to the basic node in the shielding word tree, the coding value corresponding to the basic node is 0, namely the sequence value corresponding to the corresponding position of the initial character in the second sequence is configured as the coding value of the basic node, and the jump relation between the adjacent characters in the crotch can be embodied through the numerical value (jump value) configured in the second sequence, so that the jump value is added in the second sequence in the process.
According to the mask word tree shown in fig. 2, after the jump value is correspondingly added in the second sequence, the obtained second sequence is shown in fig. 3.
S130, integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file.
And integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file. Integrating the constructed linear coding sequence with a pre-stored initial program file, wherein the integration process can be performed before the program file is released; and integrating the linear coding sequence with the initial program file to construct a complete target program file. The obtained target program file may be executed in a local terminal constructing the target program file, or the target program file may be transmitted to another terminal for execution.
And S140, if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included or not.
And if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included or not. After the target program file is executed in the terminal, the user can use the program functions contained in the target program file, such as user input text segments, to communicate with other users online. When a user inputs a text segment to a terminal, the text segment can be compared and identified according to a coding compression rule and a linear coding sequence, namely whether the text segment contains shielding words corresponding to the linear coding sequence is identified, and accordingly an identification result of whether the text segment contains shielding words is obtained.
In a specific embodiment, step S140 includes the sub-steps of: acquiring a starting character of a phrase contained in the text field; determining a first sequence value and a second sequence value corresponding to the initial character in the first sequence and the second sequence respectively according to a coding dictionary in the coding compression rule; judging whether the first sequence value is zero or not; if the first sequence value is not zero, judging whether the second sequence value is zero or not; if the second sequence value is zero, acquiring the next character of the phrase as a target character; determining a third sequence value and a fourth sequence value corresponding to the target character in the first sequence and the second sequence respectively according to the coding dictionary; judging whether the third sequence value is zero or not; if the third sequence value is not zero, judging whether the fourth sequence value is a coding value corresponding to the previous character in the coding dictionary; if the third sequence value is a first offset value and the fourth sequence value is a code value corresponding to the previous character in the code dictionary, returning to execute the step of acquiring the next character of the phrase as a target character; if the third sequence value is a second offset value and the fourth sequence value is a code value corresponding to the previous character in the code dictionary, judging that the phrase is a shielding word; if the fourth sequence value is not the corresponding code value of the previous character in the code dictionary, judging that the phrase is not a shielding word; if the first sequence value is zero or the third sequence value is zero, judging that the phrase is not a shielding word; judging whether the text field contains at least one phrase which is a shielding word or not so as to obtain a recognition result of whether the shielding word is contained or not.
Specifically, word segmentation may be performed on a word segment first to split the word segment into a plurality of corresponding word groups, obtain a start character of the word group, determine a coding value corresponding to the start character according to a coding dictionary in a coding compression rule, and obtain a first sequence value corresponding to the first sequence and a second sequence value corresponding to the second sequence according to the coding value of the start character. Judging whether the first sequence value is zero, if the first sequence value is zero, indicating that the first sequence does not contain a shielding word corresponding to the initial character, namely that the phrase currently recognized is not a shielding word; if the first sequence value is not zero, the first sequence value indicates that the first sequence contains a shielding word corresponding to the initial character, whether the second sequence value is zero is continuously judged, if the second sequence value is zero, the first sequence value indicates that the initial character is the first character of the phrase, and the next character in the phrase is continuously acquired as a target character.
Further, a third sequence value corresponding to the target character is obtained from the first sequence according to the coding dictionary, and a fourth sequence value corresponding to the target character is obtained from the second sequence, wherein the obtaining mode of the third sequence value and the fourth sequence value is the same as the obtaining mode of the first sequence value and the second sequence value. Further judging whether the third sequence value is zero, if the third sequence value is zero, indicating that the shielding word corresponding to the current recognized phrase is not included; if the third sequence value is not zero, continuing to judge whether the fourth sequence value is the code value corresponding to the previous character in the code dictionary. If the third sequence value is the first offset value (for example, the first offset value is "1") and the fourth sequence value is the code value corresponding to the previous character in the code dictionary, indicating that the judged character in the phrase corresponds to a certain mask word and the phrase also contains other undetermined characters, and continuing to acquire the next character for judgment; if the third sequence value is the second offset value (for example, the first offset value is "-1") and the fourth sequence value is the code value corresponding to the previous character in the code dictionary, the character already judged in the phrase is corresponding to a certain mask word, and all characters of the phrase are judged, so that the phrase can be judged to be the mask word.
If the fourth sequence value is not the corresponding code value of the previous character in the code dictionary, the character, which is positioned in front of the target character, in the target character and the phrase cannot be combined to form a shielding word (or a part of the shielding word), and the phrase which is currently recognized can be judged to be not the shielding word. If the third sequence value is zero, it indicates that the second sequence does not include the mask word corresponding to the target character, that is, the currently recognized phrase is not the mask word.
Acquiring identification information of each phrase in the text field, and judging whether at least one phrase serving as a shielding word is contained in the text segment; if the text segment at least contains one phrase which is a shielding word, a recognition result that the text segment contains the shielding word is obtained; and if the text segment does not contain the phrase which is the shielding word, obtaining the recognition result that the text segment does not contain the shielding word.
In a specific embodiment, after step S140, the method further includes the steps of: and if the recognition result is that the word group comprises a shielding word, replacing the word group corresponding to the shielding word in the text field to obtain a corresponding replaced text field.
Further, if the recognition result is that the text segment includes the shielding word, the phrase corresponding to the shielding word in the text segment may be replaced, that is, the character corresponding to the shielding word in the text segment may be replaced by another character, for example, the character corresponding to the shielding word may be replaced by "X", "X" or another character. For example, the text segment is "XXX sailing in the east ocean", where "XXX" is a phrase corresponding to the mask word, and all three characters included in the phrase may be replaced by "×", and the replacement text field obtained after the replacement is "×" sailing in the east ocean ".
Through comparison test, the target program file containing the linear coding sequence is used for execution, the file size corresponding to the target program file is reduced from 678kB to 300kB, the time consumption of loading the program file can be correspondingly reduced by reducing the file size, and the file updating rate is improved by more than 50%. The time consumption for executing the target program file to analyze the linear coding sequence is 77ms, the time consumption for analyzing the crotch structure generated based on the shielding word by the traditional technical method is 770ms, and the performance is improved by 90%; the memory occupation of the target program file during execution is reduced from the previous 23MB to 8MB, so that the occupation of the memory space is greatly reduced, and the memory occupation is improved by about 60% compared with the traditional technical method.
In the method for recognizing the mask word based on character encoding compression disclosed in the above embodiment, the method includes: analyzing a preset shielding word library to construct a tree-structure-based shielding word tree; traversing the shielding word tree according to a preset encoding compression rule, so as to encode and compress characters included in the shielding word tree to obtain a corresponding linear encoding sequence; integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file; and if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included. According to the identification method, the linear coding sequence is firstly constructed, and the shielding word is identified based on the linear coding sequence, so that the overall loading time of the application program is greatly shortened, the analysis time of the linear coding sequence by the application program is greatly shortened, the memory occupation is greatly reduced, and the identification efficiency of shielding words is improved while the overall operation efficiency of the application program is remarkably improved.
The embodiment of the invention also provides a shielding word recognition device based on character code compression, which can be configured in the terminal equipment, and is used for executing any embodiment of the shielding word recognition method based on character code compression. Specifically, referring to fig. 4, fig. 4 is a schematic block diagram of a mask word recognition device based on character encoding compression according to an embodiment of the present invention.
As shown in fig. 4, the mask word recognition apparatus 100 based on character encoding compression includes a mask word tree construction unit 110, a linear code sequence acquisition unit 120, an integration unit 130, and a recognition result acquisition unit 140.
The mask word tree construction unit 110 is configured to parse a preset mask word library to construct a mask word tree based on a tree structure.
The linear code sequence obtaining unit 120 is configured to traverse the mask word tree according to a preset code compression rule, so as to code and compress characters included in the mask word tree to obtain a corresponding linear code sequence.
And an integrating unit 130, configured to integrate the linear coding sequence with a pre-stored initial program file, so as to construct and execute a target program file.
And the recognition result obtaining unit 140 is configured to, if a text segment input by the user is received, compare and recognize the text segment according to the encoding compression rule and the linear encoding sequence, and obtain a recognition result of whether the text segment contains a mask word.
The character code compression-based shielding word recognition device provided by the embodiment of the invention applies the character code compression-based shielding word recognition method to analyze a preset shielding word library so as to construct and obtain a tree-structure-based shielding word tree; traversing the shielding word tree according to a preset encoding compression rule, so as to encode and compress characters included in the shielding word tree to obtain a corresponding linear encoding sequence; integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file; and if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included. According to the identification method, the linear coding sequence is firstly constructed, and the shielding word is identified based on the linear coding sequence, so that the overall loading time of the application program is greatly shortened, the analysis time of the linear coding sequence by the application program is greatly shortened, the memory occupation is greatly reduced, and the identification efficiency of shielding words is improved while the overall operation efficiency of the application program is remarkably improved.
The above-described mask word recognition apparatus based on character encoding compression may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 5.
Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device may be a terminal device for performing a mask word recognition method based on character encoding compression to achieve mask word recognition.
Referring to fig. 5, the computer device 500 includes a processor 502, a memory, and a network interface 505, which are connected by a communication bus 501, wherein the memory may include a storage medium 503 and an internal memory 504.
The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a mask word recognition method based on character encoding compression, wherein the storage medium 503 may be a volatile storage medium or a non-volatile storage medium.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the execution of a computer program 5032 in the storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a mask word recognition method based on character encoding compression.
The network interface 505 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the architecture shown in fig. 5 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computer device 500 to which the present inventive arrangements may be implemented, as a particular computer device 500 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
The processor 502 is configured to execute a computer program 5032 stored in a memory to implement the corresponding functions in the above-mentioned mask word recognition method based on character encoding compression.
Those skilled in the art will appreciate that the embodiment of the computer device shown in fig. 5 is not limiting of the specific construction of the computer device, and in other embodiments, the computer device may include more or less components than those shown, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 5, and will not be described again.
It should be appreciated that in an embodiment of the invention, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium. The computer readable storage medium stores a computer program which, when executed by a processor, implements the steps included in the above-described character encoding compression-based mask word recognition method.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or part of what contributes to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a computer-readable storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned computer-readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. A method for recognizing mask words based on character encoding compression, the method comprising:
analyzing a preset shielding word library to construct a tree-structure-based shielding word tree;
traversing the shielding word tree according to a preset encoding compression rule, so as to encode and compress characters included in the shielding word tree to obtain a corresponding linear encoding sequence;
integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file;
and if the text segment input by the user is received, comparing and identifying the text segment according to the coding compression rule and the linear coding sequence to obtain an identification result of whether the shielding word is included or not.
2. The method for recognizing mask words based on character encoding compression according to claim 1, wherein the parsing the preset mask word library to construct a tree-structure-based mask word tree comprises:
constructing a base node;
creating corresponding forks on the periphery of the basic node according to the character number of each shielding word in the shielding word library, wherein each fork comprises a node chain corresponding to one shielding word;
and sequentially filling characters contained in each shielding word into the crotch according to the extending direction from the base node to the corresponding node chain, so as to construct and obtain a shielding word tree.
3. The method for recognizing mask words based on character encoding compression according to claim 1, wherein traversing the mask word tree according to a preset encoding compression rule, thereby encoding and compressing characters included in the mask word tree to obtain a corresponding linear encoding sequence, comprises:
creating a first sequence and a second sequence corresponding to a coding dictionary in the coding compression rule, wherein the coding dictionary comprises characters and coding values corresponding to the characters one by one;
traversing each crotch in the shielding word tree in turn according to an offset identification strategy and a crotch extending direction in the coding compression rule so as to add offset values corresponding to each crotch in the first sequence;
Traversing each crotch in the shielding word tree in turn according to the crotch extending direction so as to add a jump value corresponding to each crotch in the second sequence;
and combining the first sequence added with the offset value and the second sequence added with the jump value to obtain the linear coding sequence.
4. The method for recognizing mask words based on character encoding compression according to claim 3, wherein traversing each crotch in the mask word tree in turn according to the offset identification policy and the crotch extending direction in the encoding compression rule to add offset values corresponding to each crotch in the first sequence comprises:
traversing the characters contained in each crotch according to the crotch extending direction, and determining the corresponding positions of the characters in the crotch in the first sequence according to the coding dictionary;
judging whether the characters in the crotch are tail characters or not;
if the character is not the end character, configuring the numerical value of the position corresponding to the character in the first sequence as a first offset value in the offset identification strategy;
and if the character is the last character, configuring the numerical value of the position corresponding to the character in the first sequence as a second offset value in the offset identification strategy.
5. The method for recognizing mask words based on character encoding compression according to claim 3, wherein traversing each crotch in the mask word tree in sequence according to the crotch extending direction to add a jump value corresponding to each crotch in the second sequence comprises:
traversing the characters contained in each crotch according to the crotch extending direction, and determining the corresponding positions of the characters in the crotch in the second sequence according to the coding dictionary;
judging whether the characters in the crotch are initial characters or not;
and if the character is not the initial character, configuring the numerical value of the position corresponding to the character in the second sequence as the coding value of the previous character.
6. The method for recognizing mask words based on character encoding compression according to claim 3, wherein comparing and recognizing the text segments according to the encoding compression rule and the linear encoding sequence to obtain a recognition result of whether the mask words are included, comprises:
acquiring a starting character of a phrase contained in the text field;
determining a first sequence value and a second sequence value corresponding to the initial character in the first sequence and the second sequence respectively according to a coding dictionary in the coding compression rule;
Judging whether the first sequence value is zero or not;
if the first sequence value is not zero, judging whether the second sequence value is zero or not;
if the second sequence value is zero, acquiring the next character of the phrase as a target character;
determining a third sequence value and a fourth sequence value corresponding to the target character in the first sequence and the second sequence respectively according to the coding dictionary;
judging whether the third sequence value is zero or not;
if the third sequence value is not zero, judging whether the fourth sequence value is a coding value corresponding to the previous character in the coding dictionary;
if the third sequence value is a first offset value and the fourth sequence value is a code value corresponding to the previous character in the code dictionary, returning to execute the step of acquiring the next character of the phrase as a target character;
if the third sequence value is a second offset value and the fourth sequence value is a code value corresponding to the previous character in the code dictionary, judging that the phrase is a shielding word;
if the fourth sequence value is not the corresponding code value of the previous character in the code dictionary, judging that the phrase is not a shielding word;
if the first sequence value is zero or the third sequence value is zero, judging that the phrase is not a shielding word;
Judging whether the text field contains at least one phrase which is a shielding word or not so as to obtain a recognition result of whether the shielding word is contained or not.
7. The method for recognizing mask words based on character encoding compression according to claim 1 or 6, wherein the comparing and recognizing the text segment according to the encoding compression rule and the linear encoding sequence, after obtaining the recognition result of whether the mask words are included, further comprises:
and if the recognition result is that the word group comprises a shielding word, replacing the word group corresponding to the shielding word in the text field to obtain a corresponding replaced text field.
8. A mask word recognition apparatus based on character encoding compression, wherein the apparatus is for performing the mask word recognition method based on character encoding compression as claimed in any one of claims 1 to 7, the apparatus comprising:
the shielding word tree construction unit is used for analyzing a preset shielding word library to construct a shielding word tree based on a tree structure;
the linear coding sequence acquisition unit is used for traversing the shielding word tree according to a preset coding compression rule so as to code and compress characters included in the shielding word tree to obtain a corresponding linear coding sequence;
The integration unit is used for integrating the linear coding sequence with a pre-stored initial program file to construct and obtain a target program file and executing the target program file;
and the recognition result acquisition unit is used for comparing and recognizing the text segments according to the code compression rule and the linear code sequence if the text segments input by the user are received, so as to obtain a recognition result of whether the shielding words are included.
9. A computer device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the character encoding compression-based mask word recognition method according to any one of claims 1 to 7 when executing a program stored on a memory.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the character encoding compression based mask word recognition method according to any one of claims 1-7.
CN202311552860.1A 2023-11-21 2023-11-21 Character encoding compression-based mask word recognition method, device, equipment and medium Active CN117272989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311552860.1A CN117272989B (en) 2023-11-21 2023-11-21 Character encoding compression-based mask word recognition method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311552860.1A CN117272989B (en) 2023-11-21 2023-11-21 Character encoding compression-based mask word recognition method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN117272989A true CN117272989A (en) 2023-12-22
CN117272989B CN117272989B (en) 2024-02-06

Family

ID=89219970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311552860.1A Active CN117272989B (en) 2023-11-21 2023-11-21 Character encoding compression-based mask word recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117272989B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7102552B1 (en) * 2005-06-07 2006-09-05 Windspring, Inc. Data compression with edit-in-place capability for compressed data
CN107424461A (en) * 2017-08-01 2017-12-01 深圳市鹰硕技术有限公司 Information screen method and system
CN108536787A (en) * 2018-03-29 2018-09-14 优酷网络技术(北京)有限公司 content identification method and device
CN112835585A (en) * 2021-01-25 2021-05-25 山东师范大学 Program understanding method and system based on abstract syntax tree
CN113158663A (en) * 2020-12-01 2021-07-23 咪咕文化科技有限公司 Shielding processing method and device, electronic equipment and storage medium
CN114697672A (en) * 2020-12-30 2022-07-01 中国科学院计算技术研究所 Run-length all-zero coding-based neural network quantization compression method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7102552B1 (en) * 2005-06-07 2006-09-05 Windspring, Inc. Data compression with edit-in-place capability for compressed data
CN107424461A (en) * 2017-08-01 2017-12-01 深圳市鹰硕技术有限公司 Information screen method and system
CN108536787A (en) * 2018-03-29 2018-09-14 优酷网络技术(北京)有限公司 content identification method and device
CN113158663A (en) * 2020-12-01 2021-07-23 咪咕文化科技有限公司 Shielding processing method and device, electronic equipment and storage medium
CN114697672A (en) * 2020-12-30 2022-07-01 中国科学院计算技术研究所 Run-length all-zero coding-based neural network quantization compression method and system
CN112835585A (en) * 2021-01-25 2021-05-25 山东师范大学 Program understanding method and system based on abstract syntax tree

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
QIU-FENG WANG 等: "Unsupervised language model adaptation for handwritten Chinese text recognition", 《PATTERN RECOGNITION》 *
谭容: "视觉隐私保护下室内监控视频中的跌倒行为检测研究", 《中国优秀硕士学位论文全文数据库 信息科技辑 (月刊)》 *
陈基漓 等: "基于单词的Huffman压缩方法", 《桂林工学院学报》, no. 04 *

Also Published As

Publication number Publication date
CN117272989B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
US7770091B2 (en) Data compression for use in communication systems
CN111249736B (en) Code processing method and device
CN104994128B (en) A kind of identification of data encoding type and code-transferring method and device
US7650040B2 (en) Method, apparatus and system for data block rearrangement for LZ data compression
DE112008002903T5 (en) Data sequence compression
CN110545106B (en) Method and device for coding time series data
CN111914559A (en) Text attribute extraction method and device based on probability graph model and computer equipment
CN113946546B (en) Abnormality detection method, computer storage medium, and program product
CN111159329A (en) Sensitive word detection method and device, terminal equipment and computer-readable storage medium
CN113630125A (en) Data compression method, data encoding method, data decompression method, data encoding device, data decompression device, electronic equipment and storage medium
US20220005229A1 (en) Point cloud attribute encoding method and device, and point cloud attribute decoding method and devcie
CN117272989B (en) Character encoding compression-based mask word recognition method, device, equipment and medium
JP3080149B2 (en) Pattern encoding method and decoding method, and encoding apparatus and decoding apparatus using the method
CN107832341B (en) AGNSS user duplicate removal statistical method
US9235610B2 (en) Short string compression
CN113821211B (en) Command parsing method and device, storage medium and computer equipment
CN108874994A (en) A kind of piecemeal reads the method, apparatus and computer storage medium of data
CN112232025B (en) Character string storage method and device and electronic equipment
CN114092577A (en) Image data processing method, image data processing device, computer equipment and storage medium
CN109558113B (en) Data field representation method and device and electronic equipment
CN114297046A (en) Event obtaining method, device, equipment and medium based on log
US9722631B2 (en) Method and apparatus for calculating estimated data compression ratio
CN114302425B (en) Equipment network distribution method and device, storage medium and electronic equipment
CN113595557B (en) Data processing method and device
CN113438050B (en) Encoding method, decoding method, encoding device and decoding device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant