CN116955720A - Data processing method, apparatus, device, storage medium and computer program product - Google Patents

Data processing method, apparatus, device, storage medium and computer program product Download PDF

Info

Publication number
CN116955720A
CN116955720A CN202210413723.9A CN202210413723A CN116955720A CN 116955720 A CN116955720 A CN 116955720A CN 202210413723 A CN202210413723 A CN 202210413723A CN 116955720 A CN116955720 A CN 116955720A
Authority
CN
China
Prior art keywords
character string
detected
sensitive
target
sensitive word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210413723.9A
Other languages
Chinese (zh)
Inventor
王关政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210413723.9A priority Critical patent/CN116955720A/en
Publication of CN116955720A publication Critical patent/CN116955720A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

A data processing method, apparatus, device, storage medium and computer program product, the method comprising: acquiring a detection request, wherein the detection request comprises a character string to be detected; converting the character string to be detected into auxiliary information, wherein the auxiliary information comprises one or more of the following components: the first letter corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected; obtaining a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set; matching the character string to be detected and the auxiliary information with a target dictionary tree to obtain sensitive characters in the character string to be detected; and carrying out sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string, and outputting the target character string. By adopting the method, the effectiveness of processing the sensitive words can be improved by improving the accuracy of the sensitive character recognition.

Description

Data processing method, apparatus, device, storage medium and computer program product
Technical Field
The present application relates to the field of computer technology, and in particular, to a data processing method, a data processing apparatus, a computer device, a computer readable storage medium, and a computer program product.
Background
Today, with the development of informatization and high speed, people can issue own utterances through a network communication platform, and sometimes some sensitive characters which do not accord with internet use standards or even violate national regulations appear in the utterances, so that the maintenance of a good network environment is not facilitated, and therefore, sensitive word processing is necessarily performed on the utterances related to the sensitive characters, but the accuracy of the current sensitive character recognition is lower, and the sensitive word processing cannot be effectively realized.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, equipment, a storage medium and a computer program product, which can improve the effectiveness of sensitive word processing by improving the accuracy of sensitive character recognition.
In one aspect, an embodiment of the present application provides a data processing method, where the method includes:
acquiring a detection request, wherein the detection request comprises a character string to be detected;
converting the character string to be detected into auxiliary information, wherein the auxiliary information comprises one or more of the following components: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected;
Obtaining a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set;
matching the character string to be detected and the auxiliary information with the target dictionary tree to obtain sensitive characters in the character string to be detected;
and carrying out sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string, and outputting the target character string.
In one aspect, an embodiment of the present application provides a text processing method, where the method includes:
displaying an application interface, wherein the application interface comprises a character string to be detected;
when the existence of the sensitive word in the character string to be detected is detected, displaying a first prompt message, wherein the first prompt message is used for prompting the sensitive word in the character string to be detected; displaying a second prompt message, wherein the second prompt message comprises N synonyms corresponding to sensitive words in the character string to be detected, and N is a positive integer;
and when receiving a selection operation of the target synonym, displaying a target character string on the application interface, wherein the target character string is a character string after the sensitive word in the character string to be detected is replaced by the target synonym.
In one aspect, an embodiment of the present application provides a data processing apparatus, including:
the device comprises an acquisition unit, a detection unit and a detection unit, wherein the acquisition unit is used for acquiring a detection request, and the detection request comprises a character string to be detected;
the processing unit is used for converting the character string to be detected into auxiliary information, and the auxiliary information comprises one or more of the following: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected;
the acquisition unit is further used for acquiring a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set;
the processing unit is further configured to match the character string to be detected and the auxiliary information with the target dictionary tree to obtain a sensitive character in the character string to be detected;
the processing unit is further configured to perform sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected, obtain a target character string, and output the target character string.
In one aspect, an embodiment of the present application provides a text processing apparatus, including:
The display unit is used for displaying an application interface, wherein the application interface comprises a character string to be detected;
the display unit is further used for displaying a first prompt message when the existence of the sensitive word in the character string to be detected is detected, wherein the first prompt message is used for prompting the sensitive word in the character string to be detected; displaying a second prompt message, wherein the second prompt message comprises N synonyms corresponding to sensitive words in the character string to be detected, and N is a positive integer;
and the display unit is also used for displaying a target character string on the application interface when receiving the selection operation of the target synonym, wherein the target character string is a character string after the sensitive word in the character string to be detected is replaced by the target synonym.
In one aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, and where the memory stores a computer program, and the processor is configured to invoke the computer program to perform a data processing method according to any of the possible implementations described above.
In one aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a data processing method of any one of the possible implementations.
Accordingly, the embodiment of the present application also provides a computer program product, where the computer program product includes a computer program or computer instructions, and the computer program or the computer instructions are executed by a processor to implement the steps of the data processing method provided by the embodiment of the present application.
Accordingly, an embodiment of the present application further provides a computer program, where the computer program includes computer instructions, where the computer instructions are stored in a computer readable storage medium, and a processor of a computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided by the embodiment of the present application.
In the embodiment of the application, a detection request can be acquired, the detection request comprises a character string to be detected, the character string to be detected is converted into one or more of an initial corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and strokes corresponding to the character string to be detected, auxiliary information is obtained, the character string to be detected and the auxiliary information are matched with a target dictionary tree to obtain sensitive characters in the character string to be detected, then sensitive word processing is carried out on the character string to be detected according to the sensitive characters in the character string to be detected to obtain a target character string, and the target character string is output. By adopting the method, the character strings to be detected can be subjected to sensitive character recognition from multiple dimensions, the accuracy of sensitive character recognition is improved, and the effectiveness of sensitive word processing is improved.
Drawings
In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described, and it is apparent that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a diagram illustrating a system architecture of a data processing system according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a data processing method according to an embodiment of the present application;
FIG. 3 is a second flowchart of a data processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a dictionary tree generated based on an AC automaton algorithm according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a word segmentation processing method according to an embodiment of the present application;
FIG. 6 is an exemplary schematic diagram of a word pinyin mapping table and a word segmentation pinyin mapping table provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of multi-pass parallel processing according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a sensitive word configuration interface according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an exemplary sensitive word prompt interface according to an embodiment of the present application;
FIG. 10 is a schematic diagram I of an exemplary sensitive word processing provided by an embodiment of the present application;
FIG. 11 is a flowchart illustrating a data processing method according to an embodiment of the present application;
fig. 12 is a flowchart of a data processing method according to an embodiment of the present application;
fig. 13 is a schematic flow chart of a text processing method according to an embodiment of the present application;
FIG. 14 is a second exemplary schematic diagram of a sensitive word processing provided in an embodiment of the present application;
FIG. 15 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of a text processing device according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical method according to the embodiments of the present application will be clearly and completely described in the following description with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to FIG. 1, FIG. 1 is a schematic diagram illustrating a system architecture of a data processing system according to an embodiment of the present application; the system architecture shown in fig. 1 can be used to implement the data processing method according to the embodiment of the present application. As shown in fig. 1, the system architecture includes: a server 10 and a plurality of terminals 11 (3 are shown as an example).
The server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform. The terminal 11 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like. The terminal 11 shown in fig. 1 is connected to the server 10 via a network.
The system architecture shown in fig. 1 may implement the data processing method provided in the embodiment of the present application, taking the method performed by the server 10 and the terminal 11 together as an example, the implementation flow approximately includes: (1) the terminal 11 sends a detection request to the server 10, the detection request including a character string to be detected; (2) the server 10 receives the detection request sent by the terminal 11, and converts the character string to be detected in the detection request into auxiliary information, where the auxiliary information includes one or more of the following: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected; (3) the server 10 acquires a target dictionary tree which is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set; (4) the server 10 matches the character string to be detected and the auxiliary information with the target dictionary tree to obtain sensitive characters in the character string to be detected; (5) the server 10 processes the sensitive word of the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string (namely the character string to be detected after the sensitive word is processed); (6) the server 10 publishes the output target character string to the network communication platform. By adopting the method, the effectiveness of processing the sensitive words can be improved by improving the accuracy of the sensitive character recognition.
It may be understood that the schematic diagram of the system architecture described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application, and those skilled in the art can know that, with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided by the embodiment of the present application is equally applicable to similar technical problems.
Specific implementations of the data processing method are described in detail below.
Referring to fig. 2, fig. 2 is a flowchart illustrating a data processing method according to an embodiment of the application. The data processing method described in the embodiment of the present application may be performed by the server 10 or the terminal 11 in fig. 1, and includes, but is not limited to, the following steps:
s201, acquiring a detection request, wherein the detection request comprises a character string to be detected.
The detection request is used for requesting sensitive character recognition of the character string to be detected. The character string to be detected may include a Chinese character, a Chinese character stroke, punctuation marks, letters (including one or both of pinyin letters and english letters), and the like. The character string to be detected may be a speaker published on a network communication platform, a text extracted from an image or a video image, a text obtained by converting a voice into a text, a barrage in a video, or the like.
S202, converting the character string to be detected into auxiliary information, wherein the auxiliary information comprises one or more of the following: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected.
The auxiliary information may be obtained by converting a character string to be detected, in an embodiment, when the auxiliary information includes pinyin corresponding to the character string to be detected, chinese characters included in the character string to be detected may be obtained, pinyin corresponding to the included Chinese characters is searched, and the Chinese characters in the character string to be detected are replaced by the pinyin corresponding to the included Chinese characters, so as to obtain pinyin corresponding to the character string to be detected, for example: the spelling corresponding to the character string to be detected 'we go to KTV to singe' is 'womenquo KTVchangge'. When the auxiliary information includes the initial corresponding to the character string to be detected, the Chinese character contained in the character string to be detected can be obtained, the pinyin corresponding to the contained Chinese character is searched, the first letter in the pinyin corresponding to the contained Chinese character is extracted, and the first letter in the pinyin corresponding to the contained Chinese character is used for replacing the Chinese character in the character string to be detected, so as to obtain the initial corresponding to the character string to be detected, for example: the spelling corresponding to the character string to be detected 'we go to KTV to singe' is 'wmqKTVcg'. If the character string to be detected contains pinyin, the first letter of the contained pinyin can be extracted, and the first letter of the contained pinyin is utilized to replace the pinyin contained in the character string to be detected.
In a possible embodiment, when the auxiliary information includes strokes corresponding to the character string to be detected, splitting processing may be performed on the Chinese characters included in the character string to be detected to obtain strokes corresponding to the included Chinese characters, and the strokes corresponding to the included Chinese characters are replaced with the strokes corresponding to the character string to be detected to obtain strokes corresponding to the character string to be detected. In one implementation, the splitting treatment can be performed on the included Chinese characters according to the stroke sequence, for example, the split Chinese characters are removed to ; or the Chinese characters are split according to the radicals, for example, the split Chinese characters are 'alpha'.
S203, acquiring a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set.
The set of sensitive words includes one or more sensitive words. The sensitive word initial set comprises one or more sensitive word initial, and the sensitive word initial can be an initial corresponding to a sensitive word in the sensitive word set or an initial corresponding to a sensitive word outside the sensitive word set. The sensitive word pinyin set comprises one or more sensitive word pinyins, and the sensitive word pinyins can be pinyins corresponding to sensitive words in the sensitive word set or pinyins corresponding to sensitive words outside the sensitive word set. The sensitive word stroke set comprises one or more sensitive word strokes, and the sensitive word strokes can be strokes corresponding to sensitive words in the sensitive word set or strokes corresponding to sensitive words outside the sensitive word set.
The sensitive character string is data in a sensitive word set, a sensitive word initial set, a sensitive word pinyin set or a sensitive word stroke set, namely the sensitive character string can be a sensitive word in the sensitive word set, or a sensitive word initial in the sensitive word initial set, or a sensitive word pinyin in the sensitive word pinyin set, or a sensitive word stroke in the sensitive word stroke set. In an embodiment, a target dictionary tree may be constructed by using a set of sensitive words, a set of first letters of the sensitive words, a set of pinyin of the sensitive words, and a set of strokes of the sensitive words, and the sensitive character string is illustrated by taking a node in the target dictionary tree as an example, one character in the sensitive character string may be represented by a node in the target dictionary tree, and a character string that traverses from a root node to a certain node in the target dictionary tree may be the sensitive character string.
S204, matching the character string to be detected and the auxiliary information with the target dictionary tree to obtain sensitive characters in the character string to be detected.
In one embodiment, the detected string and the auxiliary information may be matched with the target dictionary tree concurrently to obtain one or more matched strings, which may be pinyin strings, stroke strings, hanzi strings, and initials strings. Further, whether the matching character string is directly contained in the character string to be detected or not can be checked, characters in the matching character string directly contained in the character string to be detected are sensitive characters, if the character string to be detected does not directly contain the matching character string, sensitive character strings corresponding to the matching character string in the character string to be detected are searched, characters in the sensitive character strings corresponding to the matching character string are used as sensitive characters, for example, the matching character string is 'word' or 'wm', and 'we' in the character string to be detected are the corresponding sensitive character strings.
S205, performing sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string, and outputting the target character string.
In an embodiment, the sensitive character in the character string to be detected may be used to perform sensitive word processing on the character string to be detected to obtain the target character string, for example, deleting the sensitive character in the character string to be detected, replacing the sensitive character in the character string to be detected with a replacement symbol, and so on. The target string may further be output, for example, published in a network communication platform.
In the embodiment of the application, a detection request can be acquired, the detection request comprises a character string to be detected, the character string to be detected is converted into one or more of an initial corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and strokes corresponding to the character string to be detected, auxiliary information is obtained, the character string to be detected and the auxiliary information are matched with a target dictionary tree to obtain sensitive characters in the character string to be detected, then the sensitive words of the character string to be detected are processed according to the sensitive characters in the character string to be detected to obtain a target character string, and the target character string is output. By adopting the method, the character strings to be detected can be subjected to sensitive character recognition from multiple dimensions, the accuracy of sensitive character recognition is improved, and the effectiveness of sensitive word processing is improved.
Referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the application. The data processing method described in the embodiment of the present application may be performed by the server 10 or the terminal 11 in fig. 1, and includes, but is not limited to, the following steps:
s301, acquiring a detection request, wherein the detection request comprises a character string to be detected.
In an embodiment, a to-be-detected string may be input through an input field of the application interface, and the input to-be-detected string is displayed on the input field of the application interface, and when a send button of the application interface is triggered, a detection request including the to-be-detected string may be generated. The application interface may be an interface in a game community for publishing dynamic content, where the dynamic content may include text, images, video, and voice, and the character string to be detected may be text extracted from the published text, the published images, or the video images, and text obtained by converting the published voice. In addition, the application interface may be a barrage playing interface or a short video comment interface, which is not limited in the present application.
S302, converting the character string to be detected into auxiliary information, wherein the auxiliary information comprises one or more of the following: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected.
In an embodiment, the detection request further includes a verification level of the character string to be detected, where the verification level may be determined based on a posting scenario of the character string to be detected, for example, the character string to be detected posted in the game comment scenario, the nickname naming scenario, and the video bullet screen scenario may have different verification levels. The verification level may be used to determine data included in the auxiliary information, and when the verification level is the first level, the auxiliary information may include an initial corresponding to the character string to be detected, or a pinyin corresponding to the character string to be detected, or a stroke corresponding to the character string to be detected; when the verification level is the second level, the auxiliary information may include the first letter corresponding to the character string to be detected and the pinyin corresponding to the character string to be detected, or the first letter corresponding to the character string to be detected and the stroke corresponding to the character string to be detected, or the pinyin corresponding to the character string to be detected and the stroke corresponding to the character string to be detected; when the verification level is the third level, the auxiliary information may include an initial corresponding to the character string to be detected, a pinyin corresponding to the character string to be detected, and a stroke corresponding to the character string to be detected. With the present embodiment, the number of data included in the auxiliary information can be decided using the check level.
In a possible embodiment, the check level may be further divided in detail, for example, when the check level is a first level, the auxiliary information may include an initial corresponding to a character string to be detected, when the check level is a second level, the auxiliary information may include a pinyin corresponding to a character string to be detected, when the check level is a third level, the auxiliary information may include a stroke corresponding to a character string to be detected, when the check level is a fourth level, the auxiliary information includes an initial corresponding to a character string to be detected and a pinyin corresponding to a character string to be detected, when the check level is a fifth level, the auxiliary information includes an initial corresponding to a character string to be detected and a stroke corresponding to a character string to be detected, when the check level is a sixth level, the auxiliary information includes a pinyin corresponding to a character string to be detected, an initial corresponding to a character string to be detected, and a stroke corresponding to a character string to be detected.
In one embodiment, when the auxiliary information includes strokes corresponding to the character string to be detected, splitting the character string to be detected to obtain the strokes corresponding to the character string to be detected. Specifically, the Chinese characters contained in the character string to be detected can be split to obtain strokes corresponding to the contained Chinese characters, and the strokes corresponding to the contained Chinese characters are replaced with the strokes corresponding to the character string to be detected to obtain the strokes corresponding to the character string to be detected. In one implementation, the splitting treatment can be performed on the included Chinese characters according to the stroke sequence, for example, the split Chinese characters are removed to ; or the Chinese characters are split according to the radicals, for example, the split Chinese characters are 'alpha'.
In an embodiment, when the auxiliary information includes pinyin corresponding to the character string to be detected, the character string to be detected may be matched with the word segmentation dictionary tree to obtain M matched words and N unmatched characters, where M and N are integers.
The word segmentation dictionary tree can be a dictionary tree obtained by loading a word segmentation word stock based on a finite state automaton algorithm (AC automaton algorithm), wherein the word segmentation word stock comprises a plurality of segmented words, and the segmented words can be words obtained based on standard Chinese or new words (words with new content, new form and no words in an original vocabulary system). The original dictionary tree refers to: the nodes are used for representing characters, a character string can be formed from the root node to a certain node, and the dictionary tree generated based on the AC automaton algorithm is additionally provided with mismatch pointers for the nodes compared with the original dictionary tree. If the mismatch pointer of the node [ i ] points to the node [ j ], the character string formed from the root node to the node [ j ] is the longest suffix of the character string formed from the root node to the node [ i ], and if the node [ j ] does not exist for the node [ i ], the mismatch pointer of the node [ i ] points to the root node. Taking fig. 4 as an example for explanation, the character string formed from the root node to the node 1 is "abch", the character string formed from the root node to the node 2 is "ch", and it can be seen that "ch" is the longest suffix of "abch", so that the mismatch pointer of the node 1 points to the node 2; the string formed from the root node to node 3 is "abd", and it can be seen that there is no suffix "d" or "bd" formed from the root node to a certain node, and the mismatch pointer of node 3 points to the root node. If characters are not matched with nodes in the original dictionary tree, the root node needs to be traced back to restart traversal, but the dictionary tree generated based on the AC automaton algorithm can be transferred to the node pointed by the mismatch pointer to continue matching when the characters are not matched with the nodes, and tracing back is not needed. For example, when the character string "abchn" is matched to "abch" in the dictionary tree of fig. 4, the node 1 is traversed, and since the next node of the node 1 is not "n", the node 2 pointed to by the mismatch pointer of the node 1 is acquired, and the matching of "n" is continued from the node 2.
Each node in the word dictionary tree has, in addition to a mismatch pointer, a node representing the last character of word a (any of the plurality of words), an end identifier and the number of characters of word a. For example, from a root node (which may represent a null character) to a node a (which represents "sense") to a node b (which represents "sense") to a node c (which represents "word"), the word "sense word" may be obtained, and then the node c contains an end identifier and a number of characters 3, when the current character ("word") currently matching in the character string to be detected ("what the sense word is") is consistent with the character represented by the node c, it is found that the node c has the end identifier, the number of characters 3 contained in the node c may be obtained, and then the characters represented by the two nodes which are the reciprocal upward may be obtained, or the two characters may be obtained from the current character in the character string to be detected, so as to determine the matching word "sense word".
As shown in fig. 5, a word segmentation dictionary tree may be constructed by using a word segmentation word library, an input character string to be detected is matched with the word segmentation dictionary tree, a reference node including an end identifier is extracted from a node matched with a current character, the number of characters contained in the reference node is obtained, and a matching word is searched from the word segmentation dictionary tree by using the number of characters contained in the reference node. Finally, when the matching of the character string to be detected and the word segmentation dictionary tree is finished, M matching words can be obtained, and Chinese characters except the matching words in the character string to be detected can be used as unmatched characters. In a feasible embodiment, new word segmentation can be added into the word segmentation word stock, and the word segmentation dictionary tree can be reloaded by using the added word segmentation word stock.
Further, the word segmentation pinyin mapping table comprises a plurality of segmentation words in the segmentation word library and pinyin corresponding to each segmentation word in the segmentation words, and the word pinyin mapping table comprises a plurality of words and pinyin corresponding to each word in the plurality of words. As shown in fig. 6, the left side of fig. 6 is an example of a word pinyin mapping table, and the right side of fig. 6 is an example of a word pinyin mapping table, where the word pinyin mapping table contains Unicode codes of chinese characters (e.g., 3416 in u+3416). The pinyin corresponding to M matching words is searched by utilizing the word spelling mapping table, the pinyin corresponding to N unmatched characters is searched by utilizing the word spelling mapping table, the pinyin corresponding to the M matching words and the pinyin corresponding to the N unmatched characters are utilized to generate the pinyin corresponding to the character string to be detected, in particular, the M matching words in the character string to be detected are replaced by the pinyin corresponding to the M matching words, and the N unmatched characters in the character string to be detected are replaced by the pinyin corresponding to the N unmatched characters, so that the pinyin corresponding to the character string to be detected is obtained.
In an embodiment, when the auxiliary information includes an initial corresponding to the character string to be detected, the initial corresponding to the character string to be detected is extracted by using pinyin corresponding to the character string to be detected. The first letter of the pinyin corresponding to each word can be extracted from the pinyin corresponding to the character string to be detected, and the first letter of the pinyin corresponding to each word is replaced by the first letter of the pinyin corresponding to the character string to be detected to obtain the first letter corresponding to the character string to be detected. Or, the first letters corresponding to the M matching words can be determined by using the pinyin corresponding to the M matching words, the first letters corresponding to the N non-matching characters can be determined by using the pinyin corresponding to the N non-matching characters, the M matching words in the character string to be detected are replaced by using the first letters corresponding to the M matching words, and the N non-matching characters in the character string to be detected are replaced by using the first letters corresponding to the N non-matching characters, so that the first letters corresponding to the character string to be detected are obtained.
As shown in fig. 7, after word segmentation is performed on a character string to be detected by using a word segmentation dictionary tree to obtain a word segmentation result of the character string to be detected, a word pinyin mapping table and pinyin in the word segmentation pinyin mapping table can be loaded, and a multi-pass parallel processing mode is adopted to obtain pinyin corresponding to the character string to be detected and initial letters corresponding to the character string to be detected.
S303, acquiring a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set.
In one embodiment, the target dictionary tree is a dictionary tree constructed based on an AC state automaton algorithm loading a set of sensitive words, a set of sensitive word initials, a set of sensitive word pinyin and a set of sensitive word strokes. The sensitive character string is data in a sensitive word set, a sensitive word initial set, a sensitive word pinyin set or a sensitive word stroke set, namely the sensitive character string can be a sensitive word in the sensitive word set, or a sensitive word initial in the sensitive word initial set, or a sensitive word pinyin in the sensitive word pinyin set, or a sensitive word stroke in the sensitive word stroke set. The node in the target dictionary tree may represent one character in the sensitive character string, the character string that traverses from the root node of the target dictionary tree to a certain node may be the sensitive character string, and each node has a mismatch pointer, and the node corresponding to the last character of the sensitive character string contains the end identifier and the number of characters of the sensitive character string.
In an embodiment, the server may respond to the sensitive word management instruction, and display a sensitive word management interface according to the object authority corresponding to the sensitive word management instruction, where the sensitive word management interface includes one or more of a sensitive word configuration area, a sensitive word initial configuration area, a sensitive word pinyin configuration area, and a sensitive word stroke configuration area, the sensitive word configuration area is used for receiving an input sensitive word, the sensitive word initial configuration area is used for receiving an input sensitive word initial, the sensitive word pinyin configuration area is used for receiving an input sensitive word pinyin, and the sensitive word stroke configuration area is used for receiving an input sensitive word stroke. The authority identification corresponding to the object authority can be contained in a sensitive word management instruction, different object authorities have different configuration rights, for example, the object authorities can be divided according to operators, developers and product staff, when the object authorities are sensitive word management instructions initiated by the operators, a sensitive word management interface comprises a sensitive word configuration area and a sensitive word initial configuration area, and when the object authorities are sensitive word management instructions initiated by the developers, the sensitive word management interface comprises a sensitive word configuration area, a sensitive word stroke configuration area and a sensitive word initial configuration area.
Further, the target set may be updated with received data, where the received data may be one or more of a sensitive word entered in a sensitive word configuration area, a sensitive word initial entered in a sensitive word initial configuration area, a sensitive word pinyin entered in a sensitive word pinyin configuration area, and a sensitive word stroke entered in a sensitive word stroke configuration area, and the received data may be added to the target set. The target set is a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a set consistent with the type of the received data in the sensitive word stroke set. The preset period for loading the target dictionary tree can be set on the sensitive word management interface, and when the preset period arrives, the updated target set and the non-updated set are loaded based on the finite state automaton algorithm to generate a new target dictionary tree.
In an embodiment, the sensitive word management interface further includes a service scene list, where the service scene list includes at least one service scene, for example, a general scene, a game application service scene, a video application service scene, and the server may respond to a trigger operation for a target service scene in the service scene list, where the target service scene is any one of the at least one service scene, and at this time, data input in the sensitive word management interface is configured for the target service scene, and then based on a finite state automaton algorithm, the updated target set and the non-updated set are loaded, and the generated new target dictionary tree is applicable to the target service scene, and may be stored in association with the target service scene. When the detection request contains the application identifier, acquiring a service scene to which the application corresponding to the application identifier belongs, if the service scene to which the application belongs is matched with the target service scene, reading the target dictionary tree, and when the detection request does not contain the application identifier, acquiring the dictionary tree applicable to the universal scene as the target dictionary tree.
In a possible embodiment, besides receiving the input sensitive words in the sensitive word configuration area, as shown in fig. 8, existing sensitive words in the sensitive word set may be deleted in the sensitive word configuration area, and a service scenario applicable to each sensitive word may be set. Similarly, the sensitive word initial configuration area, the sensitive word pinyin configuration area and the sensitive word stroke configuration area can also comprise a deleting function and a service scene configuration function. The corresponding sensitive word set, the sensitive word initial set, the sensitive word pinyin set and the sensitive word stroke set can exist for each service scene, and the corresponding sensitive word set, the sensitive word initial set, the sensitive word pinyin set and the sensitive word stroke set of each service scene can be loaded based on an AC automaton algorithm to obtain a dictionary tree applicable to each service scene. When the detection request contains the application identifier, a service scene to which the application corresponding to the application identifier belongs is acquired, a dictionary tree applicable to the service scene to which the application belongs is used as a target dictionary tree, and when the detection request does not contain the application identifier, the dictionary tree applicable to the general scene can be acquired as the target dictionary tree.
In one embodiment, the set of sensitive words, the set of sensitive word initials, the set of sensitive word pinyin and the set of sensitive word strokes may be isolated and deployed. Specifically, each data isolation model in the multiple data isolation models comprises a main data storage unit and a standby data storage unit, the multiple main data storage units can be used for storing a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set respectively, and the standby data storage unit in each data isolation model can automatically synchronize data in the corresponding main data storage unit. Loading a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set based on a finite state automaton algorithm, and acquiring data from each main data storage unit if the main data storage unit is not crashed when the target dictionary tree is generated; if the main data storage unit is crashed, the data is acquired from the corresponding standby data storage unit. By the embodiment, high reliability of data can be ensured.
In a feasible embodiment, the isolation deployment can be performed according to the service scene. The target data isolation models in the data isolation models can be used for storing a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set corresponding to the target service scene, and each data isolation model comprises a main data storage unit and a standby data storage unit. When a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set corresponding to a target service scene are loaded based on a finite state automaton algorithm, if a main data storage unit of a target data isolation model is not crashed, acquiring data from the main data storage unit of the target data isolation model, and if the main data storage unit of the target data isolation model is crashed, acquiring data from a standby data storage unit of the target data isolation model. By the embodiment, high reliability of data can be ensured.
S304, matching the character string to be detected with the target dictionary tree to obtain a first matching result.
When the target dictionary tree is obtained based on an AC automaton algorithm, the target dictionary tree comprises a plurality of nodes and mismatch pointers of each node, and the node corresponding to the last character of the sensitive character string comprises an end identifier and the character number of the sensitive character string, wherein the sensitive character string is data in a sensitive word set, a sensitive word initial set, a sensitive word pinyin set or a sensitive word stroke set.
In an embodiment, matching the character string to be detected with the target dictionary tree to obtain a first matching result includes: traversing the character string to be detected to extract a current character for current matching, and determining a target node corresponding to the current character from a target dictionary tree; and matching the current character with the target node. If the current character is successfully matched with the target node, determining a next node to be matched corresponding to the target node according to a matching success strategy, taking the next character of the current character as a new current character, taking the next node to be matched determined by the matching success strategy as a new target node, and executing the step of matching the current character with the target node. The matching success strategy specifically comprises the following steps: if the target node is a leaf node, the next node to be matched, which is determined by the matching success strategy, is the next node of the node pointed by the mismatch pointer of the target node; if the target node is not the leaf node, the next node to be matched determined by the matching success strategy is the next node of the target node. If the matching of the current character and the target node fails, determining whether the target node is the next node of the root node; if yes, taking the next character of the current character as a new current character, and executing the step of matching the current character with the target node; if not, determining the next node to be matched corresponding to the target node according to the matching failure strategy, taking the next node to be matched determined by the matching failure strategy as a new target node, and executing the step of matching the current character with the target node. The matching failure strategy specifically comprises the following steps: if the target node is a root node, as the root node is an empty character, the matching between the target node and the current character is defaulted to fail, and the next node to be matched determined by the matching failure strategy is the next node of the root node; if the target node is not the root node, the next node to be matched determined by the matching failure strategy is the next node of the node pointed by the mismatch pointer of the father node of the target node. When the traversal of the character string to be detected is completed, extracting target nodes containing end identifiers from all target nodes successfully matched, extracting a first matching result from a target dictionary tree according to the number of characters contained in the extracted target nodes, for example, when the current character ("word") currently matched in the character string to be detected is consistent with the character represented by the node c, finding that the end identifiers exist in the node c, then obtaining the characters 'sensitivity' represented by the two nodes with the up-count by using the number 3 of the characters contained in the node c, and determining the sensitive character string 'sensitivity word', wherein the first matching result can comprise one or more sensitive character strings, and the sensitive character strings can be pinyin character strings, stroke character strings, chinese character strings and initial character strings.
Taking fig. 4 as a target dictionary tree, the character string to be detected is "sabchi" to perform an exemplary description of a matching process, starting from a current character "s" (the initial current character is the first character of the character string to be detected), matching with a root node (the initial target node is the root node), taking the next node "a", "c", "i" of the root node as a new target node due to a matching failure, matching the current character "s" with the target nodes "a", "c", "i", obtaining the next character "a" of the current character "s" as a new current character due to a matching failure, matching with the target nodes "a", "c", "i", taking the next character "b" of the current character "a" as a new current character due to a matching success, taking the next node "b" of the target node as a new target node and continuing to match, the next character 'c' of the current character is used as a new current character, the next nodes'd' and 'c' of the target node are used as new target nodes and continue to be matched, the next character 'h' of the current character is used as a new current character, the next node 'h' of the target node is used as a new target node to continue to be matched due to the successful matching, the next character 'n' of the current character is used as a new current character, the next node 'i' of the target node is used as a new target node and is matched due to the successful matching, the next node 'n' of the node pointed by the mismatch pointer of the father node of the target node is used as a new target node to be matched with the current character 'n', and in addition, as the target node 'n' is a leaf node, the next node 'a', 'c', 'i' of the node pointed by the mismatch pointer of the target node is used as a new target node, and finally matching is carried out and successful. The target dictionary tree constructed based on the AC automaton algorithm can improve the matching speed of the character strings to be detected.
And S305, matching the auxiliary information with the target dictionary tree to obtain a second matching result.
In an embodiment, matching the pinyin corresponding to the character string to be detected (or the first letter corresponding to the character string to be detected, or the stroke corresponding to the character string to be detected) with the target dictionary tree to obtain a second matching result, including: traversing pinyin corresponding to the character string to be detected (or initial corresponding to the character string to be detected or strokes corresponding to the character string to be detected) to extract a current character for current matching, and determining a target node corresponding to the current character from a target dictionary tree; and matching the current character with the target node. If the current character is successfully matched with the target node, determining a next node to be matched corresponding to the target node according to a matching success strategy, taking the next character of the current character as a new current character, taking the next node to be matched determined by the matching success strategy as a new target node, and executing the step of matching the current character with the target node. The matching success strategy specifically comprises the following steps: if the target node is a leaf node, the next node to be matched, which is determined by the matching success strategy, is the next node of the node pointed by the mismatch pointer of the target node; if the target node is not the leaf node, the next node to be matched determined by the matching success strategy is the next node of the target node. If the matching of the current character and the target node fails, determining whether the target node is the next node of the root node; if yes, taking the next character of the current character as a new current character, and executing the step of matching the current character with the target node; if not, determining the next node to be matched corresponding to the target node according to the matching failure strategy, taking the next node to be matched determined by the matching failure strategy as a new target node, and executing the step of matching the current character with the target node. The matching failure strategy specifically comprises the following steps: if the target node is a root node, as the root node is an empty character, the matching between the target node and the current character is defaulted to fail, and the next node to be matched determined by the matching failure strategy is the next node of the root node; if the target node is not the root node, the next node to be matched determined by the matching failure strategy is the next node of the node pointed by the mismatch pointer of the father node of the target node. When the pinyin corresponding to the character string to be detected (or the first letter corresponding to the character string to be detected or the strokes corresponding to the character string to be detected) is traversed, extracting target nodes containing end identifiers from all target nodes successfully matched, and extracting a second matching result from the target dictionary tree according to the number of characters contained in the extracted target nodes.
When the pinyin corresponding to the character string to be detected is matched with the target dictionary tree, the obtained second matching result includes a sensitive character string which can be a pinyin character string. When the spelling corresponding to the character string to be detected is matched with the target dictionary tree, the sensitive character string included in the second matching result can be the spelling character string. When the strokes corresponding to the character strings to be detected are matched with the target dictionary tree, the sensitive character strings included in the second matching result can be the stroke character strings.
S306, combining the first matching result and the second matching result to obtain the sensitive character in the character string to be detected.
In an embodiment, the first matching result and the second matching result may be combined, for example, an intersection or a union of a sensitive string included in the first matching result and a sensitive string included in the second matching result is obtained, a string corresponding to the sensitive string included in the second matching result may be first found in the string to be detected, and when the intersection is obtained, a common portion of the string included in the first matching result and the string corresponding to the sensitive string included in the second matching result may be obtained, for example: the common part of "sensitive" and "sensitive word" is "sensitive", and characters included in "sensitive" may be regarded as sensitive characters. When the union set is obtained, the characters in the sensitive character strings included in the first matching result in the character strings to be detected can be used as sensitive characters; and acquiring a character string corresponding to the sensitive character string included in the second matching result in the character string to be detected, wherein the character included in the corresponding character string is used as the sensitive character, for example, the second matching result is "word", and the character string corresponding to the sensitive character string "word" in the character string to be detected is "we".
S307, performing sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string, and outputting the target character string.
In an embodiment, a sensitive word prompt interface may be displayed, where the sensitive word prompt interface includes Q synonyms, where the Q synonyms are obtained based on sensitive characters in the character string to be detected, and Q is a positive integer. Specifically, the sensitive characters in the character string to be detected may be characters included in the sensitive character string, or characters included in a character string corresponding to the sensitive character string in the character string to be detected, and the sensitive character string may be determined by using the sensitive characters in the character string to be detected. If the sensitive character string is a letter, a pinyin and a stroke corresponding to the sensitive word, converting the sensitive character string into the sensitive word, and then performing synonymous conversion processing on the converted sensitive word to obtain Q synonymous words. If the sensitive character string is a sensitive word, the sensitive character string is directly subjected to synonymous conversion processing to obtain Q synonymous words. Fig. 9 is a schematic diagram of an example of a sensitive word prompt interface, where the sensitive word prompt interface displays that synonyms corresponding to a sensitive word "sales champion" included in a character string to be detected include "hot-selling", "sales high", "hot-selling". And the method can replace the sensitive characters in the character string to be detected by the target synonyms in response to the selection operation of the target synonyms aiming at the sensitive word prompt interface, so as to obtain the target character string. When the target character string is obtained and output, the target character string can be displayed in the form of a text message on the application interface.
In another embodiment, the number of sensitive characters in the character string to be detected may be obtained, and the sensitive word processing policy may be determined by using the number of sensitive characters, where the sensitive word processing policy includes one of a first processing policy indicating deletion of the character string to be detected, a second processing policy indicating deletion of the sensitive characters in the character string to be detected, and a third processing policy indicating replacement of the sensitive characters in the character string to be detected with replacement symbols. Specifically, if the number of the sensitive characters is greater than or equal to a first threshold (which can be set manually), a first processing strategy for deleting the character string to be detected is executed, and when the first processing strategy is adopted, the character string to be detected after the sensitive word processing is empty, and the output is empty. If the number of the sensitive characters is larger than or equal to a second threshold (which can be set manually) and smaller than the first threshold, executing a second processing strategy for deleting the sensitive characters in the character strings to be detected, and obtaining the character strings to be detected after the sensitive words are processed. If the number of the sensitive characters is greater than or equal to a third threshold (which can be set manually) and smaller than the second threshold, executing a third processing strategy for replacing the sensitive characters in the character string to be detected by using the replacement symbols to obtain the character to be detected after the processing of the sensitive words, wherein the character to be detected after the processing of the sensitive words is the target character string, and the target character string can be output, and particularly, the target character string can be displayed in a text message form on an application interface.
In a possible embodiment, when no sensitive character exists in the character string to be detected, the character string to be detected may be directly output, and in particular, the character string to be detected may be directly displayed in the application interface in the form of a text message. When there is a sensitive character in the string to be detected, a processing policy list may be displayed, the processing policy list including one or more of a first processing policy indicating deletion of the string to be detected, a second processing policy indicating deletion of the sensitive character in the string to be detected, a third processing policy indicating replacement of the sensitive character in the string to be detected with a replacement symbol (e.g., "#" + ", etc.), and a fourth processing policy indicating replacement of the sensitive character in the string to be detected with a synonym. In addition, warning prompt information can be displayed first, the warning prompt information can be used for prompting that sensitive characters exist in the character string to be detected, and when confirmation operation for the warning prompt information is received, a processing strategy list is displayed.
Further, when the target processing strategy in the processing strategy list is selected, sensitive word processing is carried out on the character strings to be detected according to the target processing strategy. For example, when the target processing policy is the first processing policy, the obtained character string to be detected after the processing of the sensitive word is empty; when the target processing strategy is the second processing strategy, deleting the sensitive characters in the character string to be detected, and obtaining the character string to be detected after the sensitive words are processed; when the target processing strategy is a third processing strategy, replacing the sensitive character in the character string to be detected by using a replacement symbol, and obtaining the character string to be detected after the processing of the sensitive word; when the target processing strategy is the fourth processing strategy, the synonym is used for replacing the sensitive character in the character string to be detected, and the obtained character string to be detected after the processing of the sensitive word is adopted, wherein the synonym can replace the sensitive character string formed by one or more sensitive characters. The method comprises the steps of displaying confirmation prompt information about the character string to be detected after the processing of the sensitive word, taking the character string to be detected after the processing of the sensitive word as a target character string when the confirmation operation aiming at the confirmation prompt information occurs, outputting the target character string, and displaying the target character string in a text message form on an application interface when the target character string is output. For example, as shown in the content indicated by 100 in fig. 10, a user may input a character string "xx sales champion" to be detected in an input field of an application interface, click on a "send" control to generate a detection request including "xx sales champion", when it is determined that there is a sensitive character in the character string "xx sales champion" in response to the detection request, a processing policy list may be displayed in the application interface as shown in the content indicated by 101 in fig. 10, and when a target processing policy in the processing policy list (replace "sales champion" in "xx sales champion" with "high sales champion) is selected, sensitive word processing is performed on the character string to be detected with the target processing policy to obtain the target character string" xx sales champion ", and the target character string" xx sales champion "may be further displayed in a display area as shown in the content indicated by 102 in fig. 10.
In the embodiment of the application, the target dictionary tree formed by the AC automaton algorithm can be adopted to realize the completion and partial matching search of the character string to be detected and the auxiliary information, the search speed cannot be increased along with the increase of the word stock magnitude, and the search speed of the sensitive characters can be increased; meanwhile, the sensitive character strings can be efficiently configured through a sensitive word management interface, so that the convenience of constructing the target dictionary tree is improved; in addition, the sensitive character recognition can be performed in a multi-dimensional manner by utilizing the character string to be detected, the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected, so that the accuracy of the sensitive character recognition is improved, and the effectiveness of sensitive word processing is improved.
Referring to fig. 11, fig. 11 is a flowchart illustrating a data processing method according to an embodiment of the application. The method comprises the following steps:
in one embodiment, the character string to be detected may be input into a word stock word segmentation model, where the word stock word segmentation model includes a word segmentation dictionary tree, and M matching words and N non-matching characters may be obtained. And inputting M matched words and N unmatched characters into a pinyin and initial model, wherein the pinyin and initial model comprises a word segmentation pinyin mapping table and a word pinyin mapping table, and can acquire the pinyin corresponding to the character string to be detected and the initial corresponding to the character string to be detected. The character string to be detected can be input into the splitting model to obtain strokes corresponding to the character string to be detected. The detection result can be obtained by inputting the character string to be detected, strokes corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and initial corresponding to the character string to be detected into a multi-level sensitive word filtering model, wherein the detection result comprises sensitive characters in the character string to be detected.
The multi-level sensitive word filtering model includes a target dictionary tree, as shown in fig. 12, where the target dictionary tree may be configured by a sensitive word management system based on an AC automaton algorithm and loaded with a sensitive word set, a sensitive word initial set, a sensitive word pinyin set, and a sensitive word stroke set, where data in the sensitive word set, the sensitive word initial set, the sensitive word pinyin set, and the sensitive word stroke set may be configured on a sensitive word management interface provided by the sensitive word management system. In addition, data can be added or deleted in the sensitive word set, the sensitive word initial set, the sensitive word pinyin set and the sensitive word stroke set, and the target dictionary tree can be reloaded. In one implementation manner, a character string matched with the target dictionary tree can be selected from the character string to be detected, strokes corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and initial letters corresponding to the character string to be detected according to the verification level of the character string to be detected, and sensitive characters in the character string to be detected are obtained according to the matching result.
Through the embodiment, the sensitive character recognition can be performed in a multi-dimensional manner by utilizing the character string to be detected, the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected, so that the accuracy of the sensitive character recognition is improved.
Referring to fig. 13, fig. 13 is a flowchart of a text processing method according to an embodiment of the present application, which may be executed by the terminal 11 in fig. 1. The method comprises the following steps:
s1301, displaying an application interface, wherein the application interface comprises a character string to be detected.
The application interface may be an interface related to a game scene, a social scene, etc., for example, an interface for publishing dynamic content in a game community, a barrage playing interface, a short video comment interface, etc., which is not limited in the present application.
In an embodiment, the application interface may include an input field, and the character string to be detected may be a character input in the input field of the application interface. After the user inputs the character string to be detected in the input field, as shown in the content indicated by 140 in fig. 14, the user may perform a trigger operation on the "send" control in the application interface to generate a requirement for identifying a sensitive word for the character string to be detected, and when the terminal receives the requirement for identifying a sensitive word for the character string to be detected, it may be checked whether the character string to be detected has a sensitive word, and specifically, steps S301 to 306 or steps S201 to S204 may be performed to identify the sensitive word.
In another embodiment, the application interface may be in a sensitive word detection mode, and the character string to be detected included in the application interface may be a character string input in the application interface or an interface text in the application interface. When the application interface is in the sensitive word detection mode, the steps S301-306 or steps S201-204 may be automatically executed to perform sensitive word recognition on the character string to be detected.
S1302, when the existence of the sensitive word in the character string to be detected is detected, displaying a first prompt message, wherein the first prompt message is used for prompting the sensitive word in the character string to be detected; and displaying a second prompt message, wherein the second prompt message comprises N synonyms corresponding to the sensitive words in the character string to be detected, and N is a positive integer.
When the existence of the sensitive word in the character string to be detected is detected, a first prompt message and a second prompt message can be displayed, the first prompt message can be used for prompting the sensitive word in the character string to be detected, the second prompt message comprises N (positive integer) synonyms corresponding to the sensitive word in the character string to be detected, the synonyms can be words identical or similar to the semantics of the sensitive word, and can also be phrases corresponding to the sensitive word in the character string to be detected in the converted result after the whole semantic conversion is performed on the character string to be detected, for example, the whole semantic conversion is performed on 'technical garbage' to obtain 'technical garbage to be improved', and the synonyms corresponding to the sensitive word 'garbage' in the 'technical garbage' are 'to be improved'.
In an embodiment, a popup window or a floating layer may be generated on the application interface, where the first prompt message and the second prompt message are synchronously displayed, for example, as shown in the content indicated by 142 in fig. 14, where the first prompt message is displayed: the sensitive word "sales champion" exists in the "xx sales champion", and the second prompting information is that: "sales high", "hot sales". And meanwhile, the popup window or the floating layer also provides a 'determination' control, when the triggering operation aiming at the 'determination' control occurs, the selection operation of the target synonym is determined to occur, and the synonym selected by the user is taken as the target synonym. In addition, the sensitive words included in the character string to be detected in the first prompting message can be identified in a highlighting mode, and the synonyms in the second prompting message can also be identified in a highlighting mode.
In another embodiment, the first prompt message and the second prompt message may be displayed sequentially, specifically, a first popup window or a first floating layer may be generated on the application interface, and the first prompt message is displayed by using the first popup window or the first floating layer, for example, as shown in the content indicated by 141 in fig. 14, the first popup window is displayed on the application interface, where the first prompt message is displayed in the first popup window: the sensitive word in the character string 'xx sales champion' to be detected is 'sales champion', and meanwhile, the first popup window also provides a 'determination' control and a 'cancellation' control, and when the triggering operation of the 'determination' control occurs, the confirmation operation of the first prompt message is determined to occur. When the confirmation operation of the first prompt message is received, the first popup window or the first floating layer displaying the first prompt message can be disappeared, and specifically, the first popup window or the first floating layer can be disappeared according to a preset disappearing effect, such as sliding disappearance, fading disappearance and the like. The second prompt message may be further displayed, specifically, a second popup window or a second floating layer may be generated on the application interface, and the second prompt message may be displayed by using the second popup window or the second floating layer, for example, the second floating layer may be displayed on the application interface, and the second prompt message displayed in the second floating layer may include: "sales high", "hot sales". And meanwhile, the second popup window or the second floating layer also provides a 'determination' control, after the user selects the synonym, the user can trigger the 'determination' control, and when the triggering operation aiming at the 'determination' control occurs, the selection operation of the target synonym is determined to occur, wherein the target synonym is the synonym selected by the user. The second popup window or the second floating layer displaying the second prompting message can be disappeared, and specifically, the second popup window or the second floating layer can be disappeared according to a preset disappearing effect, for example, sliding disappearing, fading disappearing and the like.
And S1303, when receiving a selection operation of the target synonym, displaying a target character string on the application interface, wherein the target character string is a character string after the sensitive word in the character string to be detected is replaced by the target synonym.
In an embodiment, when a selection operation of a target synonym is received, a sensitive word in a character string to be detected may be replaced by the target synonym, so as to obtain a target character string, and jump to a display area of an application interface, where the target character string is displayed in a text message form in the display area of the application interface, for example, as shown in content indicated by 144 in fig. 14, a "xx sales amount high" may be displayed in my dynamics.
In an embodiment, as shown in the content indicated by 143 in fig. 14, the second prompt information displayed in the pop-up window or the floating layer may further include, in addition to the N synonyms "high in sales," "hot-selling," and "hot-selling" corresponding to the sensitive word in the character string to be detected, a replacement result character string "xx high in sales," "xx hot-selling," and "xx hot-selling" corresponding to each synonym, where the replacement result character string corresponding to each synonym is a character string obtained by replacing the sensitive word in the character string to be detected with the each synonym. After the user selects the replacement result character string, triggering the 'determination' control in the popup window or the floating layer, when the triggering operation occurs, determining that the confirmation operation on the target character string occurs, taking the replacement result character string selected by the user as the target character string, and displaying the replacement result character string in a text message form on an application interface. Synonyms in the replacement result string in the second hint message may be identified in a highlighted manner.
In a possible embodiment, the second prompt information may further include a first deletion result string "xx" and a second deletion result string "xx#", where the first deletion result string is a string after the sensitive word in the string to be detected is deleted, and the second deletion result string is a string after the sensitive word in the string to be detected is replaced with a replacing symbol (for example, "#"). The second deleting result character string obtained by using the replacing symbol is short of semantics compared with the replacing result character string, and text information of the character string to be detected cannot be fully expressed. The user may also select the first deletion result string or the second deletion result string, and further may display the first deletion result string or the second deletion result string selected by the user in the form of a text message on the application interface.
In the embodiment of the application, when the character string to be detected has the sensitive word, the sensitive character in the character string to be detected is processed to obtain the target character string which does not contain the sensitive character, and finally the target character string is displayed on the application interface in the form of text message, so that the convenience and the effectiveness of the sensitive word processing can be improved.
It can be appreciated that in the specific embodiment of the present application, related data such as a character string to be detected is related, when the above embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with related laws and regulations and standards of related countries and regions.
The foregoing details of the method of the present application and, in order to facilitate better practice of the method of the present application, a device of the present application is provided below. Referring to fig. 15, fig. 15 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus 150 may include:
an acquisition unit 1501 for acquiring a detection request including a character string to be detected;
a processing unit 1502, configured to convert the character string to be detected into auxiliary information, where the auxiliary information includes one or more of the following: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected;
the obtaining unit 1501 is further configured to obtain a target dictionary tree, where the target dictionary tree is a dictionary tree constructed by a set of sensitive words, a set of first letters of the sensitive words, a set of pinyin of the sensitive words, and a set of strokes of the sensitive words;
The processing unit 1502 is further configured to match the character string to be detected and the auxiliary information with the target dictionary tree to obtain a sensitive character in the character string to be detected;
the processing unit 1502 is further configured to perform sensitive word processing on the to-be-detected character string according to the sensitive character in the to-be-detected character string, obtain a target character string, and output the target character string.
In an embodiment, the processing unit 1502 is specifically configured to: responding to a sensitive word management instruction, and displaying a sensitive word management interface according to an object authority corresponding to the sensitive word management instruction, wherein the sensitive word management interface comprises one or more of a sensitive word configuration area, a sensitive word initial configuration area, a sensitive word pinyin configuration area and a sensitive word stroke configuration area, the sensitive word configuration area is used for receiving an input sensitive word, the sensitive word initial configuration area is used for receiving an input sensitive word initial, the sensitive word pinyin configuration area is used for receiving an input sensitive word pinyin, and the sensitive word stroke configuration area is used for receiving an input sensitive word stroke; updating a target set according to received data, wherein the target set is the sensitive word set, the sensitive word initial set, and the sensitive word pinyin set and the sensitive word stroke set are consistent with the received data in type; and when the preset period is reached, loading the updated target set and the non-updated set based on the finite state automaton algorithm, and generating a new target dictionary tree.
In an embodiment, the sensitive word management interface further includes a service scene list, where the service scene list includes at least one service scene; the processing unit 1502 is specifically configured to: responding to a triggering operation aiming at a target service scene in the service scene list, and storing the target dictionary tree and the target service scene in an associated way; acquiring a service scene to which the application corresponding to the application identifier belongs; and if the service scene to which the application belongs is matched with the target service scene, reading the target dictionary tree.
In an embodiment, the processing unit 1502 is specifically configured to: when the auxiliary information comprises strokes corresponding to the character strings to be detected, splitting the character strings to be detected to obtain the strokes corresponding to the character strings to be detected; when the auxiliary information comprises the pinyin corresponding to the character string to be detected, matching the character string to be detected with a word segmentation dictionary tree to obtain M matched words and N unmatched characters, wherein M and N are integers, determining the pinyin corresponding to the M matched words by using a word segmentation pinyin mapping table, determining the pinyin corresponding to the N unmatched characters by using a word pinyin mapping table, and generating the pinyin corresponding to the character string to be detected by using the pinyin corresponding to the M matched words and the pinyin corresponding to the N unmatched characters; when the auxiliary information comprises the initial corresponding to the character string to be detected, extracting the initial corresponding to the character string to be detected by utilizing the pinyin corresponding to the character string to be detected.
In an embodiment, the processing unit 1502 is specifically configured to: matching the character string to be detected with the target dictionary tree to obtain a first matching result; matching the auxiliary information with the target dictionary tree to obtain a second matching result; and combining the first matching result and the second matching result to obtain the sensitive character in the character string to be detected.
In an embodiment, the target dictionary tree includes a plurality of nodes and mismatch pointers of each node, and a node corresponding to a last character of a sensitive character string includes an end identifier and a number of characters of the sensitive character string, where the sensitive character string is the sensitive word set, the sensitive word initial set, and the sensitive word pinyin set or data in the sensitive word stroke set; the processing unit 1502 is specifically configured to: traversing the character string to be detected to extract a current character for current matching, and determining a target node corresponding to the current character from the target dictionary tree; matching the current character with the target node; if the current character is successfully matched with the target node, determining a next node to be matched corresponding to the target node according to a matching success strategy, taking the next character of the current character as a new current character, taking the next node to be matched determined by the matching success strategy as a new target node, and executing the step of matching the current character with the target node; if the matching of the current character and the target node fails, determining whether the target node is the next node of a root node; if yes, taking the next character of the current character as a new current character, and executing the step of matching the current character with the target node; if not, determining a next node to be matched corresponding to the target node according to a matching failure strategy, taking the next node to be matched determined by the matching failure strategy as a new target node, and executing the step of matching the current character with the target node; and when the traversal of the character string to be detected is completed, extracting target nodes containing end identifiers from all target nodes successfully matched, and extracting a first matching result from the target dictionary tree according to the number of characters contained in the extracted target nodes.
In an embodiment, the processing unit 1502 is specifically configured to: displaying a sensitive word prompt interface, wherein the sensitive word prompt interface comprises Q synonyms, the Q synonyms are obtained based on sensitive characters in the character string to be detected, and Q is a positive integer; and responding to the selection operation of the target synonym aiming at the sensitive word prompt interface, and replacing the sensitive character in the character string to be detected by using the target synonym.
In an embodiment, the detection request further includes a verification level of the to-be-detected character string, where the verification level is used to determine data included in the auxiliary information, and when the verification level is a first level, the auxiliary information includes an initial corresponding to the to-be-detected character string, or a pinyin corresponding to the to-be-detected character string, or a stroke corresponding to the to-be-detected character string; when the verification level is a second level, the auxiliary information comprises an initial corresponding to the character string to be detected and a pinyin corresponding to the character string to be detected, or the initial corresponding to the character string to be detected and a stroke corresponding to the character string to be detected, or the pinyin corresponding to the character string to be detected and the stroke corresponding to the character string to be detected; when the verification level is a third level, the auxiliary information comprises an initial corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and strokes corresponding to the character string to be detected.
In an embodiment, the obtaining unit 1501 is specifically configured to: acquiring the number of sensitive characters in the character string to be detected;
the processing unit 1502 is specifically configured to: determining a sensitive word processing strategy by using the quantity of the sensitive characters, wherein the sensitive word processing strategy comprises one of a first processing strategy for indicating to delete the character string to be detected, a second processing strategy for indicating to delete the sensitive characters in the character string to be detected and a third processing strategy for indicating to replace the sensitive characters in the character string to be detected by replacing symbols; and carrying out sensitive word processing on the character string to be detected according to the sensitive word processing strategy.
It may be understood that the functions of each functional unit of the data processing apparatus described in the embodiments of the present application may be specifically implemented according to the method in the embodiments of the method, and the specific implementation process may refer to the relevant description of the embodiments of the method and will not be repeated herein.
In the embodiment of the application, a detection request can be acquired, the detection request comprises a character string to be detected, the character string to be detected is converted into one or more of an initial corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and strokes corresponding to the character string to be detected, auxiliary information is obtained, the character string to be detected and the auxiliary information are matched with a target dictionary tree to obtain sensitive characters in the character string to be detected, then sensitive word processing is carried out on the character string to be detected according to the sensitive characters in the character string to be detected to obtain a target character string, and the target character string is output. By adopting the method, the character strings to be detected can be subjected to sensitive character recognition from multiple dimensions, the accuracy of sensitive character recognition is improved, and the effectiveness of sensitive word processing is improved.
Referring to fig. 16, fig. 16 is a schematic structural diagram of a text processing device according to an embodiment of the present application, where the text processing device 160 may include:
a display unit 1601, configured to display an application interface, where the application interface includes a character string to be detected;
the display unit 1601 is further configured to display a first prompting message when detecting that the to-be-detected character string has a sensitive word, where the first prompting message is used to prompt the sensitive word in the to-be-detected character string; displaying a second prompt message, wherein the second prompt message comprises N synonyms corresponding to sensitive words in the character string to be detected, and N is a positive integer;
the display unit 1601 is further configured to display, when receiving a selection operation of a target synonym, a target character string on the application interface, where the target character string is a character string after the sensitive word in the character string to be detected is replaced by the target synonym.
It may be understood that the functions of each functional unit of the text processing apparatus described in the embodiments of the present application may be specifically implemented according to the method in the embodiments of the method, and the specific implementation process may refer to the relevant description of the embodiments of the method and will not be repeated herein.
In the embodiment of the application, when the character string to be detected has the sensitive word, the sensitive character in the character string to be detected is processed to obtain the target character string which does not contain the sensitive character, and finally the target character string is displayed on the application interface in the form of text message, so that the convenience and the effectiveness of the sensitive word processing can be improved.
As shown in fig. 17, fig. 17 is a schematic structural diagram of a computer device according to an embodiment of the present application, and an internal structure of the computer device 170 is shown in fig. 17, including: one or more processors 1701, a memory 1702, and a communication interface 1703. The processor 1701, memory 1702, and communication interface 1703 may be coupled via a bus 1704 or otherwise, as exemplified by the embodiments of the present application.
Among them, the processor 1701 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of the computer device 170, which can parse various instructions within the computer device 170 and process various data of the computer device 170, for example: the CPU may be configured to parse a power-on instruction sent by the user to the computer device 170, and control the computer device 170 to perform a power-on operation; and the following steps: the CPU may transfer various types of interaction data between the internal structures of the computer device 170, and so on. The communication interface 1703 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), controlled by the processor 1701 for transceiving data. The Memory 1702 (Memory) is a Memory device in the computer device 170 for storing computer programs and data. It is to be appreciated that the memory 1702 herein may include both built-in memory of the computer device 170 and extended memory supported by the computer device 170. Memory 1702 provides storage space that stores an operating system for computer device 170, which may include, but is not limited to: windows system, linux system, android system, iOS system, etc., the application is not limited in this regard. In one embodiment, the processor 1701 performs the following operations by executing a computer program stored in the memory 1702:
Acquiring a detection request, wherein the detection request comprises a character string to be detected;
converting the character string to be detected into auxiliary information, wherein the auxiliary information comprises one or more of the following components: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected;
obtaining a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set;
matching the character string to be detected and the auxiliary information with the target dictionary tree to obtain sensitive characters in the character string to be detected;
and carrying out sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string, and outputting the target character string.
In one embodiment, the processor 1701 is specifically configured to: responding to a sensitive word management instruction, and displaying a sensitive word management interface according to an object authority corresponding to the sensitive word management instruction, wherein the sensitive word management interface comprises one or more of a sensitive word configuration area, a sensitive word initial configuration area, a sensitive word pinyin configuration area and a sensitive word stroke configuration area, the sensitive word configuration area is used for receiving an input sensitive word, the sensitive word initial configuration area is used for receiving an input sensitive word initial, the sensitive word pinyin configuration area is used for receiving an input sensitive word pinyin, and the sensitive word stroke configuration area is used for receiving an input sensitive word stroke; updating a target set according to received data, wherein the target set is the sensitive word set, the sensitive word initial set, and the sensitive word pinyin set and the sensitive word stroke set are consistent with the received data in type; and when the preset period is reached, loading the updated target set and the non-updated set based on the finite state automaton algorithm, and generating a new target dictionary tree.
In an embodiment, the sensitive word management interface further includes a service scene list, where the service scene list includes at least one service scene; the processor 1701 is specifically configured to: responding to a triggering operation aiming at a target service scene in the service scene list, and storing the target dictionary tree and the target service scene in an associated way; acquiring a service scene to which the application corresponding to the application identifier belongs; and if the service scene to which the application belongs is matched with the target service scene, reading the target dictionary tree.
In one embodiment, the processor 1701 is specifically configured to: when the auxiliary information comprises strokes corresponding to the character strings to be detected, splitting the character strings to be detected to obtain the strokes corresponding to the character strings to be detected; when the auxiliary information comprises the pinyin corresponding to the character string to be detected, matching the character string to be detected with a word segmentation dictionary tree to obtain M matched words and N unmatched characters, wherein M and N are integers, determining the pinyin corresponding to the M matched words by using a word segmentation pinyin mapping table, determining the pinyin corresponding to the N unmatched characters by using a word pinyin mapping table, and generating the pinyin corresponding to the character string to be detected by using the pinyin corresponding to the M matched words and the pinyin corresponding to the N unmatched characters; when the auxiliary information comprises the initial corresponding to the character string to be detected, extracting the initial corresponding to the character string to be detected by utilizing the pinyin corresponding to the character string to be detected.
In one embodiment, the processor 1701 is specifically configured to: matching the character string to be detected with the target dictionary tree to obtain a first matching result; matching the auxiliary information with the target dictionary tree to obtain a second matching result; and combining the first matching result and the second matching result to obtain the sensitive character in the character string to be detected.
In an embodiment, the target dictionary tree includes a plurality of nodes and mismatch pointers of each node, and a node corresponding to a last character of a sensitive character string includes an end identifier and a number of characters of the sensitive character string, where the sensitive character string is the sensitive word set, the sensitive word initial set, and the sensitive word pinyin set or data in the sensitive word stroke set; the processor 1701 is specifically configured to: traversing the character string to be detected to extract a current character for current matching, and determining a target node corresponding to the current character from the target dictionary tree; matching the current character with the target node; if the current character is successfully matched with the target node, determining a next node to be matched corresponding to the target node according to a matching success strategy, taking the next character of the current character as a new current character, taking the next node to be matched determined by the matching success strategy as a new target node, and executing the step of matching the current character with the target node; if the matching of the current character and the target node fails, determining whether the target node is the next node of a root node; if yes, taking the next character of the current character as a new current character, and executing the step of matching the current character with the target node; if not, determining a next node to be matched corresponding to the target node according to a matching failure strategy, taking the next node to be matched determined by the matching failure strategy as a new target node, and executing the step of matching the current character with the target node; and when the traversal of the character string to be detected is completed, extracting target nodes containing end identifiers from all target nodes successfully matched, and extracting a first matching result from the target dictionary tree according to the number of characters contained in the extracted target nodes.
In one embodiment, the processor 1701 is specifically configured to: displaying a sensitive word prompt interface, wherein the sensitive word prompt interface comprises Q synonyms, the Q synonyms are obtained based on sensitive characters in the character string to be detected, and Q is a positive integer; and responding to the selection operation of the target synonym aiming at the sensitive word prompt interface, and replacing the sensitive character in the character string to be detected by using the target synonym.
In an embodiment, the detection request further includes a verification level of the to-be-detected character string, where the verification level is used to determine data included in the auxiliary information, and when the verification level is a first level, the auxiliary information includes an initial corresponding to the to-be-detected character string, or a pinyin corresponding to the to-be-detected character string, or a stroke corresponding to the to-be-detected character string; when the verification level is a second level, the auxiliary information comprises an initial corresponding to the character string to be detected and a pinyin corresponding to the character string to be detected, or the initial corresponding to the character string to be detected and a stroke corresponding to the character string to be detected, or the pinyin corresponding to the character string to be detected and the stroke corresponding to the character string to be detected; when the verification level is a third level, the auxiliary information comprises an initial corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and strokes corresponding to the character string to be detected.
In one embodiment, the processor 1701 is specifically configured to: acquiring the number of sensitive characters in the character string to be detected; determining a sensitive word processing strategy by using the quantity of the sensitive characters, wherein the sensitive word processing strategy comprises one of a first processing strategy for indicating to delete the character string to be detected, a second processing strategy for indicating to delete the sensitive characters in the character string to be detected and a third processing strategy for indicating to replace the sensitive characters in the character string to be detected by replacing symbols; and carrying out sensitive word processing on the character string to be detected according to the sensitive word processing strategy.
In a specific implementation, the processor 1701, the memory 1702 and the communication interface 1703 described in the embodiments of the present application may execute an implementation described in a data processing method provided in the embodiments of the present application, or may execute an implementation described in a data processing device provided in the embodiments of the present application, which is not described herein again.
In the embodiment of the application, a detection request can be acquired, the detection request comprises a character string to be detected, the character string to be detected is converted into one or more of an initial corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and strokes corresponding to the character string to be detected, auxiliary information is obtained, the character string to be detected and the auxiliary information are matched with a target dictionary tree to obtain sensitive characters in the character string to be detected, then sensitive word processing is carried out on the character string to be detected according to the sensitive characters in the character string to be detected to obtain a target character string, and the target character string is output. By adopting the method, the character strings to be detected can be subjected to sensitive character recognition from multiple dimensions, the accuracy of sensitive character recognition is improved, and the effectiveness of sensitive word processing is improved.
In one embodiment, the processor 1701 performs the following operations by executing a computer program stored in the memory 1702:
displaying an application interface, wherein the application interface comprises a character string to be detected;
when the existence of the sensitive word in the character string to be detected is detected, displaying a first prompt message, wherein the first prompt message is used for prompting the sensitive word in the character string to be detected; displaying a second prompt message, wherein the second prompt message comprises N synonyms corresponding to sensitive words in the character string to be detected, and N is a positive integer;
and when receiving a selection operation of the target synonym, displaying a target character string on the application interface, wherein the target character string is a character string after the sensitive word in the character string to be detected is replaced by the target synonym.
In a specific implementation, the processor 1701, the memory 1702 and the communication interface 1703 described in the embodiments of the present application may execute an implementation described in a text processing method provided in the embodiments of the present application, or may execute an implementation described in a text processing device provided in the embodiments of the present application, which is not described herein again.
In the embodiment of the application, when the character string to be detected has the sensitive word, the sensitive character in the character string to be detected is processed to obtain the target character string which does not contain the sensitive character, and finally the target character string is displayed on the application interface in the form of text message, so that the convenience and the effectiveness of the sensitive word processing can be improved.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, which when run on a computer device causes the computer device to perform a data processing method of any one of the possible implementations described above. The specific implementation manner may refer to the foregoing description, and will not be repeated here.
The embodiment of the application also provides a computer program product, which comprises a computer program or computer instructions, and the computer program or computer instructions realize the steps of the data processing method provided by the embodiment of the application when being executed by a processor. The specific implementation manner may refer to the foregoing description, and will not be repeated here.
The embodiment of the application also provides a computer program, which comprises computer instructions, wherein the computer instructions are stored in a computer readable storage medium, a processor of a computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided by the embodiment of the application. The specific implementation manner may refer to the foregoing description, and will not be repeated here.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of action described, as some steps may be performed in other order or simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The above disclosure is illustrative only of some embodiments of the application and is not intended to limit the scope of the application, which is defined by the claims and their equivalents.

Claims (15)

1. A method of data processing, the method comprising:
acquiring a detection request, wherein the detection request comprises a character string to be detected;
converting the character string to be detected into auxiliary information, wherein the auxiliary information comprises one or more of the following components: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected;
obtaining a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set;
matching the character string to be detected and the auxiliary information with the target dictionary tree to obtain sensitive characters in the character string to be detected;
and carrying out sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string, and outputting the target character string.
2. The method according to claim 1, wherein the method further comprises:
responding to a sensitive word management instruction, and displaying a sensitive word management interface according to an object authority corresponding to the sensitive word management instruction, wherein the sensitive word management interface comprises one or more of a sensitive word configuration area, a sensitive word initial configuration area, a sensitive word pinyin configuration area and a sensitive word stroke configuration area, the sensitive word configuration area is used for receiving an input sensitive word, the sensitive word initial configuration area is used for receiving an input sensitive word initial, the sensitive word pinyin configuration area is used for receiving an input sensitive word pinyin, and the sensitive word stroke configuration area is used for receiving an input sensitive word stroke;
Updating a target set according to received data, wherein the target set is the sensitive word set, the sensitive word initial set, and the sensitive word pinyin set and the sensitive word stroke set are consistent with the received data in type;
and when the preset period is reached, loading the updated target set and the non-updated set based on the finite state automaton algorithm, and generating a new target dictionary tree.
3. The method of claim 2, wherein the sensitive word management interface further comprises a list of business scenarios, the list of business scenarios comprising at least one business scenario; the method further comprises the steps of:
responding to a triggering operation aiming at a target service scene in the service scene list, and storing the target dictionary tree and the target service scene in an associated way;
the detection request further includes an application identifier, and the obtaining the target dictionary tree includes:
acquiring a service scene to which the application corresponding to the application identifier belongs;
and if the service scene to which the application belongs is matched with the target service scene, reading the target dictionary tree.
4. The method according to claim 1, wherein when the auxiliary information includes strokes corresponding to the character string to be detected, the converting the character string to be detected into auxiliary information includes: splitting the character string to be detected to obtain strokes corresponding to the character string to be detected;
When the auxiliary information includes pinyin corresponding to the character string to be detected, the converting the character string to be detected into the auxiliary information includes: matching the character string to be detected with a word segmentation dictionary tree to obtain M matched words and N unmatched characters, wherein M and N are integers, the pinyin corresponding to the M matched words is determined by utilizing a word segmentation pinyin mapping table, the pinyin corresponding to the N unmatched characters is determined by utilizing a word pinyin mapping table, and the pinyin corresponding to the M matched words and the pinyin corresponding to the N unmatched characters are utilized to generate the pinyin corresponding to the character string to be detected;
when the auxiliary information includes an initial corresponding to the character string to be detected, the converting the character string to be detected into the auxiliary information includes: and extracting initial letters corresponding to the character strings to be detected by utilizing pinyin corresponding to the character strings to be detected.
5. The method of claim 1, wherein the matching the character string to be detected and the auxiliary information with the target dictionary tree to obtain the sensitive character in the character string to be detected includes:
matching the character string to be detected with the target dictionary tree to obtain a first matching result;
Matching the auxiliary information with the target dictionary tree to obtain a second matching result;
and combining the first matching result and the second matching result to obtain the sensitive character in the character string to be detected.
6. The method of claim 5, wherein the target dictionary tree comprises a plurality of nodes and mismatch pointers for each node, wherein a node corresponding to a last character of a sensitive character string comprises an end identifier and a number of characters of the sensitive character string, the sensitive character string is the set of sensitive words, the set of sensitive initials, the set of sensitive word pinyins, or data in the set of sensitive word strokes;
the step of matching the character string to be detected with the target dictionary tree to obtain a first matching result comprises the following steps:
traversing the character string to be detected to extract a current character for current matching, and determining a target node corresponding to the current character from the target dictionary tree;
matching the current character with the target node;
if the current character is successfully matched with the target node, determining a next node to be matched corresponding to the target node according to a matching success strategy, taking the next character of the current character as a new current character, taking the next node to be matched determined by the matching success strategy as a new target node, and executing the step of matching the current character with the target node;
If the matching of the current character and the target node fails, determining whether the target node is the next node of a root node; if yes, taking the next character of the current character as a new current character, and executing the step of matching the current character with the target node; if not, determining a next node to be matched corresponding to the target node according to a matching failure strategy, taking the next node to be matched determined by the matching failure strategy as a new target node, and executing the step of matching the current character with the target node;
and when the traversal of the character string to be detected is completed, extracting target nodes containing end identifiers from all target nodes successfully matched, and extracting a first matching result from the target dictionary tree according to the number of characters contained in the extracted target nodes.
7. The method according to any one of claims 1-6, wherein the performing sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string includes:
displaying a sensitive word prompt interface, wherein the sensitive word prompt interface comprises Q synonyms, the Q synonyms are obtained based on sensitive characters in the character string to be detected, and Q is a positive integer;
And responding to the selection operation of the target synonym aiming at the sensitive word prompt interface, and replacing the sensitive character in the character string to be detected by using the target synonym to obtain a target character string.
8. The method according to any one of claims 1 to 6, wherein the detection request further includes a verification level of the character string to be detected, the verification level being used to determine data included in the auxiliary information, and when the verification level is a first level, the auxiliary information includes an initial corresponding to the character string to be detected, or a pinyin corresponding to the character string to be detected, or a stroke corresponding to the character string to be detected; when the verification level is a second level, the auxiliary information comprises an initial corresponding to the character string to be detected and a pinyin corresponding to the character string to be detected, or the initial corresponding to the character string to be detected and a stroke corresponding to the character string to be detected, or the pinyin corresponding to the character string to be detected and the stroke corresponding to the character string to be detected; when the verification level is a third level, the auxiliary information comprises an initial corresponding to the character string to be detected, pinyin corresponding to the character string to be detected and strokes corresponding to the character string to be detected.
9. The method according to any one of claims 1-6, wherein the performing sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected to obtain a target character string includes:
acquiring the number of sensitive characters in the character string to be detected;
determining a sensitive word processing strategy by using the quantity of the sensitive characters, wherein the sensitive word processing strategy comprises one of a first processing strategy for indicating to delete the character string to be detected, a second processing strategy for indicating to delete the sensitive characters in the character string to be detected and a third processing strategy for indicating to replace the sensitive characters in the character string to be detected by replacing symbols;
and performing sensitive word processing on the character string to be detected according to the sensitive word processing strategy to obtain a target character string.
10. A method of text processing, the method comprising:
displaying an application interface, wherein the application interface comprises a character string to be detected;
when the existence of the sensitive word in the character string to be detected is detected, displaying a first prompt message, wherein the first prompt message is used for prompting the sensitive word in the character string to be detected; displaying a second prompt message, wherein the second prompt message comprises N synonyms corresponding to sensitive words in the character string to be detected, and N is a positive integer;
And when receiving a selection operation of the target synonym, displaying a target character string on the application interface, wherein the target character string is a character string after the sensitive word in the character string to be detected is replaced by the target synonym.
11. A data processing apparatus, the apparatus comprising:
the device comprises an acquisition unit, a detection unit and a detection unit, wherein the acquisition unit is used for acquiring a detection request, and the detection request comprises a character string to be detected;
the processing unit is used for converting the character string to be detected into auxiliary information, and the auxiliary information comprises one or more of the following: the initial corresponding to the character string to be detected, the pinyin corresponding to the character string to be detected and the strokes corresponding to the character string to be detected;
the acquisition unit is further used for acquiring a target dictionary tree, wherein the target dictionary tree is a dictionary tree constructed by a sensitive word set, a sensitive word initial set, a sensitive word pinyin set and a sensitive word stroke set;
the processing unit is further configured to match the character string to be detected and the auxiliary information with the target dictionary tree to obtain a sensitive character in the character string to be detected;
the processing unit is further configured to perform sensitive word processing on the character string to be detected according to the sensitive character in the character string to be detected, obtain a target character string, and output the target character string.
12. A text processing apparatus, the apparatus comprising:
the display unit is used for displaying an application interface, wherein the application interface comprises a character string to be detected;
the display unit is further used for displaying a first prompt message when the existence of the sensitive word in the character string to be detected is detected, wherein the first prompt message is used for prompting the sensitive word in the character string to be detected; displaying a second prompt message, wherein the second prompt message comprises N synonyms corresponding to sensitive words in the character string to be detected, and N is a positive integer;
and the display unit is also used for displaying a target character string on the application interface when receiving the selection operation of the target synonym, wherein the target character string is a character string after the sensitive word in the character string to be detected is replaced by the target synonym.
13. A computer device comprising a memory, a communication interface, and a processor, the memory, the communication interface, and the processor being interconnected; the memory stores a computer program, and the processor invokes the computer program stored in the memory for implementing the data processing method according to any one of claims 1 to 9 or for implementing the text processing method according to claim 10.
14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements the data processing method according to any one of claims 1-9 or implements the text processing method according to claim 10.
15. A computer program product, characterized in that it comprises a computer program or computer instructions which, when executed by a processor, implement the data processing method according to any of claims 1-9 or the text processing method according to claim 10.
CN202210413723.9A 2022-04-19 2022-04-19 Data processing method, apparatus, device, storage medium and computer program product Pending CN116955720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210413723.9A CN116955720A (en) 2022-04-19 2022-04-19 Data processing method, apparatus, device, storage medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210413723.9A CN116955720A (en) 2022-04-19 2022-04-19 Data processing method, apparatus, device, storage medium and computer program product

Publications (1)

Publication Number Publication Date
CN116955720A true CN116955720A (en) 2023-10-27

Family

ID=88460695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210413723.9A Pending CN116955720A (en) 2022-04-19 2022-04-19 Data processing method, apparatus, device, storage medium and computer program product

Country Status (1)

Country Link
CN (1) CN116955720A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493540A (en) * 2023-12-28 2024-02-02 荣耀终端有限公司 Text matching method, terminal device and computer readable storage medium
CN117725161A (en) * 2023-12-21 2024-03-19 伟金投资有限公司 Method and system for identifying variant words in text and extracting sensitive words
CN118467741A (en) * 2024-07-09 2024-08-09 厦门众联世纪股份有限公司 Intelligent detection method and system for data violation risk

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725161A (en) * 2023-12-21 2024-03-19 伟金投资有限公司 Method and system for identifying variant words in text and extracting sensitive words
CN117493540A (en) * 2023-12-28 2024-02-02 荣耀终端有限公司 Text matching method, terminal device and computer readable storage medium
CN118467741A (en) * 2024-07-09 2024-08-09 厦门众联世纪股份有限公司 Intelligent detection method and system for data violation risk

Similar Documents

Publication Publication Date Title
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN108287858B (en) Semantic extraction method and device for natural language
US20210150142A1 (en) Method and apparatus for determining feature words and server
CN109657054B (en) Abstract generation method, device, server and storage medium
KR102025968B1 (en) Phrase-based dictionary extraction and translation quality evaluation
CN116955720A (en) Data processing method, apparatus, device, storage medium and computer program product
US8095547B2 (en) Method and apparatus for detecting spam user created content
US10002128B2 (en) System for tokenizing text in languages without inter-word separation
KR101509727B1 (en) Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof
KR101326354B1 (en) Transliteration device, recording medium, and method
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
CN111241496B (en) Method and device for determining small program feature vector and electronic equipment
CN111444905B (en) Image recognition method and related device based on artificial intelligence
CN114244795A (en) Information pushing method, device, equipment and medium
CN111581344B (en) Interface information auditing method and device, computer equipment and storage medium
CN117421413A (en) Question-answer pair generation method and device and electronic equipment
CN112087473A (en) Document downloading method and device, computer readable storage medium and computer equipment
CN107220255B (en) Address information processing method and device
CN112784596A (en) Method and device for identifying sensitive words
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
CN114896980B (en) Military entity linking method, device, computer equipment and storage medium
CN106844783B (en) Information processing method and device
US20240152565A1 (en) Information processing system, information processing method and information processing program
CN116992831A (en) Statement processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination