CN113420549A - Abnormal character string recognition method and device - Google Patents

Abnormal character string recognition method and device Download PDF

Info

Publication number
CN113420549A
CN113420549A CN202110753494.0A CN202110753494A CN113420549A CN 113420549 A CN113420549 A CN 113420549A CN 202110753494 A CN202110753494 A CN 202110753494A CN 113420549 A CN113420549 A CN 113420549A
Authority
CN
China
Prior art keywords
character string
processed
abnormal
character
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110753494.0A
Other languages
Chinese (zh)
Other versions
CN113420549B (en
Inventor
黄飚
吴鹏
吕明钊
赵耿榕
吴双
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Kingsoft Online Game Technology Co Ltd
Original Assignee
Zhuhai Kingsoft Online Game Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Kingsoft Online Game Technology Co Ltd filed Critical Zhuhai Kingsoft Online Game Technology Co Ltd
Priority to CN202110753494.0A priority Critical patent/CN113420549B/en
Publication of CN113420549A publication Critical patent/CN113420549A/en
Application granted granted Critical
Publication of CN113420549B publication Critical patent/CN113420549B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Character Discrimination (AREA)

Abstract

The present specification provides an abnormal character string recognition method and an abnormal character string recognition device, wherein the abnormal character string recognition method includes: acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm; extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed; and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string. According to the method, the abnormal character strings in the text to be recognized can be recognized before the abnormal character strings affect the user account, and the use risk of the user account can be reduced.

Description

Abnormal character string recognition method and device
Technical Field
The present disclosure relates to the field of text processing technologies, and in particular, to a method and an apparatus for identifying an abnormal character string.
Background
Currently, when a user chats with software having a chat function, some text including abnormal characters may appear in the chat text, where the abnormal characters may be a contact address, such as a contact phone number, an instant messaging account number, a game account number, and the like. In general, the account sending the abnormal text may be an abnormal account, and the abnormal account may affect the use of software by the user and even bring risks to the user account.
Disclosure of Invention
In view of this, the present specification provides an abnormal character string recognition method. The present specification also relates to an abnormal character string recognition apparatus, a computing device, and a computer-readable storage medium, which are used to solve the technical defects in the prior art.
According to a first aspect of embodiments of the present specification, there is provided an abnormal character string recognition method including:
acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed;
and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string.
Optionally, based on a special character recognition algorithm, extracting a character string to be processed from the key character string, and determining attribute information corresponding to the character string to be processed, including:
comparing the characters in the key character string with special characters in a preset special character word list;
if the special character in the preset special character word list exists in the key character string, determining the position information of the special character in the key character string;
extracting a character string to be processed related to the special character from the key character string based on the special character and the position information of the special character;
and determining the attribute information of the special character as the attribute information corresponding to the character string to be processed.
Optionally, the key string comprises a plurality of characters; based on a special character recognition algorithm, extracting a character string to be processed from the key character string, and determining attribute information corresponding to the character string to be processed, wherein the attribute information comprises the following steps:
identifying the type of each character in the key character string, and counting the number of various types of characters;
if the number of the specific type characters in the key character string reaches the preset number corresponding to the specific type, determining the character string formed by the specific type characters as a character string to be processed;
and determining attribute information of the specific type of character as attribute information corresponding to the character string to be processed.
Optionally, before extracting the character string to be processed from the key character string based on a special character recognition algorithm, the method further includes:
comparing the key character string with a white list character string in a preset white list, and deleting the character string in the key character string, which is the same as the white list character string;
correspondingly, extracting the character string to be processed from the key character string based on a special character recognition algorithm comprises the following steps:
and extracting the character string to be processed from the key character string after the white list character string is deleted based on the special character recognition algorithm.
Optionally, the abnormal character recognition rule includes a plurality of sub-rules, and each sub-rule represents a character format in a character string having attribute information; identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string, including:
determining a sub rule to be matched corresponding to the attribute information from the plurality of sub rules based on the attribute information corresponding to the character string to be processed;
matching the character string to be processed with the sub-rule to be matched;
and if the character string to be processed is successfully matched with the sub-rule to be matched, determining that the character string to be processed is an abnormal character string.
Optionally, after matching the character string to be processed with the sub-rule to be matched, the method further includes:
if the matching of the character string to be processed and the sub-rule to be matched fails, matching the character string to be processed with other sub-rules except the sub-rule to be matched in the plurality of sub-rules;
if the sub-rule successfully matched with the character string to be processed exists in the other sub-rules, determining that the character string to be processed is an abnormal character string;
and if the sub-rule which is successfully matched with the character string to be processed does not exist in the other sub-rules, determining that the character string to be processed is not an abnormal character string.
Optionally, before filtering out the key character string with the specified format from the text to be recognized by using a preset filtering algorithm, the method further includes:
inputting the text to be recognized into a text classification model, and outputting a classification result of the text to be recognized, wherein the text classification model is used for judging whether the text to be recognized comprises a key character string with a specified format;
correspondingly, filtering out the key character strings with the specified format from the text to be recognized by using a preset filtering algorithm, wherein the method comprises the following steps:
and if the text to be recognized comprises the key character strings with the specified format based on the classification result, filtering the key character strings with the specified format from the text to be recognized by using a preset filtering algorithm.
Optionally, after determining whether the character string to be processed is an abnormal character string, the method further includes:
if the character string to be processed is determined to be an abnormal character string, determining an account number for sending the abnormal character string, and managing the account number by adopting an abnormal account number management rule.
Optionally, the managing the account by using an abnormal account management rule includes:
marking the account as an abnormal account; and/or forbidding the account to send messages.
Optionally, before the account is managed by using the abnormal account management rule, the method further includes:
counting the times of sending the abnormal character string by the account in a preset time period;
correspondingly, the account is managed by adopting an abnormal account management rule, which comprises the following steps:
and if the times exceed a time threshold, managing the account by adopting an abnormal account management rule.
Optionally, after determining whether the character string to be processed is an abnormal character string, the method further includes:
if the character string to be processed is determined to be an abnormal character string, an abnormal label is marked for the abnormal character string, and the abnormal character string with the abnormal label is forbidden to be displayed.
According to a second aspect of embodiments of the present specification, there is provided an abnormal character string recognition apparatus including:
the filtering module is configured to acquire a text to be recognized and filter out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
the first determining module is configured to extract a character string to be processed from the key character string based on a special character recognition algorithm and determine attribute information corresponding to the character string to be processed;
and the second determination module is configured to identify the character string to be processed based on the attribute information and an abnormal character identification rule, and determine whether the character string to be processed is an abnormal character string.
Optionally, the first determining module is further configured to:
comparing the characters in the key character string with special characters in a preset special character word list;
if the special character in the preset special character word list exists in the key character string, determining the position information of the special character in the key character string;
extracting a character string to be processed related to the special character from the key character string based on the special character and the position information of the special character;
and determining the attribute information of the special character as the attribute information corresponding to the character string to be processed.
Optionally, the key string comprises a plurality of characters;
the first determination module is further configured to:
identifying the type of each character in the key character string, and counting the number of various types of characters;
if the number of the specific type characters in the key character string reaches the preset number corresponding to the specific type, determining the character string formed by the specific type characters as a character string to be processed;
and determining attribute information of the specific type of character as attribute information corresponding to the character string to be processed.
Optionally, the apparatus further comprises:
the comparison module is configured to compare the key character string with a white list character string in a preset white list and delete a character string in the key character string which is the same as the white list character string;
accordingly, the first determining module is further configured to:
and extracting the character string to be processed from the key character string after the white list character string is deleted based on the special character recognition algorithm.
Optionally, the abnormal character recognition rule includes a plurality of sub-rules, and each sub-rule represents a character format in a character string having attribute information;
the second determination module is further configured to:
determining a sub rule to be matched corresponding to the attribute information from the plurality of sub rules based on the attribute information corresponding to the character string to be processed;
matching the character string to be processed with the sub-rule to be matched;
and if the character string to be processed is successfully matched with the sub-rule to be matched, determining that the character string to be processed is an abnormal character string.
Optionally, the second determining module is further configured to:
if the matching of the character string to be processed and the sub-rule to be matched fails, matching the character string to be processed with other sub-rules except the sub-rule to be matched in the plurality of sub-rules;
if the sub-rule successfully matched with the character string to be processed exists in the other sub-rules, determining that the character string to be processed is an abnormal character string;
and if the sub-rule which is successfully matched with the character string to be processed does not exist in the other sub-rules, determining that the character string to be processed is not an abnormal character string.
Optionally, the apparatus further comprises:
the classification module is configured to input the text to be recognized into a text classification model and output a classification result of the text to be recognized, wherein the text classification model is used for judging whether the text to be recognized comprises a key character string with a specified format;
accordingly, the filtering module is further configured to:
and if the text to be recognized comprises the key character strings with the specified format based on the classification result, filtering the key character strings with the specified format from the text to be recognized by using a preset filtering algorithm.
Optionally, the apparatus further comprises:
and the account management module is configured to determine an account for sending the abnormal character string if the character string to be processed is determined to be the abnormal character string, and manage the account by adopting an abnormal account management rule.
Optionally, the account management module is further configured to:
marking the account as an abnormal account; and/or forbidding the account to send messages.
Optionally, the apparatus further comprises:
the counting module is configured to count the number of times that the account sends the abnormal character string in a preset time period;
accordingly, the account management module is configured to:
and if the times exceed a time threshold, managing the account by adopting an abnormal account management rule.
Optionally, the apparatus further comprises:
and the marking module is configured to mark an abnormal label for the abnormal character string and forbid displaying the abnormal character string with the abnormal label if the character string to be processed is determined to be the abnormal character string.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is to store computer-executable instructions, and the processor is to execute the computer-executable instructions to:
acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed;
and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the method for string identification of anomalies.
The abnormal character string identification method provided by the specification is used for acquiring a text to be identified, and filtering out a key character string with a specified format from the text to be identified by using a preset filtering algorithm; extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed; and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string. The method can filter the key character strings with the specified format in the text to be recognized, then recognize the character strings to be processed, which may have abnormality in the key character strings, preliminarily determine the attribute information of the character strings to be processed, and further determine whether the character strings to be processed are abnormal character strings, namely, the abnormal character strings in the text to be recognized can be recognized before the abnormal character strings affect the user account, so that the use risk of the user account can be reduced.
Drawings
Fig. 1 is a flowchart of an abnormal character string recognition method provided in an embodiment of the present specification;
FIG. 2 is a flowchart illustrating an abnormal string recognition method applied to a game chat scenario according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an abnormal character string recognition apparatus according to an embodiment of the present disclosure;
fig. 4 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
And (3) filtering algorithm: the algorithm is used to filter out strings of characters from the text that have a specified format.
Special character recognition algorithm: the algorithm is used to extract strings from the key strings that may be abnormal strings.
Abnormal character recognition rules: for identifying an abnormal string. The corresponding identification rules are different corresponding to different attribute information.
Presetting a special character word list: some preset special characters are stored.
Attribute information: indicating that the string is a contact address, an instant messaging account, a social account, a video software account, and the like.
Next, an application scenario of the abnormal character string recognition method provided in the embodiments of the present specification is briefly described.
Currently, in game chat, many abnormal texts include contact information, such as a contact telephone number, a micro number, a Q number, a treasure number, a voice number, and the like. Many of the contact ways transmitted are advertisement numbers, promotion numbers or off-game transaction account numbers, and these contact ways or the account numbers transmitting these contact ways may bring risks to the user, so it is necessary to identify these contact ways to mask these contact ways or to block the account numbers transmitting these contact ways. In the prior art, the contact ways in the text are identified through a rule extraction way, and all the contact ways extract pure numbers, so that the method is limited.
Therefore, the abnormal character string identification method can identify abnormal characters in any character format, is high in universality, and can improve the accuracy of abnormal character identification. Specific implementations of which can be found in the various embodiments described below.
In the present specification, an abnormal character string recognition method is provided, and the present specification relates to an abnormal character string recognition apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
Fig. 1 is a flowchart illustrating an abnormal character string identification method according to an embodiment of the present specification, which specifically includes the following steps:
step 102: and acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm.
As an example, the text to be recognized may be chat text. A preset filtering algorithm may be used to filter out key strings having a specified format from the text to be recognized.
The specified format can be a character format specified by a user according to actual requirements, or the specified format can be a data format which is obtained according to big data analysis and easily contains abnormal characters. As one example, the specified format may include letters, numbers, special symbols, and the like. For example, special symbols may include "@", ". com", "_", and the like.
In some embodiments, taking the text to be recognized as a chat text as an example, in the process of a user chatting, the server may receive the chat text sent by each user account through the terminal, that is, may obtain the chat text. In order to identify the abnormal character strings with risks such as advertisements, promotion and transaction accounts in the chat text, the text to be identified can be processed, and when the text to be identified is determined to have no abnormal character string, namely the text to be identified has no risk, the text to be identified is sent to a terminal of another user account for display.
As an example, a character string having a specified format in the text to be recognized may be filtered out as a key character string, and it may be considered that characters, expressions, and the like in the text to be recognized are deleted.
For example, assuming that the specified formats include letters, numbers, "@" and ". com", it may be determined that the key character string filtered out of the text to be recognized includes "1 vx123456 kjhg" assuming that the text to be recognized is "vx 123456kjhg + 1"; assuming that the text to be recognized is "mail to mailbox abc @123. com", it may be determined that the key character string filtered out of the text to be recognized is "abc @123. com"; assuming that the text to be recognized is "qq 5201314[ roses ]", it can be determined that the key character string filtered out from the text to be recognized is "qq 5201314".
In the embodiment of the application, the character strings in the user-specified format in the text to be recognized can be filtered, characters, expressions, common punctuations and the like which are not abnormal characters are deleted from the text to be recognized, the initial screening of the characters in the text to be recognized can be achieved, the non-abnormal characters are deleted, each character in the text to be recognized does not need to be operated subsequently, the subsequent data processing amount is reduced, and the recognition efficiency of the abnormal character strings is improved.
Further, for a user in normal chat, the chat text sent by the user may not include key characters, and it may be considered that there are no abnormal characters which may bring risks to the user account in the chat text, so that subsequent operations on the chat text are not required.
Therefore, in implementation, before filtering out the key character strings with the specified format from the text to be recognized by using a preset filtering algorithm, the method further includes: inputting the text to be recognized into a text classification model, and outputting a classification result of the text to be recognized, wherein the text classification model is used for judging whether the text to be recognized comprises a key character string with a specified format.
Accordingly, a specific implementation of filtering out the key character string with the specified format from the text to be recognized by using a preset filtering algorithm may include: and if the text to be recognized comprises the key character strings with the specified format based on the classification result, filtering the key character strings with the specified format from the text to be recognized by utilizing the preset filtering algorithm.
That is to say, after the text to be recognized is obtained, the text classification model may be used to determine the classification result of the text to be recognized so as to determine whether the text to be recognized includes the key character string, and if so, the text to be recognized may include the abnormal character string.
As an example, the text classification model may be trained based on a BERT model.
In some embodiments, the text classification model may be a binary classification model, and the text to be recognized is input into the text classification model, which may output a classification result of whether the text to be recognized contains a key character string, which may or may not include yes. If the result is yes, the text to be recognized comprises the key character strings, namely the text to be recognized possibly comprises abnormal character strings, and the key character strings can be filtered out from the text to be recognized; if the result is 'no', the text to be recognized does not comprise the key character string, namely the text to be recognized may not comprise the abnormal character string, and then the text to be recognized does not need to be subjected to subsequent operation.
As an example, training samples may be labeled artificially, the label of each training sample is used to indicate whether the training sample includes a key character string, the training samples carrying the label are input into a text classification model to be trained, the text classification model may predict whether each training sample includes a prediction result of the key character string, then determine a difference between the prediction result and a real label, train the text classification model based on the difference until the prediction result of the text classification model is closer to the real label, and stop training the model to obtain a trained text classification model.
In addition, sample texts can be generated periodically in a manual labeling mode, and the text classification model is verified based on the sample texts including the labels. After the abnormal character string is determined by the method, a label can be set for the text to be processed, and a text classification model is trained based on the text to be processed and the label, so that the accuracy of the text classification model is improved.
In the embodiment of the application, the key character strings can be screened once before being filtered from the text to be recognized, the text which possibly does not include abnormal characters is filtered, the text which needs to be processed can be reduced, unnecessary operation is reduced, the processing load of equipment is reduced, and the efficiency of recognizing the abnormal character strings can be improved.
Step 104: and extracting the character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed.
The key character strings which may be abnormal character strings in the text to be recognized can be recognized through the step 102, and the key character strings can be further screened in the step, so that the character strings to be processed which are more likely to be abnormal character strings are extracted, and the attribute information corresponding to the character strings to be processed is determined.
As an example, the attribute information of the character string to be processed may be used to characterize the attribute of the character string to be processed. For example, the attribute information may include a phone number, a certain treasured number, a certain Y number, a certain Q number, and so on.
In one embodiment, the extracting, based on a special character recognition algorithm, a character string to be processed from the key character string, and determining a specific implementation of attribute information corresponding to the character string to be processed may include: comparing the characters in the key character string with special characters in a preset special character word list; if the key character string has the special character in the preset special character word list, determining the position information of the special character in the key character string; extracting a character string to be processed related to the special character from the key character string based on the special character and the position information of the special character; and determining attribute information of the special character as attribute information corresponding to the character string to be processed.
The preset special character vocabulary may include some characters designated by the user according to actual requirements. For example, the special character vocabulary may include special characters QQ, Q, vx, v, YY, Y, and the like.
That is to say, the characters in the key character string may be compared with the special characters in the preset special character vocabulary, and if the key character string includes the special characters and it can be considered that the key character string may include the characters to be processed, the character string to be processed may be extracted from the key character string based on the position of the special characters in the key character string, and the attribute information of the special characters may be used as the attribute information of the character string to be processed.
In some embodiments, the correspondence between the special character and the attribute information may be preset. For example, the special characters "QQ", "Q", "QQ", and "Q" each correspond to a certain Q number, the special characters "vx" and "v" each correspond to a certain micro number, and the special characters "YY", "Y", and "Y" each correspond to a certain Y number.
Illustratively, assuming that the special character includes QQ, q, QQ, and v, and the key string is "QQ 1234569874", it may be determined that the key string includes the special character QQ, and the special character is in the first two digits of the key string, the number "1234569874" consecutive after the special character may be extracted, and the numbers are taken as the character string to be processed. And, since the special character "qq" corresponds to a certain Q number, that is, the attribute information of the special character is a certain Q number, it can be determined that the attribute information of the character to be processed is a certain Q number.
For example, assuming that the special character includes QQ, Q, QQ, and v, and the key character string is "1234569874Q", it may be determined that the key character string includes the special character Q, and the special character is the last digit of the key character string, the consecutive digits "1234569874" before the special character may be extracted as the character string to be processed. And, since the special character "Q" corresponds to a certain Q number, that is, the attribute information of the special character is a certain Q number, it can be determined that the attribute information of the character to be processed is a certain Q number.
Illustratively, assuming that the special character includes QQ, Q, QQ, and v, and the key character string is "123456Q 9874", it may be determined that the key character string includes the special character Q, and the special character is in the middle of the key character string, the number "123456" continuing before the special character and the number "9874" continuing after the special character may be extracted, and the extracted number "1234569874" is taken as the character string to be processed. And, since the special character "Q" corresponds to a certain Q number, that is, the attribute information of the special character is a certain Q number, it can be determined that the attribute information of the character to be processed is a certain Q number.
For example, assuming that the special character includes QQ, q, QQ, and v, and the key string is "v 123t _98c 74", it may be determined that the key string includes the special character v, and the special character is first in the key string, the characters "123 t _98c 74" following the special character may be extracted as the character string to be processed. And, since the special character "v" corresponds to a certain micro-sign, that is, the attribute information of the special character is a certain micro-sign, it can be determined that the attribute information of the character to be processed is a certain micro-sign.
In another embodiment, the key string includes a plurality of characters; based on a special character recognition algorithm, extracting a character string to be processed from the key character string, and determining specific implementation of attribute information corresponding to the character string to be processed may include: identifying the type of each character in the key character string, and counting the number of various types of characters; if the number of the specific type characters in the key character string reaches the preset number corresponding to the specific type, determining the character string formed by the specific type characters as a character string to be processed; and determining attribute information of the specific type of character as attribute information corresponding to the character string to be processed.
Wherein the specific type of character may be a number, a letter, or a special symbol. As an example, attribute information of a specific type of character may be set in advance. For example, the attribute information of the numeric type character is a telephone number.
In some embodiments, the specific type of character is a number, and the predetermined number corresponding to the number is 11 bits. Assuming that the key character string is "phone 13511111111", it can be recognized that the key character string includes characters of both numeric and alphabetical types, and the number of characters of the alphabetical type is 5 and the number of characters of the numeric type is 11, and it can be considered that the number of characters of the numeric type in the key character string reaches the preset number 11 corresponding to the numeric type, then the character string "13511111111" composed of the characters of the numeric type in the key character string can be determined as the character string to be processed. And, assuming that the specific type is a numeric type and the attribute information of the numeric type character is a phone number, it can be considered that the attribute information corresponding to the character string to be processed is a phone number.
Further, some non-abnormal character strings can be screened out through a special character recognition algorithm, but the subsequent processing of the non-abnormal character strings is meaningless, so that the key character strings can be screened once, the character strings in a white list are deleted, and the character strings to be processed are extracted from the rest character strings.
In some embodiments, before extracting the character string to be processed from the key character string based on the special character recognition algorithm, the method further includes: and comparing the key character string with the white list character string in a preset white list, and deleting the character string in the key character string, which is the same as the white list character string.
Accordingly, based on the special character recognition algorithm, the specific implementation of extracting the character string to be processed from the key character string may include: and extracting the character string to be processed from the key character string after the white list character string is deleted based on a special character recognition algorithm.
That is, a white list may be preset, a character string that may cause misjudgment as an abnormal character string is added to the white list, before extracting a character string to be processed from a key character string, a white list character string included in the key character string is deleted, and then a character to be processed is extracted from the remaining key character string.
As an example, a preset white list may be stored in the device, and the preset white list may include character strings such as item names, item prices, and specific game terms. And comparing the key character string with a white list character string in a preset white list, deleting the character string in the key character string, which is the same as the white list character string, if the key character string comprises the character string which is the same as the white list character string, reserving the character string in the key character string, which is different from the white list character string, and extracting the character string to be processed from the reserved character string.
For example, assuming that the key string is "qqpice 50", and the white list string includes "price", the "price" in the key string may be deleted, and the remaining key string includes "qq 50", and then the to-be-processed character string may be extracted from "qq 50".
In the embodiment of the application, before extracting the character string to be processed from the key character string, the white list character which may cause misjudgment in the character string to be processed can be deleted, and then the character to be processed is extracted from the key character string after deleting the white list character, so that misjudgment can be reduced, follow-up operation can be reduced, and the abnormal character string identification efficiency can be improved.
Step 106: and identifying the character string to be processed based on the attribute information and the abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string.
Wherein the abnormal character recognition rule is used for recognizing the abnormal character string.
In some embodiments, the abnormal character recognition rule may include a plurality of sub-rules, each of which characterizes a character format in a character string having one of the attribute information. That is, a plurality of sub-rules included in the abnormal character recognition rule correspond to the attribute information, and one sub-rule may correspond to one or more attribute information, or one attribute information may correspond to one or more sub-rules.
As an example, the sub-rule may be that the arrangement of the character strings conforms to a preset arrangement, and the preset arrangement corresponds to the attribute information. For example, the preset arrangement mode may be an arrangement according to the order of numbers, symbols and letters, and the sub-rule may be that the character strings are arranged according to the order of numbers, symbols and letters; or, the preset arrangement mode may be that the symbol cannot be arranged at the first position and the symbol and the letter cannot be connected, and the sub-rule may be that the symbol cannot be arranged at the first position and the symbol and the letter cannot be connected in the character string.
Alternatively, the sub-rule may be that the number of characters of the specified type in the character string does not exceed a preset number. Illustratively, the specified type may be a symbol, e.g., the sub-rule may be no more than n symbols in the string. Wherein n is a positive integer greater than or equal to 1.
Alternatively, the sub-rule may be that only the specified type of character is included in the character string and the number of characters is a specified number. Illustratively, the specified type may be a number and the preset number may be m bits, e.g. the sub-rule may be that only numbers are included in the string and the number of numbers is m bits. Wherein m is a positive integer greater than or equal to 1.
Alternatively, the sub-rule may be that a character at a specific position in the character string is a specific character, and the specific character corresponds to the attribute information. Illustratively, the specific character may be one character or a class of characters. Taking the specific position as the first bit and the specific character as one character as an example, assuming that the specific character is 1, the sub-rule may be that the first bit in the character string is 1; taking the specific positions as the first and last positions, and the specific character as a type of character as an example, assuming that the specific character is a letter, the sub-rules may be that the first and last positions in the character string are letters.
It should be noted that the preset arrangement mode, the preset number and the specific characters may be set by the user according to actual needs, or obtained by analyzing big data, which is not limited in the embodiment of the present application.
In a case where the abnormal character recognition rule includes a plurality of sub-rules corresponding to the attribute information one to one, the specific implementation of recognizing the character string to be processed based on the attribute information and the abnormal character recognition rule and determining whether the character string to be processed is an abnormal character string may include: determining a sub rule to be matched corresponding to the attribute information from a plurality of sub rules based on the attribute information corresponding to the character string to be processed; matching the character string to be processed with the sub-rule to be matched; and if the character string to be processed is successfully matched with the sub-rule to be matched, determining that the character string to be processed is an abnormal character string.
That is to say, the sub-rule to be matched with the character string to be processed may be determined according to the attribute information corresponding to the character string to be processed, and then it is determined whether the character string to be processed matches the sub-rule to be matched, and in case of successful matching, the character string to be processed may be considered as an abnormal character string.
For example, assuming that the character string to be processed is "z 123t0_98c74014 y" and the attribute information corresponding to the character string to be processed is a certain micro-number, the character string to be processed may be matched with a sub-rule of the certain micro-number, and it is determined whether the character string to be processed matches with the sub-rule of the certain micro-number, and if matching is successful, the character string to be processed may be considered as the certain micro-number, that is, the character string to be processed may be determined as an abnormal character string. For example, if the sub-rule corresponding to a certain micro-size is that the symbol in the character string cannot be arranged at the first position and the symbol and the letter cannot be connected, the character to be processed "z 123t0_98c74014 y" may be matched with the sub-rule, it may be determined that the first position of the character string to be processed is not the symbol and the letter are not connected, it may be considered that the matching of the character string to be processed and the sub-rule corresponding to the certain micro-size is successful, and it may be determined that the character string to be processed is the certain micro-size.
For example, assuming that the character to be processed is "18800000000" and the attribute information corresponding to the character to be processed is a phone number, the character to be processed may be matched with a sub-rule of the phone number, and it is determined whether the character to be processed matches with the sub-rule of the phone number, and if the matching is successful, the character string to be processed may be considered as the phone number, that is, the character string to be processed may be determined as an abnormal character string. For example, assuming that the first bit in the character string of the sub-rule corresponding to the phone is 1, the character "18800000000" to be processed may be matched with the sub-rule, the first bit of the character string to be processed may be determined to be 1, the character string to be processed may be considered to be successfully matched with the sub-rule corresponding to the phone number, and the character string to be processed may be determined to be the phone number.
Further, after matching the character string to be processed with the sub-rule to be matched, the method further includes: if the matching of the character string to be processed and the sub-rule to be matched fails, matching the character string to be processed with other sub-rules except the sub-rule to be matched in the plurality of sub-rules; if the sub-rule successfully matched with the character string to be processed exists in other sub-rules, determining that the character string to be processed is an abnormal character string; and if the sub-rule which is successfully matched with the character string to be processed does not exist in other sub-rules, determining that the character string to be processed is not an abnormal character string.
That is to say, under the condition that the matching between the character string to be processed and the sub-rule to be matched fails, the preliminarily determined attribute information of the character string to be processed may be considered inaccurate, the character string to be processed may be matched with other sub-rules, if there is a sub-rule successfully matched with the character string to be processed in other sub-rules, the attribute information of the character string to be processed may be the attribute information corresponding to the other sub-rules, and the character string to be processed may be considered as an abnormal character string. And if the sub-rule which is successfully matched with the character string to be processed does not exist in other sub-rules, the character string to be processed is considered not to be an abnormal character string.
For example, assuming that the character string to be processed is "z 123t _98c 74014Y" and the attribute information corresponding to the character string to be processed is a certain micro-number, a sub-rule of the certain micro-number may be determined, and assuming that the sub-rule is that a symbol in the character string cannot be arranged at the first position and the symbol and a letter cannot be connected, matching the sub-rule with the character string to be processed may be performed, it may be determined that the first position in the character string to be processed is not a symbol but a symbol _ is connected to a letter t, which indicates that the matching of the character string to be processed and the sub-rule of the certain micro-number fails, the character string to be processed may be considered not a certain micro-number, and the character string to be processed may be matched with the sub-rule corresponding to the attribute information such as a certain Q number, a telephone number, a certain Y number, and the like. And if the sub-rule corresponding to a certain Y number is that the first digit and the last digit in the character string are letters, matching the character string to be processed with the sub-rule corresponding to the certain Y number, determining that the first digit z and the last digit Y in the character string to be processed are both letters, and determining that the attribute information of the character string to be processed is the certain Y number and further determining that the character string to be processed is an abnormal character string if the matching of the character string to be processed and the sub-rule of the certain Y number is successful.
For example, assuming that the character string to be processed is "12356" and the attribute information corresponding to the character to be processed is a Q number, a sub-rule of the Q number may be determined, assuming that the sub-rule is a character string including only digits and the digit number is 10 bits, matching the character to be processed with the sub-rule of the Q number may determine that all the character strings to be processed are digits, and the digit number is 6 bits, which indicates that the matching of the character to be processed with the sub-rule of the Q number fails, it may be determined that the character string to be processed is not the Q number, it may be determined that the character string to be processed is matched with the sub-rules corresponding to the attribute information such as a micro number, a telephone number, a Y number, etc., and assuming that the character string to be processed and the sub-rules all fail to match, it may be determined that the character string to be processed is not an abnormal character string.
It should be noted that, by the above method, the abnormal character string in the text to be recognized can be recognized. Next, an operation after the abnormal character string is recognized will be described.
In implementation, after determining whether the character string to be processed is an abnormal character string, the method further includes: and if the character string to be processed is determined to be an abnormal character string, determining the account number for sending the abnormal character string, and managing the account number by adopting an abnormal account number management rule.
As an example, in the case that it is determined that the character string to be processed is an abnormal character string, the account sending the character string to be processed may be an advertisement or promotion account, and such an account may bring risks to other user accounts, so the account sending the abnormal character string may be determined and managed.
In the embodiment of the application, after the character string to be processed is determined to be the abnormal character string, the account sending the abnormal character string can be managed by adopting the abnormal account management rule, so that the phenomenon of advertisement screen refreshing in chatting is reduced.
In some embodiments, the specific implementation of managing the account by using the abnormal account management rule may include: marking the account as an abnormal account; and/or, forbidding the account to send the message.
That is, the account number may be marked as abnormal, or the account number may be prohibited from sending messages, or the account number may be marked as abnormal and prohibited from sending messages.
As an example, if an account is marked as an abnormal account, when the account is online, the account may be focused on or a message sent by the account may be focused on. Or the account can be forbidden, so that the phenomenon of advertisement screen refreshing in the chat can be reduced. Or, the account can be marked as abnormal, and the account is forbidden to speak, so that the condition of advertisement screen refreshing can be reduced, and the possibility that other accounts are affected by the account subsequently can be reduced.
Further, before the account is managed by using the abnormal account management rule, the method further includes: counting the times of sending abnormal character strings in a preset time period by the account;
correspondingly, the specific implementation of managing the account by using the abnormal account management rule may include: and if the times exceed the time threshold, managing the account by adopting an abnormal account management rule.
It should be noted that, both the preset time period and the number threshold may be set by the user according to actual needs, and the preset time period may be 1 hour, 1 day, 1 week, and the like, and the number threshold may be 10, 50, 100, and the like, which is not limited in this embodiment of the application.
As an example, before the account is managed, the number of times that the account sends the abnormal character string in a preset time period may be counted, and if the number of times that the account sends the abnormal character string in the preset time period exceeds a threshold number of times, it indicates that the account is most likely to be a spam account such as a marketing number or an advertisement number, and therefore, the account may be managed by using an abnormal account management rule.
In the embodiment of the application, after determining that the character string to be processed is an abnormal character string, an account number for sending the character string to be processed may be determined, the number of times that the character string to be processed is sent within a preset time period is counted, and if the number of times is large, the account number is marked as the abnormal account number, or the account number is prohibited from sending messages. Therefore, the phenomenon of advertisement screen refreshing in chatting can be reduced, and the risk of a user account can be reduced.
In other embodiments, after determining whether the character string to be processed is an abnormal character string, the method further includes: and if the character string to be processed is determined to be an abnormal character string, marking an abnormal label for the abnormal character string, and forbidding displaying the abnormal character string with the abnormal label.
That is, if it is determined that the to-be-processed character string is an abnormal character string, it indicates that the to-be-processed character string may be a spam text, which may affect the use of the user, and therefore, an abnormal label may be added to the abnormal character string, and the abnormal character string having the abnormal label displayed thereon is prohibited. Therefore, as long as the obtained character string of the text to be processed has the abnormal label, the abnormal character string in the text to be processed can be set to be invisible, other users cannot see the abnormal character string, and the phenomenon of advertisement screen refreshing is reduced.
The abnormal character string identification method provided by the specification is used for acquiring a text to be identified, and filtering out a key character string with a specified format from the text to be identified by using a preset filtering algorithm; extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed; and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string. The method can filter the key character strings with the specified format in the text to be recognized, then recognize the character strings to be processed, which may have abnormality in the key character strings, preliminarily determine the attribute information of the character strings to be processed, and further determine whether the character strings to be processed are abnormal character strings, namely, the abnormal character strings in the text to be recognized can be recognized before the abnormal character strings affect the user account, so that the use risk of the user account can be reduced.
The following will further describe the abnormal character string recognition method with reference to fig. 2, by taking an application of the abnormal character string recognition method provided in this specification in a game chat scene as an example. Fig. 2 shows a processing flow chart of an abnormal character string identification method applied to a game chat scene, which is provided in an embodiment of the present specification, and specifically includes the following steps:
step 202: chat text within the game is obtained.
Step 204: and inputting the chat text into a text classification model, and outputting a classification result of the chat text.
Step 206: if the abnormal account number is determined to be possibly included in the chat text based on the classification result, key character strings comprising letters, numbers and special characters are filtered out of the chat text.
For example, taking the chat text as "QQ contact 123xxxxxxx, price," the key string may be filtered to include "QQ 123 xxxxxpick.
Step 208: and comparing the key character string with the white list character string in a preset white list, and deleting the character string in the key character string, which is the same as the white list character string.
Continuing with the above example, assuming that the white list string includes "price" and "rajy," the "price" and the "rajy" in the key string may be deleted.
Step 210: and comparing the characters in the key character string after the white list character string is deleted with the special characters in the preset special character word list.
Step 212: and if the special character in the preset special character word list exists in the key character string, determining the position information of the special character in the key character string.
Continuing with the above example, assuming that the preset special character vocabulary includes "QQ", "Q", "QQ", etc., the key character string "QQ 123 xxxxxxx" after deleting the white list character is compared with the special character in the preset special character vocabulary, and it is determined that the special character "QQ" exists in the key character string, and the QQ is in the first two digits of the key character string.
Step 214: based on the special character and the position information of the special character, the suspicious character string is extracted from the key character string.
Continuing with the above example, the number beginning at the third digit after the special character "QQ" in the key string is extracted as the suspect string, and the possible string is "123 xxxxxxx".
Step 216: and determining the attribute information of the special character as the attribute information corresponding to the suspicious character string.
Continuing with the above example, since the attribute information corresponding to the special character QQ is a certain Q number, it can be determined that the attribute information corresponding to the suspect character string "123 xxxxxxx" is a certain Q number.
Step 218: and determining the sub-rule to be matched from the plurality of sub-rules according to the attribute information.
Continuing with the above example, a sub-rule corresponding to a certain Q number may be determined from the plurality of sub-rules as the sub-rule to be matched.
Step 220: the suspicious character string is matched with the sub-rule to be matched, if the matching is successful, step 222 and step 224 are executed, and if the matching is failed, step 226 and step 228 are executed.
Step 222: it is determined that the suspect string is an abnormal string.
Continuing with the above example, matching the suspicious character string "123 xxxxxxx" with the sub-rule corresponding to a certain Q number, and if the matching is successful, determining that the suspicious character string "123 xxxxxxx" is an abnormal character string, and the suspicious character string is a certain Q number.
Step 224: and determining the account sending the suspicious character string as an abnormal account, and forbidding the account to send messages.
Step 226: the suspicious string is matched with other sub-rules, if there is a sub-rule matching the suspicious string, the step is returned to step 222 and step 224, and if there is no sub-rule matching the suspicious string, the step 228 is executed.
Step 228: it is determined that the suspect string is not an abnormal string.
The abnormal character string identification method provided by the specification can be used for filtering the key character string with the specified format in the chat text after the chat text is obtained, identifying the character string to be processed which is possibly abnormal in the key character string, preliminarily determining the attribute information of the character string to be processed, and further determining whether the character string to be processed is the abnormal character string, namely identifying the abnormal character string in the chat text before the abnormal character string affects the user account, so that the user does not receive junk accounts such as advertisements and sales promotion, and the use risk of the user account can be reduced.
Corresponding to the above method embodiment, the present specification further provides an abnormal character string recognition apparatus embodiment, and fig. 3 shows a schematic structural diagram of an abnormal character string recognition apparatus provided in an embodiment of the present specification.
As shown in fig. 3, the apparatus includes:
a filtering module 302 configured to obtain a text to be recognized, and filter out a key character string having a specified format from the text to be recognized by using a preset filtering algorithm;
a first determining module 304, configured to extract a character string to be processed from the key character string based on a special character recognition algorithm, and determine attribute information corresponding to the character string to be processed;
a second determining module 306, configured to identify the character string to be processed based on the attribute information and an abnormal character identification rule, and determine whether the character string to be processed is an abnormal character string.
Optionally, the first determining module 304 is further configured to:
comparing the characters in the key character string with special characters in a preset special character word list;
if the special character in the preset special character word list exists in the key character string, determining the position information of the special character in the key character string;
extracting a character string to be processed related to the special character from the key character string based on the special character and the position information of the special character;
and determining the attribute information of the special character as the attribute information corresponding to the character string to be processed.
Optionally, the key string comprises a plurality of characters;
the first determination module 304 is further configured to:
identifying the type of each character in the key character string, and counting the number of various types of characters;
if the number of the specific type characters in the key character string reaches the preset number corresponding to the specific type, determining the character string formed by the specific type characters as a character string to be processed;
and determining attribute information of the specific type of character as attribute information corresponding to the character string to be processed.
Optionally, the apparatus further comprises:
the comparison module is configured to compare the key character string with a white list character string in a preset white list and delete a character string in the key character string which is the same as the white list character string;
accordingly, the first determination module 304 is further configured to:
and extracting the character string to be processed from the key character string after the white list character string is deleted based on the special character recognition algorithm.
Optionally, the abnormal character recognition rule includes a plurality of sub-rules, and each sub-rule represents a character format in a character string having attribute information;
the second determination module 306 is further configured to:
determining a sub rule to be matched corresponding to the attribute information from the plurality of sub rules based on the attribute information corresponding to the character string to be processed;
matching the character string to be processed with the sub-rule to be matched;
and if the character string to be processed is successfully matched with the sub-rule to be matched, determining that the character string to be processed is an abnormal character string.
Optionally, the second determining module 306 is further configured to:
if the matching of the character string to be processed and the sub-rule to be matched fails, matching the character string to be processed with other sub-rules except the sub-rule to be matched in the plurality of sub-rules;
if the sub-rule successfully matched with the character string to be processed exists in the other sub-rules, determining that the character string to be processed is an abnormal character string;
and if the sub-rule which is successfully matched with the character string to be processed does not exist in the other sub-rules, determining that the character string to be processed is not an abnormal character string.
Optionally, the apparatus further comprises:
the classification module is configured to input the text to be recognized into a text classification model and output a classification result of the text to be recognized, wherein the text classification model is used for judging whether the text to be recognized comprises a key character string with a specified format;
accordingly, the filtering module 302 is further configured to:
and if the text to be recognized comprises the key character strings with the specified format based on the classification result, filtering the key character strings with the specified format from the text to be recognized by using a preset filtering algorithm.
Optionally, the apparatus further comprises:
and the account management module is configured to determine an account for sending the abnormal character string if the character string to be processed is determined to be the abnormal character string, and manage the account by adopting an abnormal account management rule.
Optionally, the account management module is further configured to:
marking the account as an abnormal account; and/or forbidding the account to send messages.
Optionally, the apparatus further comprises:
the counting module is configured to count the number of times that the account sends the abnormal character string in a preset time period;
accordingly, the account management module is configured to:
and if the times exceed a time threshold, managing the account by adopting an abnormal account management rule.
Optionally, the apparatus further comprises:
and the marking module is configured to mark an abnormal label for the abnormal character string and forbid displaying the abnormal character string with the abnormal label if the character string to be processed is determined to be the abnormal character string.
The abnormal character string recognition device provided by the specification acquires a text to be recognized, and filters out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm; extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed; and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string. The method can filter the key character strings with the specified format in the text to be recognized, then recognize the character strings to be processed, which may have abnormality in the key character strings, preliminarily determine the attribute information of the character strings to be processed, and further determine whether the character strings to be processed are abnormal character strings, namely, the abnormal character strings in the text to be recognized can be recognized before the abnormal character strings affect the user account, so that the use risk of the user account can be reduced.
The above is a schematic scheme of an abnormal character string recognition apparatus according to the present embodiment. It should be noted that the technical solution of the abnormal character string recognition apparatus and the technical solution of the abnormal character string recognition method belong to the same concept, and details of the technical solution of the abnormal character string recognition apparatus, which are not described in detail, can be referred to the description of the technical solution of the abnormal character string recognition method.
FIG. 4 illustrates a block diagram of a computing device 400 provided according to an embodiment of the present description. The components of the computing device 400 include, but are not limited to, a memory 410 and a processor 420. Processor 420 is coupled to memory 410 via bus 430 and database 450 is used to store data.
Computing device 400 also includes access device 440, access device 440 enabling computing device 400 to communicate via one or more networks 460. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 440 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 400, as well as other components not shown in FIG. 4, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 4 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 400 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 400 may also be a mobile or stationary server.
Wherein processor 420 is configured to execute the following computer-executable instructions:
acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed;
and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the abnormal character string recognition method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the abnormal character string recognition method.
An embodiment of the present specification also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are operable to:
acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed;
and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the abnormal character string recognition method belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the abnormal character string recognition method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teachings, and combinations of features of the present disclosure are also within the scope of the present application. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims (14)

1. An abnormal character string recognition method, comprising:
acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed;
and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string.
2. The method for identifying an abnormal character string according to claim 1, wherein extracting a character string to be processed from the key character string based on a special character identification algorithm and determining attribute information corresponding to the character string to be processed comprises:
comparing the characters in the key character string with special characters in a preset special character word list;
if the special character in the preset special character word list exists in the key character string, determining the position information of the special character in the key character string;
extracting a character string to be processed related to the special character from the key character string based on the special character and the position information of the special character;
and determining the attribute information of the special character as the attribute information corresponding to the character string to be processed.
3. The abnormal string recognition method according to claim 1, wherein the key string includes a plurality of characters; based on a special character recognition algorithm, extracting a character string to be processed from the key character string, and determining attribute information corresponding to the character string to be processed, wherein the attribute information comprises the following steps:
identifying the type of each character in the key character string, and counting the number of various types of characters;
if the number of the specific type characters in the key character string reaches the preset number corresponding to the specific type, determining the character string formed by the specific type characters as a character string to be processed;
and determining attribute information of the specific type of character as attribute information corresponding to the character string to be processed.
4. The abnormal character string recognition method according to any one of claims 1 to 3, wherein before extracting a character string to be processed from the key character string based on a special character recognition algorithm, further comprising:
comparing the key character string with a white list character string in a preset white list, and deleting the character string in the key character string, which is the same as the white list character string;
correspondingly, extracting the character string to be processed from the key character string based on a special character recognition algorithm comprises the following steps:
and extracting the character string to be processed from the key character string after the white list character string is deleted based on the special character recognition algorithm.
5. The abnormal character string recognition method according to claim 1, wherein the abnormal character recognition rule includes a plurality of sub-rules, each of which characterizes a character format in a character string having one attribute information, respectively; identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string, including:
determining a sub rule to be matched corresponding to the attribute information from the plurality of sub rules based on the attribute information corresponding to the character string to be processed;
matching the character string to be processed with the sub-rule to be matched;
and if the character string to be processed is successfully matched with the sub-rule to be matched, determining that the character string to be processed is an abnormal character string.
6. The method for identifying an abnormal character string according to claim 5, wherein after matching the character string to be processed with the sub-rule to be matched, the method further comprises:
if the matching of the character string to be processed and the sub-rule to be matched fails, matching the character string to be processed with other sub-rules except the sub-rule to be matched in the plurality of sub-rules;
if the sub-rule successfully matched with the character string to be processed exists in the other sub-rules, determining that the character string to be processed is an abnormal character string;
and if the sub-rule which is successfully matched with the character string to be processed does not exist in the other sub-rules, determining that the character string to be processed is not an abnormal character string.
7. The method for recognizing an abnormal character string according to claim 1, wherein before filtering out a key character string having a specified format from the text to be recognized by using a preset filtering algorithm, the method further comprises:
inputting the text to be recognized into a text classification model, and outputting a classification result of the text to be recognized, wherein the text classification model is used for judging whether the text to be recognized comprises a key character string with a specified format;
correspondingly, filtering out the key character strings with the specified format from the text to be recognized by using a preset filtering algorithm, wherein the method comprises the following steps:
and if the text to be recognized comprises the key character strings with the specified format based on the classification result, filtering the key character strings with the specified format from the text to be recognized by using a preset filtering algorithm.
8. The abnormal string recognition method according to claim 1 or 5, wherein after determining whether the character string to be processed is an abnormal character string, further comprising:
if the character string to be processed is determined to be an abnormal character string, determining an account number for sending the abnormal character string, and managing the account number by adopting an abnormal account number management rule.
9. The method for identifying the abnormal character string according to claim 8, wherein the account is managed by adopting an abnormal account management rule, and the method comprises the following steps:
marking the account as an abnormal account; and/or forbidding the account to send messages.
10. The method for identifying an abnormal character string according to claim 8, wherein before the account is managed by using the abnormal account management rule, the method further comprises:
counting the times of sending the abnormal character string by the account in a preset time period;
correspondingly, the account is managed by adopting an abnormal account management rule, which comprises the following steps:
and if the times exceed a time threshold, managing the account by adopting an abnormal account management rule.
11. The abnormal string recognition method according to claim 1 or 5, wherein after determining whether the character string to be processed is an abnormal character string, further comprising:
if the character string to be processed is determined to be an abnormal character string, an abnormal label is marked for the abnormal character string, and the abnormal character string with the abnormal label is forbidden to be displayed.
12. An abnormal character string recognition apparatus, comprising:
the filtering module is configured to acquire a text to be recognized and filter out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
the first determining module is configured to extract a character string to be processed from the key character string based on a special character recognition algorithm and determine attribute information corresponding to the character string to be processed;
and the second determination module is configured to identify the character string to be processed based on the attribute information and an abnormal character identification rule, and determine whether the character string to be processed is an abnormal character string.
13. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the method of:
acquiring a text to be recognized, and filtering out a key character string with a specified format from the text to be recognized by using a preset filtering algorithm;
extracting a character string to be processed from the key character string based on a special character recognition algorithm, and determining attribute information corresponding to the character string to be processed;
and identifying the character string to be processed based on the attribute information and an abnormal character identification rule, and determining whether the character string to be processed is an abnormal character string.
14. A computer readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the method of string identification according to any one of claims 1 to 11.
CN202110753494.0A 2021-07-02 2021-07-02 Abnormal character string identification method and device Active CN113420549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110753494.0A CN113420549B (en) 2021-07-02 2021-07-02 Abnormal character string identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110753494.0A CN113420549B (en) 2021-07-02 2021-07-02 Abnormal character string identification method and device

Publications (2)

Publication Number Publication Date
CN113420549A true CN113420549A (en) 2021-09-21
CN113420549B CN113420549B (en) 2023-06-13

Family

ID=77721441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110753494.0A Active CN113420549B (en) 2021-07-02 2021-07-02 Abnormal character string identification method and device

Country Status (1)

Country Link
CN (1) CN113420549B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435692A (en) * 2023-11-02 2024-01-23 北京云上曲率科技有限公司 Variant-based antagonism sensitive text recognition method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103313248A (en) * 2013-04-28 2013-09-18 北京小米科技有限责任公司 Method and device for identifying junk information
WO2016187888A1 (en) * 2015-05-28 2016-12-01 北京旷视科技有限公司 Keyword notification method and device based on character recognition, and computer program product
CN109344396A (en) * 2018-08-31 2019-02-15 阿里巴巴集团控股有限公司 Text recognition method, device and computer equipment
CN109684469A (en) * 2018-12-13 2019-04-26 平安科技(深圳)有限公司 Filtering sensitive words method, apparatus, computer equipment and storage medium
CN110209892A (en) * 2019-04-17 2019-09-06 深圳壹账通智能科技有限公司 Sensitive information recognition methods, device, electronic equipment and storage medium
CN110765973A (en) * 2019-10-31 2020-02-07 上海掌门科技有限公司 Account type identification method and device
CN111738011A (en) * 2020-05-09 2020-10-02 完美世界(北京)软件科技发展有限公司 Illegal text recognition method and device, storage medium and electronic device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103313248A (en) * 2013-04-28 2013-09-18 北京小米科技有限责任公司 Method and device for identifying junk information
WO2016187888A1 (en) * 2015-05-28 2016-12-01 北京旷视科技有限公司 Keyword notification method and device based on character recognition, and computer program product
CN109344396A (en) * 2018-08-31 2019-02-15 阿里巴巴集团控股有限公司 Text recognition method, device and computer equipment
CN109684469A (en) * 2018-12-13 2019-04-26 平安科技(深圳)有限公司 Filtering sensitive words method, apparatus, computer equipment and storage medium
CN110209892A (en) * 2019-04-17 2019-09-06 深圳壹账通智能科技有限公司 Sensitive information recognition methods, device, electronic equipment and storage medium
CN110765973A (en) * 2019-10-31 2020-02-07 上海掌门科技有限公司 Account type identification method and device
CN111738011A (en) * 2020-05-09 2020-10-02 完美世界(北京)软件科技发展有限公司 Illegal text recognition method and device, storage medium and electronic device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435692A (en) * 2023-11-02 2024-01-23 北京云上曲率科技有限公司 Variant-based antagonism sensitive text recognition method and system

Also Published As

Publication number Publication date
CN113420549B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
CN108874777B (en) Text anti-spam method and device
EP2803031B1 (en) Machine-learning based classification of user accounts based on email addresses and other account information
KR100943870B1 (en) Method and apparatus for identifying potential recipients
US20160226811A1 (en) System and method for priority email management
CN111669757B (en) Terminal fraud call identification method based on conversation text word vector
US20110258193A1 (en) Method for calculating entity similarities
EP2378475A1 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
US20060195533A1 (en) Information processing system, storage medium and information processing method
US9111218B1 (en) Method and system for remediating topic drift in near-real-time classification of customer feedback
CN101534261A (en) A method, device and system of recognizing spam information
CN107734131B (en) Short message classification method and device
CN109614464B (en) Method and device for identifying business problems
CN103399906A (en) Method and device for providing candidate words on the basis of social relationships during input
CN103179245A (en) System, method and program product for identifying calling telephone numbers
CN112291423A (en) Intelligent response processing method and device for communication call, electronic equipment and storage medium
JP2006293573A (en) Electronic mail processor, electronic mail filtering method and electronic mail filtering program
CN113420549B (en) Abnormal character string identification method and device
US20230096474A1 (en) Identifying sensitive content in electronic files
CN104765784A (en) Key words list maintenance method and system
Prusty et al. SMS Fraud detection using machine learning
CN113449829A (en) Data transmission method based on optical character recognition technology and related device
CN111259207A (en) Short message identification method, device and equipment
CN115687754B (en) Active network information mining method based on intelligent dialogue
CN116610772A (en) Data processing method, device and server
CN111464687A (en) Strange call request processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 519000 Room 102, 202, 302 and 402, No. 325, Qiandao Ring Road, Tangjiawan Town, high tech Zone, Zhuhai City, Guangdong Province, Room 102 and 202, No. 327 and Room 302, No. 329

Applicant after: Zhuhai Jinshan Digital Network Technology Co.,Ltd.

Address before: 519000 Room 102, 202, 302 and 402, No. 325, Qiandao Ring Road, Tangjiawan Town, high tech Zone, Zhuhai City, Guangdong Province, Room 102 and 202, No. 327 and Room 302, No. 329

Applicant before: ZHUHAI KINGSOFT ONLINE GAME TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant