CN110516125B

CN110516125B - Method, device and equipment for identifying abnormal character string and readable storage medium

Info

Publication number: CN110516125B
Application number: CN201910802851.0A
Authority: CN
Inventors: 陆青; 姜敏华
Original assignee: Rajax Network Technology Co Ltd
Current assignee: Rajax Network Technology Co Ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2020-05-08
Anticipated expiration: 2039-08-28
Also published as: CN110516125A

Abstract

The method, the device, the equipment and the readable storage medium for identifying the abnormal character string comprise the following steps: acquiring an original character string and respectively converting the original character string into a corresponding picture and a corresponding phonetic symbol string; inputting the original character string, the picture and the phonetic symbol string into a first deep learning model, a second deep learning model and a third deep learning model respectively to obtain a corresponding first deep learning characteristic vector, a corresponding second deep learning characteristic vector and a corresponding third deep learning characteristic vector; determining a standardized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector; and matching the standardized character string with a character string in a preset abnormal database, identifying an abnormal character string in the standardized character string, and outputting an identification result. According to the scheme, the abnormal character strings are automatically identified, the identification efficiency is improved, and the accuracy and the precision are improved.

Description

Method, device and equipment for identifying abnormal character string and readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a method, a device and equipment for identifying abnormal character strings and a readable storage medium.

Background

Nowadays, people can not leave the internet daily, and users can generate text contents in scenes such as shopping, chatting, learning and working, and often users can subjectively or unintentionally input abnormal contents in the writing process. In order to reduce the propagation of these abnormal contents, the contents input by the user need to be identified, and two methods are generally adopted at present: 1. manual identification; 2. and (5) matching and identifying the regular expression.

However, with the rapid development of science and technology, the frequency of using the internet by users is rapidly increased, more manpower and time are required to be consumed to identify abnormal contents, and if only manual identification is relied on, the cost is high, the speed is slow, and the method cannot meet the requirement of processing mass service data of the internet. And the matching of the regular expression is to perform similar matching on the acquired text content and the characters set as abnormal, and identify abnormal characters or symbols and the like in the text content. However, this method has a low degree of recognition for the deformed character, and is difficult to recognize a character string intentionally input by a user using the deformed character.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, a device and a readable storage medium for identifying an abnormal character string, which can implement automatic identification of an abnormal character string, improve efficiency of identifying an abnormal character string, and improve accuracy and precision of identification.

The embodiment of the invention provides a method for identifying an abnormal character string, which comprises the following steps:

acquiring an original character string; respectively converting the original character strings into corresponding pictures and phonetic symbol strings; inputting the original character string into a preset first deep learning model to obtain a first deep learning characteristic vector, inputting the picture into a preset second deep learning model to obtain a second deep learning characteristic vector, and inputting the phonetic symbol string into a preset third deep learning model to obtain a third deep learning characteristic vector; determining a standardized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector; matching the standardized character string with a character string in a preset abnormal database, and identifying an abnormal character string in the standardized character string; and outputting the recognition result.

Further, the determining a normalized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector includes: fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain a fused feature vector; and inputting the fusion feature vector into a preset fourth deep learning model to obtain a standardized character string corresponding to the original character string.

Further, the determining a normalized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector includes: obtaining a first standardized character string, a second standardized character string and a third standardized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector respectively;

the matching of the standardized character string with a character string in a preset abnormal database to identify an abnormal character string in the standardized character string comprises the following steps: and respectively matching the first standardized character string, the second standardized character string and the third standardized character string with character strings in a preset abnormal database, and identifying abnormal character strings in the first standardized character string, the second standardized character string and the third standardized character string.

Further, the determining a normalized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector further includes: fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain a fused feature vector; inputting the fusion feature vector into a preset fourth deep learning model to obtain a fourth standardized character string corresponding to the original character string;

the matching of the standardized character string with a character string in a preset abnormal database to identify an abnormal character string in the standardized character string further comprises: and matching the fourth standardized character string with a character string in a preset abnormal database, and identifying an abnormal character string in the fourth standardized character string.

Further, said fusing the first deep-learning feature vector, the second deep-learning feature vector, and the third deep-learning feature vector comprises: and connecting the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector end to end.

Further, the first deep learning model comprises a first cyclic neural network model, the second deep learning model comprises a convolutional neural network model, and the third deep learning model comprises a second cyclic neural network model.

Further, the converting the original character string into a phonetic symbol string includes: and converting the original character string into a phonetic symbol string corresponding to the main language type based on the main language type of the original character string.

The embodiment of the invention also provides a device for identifying the abnormal character string, which comprises: an original character string obtaining unit adapted to obtain an original character string; the first original character string conversion unit is suitable for converting the original character strings into corresponding pictures; the second original character string conversion unit is suitable for converting the original character strings into corresponding phonetic symbol strings; the first deep learning unit is suitable for inputting the original character string into a preset first deep learning model to obtain a first deep learning characteristic vector; the second deep learning unit is suitable for inputting the picture into a preset second deep learning model to obtain a second deep learning feature vector; the third deep learning unit is suitable for inputting the phonetic symbol string into a preset third deep learning model to obtain a third deep learning characteristic vector; the normalized character string generating unit is suitable for determining a normalized character string corresponding to the original character string according to the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector; the abnormal character string identification unit is suitable for matching the standardized character string with a character string in a preset abnormal database to identify an abnormal character string in the standardized character string; and the result output unit is suitable for outputting the identification result.

The embodiment of the invention also provides data processing equipment, which comprises a memory and a processor; wherein the memory is adapted to store one or more computer instructions which, when executed by the processor, perform the steps of the method of any of the above embodiments.

The embodiment of the present invention further provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the method described in any of the above embodiments are performed.

According to the scheme for identifying the abnormal character strings, the acquired original character strings are converted into corresponding pictures and phonetic symbol strings respectively, then the original character strings, the pictures and the phonetic symbol strings are input into a first deep learning model, a second deep learning model and a third deep learning model respectively to obtain corresponding first deep learning characteristic vectors, second deep learning characteristic vectors and third deep learning characteristic vectors, then the standardized character strings corresponding to the original character strings are determined based on the first deep learning characteristic vectors, the second deep learning characteristic vectors and the third deep learning characteristic vectors, and the standardized character strings are matched with the character strings in a preset abnormal database, so that the abnormal character strings in the standardized character strings can be identified. According to the character string identification process, the original character strings are converted into the pictures and the phonetic symbol strings, then deep learning is carried out respectively to obtain corresponding feature vectors, the standardized character strings corresponding to the original character strings are restored through the feature vectors with multiple dimensions, and then abnormal character string identification is carried out, so that the identification rate of deformed characters can be greatly improved, and the accuracy and the precision of abnormal character string identification can be improved. Moreover, the whole recognition process does not need manual participation and adjustment, but automatic recognition is adopted, so that the efficiency of recognizing the abnormal character strings can be improved, and the labor cost is greatly reduced.

And further, fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain a fused feature vector, inputting the fused feature vector into a fourth deep learning model to obtain a standardized character string corresponding to the original character string, and then identifying and outputting. By adopting the scheme, the original character string, the picture and the feature vector corresponding to the phonetic symbol string are fused and subjected to secondary deep learning, so that the relation among the feature vectors can be further deepened, a more accurate standardized character string can be obtained, the identification breadth and accuracy of the abnormal character string can be improved, and the capability of identifying the abnormal character string can be enhanced.

Further, standardized character strings corresponding to the first deep learning feature vector, the second deep learning feature vector, the third deep learning feature vector and the fourth deep learning feature vector can be respectively determined, abnormal character strings in the first standardized character string, the second standardized character string, the third standardized character string and the fourth standardized character string can be simultaneously recognized, when abnormal character strings exist in at least one standardized character string, the recognition result of the abnormal character strings is output, multi-dimensional recognition is achieved, and the missing rate of abnormal character string recognition can be reduced.

Furthermore, because the input original character string may contain various language characters, numbers and even symbols, when the original character string is converted into the phonetic symbol string, the original character string is converted into the corresponding phonetic symbol string and then recognized based on the main language type of the original character string, so that the application range of abnormal character string recognition can be expanded.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings needed to be used in the embodiments of the present specification or in the description of the prior art will be briefly described below, it is obvious that the drawings described below are only some embodiments of the present specification, and it is also possible for a person skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for identifying an abnormal character string according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for determining a normalized character string corresponding to an original character string according to an embodiment of the present invention.

FIG. 3 is a flow chart of another method for identifying an abnormal string in an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an apparatus for recognizing an abnormal character string according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a standardized character string generating unit according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of an abnormal character string recognition unit according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of another standardized character string generation unit in the embodiment of the present invention.

Fig. 8 is a diagram illustrating conversion of an original character string into a picture according to an embodiment of the present invention.

Detailed Description

As mentioned above, the current internet has huge business data, and if only relying on manual identification, the cost is high and the processing speed is slow. The method for matching abnormal characters through the regular expression has low recognition degree of deformed characters, and all abnormal characters cannot be accurately recognized. For example, a user registers a new user on an application service platform through other mobile phone numbers to enjoy preference, and then notifies a service party on the service platform of a real mobile phone number in a combined text form of wrongly written or mispronounced characters, letters, unordered symbols and the like in remarks; for another example, a product review may be to advertise a store, and leave a personal contact with a text composed of wrongly written or written characters, letters, unordered symbols, and the like. Therefore, the data processing requirements of mass services of the existing internet cannot be met no matter manual identification or regular expression matching identification is carried out.

In view of the above problems, an embodiment of the present invention provides a method for identifying an abnormal character string, where an acquired original character string is first converted into a corresponding picture and a corresponding phonetic symbol string, then the original character string, the picture and the phonetic symbol string are input into a first deep learning model, a second deep learning model and a third deep learning model, respectively, to obtain a corresponding first deep learning feature vector, a corresponding second deep learning feature vector and a corresponding third deep learning feature vector, then a standardized character string corresponding to the original character string is determined based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector, and the standardized character string is matched with a character string in a preset abnormal database, so that an abnormal character string in the standardized character string can be identified.

For a better understanding of the concept, implementation and advantages of embodiments of the present invention for those skilled in the art, reference is made to the following detailed description taken in conjunction with the accompanying drawings, which are included to provide a more thorough understanding of embodiments of the present invention.

Referring to a flowchart of a method for identifying an abnormal character string in the embodiment of the present invention shown in fig. 1, in the embodiment of the present invention, the following steps may be adopted to identify the abnormal character string:

and S11, acquiring the original character string.

In a specific implementation, the original character string may originate from any platform on the internet that wants to identify an abnormal character string, and the data format of the original character string is determined by the systematic encoding of the platform, wherein the systematic encoding may use any existing character set encoding, such as ASCII encoding, GB2312 encoding, BIG5 encoding, GB18030 encoding, and the like; custom character set encoding may also be employed. Taking an e-commerce platform as an example, a user can input text content in a remark bar or a comment interface for placing an order, and the e-commerce platform can acquire the text content input by the user as an original character string.

And S12, converting the original character strings into corresponding pictures and phonetic symbol strings respectively.

In particular implementations, the original character string may be converted into corresponding picture and phonetic symbol strings in a variety of ways.

In an embodiment of the present invention, the original character string can be converted into a corresponding black-and-white picture or a corresponding color picture by a conversion method of encoding and decoding. For example, the original character string is coded and decoded in Base64 format and converted into corresponding pictures.

For the phonetic symbol string, in an embodiment of the present invention, a phonetic symbol comparison table may be preset in the database, and then the original character string is converted into a phonetic symbol string by comparing the phonetic symbol comparison table. The phonetic symbol comparison table may include a comparison relationship between any subject language type and a subject language phonetic symbol, such as a comparison relationship between an english alphabet and an english phonetic symbol, a comparison relationship between a number and an english phonetic symbol, a comparison relationship between a chinese character and a pinyin, a comparison relationship between a symbol and a phonetic symbol, and the like, and may be specifically set according to an actual situation.

In addition, in order to simplify the conversion process and shorten the conversion time, a special conversion module or tool for the picture and the phonetic symbol string can be arranged, and the existing conversion tool can be adopted for converting the picture and the phonetic symbol string.

S13, inputting the original character string into a preset first deep learning model to obtain a first deep learning feature vector, inputting the picture into a preset second deep learning model to obtain a second deep learning feature vector, and inputting the phonetic symbol string into a preset third deep learning model to obtain a third deep learning feature vector.

In specific implementation, the preset first deep learning model, the preset second deep learning model and the preset third deep learning model may include one or more neural network models that complete training, and the type of the model that is specifically adopted may be selected and set according to the characteristics of the converted data.

For example, the first deep learning model may include various models under a Recurrent Neural Network (RNN) system for processing vocabulary information and semantic information of an original character string in a time series, so that a first deep learning feature vector containing information such as vocabulary and semantics can be obtained.

For another example, the second deep learning model may include various models under a Convolutional Neural Networks (CNN) system, and is configured to process feature information of each portion in a picture, so that a second deep learning feature vector including character-related feature information may be obtained. The characteristic information may include shape information such as words, symbols, letters, and numbers.

For another example, the third deep learning model may include various models under a recurrent neural network system, and is used for processing the pronunciation information and the semantic information of the phonetic symbol string in a time sequence, so that a third deep learning feature vector containing information such as vocabulary and semantics can be obtained.

And S14, determining a standardized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector.

In a specific implementation, the first deep learning feature vector and the second deep learning feature vector may be subjected to reverse analysis of character shapes, and the third deep learning feature vector may be subjected to reverse analysis of character pronunciations, so as to determine a normalized character string corresponding to the original character string. For example, a standard character shape comparison table and a standard character pronunciation comparison table may be preset in the database, the first deep learning feature vector and the second deep learning feature vector are matched with the standard character shape comparison table, and the third deep learning feature vector is matched with the standard character pronunciation comparison table. The standard character shape comparison table and the standard character pronunciation comparison table can be set according to actual conditions.

And S15, matching the standardized character strings with character strings in a preset abnormal database, and identifying abnormal character strings in the standardized character strings.

In specific implementation, the preset abnormal database can be set according to actual conditions. Because the standardized character string corresponding to the original character string is obtained, the character string is more conveniently matched with the character string in the preset abnormal database.

And S16, outputting the recognition result.

In specific implementation, if the identification result is that an abnormal character string exists, the user can be reminded according to the preset condition so as to avoid the generation of the abnormal character string, and the identification result can be output to a rear-end monitoring person to send an abnormal prompt to the rear-end monitoring person, so that the monitoring person can find the abnormality in time and execute corresponding processing operation.

By the method for identifying the abnormal character string, the original character string is converted into the picture and the phonetic symbol string, then the deep learning is respectively carried out to obtain the corresponding characteristic vectors, the standardized character string corresponding to the original character string is restored through the characteristic vectors of multiple dimensions such as the original character string, the picture and the phonetic symbol string, and then the abnormal character string is identified, so that the identification rate of the deformed character can be greatly improved, and the accuracy and precision of the abnormal character string identification can be improved. Moreover, the whole recognition process does not need manual participation and adjustment, but automatic recognition is adopted, so that the efficiency of recognizing the abnormal character strings can be improved, and the labor cost is greatly reduced.

In order to make the embodiment of the present invention better understood and implemented by those skilled in the art, how to identify the abnormal character string is described in detail below through a specific application scenario.

Assuming that the content of the character deformation input by the user in the comment or remark is '① two three-Aid qq.c0m', the system code of the application service platform adopts ASCII code, so that data in an ASCII hexadecimal coding format corresponding to '① two-three-Aid qq.c0m' can be obtained as '24608 d 304 e 09827 e 727900710071300200630030006 d', and a space is used as a separator, and the ASCII hexadecimal coding data is the original character string.

Then, the original string can be converted into a picture containing "① two three-Aid qq.c0m" content by a conversion method of decoding, in this embodiment, based 64 encoding and decoding are adopted, and the original string is converted into a corresponding picture, as shown in picture 80 in fig. 8.

And, the original character string can be converted into a corresponding phonetic symbol string according to a preset phonetic symbol comparison table, namely, "yi ersan ai te kju: kju: ju hao si: ling em".

As mentioned above, the deep learning model used in step S13 may use a corresponding neural network model according to the input data characteristics. In this embodiment, the first deep learning model may include a first cyclic neural network model, the second deep learning model may include a convolutional neural network model, and the third deep learning model may include a second cyclic neural network model.

After the data processing, inputting the original character string into the first deep learning model, and outputting an N1-dimensional first deep learning feature vector [ Xi ] after the cyclic neural network processing, wherein i is 1,2,3 … … N1, and N1 is a natural number not less than 1; xi represents the maximum probability of predicting the ith output from the original string, with Xi having a value between [0,1 ].

It can be understood that different training data are adopted for training according to actual use situations, and first deep learning models with different functions can be obtained. For example, the first deep learning model may be used to screen interference data in the original character string that does not meet grammar rules, and then training data of a standard grammar may be obtained to train the first deep learning model. After the training is completed, the first deep learning model may perform syntax screening processing on the input data, and then output a maximum probability array that is predicted according to the original character string and conforms to a syntax rule, thereby serving as a first deep learning feature vector.

As described above, the picture is input into the second deep learning model, and after being processed by the convolutional neural network, an N2-dimensional second deep learning feature vector [ Yi ] is output, where i is 1,2,3 … … N2, and N2 is a natural number not less than 1; yi represents the maximum probability of predicting the ith output from the picture, and the value of Yi is between [0,1 ].

It can be understood that different training data are adopted for training according to actual use situations, so that second deep learning models with different functions can be obtained, for example, when the second deep learning models are used for extracting character strings in the picture, the training data labeled with character string labels can be obtained for training. After the training is completed, the second deep learning model may perform character string extraction processing on the input picture, and then output a character string maximum probability array predicted according to the picture, thereby serving as a second deep learning feature vector.

As described above, the phonetic symbol string is input into the third deep learning model, and after being processed by the convolutional neural network, an N3-dimensional third deep learning feature vector [ Zi ] is output, where i is 1,2,3 … … N3, and N3 is a natural number not less than 1; zi represents the maximum probability of predicting the ith output from the phonetic symbol string, and Zi has a value between [0,1 ].

It can be understood that different training data are adopted for training according to actual use situations, so that a third deep learning model with different functions can be obtained, for example, the third deep learning model is used for screening interference data which do not accord with phonetic symbol rules in the phonetic symbol string, and then the training data marked with phonetic symbol labels can be obtained for training. After the training is completed, the third deep learning model may perform phonetic symbol rule screening processing on the input phonetic symbol string, and then output a phonetic symbol string maximum probability array predicted according to the phonetic symbol string, thereby serving as a third deep learning feature vector.

Then, according to a related standard character shape comparison table and a standard character pronunciation comparison table preset in a database of the system where the application service platform is located, inverse analysis can be performed on the first deep learning feature vector [ Xi ], the second deep learning feature vector [ Yi ] and the third deep learning feature vector [ Zi ] respectively, so that a first standardized character string, a second standardized character string and a third standardized character string corresponding to the original character string are obtained.

The standard character shape comparison table can comprise the comparison relationship between at least one character of the standard shapes of characters, symbols, letters and numbers and non-negative numbers not greater than 1, and the standard character pronunciation comparison table can comprise the comparison relationship between at least one character of the standard shapes of characters, symbols, letters and numbers and non-negative numbers not greater than 1. In addition, the standard character shape comparison table may further include a comparison relationship between the standard shapes of the radicals and non-negative numbers not greater than 1, and the standard character pronunciation comparison table may further include a comparison relationship between fuzzy pronunciations of characters, symbols, letters and numbers and non-negative numbers not greater than 1.

The specific process of reverse phase resolution is as follows:

1) matching the first deep-learning feature vector [ Xi ] with the standard character shape comparison table, a deformed number "①" similar in shape to the number "1" and a punctuation mark ". similar in shape to the punctuation mark". The first normalized character string obtained is "1 two three-idet qq.c0m".

2) Matching the second deep learning feature vector [ Yi ] with the standard character shape comparison table can identify a deformed number ' ① ' similar to the shape of the number ' 1 ', a number ' 0 ' similar to the shape of the letter ' o ', a punctuation mark similar to the shape of the punctuation mark ', and even a deformed word ' three ' similar to the shape of the number ' 3 ', and the obtained second standardized character string is ' 1-two 3-eqq.com '.

3) Matching the third deep learning feature vector [ Zi ] with the standard character pronunciation look-up table can identify the same reading as the punctuation mark "." (ju hao). "(ju hao), the word" two "(er) with the same reading of the number" 2 "(er), the word" three "(san) with the same reading of the number" 3 "(san), and the word" ait "(aite) with the same reading of the symbol" @ "(aite). The third normalized string obtained is: "123 @ qq.0m".

And matching the first standardized character string, the second standardized character string and the third standardized character string with character strings in a preset abnormal database respectively to identify abnormal character strings in the first standardized character string, the second standardized character string and the third standardized character string, and outputting an identification result with the abnormal character strings when at least one of the first standardized character string, the second standardized character string and the third standardized character string identifies the abnormal character string.

For example, after the first normalized character string and the third normalized character string are respectively matched with the character strings in the preset abnormal database, the abnormal character string "qq.com" related to the mailbox is not identified, but the second normalized character string "3 eid. qq.com" of the 1 st is matched with the character strings in the preset abnormal database, and the abnormal character string "qq.com" related to the mailbox is identified.

By adopting the scheme, the standardized character string corresponding to the original character string is restored through a plurality of feature vectors, then abnormal character string recognition is carried out, and the deformed abnormal character string is recognized from three aspects of characters, pictures and phonetic symbols.

In a specific implementation, recognizing the deformed abnormal character string from three aspects of characters, pictures and phonetic symbols may still have the problems of being unable to recognize the abnormal character string, recognizing the wrong abnormal character string, and the like, for example, if the set abnormal character string related to the mailbox is "@ qq. To this end, step S14 may be further expanded and optimized to determine a standardized string. The following is a detailed description by way of specific examples.

In this embodiment of the present invention, referring to a flowchart of a method for determining a normalized character string corresponding to an original character string shown in fig. 2, specifically, the method may include the following steps:

and S21, fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain a fused feature vector.

In a specific implementation, the method for fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector may adopt at least one of the following modes:

1. and connecting the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector end to obtain an N1+ N2+ N3-dimensional fusion feature vector [ Xi, Yi, Zi ].

2. And randomly combining the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain an N1+ N2+ N3-dimensional fusion feature vector [ Ri ], wherein Ri belongs to a set { Xi, Yi, Zi }.

3. Respectively transposing and combining the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain an N1+ N2+ N3-dimensional fusion feature vector [ Xi^T,Yi^T,Zi^T]Or [ Hi]Where Hi ∈ set { Xi ∈ set^T,Yi^T,Zi^T}。

It can be understood that the actual fusion method is not limited to the above methods, and the first deep learning feature vector, the second deep learning feature vector, and the third deep learning feature vector may be fused according to other different dimensions.

And S22, inputting the fusion feature vector into a preset fourth deep learning model to obtain a fourth standardized character string corresponding to the original character string.

The preset fourth deep learning model may adopt one or more neural network models which are trained, for example, various models and a multilayer Perceptron (MLP) under an RNN system, where the RNN model can improve the obtaining speed of the feature vector, and the MLP can improve the output accuracy of the feature vector.

In a specific implementation, the training set of the fourth deep learning model may include training data of various deformed character shapes and corresponding standard character shapes, and training data of various deformed character pronunciations and corresponding standard character pronunciations, after the fourth deep learning model is trained by the training set, the fusion feature vector is input into the trained fourth deep learning model, a corresponding fourth standardized character string "123 @ qq.com" is obtained through shape matching and pronunciation matching, and then the fourth standardized character string may be matched with a character string in a preset abnormal database, an abnormal character string "@ qq.com" may be identified, and an identification result may be output.

With reference to the foregoing embodiment, as shown in fig. 3, a flowchart of another method for identifying an abnormal character string according to an embodiment of the present invention is shown, where the method includes the following steps:

and S31, acquiring the original character string.

And S32-1, converting the original character string into a picture.

And S32-2, converting the original character string into a phonetic symbol string.

S33-1, inputting the original character string into the first deep learning model.

S33-2, inputting the picture into the second deep learning model.

And S33-3, inputting the phonetic symbol string into a third deep learning model.

And S34-1, obtaining a first deep learning feature vector.

And S34-2, obtaining a second deep learning feature vector.

And S34-3, obtaining a third deep learning feature vector.

And S35, fusing the first deep learning characteristic vector, the second deep learning characteristic vector and the third deep learning characteristic vector.

And S36, inputting the fused first to third deep learning feature vectors into a fourth deep learning model.

S37, a fourth normalized character string can be obtained after the fourth deep learning model processing.

S38, an abnormal character string in the fourth normalized character string is identified.

And S39, outputting the recognition result.

By adopting the scheme, the original character string, the picture and the feature vector corresponding to the phonetic symbol string are fused and subjected to secondary deep learning, so that the relation among the feature vectors can be further deepened, a more accurate standardized character string can be obtained, the identification breadth and accuracy of the abnormal character string can be improved, and the capability of identifying the abnormal character string can be enhanced.

In a specific implementation, step S15 may be further expanded and optimized to determine a standardized string. The following is a detailed description by way of specific examples.

In the embodiment of the present invention, the first standardized character string, the second standardized character string, the third standardized character string, and the fourth standardized character string may be respectively matched with character strings in a preset abnormal database, and as long as at least one of the first standardized character string, the second standardized character string, the third standardized character string, and the fourth standardized character string is recognized to have an abnormal character string, a recognition result of the abnormal character string is output, so that multidimensional recognition is realized, and a missing rate of abnormal character string recognition may be reduced.

In the specific implementation, because the input original character string may contain various language characters, numbers and even symbols, when the original character string is converted into the phonetic symbol string, the original character string is converted into the corresponding phonetic symbol string and then recognized based on the main language type of the original character string, so that the application range of abnormal character string recognition can be expanded.

The embodiment of the present invention further provides a device for identifying an abnormal character string corresponding to the above method for identifying an abnormal character string, so that a person skilled in the art can better understand and implement the embodiment of the present invention, and the following detailed description is given by using specific embodiments with reference to the accompanying drawings.

Referring to fig. 4, a schematic structural diagram of an apparatus for identifying an abnormal character string in an embodiment of the present invention, the apparatus 400 for identifying an abnormal character string may include:

an original character string obtaining unit 401 adapted to obtain an original character string;

a first original character string converting unit 402 adapted to convert the original character string into a corresponding picture;

a second original character string converting unit 403 adapted to convert the original character string into a corresponding phonetic symbol string;

a first deep learning unit 404, adapted to input the original character string into a preset first deep learning model, so as to obtain a first deep learning feature vector;

the second deep learning unit 405 is adapted to input the picture into a preset second deep learning model to obtain a second deep learning feature vector;

a third deep learning unit 406, adapted to input the phonetic symbol string into a preset third deep learning model, so as to obtain a third deep learning feature vector;

a normalized character string generating unit 407 adapted to determine a normalized character string corresponding to the original character string according to the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector;

an abnormal character string identification unit 408 adapted to match the standardized character string with a character string in a preset abnormal database, and identify an abnormal character string in the standardized character string;

a result output unit 409 adapted to output the recognition result.

By adopting the scheme, the original character string is converted into the picture and the phonetic symbol string, then deep learning is respectively carried out to obtain the corresponding characteristic vectors, the standardized character string corresponding to the original character string is restored through the characteristic vectors with multiple dimensions, and then abnormal character string recognition is carried out, so that the recognition rate of the deformed character can be greatly improved, and the accuracy and precision of abnormal character string recognition can be improved. Moreover, the whole recognition process does not need manual participation and adjustment, but automatic recognition is adopted, so that the efficiency of recognizing the abnormal character strings can be improved, and the labor cost is greatly reduced.

In an embodiment of the present invention, as shown in fig. 5, the standardized character string generating unit 407 may include:

a first normalized character string generation subunit 501, adapted to obtain, according to the first deep learning feature vector, a first normalized character string corresponding to the original character string;

a second normalized character string generating subunit 502, adapted to obtain, according to the second deep learning feature vector, a second normalized character string corresponding to the original character string;

the third normalized character string generating subunit 503 is adapted to obtain a third normalized character string corresponding to the original character string according to the third deep learning feature vector.

As shown in fig. 6, the abnormal character string recognition unit 408 may include:

a first abnormal string identification subunit 601, adapted to match the first standardized string with a string in a preset abnormal database, and identify an abnormal string in the first standardized string;

a second abnormal string identification subunit 602, configured to match the second normalized string with a string in a preset abnormal database, and identify an abnormal string in the second normalized string;

a third abnormal string identification subunit 603, adapted to match the third standardized string with a string in a preset abnormal database, and identify an abnormal string in the third standardized string.

In particular implementations, apparatus 400 may be further expanded and optimized to determine standardized strings. The following is a detailed description by way of specific examples.

In an embodiment of the present invention, feature vectors corresponding to the original character string, the picture, and the phonetic symbol string may be fused and subjected to secondary deep learning, so as to further deepen a connection between the feature vectors, which is further described with reference to fig. 4 and 7, as shown in fig. 7, the normalized character string generating unit 407 may include:

and a feature vector fusion subunit 701, adapted to fuse the first deep learning feature vector, the second deep learning feature vector, and the third deep learning feature vector to obtain a fusion feature vector.

And the deep learning subunit 702 is adapted to input the fusion feature vector into a preset fourth deep learning model, and determine a fourth normalized character string corresponding to the original character string.

Then, the abnormal character string identification unit 408 may match the fourth normalized character string with a character string in a preset abnormal database, identify an abnormal character string in the normalized character strings, and finally, the result output unit 409 outputs the identification result.

In another embodiment of the present invention, the first normalized character string, the second normalized character string, the third normalized character string and the fourth normalized character string can be identified separately, which is further described with reference to fig. 4, fig. 5 and fig. 6.

As shown in fig. 5, the normalized character string generating unit 407 may further include, in addition to the first normalized character string generating subunit 501, the second normalized character string generating subunit 502, and the third normalized character string generating subunit 503:

As shown in fig. 6, the abnormal string recognition unit 408 may further include, in addition to the first abnormal string recognition subunit 601, the second abnormal string recognition subunit 602, and the third abnormal string recognition subunit 603:

a fourth abnormal string identification subunit 604, adapted to match the fourth standardized string with a string in a preset abnormal database, and identify an abnormal string in the fourth standardized string.

In a specific implementation, the first standardized character string, the second standardized character string, the third standardized character string and the fourth standardized character string are respectively matched with character strings in a preset abnormal database, and as long as at least one of the first standardized character string, the second standardized character string, the third standardized character string and the fourth standardized character string is identified to have an abnormal character string, the identification result with the abnormal character string is output, so that multi-dimensional identification is realized, and the missing rate of abnormal character string identification can be reduced.

In a specific implementation, the method for fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector may include at least one of:

1. and connecting the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector end to end.

2. And randomly combining the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector.

3. And respectively transposing and combining the first deep learning characteristic vector, the second deep learning characteristic vector and the third deep learning characteristic vector.

It can be understood that the actual fusion method is not limited to the above methods, and the first deep learning feature vector, the second deep learning feature vector, and the third deep learning feature vector may also be processed according to other different dimensions.

In a specific implementation, the preset first to fourth deep learning models can be trained by using one or more neural network models. The first deep learning model may include a first cyclic neural network model, the second deep learning model may include a convolutional neural network model, the third deep learning model may include a second cyclic neural network model, and the preset fourth deep learning model may include a cyclic neural network model and a convolutional neural network model.

In a specific implementation, since the input original character string may include various language characters, numbers, or even symbols, when the original character string is converted into a phonetic symbol string, the second original character string conversion unit converts the original character string into a phonetic symbol string corresponding to the main language type based on the main language type of the original character string, and then performs recognition, so that the application range of abnormal character string recognition can be expanded.

The embodiment of the present invention further provides a data processing device, which includes a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor, when executing the computer instructions, may execute the steps of the method for identifying an abnormal character string according to any one of the above embodiments of the present invention. The specific implementation of the method for identifying an abnormal character string executed when the computer instruction runs may refer to the steps of the method for identifying an abnormal character string in the above embodiments, and will not be described in detail.

The data processing device can be a handheld terminal such as a mobile phone, a tablet computer, a personal desktop computer and the like.

The embodiment of the present invention further provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the method according to any of the above embodiments of the present invention may be executed.

The computer readable storage medium may be various suitable readable storage media such as an optical disc, a mechanical hard disc, a solid state hard disc, and the like. The method for identifying an abnormal character string executed by an instruction stored in the computer-readable storage medium may specifically refer to the embodiments of the above methods for identifying an abnormal character string, and will not be described in detail again.

To sum up, the embodiment of the present invention discloses an embodiment a1, a method for identifying an abnormal character string, including:

acquiring an original character string;

respectively converting the original character strings into corresponding pictures and phonetic symbol strings;

inputting the original character string into a preset first deep learning model to obtain a first deep learning characteristic vector, inputting the picture into a preset second deep learning model to obtain a second deep learning characteristic vector, and inputting the phonetic symbol string into a preset third deep learning model to obtain a third deep learning characteristic vector;

determining a standardized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector;

matching the standardized character string with a character string in a preset abnormal database, and identifying an abnormal character string in the standardized character string;

and outputting the recognition result.

The embodiment of the present invention discloses an a2 embodiment, and the method for identifying an abnormal character string as described in the embodiment a1, where the determining a normalized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector, and the third deep learning feature vector includes:

fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain a fused feature vector;

and inputting the fusion feature vector into a preset fourth deep learning model to obtain a standardized character string corresponding to the original character string.

The embodiment of the present invention discloses an A3 embodiment, and the method for identifying an abnormal character string as described in the embodiment a1, where the determining a normalized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector, and the third deep learning feature vector includes:

obtaining a first standardized character string, a second standardized character string and a third standardized character string corresponding to the original character string based on the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector respectively;

the matching of the standardized character string with a character string in a preset abnormal database to identify an abnormal character string in the standardized character string comprises the following steps:

and respectively matching the first standardized character string, the second standardized character string and the third standardized character string with character strings in a preset abnormal database, and identifying abnormal character strings in the first standardized character string, the second standardized character string and the third standardized character string.

The embodiment of the present invention discloses an embodiment a4, and the method for identifying an abnormal character string according to the embodiment A3, where the method for determining a normalized character string corresponding to an original character string based on a first deep learning feature vector, a second deep learning feature vector, and a third deep learning feature vector further includes:

inputting the fusion feature vector into a preset fourth deep learning model to obtain a fourth standardized character string corresponding to the original character string;

the matching of the standardized character string with a character string in a preset abnormal database to identify an abnormal character string in the standardized character string further comprises:

and matching the fourth standardized character string with a character string in a preset abnormal database, and identifying an abnormal character string in the fourth standardized character string.

The embodiment of the present invention discloses an a5 embodiment, such as the method for identifying an abnormal character string described in the a2 embodiment or the a4 embodiment, where the fusing the first deep learning feature vector, the second deep learning feature vector, and the third deep learning feature vector includes:

and connecting the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector end to end.

The embodiment of the invention discloses an A6 embodiment, and the method for identifying abnormal character strings in the embodiment of A1 is characterized in that the first deep learning model comprises a first cyclic neural network model, the second deep learning model comprises a convolutional neural network model, and the third deep learning model comprises a second cyclic neural network model.

The embodiment of the present invention discloses an a7 embodiment, such as any one of the embodiments a1 to a4 or the method for identifying an abnormal character string described in the embodiment a6, where the converting the original character string into a phonetic symbol string includes:

and converting the original character string into a phonetic symbol string corresponding to the main language type based on the main language type of the original character string.

The embodiment of the invention discloses a B1 embodiment, a device for identifying abnormal character strings, comprising:

an original character string obtaining unit adapted to obtain an original character string;

the first original character string conversion unit is suitable for converting the original character strings into corresponding pictures;

the second original character string conversion unit is suitable for converting the original character strings into corresponding phonetic symbol strings;

the first deep learning unit is suitable for inputting the original character string into a preset first deep learning model to obtain a first deep learning characteristic vector;

the second deep learning unit is suitable for inputting the picture into a preset second deep learning model to obtain a second deep learning feature vector;

the third deep learning unit is suitable for inputting the phonetic symbol string into a preset third deep learning model to obtain a third deep learning characteristic vector;

the normalized character string generating unit is suitable for determining a normalized character string corresponding to the original character string according to the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector;

the abnormal character string identification unit is suitable for matching the standardized character string with a character string in a preset abnormal database to identify an abnormal character string in the standardized character string;

and the result output unit is suitable for outputting the identification result.

The embodiment of the invention discloses an embodiment B2, and in particular relates to a device for identifying abnormal character strings as described in the embodiment B1, wherein the standardized character string generating unit comprises:

the feature vector fusion subunit is suitable for fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector to obtain a fused feature vector;

and the deep learning subunit is suitable for inputting the fusion feature vector into a preset fourth deep learning model and determining a standardized character string corresponding to the original character string.

The embodiment of the invention discloses an embodiment B3, and in particular relates to a device for identifying abnormal character strings as described in the embodiment B1, wherein the standardized character string generating unit comprises:

the first standardized character string generating subunit is suitable for obtaining a first standardized character string corresponding to the original character string according to the first deep learning feature vector;

the second standardized character string generating subunit is suitable for obtaining a second standardized character string corresponding to the original character string according to the second deep learning feature vector;

a third standardized character string generation subunit, adapted to obtain a third standardized character string corresponding to the original character string according to the third deep learning feature vector;

the abnormal character string recognition unit includes:

the first abnormal character string identifying subunit is suitable for matching the first standardized character string with character strings in a preset abnormal database to identify an abnormal character string in the first standardized character string;

the second abnormal character string identifying subunit is suitable for matching the second standardized character string with character strings in a preset abnormal database to identify an abnormal character string in the second standardized character string;

and the third abnormal character string identifying subunit is suitable for matching the third standardized character string with a character string in a preset abnormal database to identify an abnormal character string in the third standardized character string.

The embodiment of the present invention discloses an embodiment B4, and as described in the embodiment B3, the device for identifying an abnormal character string further includes:

the deep learning subunit is suitable for inputting the fusion feature vector into a preset fourth deep learning model and determining a fourth standardized character string corresponding to the original character string;

the abnormal character string recognition unit further includes:

and the fourth abnormal character string identifying subunit is suitable for matching the fourth standardized character string with character strings in a preset abnormal database to identify an abnormal character string in the fourth standardized character string.

The embodiment of the invention discloses an embodiment B5, and in particular relates to a device for identifying abnormal character strings as in any embodiment B2 or B4, wherein the feature vector fusion subunit is suitable for connecting the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector end to end.

The embodiment of the invention discloses an embodiment B6, and the device for identifying abnormal character strings is described in the embodiment B1, wherein the first deep learning model comprises a first cyclic neural network model, the second deep learning model comprises a convolutional neural network model, and the third deep learning model comprises a second cyclic neural network model.

The embodiment of the present invention discloses an apparatus for identifying an abnormal character string in an embodiment B7, such as any one of embodiments B1 to B4 or embodiment B6, wherein the second original character string converting unit is adapted to convert the original character string into a phonetic symbol string corresponding to a subject language type according to the subject language type of the original character string.

The embodiment of the invention discloses the C1 embodiment, a data processing device, comprising a memory and a processor; wherein the memory is adapted to store one or more computer instructions which, when executed by the processor, perform the steps of the method of any of embodiments a1 to a 7.

The embodiment of the invention discloses a D1 embodiment, a computer readable storage medium, and a computer instruction stored thereon, wherein the computer instruction executes the steps of the method of any one of embodiments A1 to A7 when running.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of identifying an anomalous string, comprising:

acquiring an original character string;

and outputting the recognition result.

2. The method for identifying an abnormal character string according to claim 1, wherein the determining a normalized character string corresponding to the original character string based on the first deep-learning feature vector, the second deep-learning feature vector and the third deep-learning feature vector comprises:

3. The method for identifying an abnormal character string according to claim 1, wherein the determining a normalized character string corresponding to the original character string based on the first deep-learning feature vector, the second deep-learning feature vector and the third deep-learning feature vector comprises:

4. The method for identifying an abnormal character string according to claim 3, wherein the determining a normalized character string corresponding to the original character string based on the first deep-learning feature vector, the second deep-learning feature vector and the third deep-learning feature vector further comprises:

5. The method for identifying an abnormal character string according to claim 2 or 4, wherein the fusing the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector comprises:

6. The method of identifying an abnormal string according to claim 1, wherein the first deep learning model comprises a first cyclic neural network model, the second deep learning model comprises a convolutional neural network model, and the third deep learning model comprises a second cyclic neural network model.

7. The method for identifying an abnormal character string according to any one of claims 1 to 4 or claim 6, wherein the converting the original character string into a phonetic symbol string comprises:

8. An apparatus for identifying an abnormal character string, comprising:

9. The apparatus for identifying an abnormal string according to claim 8, wherein the standardized string generating unit includes:

10. The apparatus for identifying an abnormal string according to claim 8, wherein the standardized string generating unit includes:

the abnormal character string recognition unit includes:

11. The apparatus for identifying an abnormal character string according to claim 10, wherein the standardized character string generating unit further comprises:

the abnormal character string recognition unit further includes:

12. The apparatus for identifying an abnormal character string according to claim 9 or 11, wherein the feature vector fusion subunit is adapted to connect the first deep learning feature vector, the second deep learning feature vector and the third deep learning feature vector end to end.

13. The apparatus of claim 8, wherein the first deep learning model comprises a first recurrent neural network model, the second deep learning model comprises a convolutional neural network model, and the third deep learning model comprises a second recurrent neural network model.

14. The apparatus for identifying an abnormal character string according to any one of claims 8 to 11 or 13, wherein the second original character string converting unit is adapted to convert the original character string into a phonetic symbol string corresponding to a subject language type according to the subject language type of the original character string.

15. A data processing apparatus comprising a memory and a processor; wherein the memory is adapted to store one or more computer instructions, and wherein the processor executes the computer instructions to perform the steps of the method according to any one of claims 1 to 7, and when the identification result is that an abnormal character string exists, outputs an abnormal prompt message according to a preset setting.

16. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions when executed perform the steps of the method of any one of claims 1 to 7.