CN112989810A - Text information identification method and device, server and storage medium - Google Patents

Text information identification method and device, server and storage medium Download PDF

Info

Publication number
CN112989810A
CN112989810A CN201911304665.0A CN201911304665A CN112989810A CN 112989810 A CN112989810 A CN 112989810A CN 201911304665 A CN201911304665 A CN 201911304665A CN 112989810 A CN112989810 A CN 112989810A
Authority
CN
China
Prior art keywords
text
recognized
recognition
pinyin
abnormal content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911304665.0A
Other languages
Chinese (zh)
Other versions
CN112989810B (en
Inventor
周侃
郭庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201911304665.0A priority Critical patent/CN112989810B/en
Publication of CN112989810A publication Critical patent/CN112989810A/en
Application granted granted Critical
Publication of CN112989810B publication Critical patent/CN112989810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The disclosure relates to a text information identification method, a text information identification device, a server and a storage medium, and relates to the field of text processing. Firstly, performing text type conversion processing on a text to be recognized to obtain at least one corresponding converted text; respectively identifying the contents of the text to be identified and the at least one converted text to obtain corresponding first identification results; and determining whether the text to be recognized has abnormal content or not based on each first recognition result, enriching the content of the recognized text, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can also recognize that the abnormal content exists in the converted and changed text to be recognized, so that the sample to be recognized with the abnormal content is accurately shielded.

Description

Text information identification method and device, server and storage medium
Technical Field
The present disclosure relates to the field of text processing, and in particular, to a method and an apparatus for identifying text information, a server, and a storage medium.
Background
With the development of the mobile internet, the development of some social applications installed on the user terminal has advanced and developed greatly, and most of the social applications include functions of editing personal profiles, publishing personal trends and publishing comments, and the like, so that the user can show himself to others from different angles. However, for the purpose of increasing the attention or obtaining illegal benefits, some users may violate ethical laws and adversely affect the network environment by descriptions such as personal profiles, personal dynamics, comments, and the like, and therefore, it is necessary to mask descriptions of violations in social applications.
In the related art, an illegal word bank is established, and illegal texts are determined and shielded by matching descriptions in the social application with contents in the illegal word bank. However, if the content in the violation word library is not rich enough, or if the user grasps the specific content in the violation word library, the description in the social application is transformed and changed, which results in bypassing the matching with the content in the violation word library, and thus, the above-mentioned shielding of the violation description in the social application is not accurate enough.
Disclosure of Invention
The disclosure provides a text information identification method and device, a server and a storage medium, which are used for at least solving the problem that the shielding of illegal descriptions in social applications in the related art is not accurate enough. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a text information recognition method, including:
acquiring a text to be identified;
performing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text;
respectively performing content recognition on the text to be recognized and the at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for indicating whether abnormal content exists in each corresponding text;
and determining whether abnormal content exists in the text to be recognized or not based on each first recognition result.
Optionally, the determining whether the text to be recognized has abnormal content based on each of the first recognition results includes:
and if at least one first recognition result in the first recognition results represents that the corresponding text has abnormal content, determining that the text to be recognized has the abnormal content.
Optionally, the performing a text type conversion process on the text to be recognized to obtain at least one corresponding converted text includes:
if the type of the text comprises a character type, converting the text of the character type into pinyin;
and if the type of the text comprises a Pinyin type, converting the text of the Pinyin type into characters.
Optionally, the performing content recognition on the text to be recognized and the at least one converted text respectively to obtain corresponding first recognition results includes: recognizing the text of character type by character recognition model to obtain one first recognition result, recognizing the text of Pinyin type by Pinyin recognition model to obtain another first recognition result,
the character recognition model is formed by training in advance according to a training sample set formed by historical character samples carrying category identifications and countermeasure texts of the historical character samples carrying the category identifications, the category identifications of the historical character samples are the same as the category identifications of the countermeasure texts of the historical character samples, the pinyin recognition model is formed by training in advance according to a training sample set formed by historical pinyin samples carrying the category identifications and the countermeasure texts of the historical pinyin samples carrying the category identifications, and the category identifications of the historical pinyin samples are the same as the category identifications of the countermeasure texts of the historical pinyin samples.
Optionally, the method further comprises:
if the text to be recognized is determined to have no abnormal content, processing the text of the character type into a text vector through a character embedding model and processing the text of the pinyin type into a text vector through a pinyin embedding model;
determining the similarity between the processed text vectors and a plurality of historical negative text vectors in a preset negative text vector library, wherein the historical negative text vectors are text vectors which are subjected to content identification in advance to determine that abnormal content does not exist and abnormal content actually exists;
and determining a second recognition result aiming at the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized or not.
Optionally, if the second recognition result represents that there is no abnormal content in the text to be recognized and there is actually abnormal content in the text to be recognized, adding the text vector of the text to be recognized into a preset historical negative text vector library.
According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for recognizing text information, including:
an information acquisition unit configured to perform acquisition of a text to be recognized;
the text conversion unit is configured to execute text type conversion processing on the text to be recognized to obtain at least one corresponding converted text;
the text recognition unit is configured to perform content recognition on the text to be recognized and the at least one converted text respectively to obtain corresponding first recognition results, wherein the first recognition results are used for indicating whether abnormal content exists in each corresponding text;
and the result determining unit is configured to determine whether abnormal content exists in the text to be recognized based on each first recognition result.
Optionally, the result determining unit is specifically configured to determine that there is abnormal content in the text to be recognized if at least one of the first recognition results represents the text to be recognized or the converted text has abnormal content.
Optionally, the text conversion unit is specifically configured to perform
If the type of the text comprises a character type, converting the text of the character type into pinyin; and if the type of the text comprises a Pinyin type, converting the text of the Pinyin type into characters.
Optionally, the text recognition unit is specifically configured to perform recognition of the text of the character type by the character recognition model to obtain one first recognition result, to perform recognition of the text of the pinyin type by the pinyin recognition model to obtain another first recognition result,
the character recognition model is formed by training in advance according to a training sample set formed by historical character samples carrying category identifications and countermeasure texts of the historical character samples carrying the category identifications, the category identifications of the historical character samples are the same as the category identifications of the countermeasure texts of the historical character samples, the pinyin recognition model is formed by training in advance according to a training sample set formed by historical pinyin samples carrying the category identifications and the countermeasure texts of the historical pinyin samples carrying the category identifications, and the category identifications of the historical pinyin samples are the same as the category identifications of the countermeasure texts of the historical pinyin samples.
Optionally, the apparatus further comprises:
a text vector generation unit configured to perform processing of a text of a word type into a text vector via a word embedding model and processing of a text of a pinyin type into a text vector via a pinyin embedding model if it is determined that there is no abnormal content in the text to be recognized;
the similarity determining unit is configured to perform similarity between the text vectors obtained through determination processing and a plurality of historical negative text vectors in a preset negative text vector library, wherein the historical negative text vectors are text vectors which are subjected to content identification in advance to determine that abnormal content does not exist and abnormal content actually exists;
the result determining unit is configured to determine a second recognition result aiming at the text to be recognized according to the obtained plurality of similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized.
Optionally, the apparatus further comprises: and the text vector adding unit is configured to add the text vector of the text to be recognized into a preset historical negative text vector library if the second recognition result represents that the abnormal content does not exist in the text to be recognized and the abnormal content actually exists in the text to be recognized.
According to a third aspect of the embodiments of the present disclosure, there is provided a server, including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method for recognizing text information according to the first aspect of the embodiments of the present disclosure.
In a fourth aspect, the embodiments of the present disclosure further provide a storage medium, where instructions executed by a processor of a server enable the server to perform the method for identifying text information according to the first aspect of the embodiments of the present disclosure.
In a fifth aspect, the embodiments of the present disclosure also provide a computer program product containing instructions that, when executed by a computer, cause the computer to perform the functions performed by the server according to the third aspect of the embodiments of the present disclosure.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: firstly, executing text type conversion processing on a text to be recognized to obtain at least one corresponding converted text; respectively identifying the contents of the text to be identified and the at least one converted text to obtain corresponding first identification results; and determining whether the text to be recognized has abnormal content or not based on each first recognition result, enriching the content of the recognized text, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can also recognize that the abnormal content exists in the converted and changed text to be recognized, so that the sample to be recognized with the abnormal content is accurately shielded.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a schematic diagram illustrating interaction of a user terminal with a server in accordance with an illustrative embodiment;
FIG. 2 is a flow diagram illustrating a method of recognition of textual information in accordance with an exemplary embodiment;
FIG. 3 is a flow diagram illustrating a method of recognition of textual information in accordance with an exemplary embodiment;
FIG. 4 is a block diagram illustrating a text embedding model in accordance with an exemplary embodiment;
fig. 5 is a detailed flowchart of S21 in fig. 3;
FIG. 6 is a flow diagram illustrating a method of recognition of textual information in accordance with an exemplary embodiment;
FIG. 7 is a flow diagram illustrating a method of recognition of textual information in accordance with an exemplary embodiment;
FIG. 8 is a block diagram illustrating a text recognition model in accordance with an exemplary embodiment;
fig. 9 is a detailed flowchart of S82 in fig. 7;
FIG. 10 is a block diagram illustrating a text recognition apparatus in accordance with an exemplary embodiment;
FIG. 11 is a block diagram illustrating a text recognition apparatus in accordance with an exemplary embodiment;
FIG. 12 is a block diagram illustrating a text recognition apparatus in accordance with an exemplary embodiment;
FIG. 13 is a block diagram illustrating a text recognition apparatus in accordance with an exemplary embodiment;
FIG. 14 is a block diagram illustrating a text recognition apparatus in accordance with an exemplary embodiment;
FIG. 15 is a block diagram illustrating a server in accordance with an exemplary embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating a text information recognition method according to an exemplary embodiment, the text information recognition method is applied to a server 102, and as shown in fig. 2, the server 102 and a user terminal 101 installed with a target application program are communicatively connected through a network 300 for interaction. The target application may include, among other things, the ability to edit personal profiles, post personal trends, and post text for comments. For example, the target application may be an application such as WeChat, QQ, Taobao, sing bar, etc. Specifically, as shown in fig. 1, the method for identifying text information includes the following steps:
s11: and acquiring a text to be recognized.
After the user publishes the text on the text publishing interface of the target application program, the target application program uploads the published text as the text to be identified to the server 102. For example, after the user posts comments on the friend circle interface of the WeChat, the WeChat uploads the posts to the server 102, and for example, after the user posts a product profile on the product display interface of Taobao, Taobao uploads the post product profile to the server 102.
S12: and executing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text. The semantic similarity between the converted text and the text to be recognized is higher than a certain threshold, and the expression forms of all the texts in at least one converted text are different. For example, the conversion process may be: for example, a text to be recognized of a character type is converted into a pinyin text, or synonym replacement is performed on key words in the text to be recognized so as to convert the text to a text with similar semantics with the text; or, similar font replacement is carried out on the keywords in the text to be recognized so as to convert the keywords into the text with similar font to the text, or the text to be recognized with Pinyin type is converted into the text of characters, and the like.
S13: and respectively carrying out content identification on the text to be identified and at least one converted text to obtain corresponding first identification results, wherein the first identification results are used for indicating whether abnormal content exists in each corresponding text.
For example, content recognition is performed on the text of the character type and the text of the pinyin type after the text of the character type is converted, so that a first recognition result corresponding to the text of the character type and a first recognition result corresponding to the text of the pinyin type are obtained. The first recognition result corresponding to the text of the character type and the first recognition result corresponding to the text of the pinyin type can be the same or different.
The first identification result represents whether the text of the character type and the text of the pinyin type have abnormal content or not. Wherein, the first recognition result includes two conditions, the first one: the first recognition result represents that abnormal content exists in the text to be recognized, and the first recognition result represents that the abnormal content does not exist in the text to be recognized. The text to be recognized with abnormal content can be a text with negative effects such as low customs, moral efflorescence, pornography, violence and the like; the text to be recognized in which the abnormal content does not exist may be text conforming to the national language free of moral law or the like.
In the embodiment of the present disclosure, the method for identifying whether there is abnormal content in the text to be identified may be implemented by using a pre-trained text identification model, or may also be implemented by using a pre-created text library, or by using a combination of a text classification model and a text library, and the like, which is not limited herein.
S14: and determining whether the text to be recognized has abnormal content or not based on each first recognition result.
For example, if one of the first recognition results represents that the recognized text has abnormal content, it is determined that the text to be recognized has the abnormal content.
Optionally, S14 may include determining that there is abnormal content in the text to be recognized if at least one of the first recognition results represents that there is abnormal content in the text to be recognized or the converted text.
For example, the first recognition result corresponding to the text of the text type is: abnormal content does not exist in the text of the character type; abnormal content exists in a first identification result corresponding to the Pinyin type text; it is determined that there is abnormal content in the text to be recognized.
The text information identification method comprises the steps of firstly, executing text type conversion processing on a text to be identified to obtain at least one corresponding converted text; respectively identifying the contents of the text to be identified and the at least one converted text to obtain corresponding first identification results; and determining whether the text to be recognized has abnormal content or not based on each first recognition result, enriching the content of the recognized text, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can also recognize that the abnormal content exists in the converted and changed text to be recognized, so that the sample to be recognized with the abnormal content is accurately shielded.
As one embodiment, S12 may be: if the type of the text comprises a character type, converting the text of the character type into pinyin, and if the type of the text comprises a pinyin type, converting the text of the pinyin type into the character.
For example, if the content of the text to be recognized is a text of a character that "i love watching live", the text is converted into a pinyin text of "wo ai kan zhi bo"; for another example, if the content of the text to be recognized is the pinyin text of "wo ai kan zhi bo", the conversion generates a text of the word "i love to see live".
In addition, the text to be recognized can be converted into a text with similar semantics with the text; or converted to text with a similar font to the text. For example, if the content of the text to be recognized is a text of "i love watching live broadcast", converting to generate a text with similar semantics of "i like watching live broadcast"; for example, if the content of the text to be recognized is a text of "i love watching live", a text with similar font of "i love watching live" is generated by conversion.
Specifically, S13 may include: and identifying the text of the character type by the character identification model to obtain one first identification result, and identifying the text of the pinyin type by the pinyin identification model to obtain the other first identification result.
The character recognition model is formed by training in advance according to a training sample set formed by historical character samples carrying category identifications and countermeasure texts of the historical character samples carrying the category identifications, the category identifications of the historical character samples are the same as the category identifications of the countermeasure texts, the pinyin recognition model is formed by training in advance according to a training sample set formed by historical pinyin samples carrying the category identifications and the countermeasure texts of the historical pinyin samples carrying the category identifications, and the category identifications of the historical pinyin samples are the same as the category identifications of the countermeasure texts.
Wherein, the confrontation sample can be generated by the historical word sample or the historical pinyin sample in advance through but not limited to the following ways:
1. removing invisible characters (e.g. spaces, tabs, etc.)
2. If the historical text sample is a character sample, replacing single characters in the historical text sample with pinyin
3. If the historical text sample is a character sample, the single character in the historical text sample is replaced by a homophone or harmonic character
4. Converting full angle to half angle, or converting half angle to full angle
5. Replacing words with synonyms or synonyms
6. Adjacent single characters in the exchange words
7. If the historical text sample is a character sample, the Arabic numerals in the historical text sample are converted into Chinese characters, and if the historical text sample is a pinyin sample, the Arabic numerals in the historical text sample are converted into pinyin
8. Deleting deficiency words such as tone words
9. Deleting a character in a particular word (e.g. noun)
Ways 1-9 above can each be set to perform with different probabilities to make changes to the historical text sample, e.g., way 1 with a probability of 70%, way 2 with a probability of 80%, way 3 with a probability of 85%, etc., and each way can be performed multiple times on the same historical text sample. Specifically, for example, for each historical text sample, the operations in the modes 1 to 4 may be performed successively on the historical text sample with different probabilities, and each operation may select a plurality of single words; thereafter, for each word in the historical text sample, the operations in ways 5-9 are performed successively for each word with different probabilities, respectively. Of course, the generation of the challenge sample is not limited to the above-described manner, and is merely illustrative.
Optionally, in order to ensure the integrity and readability of the obtained confrontation sample, the maximum modification times of each historical text sample and the maximum modification times of each word in the historical text sample need to be set, and when the set maximum modification times are reached, the modification of the historical text sample is stopped.
Optionally, when it is determined that the text to be recognized does not have the abnormal content, it does not mean that the text to be recognized does not have the abnormal content, and it may also be determined that the text to be recognized does not have the abnormal content due to an error occurring in the character recognition model and the pinyin recognition model, and therefore, the text to be recognized needs to be further actually screened. Thus, as shown in fig. 3, the method further comprises:
s21: and determining whether abnormal content exists in the text to be recognized, if not, executing S22, and if so, executing S25.
S22: text of a word type is processed into a text vector via a word embedding model and text of a pinyin type is processed into a text vector via a pinyin embedding model.
As shown in fig. 4, the text embedding model and the pinyin embedding model both include a feature vector extraction layer, at least one coding layer (3 coding layers in fig. 4), and a full connection layer, which are connected in sequence, and as shown in fig. 5, the process of processing a text of a text type into a text vector through the text embedding model or processing a text of a pinyin type into a text vector through the pinyin embedding model includes:
s211: inputting a training sample set formed by a text sample to be trained carrying the class identification and a confrontation sample to be trained carrying the class identification into the feature vector extraction layer.
The text sample to be trained may be a text sample or a pinyin sample, and is not limited herein.
S212: and converting the training samples in the training sample set into text feature vectors carrying position information through a feature vector extraction layer.
S213: feature interaction is performed on the text feature vectors via at least one encoding layer.
S214: and carrying out full connection on the text feature vectors after feature interaction through a full connection layer.
A text embedding model or a pinyin embedding model may be generated through S211-S214. And when the text sample to be trained is a character sample, generating a character embedded model, and when the text sample to be trained is a pinyin sample, generating a pinyin embedded model.
S23: and determining the similarity between the processed text vector and a plurality of historical negative text vectors in a preset negative text vector library.
The historical negative text vector is a text vector which is determined by a character embedding model or a pinyin embedding model in advance that abnormal content does not exist and actually exists. Specifically, the similarity between the processed text vector and a plurality of historical negative-class text vectors in a preset negative-class text vector library can be determined by calculating the cosine similarity.
S24: and determining a second recognition result of the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized or not.
Specifically, the manner of determining the second recognition result of the text to be recognized may be: a similarity threshold value can be set, and when one of the similarities is greater than the set similarity threshold value, the determined second recognition result is that abnormal content exists in the text to be recognized; and otherwise, determining that the second recognition result is that the abnormal content does not exist in the text to be recognized.
The first recognition result is used for screening the preliminarily recognized text to be recognized without abnormal content again, so that the recognition accuracy of the text to be recognized is improved, and the network environment is further purified.
S25: and shielding the text to be recognized.
Optionally, as shown in fig. 6, the method further includes:
s61: and when the second recognition result indicates that the abnormal content does not exist in the text to be recognized and the abnormal content actually exists in the text to be recognized, adding the text vector of the text to be recognized into a preset historical negative text vector library.
It can be understood that due to the finite negative text vectors in the preset negative text vector library, when the second recognition result represents that there is no abnormal content in the text to be recognized, it does not represent that there is no abnormal content in the text to be recognized, and therefore, it is necessary to further manually discriminate the text to be recognized, if there is an abnormal content in the text to be recognized actually, the text to be recognized is identified, and the text vector of the identified text to be recognized is added into the preset historical negative text vector library, so as to enrich the historical negative text vector library, and lay a foundation for more accurately recognizing the text to be recognized that has abnormal content in the future.
Optionally, before S11, as shown in fig. 7, the method further includes:
s81: acquiring character samples to be trained carrying category identifications, generating countermeasure samples to be trained carrying category identifications according to the character samples to be trained carrying the category identifications, wherein the category identification of each historical character sample is the same as the category identification of the countermeasure text; and acquiring pinyin samples to be trained carrying category identification, generating confrontation samples to be trained carrying category identification according to the pinyin samples to be trained carrying the category identification, wherein the category identification of each historical pinyin sample is the same as the category identification of the confrontation text of the historical pinyin sample.
The countermeasure sample can be generated from a historical text sample or a historical pinyin sample, and the specific generation method is the same as that of the countermeasure sample, and is not described herein again.
Optionally, in order to ensure the integrity and readability of the obtained confrontation sample, the maximum modification times of each historical text sample and the maximum modification times of each word in the historical text sample need to be set, and when the set maximum modification times are reached, the modification of the historical text sample is stopped.
S82: inputting a training sample set formed by a character sample to be trained carrying a category identification and a confrontation sample to be trained carrying the category identification into a training network model, and training to generate a character recognition model; inputting a training sample set formed by a pinyin sample to be trained carrying the category identification and a confrontation sample to be trained carrying the category identification into a training network model, and training to generate a pinyin identification model.
The training network model may be, but not limited to, a transform network model, a Logistic Regression (Logistic Regression) network model, and a Support Vector Machine (SVM) network model.
The following describes the training process of the character recognition model or the pinyin recognition model by taking the training network model as a Transformer network model as an example.
Specifically, as shown in fig. 8, the training network model includes a feature vector extraction layer, at least one coding layer (3 in fig. 8), at least one fully-connected layer (2 in fig. 8), and a softmax layer, which are connected in sequence. As shown in fig. 9, S82 includes:
s101: inputting a training sample set formed by a text sample to be trained carrying the class identification and a confrontation sample to be trained carrying the class identification into the feature vector extraction layer.
S102: and converting the input training sample into a text feature vector carrying position information through a feature vector extraction layer.
The feature vector extraction layer comprises an embedding layer and a positional encoding layer, the embedding layer converts an input training sample into a text vector, the positional encoding is carried out on each word in the training sample, and then the text vector and a position encoding result are added and spliced to obtain a text feature vector.
Specifically, the input training sample of the transform network model is a sentence, each word in the sentence is subjected to embedding to obtain a word vector, the word vector is added with the position coding result of the position coding layer for the word to obtain a vector for each word, and the vector is assumed to be 512-dimensional, that is, the vector has 512 elements. If the sentence length is 10, 10 vectors can be obtained, and 10 vectors can be spliced into a matrix of 10 rows and 512 columns.
S103: feature interaction is performed on the text feature vectors via at least one encoding layer.
Specifically, the input of the coding layer is the spliced 10-row 512-column matrix described above. Since each sentence has a different length, a maximum length N of one sentence needs to be specified, and assuming that N is 15, the matrix finally input into the coding layer is a matrix of 15 rows and 512 columns, and the matrix is subjected to an operation of complementing 0, that is, the elements of the last 5 rows are all 0. All 0 rows in the matrix are 0 when the matrix multiplication calculation is carried out later and the result of any column multiplication is 0, that is, the last 5 rows of the matrix are involved in the calculation but do not obtain useful results (because the sentence length is 10, the useful results can be obtained in the calculation of the first 10 rows). If the sentence length is larger than N, the sentence needs to be cut off first, so that the sentence length is N.
The input of the first coding layer is a matrix (in this embodiment, 15 × 512 matrix), and after feature interaction, the output of the first coding layer is also a matrix (the size of the matrix is also 15 × 512); the output matrix is used as the input of the second coding layer, the second coding layer also outputs a matrix (the size of the matrix is also 15 × 512) after feature interaction, the third coding layer also outputs a matrix after feature interaction, and then elements of 15 rows in the matrix are spliced to form a vector, wherein the length of the vector is 15 × 512 ═ 7680.
S104: and processing the text feature vectors after feature interaction through at least one full connection layer to obtain a primary recognition result, and normalizing the primary recognition result through a softmax output layer to generate a network recognition result.
In the embodiment of the disclosure, the text feature vectors after feature interaction are processed through 2 full connection layers. Specifically, for the spliced vectors, the vector with the length of 15 × 512 ═ 7680 is used as the input of the first full-connection layer for full connection, then full connection is performed through the second full-connection layer to obtain a preliminary identification result, so that a two-dimensional text is obtained, each dimension represents the probability of the absence of abnormal content and the probability of the presence of abnormal content, and the preliminary identification result is normalized through the softmax layer to obtain the output of identification, namely the network identification result.
S106: and determining a cross entropy loss function according to the class identification and the network identification result of each training sample.
The cross entropy is a function for measuring the network result output by the training network model and the real result, and in the embodiment of the application, the training network model is a two-recognition model. Therefore, the training network model finally outputs a two-dimensional vector [ p, q ], where p + q is 1, where p and q are the probabilities of the training samples having no abnormal content and abnormal content, respectively. For each training sample, a class identifier (namely a real class) is marked in advance, the class identifier can also be represented by a two-dimensional vector, if the real class of the training sample is a positive class, the real class can be represented by a vector [1,0], the probability of representing the positive class is 1, and the probability of representing the negative class is 0; if the true class of the training sample is a negative sample, it can be represented by the vector being [0,1 ]. If the true category is represented by [ x, y ], then one of x and y must be 1. Then the cross entropy calculation formula can be expressed as: l-xlogq, if there is no anomaly in the true class of the training sample, x-1, y-0, the cross entropy is-logp, if p is very close to 1, then-log 1-0, then there is no loss. Conversely, if p is 0.1, then-log 0.1 is 1, then the loss is 1 (the logarithm here is given as a base 10, and the natural constant e, i.e., loge is ln, is usually used).
In addition, it should be noted that the analysis process when the abnormal content exists in the real category of the training sample is similar to the analysis process when the abnormal content does not exist in the real category of the training sample, and is not repeated here.
If the training network model is not a two-recognition model, but a multiple-recognition model, then the cross-entropy penalty can be written as
Figure BDA0002322761990000121
Wherein i represents class i, yiRepresenting true results of class i (only one y)i1, other yiAre all 0, e.g., the recognition model requires classification of the training samples into 3 classes, then y1,y2,y3Only one of them is 1), aiRepresenting the probability of the i-th class of the training network model.
S107: and determining the gradient of the cross entropy loss function of a plurality of training samples according to a small batch gradient descent algorithm.
Where the gradient is the direction in which the function rises (or increases) the fastest, updating the network parameters along the gradient direction may cause the loss function to increase if it is desired to have the loss function decrease, and updating the network parameters along the opposite direction of the gradient may cause the loss function to decrease.
S108: and updating the network parameters of the training network model according to the gradient.
In the training process, the historical text samples are divided into a training sample set and a verification sample set, for example, the accuracy of the training sample set and the verification sample set is calculated according to each round of training of the training sample set. And then, repeating the training of the training sample set for one round, and calculating the accuracy of the training sample set and the accuracy of the verification sample set again. In the initial period, the accuracy of the training sample set and the accuracy of the verification sample set are both increased, and overfitting may occur when the training sample set is used for training in the later period, namely, the accuracy of the training sample set is continuously increased, but the accuracy of the verification sample set is reduced, and if the accuracy of the verification sample set is continuously reduced in the subsequent rounds of training, overfitting is indicated, and the training is not continued.
Fig. 10 is a block diagram illustrating an apparatus 1100 for recognizing text information according to an exemplary embodiment. It should be noted that the basic principle and the resulting technical effect of the text information recognition apparatus 1100 provided in the embodiment of the present application are the same as those of the above embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the above embodiment for the part of the embodiment of the present application that is not mentioned. The apparatus 1100 includes an information acquisition unit 1101, a text conversion unit 1102, a text recognition unit 1103, and a result determination unit 1104, wherein,
an information acquisition unit 1101 configured to perform acquisition of a text to be recognized.
A text conversion unit 1102 configured to perform a text type conversion process on the text to be recognized to obtain at least one corresponding converted text.
A text recognition unit 1103 configured to perform content recognition on the text to be recognized and the at least one converted text, respectively, to obtain corresponding first recognition results, where the first recognition results are used to indicate whether there is abnormal content in each corresponding text.
A result determining unit 1104 configured to perform determining whether or not there is abnormal content in the text to be recognized based on each of the first recognition results.
Optionally, the result determining unit 1104 is specifically configured to determine that there is abnormal content in the text to be recognized if at least one of the first recognition results represents that the text to be recognized or the converted text has abnormal content.
The text information recognition apparatus 1100 can realize the following functions when executed: executing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text; respectively identifying the contents of the text to be identified and the at least one converted text to obtain corresponding first identification results; and determining whether the text to be recognized has abnormal content or not based on each first recognition result, enriching the content of the recognized text, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can also recognize that the abnormal content exists in the converted and changed text to be recognized, so that the sample to be recognized with the abnormal content is accurately shielded.
Optionally, as an embodiment, the text conversion unit 1102 is specifically configured to perform, if the type of the text includes a word type, converting the text of the word type into pinyin.
Optionally, the text recognition unit 1103 is specifically configured to perform recognition on the text of the text type by using the text recognition model to obtain one first recognition result, and recognize the text of the pinyin type by using the pinyin recognition model to obtain another first recognition result.
The character recognition model is formed by training in advance according to a training sample set formed by historical character samples carrying category identifications and countermeasure texts of the historical character samples carrying the category identifications, the category identifications of the historical character samples are the same as the category identifications of the countermeasure texts of the historical character samples, the pinyin recognition model is formed by training in advance according to a training sample set formed by historical pinyin samples carrying the category identifications and the countermeasure texts of the historical pinyin samples carrying the category identifications, and the category identifications of the historical pinyin samples are the same as the category identifications of the countermeasure texts of the historical pinyin samples.
Optionally, as shown in fig. 11, the apparatus 1100 further includes:
a text vector generation unit 1201 configured to perform processing of a text of a word type into a text vector via a word embedding model and processing of a text of a pinyin type into a text vector via a pinyin embedding model if it is determined that there is no abnormal content in the text to be recognized.
A similarity determining unit 1202 configured to perform similarity between the text vector obtained by the determining process and a plurality of historical negative type text vectors in a preset negative type text vector library, wherein the historical negative type text vectors are text vectors which are determined by the character recognition model or the pinyin recognition model in advance that no abnormal content exists and abnormal content actually exists.
And a result determining unit 1104 configured to perform determining a second recognition result of the text to be recognized, the second recognition result being used for indicating whether the text to be recognized has abnormal content or not, according to the obtained plurality of similarities.
Optionally, as shown in fig. 12, the apparatus 1100 further includes: the text vector adding unit 1301 is configured to add the text vector of the text to be recognized into a preset historical negative type text vector library if the second recognition result represents that the text to be recognized does not have abnormal content and the text to be recognized actually has abnormal content.
Further, the information obtaining unit 1101 is further configured to perform obtaining of to-be-trained character samples carrying category identifications, and generate to-be-trained confrontation samples carrying category identifications according to the to-be-trained character samples carrying the category identifications, and the category identification of each history character sample is the same as the category identification of its confrontation text; and acquiring pinyin samples to be trained carrying category identification, generating confrontation samples to be trained carrying category identification according to the pinyin samples to be trained carrying the category identification, wherein the category identification of each historical pinyin sample is the same as the category identification of the confrontation text of the historical pinyin sample.
As shown in fig. 13, the apparatus 1100 further includes: the model training unit 1501 is configured to input a training sample set formed by a to-be-trained character sample carrying a category identifier and a to-be-trained confrontation sample carrying a category identifier into a training network model, and generate a character recognition model through training; inputting a training sample set formed by a pinyin sample to be trained carrying the category identification and a confrontation sample to be trained carrying the category identification into a training network model, and training to generate a pinyin identification model.
Specifically, the training network model comprises a feature vector extraction layer, at least one coding layer, at least one full connection layer and a softmax layer which are connected in sequence. As shown in fig. 14, the model training unit 1501 includes a text input module 1601, a feature vector generation module 1602, a feature interaction module 1603, a two-dimensional feature generation module 1604, a text recognition module 1605, a loss function determination module 1606, a gradient determination module 1607, and a parameter update module 1608, wherein,
and a text input module 1601 configured to perform input of a training sample set composed of a text sample to be trained or a pinyin sample to be trained carrying the category identifier and a confrontation sample to be trained carrying the category identifier into the feature vector extraction layer.
A feature vector generation module 1602 configured to perform the conversion of the training samples into text feature vectors carrying position information via the feature vector extraction layer.
A feature interaction module 1603 configured to perform feature interaction on the text feature vectors via the at least one encoding layer.
The text recognition module 1605 is configured to execute processing on the text feature vectors after feature interaction through at least one full connection layer to obtain a preliminary recognition result, and normalize the preliminary recognition result through the softmax layer to generate a network recognition result.
A loss function determining module 1606 configured to perform a cross entropy loss function determination according to the class identifier and the network identification result of each training sample.
A gradient determination module 1607 configured to perform determining a gradient of a cross entropy loss function of the plurality of training samples according to a mini-batch gradient descent algorithm.
A parameter update module 1608 configured to perform network parameter updating of the trained network model according to the gradient.
With respect to the apparatus 1100 in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 15 is a block diagram illustrating a server 102 for a method of recognition of textual information, according to an example embodiment. Referring to FIG. 15, the server 102 includes a processing component 1701 that further includes one or more processors and memory resources, represented by memory 1702, for storing instructions, such as application programs, that are executable by the processing component 1701. The application programs stored in memory 1702 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1701 is configured to execute instructions to perform the above-described text information recognition method.
For example, the following steps may be performed:
acquiring a text to be identified;
executing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text;
respectively identifying the content of the text to be identified and at least one converted text to obtain corresponding first identification results, wherein the first identification results are used for indicating whether abnormal content exists in each corresponding text;
and determining whether the text to be recognized has abnormal content or not based on each first recognition result.
The server 102 may also include a power component 1703 configured to perform power management of the server 102, a wired or wireless network interface 1704 configured to connect the server 102 to the network 300, and an input/output (I/O) interface 1705. The server 102 may operate based on an operating system stored in memory 1702, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
In an exemplary embodiment, a storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor of the server 102 to perform the method of identifying textual information described above is also provided. For example, the following steps may be performed:
acquiring a text to be identified;
executing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text;
respectively identifying the contents of the text to be identified and at least one converted text to obtain corresponding first identification results, wherein the first identification results are used for indicating whether abnormal contents exist in each corresponding text;
and determining whether the text to be recognized has abnormal content or not based on each first recognition result.
Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the steps of:
acquiring a text to be identified;
executing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text;
respectively carrying out content identification on the text to be identified and the at least one converted text to obtain corresponding first identification results, wherein the first identification results are used for indicating whether abnormal content exists in each corresponding text;
and determining whether abnormal content exists in the text to be recognized or not based on each first recognition result.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method for recognizing text information, the method comprising:
acquiring a text to be identified;
performing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text;
respectively performing content recognition on the text to be recognized and the at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for indicating whether abnormal content exists in each corresponding text;
and determining whether abnormal content exists in the text to be recognized or not based on each first recognition result.
2. The method according to claim 1, wherein the determining whether the text to be recognized has abnormal content based on each of the first recognition results comprises:
and if at least one first recognition result in the first recognition results represents that the corresponding text has abnormal content, determining that the text to be recognized has the abnormal content.
3. The method according to claim 1, wherein the performing a text type conversion process on the text to be recognized to obtain at least one corresponding converted text comprises:
if the type of the text comprises a character type, converting the text of the character type into pinyin;
and if the type of the text comprises a Pinyin type, converting the text of the Pinyin type into characters.
4. The method according to claim 3, wherein the performing content recognition on the text to be recognized and the at least one converted text respectively to obtain corresponding first recognition results comprises: recognizing the text of character type by character recognition model to obtain one first recognition result, recognizing the text of Pinyin type by Pinyin recognition model to obtain another first recognition result,
the character recognition model is formed by training in advance according to a training sample set formed by historical character samples carrying category identifications and countermeasure texts of the historical character samples carrying the category identifications, the category identifications of the historical character samples are the same as the category identifications of the countermeasure texts of the historical character samples, the pinyin recognition model is formed by training in advance according to a training sample set formed by historical pinyin samples carrying the category identifications and the countermeasure texts of the historical pinyin samples carrying the category identifications, and the category identifications of the historical pinyin samples are the same as the category identifications of the countermeasure texts of the historical pinyin samples.
5. The method of claim 1, further comprising:
if the text to be recognized is determined to have no abnormal content, processing the text of the character type into a text vector through a character embedding model and processing the text of the pinyin type into a text vector through a pinyin embedding model;
determining the similarity between the processed text vectors and a plurality of historical negative text vectors in a preset negative text vector library, wherein the historical negative text vectors are text vectors which are subjected to content identification in advance to determine that abnormal content does not exist and abnormal content actually exists;
and determining a second recognition result aiming at the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized or not.
6. The method according to claim 5, wherein if the second recognition result represents that there is no abnormal content in the text to be recognized but there is actually abnormal content in the text to be recognized, the text vector of the text to be recognized is added into a preset historical negative type text vector library.
7. The method of claim 1, further comprising:
and if the text to be recognized has abnormal content, shielding the text to be recognized.
8. An apparatus for recognizing text information, the apparatus comprising:
an information acquisition unit configured to perform acquisition of a text to be recognized;
the text conversion unit is configured to execute text type conversion processing on the text to be recognized to obtain at least one corresponding converted text;
the text recognition unit is configured to perform content recognition on the text to be recognized and the at least one converted text respectively to obtain corresponding first recognition results, wherein the first recognition results are used for indicating whether abnormal content exists in each corresponding text;
and the result determining unit is configured to determine whether abnormal content exists in the text to be recognized based on each first recognition result.
9. A server, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of recognizing text information according to any one of claims 1 to 7.
10. A storage medium in which instructions are executed by a processor of a server to enable the server to perform the method of recognizing text information according to any one of claims 1 to 7.
CN201911304665.0A 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium Active CN112989810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911304665.0A CN112989810B (en) 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911304665.0A CN112989810B (en) 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium

Publications (2)

Publication Number Publication Date
CN112989810A true CN112989810A (en) 2021-06-18
CN112989810B CN112989810B (en) 2024-03-12

Family

ID=76343629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911304665.0A Active CN112989810B (en) 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium

Country Status (1)

Country Link
CN (1) CN112989810B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011070269A (en) * 2009-09-24 2011-04-07 Hitachi Information Systems Ltd Character conversion device and method, diagram display system and method, and program
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system
CN107291780A (en) * 2016-04-12 2017-10-24 腾讯科技(深圳)有限公司 A kind of user comment information methods of exhibiting and device
CN109766475A (en) * 2018-12-13 2019-05-17 北京爱奇艺科技有限公司 A kind of recognition methods of rubbish text and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011070269A (en) * 2009-09-24 2011-04-07 Hitachi Information Systems Ltd Character conversion device and method, diagram display system and method, and program
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system
CN107291780A (en) * 2016-04-12 2017-10-24 腾讯科技(深圳)有限公司 A kind of user comment information methods of exhibiting and device
CN109766475A (en) * 2018-12-13 2019-05-17 北京爱奇艺科技有限公司 A kind of recognition methods of rubbish text and device

Also Published As

Publication number Publication date
CN112989810B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN110737758B (en) Method and apparatus for generating a model
US11544474B2 (en) Generation of text from structured data
CN112926327B (en) Entity identification method, device, equipment and storage medium
CN109831460B (en) Web attack detection method based on collaborative training
CN111931935B (en) Network security knowledge extraction method and device based on One-shot learning
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN111694826A (en) Data enhancement method and device based on artificial intelligence, electronic equipment and medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN116151132B (en) Intelligent code completion method, system and storage medium for programming learning scene
CN111291195A (en) Data processing method, device, terminal and readable storage medium
CN110909531A (en) Method, device, equipment and storage medium for discriminating information security
CN113268576B (en) Deep learning-based department semantic information extraction method and device
Zhang et al. Multifeature named entity recognition in information security based on adversarial learning
CN116450796A (en) Intelligent question-answering model construction method and device
CN112464655A (en) Word vector representation method, device and medium combining Chinese characters and pinyin
CN115759071A (en) Government affair sensitive information identification system and method based on big data
CN112085091A (en) Artificial intelligence-based short text matching method, device, equipment and storage medium
CN112989829B (en) Named entity recognition method, device, equipment and storage medium
CN116402630B (en) Financial risk prediction method and system based on characterization learning
Briciu et al. AutoAt: A deep autoencoder-based classification model for supervised authorship attribution
CN112989810B (en) Text information identification method and device, server and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN115712713A (en) Text matching method, device and system and storage medium
CN115600580B (en) Text matching method, device, equipment and storage medium
Du et al. A Word Vector Representation Based Method for New Words Discovery in Massive Text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant