CN112989810B - Text information identification method and device, server and storage medium - Google Patents

Text information identification method and device, server and storage medium Download PDF

Info

Publication number
CN112989810B
CN112989810B CN201911304665.0A CN201911304665A CN112989810B CN 112989810 B CN112989810 B CN 112989810B CN 201911304665 A CN201911304665 A CN 201911304665A CN 112989810 B CN112989810 B CN 112989810B
Authority
CN
China
Prior art keywords
text
recognition
recognized
pinyin
abnormal content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911304665.0A
Other languages
Chinese (zh)
Other versions
CN112989810A (en
Inventor
周侃
郭庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201911304665.0A priority Critical patent/CN112989810B/en
Publication of CN112989810A publication Critical patent/CN112989810A/en
Application granted granted Critical
Publication of CN112989810B publication Critical patent/CN112989810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The disclosure relates to a text information identification method and device, a server and a storage medium, and relates to the field of text processing. Firstly, performing text type conversion processing on a text to be identified to obtain at least one corresponding converted text; respectively carrying out content recognition on the text to be recognized and at least one converted text to obtain corresponding first recognition results; based on each first recognition result, whether the abnormal content exists in the text to be recognized is determined, and the content of the text to be recognized is enriched, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can recognize that the abnormal content exists in the text to be recognized after the conversion and change, so that the sample to be recognized with the abnormal content is accurately shielded.

Description

Text information identification method and device, server and storage medium
Technical Field
The disclosure relates to the field of text processing, and in particular relates to a method and a device for identifying text information, a server and a storage medium.
Background
With the development of the mobile internet, some social applications installed on the user terminal have been greatly advanced and developed, and most social applications include functions of editing personal profiles, posting personal dynamics, posting comments, etc., so as to enable the user to show himself to others from different angles. However, for the purposes of increasing attention or obtaining illegal interests, some users have descriptions of personal profile, posting personal dynamics, posting comments, and the like that violate ethical laws, and adversely affect the network environment, so that it is necessary to mask descriptions of violations in social applications.
In the related art, a violation word library is generally established, and a violation text is determined and shielded by matching descriptions in a social application with contents in the violation word library. However, if the content in the violation word library is not rich enough, or if the user grasps the specific content in the violation word library, the description in the social application is converted and changed, so that the matching with the content in the violation word library is bypassed, and therefore, the above-mentioned shielding of the violation description in the social application is not accurate enough.
Disclosure of Invention
The disclosure provides a text information identification method, a text information identification device, a server and a storage medium, so as to at least solve the problem that illegal description shielding in social application is not accurate enough in the related art. The technical scheme of the present disclosure is as follows:
According to a first aspect of an embodiment of the present disclosure, there is provided a method for identifying text information, including:
acquiring a text to be identified;
performing text type conversion processing on the text to be identified to obtain at least one corresponding converted text;
respectively carrying out content recognition on the text to be recognized and the at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for representing whether abnormal content exists in each corresponding text;
and determining whether abnormal content exists in the text to be identified or not based on each first identification result.
Optionally, the determining whether the abnormal content exists in the text to be identified based on each first identification result includes:
if at least one first recognition result in the first recognition results represents that abnormal content exists in the corresponding text, determining that abnormal content exists in the text to be recognized.
Optionally, the performing text type conversion processing on the text to be identified to obtain at least one corresponding converted text includes:
if the text type comprises a text type, converting the text of the text type into pinyin;
And if the text type comprises the pinyin type, converting the text of the pinyin type into characters.
Optionally, the content recognition of the text to be recognized and the at least one converted text respectively to obtain corresponding first recognition results includes: the text of the character type is identified by the character identification model to obtain one first identification result, the text of the pinyin type is identified by the pinyin identification model to obtain the other first identification result,
the character recognition model is trained by training a training sample set which is formed by the historical character samples carrying the category identification and the countermeasure text of the historical character samples carrying the category identification in advance, the category identification of each historical character sample is identical to the category identification of the countermeasure text, the pinyin recognition model is trained by training the training sample set which is formed by the historical pinyin samples carrying the category identification and the countermeasure text of the historical pinyin samples carrying the category identification in advance, and the category identification of each historical pinyin sample is identical to the category identification of the countermeasure text.
Optionally, the method further comprises:
if the fact that abnormal content does not exist in the text to be identified is determined, processing the text with the text type into a text vector through a text embedding model and processing the text with the pinyin type into the text vector through a pinyin embedding model;
Determining the similarity between the text vector obtained through processing and a plurality of historical negative text vectors in a preset negative text vector library, wherein the historical negative text vectors are text vectors which are subjected to content recognition in advance to determine that abnormal content does not exist and the abnormal content actually exists;
and determining a second recognition result aiming at the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized.
Optionally, if the second recognition result indicates that no abnormal content exists in the text to be recognized and the abnormal content exists in the text to be recognized actually, adding the text vector of the text to be recognized into a preset historical negative text vector library.
According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for recognizing text information, including:
an information acquisition unit configured to perform acquisition of a text to be recognized;
a text conversion unit configured to perform conversion processing of text types on the text to be recognized to obtain at least one corresponding converted text;
the text recognition unit is configured to perform content recognition on the text to be recognized and the at least one converted text respectively so as to obtain corresponding first recognition results, wherein the first recognition results are used for representing whether abnormal content exists in the corresponding texts;
And a result determining unit configured to perform determination of whether or not abnormal content exists in the text to be recognized based on each of the first recognition results.
Optionally, the result determining unit is specifically configured to perform determining that abnormal content exists in the text to be identified if at least one first identification result in the first identification results characterizes the text to be identified or the converted text has abnormal content.
Optionally, the text conversion unit is specifically configured to perform
If the text type comprises a text type, converting the text of the text type into pinyin; and if the text type comprises the pinyin type, converting the text of the pinyin type into characters.
Optionally, the text recognition unit is specifically configured to perform recognition of the text of the character type via the character recognition model to obtain one of the first recognition results, recognize the text of the pinyin type via the pinyin recognition model to obtain the other first recognition result,
the character recognition model is trained by training a training sample set which is formed by the historical character samples carrying the category identification and the countermeasure text of the historical character samples carrying the category identification in advance, the category identification of each historical character sample is identical to the category identification of the countermeasure text, the pinyin recognition model is trained by training the training sample set which is formed by the historical pinyin samples carrying the category identification and the countermeasure text of the historical pinyin samples carrying the category identification in advance, and the category identification of each historical pinyin sample is identical to the category identification of the countermeasure text.
Optionally, the apparatus further comprises:
a text vector generation unit configured to perform processing of text of a text type into a text vector via a text embedding model and processing of text of a pinyin type into a text vector via a pinyin embedding model if it is determined that no abnormal content exists in the text to be recognized;
a similarity determining unit configured to perform similarity between a text vector obtained by the determining process and a plurality of historical negative text vectors in a preset negative text vector library, respectively, wherein the historical negative text vectors are text vectors which are subjected to content recognition in advance to determine that no abnormal content exists and the abnormal content exists actually;
the result determining unit is configured to determine a second recognition result for the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized.
Optionally, the apparatus further comprises: and the text vector adding unit is configured to perform adding the text vector of the text to be identified into a preset historical negative type text vector library if the second identification result indicates that no abnormal content exists in the text to be identified and the abnormal content exists in the text to be identified.
According to a third aspect of embodiments of the present disclosure, there is provided a server comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method for identifying text information according to the first aspect of the embodiments of the present disclosure.
In a fourth aspect, the embodiments of the present disclosure further provide a storage medium, where instructions in the storage medium are executed by a processor of a server, so that the server can perform the method for identifying text information according to the first aspect of the embodiments of the present disclosure.
In a fifth aspect, the disclosed embodiments also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the functions as performed by the server of the third aspect of the disclosed embodiments.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: firstly, executing text type conversion processing on a text to be identified to obtain at least one corresponding converted text; respectively carrying out content recognition on the text to be recognized and at least one converted text to obtain corresponding first recognition results; based on each first recognition result, whether the abnormal content exists in the text to be recognized is determined, and the content of the text to be recognized is enriched, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can recognize that the abnormal content exists in the text to be recognized after the conversion and change, so that the sample to be recognized with the abnormal content is accurately shielded.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is a schematic diagram illustrating interactions of a user terminal with a server according to an example embodiment;
FIG. 2 is a flowchart illustrating a method of identifying text information, according to an exemplary embodiment;
FIG. 3 is a flowchart illustrating a method of identifying text information, according to an exemplary embodiment;
FIG. 4 is a block diagram of a text embedding model shown in accordance with an exemplary embodiment;
fig. 5 is a specific flowchart of S21 in fig. 3;
FIG. 6 is a flowchart illustrating a method of identifying text information, according to an exemplary embodiment;
FIG. 7 is a flowchart illustrating a method of identifying text information, according to an exemplary embodiment;
FIG. 8 is a block diagram of a text recognition model shown in accordance with an exemplary embodiment;
fig. 9 is a specific flowchart of S82 in fig. 7;
FIG. 10 is a block diagram of a text recognition device, according to an exemplary embodiment;
FIG. 11 is a block diagram of a text recognition device, according to an exemplary embodiment;
FIG. 12 is a block diagram of a text recognition device, according to an exemplary embodiment;
FIG. 13 is a block diagram of a text recognition device, according to an exemplary embodiment;
FIG. 14 is a block diagram of a text recognition device shown in accordance with an exemplary embodiment;
fig. 15 is a block diagram of a server, according to an example embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Fig. 1 is a flowchart illustrating a method of recognizing text information applied to a server 102 according to an exemplary embodiment, and as shown in fig. 2, the server 102 is communicatively connected to a user terminal 101 having a target application installed therein through a network 300 for interaction. The target application may include, among other things, functionality to edit personal profiles, post personal dynamics, post comments, etc. For example, the target application may be a WeChat, QQ, taobao, song Bar, etc. application. Specifically, as shown in fig. 1, the text information recognition method includes the following steps:
s11: and acquiring a text to be identified.
After the user publishes the text at the text publishing interface of the target application, the target application uploads the published text as text to be identified to the server 102. For example, after the user posts comments on the friend circle interface of WeChat, weChat uploads the posted comments to the server 102, and for example, after the user posts a commodity profile on the commodity display interface of Taobao, taobao uploads the posted commodity profile to the server 102.
S12: and executing text type conversion processing on the text to be recognized to obtain at least one corresponding converted text. The semantic similarity between the converted text and the text to be recognized is higher than a certain threshold, and the expression forms of each text in at least one converted text are different. For example, the manner of conversion processing may be: for example, converting text to be recognized of a text type into pinyin text, or performing synonym replacement on key words in the text to be recognized so as to convert the text into text with similar semantics with the text; or, the keywords in the text to be recognized are replaced by similar fonts so as to be converted into the text with similar fonts with the text, or the text to be recognized in Pinyin type is converted into text, and the like.
S13: and respectively carrying out content recognition on the text to be recognized and at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for indicating whether abnormal content exists in the corresponding texts.
For example, content recognition is performed on text of a text type and text of a pinyin type after text conversion of the text type, so as to obtain a first recognition result corresponding to the text of the text type and a first recognition result corresponding to the text of the pinyin type. The first recognition result corresponding to the text of the character type and the first recognition result corresponding to the text of the pinyin type may be the same or different.
The first recognition result represents whether the text of the character type and the text of the pinyin type have abnormal content or not. Wherein, the first recognition result includes two cases, the first: the first recognition result represents that abnormal content exists in the text to be recognized, and the first recognition result represents that abnormal content does not exist in the text to be recognized. The text to be identified with abnormal content can be a text with negative effects such as low colloquial, moral weathered, pornography, violence and the like; the text to be recognized without abnormal content can be the text of the front face of the citizen speaking freedom and the like conforming to the moral law.
In the embodiment of the disclosure, the method for identifying whether the abnormal content exists in the text to be identified may be identified by a pre-trained text identification model, or may also be identified according to a pre-created text library, or may be identified by a text classification model, a combination of the text library and the text library, or the like, which is not limited herein.
S14: and determining whether abnormal content exists in the text to be identified or not based on each first identification result.
For example, if one of the first recognition results indicates that the recognized text has abnormal content, determining that the text to be recognized has abnormal content.
Optionally, S14 may include determining that abnormal content exists in the text to be identified if at least one of the first recognition results characterizes the text to be identified or the converted text has abnormal content.
For example, the first recognition result corresponding to the text of the character type is: no abnormal content exists in the text of the character type; abnormal content exists in a first recognition result corresponding to the pinyin type text; then it is determined that there is abnormal content in the text to be identified.
The text information identification method comprises the steps of firstly, executing text type conversion processing on a text to be identified to obtain at least one corresponding converted text; respectively carrying out content recognition on the text to be recognized and at least one converted text to obtain corresponding first recognition results; based on each first recognition result, whether the abnormal content exists in the text to be recognized is determined, and the content of the text to be recognized is enriched, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can recognize that the abnormal content exists in the text to be recognized after the conversion and change, so that the sample to be recognized with the abnormal content is accurately shielded.
As one embodiment, S12 may be: and if the type of the text comprises the character type, converting the text of the character type into pinyin, and if the type of the text comprises the pinyin type, converting the text of the pinyin type into the character.
For example, if the content of the text to be recognized is a text of "I love watch live", the text is converted to generate a Pinyin text of "wo ai kan zhi bo"; for another example, if the content of the text to be recognized is pinyin text of "wo ai kan zhi bo", the conversion generates text of "i love to watch live".
In addition, the text to be recognized can be converted into a text with similar semantics with the text; or, it is converted into a text having a similar font to the text. For example, if the content of the text to be identified is a text of words like live watching, converting to generate a text with similar semantics like live watching; for example, if the content of the text to be recognized is the text of the word "i love live" then the conversion generates the text with similar fonts of "i love live", etc.
Specifically, S13 may include: and recognizing the text of the character type through the character recognition model to obtain one first recognition result, and recognizing the text of the pinyin type through the pinyin recognition model to obtain the other first recognition result.
The character recognition model is trained by training a training sample set which is formed by the historical character samples carrying the category identification and the countermeasure text of the historical character samples carrying the category identification in advance, the category identification of each historical character sample is identical to the category identification of the countermeasure text, the pinyin recognition model is trained by training the training sample set which is formed by the historical pinyin samples carrying the category identification and the countermeasure text of the historical pinyin samples carrying the category identification in advance, and the category identification of each historical pinyin sample is identical to the category identification of the countermeasure text.
Wherein, the countermeasure sample can be generated by the historical text sample or the historical pinyin sample in advance through the following modes but not limited to:
1. removing invisible characters (e.g. spaces, tabs, etc.)
2. If the historical text sample is a text sample, the single word in the historical text sample is replaced by pinyin
3. If the historical text sample is a text sample, the single word in the historical text sample is replaced by a homophonic or harmonic word
4. Converting full angle into half angle, or converting half angle into full angle
5. Replacement of words with synonyms or paraphraseology
6. Adjacent single words in exchange words
7. If the historical text sample is a text sample, arabic numerals in the historical text sample are converted into Chinese characters, and if the historical text sample is a pinyin sample, arabic numerals in the historical text sample are converted into pinyin
8. Deleting imaginary words such as Chinese words
9. Deleting a character in a particular word (e.g. noun)
The above modes 1-9 can each be set with different probability implementations to change the historical text samples, e.g., mode 1 has a probability of 70% implementation, mode 2 has a probability of 80% implementation, mode 3 has a probability of 85% implementation, etc., and each mode can be implemented multiple times on the same historical text sample. Specifically, for example, the operations in modes 1 to 4 may be sequentially performed on the history text samples with different probabilities, respectively, for each history text sample, and a plurality of words may be selected for each operation; then, the operations in modes 5 to 9 are sequentially performed for each word in the history text sample with different probabilities, respectively. Of course, generating challenge samples is not limited to the manner described above, but is merely illustrative herein.
Optionally, in order to ensure the integrity and readability of the resulting challenge sample, a maximum number of modifications per historical text sample, and a maximum number of modifications per word in the historical text sample, need to be set, and when the set maximum number of modifications is reached, modifications to the historical text sample are stopped.
Optionally, when it is determined that no abnormal content exists in the text to be identified, it does not represent that no abnormal content exists in the text to be identified, and it is also possible that errors occurring in the text recognition model and the pinyin recognition model result in determining that no abnormal content exists in the text to be identified, so that further screening of the actual text to be identified is required. Thus, as shown in fig. 3, the method further comprises:
s21: it is determined whether or not abnormal content exists in the text to be recognized, and if not, S22 is executed, and if yes, S25 is executed.
S22: text of a literal type is processed into a text vector via the literal embedding model and text of a pinyin type is processed into a text vector via the pinyin embedding model.
The process of processing text of a text type into a text vector through the text embedding model or processing text of a pinyin type into a text vector through the pinyin embedding model as shown in fig. 5 includes:
s211: and inputting a training sample set consisting of the text sample to be trained carrying the category identification and the challenge sample to be trained carrying the category identification into the feature vector extraction layer.
The text sample to be trained may be a text sample or a pinyin sample, which is not limited herein.
S212: and converting the training samples in the training sample set into text feature vectors carrying the position information through a feature vector extraction layer.
S213: the text feature vectors are feature interacted through at least one encoding layer.
S214: and fully connecting the text feature vectors after feature interaction through a fully connecting layer.
The text-embedded model or the pinyin-embedded model can be generated through S211-S214. And when the text to be trained is a pinyin sample, generating a pinyin embedded model.
S23: and determining the similarity between the text vector obtained through processing and a plurality of historical negative text vectors in a preset negative text vector library.
The historical negative text vector is a text vector which is determined by a text embedding model or a pinyin embedding model in advance that no abnormal content exists and the abnormal content exists in practice. Specifically, the similarity between the text vector obtained through processing and a plurality of historical negative text vectors in a preset negative text vector library can be determined by calculating cosine similarity.
S24: and determining a second recognition result of the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized.
Specifically, the manner of determining the second recognition result of the text to be recognized may be: a similarity threshold value can be set, and when one of the plurality of similarities is larger than the set similarity threshold value, the determined second recognition result is that abnormal content exists in the text to be recognized; otherwise, the determined second recognition result is that no abnormal content exists in the text to be recognized.
And the first recognition result is used for re-screening the preliminarily recognized text to be recognized without abnormal content, so that the recognition accuracy of the text to be recognized is improved, and the network environment is further purified.
S25: the text to be recognized is masked.
Optionally, as shown in fig. 6, the method further includes:
s61: and when determining that the second recognition result represents that the abnormal content does not exist in the text to be recognized and the abnormal content exists in the text to be recognized, adding the text vector of the text to be recognized into a preset historical negative text vector library.
It can be understood that, due to the finite nature of the negative text vectors in the preset negative text vector library, when the second recognition result indicates that no abnormal content exists in the text to be recognized, the text to be recognized is not necessarily not represented to have the abnormal content, so that further manual recognition is required, if the abnormal content exists in the text to be recognized actually, the text to be recognized is identified, and the text vectors of the identified text to be recognized are added into the preset historical negative text vector library, so that the historical negative text vector library is enriched, and a foundation is laid for more accurately recognizing the text to be recognized with the abnormal content in the future.
Optionally, before S11, as shown in fig. 7, the method further includes:
s81: obtaining to-be-trained text samples carrying category identifiers, generating to-be-trained countermeasure samples carrying category identifiers according to the to-be-trained text samples carrying the category identifiers, wherein the category identifiers of each historical text sample are identical to the category identifiers of countermeasure texts; and obtaining the pinyin samples to be trained carrying the category identifiers, generating the countermeasure samples to be trained carrying the category identifiers according to the pinyin samples to be trained carrying the category identifiers, wherein the category identifiers of each historical pinyin sample are the same as the category identifiers of the countermeasure texts.
The countermeasure sample may be generated by a historical text sample or a historical pinyin sample, and the specific generation mode is the same as the generation mode of the countermeasure sample, which is not described herein again.
Optionally, in order to ensure the integrity and readability of the resulting challenge sample, a maximum number of modifications per historical text sample, and a maximum number of modifications per word in the historical text sample, need to be set, and when the set maximum number of modifications is reached, modifications to the historical text sample are stopped.
S82: inputting a training sample set consisting of a to-be-trained text sample carrying a category identifier and a to-be-trained countermeasure sample carrying a category identifier into a training network model, and training to generate a text recognition model; inputting a training sample set consisting of the pinyin sample to be trained with the category identification and the challenge sample to be trained with the category identification into a training network model, and training to generate a pinyin identification model.
The training network model may be, but is not limited to, a transducer network model, a logistic regression (Logistic Regression) network model, a support vector machine (Support Vector Machine, SVM) network model.
The process of training a character recognition model or a pinyin recognition model is described below by taking a transducer network model as an example of a training network model.
Specifically, as shown in fig. 8, the training network model includes a feature vector extraction layer, at least one coding layer (including 3 in fig. 8), at least one full connection layer (2 in fig. 8), and a softmax layer, which are sequentially connected. As shown in fig. 9, S82 includes:
s101: and inputting a training sample set consisting of the text sample to be trained carrying the category identification and the challenge sample to be trained carrying the category identification into the feature vector extraction layer.
S102: the input training samples are converted into text feature vectors carrying position information through a feature vector extraction layer.
The feature vector extraction layer comprises an enabling layer and a positional encoding layer, the enabling layer converts an input training sample into a text vector, positional encoding carries out position coding on each word in the training sample, and then adds and splices the text vector and a position coding result to obtain a text feature vector.
Specifically, the input training sample of the transducer network model is a sentence, each word in the sentence is subjected to an embedding layer to obtain a vector of the word, and the vector of each word is obtained by adding the result of the position coding layer to the position coding result of the word, and the vector is assumed to be 512-dimensional, i.e. the vector has 512 elements. If the sentence length is 10, 10 vectors can be obtained, and 10 can be spliced into a matrix of 10 rows and 512 columns.
S103: the text feature vectors are feature interacted through at least one encoding layer.
Specifically, the input of the coding layer is a matrix of 10 rows and 512 columns spliced as described above. Since the lengths of each sentence are different, it is necessary to specify the maximum length N of one sentence, and assuming that n=15, the matrix finally input into the encoding layer is a matrix of 15 rows and 512 columns, and the operation of supplementing 0 to the matrix, that is, the elements of the last 5 rows are all 0. All 0 rows in the matrix are 0 when the matrix multiplication calculation is performed later and the result of any column multiplication is 0, that is, the last 5 rows of the matrix participate in the calculation but do not yield useful results (since the sentence length is 10, useful results can be obtained in the calculation of the first 10 rows). If the sentence length is greater than N, the sentence needs to be cut off first to make the sentence length N.
The input of the first coding layer is a matrix (15×512 in this embodiment), and after feature interaction, the output of the first coding layer is also a matrix (the size of the matrix is 15×512); the output matrix is used as the input of the second coding layer, the second coding layer also outputs a matrix (the size of the matrix is 15 x 512) after the characteristic interaction, the third coding layer also outputs a matrix after the characteristic interaction, and then elements of 15 rows in the matrix are spliced to form a vector with the length of 15 x 512 = 7680.
S104: processing the text feature vector after feature interaction through at least one full-connection layer to obtain a primary recognition result, and normalizing the primary recognition result through a softmax output layer to generate a network recognition result.
In the embodiment of the disclosure, text feature vectors after feature interaction are processed through 2 full connection layers. Specifically, the spliced vector is fully connected by taking the vector with the length of 15×512=7680 as the input of the first fully connected layer, then fully connected by the second fully connected layer to obtain a primary recognition result, which is a two-dimensional text, each dimension respectively represents the probability of no abnormal content and the probability of abnormal content, and the primary recognition result is normalized by the softmax layer to obtain the recognized output, namely the network recognition result.
S106: and determining a cross entropy loss function according to the category identification and the network identification result of each training sample.
Cross entropy is a function that measures the output of a training network model, which in this embodiment is a two-recognition model, between the network result and the actual result. Thus, the training network model finally outputs a two-dimensional vector [ p, q ], where p+q=1, where p and q are the probability that the training samples have no anomaly content and have anomaly content, respectively. For each training sample, the class identifier (namely, the real class) is marked in advance, the class identifier can be expressed by a two-dimensional vector, if the real class of the training sample is a positive class, the training sample can be expressed by a vector [1,0], the probability of representing the positive class is 1, and the probability of representing the negative class is 0; if the true class of the training sample is a negative sample, it can be represented by a vector of [0,1 ]. If the true class is denoted by x, y, then one of x and y must be 1. Then the calculation formula of the cross entropy can be expressed as: l= -xlogp-ylogq, x=1, y=0 if there is no anomaly content in the real class of the training sample, cross entropy is-logp, and-log1=0 if p is very close to 1, no loss is considered. Conversely, if p=0.1, then-log0.1=1, then the loss is described as 1 (the logarithm here is exemplified by 10 as the base, usually with a natural constant e, i.e., log=ln).
In addition, it should be noted that, the analysis process when the real class of the training sample has abnormal content is similar to the analysis process when the real class of the training sample does not have abnormal content, and will not be described herein.
If the training network model is not a two-recognition model, but is a multi-recognition model, then the cross entropy penalty can be written as
Wherein i represents the i-th class, y i Representing the true result of class i (only oneY is i =1, other y i Are all 0, e.g., recognition model needs to classify training samples into 3 classes, then y 1 ,y 2 ,y 3 Only one of which is 1), a) i Representing the probability of training the ith class of network model.
S107: the gradient of the cross entropy loss function of the plurality of training samples is determined according to a small batch gradient descent algorithm.
Where the gradient is the direction in which the function rises (or increases) most quickly, if one wants to have the loss function decrease, updating the network parameters along the gradient direction will cause the loss function to increase, and updating the network parameters along the opposite direction of the gradient will cause the loss function to decrease.
S108: and updating the network parameters of the training network model according to the gradient.
In the training process, the historical text samples are divided into a training sample set and a verification sample set, for example, the accuracy of the training sample set and the verification sample set is calculated according to each round of training of the training sample set. And then training the training sample set for one round repeatedly, and calculating the accuracy of the training sample set and the verification sample set again. In the initial stage, the accuracy of the training sample set and the verification sample set are both increased, and the situation of fitting possibly occurs when the training sample set is used for training in the later stage, namely, the accuracy of the training sample set is continuously increased, but the accuracy of the verification sample set is reduced, and if the accuracy of the verification sample set is continuously reduced in the subsequent training of several rounds, the fitting is indicated, and the training is not continued.
Fig. 10 is a block diagram illustrating a text message recognition apparatus 1100 according to an exemplary embodiment. It should be noted that, the basic principle and the technical effects of the text information recognition device 1100 provided in the embodiment of the present application are the same as those of the above embodiment, and for brevity, reference may be made to the corresponding content in the above embodiment for the description of the embodiment of the present application. The apparatus 1100 includes an information acquisition unit 1101, a text conversion unit 1102, a text recognition unit 1103, and a result determination unit 1104, wherein,
the information acquisition unit 1101 is configured to perform acquisition of text to be recognized.
A text conversion unit 1102, configured to perform a text type conversion process on the text to be identified, so as to obtain at least one corresponding converted text.
The text recognition unit 1103 is configured to perform content recognition on the text to be recognized and the at least one converted text, so as to obtain respective first recognition results, where the first recognition results are used to represent whether abnormal content exists in each corresponding text.
A result determining unit 1104 configured to perform determination of whether or not abnormal content exists in the text to be recognized based on each of the first recognition results.
Optionally, the result determining unit 1104 is specifically configured to perform determining that the abnormal content exists in the text to be identified if at least one of the first identification results characterizes the text to be identified or the converted text has the abnormal content.
The text information recognition apparatus 1100 may perform the following functions when executed: performing text type conversion processing on the text to be identified to obtain at least one corresponding converted text; respectively carrying out content recognition on the text to be recognized and at least one converted text to obtain corresponding first recognition results; based on each first recognition result, whether the abnormal content exists in the text to be recognized is determined, and the content of the text to be recognized is enriched, so that the recognition accuracy of the text to be recognized with the abnormal content is higher, and even if the text to be recognized with the abnormal content is converted and changed in social application, the text recognition model can recognize that the abnormal content exists in the text to be recognized after the conversion and change, so that the sample to be recognized with the abnormal content is accurately shielded.
Optionally, as one embodiment, the text conversion unit 1102 is specifically configured to perform converting the text of the text type into pinyin if the text type includes a text type.
Optionally, the text recognition unit 1103 is specifically configured to perform recognition of the text of the word type via the word recognition model to obtain one of the first recognition results, and recognize the text of the pinyin type via the pinyin recognition model to obtain the other first recognition result.
The character recognition model is trained by training a training sample set which is formed by the historical character samples carrying the category identification and the countermeasure text of the historical character samples carrying the category identification in advance, the category identification of each historical character sample is identical to the category identification of the countermeasure text, the pinyin recognition model is trained by training the training sample set which is formed by the historical pinyin samples carrying the category identification and the countermeasure text of the historical pinyin samples carrying the category identification in advance, and the category identification of each historical pinyin sample is identical to the category identification of the countermeasure text.
Optionally, as shown in fig. 11, the apparatus 1100 further includes:
the text vector generation unit 1201 is configured to perform processing of text of a text type into a text vector via the text embedding model and processing of text of a pinyin type into a text vector via the pinyin embedding model if it is determined that there is no abnormal content in the text to be recognized.
And a similarity determining unit 1202 configured to perform similarity between a text vector obtained by the determining process and a plurality of historical negative text vectors in a preset negative text vector library, respectively, wherein the historical negative text vectors are text vectors in which no abnormal content is determined by a word recognition model or a pinyin recognition model in advance and the abnormal content is actually present.
The result determining unit 1104 is configured to perform determination of a second recognition result of the text to be recognized, based on the obtained plurality of similarities, the second recognition result being used to indicate whether or not abnormal content exists in the text to be recognized.
Optionally, as shown in fig. 12, the apparatus 1100 further includes: the text vector adding unit 1301 is configured to perform adding a text vector of the text to be recognized into a preset history negative-type text vector library if the second recognition result indicates that there is no abnormal content in the text to be recognized and the text to be recognized actually has abnormal content.
Further, the information obtaining unit 1101 is further configured to obtain to-be-trained text samples carrying category identifiers, and generate to-be-trained countermeasure samples carrying category identifiers according to the to-be-trained text samples carrying the category identifiers, where the category identifier of each historical text sample is the same as the category identifier of the countermeasure text; and obtaining the pinyin samples to be trained carrying the category identifiers, generating the countermeasure samples to be trained carrying the category identifiers according to the pinyin samples to be trained carrying the category identifiers, wherein the category identifiers of each historical pinyin sample are the same as the category identifiers of the countermeasure texts.
As shown in fig. 13, the apparatus 1100 further includes: the model training unit 1501 is configured to perform training to generate a character recognition model by inputting a training sample set comprising a character sample to be trained carrying a category identifier and a challenge sample to be trained carrying a category identifier into a training network model; inputting a training sample set consisting of the pinyin sample to be trained with the category identification and the challenge sample to be trained with the category identification into a training network model, and training to generate a pinyin identification model.
Specifically, the training network model includes a feature vector extraction layer, at least one coding layer, at least one fully connected layer, and a softmax layer, which are connected in sequence. As shown in fig. 14, the model training unit 1501 includes a text input module 1601, a feature vector generation module 1602, a feature interaction module 1603, a two-dimensional feature generation module 1604, a text recognition module 1605, a loss function determination module 1606, a gradient determination module 1607, and a parameter update module 1608, wherein,
the text input module 1601 is configured to perform inputting a training sample set including a to-be-trained text sample or a to-be-trained pinyin sample carrying a category identifier, and a to-be-trained challenge sample carrying a category identifier into the feature vector extraction layer.
The feature vector generation module 1602 is configured to perform a conversion of training samples into text feature vectors carrying location information via a feature vector extraction layer.
The feature interaction module 1603 is configured to perform feature interactions on the text feature vectors via at least one encoding layer.
The text recognition module 1605 is configured to perform processing on the text feature vector after feature interaction through at least one full connection layer to obtain a preliminary recognition result, and normalize the preliminary recognition result through a softmax layer to generate a network recognition result.
The loss function determination module 1606 is configured to determine a cross entropy loss function based on the class identification, network identification result of each training sample.
A gradient determination module 1607 is configured to perform a gradient determination of the cross entropy loss function of the plurality of training samples according to a small batch gradient descent algorithm.
A parameter update module 1608 configured to perform updating network parameters of the training network model according to the gradient.
With respect to the apparatus 1100 in the above embodiment, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment regarding the method, and will not be described in detail herein.
Fig. 15 is a block diagram of a server 102 showing a recognition method for text information according to an exemplary embodiment. Referring to fig. 15, the server 102 includes a processing component 1701 that further includes one or more processors and memory resources represented by a memory 1702 for storing instructions, such as an application program, executable by the processing component 1701. The application program stored in the memory 1702 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1701 is configured to execute instructions to perform the text information recognition method described above.
For example, the following steps may be performed:
acquiring a text to be identified;
performing text type conversion processing on the text to be identified to obtain at least one corresponding converted text;
respectively carrying out content recognition on the text to be recognized and at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for representing whether abnormal content exists in each corresponding text;
and determining whether abnormal content exists in the text to be identified or not based on each first identification result.
The server 102 can also include a power component 1703 configured to perform power management of the server 102, a wired or wireless network interface 1704 configured to connect the server 102 to the network 300, and an input output (I/O) interface 1705. The server 102 may operate based on an operating system stored in the memory 1702, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
In an exemplary embodiment, a storage medium is also provided, such as a memory 804 including instructions executable by a processor of the server 102 to perform the method of identifying text information described above. For example, the following steps may be performed:
acquiring a text to be identified;
performing text type conversion processing on the text to be identified to obtain at least one corresponding converted text;
respectively carrying out content recognition on the text to be recognized and at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for representing whether abnormal content exists in the corresponding texts;
and determining whether abnormal content exists in the text to be identified or not based on each first identification result.
Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
In an exemplary embodiment, there is also provided a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the steps of:
Acquiring a text to be identified;
performing text type conversion processing on the text to be identified to obtain at least one corresponding converted text;
respectively carrying out content recognition on the text to be recognized and the at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for representing whether abnormal content exists in the corresponding texts;
and determining whether abnormal content exists in the text to be identified or not based on each first identification result.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (13)

1. A method for identifying text information, the method comprising:
acquiring a text to be identified;
performing text type conversion processing on the text to be identified to obtain at least one corresponding converted text;
respectively carrying out content recognition on the text to be recognized and the at least one converted text to obtain corresponding first recognition results, wherein the first recognition results are used for representing whether abnormal content exists in each corresponding text;
determining whether abnormal content exists in the text to be identified or not based on each first identification result;
wherein the method further comprises:
if the fact that abnormal content does not exist in the text to be identified is determined, processing the text with the text type into a text vector through a text embedding model and processing the text with the pinyin type into the text vector through a pinyin embedding model;
determining the similarity between the text vector obtained through processing and a plurality of historical negative text vectors in a preset negative text vector library, wherein the historical negative text vectors are text vectors which are subjected to content recognition in advance to determine that abnormal content does not exist and the abnormal content actually exists;
And determining a second recognition result aiming at the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized.
2. The method of claim 1, wherein determining whether abnormal content exists in the text to be recognized based on each of the first recognition results comprises:
if at least one first recognition result in the first recognition results represents that abnormal content exists in the corresponding text, determining that abnormal content exists in the text to be recognized.
3. The method according to claim 1, wherein said performing a text-type conversion process on said text to be identified to obtain a corresponding at least one converted text comprises:
if the text type comprises a text type, converting the text of the text type into pinyin;
and if the text type comprises the pinyin type, converting the text of the pinyin type into characters.
4. A method according to claim 3, wherein the content recognition of the text to be recognized and the at least one converted text respectively to obtain respective first recognition results comprises: the text of the character type is identified by the character identification model to obtain one first identification result, the text of the pinyin type is identified by the pinyin identification model to obtain the other first identification result,
The character recognition model is trained by training a training sample set which is formed by the historical character samples carrying the category identification and the countermeasure text of the historical character samples carrying the category identification in advance, the category identification of each historical character sample is identical to the category identification of the countermeasure text, the pinyin recognition model is trained by training the training sample set which is formed by the historical pinyin samples carrying the category identification and the countermeasure text of the historical pinyin samples carrying the category identification in advance, and the category identification of each historical pinyin sample is identical to the category identification of the countermeasure text.
5. The method according to claim 1, wherein if the second recognition result indicates that no abnormal content exists in the text to be recognized and that abnormal content exists in the text to be recognized actually, adding the text vector of the text to be recognized into a preset historical negative text vector library.
6. The method according to claim 1, wherein the method further comprises:
and if the abnormal content exists in the text to be identified, shielding the text to be identified.
7. A text message recognition device, the device comprising:
An information acquisition unit configured to perform acquisition of a text to be recognized;
a text conversion unit configured to perform conversion processing of text types on the text to be recognized to obtain at least one corresponding converted text;
the text recognition unit is configured to perform content recognition on the text to be recognized and the at least one converted text respectively so as to obtain corresponding first recognition results, wherein the first recognition results are used for representing whether abnormal content exists in the corresponding texts;
a result determination unit configured to perform determination of whether or not abnormal content exists in the text to be recognized based on each of the first recognition results;
wherein the apparatus further comprises:
a text vector generation unit configured to perform processing of text of a text type into a text vector via a text embedding model and processing of text of a pinyin type into a text vector via a pinyin embedding model if it is determined that no abnormal content exists in the text to be recognized;
a similarity determining unit configured to perform similarity between a text vector obtained by the determining process and a plurality of historical negative text vectors in a preset negative text vector library, respectively, wherein the historical negative text vectors are text vectors which are subjected to content recognition in advance to determine that no abnormal content exists and the abnormal content exists actually;
The result determining unit is configured to determine a second recognition result for the text to be recognized according to the obtained multiple similarities, wherein the second recognition result is used for indicating whether abnormal content exists in the text to be recognized.
8. The apparatus according to claim 7, wherein the result determining unit is specifically configured to perform determining that abnormal content exists in the text to be recognized if at least one of the first recognition results characterizes the text to be recognized or the converted text has abnormal content.
9. The apparatus of claim 7, wherein the text conversion unit is specifically configured to perform converting the text of a text type into pinyin if the text type includes a text type; and if the text type comprises the pinyin type, converting the text of the pinyin type into characters.
10. The apparatus of claim 9, wherein the text recognition unit is specifically configured to perform recognition of text of a literal type via the literal recognition model to obtain one of the first recognition results, recognition of text of a pinyin type via the pinyin recognition model to obtain the other of the first recognition results,
The character recognition model is trained by training a training sample set which is formed by the historical character samples carrying the category identification and the countermeasure text of the historical character samples carrying the category identification in advance, the category identification of each historical character sample is identical to the category identification of the countermeasure text, the pinyin recognition model is trained by training the training sample set which is formed by the historical pinyin samples carrying the category identification and the countermeasure text of the historical pinyin samples carrying the category identification in advance, and the category identification of each historical pinyin sample is identical to the category identification of the countermeasure text.
11. The apparatus of claim 7, wherein the apparatus further comprises: and the text vector adding unit is configured to perform adding the text vector of the text to be identified into a preset historical negative type text vector library if the second identification result indicates that no abnormal content exists in the text to be identified and the abnormal content exists in the text to be identified.
12. A server, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of identifying text information as claimed in any one of claims 1 to 6.
13. A storage medium, which when executed by a processor of a server, enables the server to perform the method of identifying text information according to any one of claims 1 to 6.
CN201911304665.0A 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium Active CN112989810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911304665.0A CN112989810B (en) 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911304665.0A CN112989810B (en) 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium

Publications (2)

Publication Number Publication Date
CN112989810A CN112989810A (en) 2021-06-18
CN112989810B true CN112989810B (en) 2024-03-12

Family

ID=76343629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911304665.0A Active CN112989810B (en) 2019-12-17 2019-12-17 Text information identification method and device, server and storage medium

Country Status (1)

Country Link
CN (1) CN112989810B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011070269A (en) * 2009-09-24 2011-04-07 Hitachi Information Systems Ltd Character conversion device and method, diagram display system and method, and program
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system
CN107291780A (en) * 2016-04-12 2017-10-24 腾讯科技(深圳)有限公司 A kind of user comment information methods of exhibiting and device
CN109766475A (en) * 2018-12-13 2019-05-17 北京爱奇艺科技有限公司 A kind of recognition methods of rubbish text and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011070269A (en) * 2009-09-24 2011-04-07 Hitachi Information Systems Ltd Character conversion device and method, diagram display system and method, and program
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system
CN107291780A (en) * 2016-04-12 2017-10-24 腾讯科技(深圳)有限公司 A kind of user comment information methods of exhibiting and device
CN109766475A (en) * 2018-12-13 2019-05-17 北京爱奇艺科技有限公司 A kind of recognition methods of rubbish text and device

Also Published As

Publication number Publication date
CN112989810A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
US20190065506A1 (en) Search method and apparatus based on artificial intelligence
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
CN110321553B (en) Short text topic identification method and device and computer readable storage medium
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN112926327B (en) Entity identification method, device, equipment and storage medium
CN112417885A (en) Answer generation method and device based on artificial intelligence, computer equipment and medium
CN110414004B (en) Method and system for extracting core information
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN111291195A (en) Data processing method, device, terminal and readable storage medium
CN110532381A (en) A kind of text vector acquisition methods, device, computer equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN110909531A (en) Method, device, equipment and storage medium for discriminating information security
CN111931935A (en) Network security knowledge extraction method and device based on One-shot learning
CN111881398B (en) Page type determining method, device and equipment and computer storage medium
CN112085091A (en) Artificial intelligence-based short text matching method, device, equipment and storage medium
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN115292520B (en) Knowledge graph construction method for multi-source mobile application
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN115759071A (en) Government affair sensitive information identification system and method based on big data
JP2023002690A (en) Semantics recognition method, apparatus, electronic device, and storage medium
CN115840808A (en) Scientific and technological project consultation method, device, server and computer-readable storage medium
CN112989829B (en) Named entity recognition method, device, equipment and storage medium
CN112989810B (en) Text information identification method and device, server and storage medium
CN114792092B (en) Text theme extraction method and device based on semantic enhancement
CN112883703B (en) Method, device, electronic equipment and storage medium for identifying associated text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant