CN110781668A

CN110781668A - Text information type identification method and device

Info

Publication number: CN110781668A
Application number: CN201911018745.XA
Authority: CN
Inventors: 郝彦超; 康斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2020-02-11
Anticipated expiration: 2039-10-24
Also published as: CN110781668B

Abstract

The invention discloses a method and a device for identifying the type of text information. Wherein, the method comprises the following steps: inputting the acquired first text information into a text type identification model, wherein the text type identification model comprises a plurality of sub-models which are sequentially connected in series, and each sub-model in the plurality of sub-models is used for identifying whether the text information input into each sub-model belongs to a text type corresponding to each sub-model; and acquiring a first recognition result output by the text type recognition model, wherein the first recognition result is used for indicating a first sub-model in the plurality of sub-models to determine that the first text information belongs to a target text type corresponding to the first sub-model. The invention solves the technical problem of low recognition efficiency of type recognition of the text information.

Description

Text information type identification method and device

Technical Field

The invention relates to the field of computers, in particular to a text information type identification method and device.

Background

Most of the existing technical schemes for identifying the malicious titles directionally attack the malicious titles in a single matching mode, the method has the defect of low accuracy rate, meanwhile, the recall rate is high, the performance effect of the whole system is difficult to satisfy the user, and the effect of assisting manual text auditing in application cannot be achieved.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying the type of text information, which at least solve the technical problem of low efficiency in identifying the type of the text information.

According to an aspect of the embodiments of the present invention, there is provided a method for identifying a type of text information, including: inputting the acquired first text information into a text type identification model, wherein the text type identification model comprises a plurality of sub-models which are sequentially connected in series, and each sub-model in the plurality of sub-models is used for identifying whether the text information input into each sub-model belongs to a text type corresponding to each sub-model;

and acquiring a first recognition result output by the text type recognition model, wherein the first recognition result is used for indicating a first sub-model in the plurality of sub-models to determine that the first text information belongs to a target text type corresponding to the first sub-model.

According to another aspect of the embodiments of the present invention, there is also provided a text information type recognition apparatus, including: the text type recognition module is used for inputting the acquired first text information into a text type recognition model, wherein the text type recognition model comprises a plurality of sub-models which are sequentially connected in series, and each sub-model in the plurality of sub-models is used for recognizing whether the text information input into each sub-model belongs to a text type corresponding to each sub-model;

the first obtaining module is configured to obtain a first recognition result output by the text type recognition model, where the first recognition result is used to indicate a first sub-model of the multiple sub-models to determine that the first text information belongs to a target text type corresponding to the first sub-model.

Optionally, the apparatus further comprises:

the display module is used for displaying the plurality of text messages with the corresponding relation and the target text type corresponding to each text message in the plurality of text messages before adding the target text message which is marked with the target text type and does not belong to the target text type into the second classification training data to obtain third classification training data;

a first determining module, configured to determine, as the target text information, text information on which a selection operation is performed among the plurality of text information;

and the second determining module is used for determining that the target text information does not belong to the target text type.

In the embodiment of the invention, a text type identification model is input by using first acquired text information, wherein the text type identification model comprises a plurality of submodels which are sequentially connected in series, and each submodel in the submodels is used for identifying whether the text information input into each submodel belongs to a text type corresponding to each submodel; acquiring a first recognition result output by the text type recognition model, wherein the first recognition result is used for indicating a mode that a first sub-model in the sub-models determines that the first text information belongs to a target text type corresponding to the first sub-model, the text type to which the input first text information belongs is respectively recognized by the sub-models which are sequentially connected in series and comprise the text type recognition model, different sub-models correspond to different text types, and the text type of the text information can be finely recognized, so that the recognition accuracy is improved, and in the process that the first text information passes through the sub-models which are sequentially connected in series, the text type of the first sub-model which is recognized firstly is output as the target text type, so that the purpose of improving the recognition speed is achieved, and the technical effect of improving the recognition efficiency of type recognition on the text information is realized, and the technical problem of low identification efficiency in type identification of the text information is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an alternative method for type recognition of textual information, in accordance with an embodiment of the present invention;

FIG. 2 is a diagram illustrating a first application environment of an alternative text information type identification method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an application environment of an alternative method for type recognition of text information according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an application environment of an alternative text information type identification method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an application environment of an alternative text information type identification method according to an embodiment of the present invention;

FIG. 6 is a first diagram illustrating an alternative method for type recognition of textual information, in accordance with an alternative embodiment of the present invention;

FIG. 7 is a diagram illustrating an alternative method for type recognition of textual information, in accordance with an alternative embodiment of the present invention;

FIG. 8 is a third schematic diagram of an alternative method for type recognition of textual information, in accordance with an alternative embodiment of the present invention;

FIG. 9 is a fourth schematic diagram of an alternative method for type recognition of textual information, in accordance with an alternative embodiment of the present invention;

FIG. 10 is a fifth exemplary illustration of an alternative method of type recognition of textual information, in accordance with an alternative embodiment of the present invention;

FIG. 11 is a sixth schematic diagram illustrating an alternative method for type recognition of textual information, in accordance with an alternative embodiment of the present invention;

FIG. 12 is a seventh schematic diagram illustrating an alternative method for type recognition of textual information, in accordance with an alternative embodiment of the present invention;

FIG. 13 is an eighth schematic diagram illustrating an alternative method of type-identifying textual information, in accordance with an alternative embodiment of the present invention;

FIG. 14 is a ninth illustration of an alternative method of type recognition of textual information, in accordance with an alternative embodiment of the present invention;

fig. 15 is a schematic diagram of an alternative type recognition apparatus for text information according to an embodiment of the present invention;

fig. 16 is a schematic view of an application scenario of an alternative text information type identification method according to an embodiment of the present invention; and

FIG. 17 is a schematic diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of an embodiment of the present invention, there is provided a method for identifying a type of text information, as shown in fig. 1, the method including:

step S102, inputting the acquired first text information into a text type identification model, wherein the text type identification model comprises a plurality of sub-models which are sequentially connected in series, and each sub-model in the plurality of sub-models is used for identifying whether the text information input into each sub-model belongs to a text type corresponding to each sub-model;

step S104, obtaining a first recognition result output by the text type recognition model, wherein the first recognition result is used for indicating a first sub-model in the plurality of sub-models to determine that the first text information belongs to a target text type corresponding to the first sub-model.

Alternatively, in this embodiment, the above text information type identification method may be applied to a hardware environment formed by the device 202 shown in fig. 2. As shown in fig. 2, the device 202 inputs the obtained first text information into a text type identification model, where the text type identification model includes multiple serially connected submodels, and each submodel in the multiple submodels is used to identify whether the text information input into each submodel belongs to a text type corresponding to each submodel; and acquiring a first recognition result output by the text type recognition model, wherein the first recognition result is used for indicating a first sub-model in the multiple sub-models to determine that the first text information belongs to a target text type corresponding to the first sub-model.

Optionally, in this embodiment, the device for executing the type identification method of the text message may be, but is not limited to, a node in a data sharing system.

Referring to the data sharing system shown in fig. 3, the data sharing system 100 refers to a system for performing data sharing between nodes, the data sharing system may include a plurality of nodes 101, and the plurality of nodes 101 may refer to respective clients in the data sharing system. Each node 101 may receive input information while operating normally and maintain shared data within the data sharing system based on the received input information. In order to ensure information intercommunication in the data sharing system, information connection can exist between each node in the data sharing system, and information transmission can be carried out between the nodes through the information connection. For example, when an arbitrary node in the data sharing system receives input information, other nodes in the data sharing system acquire the input information according to a consensus algorithm, and store the input information as data in shared data, so that the data stored on all the nodes in the data sharing system are consistent.

Each node in the data sharing system has a node identifier corresponding thereto, and each node in the data sharing system may store a node identifier of another node in the data sharing system, so that the generated block is broadcast to the other node in the data sharing system according to the node identifier of the other node in the following. Each node may maintain a node identifier list as shown in the following table, and store the node name and the node identifier in the node identifier list correspondingly. The node identifier may be an IP (Internet Protocol) address and any other information that can be used to identify the node, and table 1 only illustrates the IP address as an example.

TABLE 1

Node name	Node identification
		Node 1	117.114.151.174
Node 2	117.116.189.145
		…	…
Node N	119.123.789.258

Each node in the data sharing system stores one identical blockchain. The block chain is composed of a plurality of blocks, referring to fig. 4, the block chain is composed of a plurality of blocks, the starting block includes a block header and a block main body, the block header stores an input information characteristic value, a version number, a timestamp and a difficulty value, and the block main body stores input information; the next block of the starting block takes the starting block as a parent block, the next block also comprises a block head and a block main body, the block head stores the input information characteristic value of the current block, the block head characteristic value of the parent block, the version number, the timestamp and the difficulty value, and the like, so that the block data stored in each block in the block chain is associated with the block data stored in the parent block, and the safety of the input information in the block is ensured.

When each block in the block chain is generated, referring to fig. 5, when the node where the block chain is located receives the input information, the input information is verified, after the verification is completed, the input information is stored in the memory pool, and the hash tree for recording the input information is updated; and then, updating the updating time stamp to the time when the input information is received, trying different random numbers, and calculating the characteristic value for multiple times, so that the calculated characteristic value can meet the following formula:

SHA256(SHA256(version+prev_hash+merkle_root+ntime+nbits+x))＜TARGET

wherein, SHA256 is a characteristic value algorithm used for calculating a characteristic value; version is version information of the relevant block protocol in the block chain; prev _ hash is a block head characteristic value of a parent block of the current block; merkle _ root is a characteristic value of the input information; ntime is the update time of the update timestamp; nbits is the current difficulty, is a fixed value within a period of time, and is determined again after exceeding a fixed time period; x is a random number; TARGET is a feature threshold, which can be determined from nbits.

Therefore, when the random number meeting the formula is obtained through calculation, the information can be correspondingly stored, and the block head and the block main body are generated to obtain the current block. And then, the node where the block chain is located respectively sends the newly generated blocks to other nodes in the data sharing system where the newly generated blocks are located according to the node identifications of the other nodes in the data sharing system, the newly generated blocks are verified by the other nodes, and the newly generated blocks are added to the block chain stored in the newly generated blocks after the verification is completed.

Optionally, in this embodiment, the method for identifying the type of the text information may be, but is not limited to, applied to a scenario in which text type identification is performed on text information in an application program. The device may be, but is not limited to, a client or a server of various types of applications, such as an online education application, an instant messaging application, a community space application, a game application, a shopping application, a browser application, a financial application, a multimedia application, a live application, and the like. Specifically, the method can be applied to, but not limited to, a scene in which text type recognition is performed on a video title text in a multimedia application, or can also be applied to, but not limited to, a scene in which text type recognition is performed on article content in an instant messaging application, so as to improve recognition efficiency of text information type recognition. The above is only an example, and this is not limited in this embodiment.

Optionally, in this embodiment, after the obtained first text information is input into the text type recognition model, if none of the multiple sub-models outputs the first recognition result, it may be determined that the first text information does not belong to the text types corresponding to the multiple sub-models.

Optionally, in this embodiment, the text information for performing type recognition may include, but is not limited to: caption text, body text, subtitle text, and the like. Such as: audio headlines, video headlines, article headlines, subtitles in video, lyrics in audio, text of an article, text in a picture, and the like.

Optionally, in this embodiment, the text types may include, but are not limited to: and classifying the malicious texts to obtain the text types. Such as: as shown in fig. 6, the malicious video titles are divided into six categories of 26 sub-categories, which are "title party", "popular", "meaningless", "malicious promotion", "other", "nausea", respectively, except for the "normal" category. Wherein "title party", "vulgar", "meaningless" and "others" also contain several sub-categories, totaling 26 text types. Each of the plurality of submodels is used to identify one or more of the 26 sub-classes described above.

Optionally, in this embodiment, the task of type recognition of the text information is a highly comprehensive task, and the whole process combines multiple natural language processing methods. The main technical means currently used is shown in fig. 7, wherein each of the sub-models uses one or more methods to detect the corresponding text type. The language model models a sentence (text) to evaluate the reasonableness and possibility of the occurrence of the sentence. The part-of-speech tagging is used for judging the part-of-speech of words in the text, for example, if a 'so' + verb appears in a word guidance type in 'heading party-malicious guidance', the text information is determined to belong to the text type of 'heading party-malicious guidance', malicious guidance reminding can be made to an auditor, and if the 'so' + adjective does not belong to the 'heading party-malicious guidance' type, the situation is not required to be reminded. To avoid repetition, it is customary to refer to the aforementioned entities by pronouns, terms, and abbreviations. So-called coreference resolution is the process of merging together different descriptions of the same entity in the real world. In malicious heading detection, coreference resolution refers to the correct finding of a noun or entity to which a pronoun in a sentence refers. For example, in the vocabulary guidance in "title party-malicious guidance", if the title appears "he/she/he(s)", if the noun or entity referred by these pronouns cannot be found, a malicious guidance reminder should be made to the auditor.

In an optional embodiment, as shown in fig. 8, the obtained video title a is input into a text type identification model, where the text type identification model includes a plurality of submodels (submodel 1, submodel 2 … … submodel n) connected in series in sequence, and each of the plurality of submodels is used to identify whether text information input into each submodel belongs to a text type (type 1, type 2 … … type n) corresponding to each submodel; and acquiring a first recognition result output by the text type recognition model, wherein the first recognition result is used for indicating a sub-model 2 in a plurality of sub-models to determine that the video title A belongs to a type 2 corresponding to the sub-model 2.

It can be seen that, through the above steps, the text type to which the input first text information belongs is respectively identified through the multiple sequentially connected submodels included in the text type identification model in series, different submodels correspond to different text types, and the text type of the text information can be finely identified, so that the identification accuracy is improved, and in the process that the first text information passes through the multiple sequentially connected submodels, the text type of the first submodel which is identified first is output as the target text type, so that the purpose of improving the identification speed is achieved, the technical effect of improving the identification efficiency of type identification of the text information is achieved, and the technical problem that the identification efficiency of type identification of the text information is low is solved.

As an alternative, in case the first sub-model comprises a short text classification model,

s1, inputting the acquired first text information into a text type recognition model, including: inputting the first text information into the short text classification model, wherein the short text classification model comprises a plurality of first encoders and a first excitation function layer which are sequentially connected in series, and each first encoder in the plurality of first encoders comprises a self-attention layer and a feedforward neural network layer;

s2, obtaining the first recognition result output by the text type recognition model, including: and acquiring the first recognition result output by the first excitation function layer.

Optionally, in this embodiment, each of the first encoders may include, but is not limited to, a transform-Encoder.

Optionally, in this embodiment, the first excitation function layer may include, but is not limited to, a softmax layer. The softmax layer may be a fully-connected layer + softmax, which is calculated by multiplying the weight matrix by the input vector and adding an offset to the fully-connected layer, mapping N real numbers (— ∞, + ∞) into M real numbers (— ∞, + ∞), and mapping M infinite real numbers into the (0,1) interval by the softmax function while ensuring that the sum thereof is 1.

In an alternative embodiment, a transform-Encoder structure is shown in fig. 9, where an Encoder receives a vector list as input, then passes the vectors in the vector list to a self-attention layer for processing, then to a feedforward neural network layer, and passes the output to the next Encoder. As the model processes each word of the input sequence, self-attention may focus on all words of the entire input sequence, helping the model to better encode the word. The first step in calculating self-attention is to generate three vectors from the input vectors (word vectors for each word) for each encoder. That is, for each word, a query vector, a key vector, and a value vector are created. These three vectors are created by post-multiplication of the word embedding with three weight matrices, X as shown in FIG. 10 ₁And W ^QMultiplying the weight matrix to obtain q ₁As a query vector related to this word, X ₁And W ^KMultiplying the weight matrix to obtain k ₁As the key vector associated with this word, X ₁And W ^VMultiplying the weight matrix to obtain v ₁Similarly, q is obtained for X2 as a value vector relating to this word ₂，k ₂And v ₂Such that each word of the input sequence creates a query vector, a key vector, and a value vector. The second step in calculating self-attention is to calculate a score. These scores are calculated by scoring the query vector dot product of the key vectors of the words (of all input sentences) and the words to be encoded. The result is then passed through the softmax operation. The effect of softmax is to score all wordsNormalized, the resulting scores are all positive values and the sum is 1. The value vector for each word is multiplied by the softmax score (this is to sum them after preparation). The intuition here is that it is desirable to focus on semantically related words and weaken irrelevant words. Finally, the weighted vectors are summed to obtain the z vector in fig. 9. Another explanation from attention is that when encoding a word, the representation of all words (value vector) is summed weighted, and the weight is obtained by the dot product of the representation of the word (key vector) and the representation of the word being encoded (query vector) and by softmax. In the process of feature extraction in practical application, a plurality of representation subspaces of a self-attention layer are added in a multi-head mode so as to expand the capability of the model for concentrating on different positions. As shown in FIG. 11, the complete schematic structure of the transform-Encoder single-layer Block is shown. In practical application, a plurality of transform-Encoder Block blocks are stacked for text depth feature extraction. When the classifier is used for classification, the extracted features are used as input, the input is sent into a softmax layer, and then a predicted label is output, and the whole flow is shown in fig. 12.

As an optional scheme, before the first text information is input into the short text classification model, the method further includes:

s1, training a plurality of second encoders which are sequentially connected in series by using first classification training data to obtain a plurality of third encoders which are sequentially connected in series, wherein the initial short text classification model comprises the second encoders and a second excitation function layer which are sequentially connected in series;

and S2, training the plurality of third encoders and the second excitation function layer which are sequentially connected in series by using the first classification training data to obtain the short text classification model.

Optionally, in this embodiment, a two-stage migration training mode is adopted in the process of training the short text classification model. And training the coder in a pre-training stage, and training the coder and the excitation function layer after pre-training in a fine tuning stage.

In the above alternative embodiment, as shown in fig. 9, it is important to perform the initial training of the Transformer layer and the word embedding layer by first performing the language model training on the large-scale corpus. This is the basis for semantic migration with large-scale corpora for the later fine tuning stage. In the fine tuning stage, fine tuning training is performed on the classification training data, and the initial point of each parameter is the end point parameter of the pre-training stage. In general, on the classification training data in the fine tuning stage, only 3 or so iterations are needed to perform correct classification on the test data and maintain a high performance level.

As an alternative, in case the first sub-model comprises a sensitive word detection model,

s1, inputting the acquired first text information into a text type recognition model, including: inputting the first text information into the sensitive word detection model, wherein the sensitive word detection model is used for matching sensitive words included in a sensitive word dictionary with the first text information;

s2, obtaining a first recognition result output by the text type recognition model, including: and under the condition that the first sensitive word included in the sensitive word dictionary is successfully matched with the first text information, determining that the text type corresponding to the first sensitive word is the target text type to which the first text information belongs.

Optionally, in this embodiment, for malicious small titles that can be covered by individual sensitive word detection and recognition, such as "vulgar-illegal vocabulary", "vulgar-personal attack vocabulary", and the like, a sensitive word method may be selected for recognition.

As an optional scheme, after determining that the text type corresponding to the first sensitive word is the target text type to which the first text information belongs, the method further includes:

s1, deleting the text content corresponding to the first sensitive word from the first text information to obtain a sensitive word template;

s2, matching the sensitive word template with the acquired text information set;

s3, acquiring second text information successfully matched with the sensitive word template from the text information set;

s4, extracting a second sensitive word from the second text information;

s5, adding the second sensitive word into the sensitive word dictionary.

Optionally, in this embodiment, a sensitive dictionary expansion manner based on a template is proposed, as shown in fig. 13. The method comprises the steps of firstly forming a sensitive word seed dictionary by manual labeling, wherein the sensitive word dictionary is expanded more backward, the human efficiency ratio of the sensitive word dictionary is higher, the sensitive word dictionary is exponentially increased, the labor cost for expanding the same number of sensitive words is too high, and the efficiency is low, so that a short text template (namely the sensitive word template) which is formed by removing the sensitive words and only context texts around the sensitive words can be formed on the basis of the existing seed sensitive words and the corresponding title texts, the short text template is applied to a large number of video title streams, new sensitive words to be found are extracted, and then the extracted new sensitive words are manually confirmed and are brought into the sensitive word dictionary, so that the sensitive word dictionary is expanded. Meanwhile, the expanded sensitive word dictionary can generate more templates, so that new sensitive words can be found, and a closed-loop high-efficiency ecological chain is formed.

As an optional scheme, the first text information includes a plurality of text information, where after obtaining the first recognition result output by the text type recognition model, the method further includes:

s1, under the condition that it is determined that target text information in the plurality of text information does not belong to the target text type, adding the target text information which is marked with the target text type into second classification training data to obtain third classification training data, wherein the second classification training data is used for training an initial sub-model to obtain the first sub-model;

s2, training the first sub-model by using the third classification training data to obtain a second sub-model;

s3, replacing the first sub-model included in the text type recognition model with the second sub-model.

Optionally, in this embodiment, before adding the target text information labeled as not belonging to the target text type to the second classification training data to obtain third classification training data, the target text information not belonging to the target text type in the plurality of text information may be determined in, but is not limited to, the following manner: displaying the plurality of text messages with the corresponding relation and a target text type corresponding to each text message in the plurality of text messages; determining text information on which a selection operation is performed among the plurality of text information as the target text information; determining that the target text information does not belong to the target text type.

Optionally, in this embodiment, a misjudgment sample backtracking procedure is proposed to gradually enhance the identification process. As shown in fig. 14. The manual labeling sample size is very limited (the number of classes is more, more than 20 classes are involved, the labeling efficiency is low), a model is finely trained on a small sample data set, and generally, the performance (accuracy and recall ratio) of online data is greatly different from that of the data in a test set. The whole process can be designed into a reinforcement learning process in which auditors participate. After online prediction, a link of manual verification of an auditor is added, model performance is evaluated and controlled, and due to the fact that manual verification has the basis of model classification, judgment and misjudgment can be rapidly conducted on samples of each category, and accurate verification can be conducted on each category respectively, so that the labeling efficiency is greatly improved. And (3) pulling a misjudgment sample for each class, and after manually confirming (single class, high speed and high efficiency), feeding back the classification training data to enable the classification model to perform fine adjustment on new classification training data again to form a closed-loop virtuous cycle.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiments of the present invention, there is also provided a text information type recognition apparatus for implementing the above text information type recognition method, as shown in fig. 15, the apparatus including:

an input module 1502, configured to input the obtained first text information into a text type identification model, where the text type identification model includes multiple submodels connected in series in sequence, and each submodel in the multiple submodels is used to identify whether the text information input into each submodel belongs to a text type corresponding to each submodel;

a first obtaining module 1504, configured to obtain a first recognition result output by the text type recognition model, where the first recognition result is used to indicate a first sub-model of the multiple sub-models to determine that the first text information belongs to a target text type corresponding to the first sub-model.

Optionally, in case the first sub-model comprises a short text classification model,

the input module is used for: inputting the first text information into the short text classification model, wherein the short text classification model comprises a plurality of first encoders and a first excitation function layer which are sequentially connected in series, and each first encoder in the plurality of first encoders comprises a self-attention layer and a feedforward neural network layer;

the first obtaining module is configured to: and acquiring the first recognition result output by the first excitation function layer.

Optionally, the apparatus further comprises:

the first training module is used for training a plurality of second encoders which are sequentially connected in series by using first classification training data before the first text information is input into the short text classification model, so as to obtain a plurality of third encoders which are sequentially connected in series, wherein the initial short text classification model comprises the second encoders and a second excitation function layer which are sequentially connected in series;

and the second training module is used for training the plurality of third encoders and the second excitation function layer which are sequentially connected in series by using the first classification training data to obtain the short text classification model.

Optionally, in case the first sub-model comprises a sensitive word detection model,

the input module is used for: inputting the first text information into the sensitive word detection model, wherein the sensitive word detection model is used for matching sensitive words included in a sensitive word dictionary with the first text information;

the first obtaining module is configured to: and under the condition that the first sensitive word included in the sensitive word dictionary is successfully matched with the first text information, determining that the text type corresponding to the first sensitive word is the target text type to which the first text information belongs.

Optionally, the apparatus further comprises:

a deleting module, configured to delete, after determining that the text type corresponding to the first sensitive word is the target text type to which the first text information belongs, text content corresponding to the first sensitive word from the first text information, so as to obtain a sensitive word template;

the matching module is used for matching the sensitive word template with the acquired text information set;

the second obtaining module is used for obtaining second text information which is successfully matched with the sensitive word template from the text information set;

the extraction module is used for extracting a second sensitive word from the second text information;

a first adding module for adding the second sensitive word to the sensitive word dictionary.

Optionally, the first text information includes a plurality of text information, wherein the apparatus further includes:

the second adding module is used for adding target text information which is marked with the target text type to second classification training data to obtain third classification training data under the condition that the target text information in the plurality of text information does not belong to the target text type after a first recognition result output by the text type recognition model is obtained, wherein the second classification training data is used for training an initial sub-model to obtain the first sub-model;

the third training module is used for training the first sub-model by using the third classification training data to obtain a second sub-model;

a replacing module, configured to replace the first sub-model included in the text type recognition model with the second sub-model.

Optionally, the apparatus further comprises:

As an alternative embodiment, the above text information type identification method may be applied, but not limited to, in a scene of performing malicious title identification on a video title as shown in fig. 16. In the scene, obvious problems are detected firstly, and unobvious or high misjudgment rate of the current model is put behind to reduce filtering flow. Of course, some malicious titles are complex, and there are 2 or more malicious title classifications together, only one item detected first is reminded (in the process of manual review, only if a problem is detected, the item is required to be returned to the standard), and in other scenes, all malicious title types need to be given, and the process can be modified on the flow architecture, and the serial operation is changed into the parallel operation. As shown in fig. 16, the plurality of sequentially connected submodels included in the text type recognition model are: the policy models include a low-custom policy model, other policy models, a meaningless policy model, a title party policy model, and a low-custom association model.

In an optional implementation manner, the input video title is identified through a vulgar policy model to determine that the video title does not belong to the text type corresponding to the vulgar policy model, then the video title is identified through other policy models, the video title is determined not to belong to the text type corresponding to other policy models, then the video title is identified through a meaningless policy model to determine that the video title belongs to the text type of 'screen-swiping meaningless characters', then the video title does not need to identify a title party policy model and a vulgar association model, and a first identification result is directly output as the text type of the video title which belongs to 'screen-swiping meaningless characters'.

According to still another aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the above-mentioned method for type recognition of text information, as shown in fig. 17, the electronic device including: one or more processors 1702 (only one of which is shown), in which a computer program is stored, a memory 1704, the sensors 1706, the encoder 1708 and the transmission device 1710, the processor being arranged to perform the steps of any of the above-described method embodiments by means of the computer program.

Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, inputting the acquired first text information into a text type recognition model, wherein the text type recognition model comprises a plurality of submodels which are sequentially connected in series, and each submodel in the plurality of submodels is used for recognizing whether the text information input into each submodel belongs to a text type corresponding to each submodel;

and S2, acquiring a first recognition result output by the text type recognition model, wherein the first recognition result is used for indicating a first submodel of the multiple submodels to determine that the first text information belongs to a target text type corresponding to the first submodel.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 14 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 17 is a diagram illustrating the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 17, or have a different configuration than shown in FIG. 17.

The memory 1704 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for identifying the type of text message in the embodiment of the present invention, and the processor 1702 executes various functional applications and data processing by executing the software programs and modules stored in the memory 1704, that is, the control method of the target component is implemented. Memory 1704 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1704 may further include memory located remotely from the processor 1702, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1710 is used for receiving or transmitting data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 1710 includes a network adapter (NIC) that can be connected to a router via a network cable and other network devices to communicate with the internet or a local area network. In one example, the transmission 1710 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

Among them, the memory 1704 is used to store, in particular, an application program.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

Optionally, the storage medium is further configured to store a computer program for executing the steps included in the method in the foregoing embodiment, which is not described in detail in this embodiment.

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying a type of text information, comprising:

inputting the acquired first text information into a text type identification model, wherein the text type identification model comprises a plurality of sub-models which are sequentially connected in series, and each sub-model in the plurality of sub-models is used for identifying whether the text information input into each sub-model belongs to a text type corresponding to each sub-model;

2. The method of claim 1, wherein, in the case that the first sub-model comprises a short-text classification model,

inputting the acquired first text information into a text type recognition model, wherein the text type recognition model comprises the following steps: inputting the first text information into the short text classification model, wherein the short text classification model comprises a plurality of first encoders and a first excitation function layer which are sequentially connected in series, and each first encoder in the plurality of first encoders comprises a self-attention layer and a feedforward neural network layer;

acquiring the first recognition result output by the text type recognition model, wherein the first recognition result comprises: and acquiring the first recognition result output by the first excitation function layer.

3. The method of claim 2, wherein prior to entering the first textual information into the short-text classification model, the method further comprises:

training a plurality of second encoders which are sequentially connected in series by using first classification training data to obtain a plurality of third encoders which are sequentially connected in series, wherein the initial short text classification model comprises the plurality of second encoders and a second excitation function layer which are sequentially connected in series;

and training the plurality of third encoders and the second excitation function layer which are sequentially connected in series by using the first classification training data to obtain the short text classification model.

4. The method of claim 1, wherein, in the case that the first sub-model comprises a sensitive word detection model,

inputting the acquired first text information into a text type recognition model, wherein the text type recognition model comprises the following steps: inputting the first text information into the sensitive word detection model, wherein the sensitive word detection model is used for matching sensitive words included in a sensitive word dictionary with the first text information;

acquiring a first recognition result output by the text type recognition model, wherein the first recognition result comprises: and under the condition that the first sensitive word included in the sensitive word dictionary is successfully matched with the first text information, determining that the text type corresponding to the first sensitive word is the target text type to which the first text information belongs.

5. The method of claim 4, wherein after determining that the text type corresponding to the first sensitive word is the target text type to which the first text information belongs, the method further comprises:

deleting text content corresponding to the first sensitive word from the first text information to obtain a sensitive word template;

matching the sensitive word template with the acquired text information set;

acquiring second text information successfully matched with the sensitive word template from the text information set;

extracting a second sensitive word from the second text information;

adding the second sensitive word to the sensitive word dictionary.

6. The method according to claim 1, wherein the first text information comprises a plurality of text information, and wherein after obtaining the first recognition result output by the text type recognition model, the method further comprises:

under the condition that the target text information in the plurality of text information is determined not to belong to the target text type, adding the target text information which is marked not to belong to the target text type into second classification training data to obtain third classification training data, wherein the second classification training data is used for training an initial sub-model to obtain the first sub-model;

training the first sub-model by using the third classification training data to obtain a second sub-model;

replacing the first sub-model included in the text type recognition model with the second sub-model.

7. The method according to claim 6, wherein before adding the target text information labeled as not belonging to the target text type to second classification training data to obtain third classification training data, the method further comprises:

displaying the plurality of text messages with the corresponding relation and a target text type corresponding to each text message in the plurality of text messages;

determining text information on which a selection operation is performed among the plurality of text information as the target text information;

determining that the target text information does not belong to the target text type.

8. An apparatus for recognizing a type of text information, comprising:

the text type recognition module is used for inputting the acquired first text information into a text type recognition model, wherein the text type recognition model comprises a plurality of sub-models which are sequentially connected in series, and each sub-model in the plurality of sub-models is used for recognizing whether the text information input into each sub-model belongs to a text type corresponding to each sub-model;

9. The apparatus of claim 8, wherein, in the case that the first sub-model comprises a short text classification model,

10. The apparatus of claim 9, further comprising:

11. The apparatus of claim 8, wherein, in the case that the first sub-model comprises a sensitive word detection model,

12. The apparatus of claim 11, further comprising:

13. The apparatus of claim 8, wherein the first text information comprises a plurality of text information, and wherein the apparatus further comprises:

14. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when executed.

15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.