CN113449109A

CN113449109A - Security class label detection method and device, computer equipment and storage medium

Info

Publication number: CN113449109A
Application number: CN202110762951.2A
Authority: CN
Inventors: 吴智东
Original assignee: Guangzhou Huaduo Network Technology Co Ltd
Current assignee: Guangzhou Huaduo Network Technology Co Ltd
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-09-28

Abstract

The application discloses a security class label detection method, a security class label detection device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring text information of a security class label to be detected; calling a sequence labeling model to label key words in text information and security category labels to which the key words belong to the text information, and calculating word segmentation evaluation scores of the text information respectively belonging to the security category labels according to the key words labeled by the security category labels; calling a text classification model to perform classification evaluation on the text information to obtain full-text evaluation scores of the text information respectively hitting the security class labels; and performing linear fusion on the two scores corresponding to the safety class labels to obtain comprehensive evaluation scores of the text information respectively belonging to the safety class labels, and determining the safety class label of the text information with the maximum comprehensive evaluation score. According to the method and the device, the safety classification of the text information is accurately detected from the dimensionality of the phrases and the full text by fusing two label classification models.

Description

Security class label detection method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the field of information security, in particular to a security class label detection method, a security class label detection device, computer equipment and a storage medium.

Background

In the prior art, illegal keywords are mostly used for matching information contents, and when the keywords exist in the matched contents, whether the contents are illegal or not is judged by combining a white list; there are also some techniques, the text similarity of the information text and the blacklist content in the database is calculated, and when the similarity is higher than a threshold value, the content violation is judged.

By means of the keyword matching method, semantic information among texts is ignored, and most of recalled data are noise data irrelevant to violation categories. The similarity method is very dependent on the collected sample library, and when the information text has samples which do not exist in the database, the method is highly likely to cause the condition of missed recall. Therefore, how to improve the quality of violation detection of the pushed information text becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

It is an object of the present application to overcome at least some of the disadvantages of the prior art and to provide a security class tag detection method, apparatus, computer device and storage medium.

In order to realize the purpose of the application, the following technical scheme is adopted:

a security class label detection method adapted to one of the objects of the present application, comprising the steps of:

acquiring text information of a security class label to be detected;

calling a sequence labeling model to label key words in the text information and security category labels to which the key words belong to the text information, calculating word segmentation evaluation scores of the text information respectively belonging to the security category labels according to the key words labeled by the security category labels, and training the sequence labeling model to a convergence state in advance;

calling a text classification model to perform classification evaluation on the text information to obtain full-text evaluation scores of the text information respectively hitting the safety class labels, wherein the text classification model is trained to a convergent state in advance;

and linearly fusing the word segmentation evaluation scores corresponding to the safety category labels with the full-text evaluation scores to obtain comprehensive evaluation scores of the text information belonging to the safety category labels respectively, and determining the safety category label of the text information with the maximum comprehensive evaluation score.

In a further embodiment, the step of obtaining the text information of the security class label to be detected includes: responding to a text information submission event, and extracting text information in the text information, wherein the text information comprises a content text of an advertisement to be published, a content text of a notice to be published or a content text of an article to be published;

and after determining that the maximum comprehensive evaluation score is the security class label of the text information, the method comprises the following steps: judging the security attribute of the security class label, and forbidding issuing the text information when the security attribute is a non-security attribute; when it is a security attribute, the text information is allowed to be issued.

In a further embodiment, a sequence labeling model is called to label the text information with keywords in the text information and security category labels to which the keywords belong, and word segmentation evaluation scores of the text information respectively belonging to the security category labels are calculated according to the keywords labeled by the security category labels, which includes the following specific steps:

importing the text information into a sequence labeling model to perform keyword extraction based on semantic features to obtain a keyword sequence represented as a semantic vector;

the sequence labeling model carries out label prediction on the keyword sequence based on the semantic vector to obtain a label sequence describing a security class label corresponding to each keyword;

and the sequence labeling model calculates word segmentation evaluation scores of the text information respectively belonging to each safety class label according to the keywords corresponding to the safety class labels belonging to the non-safety attribute in the label sequence.

In a further embodiment, in the step of calculating the segmentation evaluation scores of the text information respectively belonging to the security category labels according to the keywords corresponding to the security category labels belonging to the non-security attribute in the label sequence, the calculation step of the segmentation evaluation score corresponding to each security category label is as follows:

determining the sum of the word numbers of all the keywords marked by the security category label;

determining a total word number of the text information;

and taking the ratio of the sum value to the total word number as a participle evaluation score corresponding to the security class label.

In a further embodiment, a text classification model is called to perform classification evaluation on the text information to obtain full-text evaluation scores of the text information respectively hitting the security class labels, and the method comprises the following specific steps:

importing the text information into a text classification model to perform semantic feature extraction based on the text information, and obtaining a semantic vector of the text feature;

and the text classification model classifies the semantic vectors by using a regression classifier to obtain the probability of the whole semantic vector hitting each safety class label as the corresponding full text evaluation score of each safety class label.

In a further embodiment, in the step of linearly fusing the word segmentation evaluation scores corresponding to the security category labels with the full-text evaluation scores, the word segmentation evaluation scores and the full-text evaluation scores respectively carry respective weights, and the two weights reflect the correlation with each other by using the same preset hyper-parameter so as to realize the linear weighting of each other, so as to obtain the comprehensive evaluation scores of the text information respectively belonging to the security category labels.

In a preferred embodiment, the sequence labeling model and the text classification model construct semantic feature extractors thereof based on the same text pre-training model, so as to realize the extraction based on the semantic features.

A security class label detection apparatus adapted for the purpose of the present application, comprising:

the text information acquisition module is used for acquiring text information of the security class label to be detected;

the word segmentation estimation score calculation module is used for calling a sequence labeling model to label the keywords in the text information and the security category labels to which the keywords belong, calculating word segmentation estimation scores of the text information respectively belonging to the security category labels according to the keywords labeled by the security category labels, and pre-training the sequence labeling model to a convergence state;

the full-text evaluation score acquisition module is used for calling a text classification model to perform classification evaluation on the text information to acquire full-text evaluation scores of the text information respectively hitting the security class labels, and the text classification model is trained to be in a convergence state in advance;

and the comprehensive evaluation value acquisition module is used for linearly fusing the word segmentation evaluation scores corresponding to the security category labels with the full-text evaluation scores to obtain comprehensive evaluation scores of the text information belonging to the security category labels respectively, and determining the security category label of the text information as the maximum comprehensive evaluation score.

In a further embodiment, the word segmentation estimation score calculation module includes:

the keyword sequence submodule is used for importing the text information into a sequence labeling model to extract keywords based on semantic features so as to obtain a keyword sequence represented as a semantic vector;

the label prediction submodule is used for performing label prediction on the keyword sequence by the sequence labeling model based on the semantic vector to obtain a label sequence describing a security class label corresponding to each keyword;

and the evaluation score sub-module is used for calculating the word segmentation evaluation scores of the text information respectively belonging to each safety category label according to the keywords corresponding to the safety category labels belonging to the non-safety attribute in the label sequence by the sequence labeling model.

In a further embodiment, the full-text assessment score obtaining module comprises:

the keyword sequence submodule is used for importing the text information into a text classification model to perform semantic-feature-based extraction so as to obtain a semantic vector of a text representation;

and the full-text scoring submodule is used for classifying the semantic vectors by the text classification model through a regression classifier, obtaining the probability of the whole semantic vector hitting each safety class label and using the probability as the full-text evaluation score corresponding to each safety class label.

In order to solve the above technical problem, an embodiment of the present invention further provides a computer device, including a memory and a processor, where the memory stores computer-readable instructions, and the computer-readable instructions, when executed by the processor, cause the processor to execute the steps of the security class tag detection method.

In order to solve the above technical problem, an embodiment of the present invention further provides a storage medium storing computer readable instructions, which when executed by one or more processors, cause the one or more processors to execute the steps of the security class tag detection method.

The embodiment of the invention has the beneficial effects that:

the application provides a text violation content detection technology based on a label sequence, which is characterized in that a sequence labeling model and a text classification model are combined, scores of text information to be issued belonging to safety class labels are respectively predicted from the dimensionality of phrases and the dimensionality of full texts, and finally the two types of scores are subjected to linear fusion to determine the safety class labels to which the text information belongs.

Firstly, the method adopts a sequence labeling-based structure of the sequence labeling model and a text classification model, calculates the scores of the text information hitting each security class label, compared with the traditional keyword matching method, only extracts corresponding keywords from a word bank for matching so as to judge whether illegal words exist in the text information, detects the illegal words on the basis of the sequence labeling and the scores, can enhance the generalization capability of the illegal word extraction model, can extract some illegal words not in a training set word bank, enhances the recognition capability of the illegal words, and effectively prevents the situation that the text information cannot be determined as illegal text information because the illegal words contained in the text information are keywords not existing in the word bank.

Secondly, the sequence labeling model and the text classification model are fused to construct an algorithm framework for detecting the illegal category of the text information from multiple dimensions, the text information is analyzed, compared with a single detection method, attention is often paid to information of a certain dimension, the safety category of the text information cannot be accurately determined, and the accuracy of detecting the illegal word of the whole scheme is improved while the attention point covering capability of the model to various text information is improved by fusing multiple detection classification methods.

In addition, the method for classifying the safety class to which the text information belongs by using the multi-classification model based on the neural network fuses the neural network technology, and can automatically detect whether the text information has illegal words and detect the illegal type of the text information.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic diagram of a typical network deployment architecture related to implementing the technical solution of the present application;

FIG. 2 is a schematic flow chart diagram of an exemplary embodiment of a security class label detection method of the present application;

FIG. 3 is a schematic flow chart illustrating a specific step of step S12 in FIG. 2;

FIG. 4 is a schematic flowchart illustrating a specific step of step S123 in FIG. 3;

FIG. 5 is a schematic flowchart illustrating a specific step of step S13 in FIG. 2;

FIG. 6 is a functional block diagram of an exemplary embodiment of a security class tag detection apparatus of the present application;

fig. 7 is a block diagram of a basic structure of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, "client," "terminal," and "end device" include both devices that are wireless signal receivers, devices that have only wireless signal receivers without transmit capability, and devices that include receive and transmit hardware, devices that have receive and transmit hardware capable of two-way communication over a two-way communication link, as will be understood by those skilled in the art. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a network access terminal, and a music/video playing terminal, for example, a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and other devices.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., in which a computer program is stored, and the central processing unit calls a program stored in the external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby accomplishing specific functions.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of being applied to a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through interfaces, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the manner in which the network of the present application is deployed.

Referring to fig. 1, the hardware basis required for implementing the embodiments of the present application may be deployed according to the architecture shown in the figure. The server 80 is deployed at the cloud end, and serves as a business server, and is responsible for further connecting to a related data server and other servers providing related support, so as to form a logically associated server cluster to provide services for related terminal devices, such as the smart phone 81 and the personal computer 82 shown in the figure, or a third-party server (not shown). The smart phone and the personal computer can both access the internet through a known network access method, and establish a data communication link with the cloud server 80 so as to run a terminal application program related to the service provided by the server.

For the server, the application program is usually constructed as a service process, and a corresponding program interface is opened for remote call of the application program running on various terminal devices.

The application program refers to an application program running on a server or a terminal device, the application program implements the related technical scheme of the application in a programming mode, a program code of the application program can be saved in a nonvolatile storage medium which can be identified by a computer in a form of computer executable instructions and called into a memory by a central processing unit to run, and the related device of the application is constructed by running the application program on the computer.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

Referring to fig. 2, a security class tag detection method according to the present application, in an exemplary embodiment, includes the following steps:

step S11, acquiring the text information of the security class label to be detected:

the server acquires the text information of the security class label to be detected, the text information is generally used for text content which is released to the Internet for propagation in the aspects of commercial promotion or article publication and the like, in order to prevent unsafe or illegal text content from being propagated in the Internet, the server acquires the text information, detects the security class label to which the text information belongs, judges whether the text information can be released to the Internet or not, allows the text information to be distributed when the attribute of the security class label to which the text information belongs is a security attribute, and prohibits the text information from being distributed when the attribute is a non-security attribute.

The text information is generally submitted by an internet platform in a data communication link with a server, when the platform or a platform user publishes text information of text contents such as advertisements, bulletins, articles and the like, the server responds to a text information submission event to acquire the text information so as to detect a security class label to which the text information belongs and judge whether the text information can be published.

The security class label is a label used for representing whether the text information is published, and the type of the security class label is as follows: a security attribute type and a non-security attribute type, wherein the non-security attribute type has the following classification: gambling, political involvement, counterfeiting, infringement, or riot, etc., violate network security or disrupt market order, and the security attributes may be classified as: types of articles of commerce such as apparel, novel, poetry, sports or electronic products, or types of literature types; the security class labels are used for calculating word segmentation evaluation scores and full text evaluation scores of the text information by a sequence labeling model and a text classification model.

Step S12, calling a sequence labeling model to label the text information with the security category labels to which the keywords and the keywords in the text information belong, calculating the segmentation evaluation scores of the text information belonging to the security category labels respectively according to the keywords labeled by the security category labels, and the sequence labeling model being trained to a convergence state in advance:

and the server calls the sequence tagging model which is trained to be in a convergence state in advance, tags the security category labels to which the keywords contained in the keyword sequence of the text information belong, and calculates the word segmentation evaluation scores of the text information which respectively belong to the security category labels according to the keywords tagged by the security category labels.

The sequence labeling model is trained to be in a convergence state according to a preset keyword tag library, a plurality of text messages and the corresponding security category tags are stored in the keyword tag library, the text messages are acquired in a data capturing mode such as a crawler system or manual collection, a word segmentation device is used for the text messages, word segmentation included in each text message is acquired, the corresponding security category tags are configured for the word segmentation, and the text messages after the security category tag configuration are constructed into the keyword tag library; the security class label is constructed according to rules formulated by a platform, the types of the security class label are divided into two major types, namely a security attribute and a non-security attribute, each type comprises a corresponding minor type, for example, the non-security attribute type can be divided into: gambling, political involvement, counterfeiting, infringement, or terrorist, etc., violating network security or disrupting market rank order, the security attributes may be classified as: types of goods such as apparel, novels, poetry, sports, or electronic products, or types of literature types.

Specifically, the storage architecture of the text information and the security class tags stored in the keyword tag library is as follows:

D₁：{(X_i，Y_i)|i∈1，...，n}

where i denotes the ith data record of the data set, X_iRepresenting the ith text information, i.e. a text sentence or paragraph, by_iCharacter composition, expressed as

Y_iThe safety class label corresponding to each word in the ith text message is represented by_iIndividual security class label component, expressed as

The sequence labeling model is combined with a Bert model, a Conditional Random Field (CRF) and a Viterbi algorithm (Viterbi) which are trained to be in a convergence state for training, the Bert model extracts keywords from the text information based on semantic features and converts the text information into a keyword sequence representing semantic vectors, and the Conditional Random Field (CRF) and the Viterbi algorithm (Viterbi) calculate a path with the highest probability in paths formed by security class labels in the keyword sequence. In the case of realizing the function of the sequence labeling model, a person skilled in the art can construct the sequence labeling model for training by using other neural network model models and algorithms according to actual service scenarios, which is not repeated.

Regarding the training implementation of the sequence labeling model, generally all security category labels in the keyword label library are imported into the conditional random field model (CRF) as random variables, each piece of text information stored in the keyword label library is imported into the sequence labeling model, the sequence labeling model uses a semantic feature extractor constructed based on a text pre-training model Bert to perform keyword extraction on the text information based on semantic features, a keyword sequence corresponding to the text information and characterized as a semantic vector is obtained, the probability of a path composed of the keyword sequence and each security category label in the conditional random field model is calculated according to the conditional random field model (CRF) and a viterbi path algorithm, and each security category label in the path with the highest probability in the probabilities is determined as the corresponding security category label of the text information, and then inquiring whether the security class label pre-configured for the text information in the keyword label library is the security label contained in the path with the maximum probability, if not, modifying the probability corresponding to each security class label in the conditional random field model, and so on until the fact that most of the security class labels contained in the path with the maximum text information probability are the security class labels pre-configured in the illegal word labels is calculated, and representing that the sequence labeling model is trained to be in a convergence state.

Specifically, the training process of the sequence labeling model is as follows:

and converting the text information into the keyword sequence represented as a semantic vector by using a semantic feature extractor constructed based on a text pre-training model Bert as the semantic feature extractor of the text information.

V_bert＝Bert(X_i).

After the conversion of the keyword sequence is completed, a conditional random field model is used, and the conditional random field model is arranged on an output layer V of the Bert_bertA decoding layer, namely a CRF layer, is added. And obtaining each security category label of the predicted hit of the keyword sequence after decoding by a viterbi algorithm.

P_seq＝CRF(V_bert).

Using the idea of minimizing the negative log-likelihood function, a loss function is constructed:

Loss_seq＝-log(P_seq).

and traversing the keyword tag library, and performing optimization updating on the model parameters by using an AdamW algorithm. And iterating the process until the sequence labeling model is trained to a convergence state.

After the server acquires the text information, the text information is imported into the sequence tagging model, the sequence tagging model uses a semantic feature extractor constructed based on a text pre-training model Bert to extract keywords from the text information based on semantic features so as to construct a keyword sequence represented as a semantic vector, label prediction is carried out on the keyword vector based on the semantic vector of the keyword sequence, a label sequence describing the security category labels corresponding to the keywords is acquired, and the word segmentation evaluation scores of the text information respectively belonging to the security category tables are calculated according to the keywords corresponding to the security category labels of all non-security attributes in the security category labels.

For example, the text information X_sInput into the sequence annotation model M_seqThe model decodes the input, and will obtain:

Y_s＝(y_s1，y_s2，...，y_sn)＝M_seq(X_s)

wherein y is_snRepresenting input text X_sThe predicted tag corresponding to the nth word.

Referring to fig. 3, an implementation of calculating each of the segmentation evaluation scores of the text information according to the sequence labeling model includes the following specific implementation steps:

step S121, importing the text information into a sequence labeling model to extract key words based on semantic features, and obtaining a key word sequence represented as a semantic vector:

and the server leads the text information into the sequence labeling model so that the sequence labeling model can extract keywords based on semantic features from the text information to obtain the keyword sequence represented as the semantic vector.

And the sequence labeling model calls a word segmentation device to perform word segmentation processing on the text information, so as to obtain characters contained in the text information and convert the characters into the keyword sequence.

Regarding the selection of the word segmentation device, when the text information is a chinese text, selecting an LTP word segmentation device, a THULAC word segmentation device, a jieba word segmentation device, or a KCWS word segmentation device facing the chinese field to segment the text information so as to preliminarily obtain all characters contained in the text information, and if the text information is an english text, removing spaces and non-english special characters to obtain a keyword group contained in the text information, or selecting a corresponding word segmentation device facing the english field, for example, a large model of space as the word segmentation device. The technical personnel in the field can select the existing word segmentation device for word segmentation according to the actual service scene, which is not repeated.

The keyword sequence is extracted by the sequence labeling model according to the characters contained in the text information based on semantic features, the sequence labeling model uses a semantic feature extractor, the semantic feature extractor generally refers to a text pre-training model Bert, firstly, the text information is converted into word vectors for all the characters contained in the text information, then the word vectors are converted into text vectors representing the global semantic information of the text information, finally, different vectors are added to the characters at different positions of the text vectors to be converted into position vectors so as to represent the difference between the semantic information carried by the different characters in the text information, and the keyword sequence representing the semantic vectors of the text information is constructed by carrying out ordered vector conversion on the text information.

Step S122, the sequence labeling model performs label prediction on the keyword sequence based on the semantic vector to obtain a label sequence describing the security category label corresponding to each keyword:

and after the sequence tagging model completes the construction of the keyword sequence, performing tag prediction on the keyword sequence based on the semantic vector of the keyword sequence to obtain a tag sequence describing a security type tag corresponding to each keyword.

Specifically, the sequence tagging model inputs the keyword sequence into the conditional random field model (CRF) to perform the tag prediction, the conditional random field model (CRF) tags the corresponding security class tags according to semantic vectors of the keyword sequence, and when the keyword sequence is [ imitative, make, love, dy, da, prosperous, upper, new, empty, gas, shoes, mats, self, e.g., step, body, check ], the conditional random field model (CRF) obtains tag sequences of the security class tags corresponding to the keywords:

['B-fake'，'I-fake'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O','O'，'O'，'O'，'O','O','O','O','O'，'O'，'O'，'O']。

the security category labels are typically labeled BIO, B denotes the beginning of a word, I denotes the continuation of a word, and O denotes a non-physical word.

Step S123, the sequence labeling model calculates word segmentation evaluation scores of the text information respectively belonging to each safety category label according to the keywords corresponding to the safety category labels belonging to the non-safety attribute in the label sequence:

and the sequence labeling model calculates the word segmentation evaluation scores of the text information respectively belonging to each safety class label according to the keywords corresponding to the safety class labels which are extracted from the label sequence and belong to the non-safety attribute.

For example, when the keyword sequence is [ imitative, make, love, dy, da, prosperous, up, new, empty, gas, shoe, mat, self, e.g., step, body, test ], the tag sequence of the security category tag is:

['B-fake'，'I-fake','O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O'，'O']

then the word "counterfeit" is extracted, which corresponds to the keyword ω with the security class label "fake", which is a non-security attribute.

Specifically, the expression of the word segmentation evaluation score of the text information belonging to each security category label respectively calculated by the sequence labeling model is as follows:

wherein

Representing text information X_sIn the security class labeled c_iThe score of the word segmentation evaluation of (1),

representing text information X_sThe number of words that are contained in the word,

labels c belonging to security category in keywords omega extracted by representation model_iI represents the number of words of the keyword omega, h represents the security class label c belonging to the non-security attribute in the keywords extracted by the model_iThe total number of keyword words.

Referring to fig. 4, the specific implementation steps of the embodiment of calculating the segmentation evaluation score of a certain security category label of the text information according to the sequence annotation model are as follows:

step S1231, determining a sum of the word numbers of all the keywords labeled by the security category label:

and the sequence labeling model determines the word number of one or more keywords corresponding to the security class label belonging to a certain non-security attribute in the keyword sequence, and sums the word numbers to obtain the sum of the word numbers of the keywords.

Step S1232, determining the total word number of the text information:

the sequence labeling model determines the total word number of the text information to which the keyword sequence belongs.

Step S1233, taking the ratio of the sum value to the total word number as the word segmentation evaluation score corresponding to the security class label:

and the sequence labeling model divides the sum of the word numbers of the keywords by the total word number of the text information, and takes the ratio obtained by the division transportation as the word segmentation evaluation value of the security class label of the text information.

Step S13, calling a text classification model to classify and evaluate the text information, obtaining full text evaluation scores of the text information respectively hitting the safety class labels, wherein the text classification model is trained to a convergence state in advance:

and calling the text classification model which is trained to be in a convergence state in advance by the server, carrying out classification evaluation on the keyword sequence of the text information, and acquiring the full text evaluation score of each safety class label in the keyword sequence.

The text classification model is trained to a convergence state according to a pre-configured text label library, a plurality of text messages and the corresponding security category labels are stored in the text label library, the text messages are acquired through a data capturing mode such as a crawler system or manual collection, and the corresponding security category labels are configured for the text messages according to the semantics of the text messages and are labeled.

Specifically, the storage architecture of the text information and the security class tag in the text tag library is as follows:

D₂：{(X_i，C_i)|i∈1，...，n}.

wherein i represents the ith piece of text information of the text label library. X_iRepresenting the ith text information, C_iIndicating the security class label to which the ith text message belongs, C_iE (1, 2.. k), wherein k represents the number of security class labels corresponding to the text information.

Specifically, the training process of the text classification model is as follows:

and training the text classification model. This model is used to classify textual information into security class labels. And converting the text information in the text label library into a semantic vector representing full-text semantics by using a semantic feature extractor constructed based on a text pre-training model Bert.

V_bert＝Bert(X_i)

After the conversion of the keyword sequence is completed, a regression classifier constructed based on a softmax function is used for predicting the security category label of the semantic vector

P_cls＝Softmax(V_bert)

Calculating the cross entropy between the predicted security class label and the security class label marked by the text information in the text label library, and taking the cross entropy as a loss function

Loss_cls＝CrossEntropy(P_cls)

And traversing the text label library, and performing optimization updating on model parameters by using an AdamW algorithm. And iterating the process until the text classification model is trained to a convergence state.

In an embodiment, the semantic feature extractor used by the sequence labeling model and the text classification model is a semantic feature extractor constructed by a same text pre-training model to implement construction of semantic vectors related to text information in each model and simplify a network of each model, the semantic feature extractor may be constructed based on a Bert model or based on text pre-training models such as GPT or ERNIE, and a person skilled in the art may select a corresponding model to construct the semantic feature extractor according to actual services, which is not repeated.

The server leads the text information into the text classification model trained to be in a convergence state, the text classification model uses a semantic feature extractor constructed based on a text pre-training model Bert to extract the semantic features of the full-text semantics of the text information based on the semantic features so as to construct a semantic vector corresponding to the text information, a regression classifier Softmax is called to classify the security category labels of the keyword sequence, the probability of hitting each security category label by the keyword sequence is obtained, and the probability is used as the full-text evaluation score of hitting each security category label by the text information.

Referring to fig. 5, the specific implementation steps of the embodiment of the text classification model for predicting the full-text evaluation score of each security class label hit by the text information are as follows:

step S131, importing the text information into a text classification model for semantic feature extraction, and obtaining a semantic vector of text representation:

and the server leads the text information into the text classification model so that the text classification model extracts the text information based on semantic features to obtain the semantic vector represented by the full-text semantics of the text information.

It should be noted that, compared with the keyword extraction of the sequence labeling model, the text classification model extracts semantic features from the full-text semantics of the text information, and constructs the semantic vector listed in the full-text semantics of the text information.

Step S132, the text classification model classifies the semantic vectors by a regression classifier to obtain the probability that the whole semantic vector hits each safety category label as the full-text evaluation score corresponding to each safety category label:

after the text classification model obtains the semantic vector corresponding to the text information, the regression classifier is called to classify the semantic vector, and the probability that the full-text semantics represented by the whole semantic vector hit each safety category label is obtained and used as the full-text evaluation score corresponding to each safety category label of the text information.

The regression classifier is generally configured based on a normalized exponential function (Softmax activation function) to predict the prediction result probabilities of the semantic vector and the different security class tags, the sum of the prediction result probabilities being 1, and the text classification model determines the prediction result probabilities as the full-text evaluation scores of the text information hitting the security class tags.

Specifically, the text classification model calculates the expressions of the full-text evaluation scores of the text messages respectively belonging to the security class labels as follows:

the text information X is processed_sIs input to the text classification model M_clsThe model makes predictions of the security class labels for the inputs.

Wherein c is_kIndicating the kth security class label.

It is understood that the sequence labeling model predicts the security class labels of the text information from the dimensions of the keywords included in the text information, calculates the segmentation evaluation scores of the text information hitting the security class labels by using a Conditional Random Field (CRF) and a viterbi path algorithm according to the keywords included in the text information, predicts the security class labels of the text information from the dimensions of the full-text semantics of the text information, and calculates the full-text evaluation scores of the text information hitting the security class labels by using a normalized exponential function (Softmax activation function) according to the semantic vector of the full-text semantic representation of the text information.

Step S14, carrying out linear fusion on the word segmentation evaluation scores corresponding to the security category labels and the full text evaluation scores to obtain comprehensive evaluation scores of the text information respectively belonging to the security category labels, and determining the maximum comprehensive evaluation score as the security category label of the text information:

and after the server acquires the word segmentation evaluation scores and the full-text evaluation scores of the text information, performing linear fusion on the word segmentation evaluation scores and the full-text evaluation scores to acquire the comprehensive evaluation scores of the text information belonging to the safety class labels respectively, and determining the maximum comprehensive evaluation score as the safety class label of the text information.

It can be understood that the quantities of the word segmentation evaluation scores and the full-text evaluation scores calculated by the server for the text information are the same, and the sequence labeling model and the text classification model respectively predict the probability of the text information hitting each security class label in all the security class labels from different dimensions (word dimensions and full-text dimensions), so that the server can call a hyper-parameter according to the security class labels corresponding to the word segmentation evaluation scores and the full-text evaluation scores, linearly add the word segmentation evaluation scores and the full-text evaluation values of the same security class label of the text information, obtain the comprehensive evaluation scores of the security class labels, and perform such pushing until the comprehensive evaluation that the text information belongs to each security class label is calculated, thereby completing the calculation of the linear fusion.

The hyper-parameters are preset parameters, and are generally obtained by estimation or data learning of output or input data of the sequence labeling model and the text classification model, or by grid search and cross validation.

Specifically, the server linearly adds a full-text evaluation value (Scorecls) and a participle evaluation score (Scoreseq) corresponding to each security class label according to an element operation by introducing a hyper parameter α to obtain a comprehensive evaluation score corresponding to each security class label, and determines the security class label corresponding to the security class label with the maximum score among the comprehensive evaluation scores as the security class label of the text information, wherein a specific expression is as follows:

S＝αScore_cls+(1-α)Score_seq

c＝argmax(S)

the participle evaluation score and the full-text evaluation score respectively carry respective weights, the two weights are based on the same preset hyper-parameter, specifically, as shown in the above expression, the weight of the full-text evaluation value (Scorecls) is alpha, the weight of the participle evaluation score (Scoreseq) is (1-alpha), and the weights can reflect the correlation between the participle evaluation score and the full-text evaluation score, so as to realize linear weighting of the participle evaluation score and the full-text evaluation score, and obtain the comprehensive evaluation score of the text information respectively belonging to each security category label.

And after determining that the maximum comprehensive evaluation score is the security class label of the text information, the server judges the security attribute of the security class label, if the maximum comprehensive evaluation score is the non-security attribute, the server prohibits issuing the text information to a corresponding platform for output display, and if the maximum comprehensive evaluation score is the security attribute, the server permits issuing the text information to the corresponding platform for output display.

Further, a security class label detection apparatus of the present application can be constructed by functionalizing the steps in the methods disclosed in the above embodiments, and according to this idea, referring to fig. 6, in an exemplary embodiment of the security class label detection method, the apparatus includes: the system comprises a text information acquisition module 11, a word segmentation estimation score calculation module 12, a full text evaluation score acquisition module 13 and a comprehensive evaluation value acquisition module 14, wherein the text information acquisition module 11 is used for acquiring text information of a security class label to be detected; a segmentation estimation score calculation module 12, configured to call a sequence tagging model to tag the text information with keywords in the text information and security category labels to which the keywords belong, calculate segmentation estimation scores of the text information respectively belonging to the security category labels according to the keywords tagged by the security category labels, and pre-train the sequence tagging model to a convergence state; a full-text evaluation score obtaining module 13, configured to invoke a text classification model to perform classification evaluation on the text information, so as to obtain full-text evaluation scores of the text information respectively hitting the security class labels, where the text classification model is trained to a convergence state in advance; and the comprehensive evaluation value acquisition module 14 is configured to perform linear fusion on the word segmentation evaluation scores corresponding to the security category tags and the full-text evaluation scores, obtain comprehensive evaluation scores of the text information belonging to the security category tags respectively, and determine that the maximum comprehensive evaluation score is the security category tag of the text information.

In one embodiment, the word segmentation estimation score calculation module comprises: the keyword sequence submodule is used for importing the text information into a sequence labeling model to extract keywords based on semantic features so as to obtain a keyword sequence represented as a semantic vector; the label prediction submodule is used for performing label prediction on the keyword sequence by the sequence labeling model based on the semantic vector to obtain a label sequence describing a security class label corresponding to each keyword; and the evaluation score sub-module is used for calculating the word segmentation evaluation scores of the text information belonging to each safety category label respectively according to the keywords corresponding to the safety category labels belonging to the non-safety attribute in the label sequence by the sequence labeling model.

In one embodiment, the full-text assessment score acquisition module comprises: the keyword sequence submodule is used for importing the text information into a text classification model to perform semantic feature extraction based on the text information to obtain a semantic vector of text representation; and the full-text scoring submodule is used for classifying the semantic vectors by the text classification model through a regression classifier, obtaining the probability of the whole semantic vector hitting each safety class label and using the probability as the full-text evaluation score corresponding to each safety class label.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, configured to run a computer program implemented according to the security class tag detection method. Referring to fig. 7 in detail, fig. 7 is a block diagram of a basic structure of a computer device according to the embodiment.

Fig. 7 is a schematic diagram of the internal structure of the computer device. The computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize the security class label detection method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a method of security class tag detection. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 7 is a block diagram of only a portion of the architecture associated with the disclosed aspects and is not intended to serve as a limitation on the computing devices to which the disclosed aspects may be applied, as a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is used to execute specific functions of each module/sub-module in the security class tag detection apparatus of the present invention, and the memory stores program codes and various types of data required to execute the modules. The network interface is used for data transmission to and from a user terminal or a server.

The memory in this embodiment stores program codes and data necessary for executing all modules/submodules in the security class tag detection device, and the server can call the program codes and data of the server to execute the functions of all the submodules.

The present application also provides a non-volatile storage medium, in which the security class label detection method is written as a computer program and stored in the storage medium in the form of computer readable instructions, which when executed by one or more processors, means execution of the program in a computer, thereby causing the one or more processors to perform the steps of the security class label detection method of any of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

In summary, according to the text violation content detection technology, two label classification models are fused, the security class of text information is accurately detected from multiple dimensions, the text violation content detection technology is constructed, the technology predicts the scores of security class labels of the text information issued from the dimensions of phrases and the dimensions of full text respectively by combining a sequence labeling model and the text classification models, and the linear security class labels of texts are determined by linearly fusing the two types of scores.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flowchart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed sequentially, but may be performed alternately or in turns with other steps or at least a portion of the sub-steps or stages of other steps.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, changed, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A security class label detection method, comprising the steps of:

acquiring text information of a security class label to be detected;

calling a text classification model to perform classification evaluation on the text information to obtain full-text evaluation scores of the text information hitting the safety class labels respectively, wherein the text classification model is trained to be in a convergence state in advance;

and performing linear fusion on the word segmentation evaluation scores corresponding to the safety class labels and the full-text evaluation scores to obtain comprehensive evaluation scores of the text information respectively belonging to the safety class labels, and determining the safety class label of the text information with the maximum comprehensive evaluation score.

2. The method of claim 1,

the method for acquiring the text information of the security class label to be detected comprises the following steps: responding to a text information submission event, and extracting text information in the text information, wherein the text information comprises a content text of an advertisement to be published, a content text of a notice to be published or a content text of an article to be published;

and after determining that the maximum comprehensive evaluation score is the security class label of the text information, the method comprises the following steps: judging the security attribute of the security class label, and forbidding to release the text information when the security attribute is a non-security attribute; when it is a security attribute, the text information is allowed to be issued.

3. The method according to claim 1, wherein a sequence labeling model is called to label the text information with keywords in the text information and security category labels to which the keywords belong, and word segmentation evaluation scores of the text information respectively belonging to the security category labels are calculated according to the keywords labeled by the security category labels, comprising the following specific steps:

importing the text information into a sequence labeling model to perform keyword extraction based on semantic features, and obtaining a keyword sequence represented as a semantic vector;

4. The method according to claim 3, wherein in the step of calculating the segmentation evaluation scores of the text message respectively belonging to the security class labels according to the keywords corresponding to the security class labels belonging to the non-security attribute in the label sequence, the calculation step of the segmentation evaluation score corresponding to each security class label is as follows:

determining a total word number of the text information;

and taking the ratio of the sum value to the total word number as a participle evaluation score corresponding to the safety class label.

5. The method according to claim 1, wherein a text classification model is called to perform classification evaluation on the text information to obtain full-text evaluation scores of the text information respectively hitting the security class labels, and the method comprises the following specific steps:

importing the text information into a text classification model to perform semantic feature extraction based on the text information, and obtaining a semantic vector of text representation;

and the text classification model classifies the semantic vectors by using a regression classifier to obtain the probability of the whole semantic vector hitting each safety class label as the full-text evaluation score corresponding to each safety class label.

6. The method according to claim 1, wherein in the step of linearly fusing the word segmentation evaluation scores corresponding to the security class labels with the full-text evaluation scores, the word segmentation evaluation scores and the full-text evaluation scores respectively carry respective weights, and the two weights represent the correlation with each other by the same preset hyper-parameter to realize the linear weighting of each other, so as to obtain the comprehensive evaluation scores of the text information belonging to the security class labels respectively.

7. The method according to any one of claims 1 to 6, wherein the sequence labeling model and the text classification model construct a semantic feature extractor thereof based on the same text pre-training model, so as to realize the semantic feature-based extraction.

8. A security class tag detection apparatus, comprising:

9. An electronic device comprising a central processor and a memory, wherein the central processor is configured to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

10. A non-volatile storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the method.