CN113342927A - Sensitive word recognition method, device, equipment and storage medium - Google Patents

Sensitive word recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN113342927A
CN113342927A CN202110470946.4A CN202110470946A CN113342927A CN 113342927 A CN113342927 A CN 113342927A CN 202110470946 A CN202110470946 A CN 202110470946A CN 113342927 A CN113342927 A CN 113342927A
Authority
CN
China
Prior art keywords
vector
word
user
sensitive
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110470946.4A
Other languages
Chinese (zh)
Other versions
CN113342927B (en
Inventor
付桂振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110470946.4A priority Critical patent/CN113342927B/en
Publication of CN113342927A publication Critical patent/CN113342927A/en
Application granted granted Critical
Publication of CN113342927B publication Critical patent/CN113342927B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a sensitive word recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring text contents published in a preset social platform by a target user, and processing each word in the text contents into a word vector by using a preset text processing model; acquiring user information for leaving a message under the text content, and constructing a social relationship graph of the target user and a user to which the user information belongs; inputting the social relation graph into a preset graph to be embedded into a network for training, and acquiring an output user characteristic vector; fusing the word vector and the user feature vector, and generating a matrix to be identified based on the fused vector; and inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition, and acquiring output sensitive words. The method and the device can improve the accuracy of sensitive word recognition.

Description

Sensitive word recognition method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a sensitive word recognition method, a sensitive word recognition device, sensitive word recognition equipment and a storage medium.
Background
The network sensitive words refer to political sensitivity tendency, violence tendency, unhealthy color or non-civilized words and the like. In the internet environment, massive text data is generated every day, and the text data is rapidly spread through platforms such as social media, forums and the like, and the text data may comprise network sensitive words. Currently, certain high-frequency sensitive words can be identified by various techniques, and the identified sensitive words are generally masked using network techniques. However, in the social platform, in view of ambiguity of a word, in combination with that the word may have different meanings when appearing in a specific scene or in a specific group communication, it is often difficult to determine whether the word is a sensitive word, and the accuracy of sensitive word recognition is low.
Disclosure of Invention
The invention aims to provide a sensitive word recognition method, a sensitive word recognition device, sensitive word recognition equipment and a storage medium, and aims to improve the accuracy of sensitive word recognition.
The invention provides a sensitive word recognition method, which comprises the following steps:
acquiring text contents published in a preset social platform by a target user, and processing each word in the text contents into a word vector by using a preset text processing model;
acquiring user information for leaving a message under the text content, and constructing a social relationship graph of the target user and a user to which the user information belongs;
inputting the social relation graph into a preset graph to be embedded into a network for training, and acquiring an output user characteristic vector;
fusing the word vector and the user feature vector, and generating a matrix to be identified based on the fused vector;
and inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition, and acquiring output sensitive words.
The present invention also provides a sensitive word recognition apparatus, including:
the system comprises a text processing module, a social interaction platform and a social interaction platform, wherein the text processing module is used for acquiring text contents published in a preset social interaction platform by a target user and processing each word in the text contents into a word vector by utilizing a preset text processing model;
the construction module is used for acquiring the user information for leaving a message under the text content and constructing a social relationship graph of the target user and the user to which the user information belongs;
the training module is used for inputting the social relationship graph into a preset graph to be embedded into a network for training and acquiring an output user characteristic vector;
the fusion module is used for fusing the word vector and the user characteristic vector and generating a matrix to be identified based on the fused vector;
and the recognition module is used for inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition and acquiring the output sensitive words.
The invention also provides a computer device, which comprises a memory and a processor connected with the memory, wherein a computer program capable of running on the processor is stored in the memory, and the steps of the sensitive word recognition method are realized when the processor executes the computer program.
The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the above-mentioned sensitive word recognition method.
The invention has the beneficial effects that: the method comprises the steps of firstly processing each word in text content published by a target user into a word vector, constructing a social relation graph for the target user and a user leaving a message, training the social relation graph by using a graph embedding network technology to obtain a user characteristic vector, then fusing the word vector and the user characteristic vector to generate a matrix to be recognized, finally inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition, and outputting a sensitive word. According to the method, the user characteristic vector is added on the basis of the text vector, and comprises the group characteristics and the group structure characteristics of the user, so that when the sensitive words are identified by the model, the text vector is taken as a main part, the user characteristic vector is taken as auxiliary judgment information, the text content is combined with the group characteristics and the group structure characteristics of the user to identify the sensitive words, and the identification accuracy can be improved.
Drawings
FIG. 1 is a flowchart illustrating a sensitive word recognition method according to an embodiment of the present invention;
FIG. 2 is a detailed flow diagram of the embodiment shown in FIG. 1;
FIG. 3 is a schematic structural diagram of a sensitive word recognition apparatus according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a hardware architecture of an embodiment of a computer device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Referring to fig. 1, a flow chart of an embodiment of the sensitive word recognition method of the present invention is schematically shown, including the following steps:
step S1, acquiring text contents published by a target user in a preset social platform, and processing each word in the text contents into a word vector by using a preset text processing model;
in the embodiment, there are various scenes in which the target user posts the text content in the predetermined social platform, for example, a conversation scene, a comment posting scene, and the like.
After the text content is obtained, preprocessing pretreatment is further carried out on the text content, wherein the preprocessing comprises text content cleaning, stop word filtering and word segmentation, the cleaning comprises removing special characters, redundant blanks, hyperlinks and the like in the original text content, and the stop word filtering comprises removing functional words without practical significance.
Further, the predetermined text processing model is preferably a BERT model, and the step of processing each word in the text content into a word vector by using the predetermined text processing model specifically includes the following steps:
a1, coding each word in the text content through a query dictionary to convert into a coding vector, and attaching a position vector corresponding to each coding vector;
a2, inputting the coding vector and the position vector into the BERT model for training, and outputting a word vector of each word of the text content.
In this embodiment, a dictionary records a large number of vectors corresponding to common words in advance, and a dictionary is queried to obtain a coding vector corresponding to each word, that is, coding is performed to obtain a coding vector, and the coding vector of each word is also accompanied by a corresponding position vector.
Because the matrix formed by the coding vectors is a sparse matrix (the number of rows of the matrix is generally ten thousand), and the model is difficult to effectively learn from the sparse matrix, the coding vectors and the position vectors are input into the BERT model for training, the BERT model can convert the coding vectors and the position vectors into a dense matrix (the number of rows of the matrix is generally hundred), and the model can better learn semantic information of text contents from the dense matrix.
Step S2, obtaining the user information for leaving messages under the text content, and constructing a social relationship graph of the target user and the user to which the user information belongs;
each user in the social relationship graph is a node, and the step of constructing the social relationship graph of the target user and the user to which the user information belongs specifically includes: and taking the target user as an initial node, and if a user leaves the text content or the message of another user between every two users, connecting the nodes of the two users by one edge.
In this embodiment, a social relationship graph is constructed by taking a target user as an initial node, for example, if there is a user who leaves a message about text contents posted by the target user, the target user is connected with the node of the user by one edge, and for all users who leave messages except for the target user, if there is a user who leaves a message about another user, the nodes of the two users are connected by one edge.
Step S3, inputting the social relationship graph into a preset graph to be embedded into a network for training, and acquiring an output user characteristic vector;
in this embodiment, the predetermined graph embedding network is preferably a node2vec graph embedding network, and the node2vec graph embedding network can convert the relationship between users into vectors and can effectively extract the relationship features between nodes. Further, the step of inputting the social relationship diagram into a predetermined diagram to be embedded into a network for training and obtaining an output user feature vector specifically includes:
c1, sampling the nodes in the social relationship graph by adopting a preset sampling algorithm to obtain a node sequence;
c2, fitting the node sequence by using a preset vector model, generating the user feature vector and outputting the user feature vector.
Wherein c1 specifically includes: calculating transition probabilities among the nodes in the social relationship graph; and carrying out random walk among the nodes of the social relationship graph according to the transition probability to obtain the node sequence. The predetermined vector model is preferably a Word2Vec model, which is a model in the genim function library. Of course, the predetermined vector model may also be other vector models, such as a BERT model, a GPT2.0 model, and the like.
In a node2vec graph embedded network, calculating the transition probability among nodes in a social relationship graph:
if it has been determined that the current node is v and the previous node is t, the transition probability α (t, x) of the current node v to its neighbor node(s) x when determining the next node:
Figure RE-GDA0003153662720000061
wherein p is a return parameter, determining the probability of revisiting a node, q is an access parameter, dtxIs the closest distance of node t to node x, dtxIs taken as value of [0, 1, 2 ]]If nodes t and x are equal (d)tx0), the transition probability is (1/p), if node t is connected to x (d)tx1), the transition probability is (1), if node t is not connected to x (d)tx2), the transition probability is (1/q). By properly controlling the random walk mode through the parameters p and q, the user nodes can be accurately divided and aggregated, the relationship among the nodes is effectively extracted, the spatial distance of more similar users is smaller, the spatial distance of less similar users is larger, and the more similar users can be generally aggregated into a group. From the obtained node sequence, a user node may belong to a group (the group has corresponding group characteristics, for example, the user belongs to a diabetes patient group) or to a plurality of groups (the connected groups have corresponding group structure characteristics, for example, the user belongs to a diabetes patient group and a stroke patient group).
The user feature vector of the embodiment includes group features and group structure features, and the two features affect the probability of the user publishing the sensitive content, and have a certain effect on the identification of the sensitive words. For example, in the family of some critically ill patients, the patients may die due to low cure rate of the corresponding diseases, so that the emotions are generally unstable and sensitive contents are easy to release. Therefore, the user population characteristics and the population structure characteristics have correlation with the probability of the user publishing sensitive content in a statistical sense.
Step S4, fusing the word vector and the user feature vector, and generating a matrix to be recognized based on the fused vector;
further, the step of fusing the word vector and the user feature vector and generating a matrix to be recognized based on the fused vector specifically includes:
splicing the user characteristic vector at the tail part of the word vector of each word, and generating the matrix to be recognized based on the word vector spliced with the user characteristic vector; or
And splicing the user characteristic vector at the tail part of the word vector of the last word of the text content, and generating the matrix to be recognized based on the word vector spliced with the user characteristic vector.
In this embodiment, the word vector and the user feature vector may be fused in the two splicing manners, and the fused vector may be regarded as a text vector, which includes the word vector and the user feature vector, where the word vector is the main vector and the user feature vector is the auxiliary vector. In the prior art, generally, the user feature vectors are used for user operation such as user classification, and text recognition cannot be performed by fusing the user feature vectors and word vectors, so that the word vectors are optimized by using the user feature vectors, and accurate recognition of sensitive words is facilitated.
And step S5, inputting the matrix to be recognized into a pre-trained bidirectional long and short term memory neural network for recognition, and acquiring output sensitive words.
In this embodiment, the definition of the sensitive word may be different for different groups.
The input of the bi-directional long-short term memory neural network BilSTM is a vector representation of each word in the text content. For example, for the text content of "I eat an applet", the input of BiLSTM is 4 vectors of four words, I, eat, an, and applet, where the vector represents only meaning information of the word itself and does not include semantic information of the whole text content. The output of BilSTM is also a vector representation of each word in a text content, and for the text content of "I eat an applet", the output of BilSTM is 4 vectors respectively representing 4 words, but each vector here not only contains the meaning of the word itself, but also contains semantic information of the whole text content. BilSTM is used for extracting semantic information from a string of word vectors and distributing the extracted semantic information to each word.
For the training of the BilSTM, a large amount of text contents corresponding to a word vector and a user characteristic vector are fused for training, sensitive words are labeled on the text contents, the labeled text contents are used as samples and input to the BilSTM for training, the accuracy of the BilSTM for identifying the sensitive words is verified after the training, and if the accuracy of the BilSTM for identifying the sensitive words reaches a certain value, the BiLSTM can be used as a trained model for practical application.
In the BilSTM, the BilSTM identifies a matrix which fuses word vectors and user characteristic vectors, the matrix to be identified is input into a full connection layer, the probability of whether each word is a sensitive word is calculated in the full connection layer, a corresponding threshold value is set for each word, and if the probability of calculating that the word is the sensitive word is greater than the threshold value, the word is determined to be the sensitive word.
Further, if a certain word is determined to be a sensitive word, the position vector is also included in the word vector, so that the position of the sensitive word can be obtained, the sensitive word is marked in a highlighted mode at the corresponding position of the output result, and the identification of the sensitive word is more intuitive.
As shown in fig. 2, which is a complete flow chart of the sensitive word recognition of the present invention, fig. 2 includes the following processes: the method comprises the steps that a user with the ID of 27782 publishes text content of 'White Christmas are not eating but leave', firstly, processing is divided into two paths, the text content 'White Christmas are not eating but leave' is input into a BERT model to be processed into word vectors in one path, after a social relation graph of the user is constructed through the user ID in the other path, the social relation graph is input into a node2vec graph to be embedded into a network to be trained and learned, user characteristic vectors are obtained, then the two paths of vectors are fused to generate a matrix to be recognized, the matrix to be recognized is input into a BilSTM to be recognized, sensitive words are recognized, and sensitive words are output.
The invention is illustrated below, by way of example, in the medical field:
1. supposing that n users and m messages are left in a social platform of an evaluation doctor, the users can interact through the messages, and a certain user releases text contents on the platform;
2. inputting text content into a text processing model to be processed into word vectors;
3. for the relation between users, a social relation graph can be constructed firstly, each user corresponds to a node v in the graph, the node of the user who delivers the text content is used as an initial node, if the user i and the user j leave a message, an edge is constructed between the user i and the user j, and after the message between the users is combed, the social relation graph is obtained;
4. the feature vector representation of each user can be obtained by training the social relationship graph using the node2ec technique. For the user relationship, the following features are generally available: 1) users may belong to one or more groups (such as a diabetic group, a gastric cancer group, etc.); 2) the users of the same group communicate more, and the users of different groups communicate less. Clustering users of different groups in space by using the node2vec is beneficial to the identification of subsequent sensitive words;
5. fusing the word vectors and the user characteristic vectors, specifically splicing the user characteristic vectors at the tail of each word vector, or splicing the user characteristic vectors at the tail of the last word vector, and generating corresponding matrixes from the fused vectors according to preset rows and columns;
6. inputting the matrix into the BILSTM, outputting a vector with the length same as that of the text content after the BISLTM is trained, wherein each value in the vector represents the probability that a word at a corresponding position in the text content is a sensitive word, setting a threshold value, and if the probability is higher than the threshold value, considering the word as a sensitive word.
Compared with the prior art, each word in the text content published by the target user is processed into a word vector, a social relation graph is constructed for the target user and the user leaving a message, the social relation graph is trained by using a graph embedding network technology to obtain a user characteristic vector, then the word vector and the user characteristic vector are fused to generate a matrix to be recognized, and finally the matrix to be recognized is input into a pre-trained bidirectional long-short term memory neural network for recognition, and sensitive words are output. In the embodiment, the user characteristic vector is added on the basis of the text vector, and comprises the group characteristics and the group structure characteristics of the user, so that when the sensitive words are identified by the model, the text vector is taken as a main part, the user characteristic vector is taken as auxiliary judgment information, and the text content is combined with the group characteristics and the group structure characteristics of the user to identify the sensitive words, so that the identification accuracy can be improved.
In an embodiment, the present invention provides a sensitive word recognition apparatus, which corresponds to the sensitive word recognition method in the above embodiments one to one. As shown in fig. 3, the sensitive word recognition apparatus includes:
the system comprises a text processing module 101, a social interaction platform and a social interaction platform, wherein the text processing module 101 is used for acquiring text contents published in a predetermined social interaction platform by a target user and processing each word in the text contents into a word vector by using a predetermined text processing model;
a building module 102, configured to obtain user information for leaving a message under the text content, and build a social relationship diagram between the target user and a user to which the user information belongs;
the training module 103 is configured to input the social relationship diagram into a predetermined diagram to be embedded in a network for training, and obtain an output user feature vector;
a fusion module 104, configured to fuse the word vector and the user feature vector, and generate a matrix to be identified based on the fused vector;
and the recognition module 105 is used for inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition, and acquiring the output sensitive words.
The specific limitations of the sensitive word recognition device can be referred to the limitations of the above sensitive word recognition method, and are not described herein again. The respective modules in the sensitive word recognition device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance. The Computer device may be a PC (Personal Computer), or a smart phone, a tablet Computer, a Computer, or a server group consisting of a single network server and a plurality of network servers, or a cloud consisting of a large number of hosts or network servers based on cloud computing, where cloud computing is one of distributed computing, and is a super virtual Computer consisting of a group of loosely coupled computers.
As shown in fig. 4, the computer device may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other through a system bus, wherein the memory 11 stores a computer program that is executable on the processor 12. It should be noted that fig. 4 only shows a computer device with components 11-13, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
The memory 11 may be a non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM). In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various types of application software installed in the computer device, for example, program codes of a computer program in an embodiment of the present invention. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may be, in some embodiments, a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor, or other data Processing chip, and is used for executing program codes stored in the memory 11 or Processing data, such as executing computer programs.
The network interface 13 may comprise a standard wireless network interface, a wired network interface, and the network interface 13 is generally used for establishing communication connection between the computer device and other electronic devices.
The computer program is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, which is executable by the processor 12 to implement the method of the embodiments of the present application, including:
step S201, acquiring text contents published in a preset social platform by a target user, and processing each word in the text contents into a word vector by using a preset text processing model;
in the embodiment, there are various scenes in which the target user posts the text content in the predetermined social platform, for example, a conversation scene, a comment posting scene, and the like.
After the text content is obtained, preprocessing pretreatment is further carried out on the text content, wherein the preprocessing comprises text content cleaning, stop word filtering and word segmentation, the cleaning comprises removing special characters, redundant blanks, hyperlinks and the like in the original text content, and the stop word filtering comprises removing functional words without practical significance.
Further, the predetermined text processing model is preferably a BERT model, and the step of processing each word in the text content into a word vector by using the predetermined text processing model specifically includes the following steps:
coding each word in the text content through a query dictionary to convert the word into a coding vector, and attaching a position vector corresponding to each coding vector;
and inputting the coding vector and the position vector into the BERT model for training, and outputting a word vector of each word of the text content.
In this embodiment, a dictionary records a large number of vectors corresponding to common words in advance, and a dictionary is queried to obtain a coding vector corresponding to each word, that is, coding is performed to obtain a coding vector, and the coding vector of each word is also accompanied by a corresponding position vector.
Because the matrix formed by the coding vectors is a sparse matrix (the number of rows of the matrix is generally ten thousand), and the model is difficult to effectively learn from the sparse matrix, the coding vectors and the position vectors are input into the BERT model for training, the BERT model can convert the coding vectors and the position vectors into a dense matrix (the number of rows of the matrix is generally hundred), and the model can better learn semantic information of text contents from the dense matrix.
Step S202, obtaining the user information for leaving a message under the text content, and constructing a social relationship graph of the target user and the user to which the user information belongs;
each user in the social relationship graph is a node, and the step of constructing the social relationship graph of the target user and the user to which the user information belongs specifically includes: and taking the target user as an initial node, and if a user leaves the text content or the message of another user between every two users, connecting the nodes of the two users by one edge.
In this embodiment, a social relationship graph is constructed by taking a target user as an initial node, for example, if there is a user who leaves a message about text contents posted by the target user, the target user is connected with the node of the user by one edge, and for all users who leave messages except for the target user, if there is a user who leaves a message about another user, the nodes of the two users are connected by one edge.
Step S203, inputting the social relationship graph into a preset graph to be embedded into a network for training, and acquiring an output user characteristic vector;
in this embodiment, the predetermined graph embedding network is preferably a node2vec graph embedding network, and the node2vec graph embedding network can convert the relationship between users into vectors and can effectively extract the relationship features between nodes. Further, the step of inputting the social relationship diagram into a predetermined diagram to be embedded into a network for training and obtaining an output user feature vector specifically includes:
sampling nodes in the social relationship graph by adopting a preset sampling algorithm to obtain a node sequence;
and fitting the node sequence by using a preset vector model to generate and output the user characteristic vector.
Wherein, the step of sampling specifically comprises: calculating transition probabilities among the nodes in the social relationship graph; and carrying out random walk among the nodes of the social relationship graph according to the transition probability to obtain the node sequence. The predetermined vector model is preferably a Word2Vec model, which is a model in the genim function library. Of course, the predetermined vector model may also be other vector models, such as a BERT model, a GPT2.0 model, and the like.
In a node2vec graph embedded network, calculating the transition probability among nodes in a social relationship graph:
if it has been determined that the current node is v and the previous node is t, the transition probability α (t, x) of the current node v to its neighbor node(s) x when determining the next node:
Figure RE-GDA0003153662720000131
wherein p is a return parameter, determining the probability of revisiting a node, q is an access parameter, dtxIs the closest distance of node t to node x, dtxIs taken as value of [0, 1, 2 ]]If nodes t and x are equal (d)tx0), the transition probability is (1/p), if node t is connected to x (d)tx1), the transition probability is (1), if node t is not connected to x (d)tx2), the transition probability is (1/q). By properly controlling the random walk mode through the parameters p and q, the user nodes can be accurately divided and aggregated, and the relations among the nodes are effectively extractedThe system makes the spatial distance of the more similar users smaller, and the spatial distance of the less similar users larger, and the more similar users will generally be gathered into a group. From the obtained node sequence, a user node may belong to a group (the group has corresponding group characteristics, for example, the user belongs to a diabetes patient group) or to a plurality of groups (the connected groups have corresponding group structure characteristics, for example, the user belongs to a diabetes patient group and a stroke patient group).
The user feature vector of the embodiment includes group features and group structure features, and the two features affect the probability of the user publishing the sensitive content, and have a certain effect on the identification of the sensitive words. For example, in the family of some critically ill patients, the patients may die due to low cure rate of the corresponding diseases, so that the emotions are generally unstable and sensitive contents are easy to release. Therefore, the user population characteristics and the population structure characteristics have correlation with the probability of the user publishing sensitive content in a statistical sense.
Step S204, fusing the word vector and the user feature vector, and generating a matrix to be identified based on the fused vector;
further, the step of fusing the word vector and the user feature vector and generating a matrix to be recognized based on the fused vector specifically includes:
splicing the user characteristic vector at the tail part of the word vector of each word, and generating the matrix to be recognized based on the word vector spliced with the user characteristic vector; or
And splicing the user characteristic vector at the tail part of the word vector of the last word of the text content, and generating the matrix to be recognized based on the word vector spliced with the user characteristic vector.
In this embodiment, the word vector and the user feature vector may be fused in the two splicing manners, and the fused vector may be regarded as a text vector, which includes the word vector and the user feature vector, where the word vector is the main vector and the user feature vector is the auxiliary vector. In the prior art, generally, the user feature vectors are used for user operation such as user classification, and text recognition cannot be performed by fusing the user feature vectors and word vectors, so that the word vectors are optimized by using the user feature vectors, and accurate recognition of sensitive words is facilitated.
And S205, inputting the matrix to be recognized into a pre-trained bidirectional long and short term memory neural network for recognition, and acquiring output sensitive words.
The input of the bi-directional long-short term memory neural network BilSTM is a vector representation of each word in the text content. For example, for the text content of "I eat an applet", the input of BiLSTM is 4 vectors of four words, I, eat, an, and applet, where the vector represents only meaning information of the word itself and does not include semantic information of the whole text content. The output of BilSTM is also a vector representation of each word in a text content, and for the text content of "I eat an applet", the output of BilSTM is 4 vectors respectively representing 4 words, but each vector here not only contains the meaning of the word itself, but also contains semantic information of the whole text content. BilSTM is used for extracting semantic information from a string of word vectors and distributing the extracted semantic information to each word.
For the training of the BilSTM, a large amount of text contents corresponding to a word vector and a user characteristic vector are fused for training, sensitive words are labeled on the text contents, the labeled text contents are used as samples and input to the BilSTM for training, the accuracy of the BilSTM for identifying the sensitive words is verified after the training, and if the accuracy of the BilSTM for identifying the sensitive words reaches a certain value, the BiLSTM can be used as a trained model for practical application.
In the BilSTM, the BilSTM identifies a matrix which fuses word vectors and user characteristic vectors, the matrix to be identified is input into a full connection layer, the probability of whether each word is a sensitive word is calculated in the full connection layer, a corresponding threshold value is set for each word, and if the probability of calculating that the word is the sensitive word is greater than the threshold value, the word is determined to be the sensitive word.
Further, if a certain word is determined to be a sensitive word, the position vector is also included in the word vector, so that the position of the sensitive word can be obtained, the sensitive word is marked in a highlighted mode at the corresponding position of the output result, and the identification of the sensitive word is more intuitive.
In one embodiment, the present invention provides a computer-readable storage medium, which may be a non-volatile and/or volatile memory, having stored thereon a computer program, which when executed by a processor, implements the steps of the sensitive word recognition method in the above-described embodiments, such as steps S1 to S5 shown in fig. 1. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the sensitive word recognition apparatus in the above-described embodiments, such as the functions of the modules 101 to 105 shown in fig. 3. To avoid repetition, further description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program that instructs associated hardware to perform the processes of the embodiments of the methods described above when executed.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A sensitive word recognition method is characterized by comprising the following steps:
acquiring text contents published in a preset social platform by a target user, and processing each word in the text contents into a word vector by using a preset text processing model;
acquiring user information for leaving a message under the text content, and constructing a social relationship graph of the target user and a user to which the user information belongs;
inputting the social relation graph into a preset graph to be embedded into a network for training, and acquiring an output user characteristic vector;
fusing the word vector and the user feature vector, and generating a matrix to be identified based on the fused vector;
and inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition, and acquiring output sensitive words.
2. The sensitive word recognition method according to claim 1, wherein each user in the social relationship graph is a node, and the step of constructing the social relationship graph between the target user and the user to which the user information belongs includes:
and taking the target user as an initial node, and if a user leaves the text content or the message of another user between every two users, connecting the nodes of the two users by one edge.
3. The sensitive word recognition method according to claim 2, wherein the step of inputting the social relationship diagram into a predetermined diagram to be embedded in a network for training and obtaining the output user feature vector specifically comprises:
sampling nodes in the social relationship graph by adopting a preset sampling algorithm to obtain a node sequence;
and fitting the node sequence by using a preset vector model to generate and output the user characteristic vector.
4. The sensitive word recognition method according to claim 3, wherein the sampling of the nodes in the social relationship graph by using a predetermined sampling algorithm to obtain a node sequence specifically comprises:
calculating transition probabilities among the nodes in the social relationship graph;
and carrying out random walk among the nodes of the social relationship graph according to the transition probability to obtain the node sequence.
5. The sensitive word recognition method according to claim 3, wherein the step of fusing the word vector with the user feature vector and generating a matrix to be recognized based on the fused vector specifically comprises:
splicing the user characteristic vector at the tail part of the word vector of each word, and generating the matrix to be recognized based on the word vector spliced with the user characteristic vector; or
And splicing the user characteristic vector at the tail part of the word vector of the last word of the text content, and generating the matrix to be recognized based on the word vector spliced with the user characteristic vector.
6. The sensitive word recognition method according to claim 1, wherein the predetermined text processing model is a BERT model, and the step of processing each word in the text content into a word vector by using the predetermined text processing model specifically includes:
coding each word in the text content through a query dictionary to convert the word into a coding vector, and attaching a position vector corresponding to each coding vector;
and inputting the coding vector and the position vector into the BERT model for training, and outputting a word vector of each word of the text content.
7. The sensitive word recognition method according to claim 6, wherein the step of inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition and obtaining the output sensitive word specifically comprises: and acquiring the position information of the sensitive word based on the position vector, and labeling the sensitive word at the position corresponding to the position information in a highlighted mode.
8. A sensitive word recognition apparatus, comprising:
the system comprises a text processing module, a social interaction platform and a social interaction platform, wherein the text processing module is used for acquiring text contents published in a preset social interaction platform by a target user and processing each word in the text contents into a word vector by utilizing a preset text processing model;
the construction module is used for acquiring the user information for leaving a message under the text content and constructing a social relationship graph of the target user and the user to which the user information belongs;
the training module is used for inputting the social relationship graph into a preset graph to be embedded into a network for training and acquiring an output user characteristic vector;
the fusion module is used for fusing the word vector and the user characteristic vector and generating a matrix to be identified based on the fused vector;
and the recognition module is used for inputting the matrix to be recognized into a pre-trained bidirectional long-short term memory neural network for recognition and acquiring the output sensitive words.
9. A computer arrangement comprising a memory and a processor connected to the memory, the memory having stored therein a computer program operable on the processor, wherein the processor, when executing the computer program, implements the steps of the sensitive word recognition method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the sensitive word recognition method according to any one of claims 1 to 7.
CN202110470946.4A 2021-04-28 2021-04-28 Sensitive word recognition method, device, equipment and storage medium Active CN113342927B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110470946.4A CN113342927B (en) 2021-04-28 2021-04-28 Sensitive word recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110470946.4A CN113342927B (en) 2021-04-28 2021-04-28 Sensitive word recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113342927A true CN113342927A (en) 2021-09-03
CN113342927B CN113342927B (en) 2023-08-18

Family

ID=77468970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110470946.4A Active CN113342927B (en) 2021-04-28 2021-04-28 Sensitive word recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113342927B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330474A (en) * 2021-10-20 2022-04-12 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN115809432A (en) * 2022-11-21 2023-03-17 中南大学 Crowd social relationship extraction method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766431A (en) * 2018-12-24 2019-05-17 同济大学 A kind of social networks short text recommended method based on meaning of a word topic model
CN110489552A (en) * 2019-07-17 2019-11-22 清华大学 A kind of microblog users suicide risk checking method and device
US20200117832A1 (en) * 2018-10-15 2020-04-16 International Business Machines Corporation Obfuscation and routing of sensitive actions or requests based on social connections
CN111382366A (en) * 2020-03-03 2020-07-07 重庆邮电大学 Social network user identification method and device based on language and non-language features
CN112487176A (en) * 2020-11-26 2021-03-12 北京智源人工智能研究院 Social robot detection method, system, storage medium and electronic device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200117832A1 (en) * 2018-10-15 2020-04-16 International Business Machines Corporation Obfuscation and routing of sensitive actions or requests based on social connections
CN109766431A (en) * 2018-12-24 2019-05-17 同济大学 A kind of social networks short text recommended method based on meaning of a word topic model
CN110489552A (en) * 2019-07-17 2019-11-22 清华大学 A kind of microblog users suicide risk checking method and device
CN111382366A (en) * 2020-03-03 2020-07-07 重庆邮电大学 Social network user identification method and device based on language and non-language features
CN112487176A (en) * 2020-11-26 2021-03-12 北京智源人工智能研究院 Social robot detection method, system, storage medium and electronic device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330474A (en) * 2021-10-20 2022-04-12 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN114330474B (en) * 2021-10-20 2024-04-26 腾讯科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium
CN115809432A (en) * 2022-11-21 2023-03-17 中南大学 Crowd social relationship extraction method, device and storage medium
CN115809432B (en) * 2022-11-21 2024-02-13 中南大学 Crowd social relation extraction method, equipment and storage medium

Also Published As

Publication number Publication date
CN113342927B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN110598206B (en) Text semantic recognition method and device, computer equipment and storage medium
CN109446302B (en) Question-answer data processing method and device based on machine learning and computer equipment
CN108628974B (en) Public opinion information classification method and device, computer equipment and storage medium
CN111431742B (en) Network information detection method, device, storage medium and computer equipment
WO2023134084A1 (en) Multi-label identification method and apparatus, electronic device, and storage medium
CN112380344B (en) Text classification method, topic generation method, device, equipment and medium
CN113342927A (en) Sensitive word recognition method, device, equipment and storage medium
CN110750965A (en) English text sequence labeling method and system and computer equipment
CN112015900A (en) Medical attribute knowledge graph construction method, device, equipment and medium
CN111461301A (en) Serialized data processing method and device, and text processing method and device
CN112085091B (en) Short text matching method, device, equipment and storage medium based on artificial intelligence
CN112035611A (en) Target user recommendation method and device, computer equipment and storage medium
CN111460783A (en) Data processing method and device, computer equipment and storage medium
CN113254649B (en) Training method of sensitive content recognition model, text recognition method and related device
CN113128526B (en) Image recognition method and device, electronic equipment and computer-readable storage medium
CN113963205A (en) Classification model training method, device, equipment and medium based on feature fusion
CN113536784A (en) Text processing method and device, computer equipment and storage medium
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN114281934A (en) Text recognition method, device, equipment and storage medium
CN113705692A (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN113792163B (en) Multimedia recommendation method and device, electronic equipment and storage medium
CN110807130A (en) Method, apparatus and computer device for determining vector representation of group in network
CN115186667B (en) Named entity identification method and device based on artificial intelligence
CN116662579B (en) Data processing method, device, computer and storage medium
CN111476037B (en) Text processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant