CN110929026A - Abnormal text recognition method and device, computing equipment and medium - Google Patents

Abnormal text recognition method and device, computing equipment and medium Download PDF

Info

Publication number
CN110929026A
CN110929026A CN201811093657.1A CN201811093657A CN110929026A CN 110929026 A CN110929026 A CN 110929026A CN 201811093657 A CN201811093657 A CN 201811093657A CN 110929026 A CN110929026 A CN 110929026A
Authority
CN
China
Prior art keywords
text
entity
emotion
recognition model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811093657.1A
Other languages
Chinese (zh)
Other versions
CN110929026B (en
Inventor
康杨杨
高喆
周笑添
孙常龙
刘晓钟
司罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811093657.1A priority Critical patent/CN110929026B/en
Publication of CN110929026A publication Critical patent/CN110929026A/en
Application granted granted Critical
Publication of CN110929026B publication Critical patent/CN110929026B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an abnormal text recognition method, an abnormal text recognition device, a computing device and a medium, wherein the method comprises the following steps: inputting a text to be recognized into a named entity recognition model to determine candidate entities included in the text, wherein the candidate entities include a person name, an organization name or a date; inputting a text to be recognized into a topic recognition model for processing so as to recognize the topic category of the text; if the theme type of the text corresponds to a preset theme, inputting a vector set corresponding to each candidate entity in the text into an emotion recognition model so as to acquire the emotional tendency of the text to the candidate entity; and if the emotional tendency is negative, judging that the text is an abnormal text under a preset theme.

Description

Abnormal text recognition method and device, computing equipment and medium
Technical Field
The present invention relates to the field of natural language processing, and in particular, to a method, an apparatus, a computing device, and a medium for recognizing an abnormal text.
Background
Nowadays, many government functional departments, financial institutions, merchants and the like can send directional short messages through short message platform channels of operators. For example, in thunderstorm weather such as rainstorm, a weather bureau of a local government may send a reminding message to a user whose number is in a local area by using a short message platform channel in consideration of safe trip.
However, the short message platform channel as an open platform can also be used by some lawless persons to send the relevant short messages related to the administration to the common users to attack or blacken the country. Such behavior is not only in violation of national relevant legal regulations, but also brings great public opinion risks to the normal operation of the platform. Therefore, text recognition needs to be performed on the short message to determine whether the text content in the short message is involved in the administration, so as to shield the short message involved in the administration and perform corresponding processing on the number for sending the short message involved in the administration.
When text recognition is performed on short messages, the recognition problem is generally converted into a two-classification problem, short message texts related to politics are labeled, and a model is trained through a machine learning method. The commonly used Machine learning method is usually a traditional shallow model such as SVM (support vector Machine)/LR (Logistic Regression), or a depth model of the current mainstream than hot, such as CNN (Convolutional Neural Network)/LSTM (Long Short-Term Memory Network). However, such models tend to over-fit features such as the name of a person, such as the name "XXX" can be easily learned as an important feature, but not all samples where "XXX" occurs are problematic. For example, some government departments send' all party members will need to learn the meeting spirit of XXX, and the short message of education and learning type with the notification property is no problem.
Disclosure of Invention
To this end, the present invention provides an abnormal text recognition scheme in an attempt to solve or at least alleviate the above-identified problems.
According to an aspect of the present invention, there is provided an abnormal text recognition method, including the steps of: firstly, inputting a text to be recognized into a named entity recognition model to determine a candidate entity included by the text, wherein the candidate entity includes a person name, an organization name or a date; inputting a text to be recognized into a topic recognition model for processing so as to recognize the topic category of the text; if the theme type of the text corresponds to a preset theme, inputting a vector set corresponding to each candidate entity in the text into an emotion recognition model so as to acquire the emotional tendency of the text to the candidate entity; and if the emotional tendency is negative, judging that the text is an abnormal text under a preset theme.
Optionally, in the abnormal text recognition method according to the present invention, inputting a text to be recognized into a topic recognition model for processing, so as to recognize a topic category of the text, the method includes: inputting a text to be recognized into a topic recognition model to obtain a topic vector which is output by the topic recognition model and corresponds to the text, wherein the topic vector comprises probability values of the text belonging to various topic categories; and taking the topic category with the highest probability value as the topic category of the text.
Optionally, in the abnormal text recognition method according to the present invention, the vector set includes word vectors corresponding to the candidate entities, topic vectors corresponding to the text, and word vectors corresponding to all candidate entities in the text.
Optionally, in the method for recognizing abnormal text according to the present invention, the emotion recognition model includes a multi-layer perceptron and a classifier connected to the multi-layer perceptron, and the inputting the vector set corresponding to the candidate entity into the emotion recognition model to obtain the emotional tendency of the text to the candidate entity includes: taking a vector set corresponding to the candidate entity as input, and inputting the vector set into the multilayer perceptron for processing; and inputting the processing result into a classifier to perform emotional tendency probability calculation, and determining whether the emotional tendency of the text to the candidate entity is positive or negative according to the result of the probability calculation.
Optionally, in the abnormal text recognition method according to the present invention, the named entity recognition model is a sequence tagging model.
Optionally, in the abnormal text recognition method according to the present invention, the named entity recognition model performs model training based on a pre-acquired entity training data set, so that an output of the named entity recognition model indicates candidate entities present in the input text.
Optionally, in the method for recognizing an abnormal text according to the present invention, the entity training data set includes a plurality of pieces of entity training data, each piece of entity training data includes a first training text and a second training text, the second training text is a text formed by entity-labeling a candidate entity in the first training text, and model training is performed based on a pre-obtained entity training data set, including: for each piece of entity training data in the entity training data set, taking a first training text in the entity training data as input, and inputting the input into the named entity recognition model to obtain a labeled text which is output by the named entity recognition model and is labeled with a candidate entity and corresponds to the first training text; and adjusting the network parameters of the named entity recognition model based on the labeled text and the second training text corresponding to the first training text in the entity training data.
Optionally, in the abnormal text recognition method according to the present invention, adjusting a network parameter of the named entity recognition model includes: network parameters of the named entity recognition model are adjusted using a back propagation algorithm.
Optionally, in the abnormal text recognition method according to the present invention, the emotion recognition model performs model training based on an emotion training data set acquired in advance, so that an output of the emotion recognition model indicates a text to which a word corresponding to the input vector set belongs and an emotional tendency of the word.
Optionally, in the method for recognizing an abnormal text according to the present invention, the emotion training data set includes a plurality of pieces of emotion training data, each piece of emotion training data includes an emotion training text, and a true emotion tendency of each word included in the emotion training text by the emotion training text, and the model training is performed based on the previously acquired emotion training data set, including: performing word segmentation processing on each piece of emotion training data in the emotion training data set to obtain a corresponding word segmentation sequence; acquiring a vector set of each word in the word segmentation sequence, wherein the vector set comprises word vectors of corresponding words, theme vectors corresponding to the emotion training texts and word vectors of all candidate entities included in the emotion training texts; dividing each word in the word sequence, taking a vector set of the word as input, and inputting an emotion recognition model to obtain the emotion tendency of an emotion training text to the word output by the emotion recognition model; and adjusting network parameters of the emotion recognition model based on the emotional tendency and the real emotional tendency of the emotion training text to the words.
Optionally, in the abnormal text recognition method according to the present invention, adjusting a network parameter of an emotion recognition model includes: network parameters of the multi-layer perceptron and/or classifier are adjusted using a back-propagation algorithm.
Alternatively, in the abnormal text recognition method according to the present invention, the subject category is any one of a political-related category, an entertainment category, a sports category, a finance category, a daily life category, and an educational learning category.
Optionally, in the abnormal text recognition method according to the present invention, the topic is preset as an administrative topic.
Optionally, in the abnormal text recognition method according to the present invention, the text to be recognized includes a short message text.
According to still another aspect of the present invention, an abnormal text recognition apparatus is provided, which includes an entity recognition module, a topic recognition module, an emotion recognition module, and a determination module. The entity recognition module is suitable for inputting the text to be recognized into the named entity recognition model so as to determine candidate entities included in the text, wherein the candidate entities include names of people, organizations or dates; the topic identification module is suitable for inputting the text to be identified into the topic identification model for processing so as to identify the topic category of the text; the emotion recognition module is suitable for inputting a vector set corresponding to candidate entities into an emotion recognition model for each candidate entity in the text when the theme type of the text corresponds to a preset theme so as to acquire the emotional tendency of the text to the candidate entities; the judging module is suitable for judging the text to be abnormal text under the preset theme when the emotional tendency is negative.
According to yet another aspect of the invention, there is provided a computing device comprising one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the method of anomalous text recognition according to the invention.
According to yet another aspect of the present invention, there is also provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the method of abnormal text recognition according to the present invention.
According to the abnormal text recognition scheme, a candidate entity of a text to be recognized is determined based on a named entity recognition model, a theme type of the text is recognized through the theme recognition model, when the theme type of the text corresponds to a preset theme, a vector set corresponding to the candidate entity in the text is input into an emotion recognition model, so that the emotional tendency of the text to the candidate entity is obtained, wherein the candidate entity comprises a person name, an organization name or a date. And finally, if the emotional tendency is negative, judging that the text is an abnormal text under a preset theme.
In other words, by introducing a theme and emotion analysis method, the phenomenon that characteristics of the names of people and organizations are over-fitted is avoided, whether the theme type of the text to be recognized corresponds to the administrative theme or not and the emotional tendency of the text to the names of people and organizations and the date are analyzed, the probability that the text is the administrative text is comprehensively judged, and the accuracy of identification of the administrative text is further improved.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of an abnormal text recognition system 100 according to one embodiment of the present invention;
FIG. 2 illustrates a block diagram of a computing device 200, according to an embodiment of the invention;
FIG. 3 illustrates a model assembly diagram for anomalous text recognition in accordance with an embodiment of the present invention;
FIG. 4 shows a schematic diagram of entity training data according to one embodiment of the invention;
FIG. 5 shows a schematic diagram of an emotion recognition model according to an embodiment of the invention;
FIG. 6 illustrates a flow diagram of an abnormal text recognition method 600 according to one embodiment of the present invention; and
fig. 7 shows a schematic diagram of an abnormal text recognition apparatus 700 according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a schematic diagram of an abnormal text recognition system 100 according to one embodiment of the present invention. As shown in fig. 1, the system 100 includes a text sender 110, a text receiver 120, and a server 130. The server 130 has an abnormal text recognition device (not shown in the figure) resident therein. In the process of sending the text to the text receiving end 120, the text sending end 110 will first send the text to the server 130, and after receiving the text, the above-mentioned devices in the server 130 perform abnormal text recognition by using the text as the text to be recognized. If the recognition result indicates that the text is not an abnormal text, the server 130 forwards the text to the text receiving terminal 120, and if the recognition result indicates that the text is an abnormal text, the server 130 intercepts the text, thereby preventing the abnormal text from being directly sent to the text receiving terminal 120.
According to an embodiment of the present invention, the preset topic is an administrative topic, and the abnormal text under the administrative topic refers to a type of text whose topic category is an administrative category and whose text has negative emotion on a person name, an organization name, or a date waiting selection entity included in the text. It should be noted that the names of people, organizations, or dates included in the candidate entities are predetermined and may have specific meanings under the subject of the text to which they belong. The date herein does not mean a simple date information including at least one of year, month and day, but a specific date related to a predetermined theme having specificity on the basis thereof. For example, when the preset theme is an administrative theme, the dates include 10/1/1949 (the date when the people's republic of china is established), 7/1/1997 (the date when hong kong returns to the country), 7/1937 (the date when the change of the rugou bridge occurs), 1999 (the year when australia returns to the country), and the like. In addition, the content related to the date, such as year, month, day, and the like, which needs to be represented by numerical values, may be arabic numerals or chinese characters, and the present application does not limit this.
For the sake of understanding, the process of the abnormal text recognition means in the server 130 for abnormal text recognition will be exemplified by the text to be recognized as "welcome XXX to the journal of university of beijing", "insist on policy and policy against XXX" (XXX represents a name, each "X" represents a kanji, and it is not limited that the kanji represented by each "X" must be the same or different).
The text "welcome XXX, university of beijing, university of council" is entered into the named entity recognition model for processing to determine candidate entities that the text includes. Since the candidate entities include a person name, an organization name, or a date, the text is known to include 2 candidate entities, which are "XXX" and "university of beijing", respectively. Inputting the text 'welcome XXX and Beijing university of Hospital' to a theme recognition model for processing, and recognizing the theme category of the text as an administrative class. Furthermore, since the topic type of the text corresponds to the administrative topic, the vector sets corresponding to the candidate entities "XXX" and "university of beijing" are input into the emotion recognition model, and the emotional tendency of the text to the candidate entities "XXX" and "university of beijing" is obtained, it is determined that the text is not an abnormal text under the administrative topic, that is, the text is not an administrative text.
The text "insist on policies and guidelines for XXX" is entered into the named entity recognition model for processing to determine candidate entities that the text includes. Since the candidate entity includes a person name, an organization name, or a date, the text is known to include the candidate entity "XXX". The text "insist on policies and policies for XXX" is input into the topic identification model for processing, and the topic category of the text is identified as the administrative category. Furthermore, as the topic type of the text corresponds to the administrative topic, the vector set corresponding to the candidate entity "XXX" is input into the emotion recognition model, and the emotional tendency of the text to the candidate entity "XXX" is negative, it is determined that the text is an abnormal text under the administrative topic, that is, the text is the administrative text.
Further, the system 100 may be regarded as a short message sending platform system, the text sending end 100 is a client a sending a short message, the text receiving end 120 is a client B receiving the short message, and the server 130 is usually a server platform deployed by a communication operator, in which a device for identifying whether the short message text is an abnormal text in a theme is resident. At this time, when the user sends the short message to the client B through the client a, the client a will send the short message to the server platform first, and after the device in the server platform receives the short message, the short message is used as the short message to be identified to identify the abnormal short message. If the identification result shows that the short message is not an abnormal short message, the service end platform forwards the short message to the client B, and if the identification result shows that the short message is an abnormal short message, the service end platform intercepts the short message to avoid the abnormal short message from being directly sent to the client B.
According to one embodiment of the invention, the server 130 in the system 100 described above may be implemented by a computing device 200 as described below. FIG. 2 shows a block diagram of a computing device 200, according to one embodiment of the invention.
As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.
Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.
Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 206 may include an operating system 220, one or more programs 222, and program data 224. In some implementations, the program 222 can be arranged to execute instructions on the operating system with the program data 224 by the one or more processors 204.
Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 200 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as part of a small-form factor portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-browsing device, a personal headset device, an application-specific device, or a hybrid device that include any of the above functions. Computing device 200 may also be implemented as a personal computer including both desktop and notebook computer configurations.
In some embodiments, computing device 200 is implemented as server 130 and is configured to perform abnormal text recognition method 600 in accordance with the present invention. Program 222 of computing device 200 includes a plurality of program instructions for executing abnormal text recognition method 600 according to the present invention, and program data 224 may also store configuration information of abnormal text recognition system 100, etc.
FIG. 3 shows a model assembly diagram for abnormal text recognition, according to one embodiment of the invention. As shown in fig. 3, when performing the abnormal text recognition, or more specifically, when recognizing whether the text is an abnormal text under one topic, a combination of models formed by three recognition models, namely, a named entity recognition model, a topic recognition model and an emotion recognition model, is used. After a text to be recognized is input into the named entity recognition model, the named entity recognition model recognizes a candidate entity in the text to be recognized, after the text to be recognized is input into the theme recognition model, the theme recognition model recognizes the theme category of the text, if the theme category of the text corresponds to a preset theme, a vector set corresponding to the candidate entity is input into the emotion recognition model for each candidate entity in the text, then the emotion recognition model recognizes the emotional tendency of the text to the candidate entity, and if the emotional tendency is negative, the text is judged to be an abnormal text under the preset theme. The named entity recognition model, the theme recognition model and the emotion recognition model are respectively trained in advance through an entity training data set, a theme training data set and an emotion training data set. For ease of understanding, the following description will first describe the structure and training process of the named entity recognition model, the topic recognition model and the emotion recognition model,
the named entity recognition model is used for recognizing candidate entities in the text, and according to one embodiment of the invention, the candidate entities comprise names of people, names of organizations or dates. The sequence labeling model may adopt a CRF (Conditional Random Field) model, a bllstm-CRF (Bi-directional Long Short-Term Memory-Conditional Random Field) model, etc., and may be appropriately adjusted according to the actual application scenario, network training condition, system configuration, performance requirement, etc., which will be easily imaginable to those skilled in the art who know the solution of the present invention and are also within the protection scope of the present invention, and will not be described herein.
According to one embodiment of the invention, model training is performed based on a pre-acquired entity training data set such that the output of the named entity recognition model indicates candidate entities present in the input text. According to one embodiment of the invention, the entity training data set comprises a plurality of pieces of entity training data, each piece of entity training data comprises a first training text and a second training text, and the second training text is formed by entity labeling candidate entities in the first training text. Specifically, when training a named entity recognition model, firstly, a first training text in entity training data is used as input for each entity training data in an entity training data set and input into the named entity recognition model to obtain a labeled text which is output by the named entity recognition model and is labeled with a candidate entity and corresponds to the first training text, and then network parameters of the named entity recognition model are adjusted based on the labeled text and a second training text which correspond to the first training text in the entity training data.
For example, for a piece of entity training data, the first training text in the entity training data is "welcome XXX and soggy university of beijing", and the second training text is formed by entity labeling "XXX" and "beijing university" on the basis of the first training text. Here, the BMEWO labeling system is used to label the first training text to complete the entity labeling process. Where B denotes the head of the entity, M denotes the middle of the entity, E denotes the tail of the entity, W denotes the single entity, and O denotes the non-entity. Since there are two entity types of candidate entities, Person (Person, expressed as Per) and Organization (Org), respectively, there are 8 entity labels under the BMEWO labeling system, B-Per, M-Per, E-Per, B-Org, M-Org, E-Org, W and O, respectively.
FIG. 4 shows a schematic of entity training data according to one embodiment of the invention. In the second training text, three characters of 'X', 'X' and 'X' are marked as B-Per, M-Per and E-Per in sequence, four characters of 'North', 'Beijing', 'big' and 'learning' are marked as B-Org, M-Org and E-Org in sequence, and the rest 'Huan', 'Ying', 'Tong', 'Zhi', 'View' and 'View' are marked as O. Based on the method, a first training text 'welcome XXX with Zhi Chao Beijing university' is input into the named entity recognition model, a label text which is output by the named entity recognition model and is marked with candidate entities and corresponds to the first training text is obtained, and network parameters of the named entity recognition model are adjusted through the difference between the label text and a second training text shown in FIG. 4. In this embodiment, the network parameters of the named entity recognition model may be adjusted using a back propagation algorithm. And after model training is carried out on a large amount of entity training data in the entity training data set, a trained named entity recognition model is obtained.
It should be noted that the entity training data set for training the named entity recognition model is composed of entity training data formed by extracting a large number of corpora related to the name of a person/the name/date of an organization from corpus resources and performing entity labeling processing based on the extracted corpora.
The topic identification model is used for identifying the topic category of the text, and according to one embodiment of the invention, the topic category is any one of a political involved category, an entertainment category, a sports category, a finance category, a daily life category and an educational learning category. Preferably, in this embodiment, the topic identification model may use LDA (Latent Dirichlet Allocation) algorithm, PLSA (Probabilistic Latent Semantic Analysis) algorithm, etc. to construct, learn and train the model, and may be appropriately adjusted according to actual application scenarios, training conditions, system configuration and performance requirements, etc., which are easily imaginable to those skilled in the art and are also within the protection scope of the present invention, and will not be described herein again.
Taking the LDA algorithm as an example, LDA is actually a document theme generation model, also called a three-layer bayesian probability model, and comprises three layers of structures including words, themes and documents. The generative model is a process in which each word of an article is considered to be obtained by "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution.
LDA is an unsupervised machine learning technique, which uses bag of words (bag of words) method, and treats each document as a word frequency vector, each document represents a probability distribution formed by some subjects, and each subject represents a probability distribution formed by many words.
When the theme recognition model is constructed, learned and trained by adopting an LDA algorithm, the theme training data set is formed by extracting a large number of news corpora with theme categories of administration, entertainment, sports, finance, daily life and educational and learning from the corpus resources. Each news corpus can be regarded as a theme training text, and the theme training text can be a text under any one of an administrative theme, an entertainment theme, a sports theme, a financial theme, a daily life theme and an educational and learning theme.
Then, for each topic training text in the topic training data set, LDA defines the following generation process (generating process):
1. extracting a theme from the theme distribution for each theme training text;
2. extracting a word from the word distribution corresponding to the extracted subject;
3. the above process is repeated until each word in the subject training text is traversed.
Each topic training text corresponds to a multinomial distribution of T topics (T is 6 here, given in advance by trial and error, etc.), each topic corresponding to a multinomial distribution of V words in the word list.
The core formula of LDA is as follows:
p(w|d)=p(w|t)·p(t|d) (1)
by intuitively considering the formula, namely taking the topic as the middle layer, the probability of the word w appearing in the topic training text d can be given through the probability p (t | d) that the topic category of the current topic training text d corresponds to the topic t and the probability p (w | t) that the word w appears under the topic t.
The LDA algorithm begins by randomly assigning values to p (w | t) and p (t | d) (training texts and words for all subjects), and then the above process is repeated continuously, and finally the result converged is the output of LDA. For the above iterative learning process, reference may be made to technical data related to the LDA algorithm, which is not described herein again.
Of course, in this embodiment, the topic identification model mainly determines the probability p (t | d) that the topic class of the topic training text d corresponds to the topic t based on the LDA algorithm. Further, after the probabilities that the topic categories of the topic training text d correspond to the topics are obtained, a topic vector is formed based on the probability values, and the topic vector includes the probability values that the topic training text d belongs to the topic categories. And the theme category with the highest probability value is the theme category of the theme training text d.
FIG. 5 shows a schematic diagram of an emotion recognition model according to an embodiment of the present invention. As shown in FIG. 5, the emotion recognition model includes a multi-layered perceptron and a classifier connected to the multi-layered perceptron. The Multi-layer perceptron (MLP) is actually a Multi-layer neural Network (3 layers or more, i.e. at least 1 hidden layer), also called a Fully Connected neural Network (FCN). Of course, the specific structure of the multi-layer sensor can be adjusted appropriately according to the actual application scenario, the network training situation, the system configuration and the performance requirement, which are easily conceivable for those skilled in the art to know the solution of the present invention and are within the protection scope of the present invention, and will not be described herein again.
According to an embodiment of the invention, the classifier adopts a softmax classifier, which is used for indicating the text to which the word corresponding to the input vector set belongs and the emotional tendency of the word, wherein the emotional tendency only has a positive type and a negative type, therefore, the softmax classifier only needs to complete the processing of two classifications actually, if the probability that the emotional tendency is the positive type is greater than the probability that the emotional tendency is the negative type, the text to which the word corresponding to the input vector set belongs and the emotional tendency of the word are determined to be the positive type, otherwise, the emotional tendency is the negative type.
According to one embodiment of the invention, the emotion recognition model carries out model training based on the emotion training data set acquired in advance, so that the output of the emotion recognition model indicates the text to which the word corresponding to the input vector set belongs and the emotional tendency of the word. In this embodiment, the emotion training data set includes a plurality of pieces of emotion training data, each piece of emotion training data including emotion training text and a true emotional tendency of the words included therein by the emotion training text. When the emotion recognition model is trained, firstly, segmenting each piece of emotion training data in an emotion training data set to obtain a corresponding segmentation sequence. Then, a vector set of each word in the word segmentation sequence is obtained, wherein the vector set comprises word vectors of corresponding words, theme vectors corresponding to the emotion training text and word vectors of all candidate entities included in the emotion training text. Dividing each word in the word sequence, taking the vector set of the word as input, and inputting the emotion recognition model to obtain the emotional tendency of the emotion training text to the word, which is output by the emotion recognition model. And finally, adjusting network parameters of the emotion recognition model based on the emotional tendency and the real emotional tendency of the emotion training text to the word.
For example, one piece of emotion recognition data includes an emotion training text "insist against XXX", and the real emotional tendency of the emotion training text to the 3 words "insist", "objection", "XXX". When the model is trained, the emotion training text "insist object XXX" is subjected to word segmentation processing, and the word segmentation sequences are 'insist', 'object', 'XXX'.
Taking the word "hard" in the word segmentation sequence as an example, the emotion training text is negative to the real emotional tendency of the word "hard". First, a vector set of the word "hard" is obtained, the vector set includes a word vector of the word, a topic vector corresponding to the emotion training text, and word vectors of all candidate entities included in the emotion training text. Wherein, the topic vector corresponding to the emotion training text "insist against XXX" can be obtained by the topic recognition model as above, and the candidate entity of the emotion training text is only the word "XXX". Then, the vector set of the word "hard" includes its own word vector, the corresponding topic vector for the emotion training text, and the word vector for the word "XXX".
And inputting the vector set of the word 'hard' into the emotion recognition model for recognition to obtain the emotional tendency of the emotion training text to the word 'hard' output by the emotion recognition model, and adjusting the network parameters of the multilayer perceptron and/or classifier by using a back propagation algorithm based on the emotional tendency and the real emotional tendency of the emotion training text to the word. And performing model training through a large amount of emotion training data in the emotion training data set to obtain a trained emotion recognition model.
It should be noted that the emotion training data set for emotion recognition model training is composed of emotion training data formed by extracting a large number of emotion corpora in each field from corpus resources and determining emotion tendencies based on the extracted emotion corpora. Furthermore, the above and the following may refer to the content of a word vector, wherein an algorithm such as word2vec/ELMo may be adopted for the generation of the word vector, which is not limited in the present application.
After the named entity recognition model, the theme recognition model and the emotion recognition model are constructed and trained, the abnormal text can be recognized based on the models. FIG. 6 illustrates a flow diagram of an abnormal text recognition method 600 according to one embodiment of the present invention. In this embodiment, the theme is preset as an administrative-related theme. It should be noted that the category of the preset theme is not limited to the political category, and the application can also be used for identifying abnormal texts under other themes.
As shown in fig. 6, the method 600 begins at step S610. In step S610, the text to be recognized is input into the named entity recognition model for processing to determine candidate entities included in the text, wherein the candidate entities include a person name, an organization name or a date. According to one embodiment of the invention, the named entity recognition model is a sequence annotation model. In this embodiment, the text to be recognized is "resisting and counteracting the compression of YYY and ZZZ on the general public" (YYY and ZZZ represent names, each "X" and "Z" represents a chinese character, and it is not limited that each "X", each "Z" represents a chinese character must be the same or different). After the text is input into the sequence annotation model, candidate entities included in the text are 'YYY' and 'ZZZ'.
The named entity recognition model performs model training based on a pre-acquired entity training data set so that the output of the named entity recognition model indicates candidate entities present in the input text. The entity training data set comprises a plurality of pieces of entity training data, each piece of entity training data comprises a first training text and a second training text, and the second training text is formed by entity marking candidate entities in the first training text. When model training is carried out based on a pre-acquired entity training data set, a first training text in the entity training data set is used as input to be input into a named entity recognition model so as to obtain a labeled text which is output by the named entity recognition model and is labeled with a candidate entity and corresponds to the first training text, and network parameters of the named entity recognition model are adjusted based on the labeled text and a second training text which correspond to the first training text in the entity training data. In particular, a back propagation algorithm may be used to adjust network parameters of the named entity recognition model.
The structure and training process of the named entity recognition model have been described in the foregoing, and are not described herein again. In addition, the text to be recognized includes a short message text, but is not limited to this, such as a WeChat message text, a QQ message text, and the like, and may also be used as the text to be recognized in the present application to perform abnormal text recognition.
Subsequently, step S620 is entered, and the text to be recognized is input into the topic recognition model for processing, so as to recognize the topic category of the text. According to one embodiment of the invention, the text to be recognized can be input into the topic recognition model for processing in the following manner so as to recognize the topic category of the text. In this embodiment, a text to be recognized is input into a topic recognition model to obtain a topic vector corresponding to the text and output by the topic recognition model, where the topic vector includes probability values of the text belonging to topic categories, and then the topic category with the highest probability value is used as the topic category of the text. Wherein the subject category is any one of administration category, entertainment category, sports category, finance category, daily life category and education and learning category. In addition, the generation, learning and training processes of the topic identification model have been described in the foregoing, and are not described herein again.
When the text to be recognized is 'resisting and resisting the oppression of YYYY and ZZZ to the masses', the text is input into a theme recognition model to be processed, and theme vectors corresponding to the text are {0.87, 0.04, 0.02, 0.01, 0.03 and 0.03}, wherein 0.87, 0.04, 0.02, 0.01, 0.03 and 0.03 respectively represent probability values of the theme categories of the text, such as political affairs, entertainment, sports, finance, daily life and educational and learning. Therefore, the topic category relating to the political category has the highest probability value, and the topic category of the text is the political category.
Next, in step S630, if the topic type of the text corresponds to a preset topic, for each candidate entity in the text, a vector set corresponding to the candidate entity is input into the emotion recognition model to obtain an emotional tendency of the text to the candidate entity. The vector set comprises word vectors corresponding to the candidate entities, topic vectors corresponding to the text and word vectors corresponding to all the candidate entities in the text.
According to one embodiment of the invention, the emotion recognition model comprises a multilayer perceptron and a classifier connected with the multilayer perceptron, and the vector set corresponding to the candidate entity can be input into the emotion recognition model in the following way to acquire the emotional tendency of the text to the candidate entity. In the embodiment, a vector set corresponding to a candidate entity is used as input and input into a multilayer perceptron for processing, a processing result is input into a classifier for emotional tendency probability calculation, and the emotional tendency of the text to the candidate entity is determined to be positive or negative according to the result of the probability calculation. Wherein the classifier is preferably a softmax classifier.
Furthermore, step S620 shows that the topic type of the text to be recognized corresponds to the preset topic, and then the vector sets corresponding to the candidate entities "YYY" and "ZZZ" are respectively input into the emotion recognition model, so as to obtain the emotional tendency of the text to the candidate entities "YYY" and "ZZZ".
For the candidate entity "YYY", the corresponding vector set includes the word vector corresponding to the candidate entity "YYY", the topic vector corresponding to the text, and the word vectors corresponding to the candidate entities "YYY" and "ZZZ". Inputting the vector set into a multi-layer perceptron for processing, inputting the processing result into a softmax classifier for calculating the emotional tendency probability, and determining that the emotional tendency of the text to the candidate entity 'YYY' is negative if the probability that the emotional tendency of the text to the candidate entity 'YYY' is positive is 0.12 and the probability that the emotional tendency of the text to the candidate entity 'YY' is negative is 0.88.
For the candidate entity "ZZZ", the corresponding set of vectors includes the word vector corresponding to the candidate entity "ZZZ", the topic vector corresponding to the text, and the word vectors corresponding to the candidate entities "YYY" and "ZZZ". Inputting the vector set into a multi-layer perceptron for processing, inputting the processing result into a softmax classifier for calculating the emotional tendency probability, and obtaining that the probability that the emotional tendency of the text to the candidate entity 'ZZZ' is positive is 0.15, and the probability that the emotional tendency of the text to the candidate entity 'ZZZ' is negative is 0.85.
The emotion recognition model carries out model training based on an emotion training data set acquired in advance, so that the output of the emotion recognition model indicates the text to which the word corresponding to the input vector set belongs and the emotional tendency of the word.
The emotion training data set comprises a plurality of pieces of emotion training data, and each piece of emotion training data comprises an emotion training text and real emotional tendency of the emotion training text to each word included in the emotion training text. When model training is performed based on a pre-acquired emotion training data set, firstly, segmenting each piece of emotion training data in the emotion training data set to obtain a corresponding segmentation sequence. Then, a vector set of each word in the word segmentation sequence is obtained, wherein the vector set comprises word vectors of corresponding words, theme vectors corresponding to the emotion training text and word vectors of all candidate entities included in the emotion training text. Dividing each word in the word sequence, taking the vector set of the word as input, and inputting the emotion recognition model to obtain the emotional tendency of the emotion training text to the word, which is output by the emotion recognition model. And finally, adjusting network parameters of the emotion recognition model based on the emotional tendency and the real emotional tendency of the emotion training text to the word. In particular, a back propagation algorithm may be used to adjust network parameters of the multi-layered perceptron and/or classifier. The structure and training process of the emotion recognition model have been described in the foregoing, and are not described in detail here.
Finally, step S640 is executed, and if the emotional tendency is negative, it is determined that the text is an abnormal text under a preset theme. According to an embodiment of the invention, when the emotional tendency of the text to any candidate entity included in the text is negative, the text can be judged to be an abnormal text under a preset theme. And only when the emotional tendency of the text to all the candidate entities included in the text is positive, the text can be judged not to be an abnormal text under the preset theme, namely the text is a normal text under the preset theme.
As can be seen from step S630, if the emotional tendency of the text to the candidate entity "YYY" is negative, and the emotional tendency of the candidate entity "ZZZ" is also negative, it can be determined that the text is an abnormal text under the administrative topic, i.e., the text is an administrative text. Further, if the text to be recognized "resists and resists the pressure of YYY and ZZZ on the masses" is a short message text, the short message text can be determined to be an administrative short message based on the method 600.
Fig. 7 shows a schematic diagram of an abnormal text recognition apparatus 700 according to an embodiment of the present invention. In this embodiment, the theme is preset as an administrative-related theme. As shown in FIG. 7, apparatus 700 includes entity identification module 710, topic identification module 720, emotion identification module 730, and decision module 740.
The entity identification module 710 is adapted to input text to be identified into a named entity identification model for processing to determine candidate entities that the text includes, the candidate entities including a person name, an organization name, or a date. The text to be identified comprises a short message text.
According to one embodiment of the invention, the named entity recognition model is a sequence tagging model, and the named entity recognition model is model trained based on a pre-acquired entity training data set, such that an output of the named entity recognition model indicates candidate entities present in the input text. The entity training data set comprises a plurality of entity training data, each entity training data comprises a first training text and a second training text, the second training text is a text formed by entity labeling of a candidate entity in the first training text, the entity recognition module 710 is further adapted to perform model training based on a pre-obtained entity training data set, specifically, for each entity training data in the entity training data set, the first training text in the entity training data is used as input and input to the named entity recognition model to obtain a labeled text which is output by the named entity recognition model and is labeled with the candidate entity and corresponds to the first training text, and network parameters of the named entity recognition model are adjusted based on the labeled text and the second training text which correspond to the first training text in the entity training data. In this embodiment, the network parameters of the named entity recognition model are adjusted using a back propagation algorithm.
The topic identification module 720 is adapted to input text to be identified into a topic identification model for processing to identify a topic category of the text. According to an embodiment of the present invention, the topic identification module 720 is further adapted to input the text to be identified into the topic identification model to obtain a topic vector corresponding to the text and output by the topic identification model, where the topic vector includes probability values of the text belonging to the topic categories, and the topic category with the highest probability value is taken as the topic category of the text. In this embodiment, the theme category is any one of a political-related category, an entertainment category, a sports category, a financial category, a daily life category, and an educational learning category.
The emotion recognition module 730 is adapted to, when the topic type of the text corresponds to a preset topic, for each candidate entity in the text, input a vector set corresponding to the candidate entity into the emotion recognition model to obtain an emotional tendency of the text to the candidate entity. The vector set includes word vectors corresponding to the candidate entities, topic vectors corresponding to the text, and word vectors corresponding to all candidate entities in the text.
According to an embodiment of the present invention, the emotion recognition model includes a multi-layer perceptron and a classifier connected to the multi-layer perceptron, and the emotion recognition module 730 is further adapted to input a vector set corresponding to the candidate entity as input to the multi-layer perceptron for processing, input a processing result to the classifier for performing emotional tendency probability calculation, and determine whether the emotional tendency of the text to the candidate entity is positive or negative according to a result of the probability calculation.
The emotion recognition model carries out model training based on an emotion training data set acquired in advance, so that the output of the emotion recognition model indicates the text to which the word corresponding to the input vector set belongs and the emotional tendency of the word. Wherein, the emotion training data set comprises a plurality of emotion training data, each piece of emotion training data comprises an emotion training text and the real emotion tendencies of the emotion training text to the words included in the emotion training data set, the emotion recognition module 730 is further adapted to perform model training based on the previously acquired emotion training data set, specifically, for each piece of emotion training data in the emotion training data set, the emotion training text in the emotion training data is subjected to word segmentation processing to obtain a corresponding word segmentation sequence, a vector set of each word in the word segmentation sequence is acquired, the vector set comprises a word vector of the corresponding word, a theme vector corresponding to the emotion training text and word vectors of all candidate entities included in the emotion training text, each word in the word segmentation sequence is input by using the vector set of the word as input, the emotion recognition model is input to obtain the emotion tendencies of the emotion training text to the word output by the emotion recognition model, and adjusting network parameters of the emotion recognition model based on the emotional tendency and the real emotional tendency of the emotion training text to the words. In this embodiment, a back propagation algorithm is used to adjust the network parameters of the multi-layer perceptron and/or classifier.
The determining module 740 is adapted to determine that the text is an abnormal text under a preset theme when the emotional tendency is negative.
The specific steps and embodiments of the abnormal text recognition are disclosed in detail in the description based on fig. 3 to 6, and are not described herein again.
The existing method for recognizing the abnormal text of the text under the theme mostly converts the recognition problem into a two-classification problem, labels the text related to the theme, recognizes the text by training a model through a machine learning method, is easy to excessively fit the characteristics of names of people and the like, and has low accuracy and reliability of the recognition result. According to the technical scheme of the abnormal text identification, the characteristics of the names of the people and the organizations are prevented from being over-fitted by introducing a theme and emotion analysis method, whether the theme type of the text to be identified corresponds to the theme or not and the emotional tendency of the text to the names of the people and the organizations and the date are analyzed, the probability that the text is the next type of text of the theme is comprehensively judged, and the accuracy of the abnormal text identification under the theme is further improved.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or groups of devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. Modules or units or groups in embodiments may be combined into one module or unit or group and may furthermore be divided into sub-modules or sub-units or sub-groups. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the method of abnormal text recognition of the present invention according to instructions in the program code stored in the memory.
By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (17)

1. An abnormal text recognition method, comprising:
inputting a text to be recognized into a named entity recognition model to determine candidate entities included in the text, wherein the candidate entities include a person name, an organization name or a date;
inputting a text to be recognized into a topic recognition model for processing so as to recognize the topic category of the text;
if the theme type of the text corresponds to a preset theme, inputting a vector set corresponding to each candidate entity in the text into an emotion recognition model so as to acquire the emotional tendency of the text to the candidate entity;
and if the emotional tendency is negative, judging that the text is an abnormal text under a preset theme.
2. The method of claim 1, wherein the inputting text to be recognized into a topic recognition model for processing to recognize a topic category of the text comprises:
inputting a text to be recognized into a topic recognition model to obtain a topic vector which is output by the topic recognition model and corresponds to the text, wherein the topic vector comprises probability values of the text belonging to various topic categories;
and taking the topic category with the highest probability value as the topic category of the text.
3. The method of claim 2, wherein the set of vectors includes a word vector corresponding to the candidate entity, a subject vector corresponding to the text, and word vectors corresponding to all candidate entities in the text.
4. The method of claim 1, wherein the emotion recognition model comprises a multi-layer perceptron and a classifier connected to the multi-layer perceptron, and the inputting the set of vectors corresponding to the candidate entity into the emotion recognition model to obtain the emotional tendency of the text to the candidate entity comprises:
inputting a vector set corresponding to the candidate entity as input into the multilayer perceptron for processing;
and inputting the processing result into the classifier to perform emotional tendency probability calculation, and determining whether the emotional tendency of the text to the candidate entity is positive or negative according to the result of the probability calculation.
5. The method of claim 1, wherein the named entity recognition model is a sequence annotation model.
6. The method of claim 5, wherein the named entity recognition model is model trained based on a pre-acquired set of entity training data such that an output of the named entity recognition model indicates candidate entities present in the input text.
7. The method of claim 6, wherein the entity training data set comprises a plurality of pieces of entity training data, each piece of entity training data comprises a first training text and a second training text, the second training text is formed by entity labeling candidate entities in the first training text, and model training is performed based on a pre-acquired entity training data set, and the method comprises:
for each entity training data in an entity training data set, taking a first training text in the entity training data as input, and inputting the input into the named entity recognition model to obtain a labeled text which is output by the named entity recognition model and is labeled with a candidate entity and corresponds to the first training text;
and adjusting the network parameters of the named entity recognition model based on the labeled text and the second training text corresponding to the first training text in the entity training data.
8. The method of claim 7, wherein the adjusting network parameters of the named entity recognition model comprises:
network parameters of the named entity recognition model are adjusted using a back propagation algorithm.
9. The method of claim 4, wherein the emotion recognition model performs model training based on a pre-acquired emotion training data set, so that the output of the emotion recognition model indicates the text to which the word corresponding to the input vector set belongs, and the emotional tendency of the word.
10. The method of claim 9, wherein the emotion training data set comprises a plurality of pieces of emotion training data, each piece of emotion training data comprises emotion training text, and the emotion training text has a true emotional tendency for each word it comprises, and model training based on the pre-acquired emotion training data set comprises:
performing word segmentation processing on each piece of emotion training data in an emotion training data set to obtain a corresponding word segmentation sequence;
obtaining a vector set of each word in the word segmentation sequence, wherein the vector set comprises word vectors of corresponding words, theme vectors corresponding to the emotion training text, and word vectors of all candidate entities included in the emotion training text;
for each word in the word segmentation sequence, taking the vector set of the word as input, and inputting the emotion recognition model to obtain the emotion tendency of the emotion training text to the word, which is output by the emotion recognition model;
and adjusting the network parameters of the emotion recognition model based on the emotional tendency and the real emotional tendency of the emotion training text to the word.
11. The method of claim 10, wherein said adjusting network parameters of said emotion recognition model comprises:
network parameters of the multi-layered perceptron and/or classifier are adjusted using a back-propagation algorithm.
12. The method of claim 1, wherein the subject category is any one of a political category, an entertainment category, a sports category, a finance category, a daily life category, and an educational learning category.
13. The method of claim 1, wherein the predetermined topic is an administrative-related topic.
14. The method of claim 1, wherein the text to be recognized comprises short message text.
15. An abnormal text recognition apparatus comprising:
an entity recognition module adapted to input text to be recognized into a named entity recognition model to determine candidate entities comprised by the text, the candidate entities comprising a person name, an organization name, or a date;
the topic identification module is suitable for inputting the text to be identified into the topic identification model for processing so as to identify the topic category of the text;
the emotion recognition module is suitable for inputting a vector set corresponding to each candidate entity in the text into an emotion recognition model to acquire the emotional tendency of the text to the candidate entity when the theme category of the text corresponds to a preset theme;
and the judging module is suitable for judging the text to be an abnormal text under a preset theme when the emotional tendency is negative.
16. A computing device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-14.
17. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-14.
CN201811093657.1A 2018-09-19 2018-09-19 Abnormal text recognition method, device, computing equipment and medium Active CN110929026B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811093657.1A CN110929026B (en) 2018-09-19 2018-09-19 Abnormal text recognition method, device, computing equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811093657.1A CN110929026B (en) 2018-09-19 2018-09-19 Abnormal text recognition method, device, computing equipment and medium

Publications (2)

Publication Number Publication Date
CN110929026A true CN110929026A (en) 2020-03-27
CN110929026B CN110929026B (en) 2023-04-25

Family

ID=69855159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811093657.1A Active CN110929026B (en) 2018-09-19 2018-09-19 Abnormal text recognition method, device, computing equipment and medium

Country Status (1)

Country Link
CN (1) CN110929026B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723856A (en) * 2020-06-11 2020-09-29 广东浪潮大数据研究有限公司 Image data processing method, device and equipment and readable storage medium
WO2021217843A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Enterprise public opinion analysis method and apparatus, and electronic device and medium
CN115587178A (en) * 2022-09-08 2023-01-10 上海网商电子商务有限公司 Automobile comment analysis method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663046A (en) * 2012-03-29 2012-09-12 中国科学院自动化研究所 Sentiment analysis method oriented to micro-blog short text
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN104199845A (en) * 2014-08-08 2014-12-10 杭州电子科技大学 On-line comment sentiment classification method based on agent model
WO2015043075A1 (en) * 2013-09-29 2015-04-02 广东工业大学 Microblog-oriented emotional entity search system
CN104572616A (en) * 2014-12-23 2015-04-29 北京锐安科技有限公司 Method and device for identifying text orientation
CN104881417A (en) * 2014-02-28 2015-09-02 深圳市网安计算机安全检测技术有限公司 Public opinion analyzing method and system
CN107038178A (en) * 2016-08-03 2017-08-11 平安科技(深圳)有限公司 The analysis of public opinion method and apparatus
CN107038154A (en) * 2016-11-25 2017-08-11 阿里巴巴集团控股有限公司 A kind of text emotion recognition methods and device
CN107807914A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 Recognition methods, object classification method and the data handling system of Sentiment orientation
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663046A (en) * 2012-03-29 2012-09-12 中国科学院自动化研究所 Sentiment analysis method oriented to micro-blog short text
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
WO2015043075A1 (en) * 2013-09-29 2015-04-02 广东工业大学 Microblog-oriented emotional entity search system
CN104881417A (en) * 2014-02-28 2015-09-02 深圳市网安计算机安全检测技术有限公司 Public opinion analyzing method and system
CN104199845A (en) * 2014-08-08 2014-12-10 杭州电子科技大学 On-line comment sentiment classification method based on agent model
CN104572616A (en) * 2014-12-23 2015-04-29 北京锐安科技有限公司 Method and device for identifying text orientation
CN107038178A (en) * 2016-08-03 2017-08-11 平安科技(深圳)有限公司 The analysis of public opinion method and apparatus
WO2018023981A1 (en) * 2016-08-03 2018-02-08 平安科技(深圳)有限公司 Public opinion analysis method, device, apparatus and computer readable storage medium
CN107807914A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 Recognition methods, object classification method and the data handling system of Sentiment orientation
CN107038154A (en) * 2016-11-25 2017-08-11 阿里巴巴集团控股有限公司 A kind of text emotion recognition methods and device
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘丽娜: ""微博数据挖掘理论的若干关键技术研究"" *
蒋知义;马王荣;邹凯;李黎;: "基于情感倾向性分析的网络舆情情感演化特征研究" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021217843A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Enterprise public opinion analysis method and apparatus, and electronic device and medium
CN111723856A (en) * 2020-06-11 2020-09-29 广东浪潮大数据研究有限公司 Image data processing method, device and equipment and readable storage medium
CN111723856B (en) * 2020-06-11 2023-06-09 广东浪潮大数据研究有限公司 Image data processing method, device, equipment and readable storage medium
CN115587178A (en) * 2022-09-08 2023-01-10 上海网商电子商务有限公司 Automobile comment analysis method

Also Published As

Publication number Publication date
CN110929026B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN110765244B (en) Method, device, computer equipment and storage medium for obtaining answering operation
US11249774B2 (en) Realtime bandwidth-based communication for assistant systems
Li et al. Mining opinion summarizations using convolutional neural networks in Chinese microblogging systems
CN107977347B (en) Topic duplication removing method and computing equipment
US10637826B1 (en) Policy compliance verification using semantic distance and nearest neighbor search of labeled content
EP3680850A1 (en) Method and system for determining risk score for a contract document
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN110222330B (en) Semantic recognition method and device, storage medium and computer equipment
US20220113998A1 (en) Assisting Users with Personalized and Contextual Communication Content
CN110929025A (en) Junk text recognition method and device, computing equipment and readable storage medium
CN110929026A (en) Abnormal text recognition method and device, computing equipment and medium
US11494565B2 (en) Natural language processing techniques using joint sentiment-topic modeling
CN111651990B (en) Entity identification method, computing device and readable storage medium
CN110245557A (en) Image processing method, device, computer equipment and storage medium
CN110019776B (en) Article classification method and device and storage medium
US20230088182A1 (en) Machine learning of colloquial place names
CN110909157B (en) Text classification method and device, computing equipment and readable storage medium
CN107665442A (en) Obtain the method and device of targeted customer
CN112905787A (en) Text information processing method, short message processing method, electronic device and readable medium
CN112087473A (en) Document downloading method and device, computer readable storage medium and computer equipment
Pareek et al. Comparative Analysis of Social Media Hate Detection over Code Mixed Hindi-English Language
KR102098461B1 (en) Classifying method using a probability labele annotation algorithm using fuzzy category representation
CN110929530B (en) Multi-language junk text recognition method and device and computing equipment
US11922515B1 (en) Methods and apparatuses for AI digital assistants
US20220342922A1 (en) A text classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant