CN108874896B - Humor identification method based on neural network and humor characteristics - Google Patents

Humor identification method based on neural network and humor characteristics Download PDF

Info

Publication number
CN108874896B
CN108874896B CN201810496016.4A CN201810496016A CN108874896B CN 108874896 B CN108874896 B CN 108874896B CN 201810496016 A CN201810496016 A CN 201810496016A CN 108874896 B CN108874896 B CN 108874896B
Authority
CN
China
Prior art keywords
humorous
humor
word
sentence
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810496016.4A
Other languages
Chinese (zh)
Other versions
CN108874896A (en
Inventor
林鸿飞
樊小超
杨亮
刁宇峰
申晨
楚永贺
任璐
张桐瑄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201810496016.4A priority Critical patent/CN108874896B/en
Publication of CN108874896A publication Critical patent/CN108874896A/en
Application granted granted Critical
Publication of CN108874896B publication Critical patent/CN108874896B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

A humor identification method based on neural network and humor characteristics belongs to the field of data mining and natural language processing, and is used for solving the problem of humor identification, and the key points are that S1, humor corpus collection and preprocessing are included; s2, humorous feature extraction; s3, representing word vectors of the text; s4, constructing a neural network model; s5, evaluating humorous recognition results, wherein the effects are as follows: collecting and preprocessing humorous data in a specific form, and according to a related mature humorous theory, fully considering the voice characteristics of humorous texts and constructing humorous voice characteristics; extracting words with most synonyms in sentences as characteristic words by using the fuzzy characteristic of humor and vectorizing the characteristic words; the deep learning method is adopted, the semantic features of the deep layer behind the humor text are extracted, the voice features and the ambiguity features of the humor are fused into the neural network, and therefore the humor recognition is carried out, and the effectiveness of the method on the humor recognition is verified through experiments on a data set.

Description

Humor identification method based on neural network and humor characteristics
Technical Field
The invention relates to the field of data mining and natural language processing, in particular to a humor identification method based on a neural network and humor characteristics.
Background
With the rapid development of artificial intelligence, humor identification becomes a very interesting research problem in natural language processing. Humor is a special language expression that can invigorate the atmosphere, relieve embarrassment, and in wikipedia, it is defined as a quality or ability to smile a person. There is no doubt that the person-to-person interaction is incomplete if there is no humour. In the field of human-computer interaction, a question-answering system and a dialogue system are applied to a plurality of household products, the interaction between a person and a computer becomes more and more common, if the computer can understand and use humor, the computer is more humanized, the communication between the computer and the person is smoother, and the question-answering system and the dialogue system become a significant achievement of human beings in the artificial intelligence era. To enable a computer to understand and use humor, the computer is first given the ability to recognize humor.
The humor identification task is to allow the computer to automatically identify whether a given paragraph or sentence is humor. The humor identification task remains a challenging task in the field of natural language processing. Firstly, the humorous form is various in types, and accurate definition and division are difficult to be given to the humorous form; secondly, some humor needs longer context information to be padded; furthermore, many humor understandings require the discovery of a large amount of common knowledge behind the textual content, requiring multiple manipulations of the textual content, in other words, humor is a potential semantic representation, a high-level abstract form of human language.
The method enables the computer to recognize all forms of humor beyond the existing computing capacity of the computer, and limits the research range of humor recognition to a sentence level. A sentence containing only a few words has a humorous effect, and usually there are some special grammatical structures or semantic forms in the sentence, which also provides traceable clues for the computer to be able to automatically find and learn the features behind the humorous.
Theoretical studies of humor can go back to the 90 s of the last century, where the most influential humor theory was Semantic Script Theory (SSTH) and the like. According to the theory related to humor, a plurality of researchers are put into the research of humor computing, Taylor and the like collect humor texts in twitter and label the humor texts, a series of humor characteristics are constructed from semantic characteristics and structural characteristics of the humor, and the humor is identified by adopting a traditional machine learning method.
At present, few studies on humorous recognition of texts are carried out, most of the studies are based on humorous theory, some humorous features are constructed manually, traditional text representation methods and classification algorithms are adopted for humorous recognition, and recognition effects are poor. The deep learning method is applied to humorous recognition and is still in a simple application stage, and humorous recognition of texts is not carried out in combination with humorous features.
Disclosure of Invention
The invention aims to provide a method for automatically carrying out humorous recognition on a text by combining a small number of humorous features, which can effectively avoid the defect that a large number of humorous features need to be constructed manually in the traditional feature engineering method and provides a user with the method.
The invention solves the technical problems in the prior art by adopting the following technical scheme: a humor identification method based on a neural network and humor characteristics comprises the following steps:
s1, humor corpus collection and pretreatment:
a1, collecting humorous material: acquiring humorous texts and evaluation information of the texts from a website; the text ID is numbered to be used as a unique identifier of the text, so that the text ID is convenient to store and use in the future; collecting the content of the humor of the website as a humor text candidate set; acquiring evaluation information of the text humor from the website as a standard for measuring the text humor degree; text in the form of news or the like is collected as a candidate set of non-humorous text. The humorous corpus is a single sentence.
a2, preprocessing: cleaning data of the humorous text candidate set and the non-humorous text candidate set, and deleting special characters and unidentifiable characters in the text; automatically marking the text with higher evaluation score as the humorous text, namely, a regular example, according to humorous evaluation information, and manually auditing the automatically marked information; selecting non-humorous text from the non-humorous text candidate set according to two principles that the lengths of sentences are similar and dictionaries used by positive and negative examples are consistent (the dictionaries are generated by non-repeated words in the positive examples, and if the non-humorous text contains words in the non-dictionaries, the non-humorous text is not selected), namely the negative example; and performing word segmentation on the humorous text and the non-humorous text.
S2, humor feature extraction:
b1, humorous voice feature extraction: expressing the english word in the form of a phoneme in the word set of the sentence obtained in step S1 using a pronunciation dictionary; extracting the number of words with the same pronunciation at the head part of the word in the sentence, the maximum length of phonemes with the same pronunciation at the head part of the word in the sentence, the number of words with the same pronunciation at the tail part of the word in the sentence, and the maximum length of phonemes with the same pronunciation at the tail part of the word in the sentence, namely the number of words with head rhyme, the maximum length of a head rhyme chain, the number of words with tail rhyme and the maximum length of a tail rhyme chain as humorous voice characteristics, thereby obtaining a 4-dimensional characteristic vector P.
b2, extracting humorous inconsistency features: judging whether an anti-sense word pair exists in the sentence or not for the word set of the sentence obtained in the step S1; and c, expressing the words into low-dimensional dense vectors by using the word vectors obtained in the step c1, calculating the semantic distance between any two words in the sentence, calculating the maximum value and the minimum value of the semantic distance, and taking the maximum value and the minimum value of the semantic distance as the humorous inconsistency characteristics to obtain a 3-dimensional characteristic vector Q. The semantic distance calculation adopts Cosine similarity, and the calculation formula is as follows:
Figure BDA0001669192500000031
similarity (A, B) represents the cosine semantic distance of two word vectors. A and B respectively represent any two words in the sentence, | A | | sweet wind2,||B||2Representing the 2-norm of the word vectors a and B, respectively.
S3, a word vector representation step of the text based on the neural network:
c1, acquiring word vectors: wikipedia corpora and joke corpora are obtained as corpus of training word vectors, and word vectors are trained by using a word vector tool. And obtaining a low-dimensional dense vector of each word in the humorous text and the non-humorous text.
c2, word vector representation of text: and using the word vector obtained by the c1 to represent the humorous sentence and the non-humorous sentence obtained in the step S1 as n multiplied by m multiplied by d word embedded matrixes, wherein n is the number of samples, m is the number of words contained in each sample, and d is the dimension of the word vector.
c3, fuzzy characteristic word extraction: for the word set of each sentence obtained in step S1, extracting the synonym set Synset by using the synonym sense resourcei={synset1,synset2,…,synsetj,…,synsetnI is the ith word in the sentence, n is the number of synonym sets, synsetjIs synonym meaning unit; resource sharing with synonymsjObtaining the synonyms of which the meanings are similari={W11,W12,…,W1m,…,Wn1,…,WnmM is synsetjThe number of synonyms of (1), remove synWordsiRepeating words, calculating syncwords in sentencesiThe most words, namely the words with the most number of similar words in the sentence are used as humorous fuzzy characteristic words.
c4, fuzzy feature word vector representation: each sentence may extract one or more words that relate to the fuzzy nature of humor. If the sentence only contains one feature word, representing the feature word into a vector form T by using the word vector acquired by c 1; if the sentence contains a plurality of feature words, the average word vector of the feature words is used as the fuzzy feature word vector, and the calculation formula is as follows:
Figure BDA0001669192500000041
t is the average word vector of the feature words, N is the number of the feature words in the sentence, TnIs the word vector of the nth feature word.
S4, neural network model construction:
d1, input of model: obtaining humorous fuzzy characteristic word vector according to the step c4
Figure BDA0001669192500000042
And compares it with the word vector w of each word in the sentencetSplicing to obtain input word vector X as modelt,Xt
Figure BDA0001669192500000051
Wherein
Figure BDA0001669192500000052
Representing a vector space, d represents a vector dimension, and the input vector can be represented as:
Figure BDA0001669192500000053
d2, constructing a humor identification model: extracting the input X obtained in d1 by using Recurrent Neural Network (RNN)tAnd obtaining hidden layer vector representation of the text according to the latent semantic features. The invention adopts a bidirectional long-short term memory network (Bi-LSTM), wherein the calculation formula of each cell calculation unit is as follows:
Figure BDA0001669192500000054
ft=σ(Wf·X'+bf)
it=σ(Wi·X'+bi)
ot=σ(Wo·X'+bo)
ct=ft⊙ct-1+it⊙tanh(Wc·X'+bc)
ht=ot⊙tanh(ct)
x' is an LSTM input vector XtHidden layer output vector h at the sum (t-1) timet-1Splicing of ft,it,otForgetting gate, input gate and output gate respectively being LSTM,ctCell unit of LSTM, Wf,Wi,WoParameter matrices for the forgetting gate, the input gate and the output gate of the LSTM model, respectively, bf,bi,boRespectively are offset quantities of an LSTM model forgetting gate, an input gate and an output gate, the parameters are obtained by LSTM model learning, sigma is a sigmoid function, tanh is a tangent function, W is a maximum deviation valuecParameters representing cell units, bcA shift amount for a cell element,. indicates a multiplication by element of the matrix,. htRepresenting hidden layer output.
d3, attention mechanism (attention) which can make the model weight the fuzzy characteristic words and the surrounding words when the humorous recognition is carried out, thereby improving the performance of the humorous recognition. Word-embedded representation X of a sentence in the case of a word T based on certain ambiguity features obtained in step d1tAnd hidden layer representation h of the sentence obtained in step d2tComputing a weight vector alpha of words in a sentenceiAnd a hidden layer representation r of the sentence, the calculation formula is as follows:
α't=WαXt+bα
Figure BDA0001669192500000061
rt=htαt
wherein WαTo pay attention to the weight of the mechanism, bαTo note the paradoxical amount of the mechanism, T is the number of words in the sentence.
d4, calculating the average word vector representation of the sentence: hidden layer representation r of humor sentences obtained according to step d3tCalculating the average word vector representation s' of the sentence according to the following calculation formula:
Figure BDA0001669192500000062
d5, fused humor features: splicing the humorous voice feature P extracted in the step b1 and the humorous inconsistency feature Q extracted in the step b2 with the average word vector representation s' of the sentence obtained in the step d4 to obtain the vector representation s of the sentence, wherein the dimension of s is the sum of the dimensions of the three part feature vectors, and the calculation formula is as follows:
Figure BDA0001669192500000063
d6, humor identification: calculating the probability p of whether the sentence is a humorous sentence according to the hidden layer representation s of the humorous sentence obtained in the step d5, thereby finally judging that the given sentence is a humorous text or a non-humorous text, wherein the calculation formula is as follows:
p=softmax(Whs+bh)
s5, humorous recognition result evaluation step: and evaluating the humorous recognition result according to a preset evaluation index.
The preset evaluation indexes are accuracy rate, recall rate and F1 value, and the accuracy rate calculation formula is as follows:
Figure BDA0001669192500000071
the accuracy calculation formula is as follows:
Figure BDA0001669192500000072
the recall ratio formula is as follows:
Figure BDA0001669192500000073
the F1 value is formulated as follows:
Figure BDA0001669192500000074
wherein TP represents the number of samples for which the classifier determines positive examples as positive examples, TN represents the number of samples for which the classifier determines negative examples as negative examples, FP represents the number of samples for which the classifier determines negative examples as positive examples, and FN represents the number of samples for which the classifier determines positive examples as negative examples.
Drawings
FIG. 1 is a logic schematic of the present invention
FIG. 2 is a schematic diagram of the step S4 model according to the embodiment of the present invention
FIG. 3 shows humorous recognition results according to an embodiment of the present invention
Detailed Description
The invention is described below with reference to the following figures and specific embodiments:
a humor identification method based on a neural network and humor characteristics comprises the following steps:
s1, humor corpus collection and pretreatment:
a1, collecting humorous material:
by using a web crawler technology, English humor corpus is crawled from www.punoftheday.com, and text ID, text content and text evaluation information of the humor are obtained. The humorous text on the website is in a single sentence form, the length of a sentence is usually less than 30 words, and the voting information of each sentence represents the recognition degree of the net friend whether the sentence is humorous or not. And crawling a humorous text ID from the website as a unique identification of the text, crawling text contents as a humorous text candidate set, and crawling net friend voting information as a measurement standard for measuring whether the text is humorous or not.
News corpora are survey statements of a certain fact, usually not humorous, and news data are crawled from news websites of yahoo news, new york times, and the like as a candidate set of non-humorous text, i.e., negative examples.
a2, preprocessing: cleaning data of the humorous text candidate set and the non-humorous text candidate set, and deleting special characters and unidentifiable characters in the text; according to humorous evaluation information, 3-star and above-3-star texts are used as humorous texts, namely, regular examples, and after manual review is carried out on the automatic labeling information, the regular example data are 2423 in total; negative examples samples are selected from the non-humorous candidate set, following two rules: the length of the negative example sample is equal to that of the positive example sample, the words used by the negative example sample appear in the positive example, namely the word dictionary for the negative example is the same as that for the positive example, and the number of the negative example sample is equal to that of the positive example after manual review.
And performing word segmentation on the humorous text and the non-humorous text by using a word segmentation method in the NLTK in the Python.
S2, humor feature extraction:
b1, humorous voice feature extraction: obtaining 69 phonemes with accents from the word set of the sentence obtained in step S1 using the CMU pronunciation dictionary, 39 phonemes finally obtained after ignoring the accent portions of the phonemes, and expressing the english word with these 39 phonemes to obtain a speech expression form of the word; extracting the number of words with the same pronunciation of the head part of the single word in the sentence, the maximum length of the phoneme with the same pronunciation of the head part of the single word in the sentence, the number of words with the same pronunciation of the tail part of the single word in the sentence and the maximum length of the phoneme with the same pronunciation of the tail part of the single word in the sentence, namely the number of pressure head rhyme words, the maximum length of a head rhyme chain, the number of pressure tail rhyme words and the maximum length of a tail rhyme chain as the humorous voice characteristics, thereby obtaining a 4-dimensional feature vector.
b2, extracting humorous inconsistency features: searching whether the sentence contains an anti-sense word pair or not by using synonym meaning resources for the word set of the sentence obtained in the step S1, wherein the characteristic is 1 if the sentence contains the anti-sense word pair, and the characteristic is 0 if the sentence does not contain the anti-sense word pair; and (3) expressing the words into low-dimensional dense vectors with the dimensionality of 300 dimensions by using a Word2Vec tool, calculating the semantic distance of the Word pairs in the sentence, and taking the maximum semantic distance and the minimum semantic distance of the Word pairs as features. Thus the inconsistent features of the sentence are represented by a 3-dimensional feature vector Q. The semantic distance calculation formula is as follows:
Figure BDA0001669192500000091
a and B respectively represent two words in the sentence, | A | | sweet wind2And B does not count2Representing the 2-norm of the word vectors a and B, respectively, (a, B) representing the word pair obtained by traversing the set of words of the sentence, and similarity (a, B) representing the cosine semantic distance of the two word vectors.
S3, a word vector representation step of the text based on the neural network:
c1, acquiring word vectors: obtaining 13.6G wikipedia corpora and 20 ten thousand joke corpora from the network as a corpus set of training word vectors, and adopting a genesis module in a Python library to train the word vectors, thereby obtaining low-dimensional dense vectors of each word in the humorous text and the non-humorous text, wherein the vector dimension is 300 dimensions.
c2, word vector representation of text: expressing the humorous sentence obtained in the step S1 as an n multiplied by m multiplied by d word embedded matrix by using the word vector obtained in the step c 1; and representing the humorous sentence obtained in the step S1 as an n multiplied by m multiplied by d word embedded matrix by using the word vector obtained from the c 1. Where n is the number of samples, m is the number of words each sample contains, and d is the dimension of the word vector.
c3, fuzzy characteristic word extraction: for the word set of each sentence obtained in step S1, a synonym set Synset is extracted by using a wordNet dictionaryi={synset1,synset2,…,synsetj,…,synsetnI is the ith word in the sentence, n is the number of synonym sets, synsetjIs synonym meaning unit; using word dictionary by syncjObtaining the synonyms of which the meanings are similari={W11,W12,…,W1m,…,Wn1,…,WnmM is synsetjThe number of synonyms of (1), remove synWordsiRepeating words, calculating syncwords in sentencesiThe most words, namely the words with the most number of similar words in the sentence are used as humorous fuzzy characteristic words.
c4, fuzzy feature word vector representation: each sentence may extract one or more words that relate to the fuzzy nature of humor. If the sentence only contains one characteristic word, representing the characteristic word into a vector form T by using a related word vector method and the word vector obtained by c 1; if the sentence contains a plurality of feature words, the average word vector of the feature words is used as the fuzzy feature word vector, and the formula is as follows:
Figure BDA0001669192500000101
t is the average word vector of the feature words, N is the number of the feature words in the sentence, TnIs the word vector of the nth feature word.
S4, neural network model construction:
d1, input of model: obtaining humorous fuzzy characteristic word vector according to the step c4
Figure BDA0001669192500000102
And compares it with the word vector w of each word in the sentencetSplicing as input word vector of model
Figure BDA0001669192500000103
Figure BDA0001669192500000104
Wherein
Figure BDA0001669192500000105
Representing a vector space, d represents a vector dimension, and the input vector can be represented as:
Figure BDA0001669192500000106
d2, constructing a humor identification model: extracting the features of the input obtained from d1 by using a Recurrent Neural Network (RNN), extracting the deep semantic features behind the humorous text, and obtaining the hidden vector expression h of the texttThe invention adopts a bidirectional long-short term memory network (Bi-LSTM) as a classification model, wherein the calculation formula of each cell calculation unit is as follows:
Figure BDA0001669192500000107
ft=σ(Wf·X'+bf)
it=σ(Wi·X'+bi)
ot=σ(Wo·X'+bo)
ct=ft⊙ct-1+it⊙tanh(Wc·X'+bc)
ht=ot⊙tanh(ct)
x' is an LSTM input vector XtHidden layer output vector h at the sum (t-1) timet-1Splicing of ft,it,otForgetting gate, input gate and output gate, respectively LSTM, ctCell unit of LSTM, Wf,Wi,WoParameter matrices for the forgetting gate, the input gate and the output gate of the LSTM model, respectively, bf,bi,boRespectively are offset quantities of an LSTM model forgetting gate, an input gate and an output gate, the parameters are obtained by LSTM model learning, sigma is a sigmoid function, tanh is a tangent function, W is a maximum deviation valuecParameters representing cell units, bcA shift amount for a cell element,. indicates a multiplication by element of the matrix,. htRepresenting hidden layer output.
d3, attention mechanism (attention): the attention mechanism can increase the weight of the fuzzy characteristic words and the surrounding words of the model when humorous recognition is carried out on the model, so that the humorous recognition performance is improved. Word-embedded representation X of a sentence in the case of a word T based on certain ambiguity features obtained in step d1tAnd hidden layer representation h of the sentence obtained in step d2tCalculating a weight vector alpha of a word in a sentence of an attention mechanismiAnd a hidden layer representation r of the sentence, the calculation formula is as follows:
α't=WαXt+bα
Figure BDA0001669192500000111
rt=htαt
wherein WαTo pay attention to the weight of the mechanism, bαTo note the paradoxical amount of the mechanism, T is the number of words in the sentence.
d4, calculating the average word vector representation of the sentence: obtained according to step d3Hidden layer representation r of humor sentencestCalculating the average word vector representation s' of the sentence according to the following calculation formula:
Figure BDA0001669192500000112
d5, fused humor features: splicing the humorous voice feature P extracted in the step b1 and the humorous inconsistency feature a extracted in the step b2 with the average word vector representation s' of the sentence obtained in the step d4 to obtain the vector representation s of the sentence, wherein the dimension of s is the sum of the dimensions of the three part feature vectors, and the calculation formula is as follows:
Figure BDA0001669192500000121
d6, humor identification: calculating the probability p that the sentence is the humor sentence according to the hidden layer representation s of the humor sentence obtained in the step d5, so as to finally judge that the given sentence is the humor text or the non-humor text, wherein the calculation formula is as follows:
p=softmax(Whs+bh)
s5, humorous recognition result evaluation step: and evaluating the humorous recognition result according to a preset evaluation index.
The preset evaluation indexes are accuracy rate, recall rate and F1 value, and the accuracy rate calculation formula is as follows:
Figure BDA0001669192500000122
the accuracy calculation formula is as follows:
Figure BDA0001669192500000123
the recall ratio formula is as follows:
Figure BDA0001669192500000124
the F1 value is formulated as follows:
Figure BDA0001669192500000125
wherein TP represents the number of samples for which the classifier determines positive examples as positive examples, TN represents the number of samples for which the classifier determines negative examples as negative examples, FP represents the number of samples for which the classifier determines negative examples as positive examples, and FN represents the number of samples for which the classifier determines positive examples as negative examples. The comparative experiments were as follows:
SVM: and taking the n-gram, the voice characteristics and the inconsistency characteristics as input, and adopting a Support Vector Machine (SVM) as a classifier model to perform humorous recognition.
CNN: and performing humorous recognition on the text by adopting a Convolutional Neural Network (CNN).
LSTM: and taking word vectors of the text as input, and carrying out humorous recognition by the long-term and short-term memory neural network.
ATF-LSTM: the method of the invention.
As shown in fig. 3, a large number of features need to be artificially constructed as input in the conventional classifier SVM, and the experimental performance is low under the condition that the features cannot sufficiently reflect the characteristics of humor data, and because humor features are often hidden under surface semantics and are difficult to characterize, the effect of recognizing humor by using the conventional machine learning method is poor. Deep learning methods, such as convolutional neural network CNN and cyclic neural network LSTM, do not need to construct complex humorous-related features, can automatically extract deep semantic features, explore the latent semantic connotation of humorous, and improve humorous recognition performance compared with the traditional machine learning method. The method combines the characteristics automatically extracted by the neural network model, constructs the characteristics of voice and inconsistency, which cannot be fully reflected by word vectors, and further improves the humorous recognition performance.
The foregoing is a more detailed description of the present invention in connection with specific preferred embodiments and is not intended to limit the practice of the invention to these embodiments. For those skilled in the art to which the invention pertains, several simple deductions or substitutions may be made without departing from the spirit of the invention, which should be construed as belonging to the protection of the present invention.

Claims (7)

1. A humor identification method based on a neural network and humor characteristics is characterized by comprising the following steps:
s1, humor corpus collection and pretreatment:
a1, collecting humorous material: acquiring humorous texts and evaluation information of the texts from a website; numbering the text ID as a unique identifier of the text; collecting the content of the humor of the website as a humor text candidate set; acquiring evaluation information of the text humor from the website as a standard for measuring the text humor degree; collecting texts in other forms as a candidate set of non-humorous texts, wherein humorous corpus is a single sentence;
a2, preprocessing: cleaning data of the humorous text candidate set and the non-humorous text candidate set, and deleting special characters and unidentifiable characters in the text; labeling humorous text; selecting a non-humor text from the non-humor text candidate set according to the principle that sentences are similar in length and dictionaries used by positive and negative examples are consistent; performing word segmentation on the humorous text and the non-humorous text;
s2, humor feature extraction:
b1, humorous voice feature extraction: extracting humorous voice characteristic vectors P of the sentences from the word set of the sentences obtained in the step S1 by utilizing a pronunciation dictionary;
b2, extracting humorous inconsistency features: extracting feature vectors Q of the humorous sentence inconsistency from the word set of the sentences obtained in the step S1 by utilizing semantic resources and word vector tools;
s3, a word vector representation step of the text based on the neural network:
c1, acquiring word vectors: obtaining corpora including Wikipedia corpora and joke corpora as a corpus of training word vectors, and training the word vectors by using a word vector tool so as to obtain low-dimensional dense vectors of each word in the humorous text and the non-humorous text;
c2, word vector representation of text: expressing the humorous sentences and the non-humorous sentences obtained in the step S1 as n multiplied by m multiplied by d word embedded matrixes by using the word vectors obtained from the c1, wherein n is the number of samples, m is the number of words contained in each sample, and d is the dimensionality of the word vectors;
c3, fuzzy characteristic word extraction: for the word set of each sentence obtained in the step S1, extracting the synonym set Synset by utilizing semantic resourcesi={synset1,synset2,…,synsetj,…,synsetnI is the ith word in the sentence, n is the number of synonym sets, synsetjIs synonym meaning unit; resource sharing with synonymsjObtaining the synonyms of which the meanings are similari={W11,W12,…,W1m,…,Wn1,…,WnmM is synsetjThe number of synonyms of (1), remove synWordsiRepeating words, calculating syncwords in sentencesiThe words with the most number of similar words in the sentence are used as humorous fuzzy characteristic words;
c4, fuzzy feature word vector representation: each sentence may extract one or more words related to the fuzzy characteristics of humor, and if the sentence only contains one characteristic word, the characteristic word is represented in a vector form T by using a related word vector method and a word vector acquired by c 1; if the sentence contains a plurality of feature words, the average word vector of the feature words is used as the fuzzy feature word vector;
s4, neural network model construction:
d1, input of model: according to the fuzzy characteristic word vector T of humor obtained in the step c4, the fuzzy characteristic word vector T is combined with the word vector w of each word in the sentencetSplicing to serve as an input word vector of the model;
d2, constructing a humor identification model: extracting the potential semantic features of the input obtained in d1 by using a cyclic neural network to obtain hidden layer vector representation of the text;
d3, attention machine humorous recognition, wherein the attention machine is adopted to carry out humorous recognition, the weight of the fuzzy characteristic words and the words around the fuzzy characteristic words is increased, and the humorous recognition performance is improved;
d4, calculating the average word vector representation of the sentence: calculating the average word vector representation of the sentence according to the hidden layer representation of the humorous sentence obtained in the step d 3;
d5, fused humor features: splicing the humorous voice characteristics extracted in the step b1 and the humorous inconsistency characteristics extracted in the step b2 with the average word vector representation of the sentence obtained in the step d4 to obtain the vector representation of the sentence;
d6, humor identification: calculating the probability of whether the sentence is a humorous sentence according to the hidden layer representation s of the humorous sentence obtained in the step d5, thereby finally judging that the given sentence is a humorous text or a non-humorous text;
s5, humorous recognition result evaluation step: and evaluating the humorous recognition result according to the evaluation index.
2. The method for humor identification based on neural networks and humor features according to claim 1, wherein in the step a2, the labeled humor text is: and automatically marking the text with higher evaluation score as the humorous text, namely, the regular example, according to the humorous evaluation information, and manually auditing the automatically marked information.
3. The method for humorous recognition based on neural network and humorous feature of claim 1, wherein, NLTK module in Python language is adopted in the word segmentation processing in step a 2.
4. The method for humorous recognition based on neural network and humorous feature of claim 1, wherein in step b1, said pronunciation dictionary is CMU pronunciation dictionary, said humorous speech feature extraction is to represent English words into phoneme form; the number of words with the same pronunciation of the head part of the word in the sentence, the maximum length of phonemes with the same pronunciation of the head part of the word in the sentence, the number of words with the same pronunciation of the tail part of the word in the sentence and the maximum length of phonemes with the same pronunciation of the tail part of the word in the sentence are extracted, and the number of words with head rhyme pressing, the maximum length of a head rhyme chain, the number of words with tail rhyme pressing and the maximum length of a tail rhyme chain are extracted as humorous voice characteristics to obtain a 4-dimensional characteristic vector.
5. The method for humorous recognition based on neural network and humorous characteristic of claim 1, characterized in that, the inconsistent feature extraction of humorous in step b2 is to utilize semantic resources to judge whether there is an anti-sense word pair in the sentence of the word set of the sentence obtained in step S1; and (3) expressing the words into low-dimensional dense vectors by using a word vector tool, extracting the maximum and minimum semantic distances of word pairs in the sentence, and taking the maximum semantic distance and the minimum semantic distance as the characteristics of the inconsistency of the humorous sentence to obtain a 3-dimensional feature vector Q, wherein the maximum semantic distance and the minimum semantic distance are the characteristics of the antisense word pairs.
6. The method for humor identification based on neural networks and humor features of claim 1, wherein in the step b2 and the step c3, the semantic resource is wordNet.
7. The method for humor identification based on neural networks and humor features according to claim 1, wherein in step S4, the evaluation indexes are precision, accuracy, recall and F1 value, and the precision calculation formula is as follows:
Figure FDA0002677502390000041
the accuracy calculation formula is as follows:
Figure FDA0002677502390000042
the recall ratio formula is as follows:
Figure FDA0002677502390000043
the F1 value is formulated as follows:
Figure FDA0002677502390000044
wherein TP represents the number of samples for which the classifier determines positive examples as positive examples, TN represents the number of samples for which the classifier determines negative examples as negative examples, FP represents the number of samples for which the classifier determines negative examples as positive examples, and FN represents the number of samples for which the classifier determines positive examples as negative examples.
CN201810496016.4A 2018-05-22 2018-05-22 Humor identification method based on neural network and humor characteristics Active CN108874896B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810496016.4A CN108874896B (en) 2018-05-22 2018-05-22 Humor identification method based on neural network and humor characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810496016.4A CN108874896B (en) 2018-05-22 2018-05-22 Humor identification method based on neural network and humor characteristics

Publications (2)

Publication Number Publication Date
CN108874896A CN108874896A (en) 2018-11-23
CN108874896B true CN108874896B (en) 2020-11-06

Family

ID=64333999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810496016.4A Active CN108874896B (en) 2018-05-22 2018-05-22 Humor identification method based on neural network and humor characteristics

Country Status (1)

Country Link
CN (1) CN108874896B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101037A (en) * 2019-05-28 2020-12-18 云义科技股份有限公司 Semantic similarity calculation method
CN110188202B (en) * 2019-06-06 2021-07-20 北京百度网讯科技有限公司 Training method and device of semantic relation recognition model and terminal
CN110457471A (en) * 2019-07-15 2019-11-15 平安科技(深圳)有限公司 File classification method and device based on A-BiLSTM neural network
CN111401003B (en) * 2020-03-11 2022-05-03 四川大学 Method for generating humor text with enhanced external knowledge
CN112214602B (en) * 2020-10-23 2023-11-10 中国平安人寿保险股份有限公司 Humor-based text classification method and device, electronic equipment and storage medium
CN112818118B (en) * 2021-01-22 2024-05-21 大连民族大学 Reverse translation-based Chinese humor classification model construction method
CN113688622A (en) * 2021-09-05 2021-11-23 安徽清博大数据科技有限公司 Method for identifying situation comedy conversation humor based on NER

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090055426A (en) * 2007-11-28 2009-06-02 중앙대학교 산학협력단 Emotion recognition mothod and system based on feature fusion
CN105427869A (en) * 2015-11-02 2016-03-23 北京大学 Session emotion autoanalysis method based on depth learning
CN106503805A (en) * 2016-11-14 2017-03-15 合肥工业大学 A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107564542A (en) * 2017-09-04 2018-01-09 大国创新智能科技(东莞)有限公司 Affective interaction method and robot system based on humour identification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090055426A (en) * 2007-11-28 2009-06-02 중앙대학교 산학협력단 Emotion recognition mothod and system based on feature fusion
CN105427869A (en) * 2015-11-02 2016-03-23 北京大学 Session emotion autoanalysis method based on depth learning
CN106503805A (en) * 2016-11-14 2017-03-15 合肥工业大学 A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107564542A (en) * 2017-09-04 2018-01-09 大国创新智能科技(东莞)有限公司 Affective interaction method and robot system based on humour identification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
幽默计算及其应用研究;林鸿飞等;《山东大学学报(理学版)》;20160731;第1-10页 *

Also Published As

Publication number Publication date
CN108874896A (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN110209806B (en) Text classification method, text classification device and computer readable storage medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN110825721B (en) Method for constructing and integrating hypertension knowledge base and system in big data environment
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN106776562B (en) Keyword extraction method and extraction system
CN109726389B (en) Chinese missing pronoun completion method based on common sense and reasoning
CN106970910B (en) Keyword extraction method and device based on graph model
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
Wahid et al. Topic2Labels: A framework to annotate and classify the social media data through LDA topics and deep learning models for crisis response
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN112818118B (en) Reverse translation-based Chinese humor classification model construction method
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
Chernova Occupational skills extraction with FinBERT
CN114742069A (en) Code similarity detection method and device
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
CN116595166A (en) Dual-channel short text classification method and system combining feature improvement and expansion
Das et al. Analysis of bangla transformation of sentences using machine learning
Lee Natural Language Processing: A Textbook with Python Implementation
Han et al. Construction method of knowledge graph under machine learning
Prasad et al. Lexicon based extraction and opinion classification of associations in text from Hindi weblogs
Wang et al. Chinese-Korean Weibo Sentiment Classification Based on Pre-trained Language Model and Transfer Learning
Aparna et al. A REVIEW ON DIFFERENT APPROACHES OF POS TAGGING IN NLP
Mirasdar et al. Graph of Words Model for Natural Language Processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant