CN108874896B

CN108874896B - Humor identification method based on neural network and humor characteristics

Info

Publication number: CN108874896B
Application number: CN201810496016.4A
Authority: CN
Inventors: 林鸿飞; 樊小超; 杨亮; 刁宇峰; 申晨; 楚永贺; 任璐; 张桐瑄
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2018-05-22
Filing date: 2018-05-22
Publication date: 2020-11-06
Anticipated expiration: 2038-05-22
Also published as: CN108874896A

Abstract

A humor identification method based on neural network and humor characteristics belongs to the field of data mining and natural language processing, and is used for solving the problem of humor identification, and the key points are that S1, humor corpus collection and preprocessing are included; s2, humorous feature extraction; s3, representing word vectors of the text; s4, constructing a neural network model; s5, evaluating humorous recognition results, wherein the effects are as follows: collecting and preprocessing humorous data in a specific form, and according to a related mature humorous theory, fully considering the voice characteristics of humorous texts and constructing humorous voice characteristics; extracting words with most synonyms in sentences as characteristic words by using the fuzzy characteristic of humor and vectorizing the characteristic words; the deep learning method is adopted, the semantic features of the deep layer behind the humor text are extracted, the voice features and the ambiguity features of the humor are fused into the neural network, and therefore the humor recognition is carried out, and the effectiveness of the method on the humor recognition is verified through experiments on a data set.

Description

Humor identification method based on neural network and humor characteristics

Technical Field

The invention relates to the field of data mining and natural language processing, in particular to a humor identification method based on a neural network and humor characteristics.

Background

With the rapid development of artificial intelligence, humor identification becomes a very interesting research problem in natural language processing. Humor is a special language expression that can invigorate the atmosphere, relieve embarrassment, and in wikipedia, it is defined as a quality or ability to smile a person. There is no doubt that the person-to-person interaction is incomplete if there is no humour. In the field of human-computer interaction, a question-answering system and a dialogue system are applied to a plurality of household products, the interaction between a person and a computer becomes more and more common, if the computer can understand and use humor, the computer is more humanized, the communication between the computer and the person is smoother, and the question-answering system and the dialogue system become a significant achievement of human beings in the artificial intelligence era. To enable a computer to understand and use humor, the computer is first given the ability to recognize humor.

The humor identification task is to allow the computer to automatically identify whether a given paragraph or sentence is humor. The humor identification task remains a challenging task in the field of natural language processing. Firstly, the humorous form is various in types, and accurate definition and division are difficult to be given to the humorous form; secondly, some humor needs longer context information to be padded; furthermore, many humor understandings require the discovery of a large amount of common knowledge behind the textual content, requiring multiple manipulations of the textual content, in other words, humor is a potential semantic representation, a high-level abstract form of human language.

The method enables the computer to recognize all forms of humor beyond the existing computing capacity of the computer, and limits the research range of humor recognition to a sentence level. A sentence containing only a few words has a humorous effect, and usually there are some special grammatical structures or semantic forms in the sentence, which also provides traceable clues for the computer to be able to automatically find and learn the features behind the humorous.

Theoretical studies of humor can go back to the 90 s of the last century, where the most influential humor theory was Semantic Script Theory (SSTH) and the like. According to the theory related to humor, a plurality of researchers are put into the research of humor computing, Taylor and the like collect humor texts in twitter and label the humor texts, a series of humor characteristics are constructed from semantic characteristics and structural characteristics of the humor, and the humor is identified by adopting a traditional machine learning method.

At present, few studies on humorous recognition of texts are carried out, most of the studies are based on humorous theory, some humorous features are constructed manually, traditional text representation methods and classification algorithms are adopted for humorous recognition, and recognition effects are poor. The deep learning method is applied to humorous recognition and is still in a simple application stage, and humorous recognition of texts is not carried out in combination with humorous features.

Disclosure of Invention

The invention aims to provide a method for automatically carrying out humorous recognition on a text by combining a small number of humorous features, which can effectively avoid the defect that a large number of humorous features need to be constructed manually in the traditional feature engineering method and provides a user with the method.

The invention solves the technical problems in the prior art by adopting the following technical scheme: a humor identification method based on a neural network and humor characteristics comprises the following steps:

s1, humor corpus collection and pretreatment:

a1, collecting humorous material: acquiring humorous texts and evaluation information of the texts from a website; the text ID is numbered to be used as a unique identifier of the text, so that the text ID is convenient to store and use in the future; collecting the content of the humor of the website as a humor text candidate set; acquiring evaluation information of the text humor from the website as a standard for measuring the text humor degree; text in the form of news or the like is collected as a candidate set of non-humorous text. The humorous corpus is a single sentence.

a2, preprocessing: cleaning data of the humorous text candidate set and the non-humorous text candidate set, and deleting special characters and unidentifiable characters in the text; automatically marking the text with higher evaluation score as the humorous text, namely, a regular example, according to humorous evaluation information, and manually auditing the automatically marked information; selecting non-humorous text from the non-humorous text candidate set according to two principles that the lengths of sentences are similar and dictionaries used by positive and negative examples are consistent (the dictionaries are generated by non-repeated words in the positive examples, and if the non-humorous text contains words in the non-dictionaries, the non-humorous text is not selected), namely the negative example; and performing word segmentation on the humorous text and the non-humorous text.

S2, humor feature extraction:

b1, humorous voice feature extraction: expressing the english word in the form of a phoneme in the word set of the sentence obtained in step S1 using a pronunciation dictionary; extracting the number of words with the same pronunciation at the head part of the word in the sentence, the maximum length of phonemes with the same pronunciation at the head part of the word in the sentence, the number of words with the same pronunciation at the tail part of the word in the sentence, and the maximum length of phonemes with the same pronunciation at the tail part of the word in the sentence, namely the number of words with head rhyme, the maximum length of a head rhyme chain, the number of words with tail rhyme and the maximum length of a tail rhyme chain as humorous voice characteristics, thereby obtaining a 4-dimensional characteristic vector P.

b2, extracting humorous inconsistency features: judging whether an anti-sense word pair exists in the sentence or not for the word set of the sentence obtained in the step S1; and c, expressing the words into low-dimensional dense vectors by using the word vectors obtained in the step c1, calculating the semantic distance between any two words in the sentence, calculating the maximum value and the minimum value of the semantic distance, and taking the maximum value and the minimum value of the semantic distance as the humorous inconsistency characteristics to obtain a 3-dimensional characteristic vector Q. The semantic distance calculation adopts Cosine similarity, and the calculation formula is as follows:

similarity (A, B) represents the cosine semantic distance of two word vectors. A and B respectively represent any two words in the sentence, | A | | sweet wind₂,||B||₂Representing the 2-norm of the word vectors a and B, respectively.

S3, a word vector representation step of the text based on the neural network:

c1, acquiring word vectors: wikipedia corpora and joke corpora are obtained as corpus of training word vectors, and word vectors are trained by using a word vector tool. And obtaining a low-dimensional dense vector of each word in the humorous text and the non-humorous text.

c2, word vector representation of text: and using the word vector obtained by the c1 to represent the humorous sentence and the non-humorous sentence obtained in the step S1 as n multiplied by m multiplied by d word embedded matrixes, wherein n is the number of samples, m is the number of words contained in each sample, and d is the dimension of the word vector.

c3, fuzzy characteristic word extraction: for the word set of each sentence obtained in step S1, extracting the synonym set Synset by using the synonym sense resource_i＝{synset₁，synset₂，…，synset_j，…，synset_nI is the ith word in the sentence, n is the number of synonym sets, synset_jIs synonym meaning unit; resource sharing with synonyms_jObtaining the synonyms of which the meanings are similar_i＝{W₁₁，W₁₂，…，W_1m，…，W_n1，…，W_nmM is synset_jThe number of synonyms of (1), remove synWords_iRepeating words, calculating syncwords in sentences_iThe most words, namely the words with the most number of similar words in the sentence are used as humorous fuzzy characteristic words.

c4, fuzzy feature word vector representation: each sentence may extract one or more words that relate to the fuzzy nature of humor. If the sentence only contains one feature word, representing the feature word into a vector form T by using the word vector acquired by c 1; if the sentence contains a plurality of feature words, the average word vector of the feature words is used as the fuzzy feature word vector, and the calculation formula is as follows:

t is the average word vector of the feature words, N is the number of the feature words in the sentence, T_nIs the word vector of the nth feature word.

S4, neural network model construction:

d1, input of model: obtaining humorous fuzzy characteristic word vector according to the step c4

And compares it with the word vector w of each word in the sentence_tSplicing to obtain input word vector X as model_t,X_t∈

Wherein

Representing a vector space, d represents a vector dimension, and the input vector can be represented as:

d2, constructing a humor identification model: extracting the input X obtained in d1 by using Recurrent Neural Network (RNN)_tAnd obtaining hidden layer vector representation of the text according to the latent semantic features. The invention adopts a bidirectional long-short term memory network (Bi-LSTM), wherein the calculation formula of each cell calculation unit is as follows:

f_t＝σ(W_f·X'+b_f)

i_t＝σ(W_i·X'+b_i)

o_t＝σ(W_o·X'+b_o)

c_t＝f_t⊙c_t-1+i_t⊙tanh(W_c·X'+b_c)

h_t＝o_t⊙tanh(c_t)

x' is an LSTM input vector X_tHidden layer output vector h at the sum (t-1) time_t-1Splicing of f_t，i_t，o_tForgetting gate, input gate and output gate respectively being LSTM，c_tCell unit of LSTM, W_f，W_i，W_oParameter matrices for the forgetting gate, the input gate and the output gate of the LSTM model, respectively, b_f，b_i，b_oRespectively are offset quantities of an LSTM model forgetting gate, an input gate and an output gate, the parameters are obtained by LSTM model learning, sigma is a sigmoid function, tanh is a tangent function, W is a maximum deviation value_cParameters representing cell units, b_cA shift amount for a cell element,. indicates a multiplication by element of the matrix,. h_tRepresenting hidden layer output.

d3, attention mechanism (attention) which can make the model weight the fuzzy characteristic words and the surrounding words when the humorous recognition is carried out, thereby improving the performance of the humorous recognition. Word-embedded representation X of a sentence in the case of a word T based on certain ambiguity features obtained in step d1_tAnd hidden layer representation h of the sentence obtained in step d2_tComputing a weight vector alpha of words in a sentence_iAnd a hidden layer representation r of the sentence, the calculation formula is as follows:

α'_t＝W_αX_t+b_α

r_t＝h_tα_t

wherein W_αTo pay attention to the weight of the mechanism, b_αTo note the paradoxical amount of the mechanism, T is the number of words in the sentence.

d4, calculating the average word vector representation of the sentence: hidden layer representation r of humor sentences obtained according to step d3_tCalculating the average word vector representation s' of the sentence according to the following calculation formula:

d5, fused humor features: splicing the humorous voice feature P extracted in the step b1 and the humorous inconsistency feature Q extracted in the step b2 with the average word vector representation s' of the sentence obtained in the step d4 to obtain the vector representation s of the sentence, wherein the dimension of s is the sum of the dimensions of the three part feature vectors, and the calculation formula is as follows:

d6, humor identification: calculating the probability p of whether the sentence is a humorous sentence according to the hidden layer representation s of the humorous sentence obtained in the step d5, thereby finally judging that the given sentence is a humorous text or a non-humorous text, wherein the calculation formula is as follows:

p＝softmax(W_hs+b_h)

s5, humorous recognition result evaluation step: and evaluating the humorous recognition result according to a preset evaluation index.

The preset evaluation indexes are accuracy rate, recall rate and F1 value, and the accuracy rate calculation formula is as follows:

the accuracy calculation formula is as follows:

the recall ratio formula is as follows:

the F1 value is formulated as follows:

wherein TP represents the number of samples for which the classifier determines positive examples as positive examples, TN represents the number of samples for which the classifier determines negative examples as negative examples, FP represents the number of samples for which the classifier determines negative examples as positive examples, and FN represents the number of samples for which the classifier determines positive examples as negative examples.

Drawings

FIG. 1 is a logic schematic of the present invention

FIG. 2 is a schematic diagram of the step S4 model according to the embodiment of the present invention

FIG. 3 shows humorous recognition results according to an embodiment of the present invention

Detailed Description

The invention is described below with reference to the following figures and specific embodiments:

a humor identification method based on a neural network and humor characteristics comprises the following steps:

s1, humor corpus collection and pretreatment:

a1, collecting humorous material:

by using a web crawler technology, English humor corpus is crawled from www.punoftheday.com, and text ID, text content and text evaluation information of the humor are obtained. The humorous text on the website is in a single sentence form, the length of a sentence is usually less than 30 words, and the voting information of each sentence represents the recognition degree of the net friend whether the sentence is humorous or not. And crawling a humorous text ID from the website as a unique identification of the text, crawling text contents as a humorous text candidate set, and crawling net friend voting information as a measurement standard for measuring whether the text is humorous or not.

News corpora are survey statements of a certain fact, usually not humorous, and news data are crawled from news websites of yahoo news, new york times, and the like as a candidate set of non-humorous text, i.e., negative examples.

a2, preprocessing: cleaning data of the humorous text candidate set and the non-humorous text candidate set, and deleting special characters and unidentifiable characters in the text; according to humorous evaluation information, 3-star and above-3-star texts are used as humorous texts, namely, regular examples, and after manual review is carried out on the automatic labeling information, the regular example data are 2423 in total; negative examples samples are selected from the non-humorous candidate set, following two rules: the length of the negative example sample is equal to that of the positive example sample, the words used by the negative example sample appear in the positive example, namely the word dictionary for the negative example is the same as that for the positive example, and the number of the negative example sample is equal to that of the positive example after manual review.

And performing word segmentation on the humorous text and the non-humorous text by using a word segmentation method in the NLTK in the Python.

S2, humor feature extraction:

b1, humorous voice feature extraction: obtaining 69 phonemes with accents from the word set of the sentence obtained in step S1 using the CMU pronunciation dictionary, 39 phonemes finally obtained after ignoring the accent portions of the phonemes, and expressing the english word with these 39 phonemes to obtain a speech expression form of the word; extracting the number of words with the same pronunciation of the head part of the single word in the sentence, the maximum length of the phoneme with the same pronunciation of the head part of the single word in the sentence, the number of words with the same pronunciation of the tail part of the single word in the sentence and the maximum length of the phoneme with the same pronunciation of the tail part of the single word in the sentence, namely the number of pressure head rhyme words, the maximum length of a head rhyme chain, the number of pressure tail rhyme words and the maximum length of a tail rhyme chain as the humorous voice characteristics, thereby obtaining a 4-dimensional feature vector.

b2, extracting humorous inconsistency features: searching whether the sentence contains an anti-sense word pair or not by using synonym meaning resources for the word set of the sentence obtained in the step S1, wherein the characteristic is 1 if the sentence contains the anti-sense word pair, and the characteristic is 0 if the sentence does not contain the anti-sense word pair; and (3) expressing the words into low-dimensional dense vectors with the dimensionality of 300 dimensions by using a Word2Vec tool, calculating the semantic distance of the Word pairs in the sentence, and taking the maximum semantic distance and the minimum semantic distance of the Word pairs as features. Thus the inconsistent features of the sentence are represented by a 3-dimensional feature vector Q. The semantic distance calculation formula is as follows:

a and B respectively represent two words in the sentence, | A | | sweet wind₂And B does not count₂Representing the 2-norm of the word vectors a and B, respectively, (a, B) representing the word pair obtained by traversing the set of words of the sentence, and similarity (a, B) representing the cosine semantic distance of the two word vectors.

S3, a word vector representation step of the text based on the neural network:

c1, acquiring word vectors: obtaining 13.6G wikipedia corpora and 20 ten thousand joke corpora from the network as a corpus set of training word vectors, and adopting a genesis module in a Python library to train the word vectors, thereby obtaining low-dimensional dense vectors of each word in the humorous text and the non-humorous text, wherein the vector dimension is 300 dimensions.

c2, word vector representation of text: expressing the humorous sentence obtained in the step S1 as an n multiplied by m multiplied by d word embedded matrix by using the word vector obtained in the step c 1; and representing the humorous sentence obtained in the step S1 as an n multiplied by m multiplied by d word embedded matrix by using the word vector obtained from the c 1. Where n is the number of samples, m is the number of words each sample contains, and d is the dimension of the word vector.

c3, fuzzy characteristic word extraction: for the word set of each sentence obtained in step S1, a synonym set Synset is extracted by using a wordNet dictionary_i＝{synset₁，synset₂，…，synset_j，…，synset_nI is the ith word in the sentence, n is the number of synonym sets, synset_jIs synonym meaning unit; using word dictionary by sync_jObtaining the synonyms of which the meanings are similar_i＝{W₁₁，W₁₂，…，W_1m，…，W_n1，…，W_nmM is synset_jThe number of synonyms of (1), remove synWords_iRepeating words, calculating syncwords in sentences_iThe most words, namely the words with the most number of similar words in the sentence are used as humorous fuzzy characteristic words.

c4, fuzzy feature word vector representation: each sentence may extract one or more words that relate to the fuzzy nature of humor. If the sentence only contains one characteristic word, representing the characteristic word into a vector form T by using a related word vector method and the word vector obtained by c 1; if the sentence contains a plurality of feature words, the average word vector of the feature words is used as the fuzzy feature word vector, and the formula is as follows:

S4, neural network model construction:

And compares it with the word vector w of each word in the sentence_tSplicing as input word vector of model

Wherein

d2, constructing a humor identification model: extracting the features of the input obtained from d1 by using a Recurrent Neural Network (RNN), extracting the deep semantic features behind the humorous text, and obtaining the hidden vector expression h of the text_tThe invention adopts a bidirectional long-short term memory network (Bi-LSTM) as a classification model, wherein the calculation formula of each cell calculation unit is as follows:

f_t＝σ(W_f·X'+b_f)

i_t＝σ(W_i·X'+b_i)

o_t＝σ(W_o·X'+b_o)

c_t＝f_t⊙c_t-1+i_t⊙tanh(W_c·X'+b_c)

h_t＝o_t⊙tanh(c_t)

x' is an LSTM input vector X_tHidden layer output vector h at the sum (t-1) time_t-1Splicing of f_t，i_t，o_tForgetting gate, input gate and output gate, respectively LSTM, c_tCell unit of LSTM, W_f，W_i，W_oParameter matrices for the forgetting gate, the input gate and the output gate of the LSTM model, respectively, b_f，b_i，b_oRespectively are offset quantities of an LSTM model forgetting gate, an input gate and an output gate, the parameters are obtained by LSTM model learning, sigma is a sigmoid function, tanh is a tangent function, W is a maximum deviation value_cParameters representing cell units, b_cA shift amount for a cell element,. indicates a multiplication by element of the matrix,. h_tRepresenting hidden layer output.

d3, attention mechanism (attention): the attention mechanism can increase the weight of the fuzzy characteristic words and the surrounding words of the model when humorous recognition is carried out on the model, so that the humorous recognition performance is improved. Word-embedded representation X of a sentence in the case of a word T based on certain ambiguity features obtained in step d1_tAnd hidden layer representation h of the sentence obtained in step d2_tCalculating a weight vector alpha of a word in a sentence of an attention mechanism_iAnd a hidden layer representation r of the sentence, the calculation formula is as follows:

α'_t＝W_αX_t+b_α

r_t＝h_tα_t

d4, calculating the average word vector representation of the sentence: obtained according to step d3Hidden layer representation r of humor sentences_tCalculating the average word vector representation s' of the sentence according to the following calculation formula:

d5, fused humor features: splicing the humorous voice feature P extracted in the step b1 and the humorous inconsistency feature a extracted in the step b2 with the average word vector representation s' of the sentence obtained in the step d4 to obtain the vector representation s of the sentence, wherein the dimension of s is the sum of the dimensions of the three part feature vectors, and the calculation formula is as follows:

d6, humor identification: calculating the probability p that the sentence is the humor sentence according to the hidden layer representation s of the humor sentence obtained in the step d5, so as to finally judge that the given sentence is the humor text or the non-humor text, wherein the calculation formula is as follows:

p＝softmax(W_hs+b_h)

the accuracy calculation formula is as follows:

the recall ratio formula is as follows:

the F1 value is formulated as follows:

wherein TP represents the number of samples for which the classifier determines positive examples as positive examples, TN represents the number of samples for which the classifier determines negative examples as negative examples, FP represents the number of samples for which the classifier determines negative examples as positive examples, and FN represents the number of samples for which the classifier determines positive examples as negative examples. The comparative experiments were as follows:

SVM: and taking the n-gram, the voice characteristics and the inconsistency characteristics as input, and adopting a Support Vector Machine (SVM) as a classifier model to perform humorous recognition.

CNN: and performing humorous recognition on the text by adopting a Convolutional Neural Network (CNN).

LSTM: and taking word vectors of the text as input, and carrying out humorous recognition by the long-term and short-term memory neural network.

ATF-LSTM: the method of the invention.

As shown in fig. 3, a large number of features need to be artificially constructed as input in the conventional classifier SVM, and the experimental performance is low under the condition that the features cannot sufficiently reflect the characteristics of humor data, and because humor features are often hidden under surface semantics and are difficult to characterize, the effect of recognizing humor by using the conventional machine learning method is poor. Deep learning methods, such as convolutional neural network CNN and cyclic neural network LSTM, do not need to construct complex humorous-related features, can automatically extract deep semantic features, explore the latent semantic connotation of humorous, and improve humorous recognition performance compared with the traditional machine learning method. The method combines the characteristics automatically extracted by the neural network model, constructs the characteristics of voice and inconsistency, which cannot be fully reflected by word vectors, and further improves the humorous recognition performance.

The foregoing is a more detailed description of the present invention in connection with specific preferred embodiments and is not intended to limit the practice of the invention to these embodiments. For those skilled in the art to which the invention pertains, several simple deductions or substitutions may be made without departing from the spirit of the invention, which should be construed as belonging to the protection of the present invention.

Claims

1. A humor identification method based on a neural network and humor characteristics is characterized by comprising the following steps:

s1, humor corpus collection and pretreatment:

a1, collecting humorous material: acquiring humorous texts and evaluation information of the texts from a website; numbering the text ID as a unique identifier of the text; collecting the content of the humor of the website as a humor text candidate set; acquiring evaluation information of the text humor from the website as a standard for measuring the text humor degree; collecting texts in other forms as a candidate set of non-humorous texts, wherein humorous corpus is a single sentence;

a2, preprocessing: cleaning data of the humorous text candidate set and the non-humorous text candidate set, and deleting special characters and unidentifiable characters in the text; labeling humorous text; selecting a non-humor text from the non-humor text candidate set according to the principle that sentences are similar in length and dictionaries used by positive and negative examples are consistent; performing word segmentation on the humorous text and the non-humorous text;

s2, humor feature extraction:

b1, humorous voice feature extraction: extracting humorous voice characteristic vectors P of the sentences from the word set of the sentences obtained in the step S1 by utilizing a pronunciation dictionary;

b2, extracting humorous inconsistency features: extracting feature vectors Q of the humorous sentence inconsistency from the word set of the sentences obtained in the step S1 by utilizing semantic resources and word vector tools;

s3, a word vector representation step of the text based on the neural network:

c1, acquiring word vectors: obtaining corpora including Wikipedia corpora and joke corpora as a corpus of training word vectors, and training the word vectors by using a word vector tool so as to obtain low-dimensional dense vectors of each word in the humorous text and the non-humorous text;

c2, word vector representation of text: expressing the humorous sentences and the non-humorous sentences obtained in the step S1 as n multiplied by m multiplied by d word embedded matrixes by using the word vectors obtained from the c1, wherein n is the number of samples, m is the number of words contained in each sample, and d is the dimensionality of the word vectors;

c3, fuzzy characteristic word extraction: for the word set of each sentence obtained in the step S1, extracting the synonym set Synset by utilizing semantic resources_i＝{synset₁，synset₂，…，synset_j，…，synset_nI is the ith word in the sentence, n is the number of synonym sets, synset_jIs synonym meaning unit; resource sharing with synonyms_jObtaining the synonyms of which the meanings are similar_i＝{W₁₁，W₁₂，…，W_1m，…，W_n1，…，W_nmM is synset_jThe number of synonyms of (1), remove synWords_iRepeating words, calculating syncwords in sentences_iThe words with the most number of similar words in the sentence are used as humorous fuzzy characteristic words;

c4, fuzzy feature word vector representation: each sentence may extract one or more words related to the fuzzy characteristics of humor, and if the sentence only contains one characteristic word, the characteristic word is represented in a vector form T by using a related word vector method and a word vector acquired by c 1; if the sentence contains a plurality of feature words, the average word vector of the feature words is used as the fuzzy feature word vector;

s4, neural network model construction:

d1, input of model: according to the fuzzy characteristic word vector T of humor obtained in the step c4, the fuzzy characteristic word vector T is combined with the word vector w of each word in the sentence_tSplicing to serve as an input word vector of the model;

d2, constructing a humor identification model: extracting the potential semantic features of the input obtained in d1 by using a cyclic neural network to obtain hidden layer vector representation of the text;

d3, attention machine humorous recognition, wherein the attention machine is adopted to carry out humorous recognition, the weight of the fuzzy characteristic words and the words around the fuzzy characteristic words is increased, and the humorous recognition performance is improved;

d4, calculating the average word vector representation of the sentence: calculating the average word vector representation of the sentence according to the hidden layer representation of the humorous sentence obtained in the step d 3;

d5, fused humor features: splicing the humorous voice characteristics extracted in the step b1 and the humorous inconsistency characteristics extracted in the step b2 with the average word vector representation of the sentence obtained in the step d4 to obtain the vector representation of the sentence;

d6, humor identification: calculating the probability of whether the sentence is a humorous sentence according to the hidden layer representation s of the humorous sentence obtained in the step d5, thereby finally judging that the given sentence is a humorous text or a non-humorous text;

s5, humorous recognition result evaluation step: and evaluating the humorous recognition result according to the evaluation index.

2. The method for humor identification based on neural networks and humor features according to claim 1, wherein in the step a2, the labeled humor text is: and automatically marking the text with higher evaluation score as the humorous text, namely, the regular example, according to the humorous evaluation information, and manually auditing the automatically marked information.

3. The method for humorous recognition based on neural network and humorous feature of claim 1, wherein, NLTK module in Python language is adopted in the word segmentation processing in step a 2.

4. The method for humorous recognition based on neural network and humorous feature of claim 1, wherein in step b1, said pronunciation dictionary is CMU pronunciation dictionary, said humorous speech feature extraction is to represent English words into phoneme form; the number of words with the same pronunciation of the head part of the word in the sentence, the maximum length of phonemes with the same pronunciation of the head part of the word in the sentence, the number of words with the same pronunciation of the tail part of the word in the sentence and the maximum length of phonemes with the same pronunciation of the tail part of the word in the sentence are extracted, and the number of words with head rhyme pressing, the maximum length of a head rhyme chain, the number of words with tail rhyme pressing and the maximum length of a tail rhyme chain are extracted as humorous voice characteristics to obtain a 4-dimensional characteristic vector.

5. The method for humorous recognition based on neural network and humorous characteristic of claim 1, characterized in that, the inconsistent feature extraction of humorous in step b2 is to utilize semantic resources to judge whether there is an anti-sense word pair in the sentence of the word set of the sentence obtained in step S1; and (3) expressing the words into low-dimensional dense vectors by using a word vector tool, extracting the maximum and minimum semantic distances of word pairs in the sentence, and taking the maximum semantic distance and the minimum semantic distance as the characteristics of the inconsistency of the humorous sentence to obtain a 3-dimensional feature vector Q, wherein the maximum semantic distance and the minimum semantic distance are the characteristics of the antisense word pairs.

6. The method for humor identification based on neural networks and humor features of claim 1, wherein in the step b2 and the step c3, the semantic resource is wordNet.

7. The method for humor identification based on neural networks and humor features according to claim 1, wherein in step S4, the evaluation indexes are precision, accuracy, recall and F1 value, and the precision calculation formula is as follows:

the accuracy calculation formula is as follows:

the recall ratio formula is as follows:

the F1 value is formulated as follows: