CN115618003A - Literature figure relation identification method and system - Google Patents

Literature figure relation identification method and system Download PDF

Info

Publication number
CN115618003A
CN115618003A CN202211392235.0A CN202211392235A CN115618003A CN 115618003 A CN115618003 A CN 115618003A CN 202211392235 A CN202211392235 A CN 202211392235A CN 115618003 A CN115618003 A CN 115618003A
Authority
CN
China
Prior art keywords
literary
relationship
sentence
analyzed
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211392235.0A
Other languages
Chinese (zh)
Inventor
周凤莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University
Original Assignee
Harbin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University filed Critical Harbin University
Priority to CN202211392235.0A priority Critical patent/CN115618003A/en
Publication of CN115618003A publication Critical patent/CN115618003A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a system for recognizing the relation of literary characters, and provides a method and a system for recognizing the relation of literary characters, which can firstly determine sentences containing special identity relations through a text classification model and then recognize the relation of characters to be recognized from the sentences, thereby realizing the accurate recognition of the relation of characters of a preset type. Carrying out sequence annotation on the literary works to be analyzed, and extracting names of people in the literary works and sentences containing special identity relations; pairing and splicing each identified person name and each sentence group containing a special identity relationship, and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is a character relationship to be identified; and counting the number of the tone words used for embodying emotion in the literary work to be analyzed, substituting the number into the category analysis model, and acquiring the literary category of the literary work to be analyzed.

Description

Literature figure relation identification method and system
Technical Field
The invention relates to the technical field of natural language and text processing, in particular to a literature character relationship identification method and system.
Background
In order to excavate effective knowledge of literary works, the character relationship between characters in the literary works needs to be analyzed, and the character relationship is an important knowledge acquisition means, which is to extract the semantic relationship existing between two character entities from a natural language text.
The existing person relationship identification method utilizes a conventional PCNN (pulse coupled neural network, PCNN) model to improve a pooling layer of a conventional Convolutional Neural Network (CNN), utilizes the improved conventional convolutional neural network to mine the person relationship, and mainly comprises the steps of dividing a feature map into three sections through two entity positions for pooling, and decomposing the feature map into (before an entity, between entities and after the entity) so as to better capture the structural information between the two entities. Using the attention mechanism, the false tag problem is mitigated by establishing a sentence-level attention mechanism. However, the semantic meaning of the sentence is not fully considered in the models, the models are not suitable for literary works, meanwhile, the literary works often have more characters, are distributed in each chapter of the book, and have complicated and intricate relationships, and the current character relationship identification method cannot fully show the complicated character relationships of the literary works.
Disclosure of Invention
The invention relates to a method and a system for recognizing the relation of literary characters, and provides a method and a system for recognizing the relation of literary characters, which can firstly determine sentences containing special identity relations through a text classification model and then recognize the relation of characters to be recognized from the sentences, thereby realizing the accurate recognition of the relation of characters of a preset type.
A literature figure relation identification method comprises the following steps:
s1: carrying out sequence annotation on the literary works to be analyzed, and extracting names of people in the literary works and sentences containing special identity relations;
s2: and combining and splicing each identified person name and each sentence group containing the special identity relationship, and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is the character relationship to be identified.
S3: and counting the number of the tone words used for embodying emotion in the literary work to be analyzed, substituting the number into the category analysis model, and acquiring the literary category of the literary work to be analyzed.
Further, the specific method for performing sequence annotation on the literary work to be analyzed and extracting the names of people in the literary work and the sentences containing the special identity relationship comprises the following steps:
s101: carrying out sequence annotation on the literary work to be analyzed to obtain the name of a figure contained in the literary work to be analyzed;
s102: segmenting the literary works to be analyzed according to sentences, and inputting each sentence into a text classification model to determine whether each sentence contains a special identity relationship;
s103: and extracting the names of the characters in the literary works of sentences containing special identity relations through a name identification interface suitable for the literary works.
Furthermore, the specific method for segmenting the literary work to be analyzed according to sentences and inputting each sentence into the text classification model to determine whether each sentence contains a special identity relationship is as follows;
s10201, converting each word in A words of the sentence into B-dimensional vector, and forming an A-B matrix by the B-dimensional vector;
s10202, inputting the matrix A and B into a convolution neural network of the text classification model to obtain a characteristic diagram, and performing maximum pooling operation on the characteristic diagram to obtain a characteristic vector;
s10203, the feature vectors are processed by a classifier to obtain a classification result, and the classification result indicates whether the sentence contains a special identity relationship.
Further, the specific method for pairing and splicing each recognized person name and each sentence group containing a special identity relationship and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is the relationship of the person to be recognized is as follows:
s201, obtaining a name list from each identified name, and forming a name-sentence pair by traversing the name list and each sentence containing a special identity relationship;
s202, segmenting the sequence text of the spliced name-sentence pair according to characters and inputting the segmented sequence text into an input layer of the language pre-training Bert model;
s203, splicing the hidden vector output by the language pre-training Bert model with the name-to-position vector in the sentence;
and S204, passing the spliced vector through a full connection layer and a softmax layer to obtain a category distribution probability vector, wherein the relationship category corresponding to the maximum value in the category distribution probability vector is the category of the spliced name-sentence pair.
Further, the step of counting the number of the tone words used for representing emotion in the literary work to be analyzed, substituting the number into the category analysis model, and obtaining the literary category of the literary work to be analyzed comprises the following steps:
s301: extracting the tone words used for representing the emotion in the literary works to be analyzed to obtain the number of the tone words used for representing the emotion;
s302: substituting the number of the tone words for representing the emotion into a literature category analysis model to obtain an importance degree parameter of the tone words for representing the emotion in the call text information;
s303: and obtaining the literature category to which the literature to be analyzed belongs according to the importance degree parameter.
A literature figure relationship recognition system, the literature figure relationship recognition system comprising:
the extraction module is used for carrying out sequence annotation on the literary works to be analyzed and extracting names of people and sentences containing special identity relations in the literary works;
and the recognition module is used for pairing and splicing each recognized person name and each sentence group containing the special identity relationship, and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is the character relationship to be recognized.
And the classification module is used for counting the number of the tone words used for embodying emotion in the literary works to be analyzed, substituting the number into the category analysis model, and acquiring the literary category of the literary works to be analyzed.
Further, the extraction module comprises:
the annotation module is used for carrying out sequence annotation on the literary works to be analyzed to obtain the names of the figures contained in the literary works to be analyzed;
the segmentation module is used for segmenting the literary works to be analyzed according to sentences and inputting each sentence into the text classification model to determine whether each sentence contains a special identity relationship;
and the interface module is used for extracting the names of the people in the literary works of the sentences containing the special identity relations through the name identification interface applicable to the literary works.
Further, the segmentation module comprises;
the matrix module is used for converting each character in A characters of the sentence into a B-dimensional vector and forming an A-B matrix by the B-dimensional vector;
the vector module is used for inputting the A & ltx & gt B matrix into a convolutional neural network of the text classification model to obtain a characteristic diagram, and performing maximum value pooling operation on the characteristic diagram to obtain a characteristic vector;
and the characteristic module is used for enabling the characteristic vectors to pass through the classifier to obtain a classification result, and the classification result indicates whether the sentence contains a special identity relation.
Further, the identification module comprises:
the group-to-group module is used for obtaining a name list from each identified name and forming a name-sentence pair by traversing the name list and each sentence containing a special identity relationship;
the embedded module is used for segmenting the sequence text of the spliced name-sentence pair according to characters and inputting the segmented sequence text into an input layer of the language pre-training Bert model;
the splicing module is used for splicing the hidden vector output by the language pre-training Bert model with the name-to-position vector in the sentence; splicing;
and the corresponding module is used for enabling the spliced vector to pass through a full connection layer and a softmax layer so as to obtain a category distribution probability vector, wherein the relationship category corresponding to the maximum value in the category distribution probability vector is the category of the spliced human name-sentence pair.
Further, the classification module comprises:
the data acquisition module is used for extracting the tone words used for embodying the emotion in the literary works to be analyzed to obtain the number of the tone words used for embodying the emotion;
the calculation module is used for substituting the number of the tone words for representing the emotion into the literature category analysis model to obtain the importance degree parameters of the tone words for representing the emotion in the call text information;
and the definition module is used for obtaining the literature categories to which the literature to be analyzed belongs according to the importance degree parameters.
The beneficial effects of the invention are as follows:
the method is suitable for analyzing the literary works, fully considers the semantics of sentences in the literary works, effectively processes the problems that the literary works are frequently provided with more characters and distributed in each section of the book, and the relation is complicated, and can fully show the complicated character relation of the literary works.
Meanwhile, by counting the number of the tone words used for embodying emotion in the literary work, the literary category of the literary work to be analyzed can be obtained, and the literary work is analyzed from multiple angles, and the character relation of the literary work is printed on the side face.
Furthermore, dropout, softmax and the like are widely applied in analysis processing, so that the calculation of the whole analysis process is simpler, the effect is obvious, the use is very good, and the reliability and the application range of the method are further improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The following goes through the drawings and examples. The technical scheme of the invention is further described in detail.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention without limiting the invention in which:
FIG. 1 is a diagram of the steps of the method of the present invention;
FIG. 2 is a schematic diagram of a system in accordance with the present invention;
fig. 3 is a detailed view of the system of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
The embodiment of the invention provides a literary figure relation identification method, which comprises the following steps of:
s1: carrying out sequence annotation on the literary works to be analyzed, and extracting names of people in the literary works and sentences containing special identity relations;
s2: pairing and splicing each identified person name and each sentence group containing a special identity relationship, and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is a character relationship to be identified;
s3: and counting the number of the language and gas words used for embodying emotion in the literary work to be analyzed, substituting the number into the category analysis model, and acquiring the literary category of the literary work to be analyzed.
The working principle of the embodiment is as follows:
firstly, carrying out sequence annotation on a literary work to be analyzed to obtain a character name contained in the literary work to be analyzed; segmenting the literary works to be analyzed according to sentences, and inputting each sentence into a text classification model to determine whether each sentence contains a special identity relationship; and extracting the names of the people in the literary works of the sentences containing the special identity relations through a name identification interface suitable for the literary works.
Then, after the name of a person in the referee document is extracted through a name recognition interface applicable to the referee document, a name list is obtained; traversing a name list, pairing each name with each sentence containing a special identity relationship, and splicing; thus, the obtained person name-sentence pairs can be input into the relation classification model to obtain the relation type predicted by the model.
Then, in the relation classification model, the sequence text of the name-sentence pairs is segmented according to characters and input into the model at an input layer, a hidden vector output by the model is taken, after the hidden vector is spliced with the position vector of the name pair in the sentence, the hidden vector is output through a full connection layer and a softmax layer, a category distribution probability vector is obtained, and the relation category with the maximum output value is taken as the prediction result of the model.
Finally, extracting the tone words used for representing the emotion in the literary works to be analyzed to obtain the number of the tone words used for representing the emotion; and substituting the number of the tone words for representing the emotion into a literature category analysis model to obtain the importance degree parameters of the tone words for representing the emotion in the conversation text information.
The softmax layer is a logistic regression model, and is a prior art.
The special identity relations are relations such as relatives, couples, superior and subordinate.
The text classification model comprises: fastext, textCNN, textRNN, textRCNN, and the like.
The beneficial effect of this embodiment does:
the method is suitable for analyzing the literary works, fully considers the semantics of sentences in the literary works, effectively processes the problems that the literary works are frequently provided with more characters and distributed in each section of the book, and the relation is complicated, and can fully show the complicated character relation of the literary works.
Meanwhile, by counting the number of the tone words used for embodying emotion in the literary work, the literary category of the literary work to be analyzed can be obtained, and the literary work is analyzed from multiple angles, and the character relation of the literary work is printed on the side face.
In one embodiment, the specific method for performing sequence annotation on the literary work to be analyzed and extracting the names of people and the sentences containing the special identity relationship in the literary work comprises the following steps:
s101: carrying out sequence annotation on the literary work to be analyzed to obtain the name of a figure contained in the literary work to be analyzed;
s102: segmenting the literary works to be analyzed according to sentences, and inputting each sentence into a text classification model to determine whether each sentence contains a special identity relationship;
s103: and extracting the names of the characters in the literary works of sentences containing special identity relations through a name identification interface suitable for the literary works.
The working principle of the embodiment is as follows:
and preprocessing the literary work to be analyzed, namely acquiring the literary work to be identified with a preset character relationship, such as the special character identity relationship, and finishing the data cleaning work. Converting each of the A words of the sentence into a B-dimensional vector; forming A B-dimensional vectors corresponding to the A words of the sentence into an A x B matrix; inputting the A-B matrix into a convolutional neural network of a text classification model to obtain a feature map; performing maximum pooling operation on the feature map to obtain a feature vector; and passing the feature vector through a classifier to obtain a classification result, wherein the classification result represents whether the sentence contains a special identity relationship; and for the sentences containing special identity relations judged by the text classification model, the names of the characters in the literary works can be extracted through a name recognition interface suitable for the literary works.
Example (c): a may take the value 4 and B may take the value 3, then the a x B matrix is:
Figure BDA0003931859550000061
the beneficial effect of this embodiment does:
the character relationship in the literature is often embodied by sentences containing special identity relationship. Therefore, the names of the people in the literary works and the sentences containing the special identity relations are obtained, and the character relations in the literary works can be found out most quickly. Therefore, the cultural works need to be processed in a segmented manner, the processing process of the cultural works is precise, high logicality is embodied, and the analysis efficiency of the human relationship in the cultural works is improved;
further, the literary works are converted into vectors to be analyzed and processed, so that the speed and the accuracy of the analysis process are improved; furthermore, dropout, softmax and the like are widely applied in analysis processing, so that the calculation of the whole analysis process is simpler, the effect is obvious, the use is very good, and the reliability and the application range of the method are further improved.
In one embodiment, the specific method for segmenting the literary work to be analyzed according to sentences and inputting each sentence into the text classification model to determine whether each sentence contains a special identity relationship is as follows;
s10201, converting each word in A words of the sentence into a B-dimensional vector, and forming an A-B matrix by the B-dimensional vector;
s10202, inputting the A-B matrix into a convolutional neural network of the text classification model to obtain a characteristic diagram, and performing maximum value pooling operation on the characteristic diagram to obtain a characteristic vector;
s10203, the feature vectors are processed by a classifier to obtain a classification result, and the classification result indicates whether the sentence contains a special identity relationship.
The working principle of the embodiment is as follows:
converting each of the A words of the sentence into a B-dimensional vector; forming A B-dimensional vectors corresponding to the A words of the sentence into an A x B matrix; inputting the A-B matrix into a convolutional neural network of a text classification model to obtain a feature map; performing maximum pooling operation on the feature map to obtain feature vectors; and passing the feature vector through a classifier to obtain a classification result, wherein the classification result represents whether the sentence contains a special identity relationship.
Specifically, in the classification, the feature vectors are first output through the fully connected layer, and a Dropout layer is added to prevent overfitting. In multi-classification, usually a Softmax layer is used for multi-classification, and a Softmax function can map the output of the neural network into a (0, 1) interval, and can regard the value as a class distribution probability vector, and take the class with the maximum probability value as a final prediction result. And the training data of the classification model is derived from the manually marked data whether to contain the relation category in the referee document, namely the labels of the sentences are of two types, one type is the sentences containing the special identity relation, and the other type is the sentences not containing the special identity relation.
The Dropout layer is a structure that can be used to reduce neural network overfitting.
The beneficial effect of this embodiment does:
by means of subsection processing of the literary works, the processing process of the literary works is precise, high logicality is embodied, and the analysis efficiency of the relationship between the literary works and the human beings is improved; further, the literary works are converted into vectors to be analyzed and processed, so that the speed and the accuracy of the analysis process are improved; furthermore, dropout, softmax and the like are widely applied in analysis processing, so that the calculation of the whole analysis process is simpler, the effect is obvious, the use is very good, and the reliability degree and the application range of the method are further improved.
In one embodiment, the specific method for pairing and splicing each recognized person name and each sentence group containing a special identity relationship and inputting the sentence group into the relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is the relationship of the person to be recognized is as follows:
s201, obtaining a name list from each identified name, and forming a name-sentence pair by traversing the name list and each sentence containing a special identity relationship;
s202, segmenting the sequence text of the spliced name-sentence pair according to characters and inputting the segmented sequence text into an input layer of the language pre-training Bert model;
s203, splicing the hidden vector output by the language pre-training Bert model with the name-to-position vector in the sentence;
and S204, passing the spliced vector through a full connection layer and a softmax layer to obtain a category distribution probability vector, wherein the relationship category corresponding to the maximum value in the category distribution probability vector is the category of the spliced name-sentence pair.
The working principle of the embodiment is as follows:
firstly, after the name of a person in a referee document is extracted through a name recognition interface applicable to the referee document, a name list is obtained; traversing a name list, pairing each name with each sentence containing a special identity relationship, and splicing; thus, the obtained person name-sentence pairs can be input into the relation classification model, and the relation type predicted by the model is obtained.
Then, in the relation classification model, segmenting the sequence text of the name-sentence pairs according to characters and inputting the segmented sequence text into the model in an input layer; and (3) taking the hidden vector output by the model, splicing the hidden vector with the name pair position vector in the sentence, outputting the hidden vector through full connection and a softmax layer to obtain a class distribution probability vector, and taking the relation class with the maximum output value as the prediction result of the model.
In the embodiment of the present application, the relational classification model may be a language pre-training Bert model, and the Bert model is a language pre-training model proposed by google in 2018, and belongs to the prior art.
When the model is used for classifying the relation types of the literary works, the model is pre-trained based on large-scale literary field linguistic data, so that the model is more suitable for processing the natural language processing problem in the literary field. Then, a new step of training of the model is performed using the labeled literary work.
The training process of the language pre-training Bert model comprises the following steps:
1) Pre-training the language pre-training Bert model based on the linguistic data in the large-scale literature field;
2) And training the language pre-training Bert model by using the marked literature.
The beneficial effect of this embodiment does:
through the classification of the relation category of the relation classification model, whether the relation in the sentence containing the special identity relation is the character relation to be identified or not can be determined, such as the special character identity relation, the method is accurate and efficient, and the working time is saved;
the Bert model has strong language representation capability and feature extraction capability. The state of the art is reached in 11 NLP benchmark test tasks, and meanwhile, the capability of the bidirectional language model is proved to be more powerful, and the working efficiency and the reliability degree of the invention are greatly improved.
In one embodiment, the step of counting the number of the mood words used for representing emotion in the literary work to be analyzed, substituting the number into the category analysis model, and obtaining the literary category of the literary work to be analyzed includes:
s301: extracting the tone words used for embodying emotion in the literary works to be analyzed to obtain the number of the tone words used for embodying emotion;
s302: substituting the number of the tone words for representing the emotion into a literature category analysis model to obtain an importance degree parameter of the tone words for representing the emotion in the call text information;
s303: and obtaining the literature category to which the literature work to be analyzed belongs according to the importance degree parameter.
The working principle of the embodiment is as follows:
the literature category analysis model comprises the following steps:
Figure BDA0003931859550000091
the number of the tone words for representing emotion in the formula is Si, and the importance degree of the ith tone word for representing emotion in the literature is Zi, i =1,2, 3.
Sorting according to the Zi size, and determining a first ranking tone word for representing emotion; and determining the literature category to which the literature belongs according to the first ranked Chinese word for representing emotion.
The beneficial effect of this embodiment does:
because the literary works may contain various mood words for representing emotions, such as words representing joy, words representing anger, words representing loss, words representing love and the like, for the condition that various mood words for representing emotions coexist, the main emotions of both parties of a call can be analyzed by carefully screening, and through the embodiment, the mood words for representing emotions with the first rank can be determined by sequencing according to the size of Zi, and then the main category tendency of the analyzed literary works can be accurately determined according to the words of the mood words for representing emotions with the first rank.
Compared with the prior art, the literature type analysis model is more precise, has more accurate and visual expression effect, and is beneficial to the propagation and popularization of the invention.
Example (c): when the first ranking mood word embodying emotion is of emotion class: love, hate, complain, recite, etc., the literary work can be classified as a sentiment-type literary work;
when the first-ranked emotional tone word represents a startle, the following words are used: the literary works can be classified into horror works by frightening, scaring, flustering and the like;
when the first-ranked emotion-expressing linguistic word is of a reasoning class: thinking, worrying, waiting, etc., the literary work can be classified as a reasoning class work.
The embodiment provides a literary character relationship identification system, as shown in figure 2,
the extraction module is used for carrying out sequence labeling on the literary works to be analyzed and extracting names of people and sentences containing special identity relations in the literary works;
the recognition module is used for pairing and splicing each recognized person name and each sentence group containing a special identity relationship, and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is a character relationship to be recognized or not;
and the classification module is used for counting the number of the tone words used for embodying emotion in the literary work to be analyzed, substituting the number into the category analysis model, and acquiring the literary category of the literary work to be analyzed.
The working principle of the embodiment is as follows:
firstly, carrying out sequence annotation on a literary work to be analyzed to obtain a character name contained in the literary work to be analyzed; segmenting the literary works to be analyzed according to sentences, and inputting each sentence into a text classification model to determine whether each sentence contains a special identity relationship; and extracting the names of the people in the literary works of the sentences containing the special identity relations through a name identification interface suitable for the literary works.
Further, after the name of the referee document is extracted through a name recognition interface applicable to the referee document, a name list is obtained; traversing a name list, pairing each name with each sentence containing a special identity relationship, and splicing; thus, the obtained person name-sentence pairs can be input into the relation classification model, and the relation type predicted by the model is obtained.
Further, in the relation classification model, the sequence text of the name-sentence pair is segmented according to characters and input into the model at an input layer, a hidden vector output by the model is taken, after the hidden vector is spliced with the position vector of the name pair in the sentence, the hidden vector is output through a full connection layer and a softmax layer, a class distribution probability vector is obtained, and the relation class with the maximum output value is taken as the prediction result of the model.
Finally, extracting the tone words used for embodying emotion in the literary works to be analyzed to obtain the number of the tone words used for embodying emotion; and substituting the number of the tone words for representing the emotion into a literature category analysis model to obtain the importance degree parameter of the tone words for representing the emotion in the call text information.
The softmax layer is a logistic regression model, and is a prior art.
The special identity relations are relations of relatives, couples, superior and subordinate.
The text classification model comprises: fastext, textCNN, textRNN, textRCNN, and the like.
The beneficial effect of this embodiment does:
the method is suitable for analyzing the literary works, fully considers the semantics of sentences in the literary works, effectively processes the problems that the literary works are frequently provided with more characters and distributed in each section of the book, and the relation is complicated, and can fully show the complicated character relation of the literary works.
Meanwhile, by counting the number of the tone words used for embodying emotion in the literary work, the literary category of the literary work to be analyzed can be obtained, and the literary work is analyzed from multiple angles, and the character relation of the literary work is printed on the side face.
In one embodiment, as shown in fig. 3, the extraction module comprises:
the system comprises a labeling module, a judging module and a judging module, wherein the labeling module is used for carrying out sequence labeling on a literary work to be analyzed to obtain the name of a figure contained in the literary work to be analyzed;
the segmentation module is used for segmenting the literary works to be analyzed according to sentences and inputting each sentence into the text classification model so as to determine whether each sentence contains a special identity relationship;
and the interface module is used for extracting the names of the characters in the sentences containing the sentences with special identity relations through the name identification interface suitable for the characters.
The working principle of the embodiment is as follows:
and preprocessing the literary work to be analyzed, namely acquiring the literary work to be identified with a preset character relationship, such as the special character identity relationship, and finishing the data cleaning work. Converting each of the A words of the sentence into a B-dimensional vector; forming A B-dimensional vectors corresponding to the A words of the sentence into an A x B matrix; inputting the A-B matrix into a convolutional neural network of a text classification model to obtain a feature map; performing maximum pooling operation on the feature map to obtain feature vectors; and passing the feature vector through a classifier to obtain a classification result, wherein the classification result represents whether the sentence contains a special identity relationship; for the sentence containing the special identity relation judged by the text classification model, the name of the person in the literary work can be extracted through a name recognition interface suitable for the literary work.
Example (c): a may take the value 4 and B may take the value 3, then the a x B matrix is:
Figure BDA0003931859550000111
the beneficial effect of this embodiment does:
the relationship of the characters in the literary works is often embodied by sentences containing special identity relationships. Therefore, the names of the people in the literary works and the sentences containing the special identity relations are obtained, and the character relations in the literary works can be found out most quickly. Therefore, the cultural works need to be processed in a segmented manner, the processing process of the cultural works is precise, high logicality is embodied, and the analysis efficiency of the human relationship in the cultural works is improved;
further, the literary works are converted into vectors to be analyzed and processed, so that the speed and the accuracy of the analysis process are improved; furthermore, dropout, softmax and the like are widely applied in analysis processing, so that the calculation of the whole analysis process is simpler, the effect is obvious, the use is very good, and the reliability and the application range of the method are further improved.
In one embodiment, as shown in FIG. 3, the segmentation module comprises;
the matrix module is used for converting each word in A words of a sentence into a B-dimensional vector and forming an A & ltx & gt B matrix by the B-dimensional vector;
the vector module is used for inputting the A-B matrix into a convolutional neural network of the text classification model to obtain a characteristic diagram, and performing maximum value pooling operation on the characteristic diagram to obtain a characteristic vector;
and the characteristic module is used for enabling the characteristic vectors to pass through the classifier to obtain a classification result, and the classification result indicates whether the sentence contains a special identity relation.
The working principle of the embodiment is as follows:
converting each of the A words of the sentence into a B-dimensional vector; forming A B-dimensional vectors corresponding to the A words of the sentence into an A x B matrix; inputting the A-B matrix into a convolutional neural network of a text classification model to obtain a feature map; performing maximum pooling operation on the feature map to obtain feature vectors; and enabling the feature vectors to pass through a classifier to obtain a classification result, wherein the classification result represents whether the sentence contains a special identity relation.
Specifically, in the classification, the feature vectors are first output through the fully-connected layer, and a Dropout layer is added to prevent overfitting. In multi-classification, usually a Softmax layer is used for multi-classification, and a Softmax function can map the output of the neural network into a (0, 1) interval, and can regard the value as a class distribution probability vector, and take the class with the maximum probability value as a final prediction result. And the training data of the classification model is derived from the manually marked data whether to contain the relation category in the referee document, namely the labels of the sentences are of two types, one type is the sentences containing the special identity relation, and the other type is the sentences not containing the special identity relation.
The Dropout layer is a structure that can be used to reduce neural network overfitting.
The beneficial effect of this embodiment does:
by means of subsection processing of the literary works, the processing process of the literary works is precise, high logicality is embodied, and the analysis efficiency of the relationship between the literary works and the human beings is improved; further, the literary works are converted into vectors to be analyzed and processed, so that the speed and the accuracy of the analysis process are improved; furthermore, dropout, softmax and the like are widely applied in analysis processing, so that the calculation of the whole analysis process is simpler, the effect is obvious, the use is very good, and the reliability and the application range of the method are further improved.
In one embodiment, as shown in fig. 3, the identification module comprises:
the group-to-group module is used for obtaining a name list from each identified name and forming a name-sentence pair by traversing the name list and each sentence containing a special identity relationship;
the embedded module is used for segmenting the sequence text of the spliced name-sentence pair according to characters and inputting the sequence text into an input layer of the language pre-training Bert model;
the splicing module is used for splicing the hidden vector output by the language pre-training Bert model with the name-to-position vector in the sentence; splicing;
and the corresponding module is used for enabling the spliced vectors to pass through a full connection layer and a softmax layer so as to obtain a category distribution probability vector, wherein the relationship category corresponding to the maximum value in the category distribution probability vector is the category of the spliced name-sentence pair.
The working principle of the embodiment is as follows:
firstly, extracting the name of a person in a referee document through a name recognition interface suitable for the referee document to obtain a name list; traversing a name list, pairing each name with each sentence containing a special identity relationship, and splicing; thus, the obtained person name-sentence pairs can be input into the relation classification model, and the relation type predicted by the model is obtained.
Then, in the relation classification model, segmenting the sequence text of the name-sentence pairs according to characters and inputting the segmented sequence text into the model in an input layer; and (3) taking the hidden vector output by the model, splicing the hidden vector with the name pair position vector in the sentence, outputting the hidden vector through full connection and a softmax layer to obtain a class distribution probability vector, and taking the relation class with the maximum output value as the prediction result of the model.
In the embodiment of the present application, the relational classification model may be a language pre-training Bert model, and the Bert model is a language pre-training model proposed by google in 2018, and belongs to the prior art.
When the model is used for classifying the relation categories of the literary works, the model is pre-trained on the basis of large-scale literary field linguistic data, so that the model is more suitable for processing the natural language processing problem of the literary field. Then, a new step of training of the model is performed using the labeled literary work.
The training process of the language pre-training Bert model comprises the following steps:
1) Pre-training the language pre-training Bert model based on the linguistic data in the large-scale literature field;
2) And training the language pre-training Bert model by using the marked literary works.
The beneficial effect of this embodiment does:
through the classification of the relation category of the relation classification model, whether the relation in the sentence containing the special identity relation is the character relation to be identified or not can be determined, such as the special character identity relation, the method is accurate and efficient, and the working time is saved;
the Bert model has strong language representation capability and feature extraction capability. The state of the art is reached in 11 NLP benchmark test tasks, and meanwhile, the capability of the bidirectional language model is proved to be more powerful, and the working efficiency and the reliability of the invention are greatly improved.
In one embodiment, as shown in fig. 3, the classification module comprises:
the data acquisition module is used for extracting the tone words used for embodying the emotion in the literary works to be analyzed to obtain the number of the tone words used for embodying the emotion;
the calculation module is used for substituting the number of the tone words for embodying the emotion into the literature category analysis model to obtain the importance degree parameters of the tone words for embodying the emotion in the call text information;
and the definition module is used for obtaining the literature categories to which the literature to be analyzed belongs according to the importance degree parameters.
The working principle of the embodiment is as follows:
the literature category analysis model comprises the following steps:
Figure BDA0003931859550000131
the number of the language word used for representing emotion is Si, and the importance degree of the ith language word used for representing emotion in the literature is Zi, i =1,2, 3.
Sorting according to the Zi, and determining a first-ranked Chinese word for representing emotion; and determining the literature category to which the literature belongs according to the first ranked Chinese word for representing emotion.
The beneficial effect of this embodiment does:
since the literary works may include a plurality of tone words for representing emotions, such as words representing joy, words representing anger, words representing loss, words representing love and the like, for the case that a plurality of tone words for representing emotions coexist, the main emotions of both parties of a call can be analyzed by carefully screening, and through the embodiment, the tone words for representing emotions with the first ranking can be determined by sorting according to the size of Zi, and then the main category tendency of the analyzed literary works can be accurately determined according to the words of the tone words for representing emotions with the first ranking.
Compared with the prior art, the literature type analysis model is more precise, has more accurate and visual expression effect, and is beneficial to the propagation and popularization of the invention.
Example (c): when the first ranking mood word embodying emotion is of emotion class: love, abhate, complain, recite, etc., then the literary work can be classified as a sentiment-like literary work;
when the first-ranked emotional tone word represents a startle, the following words are used: the literary works can be classified into horror works by frightening, scaring, flustery and the like;
when the first-ranked emotion-expressing linguistic word is of a reasoning class: thinking, worrying, waiting, etc., the literary work can be classified as a reasoning class work.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the technical scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A literature figure relation identification method is characterized by comprising the following steps:
s1: carrying out sequence annotation on the literary works to be analyzed, and extracting names of people in the literary works and sentences containing special identity relations;
s2: pairing and splicing each identified person name and each sentence group containing a special identity relationship, and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is a character relationship to be identified;
s3: and counting the number of the tone words used for embodying emotion in the literary work to be analyzed, substituting the number into the category analysis model, and acquiring the literary category of the literary work to be analyzed.
2. The literary figure relationship identification method of claim 1, wherein the sequence labeling is performed on the literary work to be analyzed, and the specific method for extracting the names of people in the literary work and the sentences containing special identity relationships comprises the following steps:
s101: carrying out sequence annotation on the literary work to be analyzed to obtain the name of a figure contained in the literary work to be analyzed;
s102: segmenting the literary works to be analyzed according to sentences, and inputting each sentence into a text classification model to determine whether each sentence contains a special identity relationship;
s103: and extracting the names of the characters in the literary works of sentences containing special identity relations through a name identification interface suitable for the literary works.
3. The method for recognizing the relation of literary characters as claimed in claim 2, wherein the literary works to be analyzed are segmented into sentences, and each sentence is inputted into the text classification model to determine whether each sentence contains a special identity relation;
s10201, converting each word in A words of the sentence into B-dimensional vector, and forming an A-B matrix by the B-dimensional vector;
s10202, inputting the matrix A and B into a convolution neural network of the text classification model to obtain a characteristic diagram, and performing maximum pooling operation on the characteristic diagram to obtain a characteristic vector;
s10203, the feature vectors are processed by a classifier to obtain a classification result, and the classification result indicates whether the sentence contains a special identity relationship.
4. The method of claim 1, wherein the specific method of pairing and concatenating each recognized person name and each sentence group containing a special identity relationship and inputting the relationship classification model to determine whether the relationship in the sentence containing a special identity relationship is the person relationship to be recognized is as follows:
s201, obtaining a name list from each identified name, and forming a name-sentence pair by traversing the name list and each sentence containing a special identity relationship;
s202, segmenting the sequence text of the spliced name-sentence pair according to characters and inputting the segmented sequence text into an input layer of the language pre-training Bert model;
s203, splicing the hidden vector output by the language pre-training Bert model with the name-to-position vector in the sentence;
and S204, passing the spliced vector through a full connection layer and a softmax layer to obtain a category distribution probability vector, wherein the relationship category corresponding to the maximum value in the category distribution probability vector is the category of the spliced human name-sentence pair.
5. The literary character relationship identification method of claim 1, wherein the step of counting the number of the mood words for representing emotion in the literary work to be analyzed, substituting the number into the category analysis model, and obtaining the literary category of the literary work to be analyzed comprises:
s301: extracting the tone words used for embodying emotion in the literary works to be analyzed to obtain the number of the tone words used for embodying emotion;
s302: substituting the number of the tone words for representing the emotion into a literature category analysis model to obtain an importance degree parameter of the tone words for representing the emotion in the call text information;
s303: and obtaining the literature category to which the literature work to be analyzed belongs according to the importance degree parameter.
6. A literary personal relationship identification system, comprising:
the extraction module is used for carrying out sequence labeling on the literary works to be analyzed and extracting names of people and sentences containing special identity relations in the literary works;
the recognition module is used for pairing and splicing each recognized person name and each sentence group containing a special identity relationship, and inputting a relationship classification model to determine whether the relationship in the sentences containing the special identity relationship is a character relationship to be recognized or not;
and the classification module is used for counting the number of the tone words used for embodying emotion in the literary work to be analyzed, substituting the number into the category analysis model, and acquiring the literary category of the literary work to be analyzed.
7. The literary human relationship identification system of claim 6, wherein the extraction module comprises:
the annotation module is used for carrying out sequence annotation on the literary works to be analyzed to obtain the names of the figures contained in the literary works to be analyzed;
the segmentation module is used for segmenting the literary works to be analyzed according to sentences and inputting each sentence into the text classification model so as to determine whether each sentence contains a special identity relationship;
and the interface module is used for extracting the names of the characters in the sentences containing the sentences with special identity relations through the name identification interface suitable for the characters.
8. The literary character relationship identification method of claim 7, wherein the segmentation module comprises;
the matrix module is used for converting each word in A words of a sentence into a B-dimensional vector and forming an A & ltx & gt B matrix by the B-dimensional vector;
the vector module is used for inputting the A-B matrix into a convolutional neural network of the text classification model to obtain a characteristic diagram, and performing maximum value pooling operation on the characteristic diagram to obtain a characteristic vector;
and the characteristic module is used for enabling the characteristic vectors to pass through the classifier to obtain a classification result, and the classification result indicates whether the sentence contains a special identity relation.
9. The literary character relationship identification method of claim 6, wherein the identification module comprises:
the group-to-group module is used for obtaining a name list from each identified name and forming a name-sentence pair by traversing the name list and each sentence containing a special identity relationship;
the embedded module is used for segmenting the sequence text of the spliced name-sentence pair according to characters and inputting the sequence text into an input layer of the language pre-training Bert model;
the splicing module is used for splicing the hidden vector output by the language pre-training Bert model with the name-to-position vector in the sentence; splicing;
and the corresponding module is used for enabling the spliced vectors to pass through a full connection layer and a softmax layer so as to obtain a category distribution probability vector, wherein the relationship category corresponding to the maximum value in the category distribution probability vector is the category of the spliced name-sentence pair.
10. The literary character relationship identification method of claim 6, wherein the classification module comprises:
the data acquisition module is used for extracting the tone words used for embodying the emotion in the literary works to be analyzed to obtain the number of the tone words used for embodying the emotion;
the calculation module is used for substituting the number of the tone words for representing the emotion into the literature category analysis model to obtain the importance degree parameters of the tone words for representing the emotion in the call text information;
and the definition module is used for obtaining the literature categories to which the literature to be analyzed belongs according to the importance degree parameters.
CN202211392235.0A 2022-11-08 2022-11-08 Literature figure relation identification method and system Pending CN115618003A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211392235.0A CN115618003A (en) 2022-11-08 2022-11-08 Literature figure relation identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211392235.0A CN115618003A (en) 2022-11-08 2022-11-08 Literature figure relation identification method and system

Publications (1)

Publication Number Publication Date
CN115618003A true CN115618003A (en) 2023-01-17

Family

ID=84878271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211392235.0A Pending CN115618003A (en) 2022-11-08 2022-11-08 Literature figure relation identification method and system

Country Status (1)

Country Link
CN (1) CN115618003A (en)

Similar Documents

Publication Publication Date Title
CN108255805B (en) Public opinion analysis method and device, storage medium and electronic equipment
Onan Topic-enriched word embeddings for sarcasm identification
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN108804612B (en) Text emotion classification method based on dual neural network model
CN107491435B (en) Method and device for automatically identifying user emotion based on computer
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN113157859B (en) Event detection method based on upper concept information
CN112883286A (en) BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN115688920A (en) Knowledge extraction method, model training method, device, equipment and medium
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN115759071A (en) Government affair sensitive information identification system and method based on big data
CN111429184A (en) User portrait extraction method based on text information
Ruposh et al. A computational approach of recognizing emotion from Bengali texts
CN111159405B (en) Irony detection method based on background knowledge
CN112632963A (en) Method and device for constructing Chinese metaphor information knowledge base based on government work report
Shruthi et al. A prior case study of natural language processing on different domain
CN112052869A (en) User psychological state identification method and system
KR20200040032A (en) A method ofr classification of korean postings based on bidirectional lstm-attention
CN114943235A (en) Named entity recognition method based on multi-class language model
CN114842301A (en) Semi-supervised training method of image annotation model
CN115618003A (en) Literature figure relation identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination