CN115374778A - Cosmetic public opinion text entity relation extraction method based on deep learning - Google Patents

Cosmetic public opinion text entity relation extraction method based on deep learning Download PDF

Info

Publication number
CN115374778A
CN115374778A CN202211010810.6A CN202211010810A CN115374778A CN 115374778 A CN115374778 A CN 115374778A CN 202211010810 A CN202211010810 A CN 202211010810A CN 115374778 A CN115374778 A CN 115374778A
Authority
CN
China
Prior art keywords
text
word
cosmetic
vector
public opinion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211010810.6A
Other languages
Chinese (zh)
Inventor
左敏
葛伟
路勇
张伟清
许鸣镝
孙磊
王海燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
National Institutes for Food and Drug Control
Original Assignee
Beijing Technology and Business University
National Institutes for Food and Drug Control
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University, National Institutes for Food and Drug Control filed Critical Beijing Technology and Business University
Publication of CN115374778A publication Critical patent/CN115374778A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a cosmetic public opinion text entity relation extraction method based on deep learning, which comprises the following steps: preprocessing cosmetic risk public opinion text information crawled on the Internet, constructing a cosmetic field word library, extracting word dimension text characteristics through an improved BERT neural network, fusing the word dimension text characteristics with word dimension information embedded in words, calculating multi-classification information through a BLSTM network of a fusion position perception attention mechanism, integrating the multi-classification information into an improved BERT neural network extracted word dimension text vector, calculating BLSTM of the fusion position perception attention mechanism again, and finally calculating optimal probability through CRF to finish extraction of cosmetic risk public opinion text relation. The invention solves the problems of low accuracy and strong field of extraction of cosmetic risk public opinion text relation to a certain extent, and improves the accuracy of extraction of event information by constructing a new model and adding word dimension for auxiliary representation on the basis of combining the word dimension of Chinese radical information.

Description

Cosmetic public opinion text entity relation extraction method based on deep learning
Technical Field
The invention relates to the field of artificial intelligence, in particular to a cosmetic public sentiment text entity relation extraction method based on deep learning.
Background
With the advent of the "regulations on supervision and management of cosmetics", the cosmetics industry has opened a new era and has become the focus of public opinion attention. Related supporting regulation documents also sequentially disclose solicitation opinions, and the legal status of related works of national risk monitoring is clarified for the first time in regulations.
The safety risk substances possibly existing in the cosmetics refer to substances brought in by cosmetic raw materials, generated or brought in during the production process, which may cause potential harm to human health. On one hand, the safety risk of the use of the cosmetics is objectively caused by the complexity of the cosmetic formula, the limitation of people on the knowledge of the ingredients of the cosmetic formula and the potential threats thereof and the incompleteness of the use experience of the cosmetics; on the other hand, in order to pursue high profit, many lawbreakers do not need to add forbidden substances artificially, counterfeit the well-known brand cosmetics, and subjectively cause the safety risk of using the cosmetics. Because cosmetic risks have certain harmfulness and sociality, human body injuries and certain economic losses are caused to different degrees, even sometimes, adverse social effects are possibly generated after the event is fermented through public opinions, the online public opinions are formed quickly, people often publish opinions on the internet within a short time after the event occurs, and when people pay attention to the progress of the event, the online public opinions develop at a higher speed, so that the online public opinions are difficult to control to some extent.
Therefore, the method attaches attention to and makes an assessment of the online public opinion risk, identifies the degree of the online public opinion risk, and prevents the expansion of the online public opinion risk, and is the first step of online public opinion management and control, and only by effectively assessing the online public opinion risk can determine which countermeasures to take. By the measure, scientific management of network public sentiment can be better realized. Therefore, the establishment of the risk public opinion entity relationship extraction model is of great significance to the supervision of cosmetic safety.
The relation extraction is gradually developed from pattern matching to a machine learning method based on statistics, and the deep learning based on the artificial neural network is currently the dominant position, and the deep learning not only considers the event extraction as a classification task, but also considers the event extraction as a sequence labeling task.
Two main tasks of relationship extraction are entity identification and relationship classification, and some current models firstly identify trigger words and then extract arguments by using a cascade (pipeline) mode. This method may cause errors of the previous stage to propagate to the next stage, causing error propagation. The invention adopts a combined extraction mode to simultaneously extract the trigger words and the arguments so as to improve the performance of the two subtasks, and simultaneously adds global characteristics to express global information between the trigger words and the arguments.
The invention adopts an event joint extraction model structure based on BERT-BLSTM-CRF and a novel sequence marking mode, changes the problem of event argument extraction into an end-to-end problem, and well solves the problem of error transmission caused by the traditional pipeline model. Meanwhile, a dual-network model structure is adopted, one network uses Chinese characters as input and introduces Chinese radical characteristics to increase extra semantic information, and the second network model uses Chinese words as input, and a domain word mechanism is introduced in order to enable the network to have better performance effect on different argument distinction degrees and absorb the text characteristics in the cosmetic public opinion field.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method can quickly and accurately extract the information of the cosmetic public sentiment events, greatly improve the working efficiency of a supervisor, assist the supervisor to make a judgment, realize the crossing from 'post-event public sentiment monitoring' to 'pre-event risk early warning', provide scientific basis for cosmetic safety supervision decision making, and lay the foundation for establishing a national cosmetic safety risk control system.
The method provided by the invention comprises the following steps: a cosmetic public opinion text entity relation extraction method based on deep learning comprises the following steps:
step 1, according to four main publishing channels of cosmetic risk public opinion data: official release information, social news, E-commerce platform comment data and social media related information, a network crawler is compiled aiming at public sentiment events by using a python programming language, and original text data crawled by the crawler is subjected to duplication removal and screening pretreatment to form available public sentiment event text corpora.
And 2, combining a word embedding (word embedding) resource library in the public domain to obtain a word embedding resource library in the cosmetic safety domain according to the professional words in the cosmetic public opinion supervision domain obtained in the step 1. On the basis of the public domain word embedding resource library, the professional words in the cosmetic public opinion domain are used for carrying out incremental training on the word embedding resource library to obtain the cosmetic public opinion domain word embedding resource library.
Step 3, semantic role labeling of entity 1, relation and entity 2 triples is carried out aiming at the cosmetic risk public opinion text extracted in the step 1, wherein the entity 1 is a main body of a cosmetic public opinion event, the entity 2 is an object of the cosmetic public opinion event, the relation is a relation between the entity 1 and the entity 2, the entity 1 comprises baby cream, a big head doll event, fake cosmetics and the like, the entity 2 comprises hormone, preservative, overdue parts and the like, and the relation is 6 in total: the method comprises the following steps of raw material components, adverse reactions, risk substances, public opinion heat, efficacy declaration and illegal behaviors, wherein a sentence is divided into different components aiming at a cosmetic risk public opinion text, the influence degree of a core word on adjacent words in the same sentence component changes along with the distance, the influence of all core words in the sentence on the adjacent words is accumulated to simulate the state of the whole sentence influenced by position perception, the position perception strategy is combined with the traditional attention mechanism, and the semantic role attention mechanism based on the position perception is constructed;
step 4, aiming at the cosmetic risk public opinion text extracted in the step 1, adopting a coder (BERT-Bidirectional Encoder retrieval from transforms) based on a Bidirectional depth self-attention transformation network to construct a pre-training model on word dimension, fusing a character radical feature vector for each word vector, then constructing a word dimension pre-training model by using a cosmetic public opinion field word embedding resource library obtained in the step 2, respectively obtaining text feature vectors after full-text semantic information is fused by combining the word vectors and the word vectors through a Bidirectional long-short term memory network BLSTM model and a semantic role attention machine based on position perception constructed in the step three, and obtaining a multi-classification relation of the text through localization, a full-connection layer and a gesture;
and 5, inputting the text corpus of the public opinion events into a Bert-based pre-training model to obtain a word vector of the text, fusing Chinese radical feature vectors, adding the multi-classification relation information obtained in the step 4 to two ends of the text feature vectors extracted by the Bert pre-training model to obtain a word-fused two-dimensional text semantic vector, inputting the text semantic vector into the BLSTM model and the conditional random field CRF again, and calculating the optimal probability through the conditional random field to obtain a final cosmetic public opinion text entity relation extraction result.
Further, in the step 1, the constructed web crawler suitable for the cosmetic public opinion field has information which is issued by authoritative research institutions at home and abroad and causes harm to human and animal and plant health; the adverse reaction monitoring data of domestic and foreign research institutions on cosmetics are as follows: authoritative report of news media at home and abroad; problems and recalling information of cosmetic production enterprises in the production, storage, circulation and sale links; various information published by industry associations at home and abroad; the product in the social network uses shared information, e-commerce platform sales comment information, and the like. And preprocessing the crawled content by data to form available public opinion event text corpora, and extracting professional vocabularies in the field of cosmetic public opinions.
Further, in step 2, on the basis of embedding the public domain words into the resource library, the cosmetic domain professional words obtained in step 1 are input into a skip-gram (skip-gram) model, incremental training is performed on the public domain words embedded into the resource library, and as the number of the crawled contents in step 1 increases, at intervals, after a certain number of contents capable of being subjected to incremental training are accumulated, the contents are input into the skip-gram (skip-gram) model again to perform incremental training on the public domain words embedded into the resource library, and finally the public domain words embedded into the resource library is expanded into the words embedded into the resource library suitable for the cosmetic public opinion domain.
Further, in the step 3, semantic role labeling (entity 1, relationship, entity 2) is performed on the cosmetic risk public opinion text extracted in the step 1 in a triple form, where the entity 1 is a subject of the cosmetic public opinion event, the entity 2 is an object of the cosmetic public opinion event, the relationship is a connection between the entity 1 and the entity 2, the entity 1 includes baby cream, a big-head doll event, fake and inferior cosmetics, and the entity 2 includes hormones, preservatives, overdue pieces, and the like, and there are 6 relations: the method comprises the steps of marking and dividing different sentence components through semantic roles, locating the positions of words in the sentence components, generating a vector of each word based on position perception influence through propagation influence, updating the word weight by using the position perception of context semantics, and constructing a semantic role attention mechanism based on the position perception.
The specific process for constructing the semantic role attention mechanism based on the position perception comprises the following steps:
(1) The attention of the words at sentence j position is:
Figure BDA0003810773120000041
in the formula (1), h j Is a hidden layer vector of j-position words, p j Is the accumulated position perception influence vector of the words, len is the number of the words in the sentence, h i Is a hidden layer vector of a word at a certain position in a sentence, p i A (-) is a vector for measuring the importance of a word based on the hidden layer vector and the location-aware influence vector;
(2) The specific form of a (-) is:
Figure BDA0003810773120000042
in the formula (2), W H 、W P Is h j 、p j A weight matrix of (a); b 1 Is a bias vector belonging to a first layer parameter;
Figure BDA0003810773120000043
is a ReLU function; v is a global vector,v T Represents its transpose; b 2 Is a bias vector belonging to the second layer parameters, len is the number of words in the sentence, and i is a word at a certain position in the sentence.
Further, in step 4, the usable public opinion event text corpus formed in step 1 is input into a Bert pre-training model to obtain vectorized representation of the text, wherein the specific implementation process includes segmenting the whole text input according to sentences, then encoding the input by using a deep self-attention transformation network, masking part of the content of the sentences after encoding (mask), predicting the masked content through the remaining content of the sentences after masking, comparing the predicted masked result with the real masked content to obtain a predicted error, adjusting parameters of the model according to the predicted error, mapping the input text into a vector space by using the prediction task, thereby obtaining text vectorized representation of word dimensions (taking Chinese as a unit), and adding 48-dimensional additional semantic information on the basis of 768-dimensional word vectors for the Chinese radicals of each word. Before the word dimension is input into the pre-training, chinese word segmentation is firstly carried out, vectorization of Chinese words is carried out through the word embedding resource library in the cosmetic public opinion field constructed in the step 2, and a text input vector on the word dimension (taking the Chinese words as a unit) is obtained; respectively inputting the word vector and the word vector into a BLSTM model, and calculating a specific attention distribution coefficient r through a semantic role attention mechanism constructed in step 3 and a semantic role attention mechanism based on position perception a The calculation process is as follows:
Figure BDA0003810773120000044
in the formula (3), h j Is a hidden layer vector of the j position word, alpha j Is the attention of the words at the j position, len is the number of words in a sentence;
and spreading the obtained word attention distribution coefficient to a hidden layer vector of the BLSTM, performing weighted calculation on each word to obtain text characteristics under the influence of an attention mechanism, splicing word dimension calculation results through conversation, and finally obtaining a multi-classification relation of the input text through a full connection layer and a sigmoid layer.
Further, in the step 5, public opinion event text corpus is input into a Bert pre-training model to obtain vectorization representation of texts, a Chinese radical feature vector (48 dimensions) of each word vector (768 dimensions) is fused, then the multi-classification result of the step 4 is expanded into 768+ 48-dimensional vectors which are spliced at two ends of an input text word vector matrix to obtain text vectors with full text semantic information fused, the text vectors are input into a BLSTM model for calculation, the entity relationship of the input texts is judged through the semantic role attention mechanism constructed in the step 3, and a final cosmetic public opinion text entity relationship extraction result is obtained after the optimal probability is calculated through a conditional random field CRF.
Compared with the prior art, the invention has the advantages that:
the invention constructs a word two-dimensional event text relation extraction model by an improved encoder BERT network of a two-way deep self-attention transformation network and a two-way long-short term memory network BLSTM fused based on a semantic role attention mechanism, can quickly and accurately judge key information in cosmetic public opinion events, constructs more comprehensively aiming at the aspect of extracting event text relation in the field of cosmetic public opinion, takes two different text distributed representations of character level and word level as model input, and integrates the output multi-classification information into a text vector of full text semantic information to complete the extraction of the cosmetic public opinion text relation. The model provided by the invention makes full use of the characteristics of Bert, adds the character vectors of the radicals of the characters in the pre-training model, so that the character vectors carry richer semantic information, overcomes the defects of word attention weight dependence and hidden layer representation in the traditional attention mechanism by calculating the semantic role attention mechanism based on position perception in the broadcasting, simultaneously takes the word vectors embedded in the text words as the supplementary information of the character vectors, further excavates the text semantics, avoids the loss of classification precision due to incomplete feature extraction caused by unstructured and lack of normative text corpora, and effectively improves the event relation extraction effect.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of a word two-dimensional text entity relationship extraction model.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.
As shown in fig. 1, the method for extracting the entity relationship of cosmetic public sentiment text based on deep learning of the present invention comprises: the method comprises the steps of preprocessing public sentiment event data crawled on the internet, constructing a resource library of a cosmetic public sentiment field, performing incremental training by using field linguistic data, extracting word dimension text characteristics through an improved BERT neural network, fusing the word dimension text characteristics with word embedded word dimension information, calculating multi-classification information through a BLSTM network, integrating the multi-classification information into an extracted word dimension text vector of the improved BERT neural network, and finally calculating the optimal probability through CRF. The method solves the problems of low accuracy and strong domain of extraction of the event text relation in the cosmetic public opinion field to a certain extent, and improves the accuracy of extraction of the event relation by constructing a new model, taking the character dimension fused with the Chinese radical characteristics as the text vectorization representation and adding the word dimension for auxiliary representation.
The method specifically comprises the following steps:
step 1, compiling a web crawler aiming at public sentiment events according to the characteristics of the field of cosmetic public sentiment by using a python programming language, and crawling information which is harmful to the health of human beings, animals and plants and has the crawling content and is issued by authoritative research institutions at home and abroad; the adverse reaction monitoring data of domestic and foreign research institutions on cosmetics are as follows: authoritative report of news media at home and abroad; problems and recalling information of cosmetic production enterprises in the production, storage, circulation and sale links; various information published by the domestic and foreign industry associations; the method comprises the steps that products in a social network use shared information, e-commerce platform sales comment information and the like, duplicate removal and screening pretreatment are conducted on original text data crawled by crawlers, available public opinion event text corpora are formed, the improved Jieba method is used for achieving word segmentation on cosmetic risk public opinion text data, stop words which do not have meanings in the original text data are removed, and then a word bank in the cosmetic public opinion field is constructed through cooperation of point-to-Point Mutual Information (PMI) calculation and manual screening and supplement.
And 2, combining a word embedding (word embedding) resource library of the public domain to obtain a word embedding resource library of the cosmetic public opinion domain according to the word library of the cosmetic public opinion domain obtained in the step 1. On the basis of the public domain word embedding resource library, inputting the cosmetic domain professional words obtained in the step 1 into a skip-gram model, performing incremental training on the public domain word embedding resource library at intervals along with the continuous increase of the crawled content in the step 1, and finally expanding the public domain word embedding resource library into a word embedding resource library suitable for the cosmetic public opinion domain.
Step 3, semantic role labeling (entity 1, relation, entity 2) in a triple form is carried out aiming at the cosmetic risk public opinion text extracted in the step 1, wherein the entity 1 is a main body of a cosmetic public opinion event and contains baby cream, a big head doll event, fake cosmetics and the like, the entity 2 is an object of the cosmetic public opinion event and contains hormone, preservative, overdue parts and the like, the relation is the relation between the entity 1 and the entity 2, and 6 types are provided: the method comprises the following steps of raw material components, adverse reactions, risk substances, public opinion popularity, efficacy declaration and illegal behaviors, different sentence components are divided through semantic role labeling, so that propagation of position attention influence only occurs in the same sentence component, the positions of words in the sentence components are located, a vector based on the position perception influence of each word is generated through the propagation influence, word weight is updated by using position perception of context semantics, and a semantic role attention mechanism based on the position perception is constructed.
The specific process for constructing the semantic role attention mechanism based on the position perception comprises the following steps:
(1) The attention of the words at sentence j position is:
Figure BDA0003810773120000061
in the formula (1), h j Is a hidden layer vector of j-position words, p j Is the accumulated position perception influence vector of the words, len is the number of the words in the sentence, h i Is a hidden layer vector of a word at a certain position in a sentence, p i A (-) is a vector for measuring the importance of a word based on the hidden layer vector and the location-aware influence vector;
(2) The specific form of a (-) is:
Figure BDA0003810773120000071
in the formula (2), W H 、W P Is h j 、p j A weight matrix of (a); b 1 Is a bias vector belonging to a first layer parameter;
Figure BDA0003810773120000072
is a ReLU function; v is a global vector, v T Represents its transpose; b 2 Is a bias vector belonging to the second layer parameters, len is the number of words in the sentence, and i is a word at a certain position in the sentence.
Step 4, inputting the usable public opinion event text corpus formed in the step 1 into a Bert pre-training model to obtain vectorized representation of the text, wherein the specific implementation process comprises the steps of segmenting the whole text input according to sentences, then coding the input by using a deep self-attention transformation network, masking part of the content of the sentences after coding (mask), predicting the masked content through the residual content of the sentences after masking, comparing the predicted masking result with the real masked content to obtain predicted errors, adjusting the parameters of the model according to the predicted errors, and mapping the input text by the prediction taskAnd (2) injecting the Chinese radicals into a vector space, thereby obtaining a text vectorization representation of word dimensions (taking Chinese as a unit), and then adding 48-dimensional additional semantic information on the basis of 768-dimensional word vectors for the Chinese radicals of each word according to the particularity of the Chinese radicals in the text in the cosmetic public opinion field in the character evolution process. Performing Chinese word segmentation before inputting word dimension into pre-training, performing Chinese word vectorization through the word embedding resource library in the cosmetic public sentiment field constructed in the step 2 to obtain a text input vector in the word dimension (taking the Chinese word as a unit), inputting the word vector and the word vector into the BLSTM model respectively, and calculating a specific attention distribution coefficient r through the semantic role attention mechanism constructed in the step 3 and the semantic role attention mechanism based on position perception a The calculation process is as follows:
Figure BDA0003810773120000073
in the formula (3), h j Is a hidden layer vector of the j position word, alpha j Is the attention of the words at the j position, len is the number of words in a sentence; and transmitting the attention distribution coefficient to a hidden layer vector of the BLSTM for calculation, performing weighted calculation on each word to obtain text characteristics under the influence of an attention mechanism, splicing calculation results of word dimensions through localization, and finally obtaining a multi-classification relation of the input text through a full connection layer and a sigmoid layer.
And 5, inputting the text corpus of the public opinion events into a pretraining model based on BERT to obtain a word vector of a text, adding 48-dimensional additional semantic information on the basis of 768-dimensional word vectors for Chinese radicals of each word, expanding the multi-classification relationship information (6-dimensional) obtained in the step 4 by 136-768 +48 dimensions, adding the expanded multi-classification relationship information to two ends of the word dimension text feature vectors extracted by the Bert pretraining model to obtain text semantic vectors fused with full text semantic information, inputting the text semantic vectors into a BLSTM model and a Conditional Random field CRF (Conditional Random Fields), and calculating the optimal probability through the Conditional Random Fields to obtain a final cosmetic public opinion text entity relation extraction result.
Referring to fig. 1, an overall schematic diagram of the method provided by the invention is shown, crawled cosmetic public opinion data is preprocessed, a cosmetic public opinion field resource library is constructed, increment training of the cosmetic public opinion field word embedding resource library and supplementary linguistic data is constructed by combining the public field word embedding resource library, text vectorization representation of word dimensions and text vectorization representation of word embedding word dimensions are obtained through a Bert pre-training model, word two-dimensional text feature vectors are obtained, multi-classification relations of the word two-dimensional text feature vectors are extracted, and finally, the entity relation of the cosmetic public opinion event text is extracted.
In the model diagram shown in fig. 2, first, a lower right word is embedded into a network to obtain a text vectorization representation of word dimensions, and in addition, a word dimension text vectorization representation of fused Chinese radical features is obtained in a lower left BERT network, and is respectively calculated by a BLSTM network of a fusion position-aware attention mechanism (semantic role part in the middle of the diagram) and connected with two paths of outputs, a multi-classification result is added to an upper BERT neural network text vector, and is calculated by the BLSTM of the fusion position-aware attention mechanism again, and finally, an optimal probability is calculated by a CRF to obtain an optimal output information marking sequence, and an event text relationship extraction result is obtained according to a text at a corresponding position to the sequence marking result.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims (6)

1. A cosmetic public opinion text entity relation extraction method based on deep learning is characterized by comprising the following steps:
step 1, aiming at four publishing channels of cosmetic risk public opinion data: official release information, social news, E-commerce platform comment data and social media related information, a search engine technology and a network information mining technology are utilized, duplication removal and screening pretreatment are carried out on original text data obtained by a crawler, public opinion text corpus is formed, aiming at Chinese texts, word segmentation is carried out by using an improved Jieba method, stop words which do not have meanings in the original text data are removed, then a word bank in the cosmetic public opinion field is constructed based on inter-Point Mutual Information (PMI) calculation and manual screening correction, and professional words in the extracted cosmetic public opinion field are obtained;
step 2, aiming at the professional vocabularies of the cosmetic public opinion field extracted in the step 1, performing incremental training on a public field word embedding resource library to obtain a cosmetic public opinion field word embedding resource library;
step 3, semantic role labeling of entity 1, relation and entity 2 triples is carried out aiming at the cosmetic risk public opinion text extracted in the step 1, wherein the entity 1 is a main body of a cosmetic public opinion event, the entity 2 is an object of the cosmetic public opinion event, the relation is a relation between the entity 1 and the entity 2, the entity 1 comprises baby cream, a big head doll event and fake cosmetics, the entity 2 comprises hormone, preservative and overdue parts, and the relation is 6 in total: the method comprises the following steps of raw material components, adverse reactions, risk substances, public opinion heat, efficacy declaration and illegal behaviors, wherein a sentence is divided into different components aiming at a cosmetic risk public opinion text, the influence degree of a core word on adjacent words in the same sentence component changes along with the distance, the influence of all core words in the sentence on the adjacent words is accumulated to simulate the state of the whole sentence influenced by position perception, the position perception strategy is combined with the traditional attention mechanism, and the semantic role attention mechanism based on the position perception is constructed;
step 4, aiming at the cosmetic risk public opinion text extracted in the step 1, a bidirectional depth self-attention transformation network-based encoder BERT is adopted to construct a word vector fusing Chinese radical characteristics, then a word vector is constructed by using the word embedding resource library in the cosmetic public opinion field obtained in the step 2, and the word vector are subjected to a semantic role attention mechanism based on position perception constructed in the step 3 and based on a bidirectional long-short term memory network BLSTM model to obtain a multi-classification relation of the input text;
and 5, extracting a word vector fusing Chinese radical features of the encoder BERT based on the bidirectional depth self-attention transformation network from the input text, adding the multi-classification relation information obtained in the step 4 into the text feature vector extracted by the Bert pre-training model to obtain a word-fused two-dimensional text semantic vector, and inputting the text semantic vector into the BLSTM model and the conditional random field CRF to obtain a final cosmetic public opinion text entity relation extraction result.
2. The method for extracting the physical relationship between cosmetics public sentiments text based on deep learning as claimed in claim 1, wherein: in the step 1, when the web crawler suitable for the field of cosmetics and public sentiments is constructed, information which is issued by authoritative research institutions at home and abroad and causes harm to the health of human beings, animals and plants is crawled; the adverse reaction monitoring data of domestic and foreign research institutions on cosmetics are as follows: authoritative reports of domestic and foreign news media; problems and recalling information of cosmetic production enterprises in the production, storage, circulation and sale links; various information published by the national and foreign society of cosmetics industry; the products in the social network use the shared information and the E-commerce platform sales comment information to form cosmetic public opinion text corpora and construct a cosmetic public opinion field lexicon.
3. The method for extracting the physical relationship between cosmetics public sentiments text based on deep learning as claimed in claim 1, wherein: in the step 2, on the basis of the public domain word embedding resource library, the cosmetic domain professional vocabulary obtained in the step 1 is input into the leap model for incremental training, and with the continuous increase of the crawled content in the step 1, the leap model is input into the leap model at intervals to perform incremental training on the public domain word embedding resource library, and finally the public domain word embedding resource library is expanded into a word embedding resource library suitable for the cosmetic public sentiment domain.
4. The method for extracting entity relation of cosmetic public opinion text based on deep learning of claim 1, wherein the method comprises the following steps: in the step 3, a semantic role attention mechanism based on location awareness is constructed in the following specific process:
(1) The attention of the words at sentence j position is:
Figure FDA0003810773110000021
in the formula (1), h j Is the hidden layer vector of the j position word, p j Is the accumulated position perception influence vector of the words, len is the number of words in the sentence, h i Is a hidden layer vector of a word at a certain position in a sentence, p i A (-) is a vector for measuring the importance of a word based on the hidden layer vector and the location-aware influence vector;
(2) The specific form of a (-) is:
Figure FDA0003810773110000022
in the formula (2), W H 、W P Is h j 、p j A weight matrix of (a); b is a mixture of 1 Is a bias vector belonging to a first layer parameter;
Figure FDA0003810773110000023
is a ReLU function; v is a global vector, v T Represents its transpose; b 2 Is a bias vector belonging to the second layer parameter, len is the number of words in the sentence, and i is a word at a certain position in the sentence.
5. The method for extracting the physical relationship between cosmetics public sentiments text based on deep learning as claimed in claim 1, wherein: in the step 4, when the text corpus of the public sentiment event is input into the Bert pre-training model to obtain the vectorized representation of the text, the specific execution process is to segment the whole text input according to sentences, and then to enable the whole text input to be segmentedCoding the input by using a deep self-attention transformation network, masking partial content of a sentence after coding, predicting the masked content by residual content of the sentence after masking, comparing a predicted masking result with real masked content to obtain a predicted error, adjusting parameters of a model according to the predicted error, mapping an input text into a vector space through the prediction to obtain word dimension text vectorization expression, and adding 48-dimensional additional Chinese radical semantic information on the basis of 768-dimensional word vectors according to the similarity of the Chinese radicals in the text of the cosmetic public opinion field; obtaining word dimension text input vectors through the word embedding resource library in the cosmetic public sentiment field constructed in the step 2; respectively inputting the word vector and the word vector into a BLSTM model, judging the entity relationship of the input text through a semantic role attention mechanism constructed in step 3, propagating the obtained word attention distribution coefficient into the hidden layer vector of the BLSTM through the calculation of the semantic role attention mechanism based on position perception, and performing weighted calculation on each word to obtain the text characteristics under the influence of the attention mechanism, wherein the specific attention distribution coefficient r is a The calculation process is as follows:
Figure FDA0003810773110000031
in the formula (3), h j Is a hidden layer vector of the j position word, alpha j Is the attention of the words in the j position, len is the number of the words in one sentence;
and after the character output of the word double-dimensional text is obtained, connecting the two paths of output, and finally obtaining the multi-classification relation of the input text through the calculation of a full connection layer and a sigmoid layer.
6. The method for extracting the physical relationship of the cosmetics public sentiment text based on the deep learning in the field of the cosmetics public sentiment as claimed in claim 1, wherein the method comprises the following steps: in the step 5, the public opinion event text corpus is input into a Bert pre-training model to obtain vectorization representation of the text, a word vector (768 +48 dimensions) containing Chinese radical information is obtained, the multi-classification result (6 dimensions) in the step 4 is expanded by 136 times and is consistent with the length of the word vector, the word vector is spliced at two ends of an input text word vector matrix to obtain a text vector with richer semantic features, the text vector is input into a BLSTM model for calculation, the entity relationship of the input text is judged through the semantic role attention mechanism based on position perception constructed in the step 3, and the final cosmetic public opinion text entity relationship extraction result is obtained after the optimal probability is calculated through a conditional random field CRF.
CN202211010810.6A 2022-08-08 2022-08-23 Cosmetic public opinion text entity relation extraction method based on deep learning Pending CN115374778A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210943515X 2022-08-08
CN202210943515 2022-08-08

Publications (1)

Publication Number Publication Date
CN115374778A true CN115374778A (en) 2022-11-22

Family

ID=84068183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211010810.6A Pending CN115374778A (en) 2022-08-08 2022-08-23 Cosmetic public opinion text entity relation extraction method based on deep learning

Country Status (1)

Country Link
CN (1) CN115374778A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969269A (en) * 2022-06-23 2022-08-30 济南大学 False news detection method and system based on entity identification and relation extraction
CN116227496A (en) * 2023-05-06 2023-06-06 国网智能电网研究院有限公司 Deep learning-based electric public opinion entity relation extraction method and system
CN116522165A (en) * 2023-06-27 2023-08-01 武汉爱科软件技术股份有限公司 Public opinion text matching system and method based on twin structure
CN117235286A (en) * 2023-11-10 2023-12-15 昆明理工大学 Attention-strengthening entity relation extraction model, construction method thereof and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969269A (en) * 2022-06-23 2022-08-30 济南大学 False news detection method and system based on entity identification and relation extraction
CN116227496A (en) * 2023-05-06 2023-06-06 国网智能电网研究院有限公司 Deep learning-based electric public opinion entity relation extraction method and system
CN116522165A (en) * 2023-06-27 2023-08-01 武汉爱科软件技术股份有限公司 Public opinion text matching system and method based on twin structure
CN116522165B (en) * 2023-06-27 2024-04-02 武汉爱科软件技术股份有限公司 Public opinion text matching system and method based on twin structure
CN117235286A (en) * 2023-11-10 2023-12-15 昆明理工大学 Attention-strengthening entity relation extraction model, construction method thereof and storage medium
CN117235286B (en) * 2023-11-10 2024-01-23 昆明理工大学 Attention-strengthening entity relation extraction model, construction method thereof and storage medium

Similar Documents

Publication Publication Date Title
CN115374778A (en) Cosmetic public opinion text entity relation extraction method based on deep learning
Zhong et al. Deep learning-based extraction of construction procedural constraints from construction regulations
CN105512687A (en) Emotion classification model training and textual emotion polarity analysis method and system
CN106202010A (en) The method and apparatus building Law Text syntax tree based on deep neural network
CN110502626A (en) A kind of aspect grade sentiment analysis method based on convolutional neural networks
Fahfouh et al. PV-DAE: A hybrid model for deceptive opinion spam based on neural network architectures
CN110889786A (en) Legal action insured advocate security use judging service method based on LSTM technology
Zhao et al. ZYJ123@ DravidianLangTech-EACL2021: Offensive language identification based on XLM-RoBERTa with DPCNN
Kleenankandy et al. An enhanced Tree-LSTM architecture for sentence semantic modeling using typed dependencies
CN113127933B (en) Intelligent contract Pompe fraudster detection method and system based on graph matching network
Mehndiratta et al. Identification of sarcasm using word embeddings and hyperparameters tuning
CN114330338A (en) Program language identification system and method fusing associated information
CN109241199A (en) A method of it is found towards financial knowledge mapping
Poria et al. Sentic Demo: A hybrid concept-level aspect-based sentiment analysis toolkit
CN114881042A (en) Chinese emotion analysis method based on graph convolution network fusion syntax dependence and part of speech
CN114610846A (en) Knowledge graph expanding and complementing method for heuristic bionic knowledge grafting strategy
CN115329085A (en) Social robot classification method and system
CN115906816A (en) Text emotion analysis method of two-channel Attention model based on Bert
Zhang et al. Aspect-level sentiment analysis via a syntax-based neural network
Sharma et al. Various methods to classify the polarity of text based customer reviews using sentiment analysis
Ciroku et al. Automated multimodal sensemaking: Ontology-based integration of linguistic frames and visual data
Wehnert et al. Applying BERT embeddings to predict legal textual entailment
CN113468884A (en) Chinese event trigger word extraction method and device
Antia et al. Assessing and enhancing bottom-up CNL design for competency questions for ontologies
Singh et al. Deep Learning Model for Interpretability and Explainability of Aspect-Level Sentiment Analysis Based on Social Media

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination