CN110188191A - A kind of entity relationship map construction method and system for Web Community's text - Google Patents
A kind of entity relationship map construction method and system for Web Community's text Download PDFInfo
- Publication number
- CN110188191A CN110188191A CN201910277242.8A CN201910277242A CN110188191A CN 110188191 A CN110188191 A CN 110188191A CN 201910277242 A CN201910277242 A CN 201910277242A CN 110188191 A CN110188191 A CN 110188191A
- Authority
- CN
- China
- Prior art keywords
- entity
- relationship
- text
- model
- web community
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Abstract
This application discloses a kind of entity relationship map construction methods and system for Web Community's text, comprising: the text in acquisition webpage carries out Entity recognition and entity relation extraction, constructs semantic model;The text in Web Community is acquired, Entity recognition and entity relation extraction is carried out, obtains network entity set of relationship;Classified using disaggregated model to network entity set of relationship, obtains entity pair;To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;Visualization processing is carried out to fused semantic model, obtains entity relationship map.Using the pure text generation semantic model in particular webpage, guarantee the accuracy and reliability of entity relationship;It using sorting algorithm and kernel entity set of relationship train classification models, and is assessed, increases the reliability of classification;Core semantic model will be added by the network entity set of relationship assessed, increases the rich of core semantic model, stability and automatic scalability.
Description
Technical field
This application involves field of information processing more particularly to a kind of entity relationship map buildings for Web Community's text
Method and system.
Background technique
Web Community is identical as community content, it includes the ginseng of certain local, certain crowds, certain class loading, community member
With and same interest and culture certain features.Web Community provides the means of various information exchanges, such as discusses, links up, merely
It etc., so that community resident interacts.With the fast development of internet, the actual life of people and Web Community is mutual
Relationship is more and more closer.People like recording the daily life of oneself in Web Community, and political situation of the time hot spot instantly, the people's livelihood hundred are discussed
State proposes oneself thinking and view to various hot news, and the community participation sense of oneself is promoted by diversified forms.People couple
This of Web Community is had deep love for, and promotes the prosperity and development of Web Community.The diversification of Web Community meets making for people
Use demand.People discuss that star lives in microblogging community, and star's Eight Diagrams are discussed in the community of the ends of the earth, when News Community discusses
Political affairs hot spot discusses literature etc. in Baidu's discussion bar.Web Community has penetrated into the every aspect of people's daily life.
With flourishing for Web Community, the social property of carrying is also more and more, and the information content contained is specific
Region, will form very typical text feature in specific crowd, the information contained reflects people to a certain extent
Demand and wish.What the content of text by analyzing one or more similar networks community was contained there are Deep Semantics to close
The public sentiment event of system can be convenient Community administrators and carry out community management and understand the life trend of community resident, preferably builds
It founds community and provides beneficial guidance for the communicative channel of resident, so as to cater to community while improving community's liveness
The thinking of resident.Meanwhile network management and policy-making body can also grasp the hot topic in community in time, track carriage
Feelings trend, understands will of the people cry, provides correct decision and support for community management and network economic governance.
Tradition mainly passes through the technologies such as keyword match, topic cluster to the analysis of text information and realizes to text information
Extraction and understanding, but these are all merely resting on information extraction and analysis on shallow semantic, can not be from the angle of Deep Semantics
Degree is excavated and is stated to public feelings information.Meanwhile these technical research are the long article in News Field and medical field mostly
It is carried out in sheet, and Web Community is short text mostly and colloquial style content is more, text expression is not advised due to its content of text
The problems such as model, can not directly using one or more technologies come accurately find and identify it includes popular public feelings information.
In summary, it is desirable to provide it is a kind of suitable for Web Community's short text, it can be from the angles of Deep Semantics to carriage
The entity relationship map construction method and system that feelings information is excavated and stated.
Summary of the invention
In order to solve the above problem, present applicant proposes a kind of entity relationship map construction methods for Web Community's text
And system.
On the one hand, the application proposes a kind of entity relationship map construction method for Web Community's text, comprising:
Acquire the text in webpage;
Entity recognition and entity relation extraction are carried out to the text in the webpage, construct semantic model;
Acquire the text in Web Community;
Entity recognition and entity relation extraction are carried out to the text in the Web Community, obtain network entity set of relations
It closes;
Classified using disaggregated model to network entity set of relationship, obtains entity pair;
To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;
Visualization processing is carried out to fused semantic model, obtains entity relationship map.
Preferably, the text in the acquisition webpage, comprising:
Acquisition has the first text of semantic structure in current web page, searches for text hyperlink;
Acquisition has the second text of semantic structure in the corresponding webpage of text hyperlink.
Preferably, the Entity recognition includes: Text Pretreatment, morphological analysis and/or entity duplicate removal.
Preferably, the entity relation extraction includes: interdependent syntactic analysis and/or syntactic analysis.
Preferably, classified using disaggregated model to network entity set of relationship, obtain entity pair, comprising:
Use the first instance set training entity classification model in kernel entity set of relationship;
Use the first set of relationship training relationship disaggregated model in kernel entity set of relationship;
By in network entity set of relationship second instance set and the second set of relationship input entity classification model respectively
With relationship disaggregated model, entity pair is obtained.
Preferably, the first instance set training entity classification model using in kernel entity set of relationship, comprising:
Use the first instance set in sorting algorithm classification core entity relationship set, training entity classification model;
It will classify in a part of entity input model in network entity set of relationship, output category result, assessment
The accuracy rate of classification results;
If accuracy rate reaches setting value, using this entity classification model to other entities in network entity set of relationship
Classify.
Preferably, the first set of relationship training relationship disaggregated model using in kernel entity set of relationship, comprising:
Use the first set of relationship in sorting algorithm classification core entity relationship set, training relationship disaggregated model;
It will classify in a part of relationship input model in network entity set of relationship, output category result, assessment
The accuracy rate of relationship classification results;
If accuracy rate reaches setting value, using this relationship disaggregated model to other relationships in network entity set of relationship
Classify.
Preferably, the kernel entity set of relationship is obtained from semantic model.
Preferably, the entity relationship set includes: entity sets and set of relationship.
Second aspect, the application propose that a kind of entity relationship map for Web Community's text constructs system, comprising:
Acquisition module, for the text in the text and Web Community in automatic collection webpage;
Semantic model module, for carrying out Entity recognition and entity relation extraction, structure to the text in collected webpage
Build semantic model;
Text analysis model, for carrying out Entity recognition and entity relationship pumping to the text in collected Web Community
It takes, obtains network entity set of relationship;
Fusion and display module for classifying using disaggregated model to network entity set of relationship obtain entity pair;
To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;Fused semantic model is carried out
Visualization processing obtains entity relationship map.
The advantages of the application, is: using the pure text generation semantic model in particular webpage, ensure that entity relationship
Accuracy and reliability;By being pre-processed to Web Community's short text, removes meaningless character, text expression, draws
The information such as text standardize the short text got;By analyzing the characteristics of Web Community's short text, excavate wherein included
Deep Semantics entity relationship obtains the entity relationship in Web Community's short text;Pass through sorting algorithm and kernel entity relationship
Gather train classification models, and assessed using network entity set of relationship, increases the reliability of classification;Assessment will be passed through
Network entity set of relationship increase into core semantic model, increase the rich of core semantic model, stability and automatic
Scalability;By visualizing entity relationship map, Web Community's hot topic public sentiment can be promptly and accurately found, and then be community management
Member's management community provides strong decision and support for the improvement of Web Community.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.Attached drawing is only used for showing the purpose of preferred implementations, and is not considered as to the application
Limitation.And throughout the drawings, identical component is indicated with same reference symbol.In the accompanying drawings:
The step of Fig. 1 is a kind of entity relationship map construction method for Web Community's text provided by the present application signal
Figure;
Fig. 2 is a kind of text hyperlink of entity relationship map construction method for Web Community's text provided by the present application
Connect the schematic diagram of search;
Fig. 3 is a kind of Entity recognition of entity relationship map construction method for Web Community's text provided by the present application
Schematic diagram;
Fig. 4 is a kind of short essay one's duty of entity relationship map construction method for Web Community's text provided by the present application
Word schematic diagram;
Fig. 5 is a kind of entity relationship of entity relationship map construction method for Web Community's text provided by the present application
The schematic diagram of pumping;
Fig. 6 is a kind of tertiary level of entity relationship map construction method for Web Community's text provided by the present application
The semantic model schematic diagram of structure;
Fig. 7 is a kind of schematic diagram of entity relationship map building system for Web Community's text provided by the present application.
Specific embodiment
The illustrative embodiments of the disclosure are more fully described below with reference to accompanying drawings.Although showing this public affairs in attached drawing
The illustrative embodiments opened, it being understood, however, that may be realized in various forms the disclosure without the reality that should be illustrated here
The mode of applying is limited.It is to be able to thoroughly understand the disclosure on the contrary, providing these embodiments, and can be by this public affairs
The range opened is fully disclosed to those skilled in the art.
According to presently filed embodiment, a kind of entity relationship map construction method for Web Community's text is proposed,
As shown in Figure 1, comprising:
S101 acquires the text in webpage;
S102 carries out Entity recognition and entity relation extraction to the text in the webpage, constructs semantic model;
S103 acquires the text in Web Community;
S104 carries out Entity recognition and entity relation extraction to the text in the Web Community, obtains network entity pass
Assembly is closed;
S105 classifies to network entity set of relationship using disaggregated model, obtains entity pair;
S106, to the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;
S107 carries out visualization processing to fused semantic model, obtains entity relationship map.
Text in the acquisition webpage, comprising:
Acquisition has the first text of semantic structure in current web page, searches for text hyperlink;
Acquisition has the second text of semantic structure in the corresponding webpage of text hyperlink.
The current web page includes the webpage that other people have put in order in internet such as Baidupedia and interaction encyclopaedia.
Since the presence of Web Community's text is mixed and disorderly, random, even if handling these content of text, obtained result
And it is in disorder, it is immethodical.So when constructing semantic model, need using Baidupedia and interaction encyclopaedia etc. from
The webpage that other people have put in order in internet.Text in such webpage be with each field it is closely related, have tight knot of tissue
The level text of structure is adopted by the hyperlink in search text into text is carried out in other web page texts associated there
Collection can carry out level excavation to Current Content.
The search sum of series number of the hyperlink can be set.
As shown in Fig. 2, for from a certain Webpage search text hyperlink, it is assumed that series is set as 3, i.e., from current web page
All text hyperlinks searched are the 1st grade, according to these text hyperlinks, into next layer (the 2nd layer) each webpage into
The search of row text hyperlink is the 2nd grade in all text hyperlinks that the 2nd layer of each Webpage search arrives, super according to these texts
Each webpage for being linked into next layer (the 3rd layer) carries out text hyperlink search, in all texts that the 3rd layer of each Webpage search arrives
This hyperlink is connected in 3rd level, carries out text collection according to each webpage that these text hyperlinks enter next layer (the 4th layer), no longer
Search for hyperlink.
As shown in figure 3, the Entity recognition includes: Text Pretreatment, morphological analysis and/or entity duplicate removal.
The Entity recognition uses hidden Markov model.
Entity recognition is the process that the noun phrase in text is identified and marked.
The Text Pretreatment includes noise remove and format conversion.
The noise remove includes: removal html label, removal quotation, removal text expression, the meaningless character of removal, goes
Except top note reply etc..
Since Web Community's text is different from news corpus, it is collected in community forum, with the shape of theme model and follow-up
Formula exists.It is a kind of contents non-structured, comprising multimedia messages such as picture, expression, videos, and since it is based on mutually
Networking is produced and is propagated, therefore can include some useless hypertext markup labels (html label) and reply of pouring water, institute
To need to carry out series of preprocessing to it before formally carrying out text analyzing.
Since the text encoding format of the system defaults such as Windows, Linux and Mac is different, different platform acquisition will lead to
The format of the Web Community's text arrived is not identical.The default code of Windows text is ANSI, it is the standard text of systemic presupposition
Word storage format, but in participle, the format that need input text is UTF-8, it is therefore desirable to Web Community's text of acquisition
It formats, is uniformly converted into the UTF-8 text formatting for supporting participle.
Short text participle, which refers to, carries out cutting to the sentence in text, to obtain the process of multiple phrases.
Chinese Word Automatic Segmentation include: the segmentation methods of word-based storehouse matching, word-based frequency statistics segmentation methods and
Rule-based segmentation methods.
The algorithm of word-based storehouse matching includes: maximum forward matching algorithm and maximum reverse matching algorithm etc..
Segmentation methods based on statistics include: probability statistics algorithm and the right algorithm of group of mutual information etc..
The morphological analysis includes: short text participle and part-of-speech tagging.
Short text participle and part-of-speech tagging are used to sentence being cut into correct set of words.
By taking " Xiao Wang beats Xiao Li " as an example, this morphological analysis result is " Xiao Wang's (noun)/beat (verb)/Xiao Li (name
Word) ".By short text participle and part-of-speech tagging, name entity that seeing of being apparent includes in the sentence is (Xiao Wang, small
Lee).Short text participle and part-of-speech tagging process will not only identify name entity, also need to identify the physical names such as place name, mechanism name
Word.
During participle, matching algorithm and disambiguation algorithm are also used, the efficiency and accuracy of participle are promoted.
As shown in figure 4, dictionary is according to colleges and universities, Web Community public sentiment field text feature, the customized dictionary of creation is (certainly
Build dictionary), content and Web Community's text are closely related.It is segmented by short text, English is carried out to the text in Web Community
Language, Chinese separation;According to each word block in dictionary, Chinese (Chinese) text is carried out matching primitives and disambiguated to calculate, is cut
The set of words (word segmentation result) divided.After the result for obtaining short text participle, it is also necessary to result (entity relationship set)
In name, place name, mechanism name and other entities identified.
Due to the text in Web Community be to post and reply as in the form of mainly showing, can be comprising a large amount of duplicate
Entity will carry out deduplication operation to entity to guarantee the accuracy of entity relation extraction.
The entity relation extraction includes: interdependent syntactic analysis and/or syntactic analysis.
Entity relation extraction, which refers to from acquisition in the corpus of natural language description, names existing relationship between entity, such as
There may be employer-employee relationship etc. between name and organization, common are employer-employee relationship, geographical location relationship, membership,
Whole and part relationship etc..Similar with entity extraction, the type of entity relationship is also predetermined.Entity relation extraction is life
The problems such as further investigation of name Entity recognition can be event content extraction, automatic question answering, machine translation and natural language processing
Precondition is provided.
As shown in figure 5, carrying out interdependent syntactic analysis, analysis to the sentence in Web Community's text after Entity recognition
The composition of each ingredient of sentence out extracts the relationship between entity according to the sentence feature (syntactic analysis) that analysis Chinese obtains.
Interdependent syntactic analysis is the syntax knot that analyze to the dependence of ingredient each in sentence and then find sentence
Fruit.This thinks that the dominator in sentence is core verb, if any other ingredients dominate core verb, these by dominator all
Dominator can be depended on some form.That is: " Subject, Predicate and Object " in the sentence that interdependent syntactic analysis identifies, " determining shape benefit " this
A little grammatical items are unrelated with the position of these ingredients.
According to the distance between argument (entity) and argument, argument and relationship statement, calculates and set for the combination of every two argument
Reliability.Confidence level is lower, between argument, between the statement of argument and relationship a possibility that there are semantic relations with regard to smaller.
Calculating by obtaining the interdependent syntactic relation statement of sentence in Web Community's text and to confidence level, Neng Gouzheng
The grammatical relation for really extracting entity centering, obtains network entity set of relationship.
It is described to be classified using disaggregated model to network entity set of relationship, obtain entity pair, comprising:
Use the first instance set training entity classification model in kernel entity set of relationship;
Use the first set of relationship training relationship disaggregated model in kernel entity set of relationship;
By in network entity set of relationship second instance set and the second set of relationship input entity classification model respectively
With relationship disaggregated model, entity pair is obtained.
Classification is to establish multiple themes (field) for each entity and each entity relationship.By data (entity relationship to be sorted
Set) input disaggregated model, corresponding class categories are exported by model calculation.
Due in semantic model entity and entity relationship be derived from each field it is closely related, have tight knot of tissue
The level text of structure, so using kernel entity set of relationship in train classification models.
In the entity classification model and relationship disaggregated model in a certain field not needed currently, closed using kernel entity
First instance set training entity classification model in assembly conjunction, uses the first set of relationship in kernel entity set of relationship to instruct
Practice relationship disaggregated model, the disaggregated model in a certain field currently needed.If the reality in the existing a certain field currently needed
When body disaggregated model and relationship disaggregated model, directly use.
The sorting algorithm that disaggregated model uses includes: NB Algorithm, decision Tree algorithms, vector machine (Support
Vector Machines, SVM) algorithm and convolutional neural networks (Convolutional Neural Networks, CNN) algorithm
Deng.
The first instance set training entity classification model using in kernel entity set of relationship, comprising:
Use the first instance set in sorting algorithm classification core entity relationship set, training entity classification model;
It will classify in a part of entity input model in network entity set of relationship, output category result, assessment
The accuracy rate of classification results;
If accuracy rate reaches setting value, using this entity classification model to other entities in network entity set of relationship
Classify.
Using first instance set (training set) the training entity classification model in kernel entity set of relationship, obtain current
The disaggregated model and classifying rules in a certain field needed.Using a part of entity in network entity set of relationship as test
Collection, inputs in this disaggregated model and classifies.The test set classified is evaluated and tested, if evaluation and test value is more than (being higher than) setting
Threshold value is evaluated and tested, then the test set is incorporated into kernel entity set of relationship, kernel entity set of relationship is expanded.
The first set of relationship training relationship disaggregated model using in kernel entity set of relationship, comprising:
Use the first set of relationship in sorting algorithm classification core entity relationship set, training relationship disaggregated model;
It will classify in a part of relationship input model in network entity set of relationship, output category result, assessment
The accuracy rate of classification results;
If accuracy rate reaches setting value, using this relationship disaggregated model to other relationships in network entity set of relationship
Classify.
Using the first set of relationship (training set) training relationship disaggregated model in kernel entity set of relationship, obtain current
The disaggregated model and classifying rules in a certain field needed.Using a part of relationship in network entity set of relationship as test
Collection, inputs in this disaggregated model and classifies.The test set classified is evaluated and tested, if evaluation and test value is more than the evaluation and test threshold of setting
The test set, then be incorporated into kernel entity set of relationship, expand kernel entity set of relationship by value.
The quantity of a part of entity as test set can be set, i.e., as in test set input disaggregated model
The amount of entity and relationship in network entity set of relationship can be set.
There are three values for evaluation metrics: accuracy rate P, recall rate R and F value.Formula is as follows:
F=R*P* (1+A2)/(R+P*A2)。
It defines as follows respectively:
P expression correctly identifies such such name of name entity number/identify entity sum * 100%;
R expression correctly identifies such name entity number/such name entity sum * 100%;
A is parameter, can set, take A=1 here, so F value herein is also known as F-1 value.
Evaluation and test value is generally determined by three above value together.
It is described respectively to input the entity classified in corresponding disaggregated model and/or entity relationship, if evaluation and test value is more than to comment
Threshold value is surveyed, then is all put into the kernel entity set of relationship of corresponding classification, the corpus of corresponding classification is expanded.
To obtained each entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model.
By sorting algorithm and hierarchical classification, newly identified entity and entity relationship can be positioned, thus and core
Heart semantic model carries out entity fusion and relationship fusion, realizes growing certainly for semantic model.
The Feature Words semantic association degree for calculating different themes needs to construct the semantic tree with hierarchical structure, reuses reverse
Word frequency filters (Term Frequency-Inverse Document Frequency, TF-IDF) algorithm degree of being associated and calculates.
During classified calculating and hierarchical classification calculate, it is also necessary to which degree of being associated calculates.
The main purpose of reverse word frequency filtering is some appearance frequent in the text of filtering, but the word having little significance, and is protected
The biggish word of important ratio is stayed, this method is also used for feature space dimensionality reduction and feature extraction.It is filtered and is weighed by reverse word frequency
The relationship between substance feature word and text set is measured, the number that the specific word occurs in some file in text set is more, right
This document is more important.
Colleges and universities can be appointed as root node taking human as determination by the hierarchical structure of semantic tree, establish a simple three-level
For the semantic model of hierarchical structure, as shown in Figure 6.The degree of association of its root node and child node is the text that child node theme contains
The weight of the total textual data of this number Zhan;The degree of association between child node and adjacent leaf node is leaf node term weight function,
Its TF-IDF weight value can be calculated with following formula:
Wherein, N is text sum, and n (w) is the textual data comprising w.
The TF value of IDF (w) presentation-entity word w.
The degree of association between leaf node under the same theme node is that two leaf nodes are associated with to public father node
The product of degree, the degree of association can be calculated with following formula:
The quotient for the number that TF indicates the number that some entity occurs in entity set and all entities occur herein.If
The frequency of occurrences of some entity under some subject text is larger, then illustrates that the entity is related to the theme, can be to its word frequency
It is counted.
Pass through the calculating of the degree of association, it can be deduced that feature under the degree of association and each theme between each theme and Feature Words
The degree of association between word.It obtains in each entity pair and semantic model after the Feature Words degree of association, so that it may by each entity pair
It is inserted into the correct position in semantic model, is merged with semantic model.
The visualization processing is realized by using visualization tool.
The source of text, the original contents of text and semantic network can all be saved, conveniently trace to the source, read again and
It calls.
The source of the text specifically include that community names, text it is detailed link and it is relevant to the text other
Content of text.
The original contents of the text include without pretreated text.
The semantic network includes the corresponding relationship of the relationship and text between the entity and entity visualized in map.
The data such as each semantic model, network entity set of relationship, kernel entity set of relationship, corpus can all carry out
Storage.
The kernel entity set of relationship is obtained from semantic model.
The entity relationship set includes: entity sets and set of relationship.
The identification includes mark.
According to presently filed embodiment, it is also proposed that a kind of entity relationship map building system for Web Community's text
System, as shown in fig. 7, comprises:
Acquisition module 101, for the text in the text and Web Community in automatic collection webpage;
Semantic model module 102, for carrying out Entity recognition and entity relation extraction to the text in collected webpage,
Construct semantic model;
Text analysis model 103, for carrying out Entity recognition and entity relationship to the text in collected Web Community
It extracts, obtains network entity set of relationship;
Fusion obtains entity for classifying using disaggregated model to network entity set of relationship with display module 104
It is right;To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;To fused semantic model into
Row visualization processing obtains entity relationship map.
The fusion further includes visualization storage with display module, for storing the source of text, the original contents of text
And semantic network.
Semantic model module is also used to store each semantic model, network entity set of relationship, kernel entity set of relationship, language
Expect the data such as library.
In the present processes, using the pure text generation semantic model in particular webpage, entity relationship ensure that
Accuracy and reliability;By pre-processing to Web Community's short text, meaningless character, text expression, quotation are removed
Etc. information, standardize the short text got;By analyzing the characteristics of Web Community's short text, depth wherein included is excavated
Layer semantic entity-relationship, obtains the entity relationship in Web Community's short text;Pass through sorting algorithm and kernel entity set of relations
Train classification models are closed, and are assessed using network entity set of relationship, the reliability of classification is increased;Assessment will be passed through
Network entity set of relationship increases into core semantic model, increases the rich of core semantic model, stability and automatic expansion
Malleability;By visualizing entity relationship map, Web Community's hot topic public sentiment can be promptly and accurately found, and then be Community administrators
Management community provides strong decision and support for the improvement of Web Community.
The preferable specific embodiment of the above, only the application, but the protection scope of the application is not limited thereto,
Within the technical scope of the present application, any changes or substitutions that can be easily thought of by anyone skilled in the art,
Should all it cover within the scope of protection of this application.Therefore, the protection scope of the application should be with the protection model of the claim
Subject to enclosing.
Claims (10)
1. a kind of entity relationship map construction method for Web Community's text characterized by comprising
Acquire the text in webpage;
Entity recognition and entity relation extraction are carried out to the text in the webpage, construct semantic model;
Acquire the text in Web Community;
Entity recognition and entity relation extraction are carried out to the text in the Web Community, obtain network entity set of relationship;
Classified using disaggregated model to network entity set of relationship, obtains entity pair;
To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;
Visualization processing is carried out to fused semantic model, obtains entity relationship map.
2. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that
Text in the acquisition webpage, comprising:
Acquisition has the first text of semantic structure in current web page, searches for text hyperlink;
Acquisition has the second text of semantic structure in the corresponding webpage of text hyperlink.
3. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that
The Entity recognition includes: Text Pretreatment, morphological analysis and/or entity duplicate removal.
4. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that
The entity relation extraction includes: interdependent syntactic analysis and/or syntactic analysis.
5. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that
Classified using disaggregated model to network entity set of relationship, obtain entity pair, comprising:
Use the first instance set training entity classification model in kernel entity set of relationship;
Use the first set of relationship training relationship disaggregated model in kernel entity set of relationship;
By in network entity set of relationship second instance set and the second set of relationship input entity classification model and pass respectively
It is disaggregated model, obtains entity pair.
6. a kind of entity relationship map construction method for Web Community's text as claimed in claim 5, which is characterized in that
The first instance set training entity classification model using in kernel entity set of relationship, comprising:
Use the first instance set in sorting algorithm classification core entity relationship set, training entity classification model;
It will classify in a part of entity input model in network entity set of relationship, output category result, assessment classification
As a result accuracy rate;
If accuracy rate reaches setting value, other entities in network entity set of relationship are carried out using this entity classification model
Classification.
7. a kind of entity relationship map construction method for Web Community's text as claimed in claim 5, which is characterized in that
The first set of relationship training relationship disaggregated model using in kernel entity set of relationship, comprising:
Use the first set of relationship in sorting algorithm classification core entity relationship set, training relationship disaggregated model;
It will classify in a part of relationship input model in network entity set of relationship, output category result, evaluation relations
The accuracy rate of classification results;
If accuracy rate reaches setting value, other relationships in network entity set of relationship are carried out using this relationship disaggregated model
Classification.
8. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that
The kernel entity set of relationship is obtained from semantic model.
9. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that
The entity relationship set includes: entity sets and set of relationship.
10. a kind of entity relationship map for Web Community's text constructs system characterized by comprising
Acquisition module, for the text in the text and Web Community in automatic collection webpage;
Semantic model module constructs language for carrying out Entity recognition and entity relation extraction to the text in collected webpage
Adopted model;
Text analysis model is obtained for carrying out Entity recognition and entity relation extraction to the text in collected Web Community
To network entity set of relationship;
Fusion and display module for classifying using disaggregated model to network entity set of relationship obtain entity pair;To institute
Entity is stated to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;Fused semantic model is carried out visual
Change processing, obtains entity relationship map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910277242.8A CN110188191A (en) | 2019-04-08 | 2019-04-08 | A kind of entity relationship map construction method and system for Web Community's text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910277242.8A CN110188191A (en) | 2019-04-08 | 2019-04-08 | A kind of entity relationship map construction method and system for Web Community's text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110188191A true CN110188191A (en) | 2019-08-30 |
Family
ID=67713784
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910277242.8A Pending CN110188191A (en) | 2019-04-08 | 2019-04-08 | A kind of entity relationship map construction method and system for Web Community's text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188191A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110727803A (en) * | 2019-10-10 | 2020-01-24 | 北京明略软件系统有限公司 | Text event extraction method and device |
CN110737845A (en) * | 2019-10-15 | 2020-01-31 | 精硕科技(北京)股份有限公司 | method, computer storage medium and system for realizing information analysis |
CN110795573A (en) * | 2019-10-31 | 2020-02-14 | 北京邮电大学 | Method and device for predicting geographic position of webpage content |
CN111310454A (en) * | 2020-01-17 | 2020-06-19 | 北京邮电大学 | Relation extraction method and device based on domain migration |
CN111400448A (en) * | 2020-03-12 | 2020-07-10 | 中国建设银行股份有限公司 | Method and device for analyzing incidence relation of objects |
CN112100292A (en) * | 2020-09-22 | 2020-12-18 | 山东旗帜信息有限公司 | Personnel relation map determination method and device |
CN112364173A (en) * | 2020-10-21 | 2021-02-12 | 中国电子科技网络信息安全有限公司 | IP address mechanism tracing method based on knowledge graph |
CN113254635A (en) * | 2021-04-14 | 2021-08-13 | 腾讯科技(深圳)有限公司 | Data processing method, device and storage medium |
CN113269271A (en) * | 2021-04-30 | 2021-08-17 | 清华大学 | Initialization method and equipment of double-dictionary model for artificial intelligence text analysis |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102708096A (en) * | 2012-05-29 | 2012-10-03 | 代松 | Network intelligence public sentiment monitoring system based on semantics and work method thereof |
CN108345647A (en) * | 2018-01-18 | 2018-07-31 | 北京邮电大学 | Domain knowledge map construction system and method based on Web |
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN109064318A (en) * | 2018-08-24 | 2018-12-21 | 苏宁消费金融有限公司 | A kind of internet financial risks monitoring system of knowledge based map |
-
2019
- 2019-04-08 CN CN201910277242.8A patent/CN110188191A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102708096A (en) * | 2012-05-29 | 2012-10-03 | 代松 | Network intelligence public sentiment monitoring system based on semantics and work method thereof |
CN108345647A (en) * | 2018-01-18 | 2018-07-31 | 北京邮电大学 | Domain knowledge map construction system and method based on Web |
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN109064318A (en) * | 2018-08-24 | 2018-12-21 | 苏宁消费金融有限公司 | A kind of internet financial risks monitoring system of knowledge based map |
Non-Patent Citations (3)
Title |
---|
杨浩: "面向"一带一路"的社交网络舆情空间语义关联分析", 《中国优秀硕士学位论文全文数据库 基础科学辑》 * |
王超: "基于深度学习的中文微博人物关系图谱的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
贾丙静,马润: "基于实体对齐的知识图谱构建研究", 《佳木斯大学学报(自然科学版)》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110727803A (en) * | 2019-10-10 | 2020-01-24 | 北京明略软件系统有限公司 | Text event extraction method and device |
CN110737845A (en) * | 2019-10-15 | 2020-01-31 | 精硕科技(北京)股份有限公司 | method, computer storage medium and system for realizing information analysis |
CN110795573B (en) * | 2019-10-31 | 2021-09-28 | 北京邮电大学 | Method and device for predicting geographic position of webpage content |
CN110795573A (en) * | 2019-10-31 | 2020-02-14 | 北京邮电大学 | Method and device for predicting geographic position of webpage content |
CN111310454A (en) * | 2020-01-17 | 2020-06-19 | 北京邮电大学 | Relation extraction method and device based on domain migration |
CN111310454B (en) * | 2020-01-17 | 2022-01-07 | 北京邮电大学 | Relation extraction method and device based on domain migration |
CN111400448A (en) * | 2020-03-12 | 2020-07-10 | 中国建设银行股份有限公司 | Method and device for analyzing incidence relation of objects |
CN112100292A (en) * | 2020-09-22 | 2020-12-18 | 山东旗帜信息有限公司 | Personnel relation map determination method and device |
CN112364173A (en) * | 2020-10-21 | 2021-02-12 | 中国电子科技网络信息安全有限公司 | IP address mechanism tracing method based on knowledge graph |
CN112364173B (en) * | 2020-10-21 | 2022-03-18 | 中国电子科技网络信息安全有限公司 | IP address mechanism tracing method based on knowledge graph |
CN113254635A (en) * | 2021-04-14 | 2021-08-13 | 腾讯科技(深圳)有限公司 | Data processing method, device and storage medium |
CN113269271A (en) * | 2021-04-30 | 2021-08-17 | 清华大学 | Initialization method and equipment of double-dictionary model for artificial intelligence text analysis |
CN113269271B (en) * | 2021-04-30 | 2022-11-15 | 清华大学 | Initialization method and equipment of double-dictionary model for artificial intelligence text analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188191A (en) | A kind of entity relationship map construction method and system for Web Community's text | |
Wang et al. | Relevant document discovery for fact-checking articles | |
Velardi et al. | Ontolearn reloaded: A graph-based algorithm for taxonomy induction | |
CN102254014B (en) | Adaptive information extraction method for webpage characteristics | |
CN105893611B (en) | Method for constructing interest topic semantic network facing social network | |
CN106570171A (en) | Semantics-based sci-tech information processing method and system | |
Terrana et al. | Automatic unsupervised polarity detection on a twitter data stream | |
Ahmed | Detecting opinion spam and fake news using n-gram analysis and semantic similarity | |
US20160321244A1 (en) | Phrase pair collecting apparatus and computer program therefor | |
CN102890702A (en) | Internet forum-oriented opinion leader mining method | |
CN110232149A (en) | A kind of focus incident detection method and system | |
TW201115371A (en) | Systems and methods for organizing collective social intelligence information using an organic object data model | |
CN107967290A (en) | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data | |
CN109299248A (en) | A kind of business intelligence collection method based on natural language processing | |
CN108153851B (en) | General forum subject post page information extraction method based on rules and semantics | |
Schatten et al. | An introduction to social semantic web mining & big data analytics for political attitudes and mentalities research | |
Saif et al. | Mapping Arabic WordNet synsets to Wikipedia articles using monolingual and bilingual features | |
Fernandes et al. | Analysis of product Twitter data though opinion mining | |
Yıldız et al. | Acquisition of Turkish meronym based on classification of patterns | |
Cui et al. | Mining concepts from wikipedia for ontology construction | |
Bhartiya et al. | A Semantic Approach to Summarization | |
Garcia-Gorrostieta et al. | Argument component classification in academic writings | |
CN113392183A (en) | Characterization and calculation method of children domain map knowledge | |
Griazev et al. | Web mining taxonomy | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190830 |