CN110188191A - A kind of entity relationship map construction method and system for Web Community's text - Google Patents

A kind of entity relationship map construction method and system for Web Community's text Download PDF

Info

Publication number
CN110188191A
CN110188191A CN201910277242.8A CN201910277242A CN110188191A CN 110188191 A CN110188191 A CN 110188191A CN 201910277242 A CN201910277242 A CN 201910277242A CN 110188191 A CN110188191 A CN 110188191A
Authority
CN
China
Prior art keywords
entity
relationship
text
model
web community
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910277242.8A
Other languages
Chinese (zh)
Inventor
吴旭
颉夏青
吴海涛
张熙
方滨兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910277242.8A priority Critical patent/CN110188191A/en
Publication of CN110188191A publication Critical patent/CN110188191A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

This application discloses a kind of entity relationship map construction methods and system for Web Community's text, comprising: the text in acquisition webpage carries out Entity recognition and entity relation extraction, constructs semantic model;The text in Web Community is acquired, Entity recognition and entity relation extraction is carried out, obtains network entity set of relationship;Classified using disaggregated model to network entity set of relationship, obtains entity pair;To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;Visualization processing is carried out to fused semantic model, obtains entity relationship map.Using the pure text generation semantic model in particular webpage, guarantee the accuracy and reliability of entity relationship;It using sorting algorithm and kernel entity set of relationship train classification models, and is assessed, increases the reliability of classification;Core semantic model will be added by the network entity set of relationship assessed, increases the rich of core semantic model, stability and automatic scalability.

Description

A kind of entity relationship map construction method and system for Web Community's text
Technical field
This application involves field of information processing more particularly to a kind of entity relationship map buildings for Web Community's text Method and system.
Background technique
Web Community is identical as community content, it includes the ginseng of certain local, certain crowds, certain class loading, community member With and same interest and culture certain features.Web Community provides the means of various information exchanges, such as discusses, links up, merely It etc., so that community resident interacts.With the fast development of internet, the actual life of people and Web Community is mutual Relationship is more and more closer.People like recording the daily life of oneself in Web Community, and political situation of the time hot spot instantly, the people's livelihood hundred are discussed State proposes oneself thinking and view to various hot news, and the community participation sense of oneself is promoted by diversified forms.People couple This of Web Community is had deep love for, and promotes the prosperity and development of Web Community.The diversification of Web Community meets making for people Use demand.People discuss that star lives in microblogging community, and star's Eight Diagrams are discussed in the community of the ends of the earth, when News Community discusses Political affairs hot spot discusses literature etc. in Baidu's discussion bar.Web Community has penetrated into the every aspect of people's daily life.
With flourishing for Web Community, the social property of carrying is also more and more, and the information content contained is specific Region, will form very typical text feature in specific crowd, the information contained reflects people to a certain extent Demand and wish.What the content of text by analyzing one or more similar networks community was contained there are Deep Semantics to close The public sentiment event of system can be convenient Community administrators and carry out community management and understand the life trend of community resident, preferably builds It founds community and provides beneficial guidance for the communicative channel of resident, so as to cater to community while improving community's liveness The thinking of resident.Meanwhile network management and policy-making body can also grasp the hot topic in community in time, track carriage Feelings trend, understands will of the people cry, provides correct decision and support for community management and network economic governance.
Tradition mainly passes through the technologies such as keyword match, topic cluster to the analysis of text information and realizes to text information Extraction and understanding, but these are all merely resting on information extraction and analysis on shallow semantic, can not be from the angle of Deep Semantics Degree is excavated and is stated to public feelings information.Meanwhile these technical research are the long article in News Field and medical field mostly It is carried out in sheet, and Web Community is short text mostly and colloquial style content is more, text expression is not advised due to its content of text The problems such as model, can not directly using one or more technologies come accurately find and identify it includes popular public feelings information.
In summary, it is desirable to provide it is a kind of suitable for Web Community's short text, it can be from the angles of Deep Semantics to carriage The entity relationship map construction method and system that feelings information is excavated and stated.
Summary of the invention
In order to solve the above problem, present applicant proposes a kind of entity relationship map construction methods for Web Community's text And system.
On the one hand, the application proposes a kind of entity relationship map construction method for Web Community's text, comprising:
Acquire the text in webpage;
Entity recognition and entity relation extraction are carried out to the text in the webpage, construct semantic model;
Acquire the text in Web Community;
Entity recognition and entity relation extraction are carried out to the text in the Web Community, obtain network entity set of relations It closes;
Classified using disaggregated model to network entity set of relationship, obtains entity pair;
To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;
Visualization processing is carried out to fused semantic model, obtains entity relationship map.
Preferably, the text in the acquisition webpage, comprising:
Acquisition has the first text of semantic structure in current web page, searches for text hyperlink;
Acquisition has the second text of semantic structure in the corresponding webpage of text hyperlink.
Preferably, the Entity recognition includes: Text Pretreatment, morphological analysis and/or entity duplicate removal.
Preferably, the entity relation extraction includes: interdependent syntactic analysis and/or syntactic analysis.
Preferably, classified using disaggregated model to network entity set of relationship, obtain entity pair, comprising:
Use the first instance set training entity classification model in kernel entity set of relationship;
Use the first set of relationship training relationship disaggregated model in kernel entity set of relationship;
By in network entity set of relationship second instance set and the second set of relationship input entity classification model respectively With relationship disaggregated model, entity pair is obtained.
Preferably, the first instance set training entity classification model using in kernel entity set of relationship, comprising:
Use the first instance set in sorting algorithm classification core entity relationship set, training entity classification model;
It will classify in a part of entity input model in network entity set of relationship, output category result, assessment The accuracy rate of classification results;
If accuracy rate reaches setting value, using this entity classification model to other entities in network entity set of relationship Classify.
Preferably, the first set of relationship training relationship disaggregated model using in kernel entity set of relationship, comprising:
Use the first set of relationship in sorting algorithm classification core entity relationship set, training relationship disaggregated model;
It will classify in a part of relationship input model in network entity set of relationship, output category result, assessment The accuracy rate of relationship classification results;
If accuracy rate reaches setting value, using this relationship disaggregated model to other relationships in network entity set of relationship Classify.
Preferably, the kernel entity set of relationship is obtained from semantic model.
Preferably, the entity relationship set includes: entity sets and set of relationship.
Second aspect, the application propose that a kind of entity relationship map for Web Community's text constructs system, comprising:
Acquisition module, for the text in the text and Web Community in automatic collection webpage;
Semantic model module, for carrying out Entity recognition and entity relation extraction, structure to the text in collected webpage Build semantic model;
Text analysis model, for carrying out Entity recognition and entity relationship pumping to the text in collected Web Community It takes, obtains network entity set of relationship;
Fusion and display module for classifying using disaggregated model to network entity set of relationship obtain entity pair; To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;Fused semantic model is carried out Visualization processing obtains entity relationship map.
The advantages of the application, is: using the pure text generation semantic model in particular webpage, ensure that entity relationship Accuracy and reliability;By being pre-processed to Web Community's short text, removes meaningless character, text expression, draws The information such as text standardize the short text got;By analyzing the characteristics of Web Community's short text, excavate wherein included Deep Semantics entity relationship obtains the entity relationship in Web Community's short text;Pass through sorting algorithm and kernel entity relationship Gather train classification models, and assessed using network entity set of relationship, increases the reliability of classification;Assessment will be passed through Network entity set of relationship increase into core semantic model, increase the rich of core semantic model, stability and automatic Scalability;By visualizing entity relationship map, Web Community's hot topic public sentiment can be promptly and accurately found, and then be community management Member's management community provides strong decision and support for the improvement of Web Community.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.Attached drawing is only used for showing the purpose of preferred implementations, and is not considered as to the application Limitation.And throughout the drawings, identical component is indicated with same reference symbol.In the accompanying drawings:
The step of Fig. 1 is a kind of entity relationship map construction method for Web Community's text provided by the present application signal Figure;
Fig. 2 is a kind of text hyperlink of entity relationship map construction method for Web Community's text provided by the present application Connect the schematic diagram of search;
Fig. 3 is a kind of Entity recognition of entity relationship map construction method for Web Community's text provided by the present application Schematic diagram;
Fig. 4 is a kind of short essay one's duty of entity relationship map construction method for Web Community's text provided by the present application Word schematic diagram;
Fig. 5 is a kind of entity relationship of entity relationship map construction method for Web Community's text provided by the present application The schematic diagram of pumping;
Fig. 6 is a kind of tertiary level of entity relationship map construction method for Web Community's text provided by the present application The semantic model schematic diagram of structure;
Fig. 7 is a kind of schematic diagram of entity relationship map building system for Web Community's text provided by the present application.
Specific embodiment
The illustrative embodiments of the disclosure are more fully described below with reference to accompanying drawings.Although showing this public affairs in attached drawing The illustrative embodiments opened, it being understood, however, that may be realized in various forms the disclosure without the reality that should be illustrated here The mode of applying is limited.It is to be able to thoroughly understand the disclosure on the contrary, providing these embodiments, and can be by this public affairs The range opened is fully disclosed to those skilled in the art.
According to presently filed embodiment, a kind of entity relationship map construction method for Web Community's text is proposed, As shown in Figure 1, comprising:
S101 acquires the text in webpage;
S102 carries out Entity recognition and entity relation extraction to the text in the webpage, constructs semantic model;
S103 acquires the text in Web Community;
S104 carries out Entity recognition and entity relation extraction to the text in the Web Community, obtains network entity pass Assembly is closed;
S105 classifies to network entity set of relationship using disaggregated model, obtains entity pair;
S106, to the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;
S107 carries out visualization processing to fused semantic model, obtains entity relationship map.
Text in the acquisition webpage, comprising:
Acquisition has the first text of semantic structure in current web page, searches for text hyperlink;
Acquisition has the second text of semantic structure in the corresponding webpage of text hyperlink.
The current web page includes the webpage that other people have put in order in internet such as Baidupedia and interaction encyclopaedia. Since the presence of Web Community's text is mixed and disorderly, random, even if handling these content of text, obtained result And it is in disorder, it is immethodical.So when constructing semantic model, need using Baidupedia and interaction encyclopaedia etc. from The webpage that other people have put in order in internet.Text in such webpage be with each field it is closely related, have tight knot of tissue The level text of structure is adopted by the hyperlink in search text into text is carried out in other web page texts associated there Collection can carry out level excavation to Current Content.
The search sum of series number of the hyperlink can be set.
As shown in Fig. 2, for from a certain Webpage search text hyperlink, it is assumed that series is set as 3, i.e., from current web page All text hyperlinks searched are the 1st grade, according to these text hyperlinks, into next layer (the 2nd layer) each webpage into The search of row text hyperlink is the 2nd grade in all text hyperlinks that the 2nd layer of each Webpage search arrives, super according to these texts Each webpage for being linked into next layer (the 3rd layer) carries out text hyperlink search, in all texts that the 3rd layer of each Webpage search arrives This hyperlink is connected in 3rd level, carries out text collection according to each webpage that these text hyperlinks enter next layer (the 4th layer), no longer Search for hyperlink.
As shown in figure 3, the Entity recognition includes: Text Pretreatment, morphological analysis and/or entity duplicate removal.
The Entity recognition uses hidden Markov model.
Entity recognition is the process that the noun phrase in text is identified and marked.
The Text Pretreatment includes noise remove and format conversion.
The noise remove includes: removal html label, removal quotation, removal text expression, the meaningless character of removal, goes Except top note reply etc..
Since Web Community's text is different from news corpus, it is collected in community forum, with the shape of theme model and follow-up Formula exists.It is a kind of contents non-structured, comprising multimedia messages such as picture, expression, videos, and since it is based on mutually Networking is produced and is propagated, therefore can include some useless hypertext markup labels (html label) and reply of pouring water, institute To need to carry out series of preprocessing to it before formally carrying out text analyzing.
Since the text encoding format of the system defaults such as Windows, Linux and Mac is different, different platform acquisition will lead to The format of the Web Community's text arrived is not identical.The default code of Windows text is ANSI, it is the standard text of systemic presupposition Word storage format, but in participle, the format that need input text is UTF-8, it is therefore desirable to Web Community's text of acquisition It formats, is uniformly converted into the UTF-8 text formatting for supporting participle.
Short text participle, which refers to, carries out cutting to the sentence in text, to obtain the process of multiple phrases.
Chinese Word Automatic Segmentation include: the segmentation methods of word-based storehouse matching, word-based frequency statistics segmentation methods and Rule-based segmentation methods.
The algorithm of word-based storehouse matching includes: maximum forward matching algorithm and maximum reverse matching algorithm etc..
Segmentation methods based on statistics include: probability statistics algorithm and the right algorithm of group of mutual information etc..
The morphological analysis includes: short text participle and part-of-speech tagging.
Short text participle and part-of-speech tagging are used to sentence being cut into correct set of words.
By taking " Xiao Wang beats Xiao Li " as an example, this morphological analysis result is " Xiao Wang's (noun)/beat (verb)/Xiao Li (name Word) ".By short text participle and part-of-speech tagging, name entity that seeing of being apparent includes in the sentence is (Xiao Wang, small Lee).Short text participle and part-of-speech tagging process will not only identify name entity, also need to identify the physical names such as place name, mechanism name Word.
During participle, matching algorithm and disambiguation algorithm are also used, the efficiency and accuracy of participle are promoted.
As shown in figure 4, dictionary is according to colleges and universities, Web Community public sentiment field text feature, the customized dictionary of creation is (certainly Build dictionary), content and Web Community's text are closely related.It is segmented by short text, English is carried out to the text in Web Community Language, Chinese separation;According to each word block in dictionary, Chinese (Chinese) text is carried out matching primitives and disambiguated to calculate, is cut The set of words (word segmentation result) divided.After the result for obtaining short text participle, it is also necessary to result (entity relationship set) In name, place name, mechanism name and other entities identified.
Due to the text in Web Community be to post and reply as in the form of mainly showing, can be comprising a large amount of duplicate Entity will carry out deduplication operation to entity to guarantee the accuracy of entity relation extraction.
The entity relation extraction includes: interdependent syntactic analysis and/or syntactic analysis.
Entity relation extraction, which refers to from acquisition in the corpus of natural language description, names existing relationship between entity, such as There may be employer-employee relationship etc. between name and organization, common are employer-employee relationship, geographical location relationship, membership, Whole and part relationship etc..Similar with entity extraction, the type of entity relationship is also predetermined.Entity relation extraction is life The problems such as further investigation of name Entity recognition can be event content extraction, automatic question answering, machine translation and natural language processing Precondition is provided.
As shown in figure 5, carrying out interdependent syntactic analysis, analysis to the sentence in Web Community's text after Entity recognition The composition of each ingredient of sentence out extracts the relationship between entity according to the sentence feature (syntactic analysis) that analysis Chinese obtains.
Interdependent syntactic analysis is the syntax knot that analyze to the dependence of ingredient each in sentence and then find sentence Fruit.This thinks that the dominator in sentence is core verb, if any other ingredients dominate core verb, these by dominator all Dominator can be depended on some form.That is: " Subject, Predicate and Object " in the sentence that interdependent syntactic analysis identifies, " determining shape benefit " this A little grammatical items are unrelated with the position of these ingredients.
According to the distance between argument (entity) and argument, argument and relationship statement, calculates and set for the combination of every two argument Reliability.Confidence level is lower, between argument, between the statement of argument and relationship a possibility that there are semantic relations with regard to smaller.
Calculating by obtaining the interdependent syntactic relation statement of sentence in Web Community's text and to confidence level, Neng Gouzheng The grammatical relation for really extracting entity centering, obtains network entity set of relationship.
It is described to be classified using disaggregated model to network entity set of relationship, obtain entity pair, comprising:
Use the first instance set training entity classification model in kernel entity set of relationship;
Use the first set of relationship training relationship disaggregated model in kernel entity set of relationship;
By in network entity set of relationship second instance set and the second set of relationship input entity classification model respectively With relationship disaggregated model, entity pair is obtained.
Classification is to establish multiple themes (field) for each entity and each entity relationship.By data (entity relationship to be sorted Set) input disaggregated model, corresponding class categories are exported by model calculation.
Due in semantic model entity and entity relationship be derived from each field it is closely related, have tight knot of tissue The level text of structure, so using kernel entity set of relationship in train classification models.
In the entity classification model and relationship disaggregated model in a certain field not needed currently, closed using kernel entity First instance set training entity classification model in assembly conjunction, uses the first set of relationship in kernel entity set of relationship to instruct Practice relationship disaggregated model, the disaggregated model in a certain field currently needed.If the reality in the existing a certain field currently needed When body disaggregated model and relationship disaggregated model, directly use.
The sorting algorithm that disaggregated model uses includes: NB Algorithm, decision Tree algorithms, vector machine (Support Vector Machines, SVM) algorithm and convolutional neural networks (Convolutional Neural Networks, CNN) algorithm Deng.
The first instance set training entity classification model using in kernel entity set of relationship, comprising:
Use the first instance set in sorting algorithm classification core entity relationship set, training entity classification model;
It will classify in a part of entity input model in network entity set of relationship, output category result, assessment The accuracy rate of classification results;
If accuracy rate reaches setting value, using this entity classification model to other entities in network entity set of relationship Classify.
Using first instance set (training set) the training entity classification model in kernel entity set of relationship, obtain current The disaggregated model and classifying rules in a certain field needed.Using a part of entity in network entity set of relationship as test Collection, inputs in this disaggregated model and classifies.The test set classified is evaluated and tested, if evaluation and test value is more than (being higher than) setting Threshold value is evaluated and tested, then the test set is incorporated into kernel entity set of relationship, kernel entity set of relationship is expanded.
The first set of relationship training relationship disaggregated model using in kernel entity set of relationship, comprising:
Use the first set of relationship in sorting algorithm classification core entity relationship set, training relationship disaggregated model;
It will classify in a part of relationship input model in network entity set of relationship, output category result, assessment The accuracy rate of classification results;
If accuracy rate reaches setting value, using this relationship disaggregated model to other relationships in network entity set of relationship Classify.
Using the first set of relationship (training set) training relationship disaggregated model in kernel entity set of relationship, obtain current The disaggregated model and classifying rules in a certain field needed.Using a part of relationship in network entity set of relationship as test Collection, inputs in this disaggregated model and classifies.The test set classified is evaluated and tested, if evaluation and test value is more than the evaluation and test threshold of setting The test set, then be incorporated into kernel entity set of relationship, expand kernel entity set of relationship by value.
The quantity of a part of entity as test set can be set, i.e., as in test set input disaggregated model The amount of entity and relationship in network entity set of relationship can be set.
There are three values for evaluation metrics: accuracy rate P, recall rate R and F value.Formula is as follows:
F=R*P* (1+A2)/(R+P*A2)。
It defines as follows respectively:
P expression correctly identifies such such name of name entity number/identify entity sum * 100%;
R expression correctly identifies such name entity number/such name entity sum * 100%;
A is parameter, can set, take A=1 here, so F value herein is also known as F-1 value.
Evaluation and test value is generally determined by three above value together.
It is described respectively to input the entity classified in corresponding disaggregated model and/or entity relationship, if evaluation and test value is more than to comment Threshold value is surveyed, then is all put into the kernel entity set of relationship of corresponding classification, the corpus of corresponding classification is expanded.
To obtained each entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model.
By sorting algorithm and hierarchical classification, newly identified entity and entity relationship can be positioned, thus and core Heart semantic model carries out entity fusion and relationship fusion, realizes growing certainly for semantic model.
The Feature Words semantic association degree for calculating different themes needs to construct the semantic tree with hierarchical structure, reuses reverse Word frequency filters (Term Frequency-Inverse Document Frequency, TF-IDF) algorithm degree of being associated and calculates.
During classified calculating and hierarchical classification calculate, it is also necessary to which degree of being associated calculates.
The main purpose of reverse word frequency filtering is some appearance frequent in the text of filtering, but the word having little significance, and is protected The biggish word of important ratio is stayed, this method is also used for feature space dimensionality reduction and feature extraction.It is filtered and is weighed by reverse word frequency The relationship between substance feature word and text set is measured, the number that the specific word occurs in some file in text set is more, right This document is more important.
Colleges and universities can be appointed as root node taking human as determination by the hierarchical structure of semantic tree, establish a simple three-level For the semantic model of hierarchical structure, as shown in Figure 6.The degree of association of its root node and child node is the text that child node theme contains The weight of the total textual data of this number Zhan;The degree of association between child node and adjacent leaf node is leaf node term weight function, Its TF-IDF weight value can be calculated with following formula:
Wherein, N is text sum, and n (w) is the textual data comprising w.
The TF value of IDF (w) presentation-entity word w.
The degree of association between leaf node under the same theme node is that two leaf nodes are associated with to public father node The product of degree, the degree of association can be calculated with following formula:
The quotient for the number that TF indicates the number that some entity occurs in entity set and all entities occur herein.If The frequency of occurrences of some entity under some subject text is larger, then illustrates that the entity is related to the theme, can be to its word frequency It is counted.
Pass through the calculating of the degree of association, it can be deduced that feature under the degree of association and each theme between each theme and Feature Words The degree of association between word.It obtains in each entity pair and semantic model after the Feature Words degree of association, so that it may by each entity pair It is inserted into the correct position in semantic model, is merged with semantic model.
The visualization processing is realized by using visualization tool.
The source of text, the original contents of text and semantic network can all be saved, conveniently trace to the source, read again and It calls.
The source of the text specifically include that community names, text it is detailed link and it is relevant to the text other Content of text.
The original contents of the text include without pretreated text.
The semantic network includes the corresponding relationship of the relationship and text between the entity and entity visualized in map.
The data such as each semantic model, network entity set of relationship, kernel entity set of relationship, corpus can all carry out Storage.
The kernel entity set of relationship is obtained from semantic model.
The entity relationship set includes: entity sets and set of relationship.
The identification includes mark.
According to presently filed embodiment, it is also proposed that a kind of entity relationship map building system for Web Community's text System, as shown in fig. 7, comprises:
Acquisition module 101, for the text in the text and Web Community in automatic collection webpage;
Semantic model module 102, for carrying out Entity recognition and entity relation extraction to the text in collected webpage, Construct semantic model;
Text analysis model 103, for carrying out Entity recognition and entity relationship to the text in collected Web Community It extracts, obtains network entity set of relationship;
Fusion obtains entity for classifying using disaggregated model to network entity set of relationship with display module 104 It is right;To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;To fused semantic model into Row visualization processing obtains entity relationship map.
The fusion further includes visualization storage with display module, for storing the source of text, the original contents of text And semantic network.
Semantic model module is also used to store each semantic model, network entity set of relationship, kernel entity set of relationship, language Expect the data such as library.
In the present processes, using the pure text generation semantic model in particular webpage, entity relationship ensure that Accuracy and reliability;By pre-processing to Web Community's short text, meaningless character, text expression, quotation are removed Etc. information, standardize the short text got;By analyzing the characteristics of Web Community's short text, depth wherein included is excavated Layer semantic entity-relationship, obtains the entity relationship in Web Community's short text;Pass through sorting algorithm and kernel entity set of relations Train classification models are closed, and are assessed using network entity set of relationship, the reliability of classification is increased;Assessment will be passed through Network entity set of relationship increases into core semantic model, increases the rich of core semantic model, stability and automatic expansion Malleability;By visualizing entity relationship map, Web Community's hot topic public sentiment can be promptly and accurately found, and then be Community administrators Management community provides strong decision and support for the improvement of Web Community.
The preferable specific embodiment of the above, only the application, but the protection scope of the application is not limited thereto, Within the technical scope of the present application, any changes or substitutions that can be easily thought of by anyone skilled in the art, Should all it cover within the scope of protection of this application.Therefore, the protection scope of the application should be with the protection model of the claim Subject to enclosing.

Claims (10)

1. a kind of entity relationship map construction method for Web Community's text characterized by comprising
Acquire the text in webpage;
Entity recognition and entity relation extraction are carried out to the text in the webpage, construct semantic model;
Acquire the text in Web Community;
Entity recognition and entity relation extraction are carried out to the text in the Web Community, obtain network entity set of relationship;
Classified using disaggregated model to network entity set of relationship, obtains entity pair;
To the entity to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;
Visualization processing is carried out to fused semantic model, obtains entity relationship map.
2. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that Text in the acquisition webpage, comprising:
Acquisition has the first text of semantic structure in current web page, searches for text hyperlink;
Acquisition has the second text of semantic structure in the corresponding webpage of text hyperlink.
3. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that The Entity recognition includes: Text Pretreatment, morphological analysis and/or entity duplicate removal.
4. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that The entity relation extraction includes: interdependent syntactic analysis and/or syntactic analysis.
5. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that Classified using disaggregated model to network entity set of relationship, obtain entity pair, comprising:
Use the first instance set training entity classification model in kernel entity set of relationship;
Use the first set of relationship training relationship disaggregated model in kernel entity set of relationship;
By in network entity set of relationship second instance set and the second set of relationship input entity classification model and pass respectively It is disaggregated model, obtains entity pair.
6. a kind of entity relationship map construction method for Web Community's text as claimed in claim 5, which is characterized in that The first instance set training entity classification model using in kernel entity set of relationship, comprising:
Use the first instance set in sorting algorithm classification core entity relationship set, training entity classification model;
It will classify in a part of entity input model in network entity set of relationship, output category result, assessment classification As a result accuracy rate;
If accuracy rate reaches setting value, other entities in network entity set of relationship are carried out using this entity classification model Classification.
7. a kind of entity relationship map construction method for Web Community's text as claimed in claim 5, which is characterized in that The first set of relationship training relationship disaggregated model using in kernel entity set of relationship, comprising:
Use the first set of relationship in sorting algorithm classification core entity relationship set, training relationship disaggregated model;
It will classify in a part of relationship input model in network entity set of relationship, output category result, evaluation relations The accuracy rate of classification results;
If accuracy rate reaches setting value, other relationships in network entity set of relationship are carried out using this relationship disaggregated model Classification.
8. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that The kernel entity set of relationship is obtained from semantic model.
9. a kind of entity relationship map construction method for Web Community's text as described in claim 1, which is characterized in that The entity relationship set includes: entity sets and set of relationship.
10. a kind of entity relationship map for Web Community's text constructs system characterized by comprising
Acquisition module, for the text in the text and Web Community in automatic collection webpage;
Semantic model module constructs language for carrying out Entity recognition and entity relation extraction to the text in collected webpage Adopted model;
Text analysis model is obtained for carrying out Entity recognition and entity relation extraction to the text in collected Web Community To network entity set of relationship;
Fusion and display module for classifying using disaggregated model to network entity set of relationship obtain entity pair;To institute Entity is stated to hierarchical classification calculating is carried out, by entity to being integrated into semantic model;Fused semantic model is carried out visual Change processing, obtains entity relationship map.
CN201910277242.8A 2019-04-08 2019-04-08 A kind of entity relationship map construction method and system for Web Community's text Pending CN110188191A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910277242.8A CN110188191A (en) 2019-04-08 2019-04-08 A kind of entity relationship map construction method and system for Web Community's text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910277242.8A CN110188191A (en) 2019-04-08 2019-04-08 A kind of entity relationship map construction method and system for Web Community's text

Publications (1)

Publication Number Publication Date
CN110188191A true CN110188191A (en) 2019-08-30

Family

ID=67713784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910277242.8A Pending CN110188191A (en) 2019-04-08 2019-04-08 A kind of entity relationship map construction method and system for Web Community's text

Country Status (1)

Country Link
CN (1) CN110188191A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727803A (en) * 2019-10-10 2020-01-24 北京明略软件系统有限公司 Text event extraction method and device
CN110737845A (en) * 2019-10-15 2020-01-31 精硕科技(北京)股份有限公司 method, computer storage medium and system for realizing information analysis
CN110795573A (en) * 2019-10-31 2020-02-14 北京邮电大学 Method and device for predicting geographic position of webpage content
CN111310454A (en) * 2020-01-17 2020-06-19 北京邮电大学 Relation extraction method and device based on domain migration
CN111400448A (en) * 2020-03-12 2020-07-10 中国建设银行股份有限公司 Method and device for analyzing incidence relation of objects
CN112100292A (en) * 2020-09-22 2020-12-18 山东旗帜信息有限公司 Personnel relation map determination method and device
CN112364173A (en) * 2020-10-21 2021-02-12 中国电子科技网络信息安全有限公司 IP address mechanism tracing method based on knowledge graph
CN113254635A (en) * 2021-04-14 2021-08-13 腾讯科技(深圳)有限公司 Data processing method, device and storage medium
CN113269271A (en) * 2021-04-30 2021-08-17 清华大学 Initialization method and equipment of double-dictionary model for artificial intelligence text analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN108345647A (en) * 2018-01-18 2018-07-31 北京邮电大学 Domain knowledge map construction system and method based on Web
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109064318A (en) * 2018-08-24 2018-12-21 苏宁消费金融有限公司 A kind of internet financial risks monitoring system of knowledge based map

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN108345647A (en) * 2018-01-18 2018-07-31 北京邮电大学 Domain knowledge map construction system and method based on Web
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109064318A (en) * 2018-08-24 2018-12-21 苏宁消费金融有限公司 A kind of internet financial risks monitoring system of knowledge based map

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
杨浩: "面向"一带一路"的社交网络舆情空间语义关联分析", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *
王超: "基于深度学习的中文微博人物关系图谱的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
贾丙静,马润: "基于实体对齐的知识图谱构建研究", 《佳木斯大学学报(自然科学版)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727803A (en) * 2019-10-10 2020-01-24 北京明略软件系统有限公司 Text event extraction method and device
CN110737845A (en) * 2019-10-15 2020-01-31 精硕科技(北京)股份有限公司 method, computer storage medium and system for realizing information analysis
CN110795573B (en) * 2019-10-31 2021-09-28 北京邮电大学 Method and device for predicting geographic position of webpage content
CN110795573A (en) * 2019-10-31 2020-02-14 北京邮电大学 Method and device for predicting geographic position of webpage content
CN111310454A (en) * 2020-01-17 2020-06-19 北京邮电大学 Relation extraction method and device based on domain migration
CN111310454B (en) * 2020-01-17 2022-01-07 北京邮电大学 Relation extraction method and device based on domain migration
CN111400448A (en) * 2020-03-12 2020-07-10 中国建设银行股份有限公司 Method and device for analyzing incidence relation of objects
CN112100292A (en) * 2020-09-22 2020-12-18 山东旗帜信息有限公司 Personnel relation map determination method and device
CN112364173A (en) * 2020-10-21 2021-02-12 中国电子科技网络信息安全有限公司 IP address mechanism tracing method based on knowledge graph
CN112364173B (en) * 2020-10-21 2022-03-18 中国电子科技网络信息安全有限公司 IP address mechanism tracing method based on knowledge graph
CN113254635A (en) * 2021-04-14 2021-08-13 腾讯科技(深圳)有限公司 Data processing method, device and storage medium
CN113269271A (en) * 2021-04-30 2021-08-17 清华大学 Initialization method and equipment of double-dictionary model for artificial intelligence text analysis
CN113269271B (en) * 2021-04-30 2022-11-15 清华大学 Initialization method and equipment of double-dictionary model for artificial intelligence text analysis

Similar Documents

Publication Publication Date Title
CN110188191A (en) A kind of entity relationship map construction method and system for Web Community's text
Wang et al. Relevant document discovery for fact-checking articles
Velardi et al. Ontolearn reloaded: A graph-based algorithm for taxonomy induction
CN102254014B (en) Adaptive information extraction method for webpage characteristics
CN105893611B (en) Method for constructing interest topic semantic network facing social network
CN106570171A (en) Semantics-based sci-tech information processing method and system
Terrana et al. Automatic unsupervised polarity detection on a twitter data stream
Ahmed Detecting opinion spam and fake news using n-gram analysis and semantic similarity
US20160321244A1 (en) Phrase pair collecting apparatus and computer program therefor
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN110232149A (en) A kind of focus incident detection method and system
TW201115371A (en) Systems and methods for organizing collective social intelligence information using an organic object data model
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
CN109299248A (en) A kind of business intelligence collection method based on natural language processing
CN108153851B (en) General forum subject post page information extraction method based on rules and semantics
Schatten et al. An introduction to social semantic web mining & big data analytics for political attitudes and mentalities research
Saif et al. Mapping Arabic WordNet synsets to Wikipedia articles using monolingual and bilingual features
Fernandes et al. Analysis of product Twitter data though opinion mining
Yıldız et al. Acquisition of Turkish meronym based on classification of patterns
Cui et al. Mining concepts from wikipedia for ontology construction
Bhartiya et al. A Semantic Approach to Summarization
Garcia-Gorrostieta et al. Argument component classification in academic writings
CN113392183A (en) Characterization and calculation method of children domain map knowledge
Griazev et al. Web mining taxonomy
Lim et al. Generalized and lightweight algorithms for automated web forum content extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190830