WO2018036239A1 - Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database - Google Patents

Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database Download PDF

Info

Publication number
WO2018036239A1
WO2018036239A1 PCT/CN2017/087000 CN2017087000W WO2018036239A1 WO 2018036239 A1 WO2018036239 A1 WO 2018036239A1 CN 2017087000 W CN2017087000 W CN 2017087000W WO 2018036239 A1 WO2018036239 A1 WO 2018036239A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
industry
data
event
entities
Prior art date
Application number
PCT/CN2017/087000
Other languages
French (fr)
Chinese (zh)
Inventor
何超
梁颖琪
车慧诗
Original Assignee
慧科讯业有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 慧科讯业有限公司 filed Critical 慧科讯业有限公司
Publication of WO2018036239A1 publication Critical patent/WO2018036239A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the invention relates to the field of internet media monitoring, in particular to a technology for constructing an industry knowledge map database and a technology for monitoring internet media events based on the constructed industry knowledge map database.
  • the existing Internet media monitoring technology has the following defects: 1) Using the interest matching method to provide users with Internet media monitoring, users need to customize the content topics of interest, related entities, etc., so only the user can be identified in the monitoring.
  • the defined entity is directly related to the event, and the event that is not defined by the user but indirectly related to the entity of interest to the user is not recognized; 2) the attribute of the monitoring object is single, and can only provide for a single media category and data source (for example, a specific social Media, news media, forums, blogs, etc.), single data type (generally text), single language monitoring.
  • An object of the present invention is to provide a technology for constructing an industry knowledge map database, which extracts and stores relevant data for a specific industry or field in a knowledge map database, and the constructed industry knowledge map database can be applied to Internet media monitoring. To achieve automation and in-depth monitoring of relevant Internet media events.
  • Another object of the present invention is to provide a technique for monitoring Internet media events based on the constructed industry knowledge map database, which can identify indirectly related entities corresponding to specific media events, and can be used for multiple types. Internet media data is monitored.
  • the present invention provides a method for constructing an industry knowledge map database, comprising the steps of: obtaining industry data from a data source; performing data processing on the industry data to extract entities related to the industry and corresponding entity attributes and/or Or an entity relationship; constructing the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  • the step of acquiring industry data is implemented by acquiring structured industry data from a third-party industry database, the structured industry data comprising a plurality of fields; and the step of performing data processing on the industry data by the following manner Implementation: data cleaning and extraction-conversion-loading (ETL) processing of the structured industry data; the steps of constructing an industry knowledge map database are implemented by: based on extracted entities, entity attributes, and/or entity relationships Generating the industry knowledge map database.
  • ETL extraction-conversion-loading
  • the step of obtaining industry data is implemented by using a web crawler technology to obtain industry-related data from an internet data source, the internet data source comprising an unstructured or semi-structured data source;
  • the step of performing data processing on the industry data is implemented by using an information extraction technique in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entity, the entity attribute, and/or the entity relationship;
  • the step of constructing an industry knowledge map database is accomplished by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships. Further preferably, the above steps are performed periodically at a predetermined cycle.
  • the step of obtaining industry data is implemented by using an application program interface (API) to obtain industry-related data from an internet data source in an inquiry manner, the internet data source including an open data source;
  • the step of data processing by the industry data is implemented by data cleaning and extraction-conversion-loading of the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships.
  • ETL entity attributes and/or entity relationships
  • the step of constructing an industry knowledge map database is implemented by: importing the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships Lines are added or updated. Further preferably, the above steps are performed periodically at a predetermined cycle.
  • the step of acquiring industry data is implemented by acquiring an industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology; and the step of performing data processing on the industry data Implementing, by performing event detection, event evaluation, and screening on the Internet media data, extracting specific media events related to the industry, and identifying corresponding directly related entities from the Internet media data;
  • the steps of the industry knowledge map database are implemented by supplementing the industry knowledge map database based on the particular media event and corresponding directly related entities, wherein the particular media event is supplemented to the industry as an abstract entity Knowledge map database.
  • the directly related entity corresponding to the specific media event is identified by at least one of: identifying from the text data based on the entity recognition in the natural language processing Entity; identifying an entity from image or video data based on image or video recognition processing; or identifying an entity from audio or video data based on a speech recognition process.
  • the specific media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event or other event of industry significance. Further preferably, the above steps are performed in real time without interruption.
  • the step of constructing an industry knowledge map database comprises performing semantic disambiguation and entity linking on the extracted entities.
  • the step of performing semantic disambiguation and entity linking on the extracted entity is further implemented by at least one of the following methods: performing semantic elimination on each extracted entity reference one by one based on entity knowledge Dissimilarity and entity linkage; based on the topic consistency hypothesis, using the association of candidate entities in the knowledge base, the extracted entities are consistently semantically disambiguated and entity linked.
  • the invention also provides a method for monitoring specific media events related to an industry based on the industry knowledge map database constructed in the invention, comprising the steps of: acquiring internet media data; performing event detection based on the acquired internet media data; , event evaluation and screening to obtain the specific media event related to the industry; identifying a directly related entity corresponding to the specific media event; accessing the industry knowledge map database based on the directly related entity to determine An indirect related entity corresponding to a specific media event; sending an alert message to the directly related entity and/or the indirectly related entity.
  • the performing event detection, the event evaluation and the event detection in the screening step comprise the steps of: classifying the content in the acquired internet media data to obtain content for a specific topic; from the obtained content. Identifying the entities involved; performing sentiment analysis on the obtained content and the identified entities, and filtering the obtained content based on the results of the sentiment analysis; performing event discovery based on the filtered content, Cluster media events and discover new media events.
  • the event detection further comprises the steps of: analyzing the authenticity of the event based on the attributes of the media event, and sorting and/or filtering the media events according to the analysis result.
  • the directly related entity corresponding to the specific media event is identified by at least one of: identifying the slave text based on the entity in the natural language processing Identifying an entity in the data; identifying the entity from the image or video data based on image or video recognition processing; or identifying the entity from the audio or video data based on the speech recognition process.
  • the step of accessing the industry knowledge map database is implemented by querying in the industry knowledge map database to determine the indirectly related entity based on the directly related entity.
  • the step of accessing the industry knowledge map database is implemented by using data mining techniques in the industry knowledge map database to determine the indirectly related entities based on the directly related entities.
  • the present invention also provides an apparatus for constructing an industry knowledge map database, comprising: a data acquisition module for acquiring industry data from a data source; and a data processing module for performing data processing on the industry data to extract and An industry-related entity and corresponding entity attribute and/or entity relationship; a database building module for constructing the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
  • the data acquisition module obtains industry data by obtaining structured industry data from a third-party industry database, the structured industry data including a plurality of fields; the data processing module performs data processing by: Data cleaning and extraction-conversion-loading (ETL) processing of the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships; the database building module is constructed by Industry Knowledge Atlas Database: The industry knowledge map database is generated based on the extracted entities, entity attributes, and/or entity relationships.
  • ETL Data cleaning and extraction-conversion-loading
  • the data acquisition module obtains industry data by using industry crawler technology to obtain industry-related data from an Internet data source, the Internet data source comprising an unstructured or semi-structured data source;
  • the processing module performs data processing by using an information extraction technique in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entity, the entity attribute, and/or the entity relationship;
  • the building module constructs an industry knowledge map database by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  • the data acquisition module obtains industry data by using an application program interface (API) Obtaining industry-related data from an Internet data source in an inquiry manner, the Internet data source including an open data source; the data processing module performs data processing by extracting an entity related to the industry and a corresponding entity Data cleaning and extraction-conversion-loading (ETL) processing of the industry-related data prior to the attribute and/or entity relationship; the database building module constructs an industry knowledge map database by: based on the extracted entity, The industry knowledge map database is supplemented or updated with entity attributes and/or entity relationships.
  • API application program interface
  • ETL extraction-conversion-loading
  • the data acquisition module acquires industry data by acquiring industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology;
  • the data processing module is Performing data processing: performing event detection, event evaluation, and screening on the internet media data to extract specific media events related to the industry, and identifying corresponding directly related entities from the internet media data;
  • the module constructs an industry knowledge map database by supplementing the industry knowledge map database based on the particular media event and corresponding directly related entities, wherein the particular media event is supplemented to the industry knowledge as an abstract entity In the map database.
  • said database building module further identifies a directly related entity corresponding to said particular media event by at least one of: identifying an entity from text data based on entity recognition in natural language processing; based on image or video recognition Processing identifies an entity from image or video data; or identifies an entity from audio or video data based on a speech recognition process.
  • the database construction module comprises: a module for semantic disambiguation and entity linking of the extracted entities.
  • the module for performing semantic disambiguation and entity linking on the extracted entity further performs semantic disambiguation and entity linking by at least one of: based on entity knowledge, for each extracted entity Semantic disambiguation and entity linking are performed independently, and semantic disambiguation and entity linking are performed consistently on the extracted entities by using the association of candidate entities in the knowledge base based on the topic consistency hypothesis.
  • the specific media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event or other event of industry significance.
  • the present invention also provides a system for monitoring specific media events related to the industry, comprising: a data acquisition unit for obtaining industry data from a data source; and a data processing unit for performing data processing on the industry data, Extracting an entity related to the industry and a corresponding entity attribute and/or entity relationship; a database building unit, configured to build the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship; database storage unit : used to store the built industry knowledge map database; media event monitoring unit: For acquiring internet media data, performing event detection, event evaluation, and screening based on the acquired internet media data to obtain the industry-specific specific media event, and identifying a directly related entity corresponding to the specific media event; database access Means for accessing the industry knowledge map database based on the directly related entity to determine an indirect related entity corresponding to the specific media event; a message sending unit, configured to the directly related entity and/or The non-directly related entity sends an alert message.
  • a data acquisition unit for obtaining industry data from a data source
  • the data obtaining unit comprises: a structured data obtaining unit, configured to obtain structured industry data from a third-party industry database, the structured industry data comprising a plurality of fields; the data processing unit comprising: structured data a processing unit, configured to perform data cleaning and extract-convert-load (ETL) processing on the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships;
  • the building unit includes: a database generating unit configured to generate the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
  • the data acquisition unit comprises: an industry-related data acquisition unit, configured to obtain industry-related data from an Internet data source, including an unstructured or semi-structured data source, by using a web crawler technology;
  • the data processing unit includes: an industry-related data processing unit, configured to perform entity identification and relationship extraction on the industry-related data by using an information extraction technique in natural language processing to extract the entity, entity attributes, and/or Entity relationship;
  • the database construction unit includes: a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
  • the data obtaining unit includes: an industry-related data acquiring unit, configured to acquire industry-related data from an Internet data source by using an application program interface (API), where the Internet data source includes an open data source;
  • the data processing unit includes: an industry-related data processing unit, configured to perform data cleaning and extraction on the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships - Conversion-loading (ETL) processing;
  • the database building unit includes a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  • the data obtaining unit comprises: a media data acquiring unit, configured to acquire industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology;
  • the data processing unit comprises: media a data processing unit, configured to perform event detection, event evaluation, and screening on the internet media data to extract a specific media event related to the industry, and identify a pair from the internet media data a direct related entity;
  • the database construction unit comprising: a database supplement/update unit for supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific medium Events are added as abstract entities to the industry knowledge map database.
  • the database supplementing/updating unit is further configured to perform semantic disambiguation and entity linking on the extracted entity.
  • the media event monitoring unit is further configured to: perform topic classification on the content in the acquired internet media data to obtain content for a specific topic; identify the involved entity from the obtained content; The content and the identified entity perform sentiment analysis, and filter the obtained content based on the result of the sentiment analysis; perform event discovery based on the filtered content to cluster media events and discover new media events. Further preferably, the media event monitoring unit is further configured to: analyze the authenticity of the event based on the attribute of the media event, and sort and/or filter the media event according to the analysis result.
  • the database access unit is further configured to query the industry knowledge map database to determine the indirectly related entity based on the directly related entity.
  • the database access unit is further configured to: use the data mining technology to determine the indirectly related entity in the industry knowledge map database based on the directly related entity.
  • the specific media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event or other event of industry significance.
  • the following technical effects can be obtained: 1) automating and deep monitoring of related Internet media events for one or more target fields or industries, and being able to identify non-corresponding to specific media events Directly related entities; 2) Automated processing of Internet media data for multiple data sources, multiple data types, and multiple languages in monitoring.
  • FIG. 1 is an exemplary flowchart of a method for constructing an industry knowledge map database provided by the present invention
  • FIG. 3 is an exemplary flowchart of a method for monitoring media events provided by the present invention.
  • FIG. 4 is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention.
  • FIG. 5 is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention.
  • FIG. 6 is an exemplary block diagram of a system for monitoring media events provided by the present invention.
  • the present invention provides a technique for constructing an industry knowledge map database and a technique for monitoring Internet media events based on the constructed industry knowledge map database to achieve the objectives of the present invention.
  • the invention relates to the application of knowledge graph database technology.
  • the Knowledge Mapping Database is a special database for knowledge management that facilitates the collection, collation, and extraction of knowledge in related fields.
  • Entities, entity attributes, and entity relationships are defined in the Knowledge Graph database.
  • the entity corresponds to things in the real world (for example, a company A, a character X), and each entity can be identified by a globally unique ID.
  • Entity attributes are used to describe the intrinsic properties of an entity (for example, company A, Chinese and English names of person X).
  • Entity relationships are used to connect entities to describe the connections between entities (for example, the relationship between person X and company A).
  • knowledge of entities, entity attributes, and entity relationships can be utilized more efficiently and in depth to discover complex connections between things.
  • the knowledge map database can be stored in a variety of forms.
  • the knowledge map database can be stored in a traditional relational database using the semantic network RDF (Resource Description Framework) triplet, or a new non-relational database.
  • the knowledge map database can be stored using a graph database, such as Neo4j, OrientDB, Titan-BerkeleyDB, HyperGraphDB, and the like.
  • the data sources used to build the knowledge map database can be varied.
  • the data source can be an open source of encyclopedia data (eg, Baidu Encyclopedia, Wikipedia, etc.), or a structured database (eg, Wikidata, DBpedia, vertical websites, or specialized databases for specific industries, etc.) ), can also be any related third-party semi-structured or unstructured data sources (for example, professional websites, content published on Internet media, including news, company annual reports, corporate announcements, etc.).
  • the knowledge map database constructed in the present invention is oriented to a particular field or industry during the construction process, but is not limited to a single industry.
  • the built knowledge map database implements attributes and events, entities and events that are related to one or more industries, and entities and entities, entities and The relationship between events, events and events is integrated into a map of knowledge.
  • FIG. 1 is an exemplary flow chart of a method for constructing an industry knowledge map database provided by the present invention, which may include steps S11-S15.
  • step S11 industry data is obtained from an industry data source, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate the industry knowledge map database.
  • An industry data source is a source of basic data for one or more specific areas or industries that are targeted for monitoring.
  • the industry data source can be a structured industry database to obtain high quality industry basic data as much as possible.
  • the structured database can be accessed through an application programming interface (API) to obtain data in a query manner (for example, through a query command).
  • API application programming interface
  • ETL extraction-Transform-Load
  • the obtained industry data can be converted, and then the entity, entity attributes and entity relationships are extracted from the converted data and loaded into the present
  • the industry knowledge map database proposed by the invention The specific execution steps of the ETL operation can be implemented by existing data integration means. For example, in an ontology-based data integration method, mapping relationships between various fields in different databases and various entity information are defined in a predetermined manner, thereby extracting entities, entity attributes, and entities according to the fields and their contents. Relationship, complete the construction of the basic industry knowledge map database.
  • data cleaning operations may be required in the process of data processing of industry data. Data cleaning operations can be implemented in conjunction with ETL processing using techniques known in the art.
  • FIG. 2 illustrates exemplary structured industry data that, as described above, may be obtained from a structured industry database.
  • Table 1 is an example of listed company structured data, which includes two data items, company A and company B, each of which includes the company's Chinese and English name, registered address, stock code, chairman of the board, etc. Field.
  • entities ie, company A, company B, person X, person Y
  • entity attributes ie, specific information of company A and company B
  • entity relationships ie, companies
  • the industry data source can also be a semi-structured or non-institutional data source from the Internet, and can crawl industry data from the data source through web crawler technology and use information based on natural language processing technology. Extract operations to extract entities, entity attributes, and entity relationships.
  • step S12 data related to the industry is obtained from an internet data source, and is extracted from the data Take the entities related to the industry and the corresponding entity attributes and entity relationships.
  • Internet data sources can be structured, semi-structured, or unstructured data sources. Therefore, for different structural characteristics of Internet data sources, industry-related data can be obtained in different ways. The entity and corresponding entity attributes and entity relationships are then extracted from the industry-related data.
  • the corresponding data content can be queried through the API and entity, entity attributes and entity relationships can be obtained.
  • the semi-structured data source after the data content is captured, the content is analyzed by the information extraction operation in the natural language processing technology, thereby extracting the entity, entity attribute and entity relationship related to the industry.
  • a semi-structured data source is a data source that contains partially structured, partially unstructured data, so that corresponding portions of the semi-structured data can be processed in a manner that handles structured and unstructured data, respectively.
  • HTML and XML files are the most common semi-structured data. In the process of processing HTML and XML files, on the one hand, the tag-based structured information can be used, and on the other hand, information extraction technology and machine learning technology can be combined to extract the required information.
  • the information extraction operation includes an entity identification operation and a relationship extraction operation.
  • Entity recognition operations may employ existing natural language processing tools (eg, part-of-speech tagging or named entity recognition tools), or machine learning methods to train entity recognition models for specific annotated data.
  • natural language processing tasks and processing tools are language-dependent (for example, Chinese data requires word segmentation and English data is not required).
  • the machine learning method digitally represents data in different languages and formats, and then uses general, language-independent algorithms (eg, conditional random field algorithms and hidden Markov models) for model training.
  • Relationship extraction operations can be implemented through a variety of existing statistical learning or machine learning methods.
  • a template learning method may be adopted, taking an entity that conforms to a certain relationship in the knowledge map database as an instance, extracting and counting the sentence patterns and contexts existing in the text in a large amount of text to form a relationship extraction template, and then The resulting template is applied to the text data to extract new instances. If you extract an instance that does not yet exist in the Knowledge Graph database, you can add it to the Knowledge Graph database.
  • step S13 the industry knowledge map database is supplemented or updated based on the industry-related entities and corresponding entity attributes and entity relationships.
  • the industry knowledge map database proposed by the present invention can adopt a traditional relational database, an RDF triple database, or a new non-relational database (for example, a graph database).
  • the specific operations of supplementing or updating the knowledge map database can be implemented in a customized manner by using a database query language, for example, the SQL language for the relational database, the RDF triple query language SPARQL, and the Neo4j map. Database Cypher language, etc.
  • entity linking operations and semantic disambiguation operations are required in the process of replenishing or updating the industry knowledge map database.
  • the entity linking operation is intended to correspond to an entity reference (or entity mention) appearing in the data content to the related entity concept in the knowledge map database.
  • entity reference or entity mention
  • the two entities “Steve Jobs” and “Steve Jobs”
  • the generation should correspond to the same person entity concept "Steve Jobs (ex-CEO of Apple)" in the knowledge map database, so the two entities need to be associated with the same entity through the entity link operation.
  • Semantic disambiguation is intended to disambiguate ambiguous entities.
  • the "Apple” entity refers to multiple ambiguous entities, such as “Apple (fruit)", “Apple Inc.”, “Apple Daily”, “Apple (movie)”, etc.
  • the "Apple” in the first sentence of the above example should correspond to the corporate entity concept "Apple Inc.” in the Knowledge Mapping Database instead of "Apple (Fruit)", “Apple (Movie)” or “Apple”. daily”.
  • Entity links and semantic disambiguation are usually done together. Because semantic disambiguation is the means of entity linking, and entity linking is the purpose of semantic disambiguation; so the two are often used interchangeably or mutually.
  • Entity knowledge includes, but is not limited to, the probability of occurrence of an entity, the distribution of names of entities (full name, alias, abbreviation, etc.), the context of the entity (such as co-occurrence information of words, word distribution, etc.) and the entity in the knowledge base.
  • Category information such as company entities, individual entities, Location entity, etc.
  • Entity link model includes, but is not limited to, the probability of occurrence of an entity, the distribution of names of entities (full name, alias, abbreviation, etc.), the context of the entity (such as co-occurrence information of words, word distribution, etc.) and the entity in the knowledge base.
  • Category information such as company entities, individual entities, Location entity, etc.
  • Another type of approach is based on the assumption of subject consistency (ie, the entities in the article are usually related to the text topic, so these entities also have semantic relevance), using the candidate entities referred to by all entities in the text content in the knowledge base (eg The associations in Wikipedia or the knowledge map constructed by the present invention consistently disambiguate and link all entity references in an article.
  • This kind of method usually uses collaborative reasoning based on graph data structure in the calculation process.
  • the candidate entities refer to all entities in the article content, and use their relationship in the knowledge base to construct a candidate entity graph.
  • the dense distribution of the graph reflects The degree of semantic association between different candidate entity nodes in the graph.
  • the process of entity linking is to synergistically enhance the evidence by iteratively passing the evidence (the possible degree of association between different entities) according to the dependency structure of the candidate entity graph until convergence.
  • the above two types of methods can also be combined flexibly or organically to improve the performance of disambiguation and linking.
  • step S14 Internet media data related to the industry is obtained from an Internet data source, and specific media events related to the industry and corresponding directly related entities are extracted from the Internet media data.
  • Internet media data can be obtained from Internet data sources in a variety of ways.
  • some social media sites eg, Sina Weibo, Facebook, Twitter, etc.
  • Web crawler technology and content extraction technology can also be used to capture news site or industry media site data.
  • the Internet media data is first detected to discover the content of the media event in the particular domain or industry of interest and the entity involved in the event, and then to differently identify the newly discovered media event.
  • Indicators eg, negative, significant, sudden, speed and scope of the event, credibility, etc. are evaluated to screen out media events that meet the requirements.
  • entity recognition techniques based on natural language processing can be used to identify entities from textual data
  • images or video recognition processing techniques can be used to identify entities from image or video data
  • speech recognition processing techniques can be used to identify from audio or video data.
  • entity Those skilled in the art will appreciate that the present invention does not limit the media types and language types of Internet media data.
  • step S15 the industry knowledge map database is supplemented based on the specific media event and the corresponding directly related entity, wherein the specific media event is supplemented as an abstract entity into the industry knowledge map database.
  • the event is added as an abstract entity to industry knowledge.
  • entity linkage and semantic disambiguation are performed on the directly related entities involved in the event, that is, the corresponding entity in the industry knowledge map database is found and associated with the abstract entity representing the event. . If it is found that the entity involved in the event does not exist in the industry knowledge map database, it may be supplemented in the manner described in the above step S13. After completing the supplement to the industry knowledge map database, the relationship between the abstract entity representing the media event and the other in the industry knowledge map database can be found based on the relationship between the directly related entities of the event and the other entities in the knowledge map database. Indirectly related entities.
  • the industry knowledge map database After constructing the industry knowledge map database through the above methods, it is possible to perform automated and in-depth monitoring of Internet media events based on the constructed information.
  • the industry knowledge map database may also be updated, for example, steps S12 and S13 may be periodically performed in a predetermined cycle, and Steps S14 and S15 are performed in a real-time uninterrupted manner.
  • FIG. 3 is an exemplary flow chart of a method for monitoring media events provided by the present invention, which can monitor industry-specific media events based on an industry knowledge map database constructed in the present invention.
  • the method can include steps S31-S35.
  • step S31 internet media data is acquired.
  • Internet media data can be obtained from Internet data sources in a variety of ways.
  • some social media sites eg, Sina Weibo, Facebook, Twitter, etc.
  • Web crawler technology and content extraction technology can also be used to capture news site or industry media site data.
  • step S32 event detection, event evaluation, and screening are performed based on the acquired internet media data to obtain the specific media event related to the industry.
  • the Internet media data is first detected to discover the content of the media event in the particular domain or industry of interest and the entity involved in the event, and then to the new Current media events are evaluated according to different indicators (eg, negative, significant, sudden, speed and scope of the event, credibility, etc.) to screen out media events that meet the requirements.
  • indicators eg, negative, significant, sudden, speed and scope of the event, credibility, etc.
  • the technical implementation steps involved in event detection may include: topic classification, entity recognition, sentiment analysis, and event discovery.
  • Topic classification is a kind of text mining technology.
  • the machine learning or deep learning method is generally used to train the classification model on the annotation data, and then applied to the text to judge the topic category.
  • Any existing classification model e.g., naive Bayesian model, decision tree, support vector machine, artificial neural network, etc. can be used in the present invention.
  • the entities involved are identified from the obtained content.
  • entity extraction is to find out which entities involved in the article for further analysis.
  • the entity identification may include extracting an entity from the text information by an information extraction technique in natural language processing, identifying an entity from the image (including video) information by an image recognition technology, and identifying the entity from the voice information by using a voice recognition technology. You can also combine entities identified from text, images, and speech.
  • sentiment analysis is performed on the obtained content and the identified entity, and the obtained content is filtered based on the result of the sentiment analysis.
  • Sentiment analysis is used to determine the full text of the content and the emotional polarity expressed for different entities to find content that meets the monitoring criteria.
  • the prior art generally implements sentiment analysis in a text classification method (eg, classifying emotions as positive, neutral, or negative) or regression analysis methods (eg, expressing emotions as scores between -5 and +5). Judging the emotions of an entity in the content can use the context information of the entity in the text, or use the dependency syntax analysis tool to find the text part of the text related to the entity for the sentiment analysis of the entity.
  • event discovery is performed based on the filtered content to cluster media events and discover new media events.
  • the purpose of event discovery is to extract event information from different texts (for example, the time, place, etc. of the event), then cluster and merge the relevant information into abstract "events", and compare them with existing events to judge new Events that occur and cluster events based on their similarity or relevance.
  • the authenticity of the event may also be performed based on the attributes of the media event (for example, the time and place of the event, the media event publisher and its related attributes, etc.). Analyze and sort and/or filter media events based on the results of the analysis.
  • step S33 a directly related entity corresponding to the specific media event is identified.
  • each directly related entity in each media event can be obtained by entity identification and event discovery operations in event monitoring.
  • each directly related entity can be associated with the corresponding entity concept in the industry knowledge map database or supplemented to the industry knowledge map database through entity link and semantic disambiguation processing.
  • step S34 based on the directly related entity, the industry knowledge map database is accessed to determine an indirectly related entity corresponding to the specific media event.
  • other indirectly related entities associated with the event directly related entity may be directly queried on the industry knowledge map database by preset various conditions.
  • the preset condition may be: 1) an entity that has an association relationship with the event directly related entity in the N layer (N may be 1, 2, 3...); 2) the degree of association with the event directly related entity satisfies a certain condition (such as other entities greater than a specified threshold); 3) entities that have a specific relationship (eg, supply relationship, investment relationship, etc.) directly related to the event; 4) have certain attributes (eg, belong to a certain An entity that specifies a industry, is located at a location, has a position, etc.).
  • These preset conditions can be used individually or in combination.
  • a method of data mining may be employed to exploit a variety of conditions to mine an indirectly related entity of an event based on an industry knowledge map database.
  • the specific implementation method may adopt a link prediction method for graph data, that is, express an indirectly related entity problem detecting an event as “a node representing the event and a directly related entity in the forecast industry knowledge map database”. The technical problem of whether there is a side edge between other entity nodes other than the node.
  • Conditions that can be used for link prediction include, but are not limited to, the characteristics of the event itself (eg, type of event, time and location attributes, negativeness, etc.), the relationship of the event to historical events (including relationship types and relationship strengths), events directly The relationship between related entities and other entities (including relationship types and relationship strengths) and entity types and attributes, all of which can be mined in the knowledge map database, to achieve a comprehensive judgment of indirectly related entities of specific media events.
  • step S35 an alert message is sent to the directly related entity and/or the indirectly related entity.
  • alert message After identifying the direct and indirect related entities corresponding to the specific media event, multiple ways (eg, email, SMS, live chat tool, social network platform, etc.) can be sent to the corresponding entity user.
  • Alert message may contain a textual description of the event itself, a picture, dissemination related statistics, an event evaluation indicator, and how the related entity may be affected by the event.
  • the specific media events described in the present invention may be various types of events that meet the conditions set by the user and can be obtained from the Internet media, for example, negative events, emergencies, crisis events, Group events or public opinion events.
  • the invention does not impose any limitation on this.
  • FIG. 4 illustrates an exemplary flow chart of another method of constructing an industry knowledge map database provided by the present invention.
  • the method may include steps S41, S421/S422, and S43-S45.
  • step S41 industry data is obtained from an industry data source, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate an industry knowledge map database.
  • step S421 based on the structured data source, an entity, an entity attribute, and an entity relationship related to the industry are obtained in a query manner by using an application program interface.
  • the structured data source can be a structured open data platform such as Wikidata, DBPedia, and industry related data can be obtained from the API.
  • step S422 based on the semi-structured or unstructured data source, the data is subjected to entity identification and relationship extraction using natural language processing techniques to extract entities, entity attributes, and entity relationships related to the industry.
  • the semi-structured or unstructured data source may be an open data platform such as Wikipedia or Baidu Encyclopedia, or any related third-party data source (eg, a professional website, in Internet media). Published content, etc.), and can obtain industry-related data through web crawling or content extraction technology.
  • steps S421 and/or S422, S43 may be periodically performed in a predetermined cycle.
  • step S43 the industry knowledge map database is supplemented or updated based on the industry-related entities and corresponding entity attributes and entity relationships.
  • step S44 Internet media data is obtained from an Internet data source, and specific media events related to the industry and corresponding directly related entities are extracted from the Internet media data.
  • step S45 an industry knowledge map database is supplemented based on the particular media event and the corresponding directly related entity, wherein the particular media event is supplemented as an abstract entity into the industry knowledge map database.
  • steps S44 and S45 can be performed in an uninterrupted manner in real time.
  • FIG. 5 is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention.
  • the method can include steps S51-S53:
  • step S51 the industry data is obtained from the data source
  • step S52 data processing is performed on the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships;
  • step S53 the industry knowledge map database is constructed based on the extracted entities, entity attributes, and/or entity relationships.
  • the data source for the industry knowledge map database can be varied, including but not limited to open encyclopedia data sources, structured databases, and any related third-party semi-structured or unstructured Internet data. source.
  • the data source of the industry knowledge map database can also be an internet media data source.
  • the data source may be a structured industry database, and the method may be implemented in the following specific manner: in step S51(1), obtaining a structuring including a plurality of fields from a third-party industry database Industry data; in step S52(1), data cleaning and extraction-conversion-loading (ETL) of the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships Processing; in step S53(1), generating the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
  • ETL extraction-conversion-loading
  • the data source may be an unstructured or semi-structured Internet data source, and the method may be implemented in the following specific manner: in step S51 (2), using web crawling technology, from the Internet The data source acquires industry-related data, the Internet data source includes an unstructured or semi-structured data source; and in step S52(2), the information related to the industry is utilized by using an information extraction technique in natural language processing Performing entity identification and relationship extraction to extract the entity, entity attribute, and/or entity relationship; in step S53(2), performing the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship Supplement or update.
  • steps S51(2)-S53(2) may be performed periodically at a predetermined cycle.
  • the data source may be an open Internet data source, and the method may be implemented in the following specific manner: in step S51 (3), using an application program interface (API) to query from The internet data source acquires industry-related data; in step S52(3), data cleaning of the industry-related data is performed before extracting entities related to the industry and corresponding entity attributes and/or entity relationships And an extract-convert-load (ETL) process; in step S53(3), the industry knowledge map database is supplemented or updated based on the extracted entity, entity attribute, and/or entity relationship.
  • API application program interface
  • ETL extract-convert-load
  • steps S51(3)-S53(3) may be performed periodically at a predetermined cycle.
  • the data source may be an internet media data source, and the method may be implemented in the following specific manner: in step S51 (4), using an application program interface (API) or web crawler technology, from The internet data source obtains the internet media data; in step S52(4), performing event detection, event evaluation and screening on the internet media data to extract specific media events related to the industry, and extracting the media media data from the internet Identifying a corresponding directly related entity; in step S53 (4), supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific media event is treated as an abstract entity Added to the industry knowledge map database.
  • API application program interface
  • web crawler technology from The internet data source obtains the internet media data
  • step S52(4) performing event detection, event evaluation and screening on the internet media data to extract specific media events related to the industry, and extracting the media media data from the internet Identifying a corresponding directly related entity
  • step S53 (4) supplementing the industry knowledge map database based on the specific media event and a
  • a directly related entity corresponding to a specific media event may be identified by at least one of: identifying an entity from text data based on entity recognition in natural language processing; based on image or video The recognition process identifies an entity from the image or video data; or, the entity is identified from the audio or video data based on the speech recognition process.
  • the particular media event may include a negative event, an emergency, a crisis event, a mass event, a public opinion event, or other event of industry significance.
  • steps S51(4)-S53(4) may be performed in real time without interruption.
  • the step of supplementing or updating the industry knowledge map database in steps S53(2), S53(3), and S53(4) may include performing semantic disambiguation on the extracted entity.
  • Entity link For example, the semantic disambiguation and entity linking may be performed by at least one of: semantically disambiguating and entity linking are performed independently for each extracted entity based on entity knowledge; Sexual hypothesis, using the association of candidate entities in the knowledge base, consistently semantically disambiguating and entity linking the extracted entities.
  • the system includes a data acquisition unit, a data acquisition unit, a database construction unit, a database storage unit, a media event monitoring unit, a database access unit, and a message sending unit.
  • a data acquisition unit for obtaining industry data from a data source.
  • a data processing unit configured to perform data processing on the industry data to extract realities related to the industry Body and corresponding entity attributes and/or entity relationships;
  • a database construction unit configured to build the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship
  • Database storage unit used to store the built industry knowledge map database
  • a media event monitoring unit configured to acquire internet media data, perform event detection, event evaluation, and screening based on the acquired internet media data to obtain the specific media event related to the industry, and identify a direct corresponding to the specific media event Related entity
  • a database access unit configured to access the industry knowledge map database based on the directly related entity to determine an indirectly related entity corresponding to the specific media event;
  • a message sending unit configured to send an alert message to the directly related entity and/or the indirectly related entity.
  • the data obtaining unit includes: a structured data obtaining unit, configured to obtain structured data from a third-party industry database, the structured data includes a plurality of fields;
  • the data processing unit includes: structured a data processing unit, configured to perform data cleaning and extract-convert-load (ETL) processing on the structured data;
  • the database building unit includes: a database generating unit, configured to be based on the extracted entity, entity attribute, and/or The entity relationship generates the industry knowledge map database.
  • the data acquisition unit includes: an industry-related data acquisition unit for obtaining industry-related data from an Internet data source, including an unstructured or semi-structured, using a web crawler technology
  • the data processing unit includes: an industry-related data processing unit, configured to perform entity identification and relationship extraction on the industry-related data by using an information extraction technology in natural language processing to extract the entity and the entity. Attributes and/or entity relationships; the database building unit comprising: a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes and/or entity relationships.
  • the data acquisition unit includes: an industry-related data acquisition unit configured to acquire industry-related data from an Internet data source in an inquiry manner using an application program interface (API), the Internet data source including an open source
  • the data processing unit includes: an industry-related data processing unit, configured to perform data on the industry-related data before extracting an entity related to the industry and a corresponding entity attribute and/or entity relationship Cleaning and extraction-conversion-loading (ETL) processing; the database building unit comprising: a database supplement/update unit for knowing the industry based on the extracted entities, entity attributes, and/or entity relationships
  • the map database is supplemented or updated.
  • the data obtaining unit includes: a media data acquiring unit, configured to acquire industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology;
  • the unit includes: a media data processing unit, configured to perform event detection, event evaluation, and screening on the internet media data to extract specific media events related to the industry, and identify corresponding direct correlations from the internet media data.
  • the database building unit comprising: a database supplement/update unit for supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific media event is an abstract entity It is added to the industry knowledge map database.
  • the database supplement/update unit is further configured to perform semantic disambiguation and entity linking on the extracted entities.
  • the media event monitoring unit is further configured to: perform topic classification on the content in the acquired internet media data to obtain content for a specific topic; identify the involved entity from the obtained content; The obtained content and the identified entity perform sentiment analysis, and filter the obtained content based on the result of the sentiment analysis; perform event discovery based on the filtered content to cluster media events and discover new media events.
  • the media event monitoring unit is further configured to: analyze the authenticity of the event based on the attribute of the media event, and sort and/or filter the media event according to the analysis result.
  • the database access unit is further configured to query the industry knowledge map database to determine the indirectly related entity based on the directly related entity. In another embodiment, the database access unit is further configured to: use the data mining technique to determine the indirectly related entity in the industry knowledge map database based on the directly related entity.
  • the present invention is provided to enable a person skilled in the art to make or use the invention. Many modifications and variations of the present invention will be apparent to those skilled in the ⁇ RTIgt; ⁇ / RTI> ⁇ RTIgt; ⁇ / RTI> ⁇ RTIgt; ⁇ / RTI> ⁇ RTIgt; Therefore, the present invention is not limited to the specific embodiments shown above, but should be consistent with the broadest scope of the inventive concepts disclosed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for constructing an industry knowledge mapping database, comprising the following steps: acquiring industry data from a data source (S51); performing data processing on the industry data to extract an entity related to the industry and a corresponding entity attribute and/or an entity relationship (S52); and constructing the industry knowledge mapping database based on the extracted entity, entity attribute and/or entity relationship (S53). A method for monitoring a specific media event related to an industry based on the constructed industry knowledge mapping database, comprising the following steps: acquiring Internet media data (S31); performing event detection, event evaluation and screening based on the acquired Internet media data to acquire the specific media event related to an industry (S32); recognizing a directly related entity corresponding to the specific media event (S33); based on the directly related entity, accessing the industry knowledge mapping database, to determine an indirectly related entity corresponding to the specific media event (S34); and sending an early warning message to the directly related entity and/or indirectly related entity (S35).

Description

基于行业知识图谱数据库对互联网媒体事件进行监测的方法、装置和系统Method, device and system for monitoring internet media events based on industry knowledge map database 技术领域Technical field
本发明涉及互联网媒体监测领域,具体而言,涉及一种构建行业知识图谱数据库的技术以及一种基于所构建的行业知识图谱数据库对互联网媒体事件进行监测的技术。The invention relates to the field of internet media monitoring, in particular to a technology for constructing an industry knowledge map database and a technology for monitoring internet media events based on the constructed industry knowledge map database.
背景技术Background technique
计算机、通信以及网络技术的迅速发展使包括PC、平板电脑、智能手机、网络电视等在内的终端设备的性能不断提高。相应地,互联网媒体,特别是互联网社交媒体,凭借其多元性、迅捷性、交互性、易复制性、多媒体化等特点,已逐渐成为大众获取新闻资讯的主要途径之一。The rapid development of computers, communications, and network technologies has led to an increase in the performance of terminal devices including PCs, tablets, smartphones, and Internet TVs. Correspondingly, Internet media, especially Internet social media, has gradually become one of the main ways for the public to obtain news information by virtue of its diversity, speed, interactivity, reproducibility and multimedia.
然而,互联网媒体信息在具有时效性强、获取方式灵活便捷等优势的同时,其信息源和传播方式的开放性特点也导致了以下问题的存在:在未经授权或证实的情况下,一些敏感消息(例如,商业秘密)甚至虚假消息在互联网媒体平台上被大量用户快速传播,从而演变为对相关的个人、企业/机构、行业乃至社会造成不良影响的媒体事件。因此,需要对互联网媒体中的媒体事件进行监测,并在监测到满足一定条件的媒体事件后采取相应的措施,以降低或消除其潜在的影响。However, while the Internet media information has the advantages of timeliness and flexible accessibility, the openness of its information source and mode of communication also leads to the following problems: some sensitive or unauthorized Messages (eg, trade secrets) and even false news are rapidly spread by a large number of users on the Internet media platform, and thus evolve into media events that adversely affect related individuals, businesses/institutions, industries, and society. Therefore, it is necessary to monitor media events in the Internet media and take corresponding measures after monitoring media events that meet certain conditions to reduce or eliminate their potential impacts.
现有的互联网媒体监测技术则存在以下缺陷:1)使用兴趣匹配的方式为用户提供互联网媒体监测,用户需要自定义感兴趣的内容主题、相关实体等,因此在监测中仅能够识别与用户已定义的实体直接相关的事件,而无法识别用户未定义但是与用户所感兴趣的实体间接相关的事件;2)监测对象的属性单一,仅能够提供针对单一媒体类别和数据源(例如,特定的社交媒体、新闻媒体、论坛、博客等)、单一数据类型(一般为文本)、单一语言的监测。The existing Internet media monitoring technology has the following defects: 1) Using the interest matching method to provide users with Internet media monitoring, users need to customize the content topics of interest, related entities, etc., so only the user can be identified in the monitoring. The defined entity is directly related to the event, and the event that is not defined by the user but indirectly related to the entity of interest to the user is not recognized; 2) the attribute of the monitoring object is single, and can only provide for a single media category and data source (for example, a specific social Media, news media, forums, blogs, etc.), single data type (generally text), single language monitoring.
发明内容Summary of the invention
本发明的一个目的是提供一种构建行业知识图谱数据库的技术,将针对特定行业或领域的相关数据提取并保存在知识图谱数据库中,所构建的行业知识图谱数据库可以应用于互联网媒体监测中,以实现对相关互联网媒体事件的自动化、深层次监测。An object of the present invention is to provide a technology for constructing an industry knowledge map database, which extracts and stores relevant data for a specific industry or field in a knowledge map database, and the constructed industry knowledge map database can be applied to Internet media monitoring. To achieve automation and in-depth monitoring of relevant Internet media events.
本发明的另一个目的是提供一种基于所构建的行业知识图谱数据库对互联网媒体事件进行监测的技术,在监测中能够识别出与特定媒体事件对应的非直接相关实体,并且能够对多种类型的互联网媒体数据进行监测。Another object of the present invention is to provide a technique for monitoring Internet media events based on the constructed industry knowledge map database, which can identify indirectly related entities corresponding to specific media events, and can be used for multiple types. Internet media data is monitored.
为了实现上述发明目的,本发明提供的具体技术方案如下。In order to achieve the above object, the specific technical solution provided by the present invention is as follows.
本发明提供了一种构建行业知识图谱数据库的方法,包括以下步骤:从数据源获取行业数据;对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。The present invention provides a method for constructing an industry knowledge map database, comprising the steps of: obtaining industry data from a data source; performing data processing on the industry data to extract entities related to the industry and corresponding entity attributes and/or Or an entity relationship; constructing the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
优选地,所述获取行业数据的步骤通过以下方式实现:从第三方行业数据库获取结构化行业数据,所述结构化行业数据包括多个字段;所述对行业数据进行数据处理的步骤通过以下方式实现:对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。Preferably, the step of acquiring industry data is implemented by acquiring structured industry data from a third-party industry database, the structured industry data comprising a plurality of fields; and the step of performing data processing on the industry data by the following manner Implementation: data cleaning and extraction-conversion-loading (ETL) processing of the structured industry data; the steps of constructing an industry knowledge map database are implemented by: based on extracted entities, entity attributes, and/or entity relationships Generating the industry knowledge map database.
优选地,所述获取行业数据的步骤通过以下方式实现:利用网络爬虫技术,从互联网数据源获取与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述对行业数据进行数据处理的步骤通过以下方式实现:利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。进一步优选地,上述步骤是以预定的周期定期执行的。Preferably, the step of obtaining industry data is implemented by using a web crawler technology to obtain industry-related data from an internet data source, the internet data source comprising an unstructured or semi-structured data source; The step of performing data processing on the industry data is implemented by using an information extraction technique in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entity, the entity attribute, and/or the entity relationship; The step of constructing an industry knowledge map database is accomplished by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships. Further preferably, the above steps are performed periodically at a predetermined cycle.
优选地,所述获取行业数据的步骤通过以下方式实现:利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述对行业数据进行数据处理的步骤通过以下方式实现:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进 行补充或更新。进一步优选地,上述步骤是以预定的周期定期执行的。Preferably, the step of obtaining industry data is implemented by using an application program interface (API) to obtain industry-related data from an internet data source in an inquiry manner, the internet data source including an open data source; The step of data processing by the industry data is implemented by data cleaning and extraction-conversion-loading of the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships. (ETL) processing; the step of constructing an industry knowledge map database is implemented by: importing the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships Lines are added or updated. Further preferably, the above steps are performed periodically at a predetermined cycle.
优选地,所述获取行业数据的步骤通过以下方式实现:利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述对行业数据进行数据处理的步骤通过以下方式实现:对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;所述构建行业知识图谱数据库的步骤通过以下方式实现:基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。进一步优选地,在所述对行业数据进行数据处理的步骤中通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者,基于语音识别处理从音频或视频数据中识别实体。进一步优选地,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。进一步优选地,上述步骤是实时不间断执行的。Preferably, the step of acquiring industry data is implemented by acquiring an industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology; and the step of performing data processing on the industry data Implementing, by performing event detection, event evaluation, and screening on the Internet media data, extracting specific media events related to the industry, and identifying corresponding directly related entities from the Internet media data; The steps of the industry knowledge map database are implemented by supplementing the industry knowledge map database based on the particular media event and corresponding directly related entities, wherein the particular media event is supplemented to the industry as an abstract entity Knowledge map database. Further preferably, in the step of performing data processing on the industry data, the directly related entity corresponding to the specific media event is identified by at least one of: identifying from the text data based on the entity recognition in the natural language processing Entity; identifying an entity from image or video data based on image or video recognition processing; or identifying an entity from audio or video data based on a speech recognition process. Further preferably, the specific media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event or other event of industry significance. Further preferably, the above steps are performed in real time without interruption.
优选地,所述构建行业知识图谱数据库的步骤包括:对所提取的实体进行语义消歧和实体链接。进一步优选地,所述对所提取的实体进行语义消歧和实体链接的步骤进一步通过以下方式中的至少一种实现:基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接;基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。Preferably, the step of constructing an industry knowledge map database comprises performing semantic disambiguation and entity linking on the extracted entities. Further preferably, the step of performing semantic disambiguation and entity linking on the extracted entity is further implemented by at least one of the following methods: performing semantic elimination on each extracted entity reference one by one based on entity knowledge Dissimilarity and entity linkage; based on the topic consistency hypothesis, using the association of candidate entities in the knowledge base, the extracted entities are consistently semantically disambiguated and entity linked.
本发明还提供了一种基于本发明中所构建的行业知识图谱数据库对与行业相关的特定媒体事件进行监测的方法,包括以下步骤:获取互联网媒体数据;基于所获取的互联网媒体数据进行事件检测、事件评价和筛选,以获取所述与行业相关的特定媒体事件;识别与所述特定媒体事件对应的直接相关实体;基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;向所述直接相关实体和/或所述非直接相关实体发送预警消息。The invention also provides a method for monitoring specific media events related to an industry based on the industry knowledge map database constructed in the invention, comprising the steps of: acquiring internet media data; performing event detection based on the acquired internet media data; , event evaluation and screening to obtain the specific media event related to the industry; identifying a directly related entity corresponding to the specific media event; accessing the industry knowledge map database based on the directly related entity to determine An indirect related entity corresponding to a specific media event; sending an alert message to the directly related entity and/or the indirectly related entity.
优选地,所述进行事件检测、事件评价和筛选步骤中的事件检测包括以下步骤:对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;从所获得的内容中识别涉及的实体;对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤;基于过滤后的内容进行事件发现, 以对媒体事件进行聚类并发现新的媒体事件。进一步优选地,所述事件检测还包括以下步骤:基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。Preferably, the performing event detection, the event evaluation and the event detection in the screening step comprise the steps of: classifying the content in the acquired internet media data to obtain content for a specific topic; from the obtained content. Identifying the entities involved; performing sentiment analysis on the obtained content and the identified entities, and filtering the obtained content based on the results of the sentiment analysis; performing event discovery based on the filtered content, Cluster media events and discover new media events. Further preferably, the event detection further comprises the steps of: analyzing the authenticity of the event based on the attributes of the media event, and sorting and/or filtering the media events according to the analysis result.
优选地,在所述识别与特定媒体事件对应的直接相关实体的步骤中通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者,基于语音识别处理从音频或视频数据中识别实体。Preferably, in the step of identifying the directly related entity corresponding to the specific media event, the directly related entity corresponding to the specific media event is identified by at least one of: identifying the slave text based on the entity in the natural language processing Identifying an entity in the data; identifying the entity from the image or video data based on image or video recognition processing; or identifying the entity from the audio or video data based on the speech recognition process.
优选地,所述访问行业知识图谱数据库的步骤通过以下方式实现:基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。Preferably, the step of accessing the industry knowledge map database is implemented by querying in the industry knowledge map database to determine the indirectly related entity based on the directly related entity.
优选地,所述访问行业知识图谱数据库的步骤通过以下方式实现:基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。Preferably, the step of accessing the industry knowledge map database is implemented by using data mining techniques in the industry knowledge map database to determine the indirectly related entities based on the directly related entities.
本发明还提供了一种构建行业知识图谱数据库的装置,包括:数据获取模块,用于从数据源获取行业数据;数据处理模块,用于对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;数据库构建模块,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。The present invention also provides an apparatus for constructing an industry knowledge map database, comprising: a data acquisition module for acquiring industry data from a data source; and a data processing module for performing data processing on the industry data to extract and An industry-related entity and corresponding entity attribute and/or entity relationship; a database building module for constructing the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
优选地,所述数据获取模块通过以下方式获取行业数据:从第三方行业数据库获得结构化行业数据,所述结构化行业数据包括多个字段;所述数据处理模块通过以下方式进行数据处理:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。Preferably, the data acquisition module obtains industry data by obtaining structured industry data from a third-party industry database, the structured industry data including a plurality of fields; the data processing module performs data processing by: Data cleaning and extraction-conversion-loading (ETL) processing of the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships; the database building module is constructed by Industry Knowledge Atlas Database: The industry knowledge map database is generated based on the extracted entities, entity attributes, and/or entity relationships.
优选地,所述数据获取模块通过以下方式获取行业数据:利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述数据处理模块通过以下方式进行数据处理:利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data acquisition module obtains industry data by using industry crawler technology to obtain industry-related data from an Internet data source, the Internet data source comprising an unstructured or semi-structured data source; The processing module performs data processing by using an information extraction technique in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entity, the entity attribute, and/or the entity relationship; The building module constructs an industry knowledge map database by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
优选地,所述数据获取模块通过以下方式获取行业数据:利用应用程序接口(API) 以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述数据处理模块通过以下方式进行数据处理:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data acquisition module obtains industry data by using an application program interface (API) Obtaining industry-related data from an Internet data source in an inquiry manner, the Internet data source including an open data source; the data processing module performs data processing by extracting an entity related to the industry and a corresponding entity Data cleaning and extraction-conversion-loading (ETL) processing of the industry-related data prior to the attribute and/or entity relationship; the database building module constructs an industry knowledge map database by: based on the extracted entity, The industry knowledge map database is supplemented or updated with entity attributes and/or entity relationships.
优选地,所述数据获取模块通过以下方式获取行业数据:用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述数据处理模块通过以下方式进行数据处理:对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。Preferably, the data acquisition module acquires industry data by acquiring industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology; the data processing module is Performing data processing: performing event detection, event evaluation, and screening on the internet media data to extract specific media events related to the industry, and identifying corresponding directly related entities from the internet media data; The module constructs an industry knowledge map database by supplementing the industry knowledge map database based on the particular media event and corresponding directly related entities, wherein the particular media event is supplemented to the industry knowledge as an abstract entity In the map database.
优选地,所述数据库构建模块进一步通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者基于语音识别处理从音频或视频数据中识别实体。Advantageously, said database building module further identifies a directly related entity corresponding to said particular media event by at least one of: identifying an entity from text data based on entity recognition in natural language processing; based on image or video recognition Processing identifies an entity from image or video data; or identifies an entity from audio or video data based on a speech recognition process.
优选地,所述数据库构建模块包括:用于对所提取的实体进行语义消歧和实体链接的模块。进一步优选地,所述用于对所提取的实体进行语义消歧和实体链接的模块进一步通过以下方式中的至少一种进行语义消歧和实体链接:基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接;基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。Preferably, the database construction module comprises: a module for semantic disambiguation and entity linking of the extracted entities. Further preferably, the module for performing semantic disambiguation and entity linking on the extracted entity further performs semantic disambiguation and entity linking by at least one of: based on entity knowledge, for each extracted entity Semantic disambiguation and entity linking are performed independently, and semantic disambiguation and entity linking are performed consistently on the extracted entities by using the association of candidate entities in the knowledge base based on the topic consistency hypothesis.
优选地,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。Preferably, the specific media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event or other event of industry significance.
本发明还提供了一种对与行业相关的特定媒体事件进行监测的系统,包括:数据获取单元,用于从数据源获得行业数据;数据处理单元,用于对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;数据库构建单元,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库;数据库存储单元:用于存储所构建的行业知识图谱数据库;媒体事件监测单元: 用于获取互联网媒体数据,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选以获得所述与行业相关的特定媒体事件,并且识别与所述特定媒体事件对应的直接相关实体;数据库访问单元:用于基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;消息发送单元,用于向所述直接相关实体和/或所述非直接相关实体发送预警消息。The present invention also provides a system for monitoring specific media events related to the industry, comprising: a data acquisition unit for obtaining industry data from a data source; and a data processing unit for performing data processing on the industry data, Extracting an entity related to the industry and a corresponding entity attribute and/or entity relationship; a database building unit, configured to build the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship; database storage unit : used to store the built industry knowledge map database; media event monitoring unit: For acquiring internet media data, performing event detection, event evaluation, and screening based on the acquired internet media data to obtain the industry-specific specific media event, and identifying a directly related entity corresponding to the specific media event; database access Means for accessing the industry knowledge map database based on the directly related entity to determine an indirect related entity corresponding to the specific media event; a message sending unit, configured to the directly related entity and/or The non-directly related entity sends an alert message.
优选地,所述数据获取单元包括:结构化数据获取单元,用于从第三方行业数据库获得结构化行业数据,所述结构化行业数据包括多个字段;所述数据处理单元包括:结构化数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库生成单元,用于基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。Preferably, the data obtaining unit comprises: a structured data obtaining unit, configured to obtain structured industry data from a third-party industry database, the structured industry data comprising a plurality of fields; the data processing unit comprising: structured data a processing unit, configured to perform data cleaning and extract-convert-load (ETL) processing on the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships; The building unit includes: a database generating unit configured to generate the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
优选地,所述数据获取单元包括:行业相关数据获取单元,用于利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述数据处理单元包括:行业相关数据处理单元,用于利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data acquisition unit comprises: an industry-related data acquisition unit, configured to obtain industry-related data from an Internet data source, including an unstructured or semi-structured data source, by using a web crawler technology; The data processing unit includes: an industry-related data processing unit, configured to perform entity identification and relationship extraction on the industry-related data by using an information extraction technique in natural language processing to extract the entity, entity attributes, and/or Entity relationship; the database construction unit includes: a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
优选地,所述数据获取单元包括:行业相关数据获取单元,用于利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述数据处理单元包括:行业相关数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。Preferably, the data obtaining unit includes: an industry-related data acquiring unit, configured to acquire industry-related data from an Internet data source by using an application program interface (API), where the Internet data source includes an open data source; The data processing unit includes: an industry-related data processing unit, configured to perform data cleaning and extraction on the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships - Conversion-loading (ETL) processing; the database building unit includes a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
优选地,所述数据获取单元包括:媒体数据获取单元,用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述数据处理单元包括:媒体数据处理单元,用于对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对 应的直接相关实体;所述数据库构建单元包括:数据库补充/更新单元,用于基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。Preferably, the data obtaining unit comprises: a media data acquiring unit, configured to acquire industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology; the data processing unit comprises: media a data processing unit, configured to perform event detection, event evaluation, and screening on the internet media data to extract a specific media event related to the industry, and identify a pair from the internet media data a direct related entity; the database construction unit comprising: a database supplement/update unit for supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific medium Events are added as abstract entities to the industry knowledge map database.
优选地,所述数据库补充/更新单元进一步用于:对所提取的实体进行语义消歧和实体链接。Preferably, the database supplementing/updating unit is further configured to perform semantic disambiguation and entity linking on the extracted entity.
优选地,所述媒体事件监测单元进一步用于:对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;从所获得的内容中识别涉及的实体;对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤;基于过滤后的内容进行事件发现,以对媒体事件进行聚类并发现新的媒体事件。进一步优选地,所述媒体事件监测单元进一步用于:基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。Preferably, the media event monitoring unit is further configured to: perform topic classification on the content in the acquired internet media data to obtain content for a specific topic; identify the involved entity from the obtained content; The content and the identified entity perform sentiment analysis, and filter the obtained content based on the result of the sentiment analysis; perform event discovery based on the filtered content to cluster media events and discover new media events. Further preferably, the media event monitoring unit is further configured to: analyze the authenticity of the event based on the attribute of the media event, and sort and/or filter the media event according to the analysis result.
优选地,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。Preferably, the database access unit is further configured to query the industry knowledge map database to determine the indirectly related entity based on the directly related entity.
优选地,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。Preferably, the database access unit is further configured to: use the data mining technology to determine the indirectly related entity in the industry knowledge map database based on the directly related entity.
优选地,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。Preferably, the specific media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event or other event of industry significance.
通过实施本发明提供的技术方案可以获得以下技术效果:1)针对一个或多个目标领域或行业,实现了对相关互联网媒体事件的自动化、深层次监测,能够识别出与特定媒体事件对应的非直接相关实体;2)在监测中实现了对多个数据源、多种数据类型、多种语言的互联网媒体数据的自动化处理。By implementing the technical solution provided by the present invention, the following technical effects can be obtained: 1) automating and deep monitoring of related Internet media events for one or more target fields or industries, and being able to identify non-corresponding to specific media events Directly related entities; 2) Automated processing of Internet media data for multiple data sources, multiple data types, and multiple languages in monitoring.
附图说明DRAWINGS
图1是本发明提供的一种构建行业知识图谱数据库的方法的示例性流程图;1 is an exemplary flowchart of a method for constructing an industry knowledge map database provided by the present invention;
图2是本发明提供的示例性结构化行业数据;2 is an exemplary structured industry data provided by the present invention;
图3是本发明提供的一种对媒体事件进行监测的方法的示例性流程图;3 is an exemplary flowchart of a method for monitoring media events provided by the present invention;
图4是本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图;4 is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention;
图5是本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图;FIG. 5 is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention; FIG.
图6是本发明提供的一种对媒体事件进行监测的系统的示例性框图。 6 is an exemplary block diagram of a system for monitoring media events provided by the present invention.
具体实施方式detailed description
以下结合附图通过实施例的形式来描述本发明的具体实施方式,以便于本领域技术人员理解本发明的目的、技术方案和优点。本领域技术人员可以理解,以实施例的形式描述的具体实施方式仅仅是示例性的,而在不具备这些具体内容的情况下也能够实现本发明的构思。The specific embodiments of the present invention are described in the form of the embodiments of the present invention in conjunction with the accompanying drawings. The embodiments described in the form of the embodiments are merely exemplary, and the concept of the present invention can be implemented without these specific contents.
本发明提供了一种构建行业知识图谱数据库的技术以及一种基于所构建的行业知识图谱数据库对互联网媒体事件进行监测的技术,以实现本发明的目的。The present invention provides a technique for constructing an industry knowledge map database and a technique for monitoring Internet media events based on the constructed industry knowledge map database to achieve the objectives of the present invention.
本发明涉及知识图谱(Knowledge Graph)数据库技术的应用。知识图谱数据库是用于知识管理的一种特殊的数据库,便于在相关领域中对知识进行采集、整理和提取。在知识图谱数据库中定义了实体、实体属性以及实体关系。其中,实体对应于现实世界中的事物(例如,一个公司A,一个人物X),每个实体可以用全局唯一的ID来标识。实体属性用于描述实体的内在特性(例如,公司A、人物X的中、英文名称)。实体关系用于连接实体,以描述实体之间的联系(例如,人物X与公司A的任职关系)。通过构建知识图谱数据库,可以更加高效、深入地利用由实体、实体属性、实体关系组成的知识,发现事物之间的复杂联系。The invention relates to the application of knowledge graph database technology. The Knowledge Mapping Database is a special database for knowledge management that facilitates the collection, collation, and extraction of knowledge in related fields. Entities, entity attributes, and entity relationships are defined in the Knowledge Graph database. Among them, the entity corresponds to things in the real world (for example, a company A, a character X), and each entity can be identified by a globally unique ID. Entity attributes are used to describe the intrinsic properties of an entity (for example, company A, Chinese and English names of person X). Entity relationships are used to connect entities to describe the connections between entities (for example, the relationship between person X and company A). By constructing a knowledge map database, knowledge of entities, entity attributes, and entity relationships can be utilized more efficiently and in depth to discover complex connections between things.
作为一种数据库,知识图谱数据库可以采用多种形式进行存储。举例而言,知识图谱数据库可以采用传统的关系型数据库,使用语义网络RDF(Resource Description Framework)三元组的方式存储,也可以采用新型的非关系型数据库。优选地,知识图谱数据库可以采用图数据库进行存储,例如Neo4j、OrientDB、Titan-BerkeleyDB、HyperGraphDB等。As a database, the knowledge map database can be stored in a variety of forms. For example, the knowledge map database can be stored in a traditional relational database using the semantic network RDF (Resource Description Framework) triplet, or a new non-relational database. Preferably, the knowledge map database can be stored using a graph database, such as Neo4j, OrientDB, Titan-BerkeleyDB, HyperGraphDB, and the like.
取决于知识图谱数据库的规模和用途,用于构建知识图谱数据库的数据来源可以是多种多样的。举例而言,数据来源可以是开放式的百科类数据源(例如,百度百科、维基百科等),也可以是结构化的数据库(例如,维基数据、DBpedia、垂直网站或特定行业的专业数据库等),还可以是任何相关的第三方半结构化或非结构化数据源(例如,专业网站、在互联网媒体中发布的内容,包括新闻、公司年报、企业公告等)。Depending on the size and use of the knowledge map database, the data sources used to build the knowledge map database can be varied. For example, the data source can be an open source of encyclopedia data (eg, Baidu Encyclopedia, Wikipedia, etc.), or a structured database (eg, Wikidata, DBpedia, vertical websites, or specialized databases for specific industries, etc.) ), can also be any related third-party semi-structured or unstructured data sources (for example, professional websites, content published on Internet media, including news, company annual reports, corporate announcements, etc.).
本领域技术人员应当理解,本发明中所构建的知识图谱数据库在构建过程中是以特定的领域或行业为导向的,但不局限于单个行业。所构建的知识图谱数据库实现了将与一个或多个行业相关的实体和事件、实体和事件的属性以及实体与实体、实体与 事件、事件与事件之间的关系整合联接成为一个知识的图谱。Those skilled in the art will appreciate that the knowledge map database constructed in the present invention is oriented to a particular field or industry during the construction process, but is not limited to a single industry. The built knowledge map database implements attributes and events, entities and events that are related to one or more industries, and entities and entities, entities and The relationship between events, events and events is integrated into a map of knowledge.
图1是本发明提供的一种构建行业知识图谱数据库的方法的示例性流程图,该方法可以包括步骤S11-S15。1 is an exemplary flow chart of a method for constructing an industry knowledge map database provided by the present invention, which may include steps S11-S15.
在步骤S11中,从行业数据源获得行业数据,并从所述行业数据中提取实体以及对应的实体属性和实体关系,以生成所述行业知识图谱数据库。In step S11, industry data is obtained from an industry data source, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate the industry knowledge map database.
行业数据源是针对一个或多个特定领域或行业的基本数据的来源,其中,这些领域或行业被作为监测的目标。在一个实施例中,行业数据源可以是结构化的行业数据库,以尽可能获得高质量的行业基本数据。可以通过应用程序接口(API)来访问结构化数据库,以查询方式(例如,通过查询命令)获得数据。An industry data source is a source of basic data for one or more specific areas or industries that are targeted for monitoring. In one embodiment, the industry data source can be a structured industry database to obtain high quality industry basic data as much as possible. The structured database can be accessed through an application programming interface (API) to obtain data in a query manner (for example, through a query command).
通过“抽取-转换-加载(Extraction-Transform-Load,ETL)”处理,可以对所获得的行业数据进行转换,然后从转换后的数据中提取实体、实体属性和实体关系并将其加载至本发明提出的行业知识图谱数据库中。ETL操作的具体执行步骤可以通过现有的数据整合手段来实现。举例而言,在基于本体的数据整合方法中,以预定的方式定义不同数据库中的各个字段与各种实体信息之间的映射关系,从而根据所述字段及其内容提取实体、实体属性及实体关系,完成构建基本行业知识图谱数据库。另外,由于行业数据库在结构上存在差异,并可能存在数据噪声、数据缺失或数据错误等问题,所以在对行业数据进行数据处理的过程中可能还需要对其进行数据清洗操作。可以采用本领域已知的技术手段,与ETL处理相结合来实现数据清洗操作。Through the "Extraction-Transform-Load (ETL)" process, the obtained industry data can be converted, and then the entity, entity attributes and entity relationships are extracted from the converted data and loaded into the present The industry knowledge map database proposed by the invention. The specific execution steps of the ETL operation can be implemented by existing data integration means. For example, in an ontology-based data integration method, mapping relationships between various fields in different databases and various entity information are defined in a predetermined manner, thereby extracting entities, entity attributes, and entities according to the fields and their contents. Relationship, complete the construction of the basic industry knowledge map database. In addition, due to the differences in the structure of the industry database, and there may be problems such as data noise, data loss or data errors, data cleaning operations may be required in the process of data processing of industry data. Data cleaning operations can be implemented in conjunction with ETL processing using techniques known in the art.
作为一个实例,图2示出了示例性的结构化行业数据,如上文所述,该数据可以是从结构化的行业数据库获得的。在图2中,表1是上市公司结构化数据的示例,其包括公司A和公司B两个数据条目,每个数据条目又包括公司中英文名称、注册地址、股票代码、董事会主席等多个字段。通过对该结构化数据进行ETL操作,可以提取其中的实体(即公司A、公司B、人物X、人物Y)、实体属性(即公司A和公司的B的具体信息)以及实体关系(即公司A与人物X以及公司B与人物Y的任职关系),从而生成了针对所属行业的知识图谱数据库。As an example, FIG. 2 illustrates exemplary structured industry data that, as described above, may be obtained from a structured industry database. In Figure 2, Table 1 is an example of listed company structured data, which includes two data items, company A and company B, each of which includes the company's Chinese and English name, registered address, stock code, chairman of the board, etc. Field. By performing ETL operations on the structured data, entities (ie, company A, company B, person X, person Y), entity attributes (ie, specific information of company A and company B), and entity relationships (ie, companies) can be extracted. A and the character X and the relationship between the company B and the character Y), thereby generating a knowledge map database for the industry.
在另一个实施例中,行业数据源也可以是来自互联网的半结构化或非机构化数据源,并且可以通过网络爬虫技术从数据源中抓取行业数据,并采用基于自然语言处理技术的信息抽取操作来提取实体、实体属性以及实体关系。In another embodiment, the industry data source can also be a semi-structured or non-institutional data source from the Internet, and can crawl industry data from the data source through web crawler technology and use information based on natural language processing technology. Extract operations to extract entities, entity attributes, and entity relationships.
在步骤S12中,从互联网数据源获得与所述行业相关的数据,并从所述数据中提 取与所述行业相关的实体以及对应的实体属性和实体关系。In step S12, data related to the industry is obtained from an internet data source, and is extracted from the data Take the entities related to the industry and the corresponding entity attributes and entity relationships.
在该步骤中,首先从互联网数据源中获得与上述特定领域或行业相关的数据。互联网数据源可以是结构化、半结构化或非结构化的数据源。因此,针对互联网数据源的不同结构特性,可以采用不同的方式获得与行业相关的数据。然后,从与行业相关的数据中提取实体以及对应的实体属性和实体关系。In this step, data related to the above specific fields or industries is first obtained from an Internet data source. Internet data sources can be structured, semi-structured, or unstructured data sources. Therefore, for different structural characteristics of Internet data sources, industry-related data can be obtained in different ways. The entity and corresponding entity attributes and entity relationships are then extracted from the industry-related data.
对于结构化的互联网数据源,可以通过API查询对应的数据内容并获得实体、实体属性和实体关系。对于半结构化的数据源,则可以在抓取数据内容后,通过自然语言处理技术中的信息抽取操作对内容进行分析,从而提取出与行业相关的实体、实体属性和实体关系。半结构化的数据源即包含部分结构化、部分非结构化数据的数据源,因此可以分别按照处理结构化和非结构化数据的方式来处理半结构化数据中的对应部分。举例而言,HTML和XML文件是最常见的半结构化数据。在处理HTML和XML文件的过程中,一方面可以使用其中基于标记符的结构化信息,另一方面可以结合信息抽取技术与机器学习技术来提取所需的信息。For structured Internet data sources, the corresponding data content can be queried through the API and entity, entity attributes and entity relationships can be obtained. For the semi-structured data source, after the data content is captured, the content is analyzed by the information extraction operation in the natural language processing technology, thereby extracting the entity, entity attribute and entity relationship related to the industry. A semi-structured data source is a data source that contains partially structured, partially unstructured data, so that corresponding portions of the semi-structured data can be processed in a manner that handles structured and unstructured data, respectively. For example, HTML and XML files are the most common semi-structured data. In the process of processing HTML and XML files, on the one hand, the tag-based structured information can be used, and on the other hand, information extraction technology and machine learning technology can be combined to extract the required information.
在一个实施例中,信息抽取操作包括实体识别操作和关系抽取操作。In one embodiment, the information extraction operation includes an entity identification operation and a relationship extraction operation.
实体识别操作可以采用现有自然语言处理工具(例如,词性标注或命名实体识别工具),或者以机器学习方法针对特定标注数据对实体识别模型进行训练。需要指出的是,一些自然语言处理任务和处理工具是与语言相关的(例如,中文数据需要进行分词处理,英文数据则不需要)。机器学习方法以数字化方式表示不同语言和格式的数据,然后采用通用的、与语言无关的算法(例如,条件随机场算法和隐马尔可夫模型)进行模型训练。Entity recognition operations may employ existing natural language processing tools (eg, part-of-speech tagging or named entity recognition tools), or machine learning methods to train entity recognition models for specific annotated data. It should be noted that some natural language processing tasks and processing tools are language-dependent (for example, Chinese data requires word segmentation and English data is not required). The machine learning method digitally represents data in different languages and formats, and then uses general, language-independent algorithms (eg, conditional random field algorithms and hidden Markov models) for model training.
关系抽取操作可以通过多种现有统计学习或机器学习方法实现。例如,可以采用模板学习方法,以知识图谱数据库中符合某种关系的实体作为实例,在大量文本中抽取并统计现有实例在文本中出现的句式、语境等形成关系抽取模板,然后将所形成的模板应用在文本数据中以抽取新的实例。如果抽取到知识图谱数据库中尚不存在的实例,则可以将其补充到知识图谱数据库中。Relationship extraction operations can be implemented through a variety of existing statistical learning or machine learning methods. For example, a template learning method may be adopted, taking an entity that conforms to a certain relationship in the knowledge map database as an instance, extracting and counting the sentence patterns and contexts existing in the text in a large amount of text to form a relationship extraction template, and then The resulting template is applied to the text data to extract new instances. If you extract an instance that does not yet exist in the Knowledge Graph database, you can add it to the Knowledge Graph database.
在步骤S13中,基于所述与行业相关的实体以及对应的实体属性和实体关系,对所述行业知识图谱数据库进行补充或更新。In step S13, the industry knowledge map database is supplemented or updated based on the industry-related entities and corresponding entity attributes and entity relationships.
在提取与行业相关的实体以及对应的实体属性和实体关系之后,可以将其与知识图谱数据库中的对应信息进行关联和比对,并按需要将新的实体、实体属性和实体关 系加入到知识图谱数据库中,并且可以对现有的实体属性和实体关系进行更新。After extracting industry-related entities and corresponding entity attributes and entity relationships, they can be correlated and compared with corresponding information in the knowledge map database, and new entities, entity attributes, and entities are closed as needed. The system is added to the knowledge map database and can update existing entity attributes and entity relationships.
如上文所述,本发明所提出的行业知识图谱数据库可以采用传统的关系型数据库,RDF三元组数据库,也可以采用新型的非关系型数据库(例如,图数据库)。对应地,补充或更新知识图谱数据库的具体操作可以利用数据库查询语言以定制化的方式实现,例如,这些数据库查询语言包括针对关系数据库的SQL语言、RDF三元组查询语言SPARQL、用于Neo4j图数据库的Cypher语言等。As described above, the industry knowledge map database proposed by the present invention can adopt a traditional relational database, an RDF triple database, or a new non-relational database (for example, a graph database). Correspondingly, the specific operations of supplementing or updating the knowledge map database can be implemented in a customized manner by using a database query language, for example, the SQL language for the relational database, the RDF triple query language SPARQL, and the Neo4j map. Database Cypher language, etc.
继续结合图2中的实例进行说明。假设通过API查询的方式从结构化的互联网数据源获得了表2的上市公司高管结构化数据,则可以对行业知识图谱数据库进行以下补充和更新:1)将人物Z、人物Z的实体属性以及人物Z与公司B的任职关系补充到知识图谱数据库中;2)补充人物X和人物Y的实体属性;3)更新人物Y和公司B的任职关系(即从“现任职”更新为“曾任职”)。The description will continue with the example in FIG. 2. Assuming that the structured data of the listed company's executives is obtained from the structured Internet data source through API query, the following can be supplemented and updated to the industry knowledge map database: 1) The entity attributes of the person Z and the character Z And the relationship between the character Z and the company B is added to the knowledge map database; 2) the entity attribute of the person X and the character Y is added; 3) the relationship between the person Y and the company B is updated (that is, the update from "current position" to "Zeng """.
在一个实施例中,在补充或更新行业知识图谱数据库的过程中需要进行实体链接操作和语义消歧操作。In one embodiment, entity linking operations and semantic disambiguation operations are required in the process of replenishing or updating the industry knowledge map database.
实体链接操作旨在将数据内容中出现的某个实体指代(或实体指称、entity mention)对应到知识图谱数据库中的相关实体概念。例如,在“乔布斯是苹果的创办人之一”以及“史蒂夫·乔布斯于1985年在美国创建NeXT”这两个句子中,“乔布斯”和“史蒂夫·乔布斯”这两个实体指代都应该对应到知识图谱数据库中的同一人物实体概念“史蒂夫·乔布斯(Steve Jobs,ex-CEO of Apple)”,因此需要通过实体链接操作将这个两个实体指代关联到同一个实体。语义消歧旨在对有歧义的实体指代进行消歧操作。例如,“苹果”这个实体指代可以对应多个有歧义的实体,例如“苹果(水果)”、“苹果公司(Apple Inc.)”、“苹果日报”、“苹果(电影)”等,而上述例子中第一个句子里的“苹果”应该对应到知识图谱数据库中的公司实体概念“苹果公司(Apple Inc.)”而不是“苹果(水果)”、“苹果(电影)”或“苹果日报”。实体链接和语义消歧通常都是一起进行的。因为语义消歧是实体链接的手段,而实体链接是语义消歧的目的;所以两者经常在不同场合互换使用或互相表示。The entity linking operation is intended to correspond to an entity reference (or entity mention) appearing in the data content to the related entity concept in the knowledge map database. For example, in the two sentences "Steve Jobs is one of the founders of Apple" and "Steve Jobs created NeXT in the United States in 1985", the two entities "Steve Jobs" and "Steve Jobs" The generation should correspond to the same person entity concept "Steve Jobs (ex-CEO of Apple)" in the knowledge map database, so the two entities need to be associated with the same entity through the entity link operation. . Semantic disambiguation is intended to disambiguate ambiguous entities. For example, the "Apple" entity refers to multiple ambiguous entities, such as "Apple (fruit)", "Apple Inc.", "Apple Daily", "Apple (movie)", etc. The "Apple" in the first sentence of the above example should correspond to the corporate entity concept "Apple Inc." in the Knowledge Mapping Database instead of "Apple (Fruit)", "Apple (Movie)" or "Apple". daily". Entity links and semantic disambiguation are usually done together. Because semantic disambiguation is the means of entity linking, and entity linking is the purpose of semantic disambiguation; so the two are often used interchangeably or mutually.
任何现有的实体链接和语义消歧技术均可用于本发明中。举例而言,其中一类方法基于实体知识对实体指代逐一独立地进行消歧与链接。实体知识包括但不局限于,实体的出现概率、实体的名字分布(全名、别名、缩写等)、实体的上下文语境(如词的共现信息、词分布等)以及实体在知识库中的类别信息(如公司实体、个人实体、 地点实体等)等。可以使用基于概率的(如线性回归或逻辑回归等)或机器学习的(如支持向量机(Support Vector Machines)、随机森林(Random Forest)等)手段来学习并训练基于实体知识的语义消歧和实体链接模型。另一类方法基于主题一致性的假设(即文章中的实体通常与文本主题相关,所以这些实体之间也具有语义相关性),利用文本内容中所有实体指代的候选实体在知识库(如维基百科或本发明构建的知识图谱)中的关联对一篇文章中的所有实体指代一致性地进行消歧与链接。这一类方法在计算过程中通常使用基于图数据结构的协同推理,即将文章内容中所有实体指代的候选实体,利用其在知识库中的关系构建成一个候选实体图,图的稠密分布反映了图中不同候选实体结点之间的语义关联程度。实体链接的过程就是:通过将证据(不同实体间可能的关联度)按照候选实体图的依存结构迭代传递以协同增强证据,直至收敛。上述两类方法也可以灵活地或有机地进行组合来提高消歧和链接的性能。Any existing entity linking and semantic disambiguation techniques can be used in the present invention. For example, one of the methods based on entity knowledge performs disambiguation and linking on an entity-by-independent basis. Entity knowledge includes, but is not limited to, the probability of occurrence of an entity, the distribution of names of entities (full name, alias, abbreviation, etc.), the context of the entity (such as co-occurrence information of words, word distribution, etc.) and the entity in the knowledge base. Category information (such as company entities, individual entities, Location entity, etc.). You can use probability-based (such as linear regression or logistic regression, etc.) or machine learning (such as Support Vector Machines, Random Forest, etc.) to learn and train semantic disambiguation based on entity knowledge. Entity link model. Another type of approach is based on the assumption of subject consistency (ie, the entities in the article are usually related to the text topic, so these entities also have semantic relevance), using the candidate entities referred to by all entities in the text content in the knowledge base (eg The associations in Wikipedia or the knowledge map constructed by the present invention consistently disambiguate and link all entity references in an article. This kind of method usually uses collaborative reasoning based on graph data structure in the calculation process. The candidate entities refer to all entities in the article content, and use their relationship in the knowledge base to construct a candidate entity graph. The dense distribution of the graph reflects The degree of semantic association between different candidate entity nodes in the graph. The process of entity linking is to synergistically enhance the evidence by iteratively passing the evidence (the possible degree of association between different entities) according to the dependency structure of the candidate entity graph until convergence. The above two types of methods can also be combined flexibly or organically to improve the performance of disambiguation and linking.
在步骤S14中,从互联网数据源获得与所述行业相关的互联网媒体数据,并从所述互联网媒体数据中提取与所述行业相关的特定媒体事件以及对应的直接相关实体。In step S14, Internet media data related to the industry is obtained from an Internet data source, and specific media events related to the industry and corresponding directly related entities are extracted from the Internet media data.
可以通过多种方式从互联网数据源获取互联网媒体数据。例如,一些社交媒体网站(例如,新浪微博、Facebook、Twitter等)都开放了用于获取其数据的API。也可以利用网路爬虫技术和内容抽取技术来抓取新闻网站或行业媒体网站数据。Internet media data can be obtained from Internet data sources in a variety of ways. For example, some social media sites (eg, Sina Weibo, Facebook, Twitter, etc.) have open APIs for getting their data. Web crawler technology and content extraction technology can also be used to capture news site or industry media site data.
在本领域中已有多种对互联网媒体进行监测以获得特定媒体事件的技术实现方式。举例而言,在一种实现方式中,先对互联网媒体数据进行检测,以发现感兴趣的特定领域或行业中媒体事件的内容以及事件所涉及的实体,然后再对新发现的媒体事件按不同指标(例如,事件的负面性、重大性、突发性、传播速度与范围、可信度等)进行评价,以筛选出符合要求的媒体事件。There are a number of technical implementations in the art for monitoring Internet media to obtain specific media events. For example, in one implementation, the Internet media data is first detected to discover the content of the media event in the particular domain or industry of interest and the entity involved in the event, and then to differently identify the newly discovered media event. Indicators (eg, negative, significant, sudden, speed and scope of the event, credibility, etc.) are evaluated to screen out media events that meet the requirements.
针对不同类型的互联网媒体数据,可以采用不同的处理技术识别与媒体事件对应的直接相关实体。例如,可以使用基于自然语言处理的实体识别技术从文本数据中识别实体,可以使用图像或视频识别处理技术从图像或视频数据中识别实体,并且可以使用语音识别处理技术从音频或视频数据中识别实体。本领域技术人员可以理解,本发明并不对互联网媒体数据的媒体类型以及语言种类做出限制。For different types of Internet media data, different processing technologies can be used to identify directly related entities corresponding to media events. For example, entity recognition techniques based on natural language processing can be used to identify entities from textual data, images or video recognition processing techniques can be used to identify entities from image or video data, and speech recognition processing techniques can be used to identify from audio or video data. entity. Those skilled in the art will appreciate that the present invention does not limit the media types and language types of Internet media data.
在步骤S15中,基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。 In step S15, the industry knowledge map database is supplemented based on the specific media event and the corresponding directly related entity, wherein the specific media event is supplemented as an abstract entity into the industry knowledge map database.
在获得与行业相关的特定媒体事件以及对应的直接相关实体(例如,某上市公司主席贪腐丑闻事件以及该事件中涉及的公司、人物、地点)之后,把该事件作为抽象实体补充到行业知识图谱数据库中,同时对事件所涉及的直接相关实体进行实体链接和语义消歧,即找出所述实体在行业知识图谱数据库中对应的实体,并将其与代表所述事件的抽象实体进行关联。如发现事件所涉及实体并不存在于行业知识图谱数据库中,则可以按上述步骤S13中说明的方式进行补充。在完成对行业知识图谱数据库的补充之后,即可基于所述事件的直接相关实体在知识图谱数据库中与其他实体之间的关系,找出代表媒体事件的抽象实体在行业知识图谱数据库中的其他非直接相关实体。After obtaining specific industry-related media events and corresponding directly related entities (for example, a listed company chairman's corruption scandal and the companies, people, locations involved in the incident), the event is added as an abstract entity to industry knowledge. In the graph database, entity linkage and semantic disambiguation are performed on the directly related entities involved in the event, that is, the corresponding entity in the industry knowledge map database is found and associated with the abstract entity representing the event. . If it is found that the entity involved in the event does not exist in the industry knowledge map database, it may be supplemented in the manner described in the above step S13. After completing the supplement to the industry knowledge map database, the relationship between the abstract entity representing the media event and the other in the industry knowledge map database can be found based on the relationship between the directly related entities of the event and the other entities in the knowledge map database. Indirectly related entities.
在通过以上方式构建行业知识图谱数据库之后,就可以基于所构建的信息对互联网媒体事件进行自动化、深层次的监测。优选地,在完成行业知识图谱数据库的首次构建后,为了保持信息的完整性和有效性,还可以对行业知识图谱数据库进行更新,例如,可以以预定的周期定期执行步骤S12和S13,还可以以实时不间断的方式执行步骤S14和S15。After constructing the industry knowledge map database through the above methods, it is possible to perform automated and in-depth monitoring of Internet media events based on the constructed information. Preferably, after completing the first construction of the industry knowledge map database, in order to maintain the integrity and validity of the information, the industry knowledge map database may also be updated, for example, steps S12 and S13 may be periodically performed in a predetermined cycle, and Steps S14 and S15 are performed in a real-time uninterrupted manner.
另外,本领域技术人员可以理解,本发明中所涉及的行业数据、与行业相关的数据以及互联网媒体数据等各种数据的内容可以是多种语言的,也可以是多种类型的(例如,文本、图像、视频、语音等),本发明并不对此做出任何限制。In addition, those skilled in the art can understand that the contents of various data such as industry data, industry-related data, and Internet media data involved in the present invention may be multi-language or multiple types (for example, The text, image, video, voice, etc.), the present invention does not impose any limitation on this.
图3是本发明提供的一种对媒体事件进行监测的方法的示例性流程图,该方法可以基于本发明中所构建的行业知识图谱数据库对与行业相关的特定媒体事件进行监测。该方法可以包括步骤S31-S35。3 is an exemplary flow chart of a method for monitoring media events provided by the present invention, which can monitor industry-specific media events based on an industry knowledge map database constructed in the present invention. The method can include steps S31-S35.
在步骤S31中,获取互联网媒体数据。In step S31, internet media data is acquired.
如上文所述,可以通过多种方式从互联网数据源获取互联网媒体数据。例如,一些社交媒体网站(例如,新浪微博、Facebook、Twitter等)都开放了用于获取其数据的API。也可以利用网路爬虫技术和内容抽取技术来抓取新闻网站或行业媒体网站数据。As mentioned above, Internet media data can be obtained from Internet data sources in a variety of ways. For example, some social media sites (eg, Sina Weibo, Facebook, Twitter, etc.) have open APIs for getting their data. Web crawler technology and content extraction technology can also be used to capture news site or industry media site data.
在步骤S32中,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选,以获得所述与行业相关的特定媒体事件。In step S32, event detection, event evaluation, and screening are performed based on the acquired internet media data to obtain the specific media event related to the industry.
如上文所述,在本领域中已有多种对互联网媒体进行监测以获得特定媒体事件的技术实现方式。举例而言,在一种实现方式中,先对互联网媒体数据进行检测,以发现感兴趣的特定领域或行业中媒体事件的内容以及事件所涉及的实体,然后再对新发 现的媒体事件按不同指标(例如,事件的负面性、重大性、突发性、传播速度与范围、可信度等)进行评价,以筛选出符合要求的媒体事件。As described above, there are various technical implementations in the art for monitoring Internet media to obtain specific media events. For example, in one implementation, the Internet media data is first detected to discover the content of the media event in the particular domain or industry of interest and the entity involved in the event, and then to the new Current media events are evaluated according to different indicators (eg, negative, significant, sudden, speed and scope of the event, credibility, etc.) to screen out media events that meet the requirements.
具体而言,在一个实施例中,事件检测涉及的技术实现步骤可以包括:话题分类、实体识别、情感分析和事件发现。Specifically, in one embodiment, the technical implementation steps involved in event detection may include: topic classification, entity recognition, sentiment analysis, and event discovery.
在话题分类的步骤中,对所获取的互联网媒体数据中的内容进行话题分类以获得针对特定话题的内容。话题分类的目的是从所获取的内容中筛选出属于某种感兴趣话题或与客户需求相关种类的文本。话题分类是一种文本挖掘技术,一般采用机器学习或深度学习方法在标注数据上训练分类模型,然后应用到文本上以判断其话题类别。任何现有分类模型(例如,朴素贝叶斯模型、决策树、支持向量机、人工神经网络等)都可用于本发明中。In the step of topic classification, the content in the acquired Internet media data is classified into topics to obtain content for a specific topic. The purpose of topic classification is to filter out the content that belongs to a certain topic of interest or related to customer needs from the content obtained. Topic classification is a kind of text mining technology. The machine learning or deep learning method is generally used to train the classification model on the annotation data, and then applied to the text to judge the topic category. Any existing classification model (e.g., naive Bayesian model, decision tree, support vector machine, artificial neural network, etc.) can be used in the present invention.
在实体识别的步骤中,从所获得的内容中识别涉及的实体。实体抽取的目的是找出文章中涉及的实体作进一步分析。举例而言,实体识别可以包括以自然语言处理中的信息抽取技术从文本信息中抽取实体,以图像识别技术从图像(含视频)信息中识别实体,以及以语音识别技术从语音信息中识别实体,还可以对从文本、图像、与语音中识别的实体进行合并处理。In the step of entity identification, the entities involved are identified from the obtained content. The purpose of entity extraction is to find out which entities involved in the article for further analysis. For example, the entity identification may include extracting an entity from the text information by an information extraction technique in natural language processing, identifying an entity from the image (including video) information by an image recognition technology, and identifying the entity from the voice information by using a voice recognition technology. You can also combine entities identified from text, images, and speech.
在情感分析的步骤中,对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤。情感分析用于判断内容全文以及针对不同实体所表达的情感极性,以找出符合监测条件的内容。现有技术一般以文本分类方法(例如,将情感归类为正面、中性或负面)或回归分析方法(例如,将情感表示成-5到+5之间的分数)实现情感分析。判断内容中针对某一实体的情感则可利用实体在文本中的上下文信息,或者采用依存句法分析工具找出文本中跟该实体相关的文字部份以进行针对实体的情感分析。In the step of sentiment analysis, sentiment analysis is performed on the obtained content and the identified entity, and the obtained content is filtered based on the result of the sentiment analysis. Sentiment analysis is used to determine the full text of the content and the emotional polarity expressed for different entities to find content that meets the monitoring criteria. The prior art generally implements sentiment analysis in a text classification method (eg, classifying emotions as positive, neutral, or negative) or regression analysis methods (eg, expressing emotions as scores between -5 and +5). Judging the emotions of an entity in the content can use the context information of the entity in the text, or use the dependency syntax analysis tool to find the text part of the text related to the entity for the sentiment analysis of the entity.
在事件发现的步骤中,基于过滤后的内容进行事件发现以对媒体事件进行聚类并发现新的媒体事件。事件发现的目的是从不同文本提取出事件信息(例如,事件发生的时间、地点等),然后将相关的信息聚类、合并成为抽象“事件”,通过与现有事件进行比对以判断新出现的事件,并根据内容的相似性或相关性对事件进行聚类。In the step of event discovery, event discovery is performed based on the filtered content to cluster media events and discover new media events. The purpose of event discovery is to extract event information from different texts (for example, the time, place, etc. of the event), then cluster and merge the relevant information into abstract "events", and compare them with existing events to judge new Events that occur and cluster events based on their similarity or relevance.
在一个实施例中,可选地,在事件检测的过程中,还可以基于媒体事件的属性(例如,事件发生的时间、地点,媒体事件发布者及其相关属性等)对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。 In an embodiment, optionally, during the event detection, the authenticity of the event may also be performed based on the attributes of the media event (for example, the time and place of the event, the media event publisher and its related attributes, etc.). Analyze and sort and/or filter media events based on the results of the analysis.
本领域技术人员可以理解,在上述步骤中针对各项操作所列举的实现方式仅仅是示例性的,本领域现有的一些其他方式也可以实现这些操作,本发明并不对实现上述操作的具体方式做出任何限制。It will be understood by those skilled in the art that the implementations listed for the operations in the above steps are merely exemplary, and some other manners existing in the art may also implement the operations, and the present invention does not implement the specific manner of implementing the foregoing operations. Make any restrictions.
在步骤S33中,识别与所述特定媒体事件对应的直接相关实体。In step S33, a directly related entity corresponding to the specific media event is identified.
在一个实施例中,通过事件监测中的实体识别和事件发现操作就可以获得每个媒体事件中的各个直接相关实体。同时,如上文所述,可以通过实体链接和语义消歧处理将各个直接相关实体关联到行业知识图谱数据库中对应的实体概念或补充到行业知识图谱数据库中。In one embodiment, each directly related entity in each media event can be obtained by entity identification and event discovery operations in event monitoring. At the same time, as described above, each directly related entity can be associated with the corresponding entity concept in the industry knowledge map database or supplemented to the industry knowledge map database through entity link and semantic disambiguation processing.
在步骤S34中,基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体。In step S34, based on the directly related entity, the industry knowledge map database is accessed to determine an indirectly related entity corresponding to the specific media event.
在一个实施例中,可以通过预设的各种条件,在行业知识图谱数据库上直接查询与事件直接相关实体有关联关系的其它非直接相关实体。例如,预设的条件可以是:1)与事件直接相关实体在N层内有关联关系的实体(N可以为1,2,3…);2)与事件直接相关实体关联程度满足某种条件(如大于某个指定阈值)的其它实体;3)与事件直接相关实体具有某种特定关系(例如,供货关系、投资关系等)的实体;4)具有某种特定属性(例如,属于某个指定行业、位于某个地点、拥有某个职位等)的实体。这些预设的条件可以单独或随意组合使用。In one embodiment, other indirectly related entities associated with the event directly related entity may be directly queried on the industry knowledge map database by preset various conditions. For example, the preset condition may be: 1) an entity that has an association relationship with the event directly related entity in the N layer (N may be 1, 2, 3...); 2) the degree of association with the event directly related entity satisfies a certain condition (such as other entities greater than a specified threshold); 3) entities that have a specific relationship (eg, supply relationship, investment relationship, etc.) directly related to the event; 4) have certain attributes (eg, belong to a certain An entity that specifies a industry, is located at a location, has a position, etc.). These preset conditions can be used individually or in combination.
在另一个实施例中,可以采用数据挖掘的方法,在行业知识图谱数据库的基础之上利用多种条件来挖掘事件的非直接相关实体。举例而言,具体实施方法可以采用针对图数据的链接预测技术(link prediction),即把检测某事件的非直接相关实体问题表示成“预测行业知识图谱数据库中代表该事件的节点与直接相关实体节点以外的其他实体节点之间是否存在连边”这一技术问题。可用于链接预测的条件包括但不局限于事件本身的特征(例如,事件的类型、时间与地点属性、负面性等)、该事件与历史事件的关系(包括关系种类与关系强度)、事件直接相关实体与其他实体之间的关系(包括关系种类和关系强度)以及实体类型和属性等所有可以在知识图谱数据库中挖掘到的知识,从而实现对特定媒体事件的非直接相关实体的综合判断。In another embodiment, a method of data mining may be employed to exploit a variety of conditions to mine an indirectly related entity of an event based on an industry knowledge map database. For example, the specific implementation method may adopt a link prediction method for graph data, that is, express an indirectly related entity problem detecting an event as “a node representing the event and a directly related entity in the forecast industry knowledge map database”. The technical problem of whether there is a side edge between other entity nodes other than the node. Conditions that can be used for link prediction include, but are not limited to, the characteristics of the event itself (eg, type of event, time and location attributes, negativeness, etc.), the relationship of the event to historical events (including relationship types and relationship strengths), events directly The relationship between related entities and other entities (including relationship types and relationship strengths) and entity types and attributes, all of which can be mined in the knowledge map database, to achieve a comprehensive judgment of indirectly related entities of specific media events.
在步骤S35中,向所述直接相关实体和/或所述非直接相关实体发送预警消息。In step S35, an alert message is sent to the directly related entity and/or the indirectly related entity.
在识别出与特定媒体事件对应的直接和非直接相关实体后,可以利用多种途径(例如,电子邮件、手机短信、实时聊天工具、社交网络平台等)向对应的实体用户发送 预警消息。预警消息可以包含对事件本身的文字描述、图片、传播相关统计信息、事件评估指标以及相关实体可能如何受到该事件影响的途径等等。After identifying the direct and indirect related entities corresponding to the specific media event, multiple ways (eg, email, SMS, live chat tool, social network platform, etc.) can be sent to the corresponding entity user. Alert message. The alert message may contain a textual description of the event itself, a picture, dissemination related statistics, an event evaluation indicator, and how the related entity may be affected by the event.
本领域技术人员可以理解,本发明中所述的特定媒体事件可以是符合用户所设定条件并且可以从互联网媒体中获得的各种类型的事件,例如,负面事件、突发事件、危机事件、群体性事件或舆情事件等。本发明并不对此做出任何限制。Those skilled in the art can understand that the specific media events described in the present invention may be various types of events that meet the conditions set by the user and can be obtained from the Internet media, for example, negative events, emergencies, crisis events, Group events or public opinion events. The invention does not impose any limitation on this.
作为一个优选的实施例,图4示出了本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图。该方法可以包括步骤S41、S421/S422以及S43-S45。As a preferred embodiment, FIG. 4 illustrates an exemplary flow chart of another method of constructing an industry knowledge map database provided by the present invention. The method may include steps S41, S421/S422, and S43-S45.
在步骤S41中,从行业数据源获得行业数据,并从所述行业数据中提取实体以及对应的实体属性和实体关系,以生成行业知识图谱数据库。In step S41, industry data is obtained from an industry data source, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate an industry knowledge map database.
在步骤S421中,基于结构化数据源,利用应用程序接口以查询方式获得与所述行业相关的实体、实体属性和实体关系。在一个实施例中,所述结构化数据源可以如维基数据、DBPedia这样的结构化开放数据平台,并且可以通过API从中获得与行业相关的数据。In step S421, based on the structured data source, an entity, an entity attribute, and an entity relationship related to the industry are obtained in a query manner by using an application program interface. In one embodiment, the structured data source can be a structured open data platform such as Wikidata, DBPedia, and industry related data can be obtained from the API.
在步骤S422中,基于半结构化或非结构化数据源,利用自然语言处理技术对数据进行实体识别和关系抽取,以提取与所述行业相关的实体、实体属性和实体关系。在一个实施例中,所述半结构化或非结构化数据源可以诸如维基百科、百度百科这样的开放数据平台,也可以是任何相关的第三方数据源(例如,专业网站、在互联网媒体中发布的内容等),并且可以通过网络爬虫或内容抽取技术获得与行业相关的数据。In step S422, based on the semi-structured or unstructured data source, the data is subjected to entity identification and relationship extraction using natural language processing techniques to extract entities, entity attributes, and entity relationships related to the industry. In one embodiment, the semi-structured or unstructured data source may be an open data platform such as Wikipedia or Baidu Encyclopedia, or any related third-party data source (eg, a professional website, in Internet media). Published content, etc.), and can obtain industry-related data through web crawling or content extraction technology.
优选地,可以以预定的周期定期执行步骤S421和/或S422、S43。Preferably, steps S421 and/or S422, S43 may be periodically performed in a predetermined cycle.
在步骤S43中,基于所述与行业相关的实体以及对应的实体属性和实体关系,对行业知识图谱数据库进行补充或更新。In step S43, the industry knowledge map database is supplemented or updated based on the industry-related entities and corresponding entity attributes and entity relationships.
在步骤S44中,从互联网数据源获得互联网媒体数据,并从所述互联网媒体数据中提取与所述行业相关的特定媒体事件以及对应的直接相关实体。In step S44, Internet media data is obtained from an Internet data source, and specific media events related to the industry and corresponding directly related entities are extracted from the Internet media data.
在步骤S45中,基于所述特定媒体事件以及对应的直接相关实体,对行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。In step S45, an industry knowledge map database is supplemented based on the particular media event and the corresponding directly related entity, wherein the particular media event is supplemented as an abstract entity into the industry knowledge map database.
优选地,可以以实时不间断的方式执行步骤S44和S45Preferably, steps S44 and S45 can be performed in an uninterrupted manner in real time.
图5是本发明提供的另一种构建行业知识图谱数据库的方法的示例性流程图。该方法可以包括步骤S51-S53: FIG. 5 is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention. The method can include steps S51-S53:
在步骤S51中,从数据源获取行业数据;In step S51, the industry data is obtained from the data source;
在步骤S52中,对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;In step S52, data processing is performed on the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships;
在步骤S53中,基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。In step S53, the industry knowledge map database is constructed based on the extracted entities, entity attributes, and/or entity relationships.
如上文所述,行业知识图谱数据库的数据来源可以是多种多样的,包括但不限于开放式的百科类数据源、结构化的数据库以及任何相关的第三方半结构化或非结构化互联网数据源。同时,如上文所述,行业知识图谱数据库的数据来源还可以是互联网媒体数据源。As mentioned above, the data source for the industry knowledge map database can be varied, including but not limited to open encyclopedia data sources, structured databases, and any related third-party semi-structured or unstructured Internet data. source. At the same time, as mentioned above, the data source of the industry knowledge map database can also be an internet media data source.
在一个实施例中,所述数据源可以是结构化的行业数据库,并且所述方法可以通过以下具体方式实现:在步骤S51(1)中,从第三方行业数据库获取包括多个字段的结构化行业数据;在步骤S52(1)中,在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;在步骤S53(1)中,基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。In one embodiment, the data source may be a structured industry database, and the method may be implemented in the following specific manner: in step S51(1), obtaining a structuring including a plurality of fields from a third-party industry database Industry data; in step S52(1), data cleaning and extraction-conversion-loading (ETL) of the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships Processing; in step S53(1), generating the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
在另一个实施例中,所述数据源可以是非结构化或半结构化的互联网数据源,并且所述方法可以通过以下具体方式实现:在步骤S51(2)中,利用网络爬虫技术,从互联网数据源获取与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;在步骤S52(2)中,利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;在步骤S53(2)中,基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。In another embodiment, the data source may be an unstructured or semi-structured Internet data source, and the method may be implemented in the following specific manner: in step S51 (2), using web crawling technology, from the Internet The data source acquires industry-related data, the Internet data source includes an unstructured or semi-structured data source; and in step S52(2), the information related to the industry is utilized by using an information extraction technique in natural language processing Performing entity identification and relationship extraction to extract the entity, entity attribute, and/or entity relationship; in step S53(2), performing the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship Supplement or update.
此外,所述步骤S51(2)-S53(2)可以是以预定的周期定期执行的。Further, the steps S51(2)-S53(2) may be performed periodically at a predetermined cycle.
在另一个实施例中,所述数据源可以是开放式的互联网数据源,并且所述方法可以通过以下具体方式实现:在步骤S51(3)中,利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据;在步骤S52(3)中,在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;在步骤S53(3)中,基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。 In another embodiment, the data source may be an open Internet data source, and the method may be implemented in the following specific manner: in step S51 (3), using an application program interface (API) to query from The internet data source acquires industry-related data; in step S52(3), data cleaning of the industry-related data is performed before extracting entities related to the industry and corresponding entity attributes and/or entity relationships And an extract-convert-load (ETL) process; in step S53(3), the industry knowledge map database is supplemented or updated based on the extracted entity, entity attribute, and/or entity relationship.
此外,所述步骤S51(3)-S53(3)可以是以预定的周期定期执行的。Further, the steps S51(3)-S53(3) may be performed periodically at a predetermined cycle.
在另一个实施例中,所述数据源可以是互联网媒体数据源,并且所述方法可以通过以下具体方式实现:在步骤S51(4)中,利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取互联网媒体数据;在步骤S52(4)中,对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;在步骤S53(4)中,基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。In another embodiment, the data source may be an internet media data source, and the method may be implemented in the following specific manner: in step S51 (4), using an application program interface (API) or web crawler technology, from The internet data source obtains the internet media data; in step S52(4), performing event detection, event evaluation and screening on the internet media data to extract specific media events related to the industry, and extracting the media media data from the internet Identifying a corresponding directly related entity; in step S53 (4), supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific media event is treated as an abstract entity Added to the industry knowledge map database.
举例而言,在步骤S52(4)中可以通过以下方式中的至少一种识别与特定媒体事件对应的直接相关实体:基于自然语言处理中的实体识别从文本数据中识别实体;基于图像或视频识别处理从图像或视频数据中识别实体;或者,基于语音识别处理从音频或视频数据中识别实体。For example, in step S52(4), a directly related entity corresponding to a specific media event may be identified by at least one of: identifying an entity from text data based on entity recognition in natural language processing; based on image or video The recognition process identifies an entity from the image or video data; or, the entity is identified from the audio or video data based on the speech recognition process.
举例而言,所述特定媒体事件可以包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。For example, the particular media event may include a negative event, an emergency, a crisis event, a mass event, a public opinion event, or other event of industry significance.
此外,所述步骤S51(4)-S53(4)可以是实时不间断执行的。Furthermore, the steps S51(4)-S53(4) may be performed in real time without interruption.
在另一个实施例中,上述步骤S53(2)、S53(3)、S53(4)中对所述行业知识图谱数据库进行补充或更新的步骤可以包括:对所提取的实体进行语义消歧和实体链接。举例而言,可以通过以下方式中的至少一种进行所述语义消歧和实体链接:基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接;基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。In another embodiment, the step of supplementing or updating the industry knowledge map database in steps S53(2), S53(3), and S53(4) may include performing semantic disambiguation on the extracted entity. Entity link. For example, the semantic disambiguation and entity linking may be performed by at least one of: semantically disambiguating and entity linking are performed independently for each extracted entity based on entity knowledge; Sexual hypothesis, using the association of candidate entities in the knowledge base, consistently semantically disambiguating and entity linking the extracted entities.
以上以实施例的方式描述了本发明提供的一种构建行业知识图谱数据库的方法。本领域技术人员可以理解,这些实施例的各种组合也包括在这种构建行业知识图谱数据库的方法的构思之内。The method for constructing an industry knowledge map database provided by the present invention is described above by way of example. Those skilled in the art will appreciate that various combinations of these embodiments are also included within the concept of such a method of constructing an industry knowledge map database.
图6是本发明提供的一种对媒体事件进行监测的系统的示例性框图。该系统包括数据获取单元、数据获取单元、数据库构建单元、数据库存储单元、媒体事件监测单元、数据库访问单元以及消息发送单元。6 is an exemplary block diagram of a system for monitoring media events provided by the present invention. The system includes a data acquisition unit, a data acquisition unit, a database construction unit, a database storage unit, a media event monitoring unit, a database access unit, and a message sending unit.
数据获取单元,用于从数据源获得行业数据。A data acquisition unit for obtaining industry data from a data source.
数据处理单元,用于对所述行业数据进行数据处理,以提取与所述行业相关的实 体以及对应的实体属性和/或实体关系;a data processing unit, configured to perform data processing on the industry data to extract realities related to the industry Body and corresponding entity attributes and/or entity relationships;
数据库构建单元,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库;a database construction unit, configured to build the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship;
数据库存储单元:用于存储所构建的行业知识图谱数据库;Database storage unit: used to store the built industry knowledge map database;
媒体事件监测单元:用于获取互联网媒体数据,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选以获得所述与行业相关的特定媒体事件,并且识别与所述特定媒体事件对应的直接相关实体;a media event monitoring unit: configured to acquire internet media data, perform event detection, event evaluation, and screening based on the acquired internet media data to obtain the specific media event related to the industry, and identify a direct corresponding to the specific media event Related entity
数据库访问单元:用于基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;a database access unit: configured to access the industry knowledge map database based on the directly related entity to determine an indirectly related entity corresponding to the specific media event;
消息发送单元,用于向所述直接相关实体和/或所述非直接相关实体发送预警消息。a message sending unit, configured to send an alert message to the directly related entity and/or the indirectly related entity.
在一个实施例中,所述数据获取单元包括:结构化数据获取单元,用于从第三方行业数据库获得结构化数据,所述结构化数据包括多个字段;所述数据处理单元包括:结构化数据处理单元,用于对所述结构化数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库生成单元,用于基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。In one embodiment, the data obtaining unit includes: a structured data obtaining unit, configured to obtain structured data from a third-party industry database, the structured data includes a plurality of fields; the data processing unit includes: structured a data processing unit, configured to perform data cleaning and extract-convert-load (ETL) processing on the structured data; the database building unit includes: a database generating unit, configured to be based on the extracted entity, entity attribute, and/or The entity relationship generates the industry knowledge map database.
在另一个实施例中,所述数据获取单元包括:行业相关数据获取单元,用于利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;所述数据处理单元包括:行业相关数据处理单元,用于利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。In another embodiment, the data acquisition unit includes: an industry-related data acquisition unit for obtaining industry-related data from an Internet data source, including an unstructured or semi-structured, using a web crawler technology The data processing unit includes: an industry-related data processing unit, configured to perform entity identification and relationship extraction on the industry-related data by using an information extraction technology in natural language processing to extract the entity and the entity. Attributes and/or entity relationships; the database building unit comprising: a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes and/or entity relationships.
在另一个实施例中,所述数据获取单元包括:行业相关数据获取单元,用于利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;所述数据处理单元包括:行业相关数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知 识图谱数据库进行补充或更新。In another embodiment, the data acquisition unit includes: an industry-related data acquisition unit configured to acquire industry-related data from an Internet data source in an inquiry manner using an application program interface (API), the Internet data source including an open source The data processing unit includes: an industry-related data processing unit, configured to perform data on the industry-related data before extracting an entity related to the industry and a corresponding entity attribute and/or entity relationship Cleaning and extraction-conversion-loading (ETL) processing; the database building unit comprising: a database supplement/update unit for knowing the industry based on the extracted entities, entity attributes, and/or entity relationships The map database is supplemented or updated.
在另一个实施例中,所述数据获取单元包括:媒体数据获取单元,用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;所述数据处理单元包括:媒体数据处理单元,用于对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;所述数据库构建单元包括:数据库补充/更新单元,用于基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。In another embodiment, the data obtaining unit includes: a media data acquiring unit, configured to acquire industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology; The unit includes: a media data processing unit, configured to perform event detection, event evaluation, and screening on the internet media data to extract specific media events related to the industry, and identify corresponding direct correlations from the internet media data. Entity; the database building unit comprising: a database supplement/update unit for supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific media event is an abstract entity It is added to the industry knowledge map database.
在一个实施例中,所述数据库补充/更新单元进一步用于:对所提取的实体进行语义消歧和实体链接。In one embodiment, the database supplement/update unit is further configured to perform semantic disambiguation and entity linking on the extracted entities.
在一个实施例中,所述媒体事件监测单元进一步用于:对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;从所获得的内容中识别涉及的实体;对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤;基于过滤后的内容进行事件发现,以对媒体事件进行聚类并发现新的媒体事件。在另一个实施例中,所述媒体事件监测单元进一步用于:基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。In one embodiment, the media event monitoring unit is further configured to: perform topic classification on the content in the acquired internet media data to obtain content for a specific topic; identify the involved entity from the obtained content; The obtained content and the identified entity perform sentiment analysis, and filter the obtained content based on the result of the sentiment analysis; perform event discovery based on the filtered content to cluster media events and discover new media events. In another embodiment, the media event monitoring unit is further configured to: analyze the authenticity of the event based on the attribute of the media event, and sort and/or filter the media event according to the analysis result.
在一个实施例中,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。在另一个实施例中,所述数据库访问单元进一步用于:基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。In one embodiment, the database access unit is further configured to query the industry knowledge map database to determine the indirectly related entity based on the directly related entity. In another embodiment, the database access unit is further configured to: use the data mining technique to determine the indirectly related entity in the industry knowledge map database based on the directly related entity.
以上以实施例的方式描述了本发明提供的一种对媒体事件进行监测的系统。本领域技术人员可以理解,上文结合附图1、3-5所描述的各种方法中的操作步骤可以应用在所述系统的组成单元中,因此这里不再赘述。The system for monitoring media events provided by the present invention is described above by way of example. Those skilled in the art can understand that the operational steps in the various methods described above in connection with FIGS. 1 and 3-5 can be applied to the constituent units of the system, and thus are not described herein again.
本领域技术人员还应当理解,结合本发明公开的各个实施例所描述的各种示例性的方法步骤和单元均可以实现成电子硬件、计算机软件或二者的组合。为了清楚地表示硬件和软件的可交换性,上文中各种示例性的步骤和单元均围绕其功能进行了总体描述。至于这种功能是实现成硬件还是实现成软件,则取决于特定的应用和对整个系统所施加的设计约束条件。本领域技术人员可以针对每个特定应用,以变通的方式实 现所描述的功能,但是,这种实现决策不应解释为引起与本公开内容的范围的偏离。Those skilled in the art will also appreciate that the various exemplary method steps and units described in connection with the various embodiments disclosed herein can be implemented in electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative steps and units are described above generally in terms of their functionality. Whether such functionality is implemented as hardware or as software depends on the particular application and design constraints imposed on the overall system. Those skilled in the art can adapt to each specific application in a flexible manner. The presently described functions, however, should not be construed as causing a departure from the scope of the disclosure.
本发明说明书中使用的“示例/示例性”表示用作例子、例证或说明。说明书中被描述为“示例性”的任何技术方案不应被解释为比其它技术方案更优选或更具优势。The "example/exemplary" used in the description of the present invention is used as an example, illustration or description. Any technical solution described as "exemplary" in the specification should not be construed as being more preferred or advantageous over other technical solutions.
本发明提供了对所公开的技术内容的以上描述,以使本领域技术人员能够实现或使用本发明。对于本领域技术人员而言,对这些技术内容的很多修改和变形都是显而易见的,并且本发明所定义的总体原理也可以在不脱离本发明的精神或范围的基础上适用于其它实施例。因此,本发明并不限于上文所示的具体实施方式,而是应与符合本发明公开的发明构思的最广范围相一致。 The present invention is provided to enable a person skilled in the art to make or use the invention. Many modifications and variations of the present invention will be apparent to those skilled in the <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; Therefore, the present invention is not limited to the specific embodiments shown above, but should be consistent with the broadest scope of the inventive concepts disclosed herein.

Claims (37)

  1. 一种构建行业知识图谱数据库的方法,其特征在于,包括以下步骤:A method for constructing an industry knowledge map database, comprising the steps of:
    步骤101,从数据源获取行业数据;Step 101: Obtain industry data from a data source;
    步骤102,对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;Step 102: Perform data processing on the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships;
    步骤103,基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。Step 103: Construct the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  2. 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein
    所述步骤101通过以下方式实现:从第三方行业数据库获取结构化行业数据,所述结构化行业数据包括多个字段;The step 101 is implemented by acquiring structured industry data from a third-party industry database, the structured industry data including a plurality of fields;
    所述步骤102通过以下方式实现:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;The step 102 is implemented by performing data cleaning and extraction-conversion-loading (ETL) processing on the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships. ;
    所述步骤103通过以下方式实现:基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。The step 103 is implemented by generating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  3. 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein
    所述步骤101通过以下方式实现:利用网络爬虫技术,从互联网数据源获取与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;The step 101 is implemented by acquiring network-related data from an Internet data source, including an unstructured or semi-structured data source, by using a web crawler technology;
    所述步骤102通过以下方式实现:利用自然语言处理中的信息抽取技术,对所述与行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;The step 102 is implemented by performing an entity identification and relationship extraction on the industry-related data by using an information extraction technique in natural language processing to extract the entity, the entity attribute, and/or the entity relationship;
    所述步骤103通过以下方式实现:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。The step 103 is implemented by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  4. 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein
    所述步骤101通过以下方式实现:利用应用程序接口(API)以查询方式从互联 网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;The step 101 is implemented by using an application program interface (API) to query from the interconnection. A network data source acquires industry-related data, the Internet data source including an open data source;
    所述步骤102通过以下方式实现:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;The step 102 is implemented by performing data cleaning and extraction-conversion-loading (ETL) on the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships. deal with;
    所述步骤103通过以下方式实现:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。The step 103 is implemented by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  5. 根据权利要求1所述的方法,其特征在于,The method of claim 1 wherein
    所述步骤101通过以下方式实现:利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;The step 101 is implemented by acquiring an industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology;
    所述步骤102通过以下方式实现:对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;The step 102 is implemented by performing event detection, event evaluation and screening on the internet media data to extract specific media events related to the industry, and identifying corresponding directly related entities from the internet media data. ;
    所述步骤103通过以下方式实现:基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。The step 103 is implemented by supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific media event is supplemented to the industry knowledge map as an abstract entity In the database.
  6. 根据权利要求5所述的方法,其特征在于,在所述步骤102中进一步通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:The method according to claim 5, wherein in step 102, a directly related entity corresponding to the specific media event is further identified by at least one of the following:
    基于自然语言处理中的实体识别从文本数据中识别实体;Identifying entities from text data based on entity recognition in natural language processing;
    基于图像或视频识别处理从图像或视频数据中识别实体;或者Identifying an entity from image or video data based on image or video recognition processing; or
    基于语音识别处理从音频或视频数据中识别实体。Identifying entities from audio or video data based on speech recognition processing.
  7. 根据权利要求3-5中任一项所述的方法,其特征在于,所述步骤103包括:对所提取的实体进行语义消歧和实体链接。The method of any of claims 3-5, wherein the step 103 comprises semantic disambiguation and entity linking of the extracted entities.
  8. 根据权利要求7所述的方法,其特征在于,所述对所提取的实体进行语义消歧和实体链接的步骤进一步通过以下方式中的至少一种实现:The method of claim 7, wherein the step of semantically disambiguating and entity linking the extracted entities is further implemented by at least one of the following:
    基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接; Semantic disambiguation and entity linking are performed independently for each extracted entity based on entity knowledge;
    基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。Based on the topic consistency assumption, using the association of the candidate entities in the knowledge base, the extracted entities are consistently semantically disambiguated and entity linked.
  9. 根据权利要求5所述的方法,其特征在于,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。The method of claim 5 wherein the particular media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event, or other event of industry significance.
  10. 根据权利要求3或4所述的方法,其特征在于,所述步骤101-103是以预定的周期定期执行的。The method according to claim 3 or 4, wherein said steps 101-103 are performed periodically at a predetermined cycle.
  11. 根据权利要求5所述的方法,其特征在于,所述步骤101-103是实时不间断执行的。The method of claim 5 wherein said steps 101-103 are performed in real time without interruption.
  12. 一种基于权利要求1-11中任一项所述的行业知识图谱数据库对与行业相关的特定媒体事件进行监测的方法,其特征在于,包括以下步骤:A method for monitoring a specific media event related to an industry based on the industry knowledge map database of any one of claims 1-11, comprising the steps of:
    步骤1201,获取互联网媒体数据;Step 1201: Obtain internet media data.
    步骤1202,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选,以获取所述与行业相关的特定媒体事件;Step 1202: Perform event detection, event evaluation, and screening based on the acquired Internet media data to obtain the specific media event related to the industry;
    步骤1203,识别与所述特定媒体事件对应的直接相关实体;Step 1203: Identify a directly related entity corresponding to the specific media event;
    步骤1204,基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;Step 1204, based on the directly related entity, accessing the industry knowledge map database to determine an indirectly related entity corresponding to the specific media event;
    步骤1205,向所述直接相关实体和/或所述非直接相关实体发送预警消息。Step 1205: Send an alert message to the directly related entity and/or the indirectly related entity.
  13. 根据权利要求12所述的方法,其特征在于,所述步骤1202中的事件检测包括以下步骤:The method of claim 12 wherein the detecting of the event in step 1202 comprises the steps of:
    对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;Sorting topics in the acquired Internet media data to obtain content for a specific topic;
    从所获得的内容中识别涉及的实体;Identifying the entities involved from the obtained content;
    对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤; Performing sentiment analysis on the obtained content and the identified entity, and filtering the obtained content based on the result of the sentiment analysis;
    基于过滤后的内容进行事件发现,以对媒体事件进行聚类并发现新的媒体事件。Event discovery based on filtered content to cluster media events and discover new media events.
  14. 根据权利要求13所述的方法,其特征在于,所述步骤1202中的事件检测还包括以下步骤:The method according to claim 13, wherein the detecting of the event in the step 1202 further comprises the following steps:
    基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。The authenticity of the event is analyzed based on the attributes of the media event, and the media events are sorted and/or filtered according to the analysis result.
  15. 根据权利要求12所述的方法,其特征在于,在所述步骤1203中通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:The method of claim 12, wherein the directly related entity corresponding to the particular media event is identified in the step 1203 by at least one of:
    基于自然语言处理中的实体识别从文本数据中识别实体;Identifying entities from text data based on entity recognition in natural language processing;
    基于图像或视频识别处理从图像或视频数据中识别实体;或者Identifying an entity from image or video data based on image or video recognition processing; or
    基于语音识别处理从音频或视频数据中识别实体。Identifying entities from audio or video data based on speech recognition processing.
  16. 根据权利要求12所述的方法,其特征在于,所述步骤1204通过以下方式实现:The method of claim 12 wherein said step 1204 is accomplished in the following manner:
    基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。Querying in the industry knowledge map database to determine the indirectly related entity based on the directly related entity.
  17. 根据权利要求12所述的方法,其特征在于,所述步骤1204通过以下方式实现:The method of claim 12 wherein said step 1204 is accomplished in the following manner:
    基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。Based on the directly related entity, data mining techniques are used in the industry knowledge map database to determine the indirectly related entities.
  18. 一种构建行业知识图谱数据库的装置,其特征在于,包括:An apparatus for constructing an industry knowledge map database, comprising:
    数据获取模块,用于从数据源获取行业数据;a data acquisition module for obtaining industry data from a data source;
    数据处理模块,用于对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;a data processing module, configured to perform data processing on the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships;
    数据库构建模块,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库。 A database building module is configured to build the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  19. 根据权利要求18所述的装置,其特征在于,The device of claim 18, wherein
    所述数据获取模块通过以下方式获取行业数据:从第三方行业数据库获得结构化行业数据,所述结构化行业数据包括多个字段;The data acquisition module obtains industry data by obtaining structured industry data from a third-party industry database, the structured industry data including a plurality of fields;
    所述数据处理模块通过以下方式进行数据处理:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;The data processing module performs data processing by performing data cleaning and extraction-conversion-loading on the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships ( ETL) processing;
    所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。The database building module constructs an industry knowledge map database by generating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  20. 根据权利要求18所述的装置,其特征在于,The device of claim 18, wherein
    所述数据获取模块通过以下方式获取行业数据:利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;The data acquisition module obtains industry data by using network crawler technology to obtain industry-related data from an Internet data source, the Internet data source including an unstructured or semi-structured data source;
    所述数据处理模块通过以下方式进行数据处理:利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实体属性和/或实体关系;The data processing module performs data processing by using an information extraction technique in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entity, entity attribute, and/or entity relationship;
    所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。The database building module constructs an industry knowledge map database by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  21. 根据权利要求18所述的装置,其特征在于,The device of claim 18, wherein
    所述数据获取模块通过以下方式获取行业数据:利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;The data acquisition module acquires industry data by using an application program interface (API) to obtain industry-related data from an Internet data source in an inquiry manner, the Internet data source including an open data source;
    所述数据处理模块通过以下方式进行数据处理:在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;The data processing module performs data processing by performing data cleaning and extraction-conversion-loading on the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships. (ETL) processing;
    所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。The database building module constructs an industry knowledge map database by supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and/or entity relationships.
  22. 根据权利要求18所述的装置,其特征在于, The device of claim 18, wherein
    所述数据获取模块通过以下方式获取行业数据:用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;The data acquisition module obtains industry data by acquiring an industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology;
    所述数据处理模块通过以下方式进行数据处理:对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;The data processing module performs data processing by performing event detection, event evaluation, and screening on the internet media data to extract specific media events related to the industry, and identifying corresponding ones from the internet media data. Directly related entity;
    所述数据库构建模块通过以下方式构建行业知识图谱数据库:基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。The database building module constructs an industry knowledge map database by supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific media event is supplemented as an abstract entity The industry knowledge map database.
  23. 根据权利要求22所述的装置,其特征在于,所述数据库构建模块进一步通过以下方式中的至少一种识别与所述特定媒体事件对应的直接相关实体:The apparatus of claim 22, wherein the database building module further identifies a directly related entity corresponding to the particular media event by at least one of:
    基于自然语言处理中的实体识别从文本数据中识别实体;Identifying entities from text data based on entity recognition in natural language processing;
    基于图像或视频识别处理从图像或视频数据中识别实体;或者Identifying an entity from image or video data based on image or video recognition processing; or
    基于语音识别处理从音频或视频数据中识别实体。Identifying entities from audio or video data based on speech recognition processing.
  24. 根据权利要求20-22中任一项所述的装置,其特征在于,所述数据库构建模块包括:用于对所提取的实体进行语义消歧和实体链接的模块。The apparatus of any of claims 20-22, wherein the database building module comprises means for semantic disambiguation and entity linking of the extracted entities.
  25. 根据权利要求24所述的装置,其特征在于,所述用于对所提取的实体进行语义消歧和实体链接的模块进一步通过以下方式中的至少一种进行语义消歧和实体链接:The apparatus according to claim 24, wherein the means for semantic disambiguation and entity linking of the extracted entities further performs semantic disambiguation and entity linking by at least one of:
    基于实体知识,对每个所提取的实体指代逐一独立地进行语义消歧和实体链接;Semantic disambiguation and entity linking are performed independently for each extracted entity based on entity knowledge;
    基于主题一致性假设,利用候选实体在知识库中的关联,对所提取的实体指代一致性地进行语义消歧和实体链接。Based on the topic consistency assumption, using the association of the candidate entities in the knowledge base, the extracted entities are consistently semantically disambiguated and entity linked.
  26. 根据权利要求22所述的方法,其特征在于,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。The method of claim 22 wherein the particular media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event, or other event of industry significance.
  27. 一种对与行业相关的特定媒体事件进行监测的系统,其特征在于,包括: A system for monitoring specific media events related to the industry, characterized by comprising:
    数据获取单元,用于从数据源获得行业数据;a data acquisition unit for obtaining industry data from a data source;
    数据处理单元,用于对所述行业数据进行数据处理,以提取与所述行业相关的实体以及对应的实体属性和/或实体关系;a data processing unit, configured to perform data processing on the industry data to extract entities related to the industry and corresponding entity attributes and/or entity relationships;
    数据库构建单元,用于基于所提取的实体、实体属性和/或实体关系构建所述行业知识图谱数据库;a database construction unit, configured to build the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship;
    数据库存储单元:用于存储所构建的行业知识图谱数据库;Database storage unit: used to store the built industry knowledge map database;
    媒体事件监测单元:用于获取互联网媒体数据,基于所获取的互联网媒体数据进行事件检测、事件评价和筛选以获得所述与行业相关的特定媒体事件,并且识别与所述特定媒体事件对应的直接相关实体;a media event monitoring unit: configured to acquire internet media data, perform event detection, event evaluation, and screening based on the acquired internet media data to obtain the specific media event related to the industry, and identify a direct corresponding to the specific media event Related entity
    数据库访问单元:用于基于所述直接相关实体,访问所述行业知识图谱数据库,以确定与所述特定媒体事件对应的非直接相关实体;a database access unit: configured to access the industry knowledge map database based on the directly related entity to determine an indirectly related entity corresponding to the specific media event;
    消息发送单元,用于向所述直接相关实体和/或所述非直接相关实体发送预警消息。a message sending unit, configured to send an alert message to the directly related entity and/or the indirectly related entity.
  28. 根据权利要求27所述的系统,其特征在于,The system of claim 27 wherein:
    所述数据获取单元包括:结构化数据获取单元,用于从第三方行业数据库获得结构化行业数据,所述结构化行业数据包括多个字段;The data obtaining unit includes: a structured data obtaining unit, configured to obtain structured industry data from a third-party industry database, where the structured industry data includes multiple fields;
    所述数据处理单元包括:结构化数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述结构化行业数据进行数据清洗以及抽取-转换-加载(ETL)处理;The data processing unit includes: a structured data processing unit, configured to perform data cleaning and extraction-conversion on the structured industry data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships - loading (ETL) processing;
    所述数据库构建单元包括:数据库生成单元,用于基于所提取的实体、实体属性和/或实体关系生成所述行业知识图谱数据库。The database construction unit includes: a database generation unit configured to generate the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
  29. 根据权利要求27所述的系统,其特征在于,The system of claim 27 wherein:
    所述数据获取单元包括:行业相关数据获取单元,用于利用网络爬虫技术,从互联网数据源获得与行业相关的数据,所述互联网数据源包括非结构化或半结构化数据源;The data acquisition unit includes: an industry-related data acquisition unit, configured to obtain industry-related data from an Internet data source, including an unstructured or semi-structured data source, by using a web crawler technology;
    所述数据处理单元包括:行业相关数据处理单元,用于利用自然语言处理中的信息抽取技术,对所述行业相关的数据进行实体识别和关系抽取,以提取所述实体、实 体属性和/或实体关系;The data processing unit includes: an industry-related data processing unit, configured to perform entity identification and relationship extraction on the industry-related data by using an information extraction technology in natural language processing to extract the entity and the real Body attributes and/or entity relationships;
    所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。The database construction unit includes a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
  30. 根据权利要求27所述的系统,其特征在于,The system of claim 27 wherein:
    所述数据获取单元包括:行业相关数据获取单元,用于利用应用程序接口(API)以查询方式从互联网数据源获取与行业相关的数据,所述互联网数据源包括开放式数据源;The data obtaining unit includes: an industry-related data acquiring unit, configured to acquire industry-related data from an Internet data source by using an application program interface (API), where the Internet data source includes an open data source;
    所述数据处理单元包括:行业相关数据处理单元,用于在提取与所述行业相关的实体以及对应的实体属性和/或实体关系之前,对所述与行业相关的数据进行数据清洗以及抽取-转换-加载(ETL)处理;The data processing unit includes: an industry-related data processing unit, configured to perform data cleaning and extraction on the industry-related data before extracting entities related to the industry and corresponding entity attributes and/or entity relationships - Conversion-loading (ETL) processing;
    所述数据库构建单元包括:数据库补充/更新单元,用于基于所提取的实体、实体属性和/或实体关系对所述行业知识图谱数据库进行补充或更新。The database construction unit includes a database supplement/update unit for supplementing or updating the industry knowledge map database based on the extracted entity, entity attribute, and/or entity relationship.
  31. 根据权利要求27所述的系统,其特征在于,The system of claim 27 wherein:
    所述数据获取单元包括:媒体数据获取单元,用于利用应用程序接口(API)或网络爬虫技术,从互联网数据源获取与行业相关的互联网媒体数据;The data obtaining unit includes: a media data acquiring unit, configured to acquire industry-related Internet media data from an Internet data source by using an application program interface (API) or a web crawler technology;
    所述数据处理单元包括:媒体数据处理单元,用于对所述互联网媒体数据进行事件检测、事件评价和筛选,以提取与所述行业相关的特定媒体事件,并从所述互联网媒体数据中识别对应的直接相关实体;The data processing unit includes: a media data processing unit, configured to perform event detection, event evaluation, and screening on the internet media data to extract a specific media event related to the industry, and identify from the internet media data Corresponding directly related entities;
    所述数据库构建单元包括:数据库补充/更新单元,用于基于所述特定媒体事件以及对应的直接相关实体,对所述行业知识图谱数据库进行补充,其中,所述特定媒体事件作为抽象实体被补充到所述行业知识图谱数据库中。The database construction unit includes a database supplement/update unit for supplementing the industry knowledge map database based on the specific media event and a corresponding directly related entity, wherein the specific media event is supplemented as an abstract entity Go to the industry knowledge map database.
  32. 根据权利要求29-31中任一项所述的系统,其特征在于,所述数据库补充/更新单元进一步用于:对所提取的实体进行语义消歧和实体链接。The system according to any one of claims 29-31, wherein the database supplement/update unit is further configured to perform semantic disambiguation and entity linking on the extracted entities.
  33. 根据权利要求27所述的系统,其特征在于,所述媒体事件监测单元进一步用于: The system of claim 27, wherein the media event monitoring unit is further configured to:
    对所获取的互联网媒体数据中的内容进行话题分类,以获得针对特定话题的内容;Sorting topics in the acquired Internet media data to obtain content for a specific topic;
    从所获得的内容中识别涉及的实体;Identifying the entities involved from the obtained content;
    对所获得的内容和所识别的实体进行情感分析,并且基于情感分析的结果对所获得的内容进行过滤;Performing sentiment analysis on the obtained content and the identified entity, and filtering the obtained content based on the result of the sentiment analysis;
    基于过滤后的内容进行事件发现,以对媒体事件进行聚类并发现新的媒体事件。Event discovery based on filtered content to cluster media events and discover new media events.
  34. 根据权利要求33所述的系统,其特征在于,所述媒体事件监测单元进一步用于:The system of claim 33, wherein the media event monitoring unit is further configured to:
    基于媒体事件的属性对事件的真实性进行分析,并根据分析结果对媒体事件进行排序和/或过滤。The authenticity of the event is analyzed based on the attributes of the media event, and the media events are sorted and/or filtered according to the analysis result.
  35. 根据权利要求27所述的系统,其特征在于,所述数据库访问单元进一步用于:The system of claim 27, wherein the database access unit is further configured to:
    基于所述直接相关实体,在所述行业知识图谱数据库中查询,以确定所述非直接相关实体。Querying in the industry knowledge map database to determine the indirectly related entity based on the directly related entity.
  36. 根据权利要求27所述的系统,其特征在于,所述数据库访问单元进一步用于:The system of claim 27, wherein the database access unit is further configured to:
    基于所述直接相关实体,在所述行业知识图谱数据库中使用数据挖掘技术,以确定所述非直接相关实体。Based on the directly related entity, data mining techniques are used in the industry knowledge map database to determine the indirectly related entities.
  37. 根据权利要求27所述的系统,其特征在于,所述特定媒体事件包括负面事件、突发事件、危机事件、群体性事件、舆情事件或其它具有行业意义的事件。 The system of claim 27, wherein the particular media event comprises a negative event, an emergency, a crisis event, a mass event, a public opinion event, or other event of industry significance.
PCT/CN2017/087000 2016-08-24 2017-06-02 Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database WO2018036239A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610716109.4A CN107783973B (en) 2016-08-24 2016-08-24 Method, device and system for monitoring internet media event based on industry knowledge map database
CN201610716109.4 2016-08-24

Publications (1)

Publication Number Publication Date
WO2018036239A1 true WO2018036239A1 (en) 2018-03-01

Family

ID=61246067

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/087000 WO2018036239A1 (en) 2016-08-24 2017-06-02 Method, apparatus and system for monitoring internet media events based on industry knowledge mapping database

Country Status (3)

Country Link
CN (1) CN107783973B (en)
TW (1) TWI664539B (en)
WO (1) WO2018036239A1 (en)

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670048A (en) * 2018-11-19 2019-04-23 平安科技(深圳)有限公司 Map construction method, apparatus and computer equipment based on air control management
CN109684313A (en) * 2018-12-14 2019-04-26 浪潮软件集团有限公司 A kind of data cleansing processing method and system
CN109828965A (en) * 2019-01-09 2019-05-31 北京小乘网络科技有限公司 A kind of method and electronic equipment of data processing
CN109947952A (en) * 2019-03-20 2019-06-28 武汉市软迅科技有限公司 Search method, device, equipment and storage medium based on english knowledge map
CN109977291A (en) * 2019-03-20 2019-07-05 武汉市软迅科技有限公司 Search method, device, equipment and storage medium based on physical knowledge map
CN110489565A (en) * 2019-08-15 2019-11-22 广州拓尔思大数据有限公司 Based on the object root type design method and system in domain knowledge map ontology
CN110781249A (en) * 2019-10-16 2020-02-11 华电国际电力股份有限公司技术服务分公司 Knowledge graph-based multi-source data fusion method and device for thermal power plant
CN110781311A (en) * 2019-09-18 2020-02-11 上海生腾数据科技有限公司 Enterprise consistent action calculation system and method
CN110866123A (en) * 2019-11-06 2020-03-06 浪潮软件集团有限公司 Method for constructing data map based on data model and system for constructing data map
CN110895568A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Method and system for processing court trial records
CN110928963A (en) * 2019-11-28 2020-03-27 西安理工大学 Column-level authority knowledge graph construction method for operation and maintenance service data table
CN111061883A (en) * 2019-10-25 2020-04-24 珠海格力电器股份有限公司 Method, device and equipment for updating knowledge graph and storage medium
CN111090683A (en) * 2019-11-29 2020-05-01 上海勘察设计研究院(集团)有限公司 Engineering field knowledge graph construction method and generation device thereof
CN111159411A (en) * 2019-12-31 2020-05-15 哈尔滨工业大学(深圳) Knowledge graph fused text position analysis method, system and storage medium
CN111177284A (en) * 2019-12-31 2020-05-19 清华大学 Emergency plan model generation method, device and equipment
CN111291191A (en) * 2018-12-07 2020-06-16 国家新闻出版广电总局广播科学研究院 Radio and television knowledge graph construction method and device
CN111309827A (en) * 2020-03-23 2020-06-19 平安医疗健康管理股份有限公司 Knowledge graph construction method and device, computer system and readable storage medium
CN111325355A (en) * 2020-03-19 2020-06-23 中国建设银行股份有限公司 Method and device for determining actual control persons of enterprises, computer equipment and medium
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN111339310A (en) * 2019-11-28 2020-06-26 哈尔滨工业大学(深圳) Social media-oriented online dispute generation method, system and storage medium
CN111339214A (en) * 2020-02-18 2020-06-26 北京航空航天大学 Automatic knowledge base construction method and system
CN111368145A (en) * 2018-12-26 2020-07-03 沈阳新松机器人自动化股份有限公司 Knowledge graph creating method and system and terminal equipment
CN111382277A (en) * 2018-12-28 2020-07-07 上海汽车集团股份有限公司 Knowledge graph construction method and device for automobile field
CN111538842A (en) * 2019-11-15 2020-08-14 国家电网有限公司 Intelligent sensing and predicting method and device for network space situation and computer equipment
CN111582488A (en) * 2020-04-23 2020-08-25 傲林科技有限公司 Event deduction method and device
CN111897947A (en) * 2020-07-30 2020-11-06 杭州橙鹰数据技术有限公司 Data analysis processing method and device based on open source information
CN111897914A (en) * 2020-07-20 2020-11-06 杭州叙简科技股份有限公司 Entity information extraction and knowledge graph construction method for field of comprehensive pipe gallery
CN111914096A (en) * 2020-07-06 2020-11-10 同济大学 Public transport passenger satisfaction evaluation method and system based on public opinion knowledge graph
CN111930956A (en) * 2020-06-17 2020-11-13 西安交通大学 Integrated system for recommending and stream-driving multiple innovation methods by adopting knowledge graph
CN111966836A (en) * 2020-08-29 2020-11-20 深圳呗佬智能有限公司 Knowledge graph vector representation method and device, computer equipment and storage medium
CN111967761A (en) * 2020-08-14 2020-11-20 国网电子商务有限公司 Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN111984931A (en) * 2020-08-20 2020-11-24 上海大学 Public opinion calculation and deduction method and system for social event web text
CN112015908A (en) * 2020-08-19 2020-12-01 新华智云科技有限公司 Knowledge graph construction method and system, and query method and system
CN112035672A (en) * 2020-07-23 2020-12-04 深圳技术大学 Knowledge graph complementing method, device, equipment and storage medium
CN112073415A (en) * 2020-09-08 2020-12-11 北京天融信网络安全技术有限公司 Method and device for constructing network security knowledge graph
CN112100324A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Knowledge graph automatic check iteration method based on greedy entity link
CN112182235A (en) * 2020-08-29 2021-01-05 深圳呗佬智能有限公司 Method and device for constructing knowledge graph, computer equipment and storage medium
CN112269885A (en) * 2020-11-16 2021-01-26 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing data
CN112328876A (en) * 2020-11-03 2021-02-05 平安科技(深圳)有限公司 Electronic card generation and pushing method and device based on knowledge graph
CN112633889A (en) * 2020-11-12 2021-04-09 中科金审(北京)科技有限公司 Enterprise gene sequencing system and method
CN112685405A (en) * 2020-12-21 2021-04-20 福建新大陆软件工程有限公司 Data management method, system, equipment and medium based on knowledge graph
CN112711705A (en) * 2020-11-30 2021-04-27 泰康保险集团股份有限公司 Public opinion data processing method, equipment and storage medium
CN112765368A (en) * 2021-01-29 2021-05-07 北京索为系统技术股份有限公司 Knowledge graph spectrum establishing method, device, equipment and medium based on industrial APP
CN113010696A (en) * 2021-04-21 2021-06-22 上海勘察设计研究院(集团)有限公司 Engineering field knowledge graph construction method based on metadata model
CN113094516A (en) * 2021-04-27 2021-07-09 东南大学 Multi-source data fusion-based power grid monitoring field knowledge graph construction method
CN113140134A (en) * 2021-03-12 2021-07-20 北京航空航天大学 System architecture of intelligent air traffic control system
CN113204636A (en) * 2021-01-08 2021-08-03 北京欧拉认知智能科技有限公司 Knowledge graph-based user dynamic personalized image drawing method
CN113326381A (en) * 2020-02-28 2021-08-31 拓尔思天行网安信息技术有限责任公司 Semantic and knowledge graph analysis method, platform and equipment based on dynamic ontology
CN113342987A (en) * 2021-04-21 2021-09-03 国网浙江省电力有限公司杭州供电公司 Composite network construction method of special corpus for power distribution DTU acceptance
CN113610626A (en) * 2021-07-26 2021-11-05 建信金融科技有限责任公司 Bank credit risk identification knowledge graph construction method and device, computer equipment and computer readable storage medium
CN113627535A (en) * 2021-08-12 2021-11-09 福建中信网安信息科技有限公司 Data grading classification system and method based on data security and privacy protection
CN113656590A (en) * 2021-07-16 2021-11-16 北京百度网讯科技有限公司 Industry map construction method and device, electronic equipment and storage medium
CN113706002A (en) * 2021-08-20 2021-11-26 华中农业大学 Food safety knowledge base-based supervision platform, method and storage medium
CN113761971A (en) * 2020-06-02 2021-12-07 中国人民解放军战略支援部队信息工程大学 Method and device for constructing target knowledge graph of remote sensing image
CN113836293A (en) * 2021-09-23 2021-12-24 平安国际智慧城市科技股份有限公司 Data processing method, device and equipment based on knowledge graph and storage medium
CN114090771A (en) * 2021-10-19 2022-02-25 广州数说故事信息科技有限公司 Big data based propagation proposition and consumer story analysis method and system
CN115907144A (en) * 2022-11-21 2023-04-04 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Event prediction method and device, terminal equipment and storage medium

Families Citing this family (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596439A (en) * 2018-03-29 2018-09-28 北京中兴通网络科技股份有限公司 A kind of the business risk prediction technique and system of knowledge based collection of illustrative plates
CN108763333B (en) * 2018-05-11 2022-05-17 北京航空航天大学 Social media-based event map construction method
CN108829858B (en) * 2018-06-22 2021-09-17 京东数字科技控股有限公司 Data query method and device and computer readable storage medium
CN109086316B (en) * 2018-06-27 2021-09-14 南京邮电大学 Knowledge graph autonomous construction system for industrial Internet of things resources
CN108549731A (en) * 2018-07-11 2018-09-18 中国电子科技集团公司第二十八研究所 A kind of knowledge mapping construction method based on ontology model
CN108932340A (en) * 2018-07-13 2018-12-04 华融融通(北京)科技有限公司 The construction method of financial knowledge mapping under a kind of non-performing asset operation field
CN109614495B (en) * 2018-08-08 2023-11-28 深圳市宏骏大数据服务有限公司 Related company mining method combining knowledge graph and text information
CN108959270B (en) * 2018-08-10 2022-08-19 新华智云科技有限公司 Entity linking method based on deep learning
CN109242548A (en) * 2018-08-20 2019-01-18 北京众标智能科技有限公司 A kind of sales lead recognition methods of knowledge based map and device
CN109255037B (en) * 2018-08-31 2022-03-08 北京字节跳动网络技术有限公司 Method and apparatus for outputting information
CN109255035B (en) * 2018-08-31 2024-03-26 北京字节跳动网络技术有限公司 Method and device for constructing knowledge graph
CN109299362B (en) * 2018-09-21 2023-04-14 平安科技(深圳)有限公司 Similar enterprise recommendation method and device, computer equipment and storage medium
CN109597894B (en) * 2018-09-30 2023-10-03 创新先进技术有限公司 Correlation model generation method and device, and data correlation method and device
CN109522396B (en) * 2018-10-22 2020-12-25 中国船舶工业综合技术经济研究院 Knowledge processing method and system for national defense science and technology field
CN109376202B (en) * 2018-10-30 2021-08-03 青岛理工大学 NLP-based enterprise supply relationship automatic extraction and analysis method
CN109508383A (en) * 2018-10-30 2019-03-22 北京国双科技有限公司 The construction method and device of knowledge mapping
CN109597855A (en) * 2018-11-29 2019-04-09 北京邮电大学 Domain knowledge map construction method and system based on big data driving
CN109308323A (en) * 2018-12-07 2019-02-05 中国科学院长春光学精密机械与物理研究所 A kind of construction method, device and the equipment of causality knowledge base
CN109635298B (en) * 2018-12-11 2022-12-30 平安科技(深圳)有限公司 Group state identification method and device, computer equipment and storage medium
CN109669994B (en) * 2018-12-21 2023-03-14 吉林大学 Construction method and system of health knowledge map
CN109726819B (en) * 2018-12-29 2021-09-14 东软集团股份有限公司 Method and device for realizing event reasoning
CN109783484A (en) * 2018-12-29 2019-05-21 北京航天云路有限公司 The construction method and system of the data service platform of knowledge based map
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN109918452A (en) * 2019-02-14 2019-06-21 北京明略软件系统有限公司 A kind of method, apparatus of data processing, computer storage medium and terminal
CN109979592A (en) * 2019-03-25 2019-07-05 广东邮电职业技术学院 Mental health method for early warning, user terminal, server and system
CN110175239A (en) * 2019-04-23 2019-08-27 成都数联铭品科技有限公司 A kind of construction method and system of knowledge mapping
CN111984737A (en) * 2019-05-23 2020-11-24 楼荣平 Intelligent main body and transaction capability construction system
CN110347811A (en) * 2019-06-11 2019-10-18 福建奇点时空数字科技有限公司 A kind of professional knowledge question and answer robot system based on artificial intelligence
CN110309234B (en) * 2019-06-14 2023-06-09 广发证券股份有限公司 Knowledge graph-based customer warehouse-holding early warning method and device and storage medium
CN110245241A (en) * 2019-06-18 2019-09-17 卓尔智联(武汉)研究院有限公司 Plastics knowledge mapping construction device, method and computer readable storage medium
CN110287338B (en) * 2019-06-21 2022-04-29 北京百度网讯科技有限公司 Industry hotspot determination method, device, equipment and medium
CN110334220A (en) * 2019-07-15 2019-10-15 中国人民解放军战略支援部队航天工程大学 A kind of knowledge mapping construction method based on multi-data source
CN110413784A (en) * 2019-07-23 2019-11-05 国家计算机网络与信息安全管理中心 The public sentiment association analysis method and system of knowledge based map
CN110363449B (en) * 2019-07-25 2022-04-15 中国工商银行股份有限公司 Risk identification method, device and system
CN112561457A (en) * 2019-09-26 2021-03-26 鸿富锦精密电子(天津)有限公司 Talent recruitment method based on face recognition, terminal server and storage medium
CN110837566B (en) * 2019-11-15 2022-05-13 北京邮电大学 Dynamic construction method of knowledge graph for CNC (computerized numerical control) machine tool fault diagnosis
CN110866126A (en) * 2019-11-22 2020-03-06 福建工程学院 College online public opinion risk assessment method
CN111046189A (en) * 2019-11-27 2020-04-21 广东电网有限责任公司 Modeling method of power distribution network knowledge graph model
CN110990748B (en) * 2019-12-18 2023-06-27 成都迪普曼林信息技术有限公司 Public opinion data collection and release system
CN111221978A (en) * 2019-12-31 2020-06-02 北京明略软件系统有限公司 Method and device for constructing knowledge graph, computer storage medium and terminal
CN111191046A (en) * 2019-12-31 2020-05-22 北京明略软件系统有限公司 Method, device, computer storage medium and terminal for realizing information search
TWI767192B (en) * 2020-02-26 2022-06-11 傑睿資訊服務股份有限公司 Application method of intelligent analysis system
CN111475612A (en) * 2020-03-02 2020-07-31 深圳壹账通智能科技有限公司 Construction method, device and equipment of early warning event map and storage medium
CN111737488B (en) * 2020-06-12 2021-02-02 南京中孚信息技术有限公司 Information tracing method and device based on domain entity extraction and correlation analysis
CN111899089A (en) * 2020-07-01 2020-11-06 苏宁金融科技(南京)有限公司 Enterprise risk early warning method and system based on knowledge graph
CN112131392A (en) * 2020-08-01 2020-12-25 赛飞特工程技术集团有限公司 Public health epidemic situation early warning method and system based on knowledge graph
CN111768869B (en) * 2020-09-03 2020-12-11 成都索贝数码科技股份有限公司 Medical guide mapping construction search system and method for intelligent question-answering system
CN112100156B (en) * 2020-09-15 2024-02-20 北京百度网讯科技有限公司 Method, device, medium and system for constructing knowledge base based on user behaviors
US11615150B2 (en) * 2020-10-09 2023-03-28 Cherre, Inc. Neighborhood-based entity disambiguation system and method
CN112417456B (en) * 2020-11-16 2022-02-08 中国电子科技集团公司第三十研究所 Structured sensitive data reduction detection method based on big data
CN112507691A (en) * 2020-12-07 2021-03-16 数地科技(北京)有限公司 Interpretable financial subject matter generating method and device fusing emotion, industrial chain and case logic
CN112487208B (en) * 2020-12-14 2023-06-30 杭州安恒信息技术股份有限公司 Network security data association analysis method, device, equipment and storage medium
CN113282703B (en) * 2021-04-01 2022-05-06 中科雨辰科技有限公司 Method and device for constructing event associated map of news data
CN113468340B (en) * 2021-06-28 2024-05-07 北京众标智能科技有限公司 Construction system and construction method of industrial knowledge graph
CN113868508B (en) * 2021-09-23 2022-09-27 北京百度网讯科技有限公司 Writing material query method and device, electronic equipment and storage medium
CN114417012A (en) * 2022-01-20 2022-04-29 上海弘玑信息技术有限公司 Method for generating knowledge graph and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN102831220A (en) * 2012-08-23 2012-12-19 江苏物联网研究发展中心 Subject-oriented customized news information extraction system
CN104573016A (en) * 2015-01-12 2015-04-29 武汉泰迪智慧科技有限公司 System and method for analyzing vertical public opinions based on industry
CN105183869A (en) * 2015-09-16 2015-12-23 分众(中国)信息技术有限公司 Building knowledge mapping database and construction method thereof

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201118619A (en) * 2009-11-30 2011-06-01 Inst Information Industry An opinion term mining method and apparatus thereof
US10075764B2 (en) * 2012-07-09 2018-09-11 Eturi Corp. Data mining system for agreement compliance controlled information throttle
CN103136352B (en) * 2013-02-27 2016-02-03 华中师范大学 Text retrieval system based on double-deck semantic analysis
US10073840B2 (en) * 2013-12-20 2018-09-11 Microsoft Technology Licensing, Llc Unsupervised relation detection model training
CN103955505B (en) * 2014-04-24 2017-09-26 中国科学院信息工程研究所 A kind of event method of real-time and system based on microblogging
CN104091054B (en) * 2014-06-26 2017-12-05 中国科学院自动化研究所 Towards the Mass disturbance method for early warning and system of short text
CN105468605B (en) * 2014-08-25 2019-04-12 济南中林信息科技有限公司 Entity information map generation method and device
CN105550190B (en) * 2015-06-26 2019-03-29 许昌学院 Cross-media retrieval system towards knowledge mapping
CN105630901A (en) * 2015-12-21 2016-06-01 清华大学 Knowledge graph representation learning method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN102831220A (en) * 2012-08-23 2012-12-19 江苏物联网研究发展中心 Subject-oriented customized news information extraction system
CN104573016A (en) * 2015-01-12 2015-04-29 武汉泰迪智慧科技有限公司 System and method for analyzing vertical public opinions based on industry
CN105183869A (en) * 2015-09-16 2015-12-23 分众(中国)信息技术有限公司 Building knowledge mapping database and construction method thereof

Cited By (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895568B (en) * 2018-09-13 2023-07-21 阿里巴巴集团控股有限公司 Method and system for processing court trial records
CN110895568A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Method and system for processing court trial records
CN109670048A (en) * 2018-11-19 2019-04-23 平安科技(深圳)有限公司 Map construction method, apparatus and computer equipment based on air control management
CN109670048B (en) * 2018-11-19 2023-06-23 平安科技(深圳)有限公司 Atlas construction method and apparatus based on wind control management and computer device
CN111291191B (en) * 2018-12-07 2024-05-03 国家新闻出版广电总局广播科学研究院 Broadcast television knowledge graph construction method and device
CN111291191A (en) * 2018-12-07 2020-06-16 国家新闻出版广电总局广播科学研究院 Radio and television knowledge graph construction method and device
CN109684313A (en) * 2018-12-14 2019-04-26 浪潮软件集团有限公司 A kind of data cleansing processing method and system
CN111368145A (en) * 2018-12-26 2020-07-03 沈阳新松机器人自动化股份有限公司 Knowledge graph creating method and system and terminal equipment
CN111382277A (en) * 2018-12-28 2020-07-07 上海汽车集团股份有限公司 Knowledge graph construction method and device for automobile field
CN111382277B (en) * 2018-12-28 2023-08-01 上海汽车集团股份有限公司 Knowledge graph construction method and device for automobile field
CN109828965A (en) * 2019-01-09 2019-05-31 北京小乘网络科技有限公司 A kind of method and electronic equipment of data processing
CN109828965B (en) * 2019-01-09 2021-06-15 千城数智(北京)网络科技有限公司 Data processing method and electronic equipment
CN109947952B (en) * 2019-03-20 2021-03-02 武汉市软迅科技有限公司 Retrieval method, device, equipment and storage medium based on English knowledge graph
CN109977291A (en) * 2019-03-20 2019-07-05 武汉市软迅科技有限公司 Search method, device, equipment and storage medium based on physical knowledge map
CN109947952A (en) * 2019-03-20 2019-06-28 武汉市软迅科技有限公司 Search method, device, equipment and storage medium based on english knowledge map
CN110489565B (en) * 2019-08-15 2023-05-16 广州拓尔思大数据有限公司 Method and system for designing object root type in domain knowledge graph body
CN110489565A (en) * 2019-08-15 2019-11-22 广州拓尔思大数据有限公司 Based on the object root type design method and system in domain knowledge map ontology
CN110781311B (en) * 2019-09-18 2024-02-27 上海合合信息科技股份有限公司 Enterprise-consistent-person operation system and method
CN110781311A (en) * 2019-09-18 2020-02-11 上海生腾数据科技有限公司 Enterprise consistent action calculation system and method
CN110781249A (en) * 2019-10-16 2020-02-11 华电国际电力股份有限公司技术服务分公司 Knowledge graph-based multi-source data fusion method and device for thermal power plant
CN111061883A (en) * 2019-10-25 2020-04-24 珠海格力电器股份有限公司 Method, device and equipment for updating knowledge graph and storage medium
CN111061883B (en) * 2019-10-25 2023-12-08 珠海格力电器股份有限公司 Method, device, equipment and storage medium for updating knowledge graph
CN110866123A (en) * 2019-11-06 2020-03-06 浪潮软件集团有限公司 Method for constructing data map based on data model and system for constructing data map
CN110866123B (en) * 2019-11-06 2023-10-27 浪潮软件集团有限公司 Method for constructing data map based on data model and system for constructing data map
CN111538842A (en) * 2019-11-15 2020-08-14 国家电网有限公司 Intelligent sensing and predicting method and device for network space situation and computer equipment
CN111538842B (en) * 2019-11-15 2023-10-03 国家电网有限公司 Intelligent sensing and predicting method and device for network space situation and computer equipment
CN110928963A (en) * 2019-11-28 2020-03-27 西安理工大学 Column-level authority knowledge graph construction method for operation and maintenance service data table
CN110928963B (en) * 2019-11-28 2023-10-24 西安理工大学 Column-level authority knowledge graph construction method for operation and maintenance service data table
CN111339310B (en) * 2019-11-28 2023-05-16 哈尔滨工业大学(深圳) Social media-oriented online dispute generation method, system and storage medium
CN111339310A (en) * 2019-11-28 2020-06-26 哈尔滨工业大学(深圳) Social media-oriented online dispute generation method, system and storage medium
CN111090683A (en) * 2019-11-29 2020-05-01 上海勘察设计研究院(集团)有限公司 Engineering field knowledge graph construction method and generation device thereof
CN111090683B (en) * 2019-11-29 2023-12-22 上海勘察设计研究院(集团)股份有限公司 Knowledge graph construction method and generation device thereof in engineering field
CN111339311A (en) * 2019-12-30 2020-06-26 智慧神州(北京)科技有限公司 Method, device and processor for extracting structured events based on generative network
CN111177284A (en) * 2019-12-31 2020-05-19 清华大学 Emergency plan model generation method, device and equipment
CN111159411A (en) * 2019-12-31 2020-05-15 哈尔滨工业大学(深圳) Knowledge graph fused text position analysis method, system and storage medium
CN111159411B (en) * 2019-12-31 2023-04-14 哈尔滨工业大学(深圳) Knowledge graph fused text position analysis method, system and storage medium
CN111339214A (en) * 2020-02-18 2020-06-26 北京航空航天大学 Automatic knowledge base construction method and system
CN111339214B (en) * 2020-02-18 2023-09-15 北京航空航天大学 Automatic knowledge base construction method and system
CN113326381A (en) * 2020-02-28 2021-08-31 拓尔思天行网安信息技术有限责任公司 Semantic and knowledge graph analysis method, platform and equipment based on dynamic ontology
CN111325355A (en) * 2020-03-19 2020-06-23 中国建设银行股份有限公司 Method and device for determining actual control persons of enterprises, computer equipment and medium
CN111325355B (en) * 2020-03-19 2023-12-19 中国建设银行股份有限公司 Method and device for determining actual control person of enterprise, computer equipment and medium
CN111309827A (en) * 2020-03-23 2020-06-19 平安医疗健康管理股份有限公司 Knowledge graph construction method and device, computer system and readable storage medium
CN111582488A (en) * 2020-04-23 2020-08-25 傲林科技有限公司 Event deduction method and device
CN113761971B (en) * 2020-06-02 2023-06-20 中国人民解放军战略支援部队信息工程大学 Remote sensing image target knowledge graph construction method and device
CN113761971A (en) * 2020-06-02 2021-12-07 中国人民解放军战略支援部队信息工程大学 Method and device for constructing target knowledge graph of remote sensing image
CN111930956B (en) * 2020-06-17 2023-05-30 西安交通大学 Multi-innovation method recommendation and flow driving integrated system adopting knowledge graph
CN111930956A (en) * 2020-06-17 2020-11-13 西安交通大学 Integrated system for recommending and stream-driving multiple innovation methods by adopting knowledge graph
CN111914096B (en) * 2020-07-06 2024-02-02 同济大学 Public opinion knowledge graph-based public transportation passenger satisfaction evaluation method and system
CN111914096A (en) * 2020-07-06 2020-11-10 同济大学 Public transport passenger satisfaction evaluation method and system based on public opinion knowledge graph
CN111897914B (en) * 2020-07-20 2023-09-19 杭州叙简科技股份有限公司 Entity information extraction and knowledge graph construction method for comprehensive pipe rack field
CN111897914A (en) * 2020-07-20 2020-11-06 杭州叙简科技股份有限公司 Entity information extraction and knowledge graph construction method for field of comprehensive pipe gallery
CN112035672A (en) * 2020-07-23 2020-12-04 深圳技术大学 Knowledge graph complementing method, device, equipment and storage medium
CN112035672B (en) * 2020-07-23 2023-05-09 深圳技术大学 Knowledge graph completion method, device, equipment and storage medium
CN111897947A (en) * 2020-07-30 2020-11-06 杭州橙鹰数据技术有限公司 Data analysis processing method and device based on open source information
CN111967761B (en) * 2020-08-14 2024-04-02 国网数字科技控股有限公司 Knowledge graph-based monitoring and early warning method and device and electronic equipment
CN111967761A (en) * 2020-08-14 2020-11-20 国网电子商务有限公司 Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN112015908A (en) * 2020-08-19 2020-12-01 新华智云科技有限公司 Knowledge graph construction method and system, and query method and system
CN111984931B (en) * 2020-08-20 2022-06-03 上海大学 Public opinion calculation and deduction method and system for social event web text
CN111984931A (en) * 2020-08-20 2020-11-24 上海大学 Public opinion calculation and deduction method and system for social event web text
CN112100324B (en) * 2020-08-28 2023-05-05 广州探迹科技有限公司 Knowledge graph expansion method and device, storage medium and computing equipment
CN112100324A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Knowledge graph automatic check iteration method based on greedy entity link
CN111966836A (en) * 2020-08-29 2020-11-20 深圳呗佬智能有限公司 Knowledge graph vector representation method and device, computer equipment and storage medium
CN112182235A (en) * 2020-08-29 2021-01-05 深圳呗佬智能有限公司 Method and device for constructing knowledge graph, computer equipment and storage medium
CN112073415A (en) * 2020-09-08 2020-12-11 北京天融信网络安全技术有限公司 Method and device for constructing network security knowledge graph
CN112328876A (en) * 2020-11-03 2021-02-05 平安科技(深圳)有限公司 Electronic card generation and pushing method and device based on knowledge graph
CN112633889A (en) * 2020-11-12 2021-04-09 中科金审(北京)科技有限公司 Enterprise gene sequencing system and method
CN112269885B (en) * 2020-11-16 2024-05-10 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing data
CN112269885A (en) * 2020-11-16 2021-01-26 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing data
CN112711705A (en) * 2020-11-30 2021-04-27 泰康保险集团股份有限公司 Public opinion data processing method, equipment and storage medium
CN112711705B (en) * 2020-11-30 2023-05-09 泰康保险集团股份有限公司 Public opinion data processing method, equipment and storage medium
CN112685405A (en) * 2020-12-21 2021-04-20 福建新大陆软件工程有限公司 Data management method, system, equipment and medium based on knowledge graph
CN113204636A (en) * 2021-01-08 2021-08-03 北京欧拉认知智能科技有限公司 Knowledge graph-based user dynamic personalized image drawing method
CN113204636B (en) * 2021-01-08 2023-12-05 北京欧拉认知智能科技有限公司 Knowledge graph-based user dynamic personalized image drawing method
CN112765368A (en) * 2021-01-29 2021-05-07 北京索为系统技术股份有限公司 Knowledge graph spectrum establishing method, device, equipment and medium based on industrial APP
CN112765368B (en) * 2021-01-29 2023-08-22 索为技术股份有限公司 Knowledge graph establishment method, device, equipment and medium based on industrial APP
CN113140134A (en) * 2021-03-12 2021-07-20 北京航空航天大学 System architecture of intelligent air traffic control system
CN113342987A (en) * 2021-04-21 2021-09-03 国网浙江省电力有限公司杭州供电公司 Composite network construction method of special corpus for power distribution DTU acceptance
CN113342987B (en) * 2021-04-21 2024-05-14 国网浙江省电力有限公司杭州供电公司 Composite network construction method of distribution DTU acceptance special corpus
CN113010696A (en) * 2021-04-21 2021-06-22 上海勘察设计研究院(集团)有限公司 Engineering field knowledge graph construction method based on metadata model
CN113094516A (en) * 2021-04-27 2021-07-09 东南大学 Multi-source data fusion-based power grid monitoring field knowledge graph construction method
CN113656590B (en) * 2021-07-16 2023-12-15 北京百度网讯科技有限公司 Industry map construction method and device, electronic equipment and storage medium
CN113656590A (en) * 2021-07-16 2021-11-16 北京百度网讯科技有限公司 Industry map construction method and device, electronic equipment and storage medium
CN113610626A (en) * 2021-07-26 2021-11-05 建信金融科技有限责任公司 Bank credit risk identification knowledge graph construction method and device, computer equipment and computer readable storage medium
CN113627535A (en) * 2021-08-12 2021-11-09 福建中信网安信息科技有限公司 Data grading classification system and method based on data security and privacy protection
CN113706002A (en) * 2021-08-20 2021-11-26 华中农业大学 Food safety knowledge base-based supervision platform, method and storage medium
CN113836293B (en) * 2021-09-23 2024-04-16 平安国际智慧城市科技股份有限公司 Knowledge graph-based data processing method, device, equipment and storage medium
CN113836293A (en) * 2021-09-23 2021-12-24 平安国际智慧城市科技股份有限公司 Data processing method, device and equipment based on knowledge graph and storage medium
CN114090771A (en) * 2021-10-19 2022-02-25 广州数说故事信息科技有限公司 Big data based propagation proposition and consumer story analysis method and system
CN115907144A (en) * 2022-11-21 2023-04-04 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Event prediction method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
TWI664539B (en) 2019-07-01
CN107783973A (en) 2018-03-09
TW201807602A (en) 2018-03-01
CN107783973B (en) 2022-02-25

Similar Documents

Publication Publication Date Title
TWI664539B (en) System, apparatus and method for monitoring internet media events based on a constructed industry knowledge graph database
US11599714B2 (en) Methods and systems for modeling complex taxonomies with natural language understanding
Guellil et al. Social big data mining: A survey focused on opinion mining and sentiments analysis
US20160196491A1 (en) Method For Recommending Content To Ingest As Corpora Based On Interaction History In Natural Language Question And Answering Systems
CN102779114B (en) It is supported using the unstructured data of automatically rule generation
WO2019196226A1 (en) System information querying method and apparatus, computer device, and storage medium
CN107918644B (en) News topic analysis method and implementation system in reputation management framework
US11188819B2 (en) Entity model establishment
US20220083949A1 (en) Method and apparatus for pushing information, device and storage medium
WO2019171328A1 (en) Flexible and scalable artificial intelligence and analytics platform with advanced content analytics and data ingestion
KR20150096295A (en) System and method for buinding q&amp;as database, and search system and method using the same
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
AU2021105938A4 (en) Automatic and dynamic contextual analysis of sentiment of social content and feedback reviews based on machine learning model
Aliprandi et al. CAPER: Collaborative information, acquisition, processing, exploitation and reporting for the prevention of organised crime
Aliprandi et al. Caper: Crawling and analysing facebook for intelligence purposes
Javed et al. Automating corpora generation with semantic cleaning and tagging of tweets for multi-dimensional social media analytics
Liu et al. Research on relation extraction of named entity on social media in smart cities
Al-Abri et al. A scheme for extracting information from collaborative social interaction tools for personalized educational environments
SCALIA Network-based content geolocation on social media for emergency management
Yin et al. Research of integrated algorithm establishment of a spam detection system
Qureshi et al. Detecting social polarization and radicalization
KR102180329B1 (en) System for determining fake news
Cherichi et al. Using big data values to enhance social event detection pattern
Javed et al. Framework for participative and collaborative governance using social media mining techniques

Legal Events

Date Code Title Description
DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17842664

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17842664

Country of ref document: EP

Kind code of ref document: A1