TWI664539B

TWI664539B - System, apparatus and method for monitoring internet media events based on a constructed industry knowledge graph database

Info

Publication number: TWI664539B
Application number: TW106127958A
Authority: TW
Inventors: 超何; 穎琪梁; 慧詩車
Original assignee: 慧科訊業有限公司
Priority date: 2016-08-24
Filing date: 2017-08-17
Publication date: 2019-07-01
Also published as: TW201807602A; WO2018036239A1; CN107783973A; CN107783973B

Abstract

本發明提供一種構建行業知識圖譜資料庫的方法，包括以下步驟：從資料來源獲取行業資料；對所述行業資料進行資料處理，以提取與所述行業相關的實體以及對應的實體屬性及/或實體關係；基於所提取的實體、實體屬性及/或實體關係構建所述行業知識圖譜資料庫。本發明還提供一種基於所構建的行業知識圖譜資料庫對與行業相關的特定媒體事件進行監測的方法，包括以下步驟：獲取互聯網媒體資料；基於所獲取的互聯網媒體資料進行事件檢測、事件評價和篩選，以獲取所述與行業相關的特定媒體事件；識別與所述特定媒體事件對應的直接相關實體；基於所述直接相關實體，訪問所述行業知識圖譜資料庫，以確定與所述特定媒體事件對應的非直接相關實體；向所述直接相關實體和/或所述非直接相關實體發送預警消息。 The invention provides a method for constructing an industry knowledge map database, including the following steps: obtaining industry data from a data source; performing data processing on the industry data to extract entities related to the industry and corresponding entity attributes and / or Entity relationship; building the industry knowledge map database based on the extracted entities, entity attributes and / or entity relationships. The invention also provides a method for monitoring specific media events related to the industry based on the constructed industry knowledge map database, which includes the following steps: obtaining Internet media data; performing event detection, event evaluation, and Screening to obtain the specific media event related to the industry; identifying directly related entities corresponding to the specific media event; accessing the industry knowledge map database based on the directly related entities to determine the specific media event An indirectly related entity corresponding to the event; and sending an early warning message to the directly related entity and / or the indirectly related entity.

Description

Method and device for monitoring internet media events based on industry knowledge map database And system

本發明涉及互聯網媒體監測領域，具體而言，涉及一種構建行業知識圖譜資料庫的技術以及一種基於所構建的行業知識圖譜資料庫對互聯網媒體事件進行監測的技術。 The invention relates to the field of Internet media monitoring, and in particular, to a technology for constructing an industry knowledge map database and a technology for monitoring Internet media events based on the constructed industry knowledge map database.

電腦、通信以及網路技術的迅速發展使包括PC、平板電腦、智慧手機、網路電視等在內的終端設備的性能不斷提高。相應地，互聯網媒體，特別是互聯網社交媒體，憑藉其多元性、迅捷性、交互性、易複製性、多媒體化等特點，已逐漸成為大眾獲取新聞資訊的主要途徑之一。 The rapid development of computer, communication and network technology has made the performance of terminal equipment including PCs, tablets, smart phones, Internet TVs and so on constantly improve. Correspondingly, Internet media, especially Internet social media, has gradually become one of the main channels for the public to obtain news information due to its diversity, swiftness, interactivity, easy replication, and multimedia.

然而，互聯網媒體資訊在具有時效性強、獲取方式靈活便捷等優勢的同時，其資訊源和傳播方式的開放性特點也導致了以下問題的存在：在未經授權或證實的情況下，一些敏感消息(例如，商業秘密)甚至虛假消息在互聯網媒體平臺上被大量用戶快速傳播，從而演變為對相關的個人、企業/機構、行業乃至社會造成不良影響的媒體事件。因此，需要對互聯網媒體中的媒體事件進行監測，並在監測到滿足一定條件的媒體事件後採取相應的措施，以降低或消除其潛在的影響。 However, while Internet media information has the advantages of timeliness, flexible and convenient access, and the open nature of its information sources and dissemination methods, it also leads to the following problems: Without authorization or confirmation, some sensitive Messages (for example, trade secrets) and even false messages are rapidly spread by a large number of users on the Internet media platform, which has evolved into media events that have an adverse impact on relevant individuals, businesses / institutions, industries, and society. Therefore, it is necessary to monitor media events in the Internet media and take corresponding measures after monitoring media events that meet certain conditions to reduce or eliminate their potential impact.

現有的互聯網媒體監測技術則存在以下缺陷：1)使用興趣匹配的方式為使用者提供互聯網媒體監測，使用者需要自訂感興趣的內容主題、相關實體等，因此在監測中僅能夠識別與使用者已定義的實體直接相關的事件，而無法識別用戶未定義但是與使用者所感興趣的實體間接相關的事件；2)監測物件的屬性單一，僅能夠提供針對單一媒體類別和資料來源(例如，特定的社交媒體、新聞媒體、論壇、博客等)、單一資料類型(一般為文本)、單一語言的監測。 The existing Internet media monitoring technologies have the following disadvantages: 1) Use interest matching to provide users with Internet media monitoring. Users need to customize the content topics and related entities that are of interest, so they can only be identified and used in monitoring. Events that are directly related to the defined entity, but cannot identify events that are not defined by the user but indirectly related to the entity that the user is interested in; 2) the properties of the monitoring object are single and can only provide for a single media category and data source (for example, Specific social media, news media, forums, blogs, etc.), single data type (generally text), single language monitoring.

本發明的一個目的是提供一種構建行業知識圖譜資料庫的技術，將針對特定行業或領域的相關資料提取並保存在知識圖譜資料庫中，所構建的行業知識圖譜資料庫可以應用於互聯網媒體監測中，以實現對相關互聯網媒體事件的自動化、深層次監測。 An object of the present invention is to provide a technology for constructing an industry knowledge map database, extracting and storing relevant data for a specific industry or field in the knowledge map database, and the constructed industry knowledge map database can be applied to Internet media monitoring In order to achieve automatic and in-depth monitoring of related Internet media events.

本發明的另一個目的是提供一種基於所構建的行業知識圖譜資料庫對互聯網媒體事件進行監測的技術，在監測中能夠識別出與特定媒體事件對應的非直接相關實體，並且能夠對多種類型的互聯網媒體資料進行監測。 Another object of the present invention is to provide a technology for monitoring Internet media events based on a constructed knowledge library database of industry, which can identify non-directly related entities corresponding to specific media events during monitoring, and can detect multiple types of Internet media data for monitoring.

為了實現上述發明目的，本發明提供的具體技術方案如下。 In order to achieve the above-mentioned object of the invention, specific technical solutions provided by the present invention are as follows.

本發明提供一種構建行業知識圖譜資料庫的方法，包括以下步驟：從資料來源獲取行業資料；對所述行業資料進行資料處理，以提取與所述行業相關的實體以及對應的實體屬性和/或實體關係；基於所提取的實體、實體屬性和/或實體關係構建所述行業知識圖譜資料庫。 The invention provides a method for constructing an industry knowledge map database, which includes the following steps: obtaining industry data from a data source; performing data processing on the industry data to extract entities related to the industry and corresponding entity attributes and / or Entity relationship; building the industry knowledge map database based on the extracted entities, entity attributes and / or entity relationships.

較佳地，所述獲取行業資料的步驟通過以下方式實現：從協力廠商行業資料庫獲取結構化行業資料，所述結構化行業資料包括多個欄位；所述對行業資料進行資料處理的步驟通過以下方式實現：對所述結構化行業資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述構建行業知識圖譜資料庫的步驟通過以下方式實現：基於所提取的實體、實體屬性和/或實體關係生成所述行業知識圖譜資料庫。 Preferably, the step of obtaining industry data is achieved by: obtaining structured industry data from a third-party industry database, the structured industry data includes a plurality of fields; and the step of data processing the industry data This is achieved by: performing data cleaning and extraction-transformation-loading (ETL) processing on the structured industry data; the steps of constructing an industry knowledge graph database are achieved by: based on the extracted entities and entity attributes And / or entity relationship to generate the row Industry knowledge map database.

較佳地，所述獲取行業資料的步驟通過以下方式實現：利用網路爬蟲技術，從互聯網資料來源獲取與行業相關的資料，所述互聯網資料來源包括非結構化或半結構化資料來源；所述對行業資料進行資料處理的步驟通過以下方式實現：利用自然語言處理中的資訊抽取技術，對所述行業相關的資料進行實體識別和關係抽取，以提取所述實體、實體屬性和/或實體關係；所述構建行業知識圖譜資料庫的步驟通過以下方式實現：基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。進一步較佳地，上述步驟是以預定的週期定期執行的。 Preferably, the step of obtaining industry data is achieved by using web crawler technology to obtain industry-related data from Internet data sources, which include unstructured or semi-structured data sources; The steps of processing data of industry data are described as follows: using information extraction technology in natural language processing to perform entity identification and relationship extraction on the relevant data of the industry to extract the entity, entity attributes and / or entities Relationship; the step of constructing an industry knowledge map database is implemented in the following manner: supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and / or entity relationships. Further preferably, the above steps are performed periodically at a predetermined cycle.

較佳地，所述獲取行業資料的步驟通過以下方式實現：利用應用程式介面(API)以查詢方式從互聯網資料來源獲取與行業相關的資料，所述互聯網資料來源包括開放式資料來源；所述對行業資料進行資料處理的步驟通過以下方式實現：在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述與行業相關的資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述構建行業知識圖譜資料庫的步驟通過以下方式實現：基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。進一步較佳地，上述步驟是以預定的週期定期執行的。 Preferably, the step of obtaining industry data is achieved by: using an application program interface (API) to obtain industry-related data from an Internet data source by query, the Internet data source includes an open data source; the The step of processing data for industry data is achieved by: before extracting entities related to the industry and corresponding entity attributes and / or entity relationships, performing data cleaning and extraction-conversion on the industry-related data- Loading (ETL) processing; the step of constructing an industry knowledge graph database is implemented by: supplementing or updating the industry knowledge graph database based on the extracted entities, entity attributes, and / or entity relationships. Further preferably, the above steps are performed periodically at a predetermined cycle.

較佳地，所述獲取行業資料的步驟通過以下方式實現：利用應用程式介面(API)或網路爬蟲技術，從互聯網資料來源獲取與行業相關的互聯網媒體資料；所述對行業資料進行資料處理的步驟通過以下方式實現：對所述互聯網媒體資料進行事件檢測、事件評價和篩選，以提取與所述行業相關的特定媒體事件，並從所述互聯網媒體資料中識別對應的直接相關實體；所述構建行業知識圖譜資料庫的步驟通過以下方式實現：基於所述特定媒體事件以及對應的直接相關實體，對所述行業知識圖譜資料庫進行補充，其中，所述特定媒體事件作為抽象實體被補充到所述行業知識圖譜資料庫中。進一步較佳地，在所述對行業資料進行資料處理的步驟中通過以下方式中的至少一種識別與所述特定媒體事件對應的直接相關實體：基於自然語言處理中的實體識別從文本資料中識別實體；基於圖像或視頻識別處理從圖像或視頻資料中識別實體；或者，基於語音辨識處理從音訊或視頻資料中識別實體。進一步較佳地，所述特定媒體事件包括負面事件、突發事件、危機事件、群體性事件、輿情事件或其它具有行業意義的事件。進一步較佳地，上述步驟是即時不間斷執行的。 Preferably, the step of obtaining industry data is achieved by: using an application programming interface (API) or a web crawler technology to obtain industry-related Internet media data from Internet data sources; and performing data processing on industry data The steps are implemented by: performing event detection, event evaluation, and screening on the Internet media materials to extract specific media events related to the industry, and identifying corresponding directly related entities from the Internet media materials; The steps of building an industry knowledge map database are described in the following way: based on the specific media The physical event and the corresponding directly related entity supplement the industry knowledge graph database, wherein the specific media event is added to the industry knowledge graph database as an abstract entity. Further preferably, in the step of processing data of industry data, a directly related entity corresponding to the specific media event is identified by at least one of the following methods: identifying from textual data based on entity recognition in natural language processing Entities; identify entities from image or video materials based on image or video recognition processing; or identify entities from audio or video materials based on speech recognition processing. Further preferably, the specific media event includes a negative event, an emergency event, a crisis event, a group event, a public opinion event, or other events of industrial significance. Further preferably, the above steps are performed immediately and without interruption.

較佳地，所述構建行業知識圖譜資料庫的步驟包括：對所提取的實體進行語義消歧和實體連結。進一步較佳地，所述對所提取的實體進行語義消歧和實體連結的步驟進一步通過以下方式中的至少一種實現：基於實體知識，對每個所提取的實體指代逐一獨立地進行語義消歧和實體連結；基於主題一致性假設，利用候選實體在知識庫中的關聯，對所提取的實體指代進行一致性地語義消歧和實體連結。 Preferably, the step of constructing an industry knowledge map database includes: performing semantic disambiguation and entity connection on the extracted entities. Further preferably, the step of performing semantic disambiguation and entity connection on the extracted entities is further implemented by at least one of the following methods: based on the entity knowledge, semantic disambiguation is performed independently for each of the extracted entity references. Entity connection; Based on the topic consistency hypothesis, use the association of candidate entities in the knowledge base to perform consistent semantic disambiguation and entity connection on the extracted entity references.

本發明還提供一種基於本發明中所構建的行業知識圖譜資料庫對與行業相關的特定媒體事件進行監測的方法，包括以下步驟：獲取互聯網媒體資料；基於所獲取的互聯網媒體資料進行事件檢測、事件評價和篩選，以獲取所述與行業相關的特定媒體事件；識別與所述特定媒體事件對應的直接相關實體；基於所述直接相關實體，訪問所述行業知識圖譜資料庫，以確定與所述特定媒體事件對應的非直接相關實體；向所述直接相關實體和/或所述非直接相關實體發送預警消息。 The present invention also provides a method for monitoring specific media events related to the industry based on the industry knowledge map database constructed in the present invention, including the following steps: obtaining Internet media data; performing event detection based on the obtained Internet media data, Event evaluation and screening to obtain the specific media event related to the industry; identify directly related entities corresponding to the specific media event; based on the directly related entities, access the industry knowledge map database to determine The non-directly related entity corresponding to the specific media event; and sending an early warning message to the directly related entity and / or the non-directly related entity.

較佳地，所述進行事件檢測、事件評價和篩選步驟中的事件檢測包括以下步驟：對所獲取的互聯網媒體資料中的內容進行話題分類，以獲得針對特定話題的內容；從所獲得的內容中識別涉及的實體；對所獲得的內容和所識別的實體進行情感分析，並且基於情感分析的結果對所獲得的內容進行過濾；基於過濾後的內容進行事件發現，以對媒體事件進行聚類並發現新的媒體事件。進一步較佳地，所述事件檢測還包括以下步驟：基於媒體事件的屬性對事件的真實性進行分析，並根據分析結果對媒體事件進行排序和/或過濾。 Preferably, the event detection in the event detection, event evaluation and screening steps includes the following steps: the content in the obtained Internet media materials is updated Perform topic classification to obtain content specific to a topic; identify the entities involved from the obtained content; perform sentiment analysis on the obtained content and the identified entities, and filter the obtained content based on the results of sentiment analysis ; Event discovery based on filtered content to cluster media events and discover new media events. Further preferably, the event detection further includes the steps of analyzing the authenticity of the event based on the attributes of the media event, and sorting and / or filtering the media event according to the analysis result.

較佳地，在所述識別與特定媒體事件對應的直接相關實體的步驟中通過以下方式中的至少一種識別與所述特定媒體事件對應的直接相關實體：基於自然語言處理中的實體識別從文本資料中識別實體；基於圖像或視頻識別處理從圖像或視頻資料中識別實體；或者，基於語音辨識處理從音訊或視頻資料中識別實體。 Preferably, in the step of identifying a directly related entity corresponding to a specific media event, the directly related entity corresponding to the specific media event is identified in at least one of the following ways: based on the entity recognition in natural language processing from the text Identifying entities in data; identifying entities from image or video data based on image or video recognition processing; or identifying entities from audio or video data based on speech recognition processing.

較佳地，所述訪問行業知識圖譜資料庫的步驟通過以下方式實現：基於所述直接相關實體，在所述行業知識圖譜資料庫中查詢，以確定所述非直接相關實體。 Preferably, the step of accessing the industry knowledge map database is implemented in the following manner: based on the directly related entities, querying in the industry knowledge map database to determine the non-directly related entities.

較佳地，所述訪問行業知識圖譜資料庫的步驟通過以下方式實現：基於所述直接相關實體，在所述行業知識圖譜資料庫中使用資料採擷技術，以確定所述非直接相關實體。 Preferably, the step of accessing the industry knowledge map database is implemented by using the data acquisition technology in the industry knowledge map database based on the directly related entities to determine the indirectly related entities.

本發明還提供一種構建行業知識圖譜資料庫的裝置，包括：資料獲取模組，用於從資料來源獲取行業資料；資料處理模組，用於對所述行業資料進行資料處理，以提取與所述行業相關的實體以及對應的實體屬性和/或實體關係；資料庫構建模組，用於基於所提取的實體、實體屬性和/或實體關係構建所述行業知識圖譜資料庫。 The invention also provides a device for constructing an industry knowledge map database, which includes: a data acquisition module for acquiring industry data from a data source; a data processing module for performing data processing on the industry data to extract The industry-related entities and corresponding entity attributes and / or entity relationships are described; a database construction module is configured to construct the industry knowledge graph database based on the extracted entities, entity attributes and / or entity relationships.

較佳地，所述資料獲取模組通過以下方式獲取行業資料：從協力廠商行業資料庫獲得結構化行業資料，所述結構化行業資料包括多個欄位；所述資料處理模組通過以下方式進行資料處理：在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述結構化行業資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述資料庫構建模組通過以下方式構建行業知識圖譜資料庫：基於所提取的實體、實體屬性和/或實體關係生成所述行業知識圖譜資料庫。 Preferably, the data acquisition module obtains industry data by: obtaining structured industry data from a third-party industry database, the structured industry data includes multiple fields; and the data processing module uses the following methods: Information Office Management: Before extracting entities related to the industry and corresponding entity attributes and / or entity relationships, perform data cleaning and extract-transform-load (ETL) processing on the structured industry data; construction of the database The module constructs an industry knowledge map database by generating the industry knowledge map database based on the extracted entities, entity attributes and / or entity relationships.

較佳地，所述資料獲取模組通過以下方式獲取行業資料：利用網路爬蟲技術，從互聯網資料來源獲得與行業相關的資料，所述互聯網資料來源包括非結構化或半結構化資料來源；所述資料處理模組通過以下方式進行資料處理：利用自然語言處理中的資訊抽取技術，對所述行業相關的資料進行實體識別和關係抽取，以提取所述實體、實體屬性和/或實體關係；所述資料庫構建模組通過以下方式構建行業知識圖譜資料庫：基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 Preferably, the data acquisition module obtains industry data through the following methods: using web crawler technology to obtain industry-related data from Internet data sources, the Internet data sources include unstructured or semi-structured data sources; The data processing module performs data processing in the following ways: using information extraction technology in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entities, entity attributes, and / or entity relationships The database building module constructs an industry knowledge map database by adding or updating the industry knowledge map database based on the extracted entities, entity attributes, and / or entity relationships.

較佳地，所述資料獲取模組通過以下方式獲取行業資料：利用應用程式介面(API)以查詢方式從互聯網資料來源獲取與行業相關的資料，所述互聯網資料來源包括開放式資料來源；所述資料處理模組通過以下方式進行資料處理：在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述與行業相關的資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述資料庫構建模組通過以下方式構建行業知識圖譜資料庫：基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 Preferably, the data acquisition module obtains industry data in the following ways: using an application program interface (API) to obtain industry-related data from Internet data sources in a query manner, and the Internet data sources include open data sources; The data processing module performs data processing in the following ways: before extracting entities related to the industry and corresponding entity attributes and / or entity relationships, performing data cleaning and extraction-transformation-loading of the industry-related data (ETL) processing; the database construction module constructs an industry knowledge graph database in the following manner: supplements or updates the industry knowledge graph database based on the extracted entities, entity attributes, and / or entity relationships.

較佳地，所述資料獲取模組通過以下方式獲取行業資料：用於利用應用程式介面(API)或網路爬蟲技術，從互聯網資料來源獲取與行業相關的互聯網媒體資料；所述資料處理模組通過以下方式進行資料處理：對所述互聯網媒體資料進行事件檢測、事件評價和篩選，以提取與所述行業相關的特定媒體事件，並從所述互聯網媒體資料中識別對應的直接相關實體；所述資料庫構建模組通過以下方式構建行業知識圖譜資料庫：基於所述特定媒體事件以及對應的直接相關實體，對所述行業知識圖譜資料庫進行補充，其中，所述特定媒體事件作為抽象實體被補充到所述行業知識圖譜資料庫中。 Preferably, the data acquisition module obtains industry data in the following ways: used to obtain industry-related Internet media data from Internet data sources using an application program interface (API) or web crawler technology; the data processing module The group performs data processing in the following ways: performing event detection, event evaluation, and screening on the Internet media materials to extract specific media events related to the industry, and from the Corresponding directly related entities are identified in Internet media materials; the database building module constructs an industry knowledge map database in the following way: based on the specific media event and corresponding directly related entities, the industry knowledge map database is Supplement, wherein the specific media event is added to the industry knowledge map database as an abstract entity.

較佳地，所述資料庫構建模組進一步通過以下方式中的至少一種識別與所述特定媒體事件對應的直接相關實體：基於自然語言處理中的實體識別從文本資料中識別實體；基於圖像或視頻識別處理從圖像或視頻資料中識別實體；或者基於語音辨識處理從音訊或視頻資料中識別實體。 Preferably, the database building module further identifies directly related entities corresponding to the specific media event by at least one of the following methods: identifying entities from textual data based on entity recognition in natural language processing; and based on images Or video recognition processing to identify entities from images or video materials; or speech recognition processing to identify entities from audio or video materials.

較佳地，所述資料庫構建模組包括：用於對所提取的實體進行語義消歧和實體連結的模組。進一步較佳地，所述用於對所提取的實體進行語義消歧和實體連結的模組進一步通過以下方式中的至少一種進行語義消歧和實體連結：基於實體知識，對每個所提取的實體指代逐一獨立地進行語義消歧和實體連結；基於主題一致性假設，利用候選實體在知識庫中的關聯，對所提取的實體指代進行一致性地語義消歧和實體連結。 Preferably, the database construction module includes a module for performing semantic disambiguation and entity connection on the extracted entities. Further preferably, the module for performing semantic disambiguation and entity connection on the extracted entities further performs semantic disambiguation and entity connection by at least one of the following methods: based on the entity knowledge, for each extracted entity References are used to perform semantic disambiguation and entity connection one by one independently. Based on the topic consistency assumption, the association of candidate entities in the knowledge base is used to perform consistent semantic disambiguation and entity connection on the extracted entity references.

較佳地，所述特定媒體事件包括負面事件、突發事件、危機事件、群體性事件、輿情事件或其它具有行業意義的事件。 Preferably, the specific media event includes a negative event, an emergency event, a crisis event, a group event, a public opinion event, or other events of industrial significance.

本發明還提供一種對與行業相關的特定媒體事件進行監測的系統，包括：資料獲取單元，用於從資料來源獲得行業資料；資料處理單元，用於對所述行業資料進行資料處理，以提取與所述行業相關的實體以及對應的實體屬性和/或實體關係；資料庫構建單元，用於基於所提取的實體、實體屬性和/或實體關係構建所述行業知識圖譜資料庫；資料庫存儲單元：用於存儲所構建的行業知識圖譜資料庫；媒體事件監測單元：用於獲取互聯網媒體資料，基於所獲取的互聯網媒體資料進行事件檢測、事件評價和篩選以獲得所述與行業相關的特定媒體事件，並且識別與所述特定媒體事件對應的直接相關實體；資料庫訪問單元：用於基於所述直接相關實體，訪問所述行業知識圖譜資料庫，以確定與所述特定媒體事件對應的非直接相關實體；消息發送單元，用於向所述直接相關實體和/或所述非直接相關實體發送預警消息。 The invention also provides a system for monitoring specific media events related to the industry, including: a data acquisition unit for obtaining industry data from a data source; a data processing unit for performing data processing on the industry data to extract Entities related to the industry and corresponding entity attributes and / or entity relationships; a database construction unit configured to construct the industry knowledge graph database based on the extracted entities, entity attributes and / or entity relationships; and a data library storage Unit: used to store the constructed knowledge knowledge database; media event monitoring unit: used to obtain Internet media materials, and perform event detection, event evaluation and screening based on the obtained Internet media materials to obtain The specific media event related to the industry, and identifying directly related entities corresponding to the specific media event; a database access unit: for accessing the industry knowledge map database based on the directly related entity to determine the A non-directly related entity corresponding to the specific media event; a message sending unit, configured to send an early warning message to the directly related entity and / or the indirectly related entity.

較佳地，所述資料獲取單元包括：結構化資料獲取單元，用於從協力廠商行業資料庫獲得結構化行業資料，所述結構化行業資料包括多個欄位；所述資料處理單元包括：結構化資料處理單元，用於在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述結構化行業資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述資料庫構建單元包括：資料庫生成單元，用於基於所提取的實體、實體屬性和/或實體關係生成所述行業知識圖譜資料庫。 Preferably, the data acquisition unit includes: a structured data acquisition unit for obtaining structured industry data from a third-party industry database, the structured industry data includes a plurality of fields; the data processing unit includes: Structured data processing unit configured to perform data cleaning and extraction-transformation-loading (ETL) processing on the structured industry data before extracting entities related to the industry and corresponding entity attributes and / or entity relationships The database building unit comprises a database generating unit for generating the industry knowledge map database based on the extracted entities, entity attributes and / or entity relationships.

較佳地，所述資料獲取單元包括：行業相關資料獲取單元，用於利用網路爬蟲技術，從互聯網資料來源獲得與行業相關的資料，所述互聯網資料來源包括非結構化或半結構化資料來源；所述資料處理單元包括：行業相關資料處理單元，用於利用自然語言處理中的資訊抽取技術，對所述行業相關的資料進行實體識別和關係抽取，以提取所述實體、實體屬性和/或實體關係；所述資料庫構建單元包括：資料庫補充/更新單元，用於基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 Preferably, the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain industry-related data from Internet data sources using web crawler technology, and the Internet data sources include unstructured or semi-structured data Source; the data processing unit includes: an industry-related data processing unit, which uses information extraction technology in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entities, entity attributes, and And / or entity relationship; the database construction unit includes: a database supplement / update unit for supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes, and / or entity relationships.

較佳地，所述資料獲取單元包括：行業相關資料獲取單元，用於利用應用程式介面(API)以查詢方式從互聯網資料來源獲取與行業相關的資料，所述互聯網資料來源包括開放式資料來源；所述資料處理單元包括：行業相關資料處理單元，用於在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述與行業相關的資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述資料庫構建單元包括：資料庫補充/更新單元，用於基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 Preferably, the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain industry-related data from an Internet data source in an inquiry manner using an application programming interface (API), and the Internet data source includes an open data source ; The data processing unit includes: an industry-related data processing unit, for extracting entities related to the industry and corresponding entity attributes and / or entity relationships, Perform data cleaning and extraction-transformation-loading (ETL) processing on the industry-related data; the database construction unit includes: a database supplement / update unit for based on the extracted entities, entity attributes, and / Or entity relationship to supplement or update the industry knowledge map database.

較佳地，所述資料獲取單元包括：媒體資料獲取單元，用於利用應用程式介面(API)或網路爬蟲技術，從互聯網資料來源獲取與行業相關的互聯網媒體資料；所述資料處理單元包括：媒體資料處理單元，用於對所述互聯網媒體資料進行事件檢測、事件評價和篩選，以提取與所述行業相關的特定媒體事件，並從所述互聯網媒體資料中識別對應的直接相關實體；所述資料庫構建單元包括：資料庫補充/更新單元，用於基於所述特定媒體事件以及對應的直接相關實體，對所述行業知識圖譜資料庫進行補充，其中，所述特定媒體事件作為抽象實體被補充到所述行業知識圖譜資料庫中。 Preferably, the data acquisition unit includes: a media data acquisition unit configured to obtain industry-related Internet media data from an Internet data source by using an application programming interface (API) or a web crawler technology; the data processing unit includes A media data processing unit, configured to perform event detection, event evaluation, and screening on the Internet media data to extract specific media events related to the industry and identify corresponding directly related entities from the Internet media data; The database construction unit includes: a database supplement / update unit for supplementing the industry knowledge map database based on the specific media event and corresponding directly related entities, wherein the specific media event serves as an abstraction Entities are added to the industry knowledge map database.

較佳地，所述資料庫補充/更新單元進一步用於：對所提取的實體進行語義消歧和實體連結。 Preferably, the database supplement / update unit is further configured to perform semantic disambiguation and entity connection on the extracted entities.

較佳地，所述媒體事件監測單元進一步用於：對所獲取的互聯網媒體資料中的內容進行話題分類，以獲得針對特定話題的內容；從所獲得的內容中識別涉及的實體；對所獲得的內容和所識別的實體進行情感分析，並且基於情感分析的結果對所獲得的內容進行過濾；基於過濾後的內容進行事件發現，以對媒體事件進行聚類並發現新的媒體事件。進一步較佳地，所述媒體事件監測單元進一步用於：基於媒體事件的屬性對事件的真實性進行分析，並根據分析結果對媒體事件進行排序和/或過濾。 Preferably, the media event monitoring unit is further configured to: classify the content in the obtained Internet media materials to obtain content targeted to a specific topic; identify the entities involved from the obtained content; Perform sentiment analysis on the content and the identified entities, and filter the obtained content based on the results of sentiment analysis; perform event discovery based on the filtered content to cluster media events and discover new media events. Further preferably, the media event monitoring unit is further configured to analyze the authenticity of the events based on the attributes of the media events, and sort and / or filter the media events according to the analysis results.

較佳地，所述資料庫訪問單元進一步用於：基於所述直接相關實體，在所述行業知識圖譜資料庫中查詢，以確定所述非直接相關實體。 Preferably, the database access unit is further configured to query the industry knowledge map database based on the directly related entities to determine the non-directly related entities.

較佳地，所述資料庫訪問單元進一步用於：基於所述直接相關實體，在所述行業知識圖譜資料庫中使用資料採擷技術，以確定所述非直接相關實體。 Preferably, the database access unit is further configured to: based on the directly related entities, use a data acquisition technology in the industry knowledge map database to determine the non-directly related entities.

通過實施本發明提供的技術方案可以獲得以下技術效果：1)針對一個或多個目標領域或行業，實現了對相關互聯網媒體事件的自動化、深層次監測，能夠識別出與特定媒體事件對應的非直接相關實體；2)在監測中實現了對多個資料來源、多種資料類型、多種語言的互聯網媒體資料的自動化處理。 By implementing the technical solution provided by the present invention, the following technical effects can be obtained: 1) for one or more target areas or industries, automatic and in-depth monitoring of related Internet media events is realized, and non-corresponding to specific media events can be identified Directly related entities; 2) Automatic monitoring of Internet media data from multiple sources, multiple data types, and multiple languages during monitoring.

S11-S15‧‧‧步驟 S11-S15‧‧‧step

S31-S35‧‧‧步驟 S31-S35‧‧‧step

S41-S45‧‧‧步驟 S41-S45‧‧‧step

S421-S422‧‧‧步驟 S421-S422‧‧‧step

S51-S53‧‧‧步驟 S51-S53‧‧‧step

60‧‧‧資料來源 60‧‧‧Source

61‧‧‧資料獲取單元 61‧‧‧Data Acquisition Unit

62‧‧‧資料處理單元 62‧‧‧Data Processing Unit

63‧‧‧資料庫構建單元 63‧‧‧Database building unit

64‧‧‧資料庫存儲單元 64‧‧‧Database Storage Unit

65‧‧‧資料庫訪問單元 65‧‧‧Database Access Unit

66‧‧‧媒體事件監測單元 66‧‧‧Media Event Monitoring Unit

67‧‧‧消息發送單元 67‧‧‧Message sending unit

601‧‧‧協力廠商行業 601‧‧‧Partner Industry

602‧‧‧互聯網資料來源 602‧‧‧ internet source

611‧‧‧結構化資料獲取單元 611‧‧‧ Structured Data Acquisition Unit

612‧‧‧行業相關資料獲取單元 612‧‧‧ Industry related data acquisition unit

613‧‧‧媒體資料獲取單元 613‧‧‧Media data acquisition unit

621‧‧‧結構化資料處理單元 621‧‧‧ structured data processing unit

622‧‧‧行業相關資料處理單元 622‧‧‧Industry-related data processing unit

623‧‧‧媒體資料處理單元 623‧‧‧Media data processing unit

631‧‧‧資料庫生成單元 631‧‧‧Database generation unit

632‧‧‧資料庫補充/更新單元 632‧‧‧Database Supplement / Update Unit

第一圖是本發明提供的一種構建行業知識圖譜資料庫的方法的示例性流程圖；第二圖是本發明提供的示例性結構化行業資料；第三圖是本發明提供的一種對媒體事件進行監測的方法的示例性流程圖；第四圖是本發明提供的另一種構建行業知識圖譜資料庫的方法的示例性流程圖；第五圖是本發明提供的另一種構建行業知識圖譜資料庫的方法的示例性流程圖；第六圖是本發明提供的一種對媒體事件進行監測的系統的示例性功能方塊圖。 The first diagram is an exemplary flowchart of a method for constructing an industry knowledge map database provided by the present invention; the second diagram is an exemplary structured industry profile provided by the present invention; the third diagram is a media event provided by the present invention Exemplary flowchart of a method for monitoring; the fourth diagram is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention; the fifth diagram is another alternative method of constructing an industry knowledge map database provided by the present invention An exemplary flowchart of the method; FIG. 6 is an exemplary functional block diagram of a system for monitoring media events provided by the present invention.

以下結合附圖通過實施例的形式來描述本發明的具體實施方式，以便於本領域技術人員理解本發明的目的、技術方案和優點。本領域技術人員可以理解，以實施例的形式描述的具體實施方式僅僅是示例性的，而在不具備這些具體內容的情況下也能夠實現本發明的構思。 The following describes specific implementations of the present invention in the form of embodiments with reference to the accompanying drawings, so that those skilled in the art can understand the objectives, technical solutions, and advantages of the present invention. Those skilled in the art can understand that the specific implementations described in the form of embodiments are merely exemplary, and can be implemented without these specific contents. The concept of the present invention is now shown.

本發明提供一種構建行業知識圖譜資料庫的技術以及一種基於所構建的行業知識圖譜資料庫對互聯網媒體事件進行監測的技術，以實現本發明的目的。 The invention provides a technology for constructing an industry knowledge map database and a technology for monitoring Internet media events based on the constructed industry knowledge map database, so as to achieve the purpose of the invention.

本發明涉及知識圖譜(Knowledge Graph)資料庫技術的應用。知識圖譜資料庫是用於知識管理的一種特殊的資料庫，便於在相關領域中對知識進行採集、整理和提取。在知識圖譜資料庫中定義了實體、實體屬性以及實體關係。其中，實體對應於現實世界中的事物(例如，一個公司A，一個人物X)，每個實體可以用全域唯一的ID來標識。實體屬性用於描述實體的內在特性(例如，公司A、人物X的中、英文名稱)。實體關係用於連接實體，以描述實體之間的聯繫(例如，人物X與公司A的任職關係)。通過構建知識圖譜資料庫，可以更加高效、深入地利用由實體、實體屬性、實體關係組成的知識，發現事物之間的複雜聯繫。 The invention relates to the application of a Knowledge Graph database technology. Knowledge graph database is a special database for knowledge management, which is convenient for collecting, sorting and extracting knowledge in related fields. Entities, entity attributes, and entity relationships are defined in the knowledge graph database. Among them, entities correspond to things in the real world (for example, a company A, a character X), and each entity can be identified by a globally unique ID. Entity attributes are used to describe the inherent characteristics of the entity (for example, the Chinese and English names of company A and person X). Entity relationships are used to connect entities to describe the connections between entities (for example, the relationship between person X and company A). By constructing a knowledge graph database, you can more efficiently and deeply utilize knowledge composed of entities, entity attributes, and entity relationships to discover the complex relationships between things.

作為一種資料庫，知識圖譜資料庫可以採用多種形式進行存儲。舉例而言，知識圖譜資料庫可以採用傳統的關係型數據庫，使用語義網路RDF(Resource Description Framework)三元組的方式存儲，也可以採用新型的非關係型數據庫。較佳地，知識圖譜資料庫可以採用圖資料庫進行存儲，例如Neo4j、OrientDB、Titan-BerkeleyDB、HyperGraphDB等。 As a kind of database, the knowledge graph database can be stored in various forms. For example, the knowledge graph database can be stored in a traditional relational database, using the semantic network RDF (Resource Description Framework) triples, or a new type of non-relational database. Preferably, the knowledge graph database may be stored using a graph database, such as Neo4j, OrientDB, Titan-BerkeleyDB, HyperGraphDB, and the like.

取決於知識圖譜資料庫的規模和用途，用於構建知識圖譜資料庫的資料來源可以是多種多樣的。舉例而言，資料來源可以是開放式的百科類資料來源(例如，百度百科、維基百科等)，也可以是結構化的資料庫(例如，維基資料、DBpedia、垂直網站或特定行業的專業資料庫等)，還可以是任何相關的協力廠商半結構化或非結構化資料來源(例如，專業網站、在互聯網媒體中發佈的內容，包括新聞、公司年報、企業公告等)。 Depending on the size and purpose of the knowledge graph database, the data sources used to construct the knowledge graph database can be diverse. For example, the data source can be an open encyclopedia source (for example, Baidu Encyclopedia, Wikipedia, etc.), or it can be a structured database (for example, Wikisource, DBpedia, vertical website, or professional information for a specific industry Library, etc.), or any relevant third-party semi-structured or unstructured source of information (for example, professional websites, content published in Internet media, including news, company annual reports, corporate announcements, etc.).

本領域技術人員應當理解，本發明中所構建的知識圖譜資料庫在構建過程中是以特定的領域或行業為導向的，但不局限於單個行業。所構建的知識圖譜資料庫實現了將與一個或多個行業相關的實體和事件、實體和事件的屬性以及實體與實體、實體與事件、事件與事件之間的關係整合聯接成為一個知識的圖譜。 Those skilled in the art should understand that the knowledge map database constructed in the present invention is oriented in a specific field or industry in the construction process, but is not limited to a single industry. The constructed knowledge graph database realizes the integration and connection of entities and events related to one or more industries, the attributes of entities and events, and the relationships between entities and entities, entities and events, and events and events into a knowledge graph. .

第一圖是本發明提供的一種構建行業知識圖譜資料庫的方法的示例性流程圖，該方法可以包括步驟S11-S15。 The first figure is an exemplary flowchart of a method for constructing an industry knowledge map database provided by the present invention. The method may include steps S11-S15.

在步驟S11中，從行業資料來源獲得行業資料，並從所述行業資料中提取實體以及對應的實體屬性和實體關係，以生成所述行業知識圖譜資料庫。 In step S11, industry data is obtained from industry data sources, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate the industry knowledge map database.

行業資料來源是針對一個或多個特定領域或行業的基本資料的來源，其中，這些領域或行業被作為監測的目標。在一個實施例中，行業資料來源可以是結構化的行業資料庫，以盡可能獲得高品質的行業基本資料。可以通過應用程式介面(API)來訪問結構化資料庫，以查詢方式(例如，通過查詢命令)獲得資料。 Industry data sources are sources of basic information for one or more specific areas or industries, where these areas or industries are targeted for monitoring. In one embodiment, the industry data source may be a structured industry database to obtain the highest quality industry basic data as much as possible. The structured database can be accessed through an application programming interface (API), and the data can be obtained in a query manner (for example, by a query command).

通過“抽取-轉換-載入(Extraction-Transform-Load，ETL)”處理，可以對所獲得的行業資料進行轉換，然後從轉換後的資料中提取實體、實體屬性和實體關係並將其載入至本發明提出的行業知識圖譜資料庫中。ETL操作的具體執行步驟可以通過現有的資料整合手段來實現。舉例而言，在基於本體的資料整合方法中，以預定的方式定義不同資料庫中的各個欄位與各種實體資訊之間的映射關係，從而根據所述欄位及其內容提取實體、實體屬性及實體關係，完成構建基本行業知識圖譜資料庫。另外，由於行業資料庫在結構上存在差異，並可能存在資料雜訊、資料缺失或資料錯誤等問題，所以在對行業資料進行資料處理的過程中可能還需要對其進行資料清洗操作。可以採用本領域已知的技術手段，與ETL處理相結合來實現資料清洗操作。 Through "Extraction-Transform-Load (ETL)" processing, the obtained industry data can be transformed, and then entities, entity attributes and entity relationships can be extracted from the transformed data and loaded. To the industry knowledge map database proposed by the present invention. The specific implementation steps of the ETL operation can be realized through existing data integration methods. For example, in the ontology-based data integration method, the mapping relationship between various fields in different databases and various entity information is defined in a predetermined way, so as to extract entities and entity attributes according to the fields and their contents And entity relationships to complete the construction of a basic industry knowledge map database. In addition, due to the differences in the structure of the industry database and the problems of data noise, missing data, or data errors, it may be necessary to perform data cleaning operations on the industry data. The data cleaning operation may be implemented by combining technical means known in the art with ETL processing.

作為一個實例，第二圖示出示例性的結構化行業資料，如上文所述，該資料可以是從結構化的行業資料庫獲得的。在第二圖中，表1是上市公司結構化資料的示例，其包括公司A和公司B兩個資料條目，每個資料條目又包括公司中英文名稱、註冊位址、股票代碼、董事會主席等多個欄位。通過對該結構化資料進行ETL操作，可以提取其中的實體(即公司A、公司B、人物X、人物Y)、實體屬性(即公司A和公司的B的具體資訊)以及實體關係(即公司A與人物X以及公司B與人物Y的任職關係)，從而生成了針對所屬行業的知識圖譜資料庫。 As an example, the second figure illustrates an exemplary structured industry profile, which may be obtained from a structured industry database, as described above. In the second figure, Table 1 is an example of the structured information of a listed company, which includes two data items of company A and company B. Each data item includes the Chinese and English names of the company, registered address, stock code, chairman of the board, etc. Multiple fields. By performing ETL operations on the structured data, you can extract the entities (i.e. company A, company B, person X, person Y), entity attributes (specific information of company A and company B), and entity relationships (i.e. A's and X's relationship with company B's and Y's), thus generating a knowledge graph database for their industry.

在另一個實施例中，行業資料來源也可以是來自互聯網的半結構化或非結構化資料來源，並且可以通過網路爬蟲技術從資料來源中抓取行業資料，並採用基於自然語言處理技術的資訊抽取操作來提取實體、實體屬性以及實體關係。 In another embodiment, the industry data source can also be a semi-structured or unstructured data source from the Internet, and industry data can be captured from the data source through a web crawler technology, and a natural language processing technology-based Information extraction operations to extract entities, entity attributes, and entity relationships.

在步驟S12中，從互聯網資料來源獲得與所述行業相關的資料，並從所述資料中提取與所述行業相關的實體以及對應的實體屬性和實體關係。 In step S12, information related to the industry is obtained from an Internet data source, and entities related to the industry and corresponding entity attributes and entity relationships are extracted from the data.

在該步驟中，首先從互聯網資料來源中獲得與上述特定領域或行業相關的資料。互聯網資料來源可以是結構化、半結構化或非結構化的資料來源。因此，針對互聯網資料來源的不同結構特性，可以採用不同的方式獲得與行業相關的資料。然後，從與行業相關的資料中提取實體以及對應的實體屬性和實體關係。 In this step, the relevant information in the above specific field or industry is first obtained from the Internet data source. Internet sources can be structured, semi-structured, or unstructured sources. Therefore, according to the different structural characteristics of Internet data sources, different methods can be used to obtain industry-related data. Then, extract the entities and the corresponding entity attributes and entity relationships from the industry-related data.

對於結構化的互聯網資料來源，可以通過API查詢對應的資料內容並獲得實體、實體屬性和實體關係。對於半結構化的資料來源，則可以在抓取資料內容後，通過自然語言處理技術中的資訊抽取操作對內容進行分析，從而提取出與行業相關的實體、實體屬性和實體關係。半結構化的資料來源即包含部分結構化、部分非結構化資料的資料來源，因此可以分別按照處理結構化和非結構化資料的方式來處理半結構化資料中的對應部分。舉例而言，HTML和XML檔是最常見的半結構化資料。在處理HTML和XML檔的過程中，一方面可以使用其中基於標記符的結構化資訊，另一方面可以結合資訊抽取技術與機器學習技術來提取所需的資訊。 For structured Internet data sources, you can query the corresponding data content and obtain entities, entity attributes, and entity relationships through APIs. For semi-structured data sources, you can analyze the content through information extraction operations in natural language processing technology after grabbing the data content, thereby extracting industry-related entities, entity attributes, and entity relationships. A semi-structured data source is a data source that contains partially structured and partially unstructured data. Approach to semi-structured data. For example, HTML and XML files are the most common semi-structured data. In the process of processing HTML and XML files, on the one hand, you can use structured information based on tags, on the other hand, you can combine information extraction technology and machine learning technology to extract the required information.

資訊抽取操作包括實體識別操作和關係抽取操作。 Information extraction operations include entity recognition operations and relationship extraction operations.

實體識別操作可以採用現有自然語言處理工具(例如，詞性標注或命名實體識別工具)，或者以機器學習方法針對特定標注資料對實體識別模型進行訓練。需要指出的是，一些自然語言處理任務和處理工具是與語言相關的(例如，中文資料需要進行分詞處理，英文資料則不需要)。機器學習方法以數位化方式表示不同語言和格式的資料，然後採用通用的、與語言無關的演算法(例如，條件隨機場演算法和隱瑪律可夫模型)進行模型訓練。 The entity recognition operation may use existing natural language processing tools (for example, part-of-speech tagging or named entity recognition tools), or use a machine learning method to train an entity recognition model for specific tagging data. It should be pointed out that some natural language processing tasks and processing tools are language-dependent (for example, Chinese materials require word segmentation and English materials do not). Machine learning methods digitally represent materials in different languages and formats, and then use general, language-independent algorithms (such as conditional random field algorithms and hidden Markov models) for model training.

關係抽取操作可以通過多種現有統計學習或機器學習方法實現。例如，可以採用範本學習方法，以知識圖譜資料庫中符合某種關係的實體作為實例，在大量文本中抽取並統計現有實例在文本中出現的句式、語境等形成關係抽取範本，然後將所形成的範本應用在文本資料中以抽取新的實例。如果抽取到知識圖譜資料庫中尚不存在的實例，則可以將其補充到知識圖譜資料庫中。 The relation extraction operation can be implemented through a variety of existing statistical learning or machine learning methods. For example, you can use the template learning method, taking the entities that meet a certain relationship in the knowledge graph database as examples, extracting and counting the sentence patterns, contexts, etc. that appear in existing text in a large amount of text to form a relationship extraction template, and then The resulting template is used in textual materials to extract new instances. If you extract an instance that does not yet exist in the knowledge graph database, you can add it to the knowledge graph database.

在步驟S13中，基於所述與行業相關的實體以及對應的實體屬性和實體關係，對所述行業知識圖譜資料庫進行補充或更新。 In step S13, the industry knowledge map database is supplemented or updated based on the industry-related entities and corresponding entity attributes and entity relationships.

在提取與行業相關的實體以及對應的實體屬性和實體關係之後，可以將其與知識圖譜資料庫中的對應資訊進行關聯和比對，並按需要將新的實體、實體屬性和實體關係加入到知識圖譜資料庫中，並且可以對現有的實體屬性和實體關係進行更新。 After extracting industry-related entities and corresponding entity attributes and entity relationships, you can associate and compare them with the corresponding information in the knowledge graph database, and add new entities, entity attributes, and entity relationships to the Knowledge graph database, and can update existing entity attributes and entity relationships.

如上文所述，本發明所提出的行業知識圖譜資料庫可以採用傳統的關係型數據庫，RDF三元組資料庫，也可以採用新型的非關係型數據庫(例如，圖資料庫)。對應地，補充或更新知識圖譜資料庫的具體操作可以利用資料庫查詢語言以定制化的方式實現，例如，這些資料庫查詢語言包括針對關聯式資料庫的SQL語言、RDF三元組查詢語言SPARQL、用於Neo4j圖資料庫的Cypher語言等。 As mentioned above, the industry knowledge graph database proposed by the present invention may use a traditional relational database, RDF triples database, or a new type. Non-relational database (for example, a graph database). Correspondingly, the specific operations of supplementing or updating the knowledge graph database can be implemented in a customized manner using a database query language. For example, these database query languages include SQL language for relational databases, and RDF triple query language SPARQL. Cypher language for Neo4j graph database.

繼續結合第二圖中的實例進行說明。假設通過API查詢的方式從結構化的互聯網資料來源獲得了表2的上市公司高管結構化資料，則可以對行業知識圖譜資料庫進行以下補充和更新：1)將人物Z、人物Z的實體屬性以及人物Z與公司B的任職關係補充到知識圖譜資料庫中；2)補充人物X和人物Y的實體屬性；3)更新人物Y和公司B的任職關係(即從“現任職”更新為“曾任職”)。 Continue to explain with the example in the second figure. Assuming that the structured data of listed company executives in Table 2 are obtained from the structured Internet data source through API query, the following additions and updates can be made to the industry knowledge map database: 1) the entity Z, the entity of person Z Attributes and the relationship between person Z and company B are added to the knowledge graph database; 2) the entity attributes of person X and person Y are added; 3) the relationship between person Y and company B is updated (that is, from "current position" to "Former time").

在補充或更新行業知識圖譜資料庫的過程中需要進行實體連結操作和語義消歧操作。 In the process of supplementing or updating the industry knowledge graph database, entity connection operations and semantic disambiguation operations are required.

實體連結操作旨在將資料內容中出現的某個實體指代(或實體指稱、entity mention)對應到知識圖譜資料庫中的相關實體概念。例如，在“約伯斯是蘋果的創辦人之一”以及“史蒂夫．約伯斯於1985年在美國創建NeXT”這兩個句子中，“約伯斯”和“史蒂夫．約伯斯”這兩個實體指代都應該對應到知識圖譜資料庫中的同一人物實體概念“史蒂夫．約伯斯(Steve Jobs,ex-CEO of Apple)”，因此需要通過實體連結操作將這個兩個實體指代關聯到同一個實體。語義消歧旨在對有歧義的實體指代進行消歧操作。例如，“蘋果”這個實體指代可以對應多個有歧義的實體，例如“蘋果(水果)”、“蘋果公司(Apple Inc.)”、“蘋果日報”、“蘋果(電影)”等，而上述例子中第一個句子裡的“蘋果”應該對應到知識圖譜資料庫中的公司實體概念“蘋果公司(Apple Inc.)”而不是“蘋果(水果)”、“蘋果(電影)”、或“蘋果日報”。實體連結和語義消歧通常都是一起進行的。因為語義消歧是實體連結的手段，而實體連結是語義消歧的目的；所以兩者經常在不同場合互換使用或互相表示。 The entity link operation aims to map an entity reference (or entity reference) appearing in the data content to the related entity concept in the knowledge graph database. For example, in the sentences "Jobs was one of the founders of Apple" and "Steve Jobs founded NeXT in the United States in 1985", "Jobs" and "Steve Jo These two entities refer to the same person entity concept "Steve Jobs (ex-CEO of Apple)" in the knowledge graph database, so the entity link operation will be required to These two entities refer to the same entity. Semantic disambiguation is designed to disambiguate entity references. For example, "Apple" refers to an entity that can correspond to multiple ambiguities, such as "Apple (Fruit)", "Apple Inc.", "Apple Daily", "Apple (Movie)", etc., and The "Apple" in the first sentence of the above example should correspond to the corporate entity concept "Apple Inc." in the Knowledge Graph database instead of "Apple (fruit)", "Apple (movie)", or "Apple Daily". Entity connection and semantic disambiguation are usually together ongoing. Because semantic disambiguation is a means of entity connection, and entity connection is the purpose of semantic disambiguation, the two are often used interchangeably or represented each other on different occasions.

任何現有的實體連結和語義消歧技術均可用于本發明中。舉例而言，其中一類方法基於實體知識對實體指代逐一獨立地進行消歧與連結。實體知識包括但不局限於，實體的出現概率、實體的名字分佈(全名、別名、縮寫等)、實體的上下文語境(如詞的共現資訊、詞分佈等)、及實體在知識庫中的類別資訊(如公司實體、個人實體、地點實體等)等。可以使用基於概率的(如線性回歸或邏輯回歸等)或機器學習的(如支持向量機(Support Vector Machines)、隨機森林(Random Forest)等)手段來學習並訓練基於實體知識的語義消歧和實體連結模型。另一類方法基於主題一致性的假設(即文章中的實體通常與文本主題相關，所以這些實體之間也具有語義相關性)，利用文本內容中所有實體指代的候選實體在知識庫(如維基百科或本發明構建的知識圖譜)中的關聯對一篇文章中的所有實體指代進行一致性地消歧與連結。這一類方法在計算過程中通常使用基於圖資料結構的協同推理，即將文章內容中所有實體指代的候選實體，利用其在知識庫中的關係構建成一個候選實體圖，圖的稠密分佈反映了圖中不同候選實體結點之間的語義關聯程度。實體連結的過程就是，通過將證據(不同實體間可能的關聯度)按照候選實體圖的依存結構反覆運算傳遞以協同增強證據，直至收斂。上述兩類方法也可以靈活地或有機地進行組合來提高消歧和連結的性能。 Any existing entity connection and semantic disambiguation techniques can be used in the present invention. For example, one type of method is based on entity knowledge to disambiguate and link entity references one by one independently. Entity knowledge includes, but is not limited to, the occurrence probability of the entity, the name distribution of the entity (full name, alias, abbreviation, etc.), the context of the entity (such as co-occurrence information of words, word distribution, etc.), and the entity's presence in the knowledge base Category information (such as corporate entities, personal entities, location entities, etc.). Can use probability-based (such as linear regression or logistic regression, etc.) or machine learning (such as Support Vector Machines, Random Forest, etc.) to learn and train semantic disambiguation based on entity knowledge and Entity Link Model. Another type of method is based on the assumption of topic consistency (that is, the entities in the article are usually related to the text topic, so these entities also have semantic relevance), using candidate entities referred to by all entities in the text content in the knowledge base (such as Wiki The association in the encyclopedia or the knowledge map constructed by the present invention disambiguates and links all entity references in an article consistently. This type of method usually uses collaborative reasoning based on the structure of the graph data in the calculation process. That is, the candidate entities referred to by all entities in the content of the article are used to construct a candidate entity graph using their relationship in the knowledge base. The dense distribution of the graph reflects The degree of semantic association between different candidate entity nodes in the graph. The process of entity connection is to synergistically enhance evidence by repeatedly passing evidence (possible degree of association between different entities) according to the dependency structure of the candidate entity graph to converge. The above two types of methods can also be combined flexibly or organically to improve the performance of disambiguation and connection.

在步驟S14中，從互聯網資料來源獲得與所述行業相關的互聯網媒體資料，並從所述互聯網媒體資料中提取與所述行業相關的特定媒體事件以及對應的直接相關實體。 In step S14, Internet industry data related to the industry is obtained from Internet data sources, and specific media events related to the industry and corresponding directly related entities are extracted from the Internet media data.

可以通過多種方式從互聯網資料來源獲取互聯網媒體資料。例如，一些社交媒體網站(例如，新浪微博、Facebook、Twitter 等)都開放了用於獲取其資料的API。也可以利用網路爬蟲技術和內容抽取技術來抓取新聞網站或行業媒體網站資料。 Internet media sources can be obtained from Internet sources in a number of ways. For example, some social media sites (e.g. Sina Weibo, Facebook, Twitter Etc.) have opened APIs for obtaining their data. You can also use web crawler technology and content extraction technology to crawl news website or industry media website data.

在本領域中已有多種對互聯網媒體進行監測以獲得特定媒體事件的技術實現方式。舉例而言，在一種實現方式中，先對互聯網媒體資料進行檢測，以發現感興趣的特定領域或行業中媒體事件的內容以及事件所涉及的實體，然後再對新發現的媒體事件按不同指標(例如，事件的負面性、重大性、突發性、傳播速度與範圍、可信度等)進行評價，以篩選出符合要求的媒體事件。 There are various technical implementations in the art for monitoring Internet media to obtain specific media events. For example, in one implementation, the Internet media data is first detected to discover the content of the media event and the entity involved in the specific field or industry of interest, and then the newly discovered media event is indexed by different indicators (E.g., negative, significant, unexpected, speed and scope of transmission, credibility, etc.) to screen media events that meet the requirements.

針對不同類型的互聯網媒體資料，可以採用不同的處理技術識別與媒體事件對應的直接相關實體。例如，可以使用基於自然語言處理的實體識別技術從文本資料中識別實體，可以使用圖像或視頻識別處理技術從圖像或視頻資料中識別實體，並且可以使用語音辨識處理技術從音訊或視頻資料中識別實體。本領域技術人員可以理解，本發明並不對互聯網媒體資料的媒體類型以及語言種類做出限制。 For different types of Internet media materials, different processing technologies can be used to identify directly related entities corresponding to media events. For example, entity recognition technology based on natural language processing can be used to identify entities from textual materials, image or video recognition processing technology can be used to identify entities from image or video materials, and speech recognition processing technology can be used from audio or video materials To identify entities. Those skilled in the art can understand that the present invention does not limit the media types and language types of Internet media materials.

在步驟S15中，基於所述特定媒體事件以及對應的直接相關實體，對所述行業知識圖譜資料庫進行補充，其中，所述特定媒體事件作為抽象實體被補充到所述行業知識圖譜資料庫中。 In step S15, the industry knowledge graph database is supplemented based on the specific media event and the corresponding directly related entity, wherein the specific media event is added to the industry knowledge graph database as an abstract entity. .

在獲得與行業相關的特定媒體事件以及對應的直接相關實體(例如，某上市公司主席貪腐醜聞事件以及該事件中涉及的公司、人物、地點)之後，把該事件作為抽象實體補充到行業知識圖譜資料庫中，同時對事件所涉及的直接相關實體進行實體連結和語義消歧，即找出所述實體在行業知識圖譜資料庫中對應的實體，並將其與代表所述事件的抽象實體進行關聯。如發現事件所涉及實體並不存在於行業知識圖譜資料庫中，則可以按上述步驟S13中說明的方式進行補充。在完成對行業知識圖譜資料庫的補充之後，即可基於所述事件的直接相關實體在知識圖譜資料庫中與其他實體之間的關係，找出代表媒體事件的抽象實體在行業知識圖譜資料庫中的其他非直接相關實體。 After obtaining specific industry-related media events and corresponding directly related entities (for example, the corruption scandal of the chairman of a listed company and the companies, people, and locations involved in the incident), the event is added to the industry knowledge as an abstract entity In the graph database, at the same time, the entities directly related to the event are physically connected and semantically disambiguated, that is, the entity corresponding to the entity in the industry knowledge graph database is found, and it is related to the abstract entity representing the event Make an association. If it is found that the entity involved in the event does not exist in the industry knowledge map database, it can be supplemented in the manner described in step S13 above. After completing the supplement to the industry knowledge map database, the directly related entities based on the event can be compared with other entities in the knowledge map database The relationship between them is to find other entities that are not directly related to the abstract entities representing media events in the industry knowledge graph database.

在通過以上方式構建行業知識圖譜資料庫之後，就可以基於所構建的資訊對互聯網媒體事件進行自動化、深層次的監測。較佳地，在完成行業知識圖譜資料庫的首次構建後，為了保持資訊的完整性和有效性，還可以對行業知識圖譜資料庫進行更新，例如，可以以預定的週期定期執行步驟S12和S13，還可以以即時不間斷的方式執行步驟S14和S15。 After the industry knowledge map database is constructed in the above manner, Internet media events can be automated and deeply monitored based on the constructed information. Preferably, after the first construction of the industry knowledge graph database is completed, in order to maintain the integrity and validity of the information, the industry knowledge graph database can also be updated. For example, steps S12 and S13 can be performed periodically at a predetermined cycle. Steps S14 and S15 can also be performed in an immediate and uninterrupted manner.

另外，本領域技術人員可以理解，本發明中所涉及的行業資料、與行業相關的資料以及互聯網媒體資料等各種資料的內容可以是多種語言的，也可以是多種類型的(例如，文本、圖像、視頻、語音等)，本發明並不對此做出任何限制。 In addition, those skilled in the art can understand that the contents of various materials such as industry materials, industry-related materials, and Internet media materials in the present invention may be in multiple languages or multiple types (for example, text, graphics Image, video, voice, etc.), the present invention does not make any restrictions on this.

第三圖是本發明提供的一種對媒體事件進行監測的方法的示例性流程圖，該方法可以基於本發明中所構建的行業知識圖譜資料庫對與行業相關的特定媒體事件進行監測。該方法可以包括步驟S31-S35。 The third figure is an exemplary flowchart of a method for monitoring media events provided by the present invention. The method can monitor specific media events related to the industry based on the industry knowledge map database constructed in the present invention. The method may include steps S31-S35.

在步驟S31中，獲取互聯網媒體資料。 In step S31, Internet media materials are acquired.

如上文所述，可以通過多種方式從互聯網資料來源獲取互聯網媒體資料。例如，一些社交媒體網站(例如，新浪微博、Facebook、Twitter等)都開放了用於獲取其資料的API。也可以利用網路爬蟲技術和內容抽取技術來抓取新聞網站或行業媒體網站資料。 As mentioned above, there are many ways to obtain Internet media materials from Internet sources. For example, some social media websites (for example, Sina Weibo, Facebook, Twitter, etc.) have opened APIs for obtaining their materials. You can also use web crawler technology and content extraction technology to crawl news website or industry media website data.

在步驟S32中，基於所獲取的互聯網媒體資料進行事件檢測、事件評價和篩選，以獲得所述與行業相關的特定媒體事件。 In step S32, event detection, event evaluation, and screening are performed based on the obtained Internet media materials to obtain the specific media event related to the industry.

如上文所述，在本領域中已有多種對互聯網媒體進行監測以獲得特定媒體事件的技術實現方式。舉例而言，在一種實現方式中，先對互聯網媒體資料進行檢測，以發現感興趣的特定領域或行業中媒體事件的內容以及事件所涉及的實體，然後再對新發現的媒體事件按不同指標(例如，事件的負面性、重大性、突發性、傳播速度與範圍、可信度等)進行評價，以篩選出符合要求的媒體事件。 As mentioned above, there are various technical implementations for monitoring Internet media to obtain specific media events in the art. For example, in one implementation, first detect Internet media materials to discover specific areas of interest Or the content of media events in the industry and the entities involved in the event, and then according to different indicators for newly discovered media events (for example, the negativeness, significance, emergencies, speed and scope of transmission, credibility, etc.) Evaluate to identify media events that meet your requirements.

具體而言，在一個實施例中，事件檢測涉及的技術實現步驟可以包括：話題分類、實體識別、情感分析和事件發現。 Specifically, in one embodiment, the technical implementation steps involved in event detection may include topic classification, entity recognition, sentiment analysis, and event discovery.

在話題分類的步驟中，對所獲取的互聯網媒體資料中的內容進行話題分類以獲得針對特定話題的內容。話題分類的目的是從所獲取的內容中篩選出屬於某種感興趣話題或與客戶需求相關種類的文本。話題分類是一種文本挖掘技術，一般採用機器學習或深度學習方法在標注資料上訓練分類模型，然後應用到文本上以判斷其話題類別。任何現有分類模型(例如，樸素貝葉斯模型、決策樹、支援向量機、人工神經網路等)都可用于本發明中。 In the topic classification step, topic classification is performed on content in the obtained Internet media material to obtain content specific to a topic. The purpose of topic classification is to filter out texts that belong to a topic of interest or a category related to customer needs from the content obtained. Topic classification is a text mining technology. Generally, machine learning or deep learning methods are used to train a classification model on labeled data, and then applied to the text to determine its topic category. Any existing classification model (for example, Naive Bayes model, decision tree, support vector machine, artificial neural network, etc.) can be used in the present invention.

在實體識別的步驟中，從所獲得的內容中識別涉及的實體。實體抽取的目的是找出文章中涉及的實體作進一步分析。舉例而言，實體識別可以包括以自然語言處理中的資訊抽取技術從文本資訊中抽取實體，以圖像識別技術從圖像(含視頻)資訊中識別實體，以及以語音辨識技術從語音資訊中識別實體，還可以對從文本、圖像、與語音中識別的實體進行合併處理。 In the entity recognition step, the entities involved are identified from the obtained content. The purpose of entity extraction is to find out the entities involved in the article for further analysis. For example, entity recognition can include extracting entities from text information using information extraction technology in natural language processing, identifying entities from image (including video) information using image recognition technology, and using speech recognition technology to extract speech information from speech information. Recognize entities. You can also combine entities identified from text, images, and speech.

在情感分析的步驟中，對所獲得的內容和所識別的實體進行情感分析，並且基於情感分析的結果對所獲得的內容進行過濾。情感分析用於判斷內容全文以及針對不同實體所表達的情感極性，以找出符合監測條件的內容。現有技術一般以文本分類方法(例如，將情感歸類為正面、中性或負面)或回歸分析方法(例如，將情感表示成-5到+5之間的分數)實現情感分析。判斷內容中針對某一實體的情感則可利用實體在文本中的上下文資訊，或者採用依存句法分析工具找出文本中跟該實體相關的文字部份以進行針對實體的情感分析。 In the step of sentiment analysis, sentiment analysis is performed on the obtained content and the identified entity, and the obtained content is filtered based on the result of the sentiment analysis. Sentiment analysis is used to judge the full content of the content and the sentiment polarity expressed by different entities to find content that meets the monitoring conditions. The prior art generally implements sentiment analysis using text classification methods (eg, classifying sentiment as positive, neutral, or negative) or regression analysis methods (eg, expressing sentiment as a score between -5 and +5). To judge the sentiment of an entity in the content, you can use the context information of the entity in the text, or use the dependency syntax analysis tool to find the text part of the text that is related to the entity for targeting. Sentiment analysis of entities.

在事件發現的步驟中，基於過濾後的內容進行事件發現以對媒體事件進行聚類並發現新的媒體事件。事件發現的目的是從不同文本提取出事件資訊(例如，事件發生的時間、地點等)，然後將相關的資訊聚類、合併成為抽象“事件”，通過與現有事件進行比對以判斷新出現的事件，並根據內容的相似性或相關性對事件進行聚類。 In the event discovery step, event discovery is performed based on the filtered content to cluster media events and discover new media events. The purpose of event discovery is to extract event information (such as the time and place of the event) from different texts, and then cluster and merge the relevant information into an abstract "event". By comparing with existing events to determine new occurrences And cluster events based on similarity or relevance of content.

可選地，在事件檢測的過程中，還可以基於媒體事件的屬性(例如，事件發生的時間、地點，媒體事件發佈者及其相關屬性等)對事件的真實性進行分析，並根據分析結果對媒體事件進行排序和/或過濾。 Optionally, during the event detection process, the authenticity of the event may also be analyzed based on the attributes of the media event (for example, the time and place of the event, the publisher of the media event and its related attributes, etc.), and according to the analysis result Sort and / or filter media events.

本領域技術人員可以理解，在上述步驟中針對各項操作所列舉的實現方式僅僅是示例性的，本領域現有的一些其他方式也可以實現這些操作，本發明並不對實現上述操作的具體方式做出任何限制。 Those skilled in the art can understand that the implementation manners listed in the above steps for each operation are merely exemplary, and some other existing manners in the art can also implement these operations, and the present invention does not do a specific way to implement the above operations. No restrictions.

在步驟S33中，識別與所述特定媒體事件對應的直接相關實體。 In step S33, a directly related entity corresponding to the specific media event is identified.

在一個實施例中，通過事件監測中的實體識別和事件發現操作就可以獲得每個媒體事件中的各個直接相關實體。同時，如上文所述，可以通過實體連結和語義消歧處理將各個直接相關實體關聯到行業知識圖譜資料庫中對應的實體概念或補充到行業知識圖譜資料庫中。 In one embodiment, each directly related entity in each media event can be obtained through entity identification and event discovery operations in event monitoring. At the same time, as described above, each directly related entity can be related to the corresponding entity concept in the industry knowledge graph database or supplemented to the industry knowledge graph database through entity connection and semantic disambiguation processing.

在步驟S34中，基於所述直接相關實體，訪問所述行業知識圖譜資料庫，以確定與所述特定媒體事件對應的非直接相關實體。 In step S34, based on the directly related entities, the industry knowledge map database is accessed to determine non-directly related entities corresponding to the specific media event.

在一種實施例中，可以通過預設的各種條件，在行業知識圖譜資料庫上直接查詢與事件直接相關實體有關聯關係的其它非直接相關實體。例如，預設的條件可以是：1)與事件直接相關實體在N層內有關聯關係的實體(N可以為1，2，3...)；2)與事件直接相關實體關聯程度滿足某種條件(如大於某個指定閾值)的其它實體；3)與事件直接相關實體具有某種特定關係(例如，供貨關係、投資關係等)的實體；4)具有某種特定屬性(例如，屬於某個指定行業、位於某個地點、擁有某個職位等)的實體。這些預設的條件可以單獨或隨意組合使用。 In one embodiment, the industry knowledge graph database can be directly queried on the industry knowledge graph database through preset various conditions. It is not a directly related entity. For example, the preset conditions may be: 1) an entity that is directly related to the event in the N layer (N can be 1, 2, 3 ...); 2) the degree of association with the directly related event entity meets a certain Other entities with certain conditions (such as greater than a specified threshold); 3) entities that have a specific relationship (e.g., supply relationship, investment relationship, etc.) with entities directly related to the event; 4) have a specific attribute (e.g., Entities belonging to a specified industry, located in a location, holding a position, etc.). These preset conditions can be used individually or in any combination.

在另一個實施例中，可以採用資料採擷的方法，在行業知識圖譜資料庫的基礎之上利用多種條件來挖掘事件的非直接相關實體。舉例而言，具體實施方法可以採用針對圖資料的連結預測技術(link prediction)，即把檢測某事件的非直接相關實體問題表示成“預測行業知識圖譜資料庫中代表該事件的節點與直接相關實體節點以外的其他實體節點之間是否存在連邊”這一技術問題。可用於連結預測的條件包括但不局限於事件本身的特徵(例如，事件的類型、時間與地點屬性、負面性等)、該事件與歷史事件的關係(包括關係種類與關係強度)、事件直接相關實體與其他實體之間的關係(包括關係種類和關係強度)以及實體類型和屬性等所有可以在知識圖譜資料庫中挖掘到的知識，從而實現對特定媒體事件的非直接相關實體的綜合判斷。 In another embodiment, a method of data acquisition can be used to mine non-directly related entities of an event on the basis of an industry knowledge map database using multiple conditions. For example, the specific implementation method can use link prediction technology for graph data, that is, the problem of detecting non-directly related entities of an event is expressed as "the nodes representing the event in the database of prediction industry knowledge graph database are directly related to "There is a connected edge between other physical nodes other than the physical node". Conditions that can be used to link predictions include, but are not limited to, the characteristics of the event itself (e.g., the type of event, time and place attributes, negativeness, etc.), the relationship between the event and historical events (including the type of relationship and the strength of the relationship), and the event directly The relationship between related entities and other entities (including the type of relationship and the strength of the relationship) and all the knowledge that can be mined in the knowledge graph database, such as entity types and attributes, so as to achieve a comprehensive judgment of non-directly related entities of a specific media event .

在步驟S35中，向所述直接相關實體和/或所述非直接相關實體發送預警消息。 In step S35, an early warning message is sent to the directly related entity and / or the non-directly related entity.

在識別出與特定媒體事件對應的直接和非直接相關實體後，可以利用多種途徑(例如，電子郵件、手機短信、即時聊天工具、社交網路平臺等)向對應的實體使用者發送預警消息。預警消息可以包含對事件本身的文字描述、圖片、傳播相關統計資訊、事件評估指標以及相關實體可能如何受到該事件影響的途徑等等。 After identifying the direct and non-directly related entities corresponding to a specific media event, multiple channels (for example, email, mobile phone text messaging, instant chat tools, social networking platforms, etc.) can be used to send early warning messages to corresponding entity users. The early warning message can include a text description of the event itself, pictures, dissemination of relevant statistical information, event evaluation indicators, and ways in which related entities may be affected by the event, and so on.

本領域技術人員可以理解，本發明中所述的特定媒體事件可以是符合用戶所設定條件並且可以從互聯網媒體中獲得的各種類型的事件，例如，負面事件、突發事件、危機事件、群體性事件或輿情事件等。本發明並不對此做出任何限制。 Those skilled in the art can understand that the specific media described in the present invention The event can be various types of events that meet the conditions set by the user and can be obtained from the Internet media, such as negative events, emergencies, crisis events, mass events, or public opinion events. The invention does not place any restrictions on this.

作為一個較佳的實施例，第四圖示出本發明提供的另一種構建行業知識圖譜資料庫的方法的示例性流程圖。該方法可以包括步驟S41、S421/S422以及S43-S45。 As a preferred embodiment, the fourth figure shows an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention. The method may include steps S41, S421 / S422, and S43-S45.

在步驟S41中，從行業資料來源獲得行業資料，並從所述行業資料中提取實體以及對應的實體屬性和實體關係，以生成行業知識圖譜資料庫。 In step S41, industry data is obtained from industry data sources, and entities and corresponding entity attributes and entity relationships are extracted from the industry data to generate an industry knowledge map database.

在步驟S421中，基於結構化資料來源，利用應用程式介面以查詢方式獲得與所述行業相關的實體、實體屬性和實體關係。在一個實施例中，所述結構化資料來源可以如維基資料、DBPedia這樣的結構化開放資料平臺，並且可以通過API從中獲得與行業相關的資料。 In step S421, based on the structured data source, an application program interface is used to obtain the entities, entity attributes, and entity relationships related to the industry in a query manner. In one embodiment, the structured data source may be a structured open data platform such as Wikidata and DBPedia, and industry-related data may be obtained from it through an API.

在步驟S422中，基於半結構化或非結構化資料來源，利用自然語言處理技術對資料進行實體識別和關係抽取，以提取與所述行業相關的實體、實體屬性和實體關係。在一個實施例中，所述半結構化或非結構化資料來源可以諸如維基百科、百度百科這樣的開放資料平臺，也可以是任何相關的協力廠商資料來源(例如，專業網站、在互聯網媒體中發佈的內容等)，並且可以通過網路爬蟲或內容抽取技術獲得與行業相關的資料。 In step S422, based on the semi-structured or unstructured data source, the data is subjected to entity recognition and relationship extraction using natural language processing technology to extract entities, entity attributes and entity relationships related to the industry. In one embodiment, the semi-structured or unstructured data source may be an open data platform such as Wikipedia or Baidu Encyclopedia, or it may be any relevant third-party data source (for example, a professional website, in Internet media Published content, etc.), and industry-relevant data can be obtained through web crawlers or content extraction technologies.

較佳地，可以以預定的週期定期執行步驟S421和/或S422、S43。 Preferably, steps S421 and / or S422 and S43 may be performed periodically at a predetermined cycle.

在步驟S43中，基於所述與行業相關的實體以及對應的實體屬性和實體關係，對行業知識圖譜資料庫進行補充或更新。 In step S43, an industry knowledge map database is supplemented or updated based on the industry-related entities and corresponding entity attributes and entity relationships.

在步驟S44中，從互聯網資料來源獲得互聯網媒體資料，並從所述互聯網媒體資料中提取與所述行業相關的特定媒體事件以及對應的直接相關實體。 In step S44, Internet media data is obtained from Internet data sources, and specific media events related to the industry are extracted from the Internet media data. And corresponding directly related entities.

在步驟S45中，基於所述特定媒體事件以及對應的直接相關實體，對行業知識圖譜資料庫進行補充，其中，所述特定媒體事件作為抽象實體被補充到所述行業知識圖譜資料庫中。 In step S45, an industry knowledge map database is supplemented based on the specific media event and the corresponding directly related entity, where the specific media event is added to the industry knowledge map database as an abstract entity.

較佳地，可以以即時不間斷的方式執行步驟S44和S45 Preferably, steps S44 and S45 can be performed in an immediate and uninterrupted manner

第五圖是本發明提供的另一種構建行業知識圖譜資料庫的方法的示例性流程圖。該方法可以包括步驟S51-S53：在步驟S51中，從資料來源獲取行業資料；在步驟S52中，對所述行業資料進行資料處理，以提取與所述行業相關的實體以及對應的實體屬性和/或實體關係；在步驟S53中，基於所提取的實體、實體屬性和/或實體關係構建所述行業知識圖譜資料庫。 The fifth figure is an exemplary flowchart of another method for constructing an industry knowledge map database provided by the present invention. The method may include steps S51-S53: in step S51, obtaining industry data from a data source; in step S52, performing data processing on the industry data to extract entities related to the industry and corresponding entity attributes and / Or entity relationship; in step S53, the industry knowledge map database is constructed based on the extracted entities, entity attributes and / or entity relationships.

如上文所述，行業知識圖譜資料庫的資料來源可以是多種多樣的，包括但不限於開放式的百科類資料來源、結構化的資料庫以及任何相關的協力廠商半結構化或非結構化互聯網資料來源。同時，如上文所述，行業知識圖譜資料庫的資料來源還可以是互聯網媒體資料來源。 As mentioned above, the sources of the industry knowledge graph database can be diverse, including but not limited to open encyclopedia sources, structured databases, and any related third-party semi-structured or unstructured Internet source. At the same time, as mentioned above, the data source of the industry knowledge map database can also be the Internet media source.

在一個實施例中，所述資料來源可以是結構化的行業資料庫，並且所述方法可以通過以下具體方式實現：在步驟S51(1)中，從協力廠商行業資料庫獲取包括多個欄位的結構化行業資料；在步驟S52(1)中，在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述結構化行業資料進行資料清洗以及抽取-轉換-載入(ETL)處理；在步驟S53(1)中，基於所提取的實體、實體屬性和/或實體關係生成所述行業知識圖譜資料庫。 In one embodiment, the data source may be a structured industry database, and the method may be implemented in the following specific ways: In step S51 (1), obtaining from a third-party industry database includes multiple fields In step S52 (1), before extracting entities related to the industry and corresponding entity attributes and / or entity relationships, perform data cleaning and extraction-conversion on the structured industry data- Load (ETL) processing; in step S53 (1), generate the industry knowledge map database based on the extracted entities, entity attributes and / or entity relationships.

在另一個實施例中，所述資料來源可以是非結構化或半結構化的互聯網資料來源，並且所述方法可以通過以下具體方式實現：在步驟S51(2)中，利用網路爬蟲技術，從互聯網資料來源獲取與行業相關的資料，所述互聯網資料來源包括非結構化或半結構化資料來源；在步驟S52(2)中，利用自然語言處理中的資訊抽取技術，對所述行業相關的資料進行實體識別和關係抽取，以提取所述實體、實體屬性和/或實體關係；在步驟S53(2)中，基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 In another embodiment, the data source may be an unstructured or semi-structured Internet data source, and the method may be implemented in the following specific ways: Implementation: In step S51 (2), use web crawler technology to obtain industry-related data from Internet data sources, which include unstructured or semi-structured data sources; in step S52 (2) , Using information extraction technology in natural language processing to perform entity identification and relationship extraction on the industry-related data to extract the entity, entity attributes, and / or entity relationship; in step S53 (2), based on the extracted Supplement or update the industry knowledge graph database based on the entities, entity attributes, and / or entity relationships of the industry knowledge graph database.

此外，所述步驟S51(2)-S53(2)可以是以預定的週期定期執行的。 In addition, the steps S51 (2) -S53 (2) may be performed periodically with a predetermined cycle.

在另一個實施例中，所述資料來源可以是開放式的互聯網資料來源，並且所述方法可以通過以下具體方式實現：在步驟S51(3)中，利用應用程式介面(API)以查詢方式從互聯網資料來源獲取與行業相關的資料；在步驟S52(3)中，在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述與行業相關的資料進行資料清洗以及抽取-轉換-載入(ETL)處理；在步驟S53(3)中，基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 In another embodiment, the data source may be an open Internet data source, and the method may be implemented in the following specific ways: In step S51 (3), an application program interface (API) is used to query from the Obtain industry-related data from Internet data sources; in step S52 (3), before extracting entities related to the industry and corresponding entity attributes and / or entity relationships, perform data cleaning on the industry-related data And extract-transform-load (ETL) processing; in step S53 (3), the industry knowledge graph database is supplemented or updated based on the extracted entities, entity attributes and / or entity relationships.

此外，所述步驟S51(3)-S53(3)可以是以預定的週期定期執行的。 In addition, the steps S51 (3) -S53 (3) may be performed periodically with a predetermined cycle.

在另一個實施例中，所述資料來源可以是互聯網媒體資料來源，並且所述方法可以通過以下具體方式實現：在步驟S51(4)中，利用應用程式介面(API)或網路爬蟲技術，從互聯網資料來源獲取互聯網媒體資料；在步驟S52(4)中，對所述互聯網媒體資料進行事件檢測、事件評價和篩選，以提取與所述行業相關的特定媒體事件，並從所述互聯網媒體資料中識別對應的直接相關實體；在步驟S53(4)中，基於所述特定媒體事件以及對應的直接相關實體，對所述行業知識圖譜資料庫進行補充，其中，所述特定媒體事件作為抽象實體被補充到所述行業知識圖譜資料庫中。 In another embodiment, the data source may be an Internet media data source, and the method may be implemented in the following specific ways: In step S51 (4), using an application program interface (API) or a web crawler technology, Obtain Internet media data from Internet data sources; in step S52 (4), perform event detection, event evaluation, and screening on the Internet media data to extract specific media events related to the industry, and from the Internet media The corresponding directly related entity is identified in the data; in step S53 (4), the industry knowledge map database is supplemented based on the specific media event and the corresponding directly related entity, where the specific media event is Abstract entities are added to the industry knowledge map database.

舉例而言，在步驟S52(4)中可以通過以下方式中的至少一種識別與特定媒體事件對應的直接相關實體：基於自然語言處理中的實體識別從文本資料中識別實體；基於圖像或視頻識別處理從圖像或視頻資料中識別實體；或者，基於語音辨識處理從音訊或視頻資料中識別實體。 For example, in step S52 (4), directly related entities corresponding to a specific media event may be identified by at least one of the following methods: identifying entities from textual data based on entity recognition in natural language processing; based on images or videos The recognition process identifies an entity from the image or video material; or, based on the voice recognition process, identifies the entity from the audio or video material.

舉例而言，所述特定媒體事件可以包括負面事件、突發事件、危機事件、群體性事件、輿情事件或其它具有行業意義的事件。 For example, the specific media event may include a negative event, an emergency event, a crisis event, a group event, a public opinion event, or other events of industrial significance.

此外，所述步驟S51(4)-S53(4)可以是即時不間斷執行的。 In addition, the steps S51 (4) -S53 (4) may be performed immediately and without interruption.

在另一個實施例中，上述步驟S53(2)、S53(3)、S53(4)中對所述行業知識圖譜資料庫進行補充或更新的步驟可以包括：對所提取的實體進行語義消歧和實體連結。舉例而言，可以通過以下方式中的至少一種進行所述語義消歧和實體連結：基於實體知識，對每個所提取的實體指代逐一獨立地進行語義消歧和實體連結；基於主題一致性假設，利用候選實體在知識庫中的關聯，對所提取的實體指代進行一致性地語義消歧和實體連結。 In another embodiment, the steps of supplementing or updating the industry knowledge graph database in steps S53 (2), S53 (3), and S53 (4) may include: performing semantic disambiguation on the extracted entities And physical connection. For example, the semantic disambiguation and entity connection can be performed in at least one of the following ways: based on entity knowledge, semantic disambiguation and entity connection are performed independently for each extracted entity reference one by one; based on the topic consistency assumption Using the association of candidate entities in the knowledge base, the extracted entity references are subjected to consistent semantic disambiguation and entity connection.

以上以實施例的方式描述了本發明提供的一種構建行業知識圖譜資料庫的方法。本領域技術人員可以理解，這些實施例的各種組合也包括在這種構建行業知識圖譜資料庫的方法的構思之內。 The method for constructing an industry knowledge map database provided by the present invention has been described in the foregoing embodiments. Those skilled in the art can understand that various combinations of these embodiments are also included in the conception of such a method for constructing an industry knowledge map database.

第六圖是本發明提供的一種對媒體事件進行監測的系統的示例性功能方塊圖。該系統包括資料獲取單元61、資料處理單元62、資料庫構建單元63、資料庫存儲單元64、媒體事件監測單元66、資料庫訪問單元65以及消息發送單元67。 The sixth figure is an exemplary functional block diagram of a system for monitoring media events provided by the present invention. The system includes a data acquisition unit 61, a data processing unit 62, a database construction unit 63, a data library storage unit 64, a media event monitoring unit 66, a database access unit 65, and a message sending unit 67.

資料獲取單元61，用於從資料來源60獲得行業資料。 The data obtaining unit 61 is configured to obtain industry data from a data source 60.

資料處理單元62，用於對所述行業資料進行資料處理，以提取與所述行業相關的實體以及對應的實體屬性和/或實體關係；資料庫構建單元63，用於基於所提取的實體、實體屬性和/或實體關係構建所述行業知識圖譜資料庫；資料庫存儲單元64：用於存儲所構建的行業知識圖譜資料庫；媒體事件監測單元66：用於獲取互聯網媒體資料，基於所獲取的互聯網媒體資料進行事件檢測、事件評價和篩選以獲得所述與行業相關的特定媒體事件，並且識別與所述特定媒體事件對應的直接相關實體；資料庫訪問單元65：用於基於所述直接相關實體，訪問所述行業知識圖譜資料庫，以確定與所述特定媒體事件對應的非直接相關實體；消息發送單元67，用於向所述直接相關實體和/或所述非直接相關實體發送預警消息。 A data processing unit 62 is configured to perform data processing on the industry data to extract entities related to the industry and corresponding entity attributes and / or entity relationships; a database construction unit 63 is configured to, based on the extracted entities, Entity attributes and / or entity relationships to build the industry knowledge map database; a data library storage unit 64: used to store the constructed industry knowledge map database; a media event monitoring unit 66: used to obtain Internet media data, based on the obtained To perform event detection, event evaluation, and screening on the Internet media materials to obtain the specific media event related to the industry, and identify directly related entities corresponding to the specific media event; a database access unit 65: configured to A related entity, accessing the industry knowledge map database to determine a non-directly related entity corresponding to the specific media event; a message sending unit 67, configured to send to the directly related entity and / or the indirectly related entity Warning message.

在一個實施例中，所述資料獲取單元61包括：結構化資料獲取單元611，用於從協力廠商行業601資料庫獲得結構化資料，所述結構化資料包括多個欄位；所述資料處理單元62包括：結構化資料處理單元621，用於對所述結構化資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述資料庫構建單元63包括：資料庫生成單元631，用於基於所提取的實體、實體屬性和/或實體關係生成所述行業知識圖譜資料庫。 In one embodiment, the data acquisition unit 61 includes: a structured data acquisition unit 611, configured to obtain structured data from a third party industry 601 database, the structured data includes a plurality of fields; the data processing The unit 62 includes: a structured data processing unit 621, configured to perform data cleaning and extraction-transformation-loading (ETL) processing on the structured data; the database construction unit 63 includes: a database generation unit 631, The industry knowledge map database is generated based on the extracted entities, entity attributes and / or entity relationships.

在另一個實施例中，所述資料獲取單元61包括：行業相關資料獲取單元612，用於利用網路爬蟲技術，從互聯網資料來源獲得與行業相關的資料，所述互聯網資料來源602包括非結構化或半結構化資料來源；所述資料處理單元62包括：行業相關資料處理單元622，用於利用自然語言處理中的資訊抽取技術，對所述行業相關的資料進行實體識別和關係抽取，以提取所述實體、實體屬性和/或實體關係；所述資料庫構建單元63包括：資料庫補充/更新單元632，用於基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 In another embodiment, the data acquisition unit 61 includes an industry-related data acquisition unit 612, which is used to obtain industry-related data from Internet data sources by using web crawler technology, and the Internet data source 602 includes non-structured data. Structured or semi-structured data source; the data processing unit 62 includes: industry-related information A material processing unit 622, configured to use entity information extraction technology in natural language processing to perform entity identification and relationship extraction on the industry-related materials to extract the entities, entity attributes, and / or entity relationships; the database construction The unit 63 includes a database supplement / update unit 632 for supplementing or updating the industry knowledge graph database based on the extracted entities, entity attributes and / or entity relationships.

在另一個實施例中，所述資料獲取單元61包括：行業相關資料獲取單元612，用於利用應用程式介面(API)以查詢方式從互聯網資料來源獲取與行業相關的資料，所述互聯網資料來源602包括開放式資料來源；所述資料處理單元62包括：行業相關資料處理單元622，用於在提取與所述行業相關的實體以及對應的實體屬性和/或實體關係之前，對所述與行業相關的資料進行資料清洗以及抽取-轉換-載入(ETL)處理；所述資料庫構建單元63包括：資料庫補充/更新單元632，用於基於所提取的實體、實體屬性和/或實體關係對所述行業知識圖譜資料庫進行補充或更新。 In another embodiment, the data acquisition unit 61 includes: an industry-related data acquisition unit 612, which is used to obtain industry-related data from an Internet data source by using an application program interface (API) in a query manner, the Internet data source 602 includes an open data source; the data processing unit 62 includes: an industry-related data processing unit 622, configured to extract the entities related to the industry and the corresponding entity attributes and / or entity relationships before extracting the related industries Relevant data is subjected to data cleaning and extraction-transformation-loading (ETL) processing; the database construction unit 63 includes: a database supplement / update unit 632, which is used to extract the entities, entity attributes and / or entity relationships based on the extracted entities Supplement or update the industry knowledge map database.

在另一個實施例中，所述資料獲取單元61包括：媒體資料獲取單元613，用於利用應用程式介面(API)或網路爬蟲技術，從互聯網資料來源602獲取與行業相關的互聯網媒體資料；所述資料處理單元62包括：媒體資料處理單元623，用於對所述互聯網媒體資料進行事件檢測、事件評價和篩選，以提取與所述行業相關的特定媒體事件，並從所述互聯網媒體資料中識別對應的直接相關實體；所述資料庫構建單元63包括：資料庫補充/更新單元632，用於基於所述特定媒體事件以及對應的直接相關實體，對所述行業知識圖譜資料庫進行補充，其中，所述特定媒體事件作為抽象實體被補充到所述行業知識圖譜資料庫中。 In another embodiment, the data acquisition unit 61 includes: a media data acquisition unit 613, which is configured to acquire industry-related Internet media data from an Internet data source 602 using an application program interface (API) or a web crawler technology; The data processing unit 62 includes: a media data processing unit 623, configured to perform event detection, event evaluation, and screening on the Internet media data to extract specific media events related to the industry, and to remove the media media data from the Internet media data. Corresponding directly related entities are identified in the database; the database construction unit 63 includes: a database supplement / update unit 632 for supplementing the industry knowledge map database based on the specific media event and the corresponding directly related entities , Wherein the specific media event is added to the industry knowledge graph database as an abstract entity.

在一個實施例中，所述資料庫補充/更新單元632進一步用於：對所提取的實體進行語義消歧和實體連結。 In one embodiment, the database supplement / update unit 632 is further configured to perform semantic disambiguation and entity connection on the extracted entities.

在一個實施例中，所述媒體事件監測單元66進一步用於：對所獲取的互聯網媒體資料中的內容進行話題分類，以獲得針對特定話題的內容；從所獲得的內容中識別涉及的實體；對所獲得的內容和所識別的實體進行情感分析，並且基於情感分析的結果對所獲得的內容進行過濾；基於過濾後的內容進行事件發現，以對媒體事件進行聚類並發現新的媒體事件。在另一個實施例中，所述媒體事件監測單元66進一步用於：基於媒體事件的屬性對事件的真實性進行分析，並根據分析結果對媒體事件進行排序和/或過濾。 In one embodiment, the media event monitoring unit 66 further Used for: classifying the content in the obtained Internet media materials to obtain content specific to a topic; identifying the entities involved from the obtained content; performing sentiment analysis on the obtained content and the identified entities, And based on the results of sentiment analysis, the obtained content is filtered; event discovery is performed based on the filtered content to cluster media events and discover new media events. In another embodiment, the media event monitoring unit 66 is further configured to analyze the authenticity of the events based on the attributes of the media events, and sort and / or filter the media events according to the analysis results.

在一個實施例中，所述資料庫訪問單元65進一步用於：基於所述直接相關實體，在所述行業知識圖譜資料庫中查詢，以確定所述非直接相關實體。在另一個實施例中，所述資料庫訪問單元65進一步用於：基於所述直接相關實體，在所述行業知識圖譜資料庫中使用資料採擷技術，以確定所述非直接相關實體。 In one embodiment, the database access unit 65 is further configured to: based on the directly related entities, query in the industry knowledge map database to determine the non-directly related entities. In another embodiment, the database access unit 65 is further configured to: based on the directly related entities, use a data acquisition technology in the industry knowledge map database to determine the indirectly related entities.

以上以實施例的方式描述本發明提供的一種對媒體事件進行監測的系統。本領域技術人員可以理解，上文結合附圖第一圖、第三至五圖所描述的各種方法中的操作步驟可以應用在所述系統的組成單元中，因此這裡不再贅述。 The system for monitoring media events provided by the present invention is described in the foregoing embodiments. Those skilled in the art can understand that the operation steps in the various methods described above with reference to the first diagram, the third to the fifth diagrams of the accompanying drawings can be applied to the constituent units of the system, and therefore will not be repeated here.

本領域技術人員還應當理解，結合本發明公開的各個實施例所描述的各種示例性的方法步驟和單元均可以實現成電子硬體、電腦軟體或二者的組合。為了清楚地表示硬體和軟體的可交換性，上文中各種示例性的步驟和單元均圍繞其功能進行了總體描述。至於這種功能是實現成硬體還是實現成軟體，則取決於特定的應用和對整個系統所施加的設計約束條件。本領域技術人員可以針對每個特定應用，以變通的方式實現所描述的功能，但是，這種實現決策不應解釋為引起與本公開內容的範圍的偏離。 Those skilled in the art should also understand that the various exemplary method steps and units described in connection with the various embodiments disclosed in the present invention can be implemented as electronic hardware, computer software, or a combination of the two. In order to clearly indicate the interchangeability of hardware and software, the various exemplary steps and units described above are generally described around their functions. Whether such functionality is implemented as hardware or software depends on the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

本發明說明書中使用的“示例/示例性”表示用作例子、例證或說明。說明書中被描述為“示例性”的任何技術方案不應被解釋為比其它技術方案更較佳或更具優勢。 The "exemplary / exemplary" used in the specification of the present invention means used as an example, illustration or explanation. Any technical solution described as “exemplary” in the specification should not be interpreted as being better or more advantageous than other technical solutions.

本發明提供對所公開的技術內容的以上描述，以使本領域技術人員能夠實現或使用本發明。對於本領域技術人員而言，對這些技術內容的很多修改和變形都是顯而易見的，並且本發明所定義的總體原理也可以在不脫離本發明的精神或範圍的基礎上適用於其它實施例。因此，本發明並不限於上文所示的具體實施方式，而是應與符合本發明公開的發明構思的最廣範圍相一致。 The present invention provides the above description of the disclosed technical content to enable those skilled in the art to implement or use the present invention. To those skilled in the art, many modifications and variations to these technical contents are obvious, and the general principle defined by the present invention can also be applied to other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention is not limited to the specific embodiments shown above, but should be consistent with the widest scope consistent with the inventive concept disclosed by the present invention.

Claims

A method for constructing an industry knowledge map database includes the following steps: Step 101: Obtain industry data of one or more industries in a specific field from a data source; Industry data includes Internet media data; step 102, performing data processing on the industry data of the industry in the one or more specific fields to extract entities related to the industry in the one or more specific fields and corresponding entities Attributes and / or entity relationships, where the entity refers to part of the industry data of the industry in the one or more specific fields; and event detection, event evaluation, and screening of the Internet media data to extract and compare Describe one or more industry-specific specific media events in a specific field; and step 103, perform semantic disambiguation and entity connection on the extracted entities, based on the extracted specific media events and the entities, entity attributes, and / Or entity relationship to build the industry knowledge map database.

The method according to item 1 of the scope of patent application, wherein the step 101 further comprises: obtaining structured industry data from a third party industry database, the structured industry data includes a plurality of fields; the step 102 further includes: before extracting entities related to the one or more specific industries and corresponding entity attributes and / or entity relationships, performing data cleaning and extraction-transformation-loading on the structured industry data ( ETL) processing; and the step 103 further comprises: generating the industry knowledge map database based on the extracted entities, entity attributes, and / or entity relationships.

The method according to item 1 of the scope of patent application, wherein the step 101 further comprises: using a web crawler technology to obtain data related to the industry in one or more specific fields from an Internet data source. The Internet data source includes an unstructured or semi-structured data source; the step 102 further includes: using information extraction technology in natural language processing to perform the data related to the industry in the one or more specific fields Entity identification and relationship extraction to extract the entity, entity attribute, and / or entity relationship; and the step 103 further includes: performing a process on the industry knowledge map database based on the extracted entity, entity attribute, and / or entity relationship Supplement or update.

The method according to item 1 of the scope of patent application, characterized in that the step 101 further comprises: using an application programming interface (API) to obtain information related to the industry in the one or more specific fields from an Internet data source in a query manner. The Internet data source includes an open data source; the step 102 further includes: before extracting entities related to the one or more specific industries and corresponding entity attributes and / or entity relationships, Performing data cleaning and extract-transform-load (ETL) processing on the data related to the industry in the one or more specific fields; and the step 103 further includes: based on the extracted entities, entity attributes, and / or The entity relationship supplements or updates the industry knowledge map database.

The method according to item 1 of the scope of patent application, wherein the step 101 further comprises: using an application program interface (API) or a web crawler technology to obtain from the Internet data source and the one or more specific fields Industry-related Internet media materials; the step 102 further includes: identifying the corresponding directly related entities from the Internet media materials; and the step 103 further includes: based on the specific media event and the corresponding directly related entities, The industry knowledge map database is supplemented, wherein the specific media event is added to the industry knowledge map database as an abstract entity.

The method according to item 5 of the scope of patent application, characterized in that in step 102, a directly related entity corresponding to the specific media event is further extracted in at least one of the following ways: based on an entity in natural language processing Recognize entities from textual materials; identify entities from image or video materials based on image or video recognition processing; or identify entities from audio or video materials based on speech recognition processing.

The method according to any one of claims 3 to 5, wherein the step 103 includes: performing semantic disambiguation and entity connection on the extracted entities to identify the entities in the industry The corresponding entity in the knowledge graph database.

The method according to item 7 of the scope of patent application, wherein the steps of performing semantic disambiguation and entity connection on the extracted entities are further implemented by at least one of the following methods: based on the entity knowledge, each extracted Based on the subject consistency assumption, using the association of candidate entities in the knowledge base to perform consistent semantic disambiguation and entity connection on the extracted entity references.

The method according to item 1 of the scope of the patent application, wherein the specific media event includes a negative event, an emergency event, a crisis event, a group event, a public opinion event, or other events of industrial significance.

The method according to item 3 or 4 of the scope of patent application, wherein the steps 101-103 are performed periodically at a predetermined period.

The method according to item 5 of the scope of patent application, wherein the steps 101-103 are performed immediately and without interruption.

A method for monitoring a specific media event related to an industry in one or more specific fields according to the industry knowledge map database according to any one of the claims 1 to 11 in the scope of patent applications, which comprises the following steps: Step 1201, obtaining Internet media materials; step 1202, performing event detection, event evaluation, and screening based on the obtained Internet media materials to obtain the specific media event related to the industry; step 1203, extracting the corresponding media event Step 1204, based on the association relationship between the directly related entity and the industry knowledge map database, determine a non-directly related entity corresponding to the specific media event; and step 1205, report to the directly related entity A related entity and / or the indirectly related entity sends an early warning message.

The method according to item 12 of the scope of patent application, wherein the event detection in step 1202 includes the following steps: classifying content in the obtained Internet media materials to obtain content targeted to a specific topic; Identify the involved entities from the obtained content; perform sentiment analysis on the obtained content and the identified entities, and filter the obtained content based on the results of the sentiment analysis; and perform event discovery based on the filtered content to Cluster media events and discover new media events.

The method according to item 13 of the scope of patent application, wherein the event detection in step 1202 further includes the steps of analyzing the authenticity of the event based on the attributes of the media event, and performing a media event analysis based on the analysis result. Sort and / or filter.

The method according to item 12 of the scope of patent application, characterized in that in step 1203, directly related entities corresponding to the specific media event are extracted by at least one of the following methods: based on entity recognition in natural language processing Identify entities from textual data; identify entities from image or video data based on image or video recognition processing; or identify entities from audio or video data based on speech recognition processing.

The method according to item 12 of the scope of patent application, wherein the step 1204 is implemented by: performing semantic disambiguation and entity connection on the directly related entities to identify the information in the industry knowledge map database. Corresponding entities, and based on the directly related entities, query other entities and / or entity relationships in the industry knowledge map database with preset conditions to determine the non-directly related entities.

The method according to item 12 of the scope of patent application, wherein the step 1204 is implemented by: performing semantic disambiguation and entity connection on the directly related entities to identify the information in the industry knowledge map database. Corresponding entities, and based on the directly related entities, use data extraction technology on other entities and / or entity relationships in the industry knowledge map database to determine the non-directly related entities.

An apparatus for constructing an industry knowledge map database is characterized in that it includes: a data acquisition module for obtaining industry data of one or more industries in a specific field from a data source, the industry information of one or more industries in a specific field The industry data includes Internet media data; a data processing module is configured to perform data processing on the industry data of the industry in the one or more specific fields to extract information related to the industry in the one or more specific fields The entity and corresponding entity attributes and / or entity relationships, where the entity refers to part of the industry data of the industry in the one or more specific fields; and event detection, event evaluation, and Screening to extract specific media events related to the industry in the one or more specific fields; and a database building module for semantic disambiguation and entity linking of the extracted entities, based on the extracted specific Media events and the entities, entity attributes, and / or entity relationships build the industry knowledge map database.

The device according to item 18 of the scope of patent application, wherein the data acquisition module is further configured to: obtain structured industry data from a third-party industry database, where the structured industry data includes multiple fields; The data processing module is further configured to perform data cleaning and extraction on the structured industry data before extracting entities related to the one or more specific industries and corresponding entity attributes and / or entity relationships. -Conversion-loading (ETL) processing; and the database building module is further configured to generate the industry knowledge map database based on the extracted entities, entity attributes, and / or entity relationships.

The device according to item 18 of the scope of application for a patent, wherein the data acquisition module is further configured to use a web crawler technology to obtain information related to the one or more industries in a specific field from an Internet data source. Data, said Internet data sources include unstructured or semi-structured data sources; said data processing module is further configured to: use information extraction technology in natural language processing to Industry-related data for entity identification and relationship extraction to extract the entities, entity attributes, and / or entity relationships; and the database building module is further configured to: based on the extracted entities, entity attributes, and / or entity relationships Supplement or update the industry knowledge map database.

The device according to item 18 of the scope of patent application, wherein the data acquisition module is further configured to: use an application programming interface (API) to obtain information from the Internet data source and the one or more specific fields in a query manner. Industry-related data, the Internet data source includes open data sources; the data processing module is further configured to: extract entities related to the industry in the one or more specific fields and corresponding entity attributes and / Before the entity relationship, the data cleaning and extraction-transformation-loading (ETL) processing is performed on the data related to the industry in the one or more specific fields; and the database building module is further used for: The extracted entities, entity attributes and / or entity relationships supplement or update the industry knowledge map database.

The device according to item 18 of the scope of the patent application, wherein the data acquisition module is further configured to: use an application programming interface (API) or a web crawler technology to obtain information from the Internet data source and the one or more Industry-specific Internet media data in a specific field; the data processing module is further configured to: identify the corresponding directly related entity from the Internet media data; and the database construction module is further configured to: The specific media event and the corresponding directly related entity supplement the industry knowledge graph database, wherein the specific media event is added to the industry knowledge graph database as an abstract entity.

The device according to item 22 of the scope of patent application, wherein the data processing module further extracts directly related entities corresponding to the specific media event in at least one of the following ways: based on entities in natural language processing Recognize entities from textual materials; identify entities from image or video materials based on image or video recognition processing; or identify entities from audio or video materials based on speech recognition processing.

The device according to any one of claims 20 to 22, wherein the database building module includes: a module for performing semantic disambiguation and entity linking on the extracted entities, and A corresponding entity of the entity in the industry knowledge map database is identified.

The device according to item 24 of the scope of patent application, wherein the module for performing semantic disambiguation and entity connection on the extracted entity further performs semantic disambiguation and entity connection by at least one of the following methods: : Semantic disambiguation and entity connection for each extracted entity reference independently based on entity knowledge; and based on the topic consistency assumption, use the association of candidate entities in the knowledge base to unify the extracted entity references Sexual disambiguation and entity connection.

The device according to item 22 of the scope of patent application, wherein the specific media event includes a negative event, an emergency event, a crisis event, a group event, a public opinion event, or other events of industrial significance.

A system for monitoring specific media events related to an industry, which is characterized in that it includes: a data acquisition unit for obtaining industry data from one or more industries in a specific field from a data source; a data processing unit for Perform data processing on the industry data of the industry in one or more specific fields to extract entities related to the industry in the one or more specific fields and corresponding entity attributes and / or entity relationships, where the entities are Refers to part of the industry data of the industry in the one or more specific fields; a database building unit that performs semantic disambiguation and entity connection on the extracted entities for use based on the extracted entities, entity attributes and / or entities Relation construction industry knowledge map database; database storage unit: used to store the constructed industry knowledge map database; media event monitoring unit: used to obtain Internet media materials, perform event detection based on the obtained Internet media materials, Event evaluation and screening to obtain the specific media events related to the industry, and to identify A direct related entity corresponding to a specific media event; a database access unit: configured to access the industry knowledge map database based on the directly related entity to determine a non-directly related entity corresponding to the specific media event; and message transmission A unit, configured to send an early warning message to the directly related entity and / or the non-directly related entity.

The system according to item 27 of the scope of patent application, wherein the data acquisition unit includes: a structured data acquisition unit for obtaining structured industry data from a third party industry database, and the structured industry data includes A plurality of fields; the data processing unit includes: a structured data processing unit, configured to perform data on the structured industry data before extracting entities related to the industry and corresponding entity attributes and / or entity relationships Cleaning and extraction-transformation-loading (ETL) processing; and the database construction unit includes: a database generation unit for generating the industry knowledge map database based on the extracted entities, entity attributes and / or entity relationships .

The system according to item 27 of the scope of application for a patent, wherein the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain information related to the one or more specific data from an Internet data source using a web crawler technology. Industry-related data in the field, the Internet data sources include unstructured or semi-structured data sources; the data processing unit includes: industry-related data processing units, which use information extraction technology in natural language processing to Entity identification and relationship extraction of the industry-related data in one or more specific fields to extract the entities, entity attributes and / or entity relationships; and the database building unit includes: a database supplement / update unit, Supplementing or updating the industry knowledge map database based on the extracted entities, entity attributes and / or entity relationships.

The system according to item 27 of the scope of patent application, wherein the data acquisition unit includes: an industry-related data acquisition unit, which is used to obtain an information source from an Internet data source in an inquiry manner using an application program interface (API) and the first data acquisition unit. Or multiple industry-specific data, the Internet data source includes open data sources; the data processing unit includes: industry-related data processing unit, used to extract industries related to the one or more specific fields Before relevant entities and corresponding entity attributes and / or entity relationships, the data cleaning and extraction-transformation-loading (ETL) processing is performed on the data related to the industry in the one or more specific fields; and The database construction unit includes: a database supplement / update unit for supplementing or updating the industry knowledge graph database based on the extracted entities, entity attributes and / or entity relationships.

The system according to item 27 of the scope of patent application, wherein the data acquisition unit includes: a media data acquisition unit, which is used to obtain and obtain information from Internet data sources using an application program interface (API) or a web crawler technology. Said one or more industry-related Internet media materials in a specific field; the data processing unit includes: a media data processing unit for performing event detection, event evaluation, and screening on the Internet media materials to extract information related to the industry Related specific media events, and identifying corresponding directly related entities from the internet media materials; and the database building unit includes: a database supplement / update unit for based on the specific media events and corresponding direct correlations An entity to supplement the industry knowledge map database, wherein the specific media event is added to the industry knowledge map database as an abstract entity.

The system according to any one of claims 29 to 31, wherein the database supplement / update unit is further configured to perform semantic disambiguation and entity connection on the extracted entities.

The system according to item 27 of the scope of patent application, wherein the media event monitoring unit is further configured to: classify content in the obtained Internet media materials to obtain content targeted to a specific topic; Identify the entities involved in the obtained content; perform sentiment analysis on the obtained content and the identified entities, and filter the obtained content based on the results of the sentiment analysis; and perform event discovery based on the filtered content to media Cluster events and discover new media events.

The system according to item 33 of the scope of patent application, wherein the media event monitoring unit is further configured to analyze the authenticity of events based on the attributes of the media events, and sort and / or media events according to the analysis results. Or filtered.

The system according to item 27 of the scope of patent application, wherein the database access unit is further configured to: based on the directly related entities, preset other conditions and other entities in the industry knowledge map database and And / or an entity relationship query to determine the non-directly related entity.

The system according to item 27 of the scope of patent application, wherein the database access unit is further configured to: based on the directly related entity, perform an analysis on other entities and / or entity relationships in the industry knowledge map database Data acquisition techniques are used to identify the non-directly related entities.

The system according to item 27 of the scope of the patent application, wherein the specific media event includes a negative event, an emergency event, a crisis event, a group event, a public opinion event, or other events of industrial significance.