WO2018019289A1 - 基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质 - Google Patents

基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质 Download PDF

Info

Publication number
WO2018019289A1
WO2018019289A1 PCT/CN2017/094881 CN2017094881W WO2018019289A1 WO 2018019289 A1 WO2018019289 A1 WO 2018019289A1 CN 2017094881 W CN2017094881 W CN 2017094881W WO 2018019289 A1 WO2018019289 A1 WO 2018019289A1
Authority
WO
WIPO (PCT)
Prior art keywords
concept
interest
chinese text
text corpus
knowledge
Prior art date
Application number
PCT/CN2017/094881
Other languages
English (en)
French (fr)
Inventor
李应樵
Original Assignee
万云数码媒体有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 万云数码媒体有限公司 filed Critical 万云数码媒体有限公司
Priority to CN201780046326.XA priority Critical patent/CN109643315B/zh
Publication of WO2018019289A1 publication Critical patent/WO2018019289A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the invention relates to a method and a system for automatically generating an ontology library, in particular to automatically generate a Chinese ontology library based on structured network knowledge.
  • Ontology represents the unique similarities and connections between different concepts and can be used to help semantically search for information or files obtained on a network, enterprise computer network, or any other database.
  • ANN artificial neural networks
  • Ontologies can be generated from knowledge in a variety of languages. Regardless of the language used, the user must process the corpus in that language and refine the key fields for ontology generation. Some languages, such as Chinese, have no clear delimiters between words, and are more difficult or complicated in language processing than English. Making keyword extraction more difficult. Therefore, the semantic content of the Chinese corpus is not easy to understand. Natural Language Processing (NLP) and Latent Semantic Analysis (LSA) are used in computer science to cover the realm of interaction between computers and human languages. Combining NLP and LSA can perform lexical, grammatical, syntactic and semantic analysis on Chinese corpus.
  • NLP Natural Language Processing
  • LSA Latent Semantic Analysis
  • This analysis specifically involves word segmentation, part-of-speech tagging, word-of-word refinement, statistical analysis, and the determination of word-like relevance.
  • NLP and LSA may not necessarily efficiently and accurately refine the correct keywords or concepts for ontology generation.
  • the Chinese ontology library can be automatically generated by using structured network knowledge.
  • Structured network knowledge is a structured information database stored on the network.
  • Structured network knowledge is a structured information database stored on the network.
  • Each article contains a topic that is usually manually edited by a data user with knowledge of the topic. If you find an error or invalid information, you can report it to the organizer of the web-based encyclopedia to correct those errors or invalid information. Therefore each topic can be considered to be hand-edited and selected by an expert and can therefore be considered an expert opinion on the topic.
  • Each theme can be further treated as a concept when used to generate an ontology.
  • data users can display associated articles by inserting links in the article.
  • Such a link can be thought of as a point of union in the concept, thus representing a semantic relationship between different concepts.
  • structured network knowledge is based on the inclusion of a large number of concepts and relationships between concepts, unlike the need for pre-training by ANNs, the generation of ontology using structured network knowledge can be automated without the need for large amounts of manpower to prepare the data. Therefore, the present invention does not require any human intervention and is therefore more efficient in terms of ontology generation.
  • NLP and LSA are computer implemented Programs that perform lexical, grammatical, syntactic, and semantic analysis of Chinese text corpora. NLP and LSA can be considered to use a computer language to understand a person's language, and this understanding may not be accurate and effective compared to Chinese native speakers' understanding of the Chinese corpus.
  • the present invention uses hyperlinks in a structured knowledge network to discover associated concepts to efficiently extract Chinese knowledge. Since these hyperlinks have been reviewed by experts, they can be considered to more accurately describe the relationship between concepts.
  • Described below is a method and computer readable medium for automatically generating a Chinese ontology library based on structured network knowledge, encoding an indication that enables the processor to implement the method when executed by the processor, including the following steps, from Grasping structured knowledge in a structured knowledge network, wherein the structured knowledge includes at least one concept for automatic Chinese ontology library generation; filtering irrelevant links; extracting knowledge related to the concept of interest; An associated concept of the concept of interest; inferring the semantic relevance of the concept of interest and its associated concepts by a measure of cosine similarity; and storing the inferred semantic relevance data.
  • the step of structuring the structured knowledge from the structured knowledge network comprises the steps of: browsing the structured knowledge through a hypertext transfer protocol ("HTTP") protocol; accessing the structured knowledge classification page using a breadth-optimized search algorithm a hyperlink in the user until all the Chinese text corpora of the link are accessed; at least one Chinese text corpus is obtained from the structured knowledge network, wherein the subject, abstract and content of the Chinese text corpus are static by the Chinese text corpus
  • HTML header, title and body tags in the Hypertext Markup Language (“HTML”) page are determined; and a link record is generated for each Chinese text corpus obtained.
  • HTML Hypertext Markup Language
  • the step of structuring the structured knowledge from the structured knowledge network includes the step of generating a unique identifier for each of the obtained Chinese text corpora.
  • the step of structuring the structured knowledge from the structured knowledge network includes the steps of storing a web address ("URL"), an identifier, and/or a last modification time for each Chinese text corpus obtained.
  • URL web address
  • identifier an identifier
  • last modification time for each Chinese text corpus obtained.
  • the step of structuring structured knowledge from the structured knowledge network includes the steps of: scanning all acquired Chinese text corpora at predetermined time intervals; generating or by retrieving whether there is a matching record having the same last modified time Update Chinese text corpus records; and eliminate All repeated Chinese text corpora.
  • the step of eliminating duplicate Chinese text corpus includes the steps of: retaining only one identifier for each Chinese text corpus; and converting all other different identifiers of the same Chinese text corpus into a redirect identifier.
  • the step of filtering the irrelevant link comprises the steps of: an unrelated link to an external web page, an unrelated link in the access menu that does not involve the conceptual knowledge of interest, and a recurring link in the structured knowledge network. Perform noise filtering.
  • the step of extracting knowledge related to the concept of interest comprises the step of extracting relevant noun terms from Chinese text corpora describing the concept of interest.
  • the step of discovering the associated concept of the concept of interest comprises the steps of extracting a list of hyperlinks from a Chinese text corpus of the concept of interest, wherein the Chinese text corpus representation of each hyperlink is related to said Concept related concept.
  • the step of deducing the semantic concept of the concept of interest and its associated concept by a measure of cosine similarity comprises the steps of: calculating a term frequency weight vector V1 of the concept of interest; accessing the concern a hyperlink in the Chinese text corpus of the concept, thereby locating the associated concept of the concept of interest; calculating a term frequency weight vector for each of the associated concepts, wherein the term frequency of each of the associated concepts The weight vector represents the unique semantics of each associated concept; and the cosine similarity between the concept of interest and the term frequency weight vector for each associated concept is calculated.
  • frequency weight vector V1 is calculated by the following equation:
  • V1 (tf(t1,c1),tf(t2,c1),....tf(tn,c1))
  • tf(t1, c1) is the term frequency of the first related term in the Chinese text corpus of concept c1 of interest
  • Tf(t2, c1) is the term frequency of the second related term in the Chinese text corpus of concept c1 of interest;
  • Tf(tn, c1) is the term frequency of the nth related term in the Chinese text corpus of concept c1 of interest.
  • frequency weight vector for each associated concept is calculated by the following equation:
  • V2 (tf(t1,c2),tf(t2,c2),....tf(tn,c2))
  • V2 is the term frequency weight vector of the associated concept c2;
  • Tf(t1, c2) is the term frequency of the first related term in the Chinese text corpus of the associated concept c2;
  • Tf(t2, c2) is the term frequency of the second related term in the Chinese text corpus of the associated concept c2;
  • Tf(tn, c2) is the term frequency of the nth related term in the Chinese text corpus of the associated concept c2.
  • V1 and V2 are the term frequency weight vectors of the concept c1 of interest and the associated concept c2, respectively.
  • the step of storing the inferred semantic relevance data comprises: storing semantic relevance with a network ontology language; and indexing information of the semantic relevance.
  • the network ontology language used is a Resource Description Framework (“RDF").
  • RDF Resource Description Framework
  • the step of indexing the information of the semantic relevance comprises establishing a concept map comprising concepts of interest, associated concepts, number of associated concepts, and RDF icons.
  • the step of fetching structured knowledge from the structured knowledge network comprises the steps of: crawling structured knowledge from a web-based Chinese encyclopedia.
  • the step of fetching structured knowledge from the structured knowledge network comprises the steps of: crawling structured knowledge from Baidu Encyclopedia or Chinese Wikipedia.
  • a system for automatically generating a Chinese ontology library based on structured network knowledge comprising: a network crawling module configured to capture structured knowledge from a structured knowledge network; a noise filtering module configured to filter irrelevant links; a knowledge extraction module , configured to extract Chinese text corpus and attention Concept-related knowledge; a database storing Chinese text corpora downloaded from structured network knowledge; and a relationship discovery module configured to extract associated concepts of concepts of interest and to calculate concepts of interest using metrics of cosine similarity Semantic relevance between associated concepts.
  • the irrelevant link is an unrelated link to an external web page, an unrelated link in the access menu that does not involve the conceptual knowledge of interest, and a link that recurs in the structured knowledge network.
  • system includes a visual interface for displaying a concept map, wherein the concept map includes concepts of interest, associated concepts, number of associated concepts, and RDF icons, wherein the number of associated concepts is related to the concept of interest The total number of associated concepts, the RDF icon allows the user to download the RDF triples of the concept of interest.
  • the semantic relevance is encoded by the RDF.
  • FIG. 1 is a block diagram of a possible implementation of a system for automatically generating a Chinese ontology library based on structured network knowledge.
  • FIG. 2 is a flow chart showing the main steps of automatically generating a Chinese ontology library based on structured network knowledge. It should be understood that although the various steps in the flowchart of FIG. 2 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and may be performed in other sequences. Moreover, at least some of the steps in FIG. 2 may include a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be executed at different times, and the order of execution thereof is not necessarily Performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps;
  • Figure 3 is a flow chart showing further steps of relationship discovery.
  • Figure 4 is a conceptual diagram of the concept "Three Kingdoms”.
  • Figure 5 shows the topic and mutual semantic relevance displayed in RDF format.
  • FIG. 6 is a schematic diagram showing the internal structure of a computer device in an embodiment.
  • Embodiments of the systems, methods, and computer readable media disclosed herein automatically generate a Chinese ontology library based on structured network knowledge.
  • the system 2 for automatically generating a Chinese ontology library based on structured network knowledge includes a network crawling module 21, a noise filtering module 22, a knowledge extraction module 23, a database 24, a relationship finding module 25, and a visualization module 26, each module This may be achieved in whole or in part by software, hardware or a combination thereof.
  • Figure 2 shows a flow chart for automatically generating a Chinese ontology library based on structured network knowledge.
  • the static HTML webpage 1 of the structured knowledge network such as the network-based Chinese encyclopedia can be captured from the network through the network crawling module 21.
  • the web-based Chinese encyclopedia can be the famous Baidu Encyclopedia and Chinese Wikipedia.
  • Each static HTML page 1 describes a particular concept and has links to related web pages.
  • the web crawling module 21 browses the directories in the structured knowledge network through the HTTP protocol, and uses the breadth-first search algorithm to access the super-directed webpages in the catalogue. Link until all linked directories are accessed.
  • the web crawl module 21 then retrieves and extracts only the Chinese text corpus from the linked static HTML web page 1, wherein the subject, abstract, and content are determined by the HTML tags (eg, header, title, and body tags) on the obtained static HTML page. .
  • HTML tags eg, header, title, and body tags
  • the network crawling module 21 can find all possible links from the structured knowledge network using the regular notation " ⁇ a(.*?) ⁇ /a>", establish a link record for each obtained Chinese text corpus, and The link record and the obtained Chinese text corpus are stored in the database 24.
  • Each Chinese text corpus obtained from the captured static HTML web page 1 can be identified by the URL of the captured static HTML web page 1.
  • a unique identifier can be generated for the Chinese text corpus. For example, if the Chinese text corpus A is obtained from the static HTML web page 1 crawled at the URL http://baike.baidu.com/view/2347.htm, the Chinese text corpus A will have an identifier of 2347. If the Chinese text corpus B is obtained from the static HTML web page 1 crawled at the URL http://baike.baidu.com/view/10088.htm , the Chinese text corpus B will have an identifier of 10088. The URL, identifier and last modification time of each Chinese text corpus are stored in the database 24.
  • the network crawling module 21 scans all downloaded Chinese texts at preset time intervals, and establishes or updates the stored links by retrieving whether the last modified time of the downloaded Chinese text corpus matches the last modified time in the existing link record. recording.
  • the web crawl module 21 can also scan and find the same Chinese text corpus in two or more crawled static HTML web pages 1 having different web addresses.
  • the same Chinese text corpus may exist under the browse page and sub-browse page of the static HTML page 1 with the following different URLs:
  • the network crawling module 21 may set the identifier of the Chinese text corpus in the secondary browsing page as a redirect identifier, and the Chinese The text corpus is redirected to the identifier under the browse page. Therefore, each Chinese text corpus has only one identifier, thus maintaining the uniqueness of the identifier in the linked record.
  • the network crawling module 21 can scan all the link records extracted by the above regular representation, extract the identifier from the link through the matching "href" attribute value in the ⁇ a> tag, and use the identifier to find the record of the database 24.
  • a link record of all downloaded Chinese text corpora is created in the database 24.
  • the noise filtering module 22 filters all irrelevant links connected to the external web page, irrelevant links in the access menu unrelated to the knowledge described in the Chinese text corpus, and recurring links in the structured knowledge network.
  • Each obtained Chinese text corpus can represent a concept, and this concept is often the subject of the Chinese text corpus.
  • the concept is an abstract idea. One can understand this concept by examining the details related to the concept, the events, people, objects, places, times, characteristics and characteristics associated with the concept. All of the above information can be considered as conceptual knowledge.
  • the knowledge extraction module 23 extracts conceptual knowledge in the Chinese text corpus. There are many ways to extract conceptual knowledge. One way is to extract the relevant noun terms in the Chinese text corpus describing the concept. It is to be understood that any substantially accurate knowledge extraction measures derived from all known or future developments may be employed without departing from the spirit and scope of the invention.
  • the knowledge extracted from the Chinese text corpus can be used to calculate the term frequency weight vector of the Chinese text corpus. Since each Chinese text corpus represents a concept, the term frequency weight vector of the Chinese text corpus can also be a conceptual term frequency weight vector.
  • V1 is the term frequency weight vector of concept c1 of interest and is calculated as follows:
  • V1 (tf(t1,c1),tf(t2,c1),....tf(tn,c1))
  • tf(t1, c1) is the term frequency of the first related term in the Chinese text corpus of concept c1 of interest
  • Tf(t2, c1) is the term frequency of the second related term in the Chinese text corpus of concept c1 of interest;
  • Tf(tn, c1) is the term frequency of the nth related term in the Chinese text corpus of concept c1 of interest.
  • Chinese text corpus has hyperlinks to other Chinese text corpora. These hyperlinked Chinese text corpora represent concepts associated with the original concept of interest.
  • the relationship discovery module 25 calculates the Chinese text corpus and the hyperlinked Chinese text by calculating the Chinese text corpus (representing the concept of interest) and the hyperlink text corpus (representing the associated concept). The cosine similarity of the corpus term frequency weight vector to find the connection between concepts.
  • the step of extracting the hyperlink list from the captured static HTML web page 1 of concept c1 is performed.
  • Each hyperlink in the Chinese text corpus represents an associated concept.
  • the associated concept is identified by accessing the hyperlink found in the Chinese text corpus of the concept of interest.
  • the corresponding term frequency weight vector for the associated concept can also be found.
  • the associated concepts c2 and c3 may be found in the Chinese text corpus of concept c1 of interest, while the term frequency weight vectors of associated concepts c2 and c3 may be calculated as follows:
  • V2 (tf(t1,c2),tf(t2,c2),....tf(tn,c2))
  • V3 (tf(t1,c3),tf(t2,c3),....tf(tn,c3))
  • V2 is the term frequency weight vector of the associated concept c2;
  • V3 is the term frequency weight vector associated with concept c3;
  • Tf(t1, c2) is the term frequency of the first related term in the Chinese text corpus of the associated concept c2;
  • Tf(t2, c2) is the term frequency of the second related term in the Chinese text corpus of the associated concept c2;
  • Tf(tn, c2) is the term frequency of the nth related term in the Chinese text corpus of the associated concept c2;
  • Tf(t1, c3) is the term frequency of the first related term in the Chinese text corpus of the associated concept c3;
  • Tf(t2, c3) is the second related term in the Chinese text corpus of the associated concept c3 Language frequency
  • Tf(tn, c3) is the term frequency of the nth related term in the Chinese text corpus of the associated concept c3;
  • each associated concept has a term frequency weight vector that represents its unique semantics.
  • the semantic relevance of the associated concept is inferred from the cosine similarity measure.
  • the cosine similarity between a concept and its associated concept can be used to infer the degree of similarity between the two concepts, namely the cosine angle of the term weight vector of a concept and associated concept:
  • V1 and V2 are the term frequency weight vectors of concept c1 of interest and associated concept c2, respectively.
  • cosine similarity between two concepts is close to 1, then the content between the two concepts is largely similar to each other. In other words, these two concepts are largely semantically related. If the cosine similarity between two concepts is equal to 0, then the two concepts have completely different content, meaning that they may be completely unrelated from a semantic point of view. Thus cosine similarity contributes to the quantification of the associated conceptual similarities.
  • All Chinese text corpus records can be retrieved from database 24, each of which represents a concept, and the term frequency weight vector for each Chinese text corpus is calculated.
  • the cosine similarity between each Chinese text corpus record and all Chinese text corpus records connected to it via a hyperlink is derived.
  • the main subject can be encoded in the official language, such as the web ontology language "OWL”, the resource description framework ("RDF” or "RDFS”). Other ontology languages can also be used.
  • the Chinese text corpus is converted into an RDF triple.
  • All associated concepts with the term frequency weights are also recorded in the form of RDF triples.
  • all associated concepts of Chinese text corpus with semantic relevance are stored in step S35 in RDF format, and an RDF file with semantic relevance information is indexed in step S36.
  • the generated RDF triples and stored RDF data can be used for further queries and operations.
  • system 2 includes a visualization interface 26 to facilitate an unfolded search.
  • the visualization interface 26 presents a conceptual diagram in which the concept of interest 51 (i.e., "three countries" in this embodiment) is displayed in the center of the diagram, with all associated concepts 52 displayed.
  • One number under the concept of interest 51 represents the total number of concepts 52 associated with the concept of interest 51.
  • the visualization interface 26 can also present an RDF icon that allows the user to download the RDF triples of the concept of interest 51.
  • the concepts, associated concepts, and the number and orientation of the number of RDF icons may vary without departing from the scope of the present disclosure.
  • a schematic diagram of the internal structure of a computer device includes a processor coupled through a system bus, a non-volatile storage medium, an internal memory, and a network interface.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions for implementing a method for automatically generating a Chinese ontology library based on structured network knowledge.
  • the processor of the computer device is used to provide computing and control capabilities to support the operation of the entire device.
  • Computer readable instructions may be stored in the internal memory of the computer device, the computer readable instructions being executable by the processor to cause the processor to perform a method for automatically generating a Chinese ontology library based on structured network knowledge.
  • the network interface of the computer device is used to communicate with external terminals via a network connection.
  • the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may include more than shown in the figure. More or fewer parts, or some parts, or different parts.
  • the description and examples of the exemplary embodiments are provided with reference to the particular embodiments of the invention, and it is understood that modifications and modifications are possible in the spirit and scope of the claims.
  • the above-described embodiments show the possible scope of the specification, but are not limited to the scope of the disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于结构化网络知识的自动生成中文本体库的方法、系统、计算机设备和计算机可读介质。所述方法包括步骤:从结构化知识网络抓取结构化知识,其中结构化知识包括至少一个关注概念用于所述自动中文本体库的生成;过滤无关的链接;提取有关所关注概念的知识;发现所述关注概念的相关联概念;基于余弦相似性度量推断所述关注概念及其相关联概念之间的语义相关性;并且存储推断出的所述语义相关性数据。本发明提供的更有效率的自动中文本体库生成的系统和方法,以应对快速发展的数据世界并迎合数据用户的需求。

Description

基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质
本申请要求于2016年7月29日提交中国香港特别行政区政府知识产权署、申请号为16109078.8、发明名称为“基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质”的中国香港特别行政区专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及自动生成本体库的方法和系统,特别是基于结构化网络知识自动生成中文本体库。
背景技术
在信息技术的时代,大量的数据每天被上载至网络、企业计算机网络或其他数据库或者从这些地方被下载下来。数据用户总是期待从网络、企业计算机网络或数据库获得他们所需要的各种信息,但是并不是每次均能获得正确的信息。本体表示的是不同概念之间特有的相似性和连接关系,可以用来帮助对网络、企业计算机网络或任何其他数据库获得的信息或文件进行语义搜索。
传统的本体生成通常是专家通过手动输入概念之间的关系来完成的,因此需要耗费许多人力。当前,不同的计算机实现程序,诸如人工神经网络(ANN)可以用于发现语料库中词语之间的语义相关性。然而,ANN需要预先进行训练,因此仍然需要大量人力准备具有多种输入模式的数据。因此采用ANN可能未必能够有效的跟上网络、企业计算机网络或任何数据库数据的更新速度。
本体可以从各种语言的知识中产生。无论运用何种语言,使用者必须以该种语言来处理语料库并且提炼关键字段用于本体生成。某些语言诸如中文,在词语之间没有明确的分隔符,与英文相比在语言处理方面更加困难或复杂, 使关键词提取更困难。因此,中文文字语料库的语义内容很不容易理解。自然语言处理(NLP)和潜在语义分析(LSA)在计算机科学中被用于涉及计算机和人类语言之间互动的领域。结合NLP和LSA可对中文文字语料库进行词法、语法、句法和语义分析。这种分析特别涉及词语切分、词性标注、词例提炼、统计分析和词例相关性的确定。然而,由于中文语言的复杂性,NLP和LSA可能未必有效和准确地提炼用于本体生成的正确关键词或概念。
总之,需要一种更有效率的系统和方法,优选地需要一种计算机自动实现的方法和系统,用于中文本体库生成,以应对快速发展的数据世界和满足数据用户的需求。
发明内容
利用结构化网络知识可以自动生成中文本体库。结构化网络知识是存储在网络上的结构化信息数据库。例如,具有许多基于网络的中文百科全书,诸如百度百科和中文维基百科,这些是流行的由几百万条文章组成的公众知识库。每条文章包含一个主题,该主题通常由具有该主题知识的数据用户手工编辑。如果发现错误或者无效的信息,可以向基于网络的百科全书的主办方汇报,以纠正那些错误或无效的信息。因此每个主题可以被认为是手工编辑的,并且由专家删选的,因此可以被认为是该主题的专家意见。在用于生成本体时,每个主题可以被进一步当作一个概念。此外,数据用户可以通过在文章中插入链接展示相关联的文章。这种链接可以被认为是概念中的结合点,因此表示不同概念之间的语义关系。由于结构化的网络知识是基于包括众多数量的概念以及概念之间的关系而建立的,与ANN需要预先训练不同,使用结构化网络知识的生成本体可以自动完成,而无需大量的人力准备数据。因此,本发明不需要任何人力介入,因此在本体生成方面更有效率。
由于中文语言在词语之间没有明确的分隔符,生成中文本体库中提炼的知识的准确性通常依赖于句子分割的方式以及选择哪些词例进行提炼。生成中文本体库通常使用NLP和LSA进行知识提取。NLP和LSA是计算机执行的 程序,这些程序进行中文文字语料库的词法、语法、句法和语义分析。NLP和LSA可以被认为使用计算机语言对人的语言进行理解,并且与中文母语的人对中文语料库的理解相比,这种理解可能不够准确有效。考虑到这一点,本发明使用结构化知识网络中的超链接来发现相关联的概念,以有效地提取中文知识。由于这些超链接已经被专家审查过,因此可以认为它们能更准确地描述概念之间的关系。
下文描述的是一种用于基于结构化的网络知识自动生成中文本体库的方法和计算机可读介质,其编码在处理器执行时能使处理器实现该方法的指示,包括下列步骤,从一结构化知识网络中抓取结构化知识,其中的结构化知识包括至少一个用于自动中文本体库生成所关注的概念;过滤无关的链接;提取与所述所关注的概念相关的知识;发现所述所关注的概念的相关联概念;通过余弦相似性的度量推断出所述所关注的概念以及其相关联概念的语义相关性;并且存储推断出的所述语义相关性数据。
优选地,从结构化知识网络抓取的结构化知识的步骤包括下列步骤:通过超文本传输协议(“HTTP”)协议浏览所述的结构化知识;使用广度优选搜索算法访问结构化知识分类页中的超链接,直到访问完所有链接的中文文本语料;从所述结构化知识网络取得至少一个中文文本语料,其中所述中文文本语料的主题、摘要和内容由包含所述中文文本语料的静态超文本标记语言(“HTML”)页面中的HTML头部,标题和主体标签来确定;并且对取得的每个中文文本语料生成链接记录。
进一步,从结构化知识网络抓取的结构化知识的步骤包括下列步骤:对取得的每个中文文本语料生成唯一标识符。
进一步,从结构化知识网络中抓取的结构化知识的步骤包括下列步骤:对取得的每个中文文本语料存储网址(“URL”),标识符和/或最后修改时间。
进一步,从结构化知识网络中抓取的结构化知识的步骤包括下列步骤:以预先设定的时间间隔扫描所有取得的中文文本语料;通过检索是否存在具有相同最后修改时间的匹配记录来产生或更新中文文本语料记录;并且消除 所有重复的中文文本语料。
进一步,消除重复的中文文本语料的步骤包括下列步骤:对每个中文文本语料仅保留一个识别符;并且将相同中文文本语料所有其他不同的识别符转换为重定向识别符。
优选地,过滤无关链接的步骤包括下列步骤:对连接到外部网页的无关链接、访问菜单中不涉及所述所关注的概念知识的无关链接、以及在所述结构化知识网络中重复出现的链接进行噪声过滤。
优选地,提取与所述所关注的概念相关的知识的步骤包括下列步骤:从描述所关注概念的中文文本语料中提取相关名词术语。
优选地,发现所述所关注的概念的相关联概念的步骤包括如下步骤:从所关注的概念的中文文本语料中提取超链接列表,其中每个超链接的中文文本语料表示与所述所关注的概念相关的概念。
优选地,通过余弦相似性的度量推断出所述所关注的概念以及其相关联概念的语义相关性的步骤包括如下步骤:计算所述所关注概念的术语频率权重矢量V1;访问所述所关注概念的中文文本语料中的超级链接,从而定位所述所关注的概念的相关联概念;计算每个所述相关联概念的术语频率权重矢量,其中每个所述相关联概念的所述术语频率权重矢代表每个相关联概念的唯一语义;并计算所关注概念和每个相关联概念的术语频率权重矢量之间的余弦相似性。
进一步,由下列方程来计算术语频率权重矢量V1:
V1=(tf(t1,c1),tf(t2,c1),....tf(tn,c1))
其中tf(t1,c1)为所关注概念c1的中文文本语料中的第一个相关术语的术语频率;
tf(t2,c1)为所关注概念c1的中文文本语料中的第二个相关术语的术语频率;并且
tf(tn,c1)为所关注概念c1的中文文本语料中的第n个相关术语的术语频率。
进一步,由下列方程来计算每个相关联概念的术语频率权重矢量:
V2=(tf(t1,c2),tf(t2,c2),....tf(tn,c2))
其中V2为相关联概念c2的术语频率权重矢量;
tf(t1,c2)为所述相关联概念c2的中文文本语料中的第一个相关术语的术语频率;
tf(t2,c2)为所述相关联概念c2的中文文本语料中的第二个相关术语的术语频率;并且
tf(tn,c2)为所述相关联概念c2的中文文本语料中的第n个相关术语的术语频率。
此外,由下列方程来计算所关注的概念和每个相关联概念的术语频率权重矢量之间的余弦相似性的步骤:
Figure PCTCN2017094881-appb-000001
其中V1和V2分别为所关注概念c1和相关联概念c2的术语频率权重矢量。
此外,存储推断出的所述语义相关性数据的步骤包括:用网络本体语言存储语义相关性;并对所述语义相关性的信息建立索引。
优选地,使用的网络本体语言是资源描述框架(“RDF”)。
优选地,对所述语义相关性的信息建立索引的步骤包括建立包括所关注概念、相关联概念、相关联概念的数量和RDF图标的概念图。
优选地,从结构化知识网络抓取结构化知识的步骤包括下列步骤:从基于网络的中文百科全书中抓取结构化知识。
优选地,从结构化知识网络抓取结构化知识的步骤包括下列步骤:从百度百科或中文维基百科抓取结构化知识。
还公开了一种基于结构化网络知识自动生成中文本体库的系统,包括:网络爬行模块,配置为从结构化知识网络抓取结构化知识;噪声过滤模块,配置为过滤无关链接;知识提取模块,配置为提取中文文本语料中与所关注 的概念相关的知识;存储从结构化网络知识中下载的中文文本语料的数据库;以及关系发现模块,配置为提取所关注概念的相关联概念,并且利用余弦相似性的度量计算所关注的概念和相关联的概念之间的语义相关性。
优选地,该无关链接是连接到外部网页的无关链接、访问菜单中不涉及所述所关注的概念知识的无关链接、以及在所述结构化知识网络中重复出现的链接。
此外,该系统包括一显示概念图的可视化界面,其中所述概念图包括所关注的概念,相关联概念,相关联概念的数量和RDF图标,其中相关联概念的数量为涉及所述所关注概念的所述相关联概念的总数,所述的RDF图标允许用户下载所述所关注概念的RDF三元组。
优选地,语义相关性由RDF所编码。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为基于结构化网络知识自动生成中文本体库的系统的可能实施方式的方框图。
图2为展示基于结构化网络知识自动生成中文本体库主要步骤的流程图。应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,图2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行;
图3为展示关系发现的进一步步骤的流程图。
图4为概念“三国”的概念图。
图5为以RDF格式显示的主题和相互语义相关性。
图6为一个实施例中计算机设备的内部结构示意图。
具体实施方式
为了使本发明的技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
参照附图中所示的示例,具体描述示范性实施方式的细节,其中全文相似的附图标记涉及相似的元素。
仅通过示意性的方式,附图和下文的描述涉及优选的实施方式。应该注意到的是,根据下文的讨论,这里公开的结构和方法的替代实施方式将毫无疑问地被认为是可行的替代方案,不会偏离要求保护的原则。
在此公开的系统、方法和计算机可读介质的实施方式基于结构化的网络知识自动生成中文本体库。
从图1中可见,基于结构化网络知识自动生成中文本体库的系统2包括网络爬行模块21,噪声过滤模块22,知识提取模块23,数据库24,关系发现模块25和可视化模块26,每个模块可全部或部分通过软件、硬件或其组合来实现。图2中展示了基于结构化的网络知识自动生成中文本体库的流程图。
在步骤S21,可以通过网络爬行模块21,从网络抓取诸如基于网络的中文百科全书的结构化知识网络的静态HTML网页1。例如,基于网络的中文百科全书可以是著名的百度百科和中文维基百科。每个静态HTML网页1描述了一个特定概念,并且有连到相关网页的链接。为了从结构化知识网页抓取所有的静态HTML网页1(包括所有链接的网页),网络爬行模块21通过HTTP协议浏览结构化知识网络中的目录,并使用广度优先搜索算法访问目录网页 中的超链接,直到所有链接的目录均被访问。网络爬行模块21接着从链接的静态HTML网页1中仅取得并提取中文文本语料,其中主题、摘要和内容由被取得的静态HTML页面上的HTML标签(例如头部,标题和主体标签)来确定。下文描述了网络爬行模块21一种可能的实施方式。网络爬行模块21可使用正规表示法″<a(.*?)</a>″从结构化的知识网络中找到所有可能的链接,对每个取得的中文文本语料建立链接记录、并将该链接记录和取得的中文文本语料存入数据库24中。每个从抓取的静态HTML网页1中取得的中文文本语料可以由该被抓取的静态HTML网页1的网址来识别。为了便于识别,基于代表该中文文字语料的网址(“URL”),可为该中文文字语料生成唯一的识别符。例如,如果从URL为http://baike.baidu.com/view/2347.htm抓取的静态HTML网页1中取得了中文文本语料A,那么该中文文本语料A将具有的标识符为2347。如果从URL为http://baike.baidu.com/view/10088.htm抓取的的静态HTML网页1取得了中文文本语料B,那么该该中文文本语料B将具有的标识符为10088。将每个中文文本语料的URL,标识符和最后修改时间存储在数据库24中。
网络爬行模块21以预先设定的时间间隔扫描所有下载的中文文本,通过检索下载的中文文本语料的最后修改时间是否与现存链接记录中的最后修改时间是否相匹配,来建立或者更新存储的链接记录。网络爬行模块21还可以在两个或多个抓取的具有不同网址的静态HTML网页1中扫描并找出相同的中文文本语料。例如,相同的中文文本语料可能存在于抓取的具有以下不同网址的静态HTML网页1的浏览页和子浏览页下:
(浏览页下)http://baike.baidu.com/view/1005619.htm
(次浏览页下)http://baike.baidu.com/subview/1005619/1005619.h tm
这种从不同网址取得的中文文本语料的复制将产生不同的识别符并使标识符不唯一。为了消除数据库24中重复的中文文本语料,网络爬行模块21可将次浏览页中的中文文本语料的标识符定为一个重定向标识符,将该中文 文本语料重定向至浏览页下的标识符。因此,每个中文文本语料只有一个标识符,从而保持链接记录中标识符的唯一性。
总之,网络爬行模块21能扫描所有用上述正规表示法提取的链接记录,通过<a>标签中匹配的“href”属性值从链接中提取标识符,将该标识符用于寻找数据库24记录的存储在语料中的唯一标识符,并在链接记录重定向标识符存在时对其进行更新。接着,在数据库24中建立所有下载的中文文本语料的链接记录。
在步骤S22,噪声过滤模块22过滤所有连接到外部网页的无关链接、与中文文本语料中描述的知识无关的访问菜单中的无关链接,和结构化知识网络中重复出现的链接。
每个取得的中文文本语料可以代表一个概念,并且这个概念经常是该中文文本语料的主题。概念是一个抽象的想法。通过审视与该概念相关的细节信息,与这个概念相关的事件、人物、物体、地点、时间、特性和特点等等,人们能够理解这个概念。所有上述信息均可以认为是概念的知识。在步骤S23,知识提取模块23提取中文文本语料中的概念知识。有很多提取概念知识的方法。其中一个方法是,提取描述这个概念的中文文本语料中的相关名词术语。可以理解的是,不偏离本发明的精神和范围,可以采取从所有已知或今后发展的手段中衍生出的任何本质上准确的知识提取措施。
从中文文本语料中提取的知识可以用于计算所述中文文本语料的术语频率权重矢量。既然每个中文文本语料代表一个概念,中文文本语料的术语频率权重矢量也可以是一个概念的术语频率权重矢量。V1是所关注概念c1的术语频率权重矢量,并且计算如下:
V1=(tf(t1,c1),tf(t2,c1),....tf(tn,c1))
其中tf(t1,c1)为所关注概念c1的中文文本语料中的第一个相关术语的术语频率;
tf(t2,c1)为所关注概念c1的中文文本语料中的第二个相关术语的术语频率;并且
tf(tn,c1)为所关注概念c1的中文文本语料中的第n个相关术语的术语频率。
中文文本语料中具有连接到其他中文文本语料的超链接。这些超链接中文文本语料代表与原始所关注概念相关联的概念。在步骤S24,关系发现模块25通过计算中文文本语料(代表所关注的概念)和超链接文本语料(代表相关联的概念)上得到的术语频率权重矢量,和计算中文文本语料和超链接中文文本语料术语频率权重矢量的余弦相似性来发现概念之间的联系。
如图3中进一步说明的,对关系发现模块25一个可能的实施方式进行如下描述。在步骤S31,执行从概念c1的已抓取的静态HTML网页1提取超链接列表的步骤。中文文本语料中的每个超链接代表一个相关联的概念。在步骤S32,通过访问所关注概念的中文文本语料中找到的超链接,识别相关联的概念。还可以找到相关联概念的相应术语频率权重矢量。例如,可以在所关注概念c1的中文文本语料中找到的相关联概念c2和c3,而相关联概念c2和c3的术语频率权重矢量可以进行如下计算:
V2=(tf(t1,c2),tf(t2,c2),....tf(tn,c2))
V3=(tf(t1,c3),tf(t2,c3),....tf(tn,c3))
其中V2是相关联概念c2的术语频率权重矢量;
V3是相关联概念c3的术语频率权重矢量;
tf(t1,c2)为相关联概念c2的中文文本语料中的第一个相关术语的术语频率;
tf(t2,c2)为相关联概念c2的中文文本语料中的第二个相关术语的术语频率;并且
tf(tn,c2)为相关联概念c2的中文文本语料中的第n个相关术语的术语频率;
tf(t1,c3)为相关联概念c3的中文文本语料中的第一个相关术语的术语频率;
tf(t2,c3)为相关联概念c3的中文文本语料中的第二个相关术语的术 语频率;并且
tf(tn,c3)为相关联概念c3的中文文本语料中的第n个相关术语的术语频率;
在步骤S33,每个相关联的概念就具有代表其唯一语义的术语频率权重矢量。在步骤S34,由余弦相似性度量来推断相关联概念的语义相关性。通过一个概念和其相关联概念的余弦相似性可以推断这两个概念之间的相近程度,即度量一个概念和相关联概念的术语频率权重矢量的余弦角:
Figure PCTCN2017094881-appb-000002
其中V1和V2分别是所关注概念c1和相关联概念c2的术语频率权重矢量。
如果两个概念之间的余弦相似性接近1,那么这两个概念之间的内容很大程度上彼此相似。换句话说,这两个概念很大程度上可能是语义相关的。如果两个概念之间的余弦相似性等于0,那么这两个概念具有完全不同的内容,意味着从语义角度来说可能是完全无关的。因此余弦相似性有助于相关联概念相似性的量化。
从数据库24中能取得所有的中文文本语料记录,其中每一个代表一个概念,并且计算每个中文文本语料的术语频率权重矢量。推导出每个中文文本语料记录和所有与其通过超链接相连的中文文本语料记录之间的余弦相似性。主要的主体可以由正式语言进行编码,例如网络本体语言“OWL”,资源描述框架(“RDF”或“RDFS”)。也可以使用其他本体语言。在本实施方式中,如图5所示,中文文本语料转换为RDF三元组。所有具有术语频率权重的相关联概念也以RDF三元组的方式被记录下来。例如,具有语义相关性的中文文本语料的所有相关联的概念以RDF格式在步骤S35进行存储,而在步骤S36为具有语义相关性信息的RDF文件建立索引。生成的RDF三元组和存储的RDF数据可以用于进一步的查询和操作。
为了便于在生成中文本体库时进行概念的检索,可以建立标题和摘要的 索引。可以通过度量概念的相关性来实现概念检索和展示相关联概念在概念图中。
在一个实施方式中,以如图4中显示的概念图用户界面的形式,系统2包括可视化界面26,从而便于展开搜索。可视化界面26展示了一个概念图,其中所关注的概念51(即本实施方式中指“三国”)展示在图中央,周边展示所有相关联的概念52。所关注的概念51下的一个数字代表与所关注概念51相关联概念52的总数目。如图4所显示的,与“三国”相关联的概念共有707个。该可视化界面26还可以展示RDF图标,允许用户下载所关注概念51的RDF三元组。不偏离本公开的范围,所关注的概念、相关联概念、RDF图标数目的位置和方向可以变化。
如图6所示,在一个实施例中,提供了计算机设备的内部结构示意图。该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、内存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,该计算机可读指令用于实现一种用于基于结构化网络知识自动生成中文本体库的方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个设备的运行。该计算机设备的内存储器中可储存有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种用于基于结构化网络知识自动生成中文本体库的方法。计算机设备的网络接口用于据以与外部的终端通过网络连接通信。图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。在此提供特别参考示例性实施方式的描述和示例,但是可以理解的是在权利要求的精神和范围下的变体和修正也是有效的。上述具体实施方式展示了说明书可能的范围,但不限于该公开的范围。

Claims (43)

  1. 一种用于基于结构化网络知识自动生成中文本体库的方法包括下列步骤:
    从结构化知识网络中抓取的结构化知识,其中结构化的知识包括至少一个所关注的概念用于自动生成中文本体库;
    过滤无关的链接;
    提取与所述所关注的概念相关的知识;
    发现所述所关注的概念的相关联概念;
    通过余弦相似性的度量推断出所述所关注的概念以及其相关联概念的语义相关性;并且
    存储推断出的所述语义相关性数据。
  2. 根据权利要求1所述的方法,其中从结构化知识网络中抓取的结构化知识的步骤包括下列步骤:
    通过所述结构化知识的超文本传输协议(“HTTP”)协议进行浏览;
    使用广度优选搜索算法访问结构化知识的分类页,直到访问了所有链接的中文文本语料;从所述结构化知识网络取得至少一个中文文本语料,其中所述中文文本语料的主题、摘要和内容由包含所述中文文本语料的静态超文本标记语言(“HTML”)页面上展示的HTML头部,标题和主体标签来确定;并且
    对取得的每个中文文本语料生成链接记录。
  3. 根据权利要求2的方法,进一步包括下列步骤:
    对取得的每个中文文本语料生成唯一标识符。
  4. 根据权利要求3的方法,进一步包括下列步骤:
    对取得的每个中文文本语料存储网址(“URL”),标识符和/或最后修改时间。
  5. 根据权利要求4的方法,进一步包括下列步骤:
    以预先设定的时间间隔扫描所有取得的中文文本语料;
    通过检索是否存在具有相同的最后修改时间的匹配记录来产生或更新中文文本语料记录;并且
    消除所有重复的中文文本语料。
  6. 根据权利要求5的方法,其中所述消除所有重复的中文文本语料的步骤包括下列步骤:
    对每个中文文本语料仅保留一个识别符;并且
    将相同中文文本语料所有其他不同的识别符转换为重定向识别符。
  7. 根据权利要求1的方法,其中所述过滤无关链接的步骤包括下列步骤:
    对连接到外部网页的无关链接、访问菜单中不涉及所述所关注的概念知识的无关链接、以及在所述结构化知识网络中重复出现的链接进行噪声过滤。
  8. 根据权利要求1的方法,其中所述提取与所述所关注的概念相关的知识的步骤包括下列步骤:从描述所关注概念的中文文本语料中提取相关名词术语。
  9. 根据权利要求1的方法,其中发现所述所关注的概念的相关联概念的步骤包括如下步骤:从所关注的概念的中文文本语料中提取超链接列表,其中每个超链接的中文文本语料表示与所述所关注的概念相关的概念。
  10. 根据权利要求1的方法,其中所述通过余弦相似性的度量推断出所述所关注的概念以及其相关联概念的语义相关性的步骤包括如下步骤:
    计算所述所关注概念的术语频率权重矢量V1;
    访问所述所关注概念的中文文本语料中的超链接,从而定位所述所关注的概念的相关联概念;
    计算每个所述相关联概念的术语频率权重矢量,其中每个所述相关联概念的所述术语频率权重矢代表每个相关联概念的唯一语义;并且
    计算所关注概念和每个相关联概念的术语频率权重矢量之间的余弦相似性。
  11. 根据权利要求10的方法,其中所述计算术语频率权重矢量V1的步骤由下列方程来实现:
    V1=(tf(t1,c1),tf(t2,c1),....tf(tn,c1))
    其中tf(t1,c1)为所关注概念c1的中文文本语料中的第一个相关术语的术语频率;
    tf(t2,c1)为所关注概念c1的中文文本语料中的第二个相关术语的术语频率;并且
    tf(tn,c1)为所关注概念c1的中文文本语料中的第n个相关术语的术语频率。
  12. 根据权利要求10的方法,其中每个相关联概念的术语频率权重矢量的步骤由下列方程来实现:
    V2=(tf(t1,c2),tf(t2,c2),....tf(tn,c2))
    其中V2为相关联概念c2的术语频率权重矢量;
    tf(t1,c2)为所述相关联概念c2的中文文本语料中的第一个相关术语的术语频率;
    tf(t2,c2)为所述相关联概念c2的中文文本语料中的第二个相关术语 的术语频率;并且
    tf(tn,c2)为所述相关联概念c2的中文文本语料中的第n个相关术语的术语频率。
  13. 根据权利要求10的方法,其中,由下列方程来计算所关注的概念和每个相关联概念的术语频率权重矢量之间的余弦相似性的步骤:
    Figure PCTCN2017094881-appb-100001
    其中V1和V2分别为所关注概念c1和相关联概念c2的术语频率权重矢量。
  14. 根据权利要求1的方法,其中存储推断出的所述语义相关性数据的步骤包括:
    用网络本体语言存储语义相关性;并且
    对所述语义相关性的信息建立索引。
  15. 根据权利要求14的方法,其中所述网络本体语言是资源描述框架(“RDF”)。
  16. 根据权利要求14的方法,其中所述对所述语义相关性的信息建立索引的步骤包括:建立包括所关注概念、相关联概念、相关联概念的数量和RDF图标的概念图。
  17. 根据权利要求1的方法,其中所述从结构化知识网络抓取结构化知识的步骤包括下列步骤:从基于网络的中文百科全书抓取的结构化知识。
  18. 根据权利要求1的方法,其中所述从结构化知识网络抓取结构化知识的步骤包括下列步骤:从百度百科或中文维基百科抓取结构化知识。
  19. 一种基于结构化网络知识自动生成中文本体库的系统,包括:
    网络爬行模块,配置为从结构化知识网络抓取结构化知识;
    噪声过滤模块,配置为过滤无关链接;
    知识提取模块,配置为提取中文文本语料中与所关注的概念相关的知识;
    存储从结构化网络知识中下载的中文文本语料的数据库;以及
    关系发现模块,配置为提取所关注概念的相关联概念,并且利用余弦相似性的度量计算所关注的概念和相关联的概念之间的语义相关性。
  20. 根据权利要求19的系统,其中所述无关链接是连接到外部网页的无关链接、访问菜单中不涉及所述所关注的概念知识的无关链接、以及在所述结构化知识网络中重复出现的链接。
  21. 根据权利要求19的系统,进一步包括显示概念图的可视化界面,其中所述概念图包括所关注的概念,相关联概念,相关联概念的数量和RDF图标。
  22. 根据权利要求21的系统,其中所述相关联概念的数量为涉及所述所关注概念的所述相关联概念的总数。
  23. 根据权利要求21的系统,其中所述RDF图标允许用户下载所述所关注概念的RDF三元组。
  24. 根据权利要求19的系统,其中所述语义相关性由RDF所编码。
  25. 一种计算机可读介质,其编码在处理器执行时能使处理器实现一方法的指示,该方法包括下列步骤:
    从结构化知识网络中抓取结构化知识,其中结构化的知识包括至少一个所关注的概念用于自动生成中文本体库;
    过滤无关的链接;
    提取与所述所关注的概念相关的知识;
    发现所述所关注的概念的相关联概念;
    通过余弦相似性的度量推断出所述所关注的概念以及其相关联概念的语义相关性;并且
    存储推断出的所述语义相关性数据。
  26. 根据权利要求25所述的计算机可读介质,其中从结构化知识网络中抓取的结构化知识的步骤包括下列步骤:
    通过所述结构化知识的超文本传输协议(“HTTP”)协议进行浏览;
    使用广度优选搜索算法访问结构化知识的分类页,直到访问了所有链接的中文文本语料;从所述结构化知识网络取得至少一个中文文本语料,其中所述中文文本语料的主题、摘要和内容由包含所述中文文本语料的静态超文本标记语言(“HTML”)页面上展示的HTML头部,标题和主体标签来确定;并且
    对取得的每个中文文本语料生成链接记录。
  27. 根据权利要求26所述的计算机可读介质,所述方法进一步包括下列步骤:
    对取得的每个中文文本语料生成唯一标识符。
  28. 根据权利要求27所述的计算机可读介质,所述方法进一步包括下列步骤:
    对取得的每个中文文本语料存储网址(“URL”),标识符和/或最后修改时间。
  29. 根据权利要求28所述的计算机可读介质,所述方法进一步包括下列步骤:
    以预先设定的时间间隔扫描所有取得的中文文本语料;
    通过检索是否存在具有相同的最后修改时间的匹配记录来产生或更新中文文本语料记录;并且
    消除所有重复的中文文本语料。
  30. 根据权利要求29所述的计算机可读介质,其中所述转化中文文本语料副本的步骤包括下列步骤:
    对每个中文文本语料仅保留一个识别符;并且
    将相同中文文本语料所有其他不同的识别符转换为重定向识别符。
  31. 根据权利要求25所述的计算机可读介质,其中所述过滤无关链接的步骤包括下列步骤:
    对连接到外部网页的无关链接、访问菜单中不涉及所述所关注的概念知识的无关链接、以及在所述结构化知识网络中重复出现的链接进行噪声过滤。
  32. 根据权利要求25所述的计算机可读介质,其中所述提取与所述所关注的概念相关的知识的步骤包括下列步骤:从描述所关注概念的中文文本语料中提取相关名词术语。
  33. 根据权利要求25所述的计算机可读介质,其中发现所述所关注的概念的相关联概念的步骤包括如下步骤:从所关注的概念的中文文本语料中提取超链接列表,其中每个超链接的中文文本语料表示与所述所关注的概念相关的概念。
  34. 根据权利要求25所述的计算机可读介质,其中所述通过余弦相似性的度量推断出所述所关注的概念以及其相关联概念的语义相关性的步骤包括如下步骤:
    计算所述所关注概念的术语频率权重矢量V1;
    访问所述所关注概念的中文文本语料中的超级链接,从而定位所述所关注的概念的相关联概念;
    计算每个所述相关联概念的术语频率权重矢量,其中每个所述相关联概念的所述术语频率权重矢代表每个相关联概念的唯一语义;并且
    计算所关注概念和每个相关联概念的术语频率权重矢量之间的余弦相似性。
  35. 根据权利要求34所述的计算机可读介质,其中所述计算术语频率权重矢量V1的步骤由下列方程来实现:
    V1=(tf(t1,c1),tf(t2,c1),....tf(tn,c1))
    其中tf(t1,c1)为所关注概念c1的中文文本语料中的第一个相关术语的术语频率;
    tf(t2,c1)为所关注概念c1的中文文本语料中的第二个相关术语的术语频率;并且
    tf(tn,c1)为所关注概念c1的中文文本语料中的第n个相关术语的术语频率。
  36. 根据权利要求34所述的计算机可读介质,其中每个相关联概念的术语频率权重矢量的步骤由下列方程来实现:
    V2=(tf(t1,c2),tf(t2,c2),....tf(tn,c2))
    其中V2为相关联概念c2的术语频率权重矢量;
    tf(t1,c2)为所述相关联概念c2的中文文本语料中的第一个相关术语的术语频率;
    tf(t2,c2)为所述相关联概念c2的中文文本语料中的第二个相关术语的术语频率;并且
    tf(tn,c2)为所述相关联概念c2的中文文本语料中的第n个相关术语的术语频率。
  37. 根据权利要求34所述的计算机可读介质,其中,由下列方程来计算所关注的概念和每个相关联概念的术语频率权重矢量之间的余弦相似性的步骤:
    Figure PCTCN2017094881-appb-100002
    其中V1和V2分别为所关注概念c1和相关联概念c2的术语频率权重矢量。
  38. 根据权利要求25所述的计算机可读介质,其中存储推断出的所述语义相关性数据的步骤包括:
    用网络本体语言存储语义相关性;并且
    对所述语义相关性的信息建立索引。
  39. 根据权利要求38所述的计算机可读介质,其中所述网络本体语言是资源描述框架(“RDF”)。
  40. 根据权利要求38所述的计算机可读介质,其中所述对所述语义相关性的信息建立索引的步骤包括:建立包括所关注概念、相关联概念、相关联概念的数量和RDF图标的概念图。
  41. 根据权利要求25所述的计算机可读介质,其中所述从结构化知识网 络抓取结构化知识的步骤包括下列步骤:从基于网络的中文百科全书抓取的结构化知识。
  42. 根据权利要求25所述的计算机可读介质,其中所述从结构化知识网络抓取结构化知识的步骤包括下列步骤:从百度百科或中文维基百科抓取结构化知识。
  43. 一种计算机设备,包括存储器和处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行一种用于基于结构化网络知识自动生成中文本体库的方法,所述方法包括以下步骤:
    从结构化知识网络中抓取的结构化知识,其中结构化的知识包括至少一个所关注的概念用于自动生成中文本体库;
    过滤无关的链接;
    提取与所述所关注的概念相关的知识;
    发现所述所关注的概念的相关联概念;
    通过余弦相似性的度量推断出所述所关注的概念以及其相关联概念的语义相关性;并且
    存储推断出的所述语义相关性数据。
PCT/CN2017/094881 2016-07-29 2017-07-28 基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质 WO2018019289A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201780046326.XA CN109643315B (zh) 2016-07-29 2017-07-28 基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
HK16109078.8 2016-07-29
HK16109078.8A HK1220319A2 (zh) 2016-07-29 2016-07-29 基於結構化網絡知識的自動中文本體庫建構方法、系統及計算機可讀介質

Publications (1)

Publication Number Publication Date
WO2018019289A1 true WO2018019289A1 (zh) 2018-02-01

Family

ID=58633644

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/094881 WO2018019289A1 (zh) 2016-07-29 2017-07-28 基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质

Country Status (4)

Country Link
CN (1) CN109643315B (zh)
HK (1) HK1220319A2 (zh)
TW (1) TW201804345A (zh)
WO (1) WO2018019289A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110502640A (zh) * 2019-07-30 2019-11-26 江南大学 一种基于建构的概念词义发展脉络的提取方法
CN110851612A (zh) * 2019-08-29 2020-02-28 国家计算机网络与信息安全管理中心 基于百科知识的移动应用知识图谱复合型补全方法及装置
CN111783422A (zh) * 2020-06-24 2020-10-16 北京字节跳动网络技术有限公司 一种文本序列生成方法、装置、设备和介质
CN111859975A (zh) * 2019-04-22 2020-10-30 广东小天才科技有限公司 一种扩充样本语料的语料正则式的方法和系统
CN115658931A (zh) * 2022-12-27 2023-01-31 清华大学 百科知识图谱动态更新方法、装置、设备及介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018232290A1 (en) * 2017-06-16 2018-12-20 Elsevier, Inc. Systems and methods for automatically generating content summaries for topics

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150019174A1 (en) * 2013-07-09 2015-01-15 Honeywell International Inc. Ontology driven building audit system
CN105488105A (zh) * 2015-11-19 2016-04-13 百度在线网络技术(北京)有限公司 信息提取模板的建立方法、知识数据的处理方法和装置
US20160103800A1 (en) * 2014-10-14 2016-04-14 Sugarcrm Inc. Universal rebranding engine
US20160132484A1 (en) * 2014-11-10 2016-05-12 Oracle International Corporation Automatic generation of n-grams and concept relations from linguistic input data
CN105843965A (zh) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 一种基于url主题分类的深层网络爬虫表单填充方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100568230C (zh) * 2004-07-30 2009-12-09 国际商业机器公司 基于超文本的多语言网络信息搜索方法和系统
CN102609512A (zh) * 2012-02-07 2012-07-25 北京中机科海科技发展有限公司 异构信息知识挖掘与可视化分析系统及方法
US11250203B2 (en) * 2013-08-12 2022-02-15 Microsoft Technology Licensing, Llc Browsing images via mined hyperlinked text snippets

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150019174A1 (en) * 2013-07-09 2015-01-15 Honeywell International Inc. Ontology driven building audit system
US20160103800A1 (en) * 2014-10-14 2016-04-14 Sugarcrm Inc. Universal rebranding engine
US20160132484A1 (en) * 2014-11-10 2016-05-12 Oracle International Corporation Automatic generation of n-grams and concept relations from linguistic input data
CN105488105A (zh) * 2015-11-19 2016-04-13 百度在线网络技术(北京)有限公司 信息提取模板的建立方法、知识数据的处理方法和装置
CN105843965A (zh) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 一种基于url主题分类的深层网络爬虫表单填充方法和装置

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859975A (zh) * 2019-04-22 2020-10-30 广东小天才科技有限公司 一种扩充样本语料的语料正则式的方法和系统
CN110502640A (zh) * 2019-07-30 2019-11-26 江南大学 一种基于建构的概念词义发展脉络的提取方法
CN110851612A (zh) * 2019-08-29 2020-02-28 国家计算机网络与信息安全管理中心 基于百科知识的移动应用知识图谱复合型补全方法及装置
CN110851612B (zh) * 2019-08-29 2023-08-18 国家计算机网络与信息安全管理中心 基于百科知识的移动应用知识图谱复合型补全方法及装置
CN111783422A (zh) * 2020-06-24 2020-10-16 北京字节跳动网络技术有限公司 一种文本序列生成方法、装置、设备和介质
US11669679B2 (en) 2020-06-24 2023-06-06 Beijing Byledance Network Technology Co., Ltd. Text sequence generating method and apparatus, device and medium
CN115658931A (zh) * 2022-12-27 2023-01-31 清华大学 百科知识图谱动态更新方法、装置、设备及介质

Also Published As

Publication number Publication date
TW201804345A (zh) 2018-02-01
HK1220319A2 (zh) 2017-04-28
CN109643315B (zh) 2024-05-07
CN109643315A (zh) 2019-04-16

Similar Documents

Publication Publication Date Title
WO2018019289A1 (zh) 基于结构化网络知识自动生成中文本体库的方法、系统、计算机设备和计算机可读介质
US9785671B2 (en) Template-driven structured query generation
JP6014725B2 (ja) 単文/複文構造の自然言語クエリに対する検索および情報提供方法並びにシステム
US20090070322A1 (en) Browsing knowledge on the basis of semantic relations
US20020010709A1 (en) Method and system for distilling content
AU2019201531A1 (en) An in-app conversational question answering assistant for product help
US10521474B2 (en) Apparatus and method for web page access
US8180751B2 (en) Using an encyclopedia to build user profiles
Dong et al. A survey in semantic search technologies
EP2192503A1 (en) Optimised tag based searching
Al-Khalifa et al. Folksonomies versus automatic keyword extraction: An empirical study
Spitz et al. EVELIN: Exploration of event and entity links in implicit networks
CN114117242A (zh) 数据查询方法和装置、计算机设备、存储介质
Barwary et al. An Intelligent and Advance Kurdish Information Retrieval Approach with Ontologies: A Critical Analysis
Maree Multimedia context interpretation: a semantics-based cooperative indexing approach
Alfred et al. A robust framework for web information extraction and retrieval
Cameron et al. Semantics-empowered text exploration for knowledge discovery
Fogarolli Wikipedia as a source of ontological knowledge: state of the art and application
Zhang et al. A semantics-based method for clustering of Chinese web search results
Kalender et al. Skmt: A semantic knowledge management tool for content tagging, search and management
WO2009035871A1 (en) Browsing knowledge on the basis of semantic relations
Saranya et al. A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval
Hsu et al. Using domain ontology to implement a frequently asked questions system
Ruest et al. Warclight: a rails engine for web archive discovery
Blanco-Fernández et al. Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17833592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17833592

Country of ref document: EP

Kind code of ref document: A1