CN108415900A - A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure - Google Patents

A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure Download PDF

Info

Publication number
CN108415900A
CN108415900A CN201810112596.2A CN201810112596A CN108415900A CN 108415900 A CN108415900 A CN 108415900A CN 201810112596 A CN201810112596 A CN 201810112596A CN 108415900 A CN108415900 A CN 108415900A
Authority
CN
China
Prior art keywords
word
document
text
occurrence
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810112596.2A
Other languages
Chinese (zh)
Inventor
李鹏
王斌
郭莉
梅钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201810112596.2A priority Critical patent/CN108415900A/en
Publication of CN108415900A publication Critical patent/CN108415900A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种基于多级共现关系词图的可视化文本信息发现方法,其步骤包括:抽取文档的文本内容,对文本内容进行切分,得到文本片段;对文本片段进行切分,提取关键词,并标记词类别标签;根据关键词在文本片段中的共现关系构建多级共现关系词图,图中的节点对应关键词,图中的边对应关键词共现;对图中的每个关键词构建词‑文档倒排索引,用以检索包含关键词的文档;通过共现关系词图获取可视化文本信息。本发明还提供一种基于多级共现关系词图的可视化文本信息发现系统,包括文档预处理模块、关键词提取模块、多级词图构建模块、词‑文档索引构建模块及可视化信息发现模块。

The present invention provides a method for discovering visual text information based on a multi-level co-occurrence relationship word graph, the steps of which include: extracting the text content of a document, segmenting the text content to obtain text segments; segmenting the text segments, and extracting key words word, and mark the word category label; construct a multi-level co-occurrence relationship word graph according to the co-occurrence relationship of keywords in text fragments, the nodes in the graph correspond to keywords, and the edges in the graph correspond to keyword co-occurrence; Each keyword constructs a word-document inverted index to retrieve documents containing keywords; obtain visual text information through co-occurrence relationship word graphs. The present invention also provides a visual text information discovery system based on a multi-level co-occurrence relationship word graph, including a document preprocessing module, a keyword extraction module, a multi-level word graph construction module, a word-document index construction module and a visual information discovery module .

Description

一种基于多级共现关系词图的可视化文本信息发现方法及 系统A method for visual text information discovery based on multi-level co-occurrence relationship word graph and its system

技术领域technical field

本发明属于文本挖掘、自然语言处理领域,涉及一种基于多级共现关系词图的可视化文本信息发现方法及系统。The invention belongs to the fields of text mining and natural language processing, and relates to a method and system for discovering visual text information based on a multi-level co-occurrence relationship word graph.

背景技术Background technique

随着互联网、办公电子化的发展,文本信息呈爆炸式增长趋势,生成的文本数量超越了以往任何时代。一方面文本包含大量有价值的信息,另一方面海量文本显著增加了有效信息的发现代价。对于绝大多数应用(如出版、行研、监管),用户已经不可能对收集的文档集中的每一篇文档进行阅读来发现有效信息,如何利用计算机从海量文本中辅助挖掘有价值的信息(文本挖掘)成为亟待解决的重要问题。With the development of the Internet and office electronics, text information has shown an explosive growth trend, and the amount of text generated has surpassed any previous era. On the one hand, texts contain a lot of valuable information, on the other hand, massive texts significantly increase the discovery cost of effective information. For the vast majority of applications (such as publishing, research, and supervision), it is impossible for users to read every document in the collected documents to find effective information. How to use computers to assist in mining valuable information from massive texts ( Text mining) has become an important problem to be solved urgently.

文本挖掘根据目标信息的特点可以分为2类:第一类是有效信息可以清晰定义的文本挖掘,比如分类或者有明确目标的搜索,现有计算机通过匹配计算基本可以满足日常需要;第二种是有效信息难以清晰定义的文本挖掘,比如搜索需求模糊的场景,现有办法一般通过“探索式”的方式进行信息发现。“探索式”信息发现底层利用搜索功能:用户输入查询词,人工查看搜索结果,形成下一次的查询词继续搜索,该过程不断重复直到找到结果为止。对于“探索式”信息发现,随着用户对结果的理解,最后使用的查询很可能与最初的查询完全不同。According to the characteristics of target information, text mining can be divided into two categories: the first category is text mining where effective information can be clearly defined, such as classification or search with clear goals, and existing computers can basically meet daily needs through matching calculations; the second category It is text mining where effective information is difficult to clearly define, such as scenes with vague search requirements. Existing methods generally use "exploratory" methods for information discovery. The bottom layer of "exploratory" information discovery uses the search function: the user enters query words, manually checks the search results, forms the next query word and continues to search, and this process is repeated until the result is found. For "exploratory" information discovery, as the user understands the results, the final query used is likely to be completely different from the original query.

目前“探索式”信息发现方法存在3个问题:一是对搜索结果进行人工排查效率低,人工浏览文档(搜索结果)是一个非常消耗时间的过程,不能快速定位目标信息;二是整个过程缺乏对目标文档集合的全局把控,导致用户在发现过程中往往陷入不知道“从哪里来、到哪里去”的问题,信息检查的状态无法在下一次检查中进行恢复以及有效利用;三是无法对已检查的文档进行过滤,难以避免重复检查。At present, there are three problems in the "exploratory" information discovery method: first, the efficiency of manual inspection of search results is low, and manual browsing of documents (search results) is a very time-consuming process, and the target information cannot be quickly located; second, the entire process lacks The global control of the target document collection causes users to often fall into the problem of not knowing where to come from and where to go during the discovery process, and the state of information inspection cannot be recovered and effectively used in the next inspection; Checked documents are filtered, making it difficult to avoid double checking.

发明内容Contents of the invention

为了克服上述信息发现的不足,本发明提出一种基于多级共现关系词图的可视化文本信息发现方法及系统。In order to overcome the above shortcomings of information discovery, the present invention proposes a method and system for visual text information discovery based on a multi-level co-occurrence relationship word graph.

为解决上述技术问题,本发明采用如下技术方案:In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种基于多级共现关系词图的可视化文本信息发现方法,如图1所示,其步骤包括:A method for discovering visual text information based on a multi-level co-occurrence relationship word graph, as shown in Figure 1, the steps include:

抽取文档的文本内容,对文本内容进行切分,得到文本片段;Extract the text content of the document, segment the text content, and obtain text fragments;

对文本片段进行切分,提取关键词,并标记词类别标签;Segment text fragments, extract keywords, and mark word category tags;

对文本片段进行分析,根据关键词在文本片段中的共现关系构建多级共现关系词图,图中的节点对应关键词,图中的边对应关键词共现;Analyze the text fragments, construct a multi-level co-occurrence relationship word graph according to the co-occurrence relationship of keywords in the text fragments, the nodes in the graph correspond to keywords, and the edges in the graph correspond to keyword co-occurrence;

对图中的每个关键词构建词-文档倒排索引,用以检索包含关键词的文档;Build a word-document inverted index for each keyword in the graph to retrieve documents containing keywords;

通过共现关系词图获取可视化文本信息。Obtain visual text information through co-occurrence relationship word graph.

进一步地,在抽取文档的文本内容之前,先将文档进行格式解析。Further, before extracting the text content of the document, the document is format parsed first.

进一步地,使用符号进行切分,该符号包括标点符号;或者使用固定窗口进行切分,设定窗口的大小和移动步长,从文本开头向结尾移动,窗口圈定的每个文本片段作为输出。Further, use symbols for segmentation, the symbols include punctuation marks; or use a fixed window for segmentation, set the size of the window and the moving step, move from the beginning to the end of the text, and each text segment enclosed by the window is used as output.

进一步地,词类别标签包括词性标签、实体词标签、文档核心词标签、语义角色标签、自定义类型标签。Further, word category tags include part-of-speech tags, entity word tags, document core word tags, semantic role tags, and custom type tags.

进一步地,实体词标签包括复合实体词。Further, the entity word tags include compound entity words.

进一步地,对于文档核心词标签,找到文档核心词的方法包括使用TF-IDF或者TextRank 计算词权重,基于词权重对关键词进行排序,取排名最高的Topk个关键词作为文档核心词。Further, for the document core word tags, the method of finding the document core words includes using TF-IDF or TextRank to calculate word weights, sort keywords based on word weights, and take the Topk keywords with the highest ranking as document core words.

进一步地,关键词的共现关系包括同一文本片段中共现、在相邻N个文本片段中共现、在整个文档中共现。Further, the co-occurrence relationship of keywords includes co-occurrence in the same text segment, co-occurrence in N adjacent text segments, and co-occurrence in the entire document.

进一步地,对于一对关键词,其只能存在于共现关系最近的单个共现关系词图中,共现关系按照由近及远的顺序为同一文本片段中共现、在相邻N个文本片段中共现、在整个文档中共现。Further, for a pair of keywords, it can only exist in a single co-occurrence relationship word graph with the closest co-occurrence relationship. Fragment co-occurs, co-occurs throughout the document.

进一步地,通过共现关系词图获取可视化文本信息的方法如图2所示,包括:全局图与局部图的在线浏览、局部图的选择浏览及扩展浏览、共现关系的切换展示及并列展示、词图浏览历史记录、词节点标记以及文档标记。Further, the method of obtaining visual text information through the co-occurrence relationship word graph is shown in Figure 2, including: online browsing of the global graph and local graph, selection browsing and extended browsing of the local graph, switching display and parallel display of the co-occurrence relationship , word graph browsing history, word node markers, and document markers.

所述全局图与局部图的在线浏览是指:全局图提供对所有词的展示功能,用户利用该功能可以形成对文档集的概貌性浏览;局部图提供对选定词节点的相邻词节点的展示功能,用户利用该功能可以实现对文档集重点区域的浏览。对于不同的共现窗口,图的展示内容不同。全局图和局部图的功能是通过展示前端按需加载离线绘制的词图信息来实现。The online browsing of the global graph and the local graph refers to: the global graph provides a display function for all words, and the user can use this function to form a general overview of the document set; the local graph provides the adjacent word nodes of the selected word node The display function, users can use this function to browse the key areas of the document set. For different co-occurrence windows, the display content of the graph is different. The functions of the global graph and the local graph are realized by displaying word graph information loaded offline by the front end on demand.

所述局部图的选择浏览及扩展浏览是指:选择浏览包括对全局图中的词进行全文搜索,选择感兴趣的词,展示以该词为中心的局部图,包括根据词类型标签对图中节点进行选择浏览;扩展浏览指用户可以点击局部图中的邻居节点,局部图自动更新为以该邻居节点为中心的局部图。The selection browsing and extended browsing of the partial graph refer to: the selection browsing includes carrying out full-text search to the words in the overall graph, selecting the word of interest, showing the partial graph centered on the word, including searching the word in the graph according to the word type label. Nodes are selected for browsing; extended browsing means that the user can click on a neighbor node in the partial graph, and the partial graph is automatically updated to a partial graph centered on the neighbor node.

所述共现关系的切换展示及并列展示是指:切换展示支持用户以一个词为中心,通过选择不同共现级别(窗口大小)来加载不同的局部图;并列展示支持用户以一个词为中心,将不同共现级别下的局部图进行并列展示。切换展示和并列展示便于用户灵活查看词的上下文,发现相关线索。The switching display and side-by-side display of the co-occurrence relationship refer to: the switching display supports the user to take a word as the center, and loads different partial diagrams by selecting different co-occurrence levels (window size); the side-by-side display supports the user to take a word as the center , to display the partial graphs under different co-occurrence levels side by side. Switching display and side-by-side display allow users to flexibly view the context of words and find relevant clues.

所述词图浏览历史记录是指:用户在进行扩展浏览的过程中,系统会记录用户点击过的点以及相关路径,路径使用图结构保存,后续用户可以对历史路径进行加载和搜索,便于回忆并恢复检查状态。The word map browsing history refers to: the user is in the process of expanding browsing, the system will record the points and related paths that the user has clicked, and the path is saved in a graph structure, and the subsequent user can load and search the historical path for easy recall and resume checking status.

所述词节点标记以及文档标记是指:在浏览过程中,用户可以对词节点以及相关的文档进行标记。包括两类标记:一是收藏标记,标记后的节点以及相关文档用户可以在后续进行重点检查;二是删除标记,标记后的节点以及相关文档会被从文档集中删掉,对应的多级共现关系词图也会进行更新。The term node mark and document mark refer to: during the browsing process, the user can mark the word node and related documents. It includes two types of marks: one is the collection mark, the marked nodes and related documents can be checked later by the user; the other is the deletion mark, the marked nodes and related documents will be deleted from the document set, and the corresponding multi-level common The existing relation graph will also be updated.

一种基于多级共现关系词图的可视化文本信息发现系统,如图3所示,包括文档预处理模块、关键词提取模块、多级词图构建模块、词-文档索引构建模块及可视化信息发现模块。A visual text information discovery system based on multi-level co-occurrence relationship word graph, as shown in Figure 3, including document preprocessing module, keyword extraction module, multi-level word graph building module, word-document index building module and visual information Discovery modules.

文档预处理模块:该模块输入为文档文件集合,输出为<文档编号,文本片段列表>集合。对每个文档文件的处理包括对文件进行格式解析,抽取其中的文本内容,按照预定义规则对全部文本进行切分,得到文本片段的有序列表。Document preprocessing module: the input of this module is a collection of document files, and the output is a collection of <document number, list of text fragments>. The processing of each document file includes parsing the format of the file, extracting the text content therein, and segmenting the entire text according to predefined rules to obtain an ordered list of text fragments.

关键词提取模块:该模块使用文档预处理模块的输出作为输入,为每个文本片段进行编号,并对文本片段进行进一步的切割,得到<词,词类别>集合。词类别的标注可以使用自然语言处理的相关工具,也可以由用户的自定义处理来完成。Keyword extraction module: This module uses the output of the document preprocessing module as input, numbers each text segment, and further cuts the text segment to obtain a set of <word, word category>. The tagging of word categories can be done using related tools of natural language processing, or by user-defined processing.

多级词图构建模块:该模块以关键词提取模块的输出作为输入,构建多级共现关系词图。多级是指使用不同的窗口大小来考察词的共现情况,从而生成多个共现关系词图。比如在同一文本片段中共现、在相邻N个文本片段中共现、在同一文档中共现等。Multi-level word graph construction module: This module uses the output of the keyword extraction module as input to construct a multi-level co-occurrence relationship word graph. Multi-level refers to the use of different window sizes to examine the co-occurrence of words, thereby generating multiple co-occurrence relationship word graphs. For example, co-occurrence in the same text segment, co-occurrence in N adjacent text segments, co-occurrence in the same document, etc.

词-文档索引构建模块:该模块对词图中的每个词,构建词-文档倒排索引,用于检索包含词的文档。Word-document index building module: This module builds a word-document inverted index for each word in the word graph, which is used to retrieve documents containing words.

可视化信息发现模块:该模块提供基于词类别以及词共现关系词图的文档浏览发现功能,提供对文档的标记功能,提供遍历词图的状态保存功能,从多角度实现对感兴趣信息的浏览发现。Visual information discovery module: This module provides document browsing and discovery functions based on word categories and word co-occurrence relationship word graphs, provides document marking functions, provides state saving functions for traversing word graphs, and realizes browsing information of interest from multiple perspectives Find.

本发明方法针对给定文档集进行可视化信息发现,首先利用自然语言处理技术对文档进行切分过滤,形成关键词集合,接着使用不同大小窗口考察词的共现情况,构建多级共现关系词图,该共现关系词图又称词图;用户通过浏览该词图进行可视化信息发现;可视化信息发现支持用户对词图中的词进行搜索;支持选定一个词作中心,通过共现关系查看相关词;支持对包含选定词的文档做重点检查,支持对词节点做删除,来删除相关文档并更新共现关系词图,支持对用户遍历词图的路径进行保存。The method of the present invention conducts visual information discovery for a given document set, first uses natural language processing technology to segment and filter documents to form a keyword set, and then uses windows of different sizes to investigate the co-occurrence of words to construct multi-level co-occurrence relationship words Graph, the co-occurrence relationship word graph is also called word graph; users can discover visual information by browsing the word graph; visual information discovery supports users to search for words in the word graph; supports selecting a word center, through the co-occurrence relationship View related words; support key checks on documents containing selected words, support deletion of word nodes to delete related documents and update co-occurrence relationship word graphs, and support saving paths for users to traverse word graphs.

利用词图进行信息排查可以提高文档排查效率,词图相当于提供了对文档内容的摘要;利用词图共现关系可以很容易进行扩展检查,记录用户词图遍历路径可以帮助用户掌控检查进度;对词节点做删除标记可以减少后续文档检查数量,并且避免重复检查。Using the word map to check information can improve the efficiency of document checking. The word map is equivalent to providing a summary of the content of the document; using the word map co-occurrence relationship can easily perform extended inspections, and recording the user's word map traversal path can help users control the inspection progress; Marking word nodes for deletion can reduce the number of subsequent document checks and avoid repeated checks.

本发明方法灵活便捷,体现在通过自定义窗口大小来调整得到的文本片段大小,文本片段大小不同则得到的词关联情况也不同;可以自定义关键词,抽取哪些词及词的类别可以根据发现需求来确定。The method of the present invention is flexible and convenient, which is reflected in the size of the text fragments obtained by adjusting the size of the custom window, and the word associations obtained are different if the text fragments are different in size; keywords can be customized, and which words and word categories to extract can be determined according to the found needs to be determined.

附图说明Description of drawings

图1是一种基于多级共现关系词图的可视化文本信息发现方法流程图。Fig. 1 is a flow chart of a method for discovering visual text information based on a multi-level co-occurrence relation graph.

图2是文本可视化信息发现功能示意图。Fig. 2 is a schematic diagram of text visualization information discovery function.

图3是一种基于多级共现关系词图的可视化文本信息发现系统图。Fig. 3 is a diagram of a visual text information discovery system based on a multi-level co-occurrence relation word graph.

图4是文档预处理、关键词提取示意图。Fig. 4 is a schematic diagram of document preprocessing and keyword extraction.

图5是多级词图构建模块使用的共现信息示意图。Fig. 5 is a schematic diagram of the co-occurrence information used by the multi-level word graph building block.

图6是一窗口共现图-全局图。Fig. 6 is a window co-occurrence graph-global graph.

图7是两窗口共现图-全局图。Fig. 7 is a two-window co-occurrence map-global map.

图8是一窗口共现图-局部图(“唐德川”为中心)。Fig. 8 is a window co-occurrence map-partial map ("Tang Dechuan" as the center).

图9是两窗口共现图-局部图(“唐德川”为中心)。Figure 9 is a two-window co-occurrence map-partial map ("Tang Dechuan" as the center).

图10是扩展浏览示意图(中心词从“唐德川”到“盈利企业”)。Fig. 10 is a schematic diagram of extended browsing (the central word is from "Tang Dechuan" to "profitable enterprise").

具体实施方式Detailed ways

为使本发明的上述特征和优点能更明显易懂,下文特举实施例,并配合所附图作详细说明如下。In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

本实施例提供一种基于多级共现关系词图的可视化文本信息发现方法,对一个文档集合进行信息发现,该文档集合包含2篇文档,如图1所示,方法步骤包括:This embodiment provides a method for discovering visual text information based on a multi-level co-occurrence relationship word graph, and performs information discovery on a document collection, the document collection contains 2 documents, as shown in Figure 1, the method steps include:

1.文档预处理:1. Document preprocessing:

对于文档集中的每一篇文档,输出<文档编号、文本片段列表>。具体处理过程包括:(1) 将文档进行格式解析,抽取有效文本内容;(2)对文本内容进行切分,切分后的文本片段一般对应有意义的语义单元;切分可以使用下面两类方法:(a)使用符号进行切分,符号由用户指定,这些符号包括常用的标点符号,如句号、逗号、换行符、段落缩进符等;(b)使用固定窗口切分,设定窗口大小和移动步长两个参数,从文档的开头向结尾移动,窗口圈定的每个文本片段作为输出。For each document in the document set, output <document number, list of text fragments>. The specific processing process includes: (1) parsing the format of the document to extract valid text content; (2) segmenting the text content, and the segmented text fragments generally correspond to meaningful semantic units; the following two types can be used for segmentation Method: (a) use symbols for segmentation, and the symbols are specified by the user. These symbols include common punctuation marks, such as periods, commas, line breaks, paragraph indents, etc.; (b) use fixed window segmentation, set the window Two parameters, size and moving step, move from the beginning to the end of the document, and each text fragment enclosed by the window is output.

对于本例的文本内容切分,使用(a)方法,选择逗号作为分隔符来对文档进行切分,得到句子集合,文档预处理的结果如图4所示。For the text content segmentation in this example, use method (a) and select comma as the delimiter to segment the document to obtain a sentence set. The result of document preprocessing is shown in Figure 4.

2.关键词提取:2. Keyword extraction:

对每篇文档的每个文本片段,该步骤对文本片段进行编号,并且对文本片段进行切分,得到<词,词类别>列表。词类别标签由用户根据需求来确定,可以使用相关的自然语言处理工具包来抽取。常用的词类别标签可以包括:(a)词性标签,如名词、动词等;(b)实体词标签,如时间、地点、人名、机构名等,实体也包括复合实体,即由多个词组合后指代的新实体,如“集团表彰会”,其中“集团”和“表彰会”分别为实体词,二者组合指代新实体; (c)文档核心词标签,实现方法包括使用TF-IDF或者TextRank计算词权重,基于词权重对词进行排序,取排名最高的Topk个词作为核心词;(d)语义角色标签(Semantic Rolelabeling),如受益人、条件、目的、原因等;(e)自定义类型,可以基于句法解析的结果进行后处理,如OpenIE得到的主语、谓语、宾语等。For each text segment of each document, this step numbers the text segment and segments the text segment to obtain a <word, word category> list. Word category labels are determined by users according to their needs, and can be extracted using related natural language processing toolkits. Commonly used word category tags can include: (a) part-of-speech tags, such as nouns, verbs, etc.; (b) entity word tags, such as time, place, person name, organization name, etc. Entities also include compound entities, which are composed of multiple words The new entity referred to later, such as "Group Commendation Meeting", where "Group" and "Commendation Meeting" are entity words respectively, and the combination of the two refers to the new entity; (c) document core word tags, the implementation method includes using TF- IDF or TextRank calculates word weights, sorts words based on word weights, and takes the top-ranked Topk words as core words; (d) Semantic Rolelabeling, such as beneficiary, condition, purpose, reason, etc.; (e ) self-defined type, which can be post-processed based on the result of syntactic analysis, such as the subject, predicate, object, etc. obtained by OpenIE.

对于本例,保留“名词、复合实体、人名、地名、机构名”的词类别标签,基于这些类别词对文档进行信息发现。关键词提取的结果如图4所示。比如,对于句子“唐德川在集团表彰会上表扬南区之时”,经过抽取得到“唐德川/人名”、“集团表彰会/复合实体”、“南区/ 地名”三个词以及词类别的序列。For this example, the word category tags of "noun, compound entity, person name, place name, organization name" are reserved, and information discovery is performed on documents based on these category words. The results of keyword extraction are shown in Figure 4. For example, for the sentence "When Tang Dechuan praised the southern district at the group commendation meeting", the sequence of three words and word categories of "Tang Dechuan/person's name", "group commendation meeting/compound entity", "South district/place name" was extracted .

3.多级词图(即共现关系词图)构建:3. Multi-level word graph (i.e. co-occurrence relationship word graph) construction:

词图节点使用步骤2输出的词,词图边通过词的共现关系来确定。多级是指使用不同窗口大小来考察词的共现情况,从而生成多个共现关系词图。比如在同一文本片段中共现、在相邻N个文本片段中共现、在整个文档中共现等。Word graph nodes use the words output in step 2, and word graph edges are determined by the co-occurrence relationship of words. Multi-level refers to the use of different window sizes to examine the co-occurrence of words, thereby generating multiple co-occurrence relationship word graphs. For example, co-occurrence in the same text segment, co-occurrence in N adjacent text segments, co-occurrence in the entire document, etc.

对于一对特定词,要求只能在单个词图中出现,该词图为关键词对出现的最小窗口所对应的共现关系词图。通过共现得到的词的连边也可以进行过滤删除,过滤规则由用户根据需要进行确定。For a pair of specific words, it is required that they can only appear in a single word graph, which is the co-occurrence relationship word graph corresponding to the smallest window where the keyword pair appears. The edges of words obtained through co-occurrence can also be filtered and deleted, and the filtering rules are determined by the user according to needs.

对于本例,使用两个级别的共现关系:在同一窗口共现、在相邻两个窗口共现,窗口单位为句子,对应生成的词图分别称为“一窗口共现图”和“两窗口共现图”。得到的词与词共现组合如图5所示,其在词图中体现为连边。具体地,以在同一窗口共现为例,[“唐德川/ 人名”、“集团表彰会/复合实体”、“南区/地名”]这三个词在同一句子中出现,那么通过该句,得到的词图的连边为这三个词的两两组合,即<唐德川,集团表彰会>、<唐德川,南区>、< 集团表彰会,南区>。For this example, two levels of co-occurrence relationships are used: co-occurrence in the same window, co-occurrence in two adjacent windows, the window unit is a sentence, and the corresponding generated word graphs are called "one-window co-occurrence graph" and "co-occurrence graph" respectively. Two-window co-occurrence map". The obtained co-occurrence combination of words and words is shown in Figure 5, which is reflected in the word graph as connected edges. Specifically, taking co-occurrence in the same window as an example, the three words ["Tang Dechuan/person's name", "Group commendation meeting/compound entity", "South District/place name"] appear in the same sentence, then through this sentence, The edges of the obtained word map are the pairwise combinations of these three words, namely <Tang Dechuan, Group Commendation Meeting>, <Tang Dechuan, South District>, <Group Commendation Meeting, South District>.

以相邻两个窗口共现为例,词列表1[“唐德川/人名”、“集团表彰会/复合实体”、“南区/ 地名”]中的词与词列表2[“南区/地名”、“南区驻集团代表/复合实体”]中的词在两个窗口范围内共现,那么词列表1中的词与词列表2中的词两两组合可以得到两窗口共现图的连边。这里注意的是<唐德川,南区>、<集团表彰会,南区>因为在“一窗口共现图”中出现,根据“对于一对特定词,要求只能在单个词图中出现”,所以这两条连边在“两窗口共现图”中进行删除。Taking the co-occurrence of two adjacent windows as an example, the words in word list 1 ["Tang Dechuan/person's name", "group commendation meeting/compound entity", "South District/place name"] and word list 2 ["South District/place name ", "Representatives in the Southern District Group/Composite Entity"] co-occur in the two windows, then the words in the word list 1 and the words in the word list 2 can be combined in pairs to obtain the co-occurrence graph of the two windows even side. Note here that <Tang Dechuan, South District> and <Group Commendation Meeting, South District> appear in the "one-window co-occurrence map", according to "for a pair of specific words, it is required that they can only appear in a single word map", Therefore, these two connected edges are deleted in the "two-window co-occurrence graph".

4.词-文档索引构建:4. Word-document index construction:

对词图中的每个词,构建词-文档倒排索引,用于检索包含词的文档。For each word in the word graph, build a word-document inverted index for retrieving documents containing words.

通过步骤1-4生成了多级共现关系词图以及倒排索引的数据结构,后续可视化信息发现通过对数据结构进行按需查找加载来完成。Through steps 1-4, the multi-level co-occurrence relationship word graph and the data structure of the inverted index are generated, and the subsequent visual information discovery is completed by searching and loading the data structure on demand.

5.可视化信息发现,核心功能包括:5. Visual information discovery, the core functions include:

1)全局图与局部图的在线浏览。1) Online browsing of global and local maps.

全局图提供对所有词的关联展示功能,用户利用该功能可以形成对文档集的概貌性浏览,图6给出了一窗口共现图的全局图,图7给出了两窗口共现图的全局图。局部图提供对选定词节点的相邻词节点的展示功能,用户利用该功能可以实现对文档集的重点区域的浏览,图 8给出了一窗口共现图的局部图。The global graph provides a display function for all words, and users can use this function to form an overview of the document collection. Figure 6 shows the global graph of a co-occurrence graph in one window, and Figure 7 shows the co-occurrence graph in two windows. Global picture. The partial graph provides the display function of the adjacent word nodes of the selected word node. Users can use this function to browse the key areas of the document set. Figure 8 shows a partial graph of the co-occurrence graph in a window.

对于不同大小的共现窗口,图的展示内容不同。全局图和局部图的功能是通过展示前端按需加载离线绘制的词图信息来实现。For co-occurrence windows of different sizes, the display content of the graph is different. The functions of the global graph and the local graph are realized by displaying word graph information loaded offline by the front end on demand.

2)局部图的选择浏览及扩展浏览。2) Selected browsing and extended browsing of partial graphs.

选择浏览包括对全局图中的词进行全文搜索,选择感兴趣的词,展示以该词为中心的局部图,包括根据词类型标签对图中节点进行选择浏览。扩展浏览指用户可以点击局部图中的邻居节点,局部图自动更新为以该邻居节点为中心的局部图。Selective browsing includes full-text search of words in the global graph, selecting a word of interest, and displaying a local graph centered on the word, including selecting and browsing nodes in the graph according to the word type label. Extended browsing means that the user can click on a neighbor node in the partial graph, and the partial graph is automatically updated to a partial graph centered on the neighbor node.

图10给出了扩展浏览的一个示例。用户点击“唐德川”展示以“唐德川”为中心的局部图,局部图中只高亮四个邻居节点,用户点击邻居节点“盈利企业”展示以“盈利企业”为中心的局部图。Figure 10 shows an example of extended browsing. The user clicks "Tang Dechuan" to display a partial map centered on "Tang Dechuan". Only four neighbor nodes are highlighted in the partial map, and the user clicks the neighbor node "Profit Enterprise" to display a partial map centered on "Profit Enterprise".

3)共现关系的切换展示及并列展示。3) Switching display and parallel display of co-occurrence relationship.

切换展示支持用户以一个词为中心,通过选择不同共现级别(窗口大小)来加载不同的局部图,保持中心词位置不变。并列展示支持用户以一个词为中心,将不同共现级别下的局部图进行并列展示。切换展示和并列展示便于用户灵活查看词的上下文,发现相关线索。Switching display supports users to focus on a word, and load different partial graphs by selecting different co-occurrence levels (window size), keeping the position of the central word unchanged. Side-by-side display allows users to display side-by-side partial graphs at different co-occurrence levels centered on a word. Switching display and side-by-side display allow users to flexibly view the context of words and find relevant clues.

图8、图9给出了以“唐德川”为中心词的共现词,图8为一窗口的局部图,图9为两窗口的局部图。切换展示会固定“唐德川”词的位置,图8与图9进行切换;并列展示则会将多个级别的局部图同时展示。Figure 8 and Figure 9 show the co-occurrence words with "Tang Dechuan" as the central word, Figure 8 is a partial graph of one window, and Figure 9 is a partial graph of two windows. Switching the display will fix the position of the word "Tang Dechuan", and switch between Figure 8 and Figure 9; side-by-side display will display partial pictures of multiple levels at the same time.

4)词图浏览历史记录。用户通过点击词图中的词对相关文档进行重点检查,常常会用到功能3中的扩展浏览功能。在浏览过程中,系统会记录用户点击的过的点以及相关路径,路径使用树结构保存,用户可以对历史路径进行加载和搜索,便于用户回忆并恢复检查状态。4) Word map browsing history. Users click on the words in the word map to check the relevant documents, and often use the extended browsing function in function 3. During the browsing process, the system will record the points clicked by the user and the related paths. The path is saved in a tree structure. The user can load and search the historical path, which is convenient for the user to recall and restore the inspection status.

对于图10,用户点击过的“唐德川”以及“盈利企业”会被保存。As for Figure 10, the "Tang Dechuan" and "Profit Enterprise" clicked by the user will be saved.

5)词节点标记以及文档标记。5) Word node marking and document marking.

在浏览过程中,用户可以对词节点以及相关的文档进行标记。包括两类标记:During browsing, users can mark word nodes and related documents. Two types of tags are included:

一是收藏标记,标记后的节点以及相关文档用户可以在后续进行重点检查;One is the collection mark, and the marked nodes and related documents can be checked by users in the follow-up;

二是删除标记,标记后的节点以及相关文档会被从文档集中删掉,对应的多级共现关系词图也会进行更新。The second is to delete the mark. The marked nodes and related documents will be deleted from the document set, and the corresponding multi-level co-occurrence relationship word graph will also be updated.

本实施例还提供一种基于多级共现关系词图的可视化文本信息发现系统,用于实现上述方法,组成如图3所示,包括文档预处理模块、关键词提取模块、多级词图构建模块、词- 文档索引构建模块及可视化信息发现模块。This embodiment also provides a visual text information discovery system based on a multi-level co-occurrence relationship word graph, which is used to implement the above method. The composition is shown in Figure 3, including a document preprocessing module, a keyword extraction module, and a multi-level word graph. building blocks, word-document indexing building blocks, and visual information discovery modules.

以上实施例仅用以说明本发明的技术方案而非对其进行限制,本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明的精神和范围,本发明的保护范围应以权利要求书所述为准。The above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Those of ordinary skill in the art can modify or equivalently replace the technical solution of the present invention without departing from the spirit and scope of the present invention. The scope of protection should be determined by the claims.

Claims (10)

1. a kind of visualText INFORMATION DISCOVERY method based on multistage cooccurrence relation word figure, step include:
The content of text of abstracting document carries out cutting to content of text, obtains text fragments;
Cutting is carried out to text fragments, extracts keyword, and tagged words class label;
According to cooccurrence relation structure multistage cooccurrence relation word figure of the keyword in text fragments, the node in figure corresponds to crucial Word, the side in figure correspond to key words co-occurrence;
Word-document inverted index is built to each keyword in figure, to retrieve the document for including keyword;
VisualText information is obtained by cooccurrence relation word figure.
2. according to the method described in claim 1, it is characterized in that, before the content of text of abstracting document, first by document into Row format parses.
3. according to the method described in claim 1, it is characterized in that, using symbol or fixed window to content of text and text fragments Mouth carries out cutting, which includes punctuation mark, which is to start to ending to move from text.
4. according to the method described in claim 1, it is characterized in that, part of speech distinguishing label includes part of speech label, entity word label, text Shelves core word label, semantic role label, customization type label.
5. according to the method described in claim 4, it is characterized in that, entity word label includes solid polymer composite word.
6. according to the method described in claim 4, it is characterized in that, for document core word label, document core word is found Method includes calculating word weight using TF-IDF or TextRank, and word-based weight is ranked up keyword, takes ranking most Topk high keyword is as document core word.
7. according to the method described in claim 1, it is characterized in that, the cooccurrence relation of keyword includes total in one text segment Existing, co-occurrence, the co-occurrence in entire document in adjacent N number of text fragments.
8. the method according to the description of claim 7 is characterized in that for a pair of of keyword, cooccurrence relation can be only present in In nearest single cooccurrence relation word figure, cooccurrence relation according to sequence from the near to the distant be one text segment in co-occurrence, in phase Co-occurrence, the co-occurrence in entire document in adjacent N number of text fragments.
9. according to the method described in claim 1, it is characterized in that, obtaining visualText information by cooccurrence relation word figure Method, including:Overall situation figure and the online browse of Local map, the switching of the selection browsing of Local map and extension browsing, cooccurrence relation Displaying and side by side displaying, word figure browsing history, word vertex ticks and document markup.
10. a kind of visualText INFORMATION DISCOVERY system based on multistage cooccurrence relation word figure, including:
Document preprocessing module extracts content of text and carries out cutting, obtain text fragments for being parsed into row format to document Ordered list;
Keyword extracting module carries out further cutting for being numbered for each text fragments, and to text fragments, obtains <Word, word class>Set;
Multistage word figure builds module, for the cooccurrence relation according to keyword in text fragments, builds multistage cooccurrence relation word Figure;
Word-document index builds module, and for building word-document inverted index, retrieval includes the document of keyword;
Visual information discovery module, for realizing document browsing, label, status saving function based on cooccurrence relation word figure.
CN201810112596.2A 2018-02-05 2018-02-05 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure Pending CN108415900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810112596.2A CN108415900A (en) 2018-02-05 2018-02-05 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810112596.2A CN108415900A (en) 2018-02-05 2018-02-05 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure

Publications (1)

Publication Number Publication Date
CN108415900A true CN108415900A (en) 2018-08-17

Family

ID=63127814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810112596.2A Pending CN108415900A (en) 2018-02-05 2018-02-05 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure

Country Status (1)

Country Link
CN (1) CN108415900A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145260A (en) * 2018-08-24 2019-01-04 北京科技大学 A kind of text information extraction method
CN109359299A (en) * 2018-09-28 2019-02-19 中国电子科技集团公司信息科学研究院 A kind of internet of things equipment ability ontology based on commodity data is from construction method
CN109726289A (en) * 2018-12-29 2019-05-07 北京百度网讯科技有限公司 Event detection method and device
CN109933707A (en) * 2018-10-31 2019-06-25 中国科学院信息工程研究所 A search engine-based topic corpus construction method and system
CN110399261A (en) * 2019-06-13 2019-11-01 中国科学院信息工程研究所 A Cluster Analysis Method of System Alarms Based on Co-occurrence Graph
CN111078824A (en) * 2019-12-18 2020-04-28 南京录信软件技术有限公司 Method for reducing storage space occupied by Lucene dictionary-free n-gram word segmentation
CN111145906A (en) * 2019-12-31 2020-05-12 清华大学 Item determination method, related device and readable storage medium
CN111429912A (en) * 2020-03-17 2020-07-17 厦门快商通科技股份有限公司 Keyword detection method, system, mobile terminal and storage medium
CN111444713A (en) * 2019-01-16 2020-07-24 清华大学 Method and device for extracting entity relationship in news event
CN111666292A (en) * 2020-04-24 2020-09-15 百度在线网络技术(北京)有限公司 Similarity model establishing method and device for retrieving geographic positions
CN111859962A (en) * 2020-08-03 2020-10-30 广州威尔森信息科技有限公司 Method and device for extracting data required by automobile public praise word cloud
CN113486071A (en) * 2021-07-27 2021-10-08 掌阅科技股份有限公司 Searching method, server, client and system based on electronic book
CN113901828A (en) * 2020-06-22 2022-01-07 江苏税软软件科技有限公司 Method for intelligently segmenting and labeling articles
CN118377945A (en) * 2024-06-25 2024-07-23 华能信息技术有限公司 Visual page rapid construction system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067807A (en) * 2007-05-24 2007-11-07 上海大学 Visual representation and acquisition method of text semantics
CN103399901A (en) * 2013-07-25 2013-11-20 三星电子(中国)研发中心 Keyword extraction method
CN104536956A (en) * 2014-07-23 2015-04-22 中国科学院计算技术研究所 A Microblog platform based event visualization method and system
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN106354708A (en) * 2015-07-13 2017-01-25 中国电力科学研究院 Client interaction information search engine system based on electricity information collection system
US20170161702A1 (en) * 2015-12-08 2017-06-08 Rhapsody International Inc. Graph-based music recommendation and dynamic media work micro-licensing systems and methods
CN107016092A (en) * 2017-04-06 2017-08-04 湘潭大学 A kind of text search method based on flattening algorithm
CN107480130A (en) * 2017-07-25 2017-12-15 西北工业大学 The property value homogeneity decision method of relation data based on WEB information

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067807A (en) * 2007-05-24 2007-11-07 上海大学 Visual representation and acquisition method of text semantics
CN103399901A (en) * 2013-07-25 2013-11-20 三星电子(中国)研发中心 Keyword extraction method
CN104536956A (en) * 2014-07-23 2015-04-22 中国科学院计算技术研究所 A Microblog platform based event visualization method and system
CN106354708A (en) * 2015-07-13 2017-01-25 中国电力科学研究院 Client interaction information search engine system based on electricity information collection system
US20170161702A1 (en) * 2015-12-08 2017-06-08 Rhapsody International Inc. Graph-based music recommendation and dynamic media work micro-licensing systems and methods
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN107016092A (en) * 2017-04-06 2017-08-04 湘潭大学 A kind of text search method based on flattening algorithm
CN107480130A (en) * 2017-07-25 2017-12-15 西北工业大学 The property value homogeneity decision method of relation data based on WEB information

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145260A (en) * 2018-08-24 2019-01-04 北京科技大学 A kind of text information extraction method
CN109359299A (en) * 2018-09-28 2019-02-19 中国电子科技集团公司信息科学研究院 A kind of internet of things equipment ability ontology based on commodity data is from construction method
CN109933707A (en) * 2018-10-31 2019-06-25 中国科学院信息工程研究所 A search engine-based topic corpus construction method and system
CN109933707B (en) * 2018-10-31 2022-10-14 中国科学院信息工程研究所 A search engine-based topic corpus construction method and system
CN109726289A (en) * 2018-12-29 2019-05-07 北京百度网讯科技有限公司 Event detection method and device
CN111444713A (en) * 2019-01-16 2020-07-24 清华大学 Method and device for extracting entity relationship in news event
CN111444713B (en) * 2019-01-16 2022-04-29 清华大学 Method and device for extracting entity relationship in news event
CN110399261A (en) * 2019-06-13 2019-11-01 中国科学院信息工程研究所 A Cluster Analysis Method of System Alarms Based on Co-occurrence Graph
CN111078824A (en) * 2019-12-18 2020-04-28 南京录信软件技术有限公司 Method for reducing storage space occupied by Lucene dictionary-free n-gram word segmentation
CN111145906B (en) * 2019-12-31 2024-04-30 清华大学 Project judging method, related device and readable storage medium
CN111145906A (en) * 2019-12-31 2020-05-12 清华大学 Item determination method, related device and readable storage medium
CN111429912A (en) * 2020-03-17 2020-07-17 厦门快商通科技股份有限公司 Keyword detection method, system, mobile terminal and storage medium
CN111429912B (en) * 2020-03-17 2023-02-10 厦门快商通科技股份有限公司 Keyword detection method, system, mobile terminal and storage medium
US11836174B2 (en) 2020-04-24 2023-12-05 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus of establishing similarity model for retrieving geographic location
CN111666292B (en) * 2020-04-24 2023-05-26 百度在线网络技术(北京)有限公司 Similarity model establishment method and device for retrieving geographic position
CN111666292A (en) * 2020-04-24 2020-09-15 百度在线网络技术(北京)有限公司 Similarity model establishing method and device for retrieving geographic positions
CN113901828A (en) * 2020-06-22 2022-01-07 江苏税软软件科技有限公司 Method for intelligently segmenting and labeling articles
CN111859962B (en) * 2020-08-03 2021-06-08 广州威尔森信息科技有限公司 Method and device for extracting data required by automobile public praise word cloud
CN111859962A (en) * 2020-08-03 2020-10-30 广州威尔森信息科技有限公司 Method and device for extracting data required by automobile public praise word cloud
CN113486071A (en) * 2021-07-27 2021-10-08 掌阅科技股份有限公司 Searching method, server, client and system based on electronic book
CN118377945A (en) * 2024-06-25 2024-07-23 华能信息技术有限公司 Visual page rapid construction system
CN118377945B (en) * 2024-06-25 2024-11-19 华能信息技术有限公司 Visual page rapid construction system

Similar Documents

Publication Publication Date Title
CN108415900A (en) A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
Inzalkar et al. A survey on text mining-techniques and application
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN1728142B (en) Phrase identification method and device in an information retrieval system
US9298816B2 (en) Methods, systems, and computer-readable media for semantically enriching content and for semantic navigation
CN104537116B (en) A kind of books searching method based on label
JP2005526317A (en) Method and system for automatically searching a concept hierarchy from a document corpus
CN104281702B (en) Data retrieval method and device based on electric power critical word participle
US20110295857A1 (en) System and method for aligning and indexing multilingual documents
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
US20090070322A1 (en) Browsing knowledge on the basis of semantic relations
CN105893611B (en) Method for constructing interest topic semantic network facing social network
US20060106793A1 (en) Internet and computer information retrieval and mining with intelligent conceptual filtering, visualization and automation
US20090119281A1 (en) Granular knowledge based search engine
US10747795B2 (en) Cognitive retrieve and rank search improvements using natural language for product attributes
CN101364239A (en) A classification catalog automatic construction method and related system
CN103425687A (en) Retrieval method and system based on queries
CN105550189A (en) Ontology-based intelligent retrieval system for information security event
CN112231494B (en) Information extraction method and device, electronic equipment and storage medium
Jabbar et al. A survey on Urdu and Urdu like language stemmers and stemming techniques
US20190272343A1 (en) System and method for searching based on text blocks and associated search operators
CN114691845B (en) Semantic search method, device, electronic device, storage medium and product
Kerremans et al. Using data-mining to identify and study patterns in lexical innovation on the web: The NeoCrawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180817

WD01 Invention patent application deemed withdrawn after publication