CN118377912A - Electronic manual processing method, interactive system, electronic device and readable storage medium - Google Patents
Electronic manual processing method, interactive system, electronic device and readable storage medium Download PDFInfo
- Publication number
- CN118377912A CN118377912A CN202410841066.7A CN202410841066A CN118377912A CN 118377912 A CN118377912 A CN 118377912A CN 202410841066 A CN202410841066 A CN 202410841066A CN 118377912 A CN118377912 A CN 118377912A
- Authority
- CN
- China
- Prior art keywords
- text
- chapter
- title
- article
- texts
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 53
- 238000003672 processing method Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 30
- 230000004044 response Effects 0.000 claims abstract description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 36
- 230000006399 behavior Effects 0.000 claims description 28
- 239000000284 extract Substances 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 12
- 238000010801 machine learning Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 4
- 230000003993 interaction Effects 0.000 abstract description 14
- 230000000694 effects Effects 0.000 abstract description 5
- 238000004883 computer application Methods 0.000 abstract description 2
- 238000000354 decomposition reaction Methods 0.000 abstract 1
- 238000012423 maintenance Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 3
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及计算机应用技术领域,特别是涉及一种电子手册处理方法、交互系统、电子设备及可读存储介质。The present invention relates to the field of computer application technology, and in particular to an electronic manual processing method, an interactive system, an electronic device and a readable storage medium.
背景技术Background technique
交互式电子手册(IETM-Interactive Electronic Technical Manual)是针对装备、设备或大型工程系统设计的技术手册数字化交互系统,以满足装备的海量技术文档数字化的迫切需求,能有效提高系统使用和维修保障效率,是装备保障的重要组成部分和关键技术手段。Interactive Electronic Technical Manual (IETM) is a digital interactive system of technical manuals designed for equipment, devices or large-scale engineering systems. It meets the urgent need for digitization of massive technical documents of equipment. It can effectively improve the efficiency of system use and maintenance support. It is an important component and key technical means of equipment support.
随着数据量的增大,通常很难快速且准确地进行电子手册的查询。虽然,通过人工创建规则的方式,可以实现对文章的关键词进行提取,从而可以基于关键词进行查询,但仅基于关键词进行查询,并不能有效应对复杂文章语言逻辑的处理,从而实现高效交互。As the amount of data increases, it is often difficult to quickly and accurately query electronic manuals. Although keywords can be extracted from articles by manually creating rules, so that keywords can be queried, querying based on keywords alone cannot effectively handle the processing of complex article language logic, thereby achieving efficient interaction.
综上所述,如何有效地解决交互式电子手册中的文档处理等问题,是目前本领域技术人员急需解决的技术问题。In summary, how to effectively solve problems such as document processing in interactive electronic manuals is a technical problem that those skilled in the art urgently need to solve.
发明内容Summary of the invention
本发明的目的是提供一种电子手册处理方法、交互系统、电子设备及可读存储介质,通过提取技术文档中的文本关联信息,从而基于文本关联信息对交互式电子手册客户端的操作行为进行有效响应。The purpose of the present invention is to provide an electronic manual processing method, an interactive system, an electronic device and a readable storage medium, which can effectively respond to the operation behavior of an interactive electronic manual client based on the text association information by extracting text association information from technical documents.
为解决上述技术问题,本发明提供如下技术方案:In order to solve the above technical problems, the present invention provides the following technical solutions:
一种电子手册处理方法,包括:An electronic manual processing method, comprising:
获取技术文档,并对所述技术文档进行章节分解,获得章节文本;Acquire a technical document, and decompose the technical document into chapters to obtain chapter texts;
提取所述章节文本的文本特征向量,基于所述文本特征向量对所述章节文本进行文章内容分类,得到文章分类;Extracting a text feature vector of the chapter text, and classifying the chapter text by article content based on the text feature vector to obtain an article classification;
对不同类的所述章节文本进行主题提取,得到文字主题;Performing topic extraction on the chapter texts of different categories to obtain text topics;
对所述章节文本进行关键词提取,得到文章关键词;Extract keywords from the chapter text to obtain article keywords;
将所述文章分类、所述文字主题和所述文章关键词作为相应章节文本的文本关联信息进行存储;The article classification, the text theme and the article keywords are stored as text association information of the corresponding chapter text;
响应于交互式电子手册客户端的操作行为,基于所述文本关联信息,输出与所述操作行为关联的目标章节文本。In response to an operation behavior of the interactive electronic manual client, a target chapter text associated with the operation behavior is output based on the text association information.
优选地,对所述技术文档进行章节分解,获得章节文本,包括:Preferably, decomposing the technical document into chapters to obtain chapter texts includes:
获取技术文档,并解析所述技术文档,得到文章标题与标题级别;Obtaining technical documents, and parsing the technical documents to obtain article titles and title levels;
基于所述文章标题和所述标题级别对所述技术文档进行章节分解,获得章节文本。The technical document is decomposed into chapters based on the article titles and the title levels to obtain chapter texts.
优选地,对所述章节文本进行关键词提取,得到文章关键词,包括:Preferably, keyword extraction is performed on the chapter text to obtain article keywords, including:
对所述章节文本进行分词处理,得到组成所述章节文本的若干词语;Performing word segmentation processing on the chapter text to obtain a number of words constituting the chapter text;
计算所述词语的词频和逆向文件频率,将所述词频和所述逆向文件频率的乘积作为词语分数;Calculating the word frequency and the reverse document frequency of the word, and taking the product of the word frequency and the reverse document frequency as the word score;
利用所述词语分数,从若干个所述词语中选出所述文章关键词。The article keywords are selected from a plurality of the words using the word scores.
优选地,对不同类的所述章节文本进行主题提取,得到文字主题,包括:Preferably, subject extraction is performed on the chapter texts of different categories to obtain text subjects, including:
将不同类的所述章节文本作为文本组;The chapter texts of different categories are regarded as text groups;
利用主题模型算法,对所述文本组的主题进行训练;Using a topic model algorithm, training the topic of the text group;
在完成收敛后,利用机器学习文本相似度算法确定文章主题相似度;After convergence, the machine learning text similarity algorithm is used to determine the topic similarity of the articles;
基于所述文章主题相似度,从训练出的主题中确定出所述文字主题。Based on the article topic similarity, the text topic is determined from the trained topics.
优选地,基于所述文本特征向量对所述章节文本进行文章内容分类,得到文章分类,包括:Preferably, classifying the chapter text by article content based on the text feature vector to obtain article classification includes:
利用朴素贝叶斯算法,并基于贝叶斯定理,通过计算给定类别的条件下,所述文本特征向量中特征出现的概率进行分类,得到所述文字分类。The naive Bayes algorithm is used and based on the Bayesian theorem, the probability of the features in the text feature vector appearing under the condition of a given category is calculated to obtain the text classification.
优选地,获取技术文档,包括:Preferably, obtaining technical documents includes:
从指定接口中接收所述技术文档;Receiving the technical document from a specified interface;
利用所述指定接口,获取所述技术文档的标题格式信息;所述标题格式信息包括标题等级。The designated interface is used to obtain title format information of the technical document; the title format information includes a title level.
优选地,利用所述指定接口,获取所述技术文档的标题格式信息,包括:Preferably, using the designated interface to obtain the title format information of the technical document includes:
在利用所述指定接口接收所述技术文档过程中,创建与所述技术文档对应的章节对象;其中,所述章节对象储存章节标题内容、章节文本内容与子章节列表;In the process of receiving the technical document by using the designated interface, creating a chapter object corresponding to the technical document; wherein the chapter object stores chapter title content, chapter text content and a sub-chapter list;
循环遍历所述技术文档中的每一段文本,并利用所述标题格式信息判断当前文本是否为标题;Loop through each text in the technical document, and use the title format information to determine whether the current text is a title;
如果否,则确定当前文本为文本内容,则将当前文本写入上一次生成的标题对象的章节文本内容中;If not, it is determined that the current text is text content, and the current text is written into the chapter text content of the last generated title object;
如果是,则若当前文本与前一次生成的标题对象为同级标题或为前一次生成的标题对象的上级标题,则确定上一次生成的标题对象的文本内容已完结,生成新标题对象,并将当前文本写入所述章节标题内容中;若当前文本为前一次生成的标题对象的下级标题,则生成新标题对象并将当前文本写入所述章节标题内容中,在前一次生成的标题对象的子章节列表中添加新标题对象;If so, if the current text is a title of the same level as the title object generated last time or is a higher level title of the title object generated last time, it is determined that the text content of the title object generated last time has been completed, a new title object is generated, and the current text is written into the chapter title content; if the current text is a lower level title of the title object generated last time, a new title object is generated and the current text is written into the chapter title content, and the new title object is added to the sub-chapter list of the title object generated last time;
利用所述章节对象,将所述技术文档转化为按文档结构划分的JSON格式对象数组;其中,在数组中的每一个值对应所述技术文档的每一个章节对象,若所述章节对象存在子章节,则所述章节对象拥有子章节对象列表作为属性。Using the chapter object, the technical document is converted into a JSON format object array divided according to the document structure; wherein each value in the array corresponds to each chapter object of the technical document, and if the chapter object has sub-chapter, the chapter object has a sub-chapter object list as an attribute.
一种交互系统,包括:An interactive system, comprising:
交互式电子手册客户端、文件管理服务器、文件解析服务器;Interactive electronic manual client, file management server, file parsing server;
其中,所述文件解析服务器,用于获取技术文档,并对所述技术文档进行章节分解,获得章节文本;提取所述章节文本的文本特征向量,基于所述文本特征向量对所述章节文本进行文章内容分类,得到文章分类;对不同类的所述章节文本进行主题提取,得到文字主题;The file parsing server is used to obtain technical documents, decompose the technical documents into chapters, and obtain chapter texts; extract text feature vectors of the chapter texts, classify the chapter texts into article content based on the text feature vectors, and obtain article classifications; extract topics from the chapter texts of different categories to obtain text topics;
对所述章节文本进行关键词提取,得到文章关键词;将所述文章分类、所述文字主题和所述文章关键词作为相应章节文本的文本关联信息;将所述文本关联信息和所述技术文档发送给所述文件管理服务器;Extract keywords from the chapter text to obtain article keywords; use the article classification, the text theme and the article keywords as text association information of the corresponding chapter text; send the text association information and the technical document to the file management server;
所述文件管理服务器,用于接收并存储所述文本关联信息和所述技术文档;响应于交互式电子手册客户端的操作行为,基于所述文本关联信息,输出与所述操作行为关联的目标章节文本;The file management server is used to receive and store the text association information and the technical document; in response to the operation behavior of the interactive electronic manual client, based on the text association information, output the target chapter text associated with the operation behavior;
所述交互式电子手册客户端,用于提供操作界面,与所述文件管理服务器进行交互,并输出所述文件管理服务器反馈的所述目标章节文本。The interactive electronic manual client is used to provide an operation interface, interact with the file management server, and output the target chapter text fed back by the file management server.
一种电子设备,包括:An electronic device, comprising:
存储器,用于存储计算机程序;Memory for storing computer programs;
处理器,用于执行所述计算机程序时实现上述电子手册处理方法的步骤。A processor is used to implement the steps of the above-mentioned electronic manual processing method when executing the computer program.
一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述电子手册处理方法的步骤。A readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the electronic manual processing method are implemented.
应用本发明实施例所提供的方法,包括:获取技术文档,并对技术文档进行章节分解,获得章节文本;提取章节文本的文本特征向量,基于文本特征向量对章节文本进行文章内容分类,得到文章分类;对不同类的章节文本进行主题提取,得到文字主题;对章节文本进行关键词提取,得到文章关键词;将文章分类、文字主题和文章关键词作为相应章节文本的文本关联信息进行存储;响应于交互式电子手册客户端的操作行为,基于文本关联信息,输出与操作行为关联的目标章节文本。The method provided by the embodiment of the present invention includes: acquiring technical documents, and decomposing the technical documents into chapters to obtain chapter texts; extracting text feature vectors of the chapter texts, and classifying the chapter texts into article categories based on the text feature vectors to obtain article categories; extracting topics from chapter texts of different categories to obtain text topics; extracting keywords from chapter texts to obtain article keywords; storing article categories, text topics, and article keywords as text association information of corresponding chapter texts; and outputting target chapter texts associated with the operation behavior based on the text association information in response to the operation behavior of the interactive electronic manual client.
在本发明中,为了实现对技术文档中的各个部分进行有效处理,首先将完整的技术文档进行章节分解,从而得到章节文本。针对章节文本进行分类,可以得到文章分类;对章节文本进行关键词提取,可以得到章节文本的关键词;对不同类的章节文本进行主题提取,可以得到文字主题。将章节文本对应的文章分类、文字主题和文章关键词作为该章节文本的文本关联信息进行存储。当用户在交互式电子手册客户端进行操作时,响应于交互式电子手册客户端的操作行为,便可基于文本关联信息,输出与操作行为关联的目标章节文本。由于基于章节文本维度进行存储文本关联信息,且该文本关联信息包括文章分类、文字主题和文章关键词,相较于仅基于关键词在完整文档维度实现交互响应,本发明能够在交互过程中,提供更为精准的输出内容。In the present invention, in order to achieve effective processing of each part in the technical document, the complete technical document is first decomposed into chapters to obtain chapter texts. By classifying the chapter texts, article classifications can be obtained; by extracting keywords from the chapter texts, keywords of the chapter texts can be obtained; by extracting topics from chapter texts of different categories, text topics can be obtained. The article classifications, text topics and article keywords corresponding to the chapter texts are stored as text association information of the chapter texts. When the user operates on the interactive electronic manual client, in response to the operation behavior of the interactive electronic manual client, the target chapter text associated with the operation behavior can be output based on the text association information. Since the text association information is stored based on the chapter text dimension, and the text association information includes article classifications, text topics and article keywords, compared with realizing interactive responses based only on keywords in the complete document dimension, the present invention can provide more accurate output content during the interactive process.
技术效果:将技术文档进行逐章分解,并通过结合不同的算法对各章节进行关键信息提取与分类,从而得到包括文章分类、文字主题和文章关键词的文本关联信息,基于该文本关联信息可以提高电子手册的查询与交互效率。Technical effect: The technical document is broken down chapter by chapter, and key information of each chapter is extracted and classified by combining different algorithms, so as to obtain text association information including article classification, text theme and article keywords. Based on this text association information, the query and interaction efficiency of the electronic manual can be improved.
相应地,本发明实施例还提供了与上述基于人工智能的电子手册处理方法相对应的系统、设备和可读存储介质,具有上述技术效果,在此不再赘述。Correspondingly, the embodiments of the present invention also provide a system, a device and a readable storage medium corresponding to the above-mentioned artificial intelligence-based electronic manual processing method, which has the above-mentioned technical effects and will not be repeated here.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or related technologies, the drawings required for use in the embodiments or related technical descriptions are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1为本发明实施例中一种电子手册处理方法的实施流程图;FIG1 is a flowchart of an implementation method of an electronic manual processing method in an embodiment of the present invention;
图2为本发明实施例中一种系统的结构示意图;FIG2 is a schematic diagram of the structure of a system in an embodiment of the present invention;
图3为本发明实施例中一种电子手册处理方法的具体实施流程图;FIG3 is a specific implementation flow chart of an electronic manual processing method according to an embodiment of the present invention;
图4为本发明实施例中一种电子设备的结构示意图;FIG4 is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention;
图5为本发明实施例中一种电子设备的具体结构示意图。FIG. 5 is a schematic diagram of a specific structure of an electronic device in an embodiment of the present invention.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to enable those skilled in the art to better understand the scheme of the present invention, the present invention is further described in detail below in conjunction with the accompanying drawings and specific implementation methods. Obviously, the described embodiments are only part of the embodiments of the present invention, rather than all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without making creative work are within the scope of protection of the present invention.
请参考图1,图1为本发明实施例中一种电子手册处理方法的流程图,该方法可以应用于如图2所示的交互系统(技术手册数字化交互系统)中,该方法包括以下步骤:Please refer to FIG. 1 , which is a flow chart of an electronic manual processing method in an embodiment of the present invention. The method can be applied to the interactive system (technical manual digitization interactive system) shown in FIG. 2 . The method includes the following steps:
S101、获取技术文档,并对技术文档进行章节分解,获得章节文本。S101. Obtain a technical document, and decompose the technical document into chapters to obtain chapter texts.
在技术手册数字化交互系统中,存在大量的技术文档,为了实现基于这些技术文档实现高效交互,在本实施例中,提出,将技术文档拆分成以章节文本为单元进行处理。In the technical manual digital interaction system, there are a large number of technical documents. In order to achieve efficient interaction based on these technical documents, in this embodiment, it is proposed to split the technical documents into chapter text units for processing.
具体的,可以通过接收技术文档,或下载技术文档,或直接从存储介质中读取的方式获取到技术文档。在本实施例中,对技术文档的获取渠道本身并不做限定。Specifically, the technical document may be obtained by receiving the technical document, downloading the technical document, or directly reading the technical document from a storage medium. In this embodiment, the channel for obtaining the technical document is not limited.
在实际应用中,技术文档可以具体为设备或系统的技术手册或其他技术性的文档。对于技术文档中记载什么具体内容,以及其内的章节划分的具体情况,在本实施例中均不做限定。In practical applications, the technical document may specifically be a technical manual of a device or system or other technical documents. The specific contents recorded in the technical document and the specific circumstances of the division of chapters therein are not limited in this embodiment.
在获取到技术文档之后,可以通过技术文档中的关于标题格式等信息,对技术文档进行章节拆分,从而得到该技术文档的若干个章节文本。After obtaining the technical document, the technical document may be divided into chapters according to information such as the title format in the technical document, thereby obtaining several chapter texts of the technical document.
在本发明中的一种实施例中,对技术文档进行章节分解,获得章节文本,包括:In one embodiment of the present invention, the technical document is decomposed into chapters to obtain chapter texts, including:
获取技术文档,并解析技术文档,得到文章标题与标题级别;Obtain technical documents, parse them, and obtain article titles and title levels;
基于文章标题和标题级别对技术文档进行章节分解,获得章节文本。The technical document is decomposed into chapters based on the article title and heading level to obtain the chapter text.
为便于描述,下面将上述步骤结合起来进行说明。For ease of description, the above steps are combined for explanation below.
在获取到技术文档之后,可以对技术文档进行解析,从而得到文章标题和标题级别。然后,基于文章标题和标题级别,便可对技术文档进行章节拆分,从而得到技术文档的各个章节文本。After obtaining the technical document, the technical document can be parsed to obtain the article title and title level. Then, based on the article title and title level, the technical document can be divided into chapters to obtain the text of each chapter of the technical document.
在本发明中的一种实施例中,获取技术文档,包括:In one embodiment of the present invention, obtaining a technical document includes:
从指定接口中接收技术文档;Receive technical documentation from a specified interface;
利用指定接口,获取技术文档的标题格式信息;标题格式信息包括标题等级。Use the specified interface to obtain the title format information of the technical document; the title format information includes the title level.
为便于描述,下面将上述两个步骤结合起来进行说明。For ease of description, the above two steps are combined for explanation below.
具体的,该指定接口可以具体为Apache POI等能够实现文档上传以及文档的标题格式信息获取的接口。Apache POI(Poor Obfuscation Implementation,简洁版的模糊实现)为用Java编写的免费开源的跨平台的 Java API,Apache POI提供API给Java程式对Microsoft Office格式档案读和写的功能。Specifically, the specified interface can be an interface such as Apache POI that can realize document uploading and obtaining document title format information. Apache POI (Poor Obfuscation Implementation) is a free, open source, cross-platform Java API written in Java. Apache POI provides an API for Java programs to read and write Microsoft Office format files.
也就是说,可以通过Apache POI完成对Word文档信息的提取,Word文档以统一格式进行技术资料上传,并储存Word文档中所有标题格式信息,且储存时区分标题等级。In other words, Apache POI can be used to extract information from Word documents, upload technical data in a unified format, and store all title format information in the Word document, distinguishing title levels when storing.
在本发明中的一种实施例中,获取技术文档的标题格式信息,包括:In one embodiment of the present invention, obtaining title format information of a technical document includes:
在利用指定接口接收技术文档过程中,创建与技术文档对应的章节对象;其中,章节对象储存章节标题内容、章节文本内容与子章节列表;In the process of receiving the technical document using the specified interface, a chapter object corresponding to the technical document is created; wherein the chapter object stores the chapter title content, the chapter text content and the sub-chapter list;
循环遍历技术文档中的每一段文本,并利用标题格式信息判断当前文本是否为标题;Loop through each text in the technical document and use the title format information to determine whether the current text is a title;
如果否,则确定当前文本为文本内容,则将当前文本写入上一次生成的标题对象的章节文本内容中;If not, it is determined that the current text is text content, and the current text is written into the chapter text content of the last generated title object;
如果是,则若当前文本与前一次生成的标题对象为同级标题或为前一次生成的标题对象的上级标题,则确定上一次生成的标题对象的文本内容已完结,生成新标题对象,并将当前文本写入章节标题内容中;若当前文本为前一次生成的标题对象的下级标题,则生成新标题对象并将当前文本写入章节标题内容中,在前一次生成的标题对象的子章节列表中添加新标题对象;If so, if the current text is a title of the same level as the title object generated last time or is a higher level title of the title object generated last time, it is determined that the text content of the title object generated last time has been completed, a new title object is generated, and the current text is written into the chapter title content; if the current text is a lower level title of the title object generated last time, a new title object is generated and the current text is written into the chapter title content, and the new title object is added to the sub-chapter list of the title object generated last time;
利用章节对象,将技术文档转化为按文档结构划分的JSON格式对象数组;其中,在数组中的每一个值对应技术文档的每一个章节对象,若章节对象存在子章节,则章节对象拥有子章节对象列表作为属性。Using the chapter object, the technical document is converted into a JSON format object array divided according to the document structure; each value in the array corresponds to each chapter object of the technical document. If the chapter object has sub-chapter, the chapter object has a list of sub-chapter objects as an attribute.
为便于描述,下面将上述步骤结合起来进行说明。For ease of description, the above steps are combined for explanation below.
具体的,可以定义章节对象,用于储存章节标题内容、章节文本内容与子章节列表。Specifically, a chapter object may be defined to store chapter title content, chapter text content, and a sub-chapter list.
可以循环遍历文档中的每一段文本,并根据文本的格式判断是否为标题,如果非标题,则视为文本内容,且将其作为前一次标题对象的文本内容;如果为标题,且与前一次生成的标题对象为同级标题或者前一次标题对象为上级标题,则视为上一次生成的标题对象完成文本内容,并生成新标题对象;如果与前一次生成的标题对象为该标题的下级标题,则生成新标题对象,并将其作为前标题对象的子级标题。You can loop through each paragraph of text in the document and determine whether it is a title based on the format of the text. If it is not a title, it will be regarded as text content and used as the text content of the previous title object; if it is a title, and it is a title of the same level as the title object generated last time or the previous title object is a higher-level title, then the title object generated last time is regarded as completing the text content, and a new title object is generated; if the title object generated last time is a lower-level title of this title, then a new title object is generated and used as a sub-title of the previous title object.
在实际应用中,还可根据文档的义章节对象将整个文档转化为按文档结构划分的JSON(Java Script Object Notation,JS对象简谱,一种轻量级的数据交换格式)格式对象数组,数组中的每一个值对应文章的每一个章节对象,其每一个章节对象如果存在子章节,则该章节对象拥有子章节对象列表作为属性,同时储存了所有子章节对象。In actual applications, the entire document can also be converted into a JSON (Java Script Object Notation, JS object notation, a lightweight data exchange format) format object array divided according to the document structure based on the document's chapter objects. Each value in the array corresponds to each chapter object of the article. If each chapter object has sub-chapter, the chapter object has a sub-chapter object list as an attribute and stores all sub-chapter objects at the same time.
为了提高文本质量,在分解了技术文档,得到章节文本之后,还可以以章节文本为单位,进行文本数据的文本清洗、文本分词、停用词过滤等操作。In order to improve the quality of the text, after decomposing the technical documents and obtaining the chapter texts, you can also perform operations such as text cleaning, text segmentation, and stop word filtering on the text data based on the chapter texts.
S102、提取章节文本的文本特征向量,基于文本特征向量对章节文本进行文章内容分类,得到文章分类。S102: extracting text feature vectors of chapter texts, and classifying the chapter texts by article content based on the text feature vectors to obtain article classifications.
在获得了章节文本之后,便可通过提前章节文本的文本特征向量,从而基于该文本特征向量对章节文本进行分类,从而得到文章分类。After obtaining the chapter text, the chapter text can be classified based on the text feature vector by obtaining the text feature vector of the chapter text in advance, thereby obtaining the article classification.
在实际应用中,可以将各个章节文本输入到文本内容分类模型进行分类识别,从而得到各个章节文本的文章分类。In practical applications, each chapter text can be input into a text content classification model for classification and recognition, thereby obtaining the article classification of each chapter text.
在本发明中的一种实施例中,基于文本特征向量对章节文本进行文章内容分类,得到文章分类,包括:In one embodiment of the present invention, the chapter text is classified into article content based on the text feature vector to obtain article classification, including:
利用朴素贝叶斯算法,并基于贝叶斯定理,通过计算给定类别的条件下,文本特征向量中特征出现的概率进行分类,得到文字分类。Using the naive Bayes algorithm and based on the Bayesian theorem, the text classification is obtained by calculating the probability of the features appearing in the text feature vector under the condition of a given category.
其中,朴素贝叶斯算法基于贝叶斯定理,通过计算给定类别的条件下,每个特征出现的概率来进行分类,可以用来做文本分类,对于给定的文本样本,朴素贝叶斯算法计算每个类别的后验概率,并选择具有最高后验概率的类别作为分类结果。Among them, the naive Bayes algorithm is based on Bayes' theorem. It performs classification by calculating the probability of each feature appearing under the condition of a given category. It can be used for text classification. For a given text sample, the naive Bayes algorithm calculates the posterior probability of each category and selects the category with the highest posterior probability as the classification result.
具体的,可以将技术文档的所有章节文本通过机器学习文本特征提取算法获取文本特征向量,通过朴素贝叶斯算法进行文章内容分类。Specifically, all chapter texts of the technical document can be extracted using a machine learning text feature extraction algorithm to obtain text feature vectors, and the article content can be classified using a naive Bayes algorithm.
S103、对不同类的章节文本进行主题提取,得到文字主题。S103, extracting topics from different types of chapter texts to obtain text topics.
考虑到在实际应用中,虽然章节文本属于不同的分类,但其可能涉及一个主题。为了实现在交互过程中,不仅基于分类进行交互推荐,还可以基于主题进行交互推荐。因此,在本实施例中,对不同类的章节文本进行主题提取,从而得到共同的文字主题。具体的,可以采用聚类算法等算法,来提取获得文字主题。Considering that in actual applications, although chapter texts belong to different categories, they may involve a theme. In order to realize interactive recommendation based not only on categories but also on themes in the interactive process. Therefore, in this embodiment, theme extraction is performed on chapter texts of different categories to obtain a common text theme. Specifically, algorithms such as clustering algorithms can be used to extract and obtain text themes.
在本发明中的一种实施例中,对不同类的章节文本进行主题提取,得到文字主题,包括:In one embodiment of the present invention, topic extraction is performed on different types of chapter texts to obtain text topics, including:
将不同类的章节文本作为文本组;Treat different types of chapter texts as text groups;
利用主题模型算法,对文本组的主题进行训练;Using the topic model algorithm, the topics of the text group are trained;
在完成收敛后,利用机器学习文本相似度算法确定文章主题相似度;After convergence, the machine learning text similarity algorithm is used to determine the topic similarity of the articles;
基于文章主题相似度,从训练出的主题中确定出文字主题。Based on the similarity of article topics, the text topics are determined from the trained topics.
也就是说,可以根据文章分类结果,将不同类文章作为一组进行主题提取,通过LDA算法进行算法训练并使其完成收敛,之后使用机器学习文本相似度算法判断文章主题相似度,提取相似度较高的主题作为同一类主题进行储存。In other words, based on the article classification results, different types of articles can be grouped together for topic extraction, and the algorithm can be trained and converged using the LDA algorithm. Then, the machine learning text similarity algorithm can be used to determine the similarity of article topics, and topics with higher similarity can be extracted and stored as topics of the same category.
其中,LDA主题模型属于主题模型算法,可用于推测文档的主题分布。它可以将文档集中每篇文档的主题以概率分布的形式给出,从而通过分析一些文档抽取出它们的主题分布后,便可以根据主题分布进行主题聚类或文本分类。Among them, the LDA topic model belongs to the topic model algorithm, which can be used to infer the topic distribution of documents. It can give the topic of each document in the document set in the form of probability distribution, so that after analyzing some documents to extract their topic distribution, topic clustering or text classification can be performed based on the topic distribution.
S104、对章节文本进行关键词提取,得到文章关键词。S104, extracting keywords from the chapter text to obtain article keywords.
基于关键词进行交互也是提高交互有效性的渠道之一,因而在本实施例中,可以对章节文本进行关键词提取,从而得到文章关键词。Interaction based on keywords is also one of the channels to improve the effectiveness of interaction. Therefore, in this embodiment, keyword extraction can be performed on the chapter text to obtain article keywords.
在实际应用中,可以现对章节文本进行分词处理,然后,对每一个词语进行统计,从而基于重复率选出文章关键词。In practical applications, the chapter text can be segmented, and then each word can be counted to select the article keywords based on the repetition rate.
在本发明中的一种实施例中,对章节文本进行关键词提取,得到文章关键词,包括:In one embodiment of the present invention, keyword extraction is performed on the chapter text to obtain article keywords, including:
对章节文本进行分词处理,得到组成章节文本的若干词语;Perform word segmentation on the chapter text to obtain several words that constitute the chapter text;
计算词语的词频和逆向文件频率,将词频和逆向文件频率的乘积作为词语分数;Calculate the word frequency and inverse document frequency of the word, and take the product of the word frequency and the inverse document frequency as the word score;
利用词语分数,从若干个词语中选出文章关键词。Use word scores to select article keywords from a number of words.
为便于描述,下面将上述步骤结合起来进行说明。For ease of description, the above steps are combined for explanation below.
TF(Term Frequency)为词频,IDF(Inverse Document Frequency)为逆文本频率指数。TF (Term Frequency) is the term frequency, and IDF (Inverse Document Frequency) is the inverse document frequency index.
TF-IDF(term frequency–inverse document frequency)为一种用于信息检索与数据挖掘的常用加权技术,用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF (term frequency–inverse document frequency) is a commonly used weighting technique for information retrieval and data mining to evaluate the importance of a word to a document set or a document in a corpus. The importance of a word increases in direct proportion to the number of times it appears in a document, but decreases in inverse proportion to the frequency of its appearance in the corpus.
也就是说,在实际应用中,可以以文章中的章节为单位进行TF-IDF关键词提取算法,获取每段章节中所含关键信息,提取章节中的关键词并进行储存。That is to say, in practical applications, the TF-IDF keyword extraction algorithm can be performed on the chapters in the article to obtain the key information contained in each chapter, extract the keywords in the chapters and store them.
S105、将文章分类、文字主题和文章关键词作为相应章节文本的文本关联信息进行存储。S105: storing the article classification, text theme and article keywords as text association information of the corresponding chapter text.
在实际应用中,可以设置一个文件管理服务器,该服务器用来储存所有资料文件的所有文本数据与文本关联信息。In practical applications, a file management server may be provided to store all text data and text-related information of all data files.
文本关联信息包括文章分类、文字主题与文章关键词信息。Text-related information includes article classification, text subject, and article keyword information.
S106、响应于交互式电子手册客户端的操作行为,基于文本关联信息,输出与操作行为关联的目标章节文本。S106 : In response to the operation behavior of the interactive electronic manual client, output the target chapter text associated with the operation behavior based on the text association information.
在交互式电子手册客户端可以提供操作界面,用户可以基于该操作界面实现相关操作。交互式电子手册客户端捕捉用户的操作行为,并反馈给服务器,服务器便可对该操作行为进行响应。具体的,可以基于文本管理信息,输出与操作行为关联的目标章节文本。The interactive electronic manual client can provide an operation interface, and the user can perform related operations based on the operation interface. The interactive electronic manual client captures the user's operation behavior and feeds it back to the server, and the server can respond to the operation behavior. Specifically, the target chapter text associated with the operation behavior can be output based on the text management information.
当然,操作界面还可基于VR(Virtual Reality,虚拟现实,也称为虚拟技术、虚拟环境)等技术进行呈现,在本实施例中,对交互的具体呈现形式并不做限定。Of course, the operation interface can also be presented based on technologies such as VR (Virtual Reality, also known as virtual technology, virtual environment), and in this embodiment, the specific presentation form of the interaction is not limited.
举例说明:用户可以在交互式电子手册客户端输入查询信息,基于查询信息中的关键内容,与文本关联信息进行近似计算,基于计算出的距离信息来选择查询列表中的各目标章节文本;或者,在交互式电子手册客户端显示A章节文本时,获得A章节文本的文本关联信息,通过文本关联信息从文件管理服务器中找出与A章节文本的文本关联信息紧密关联B章节文本。For example: the user can input query information in the interactive electronic manual client, perform approximate calculations based on the key content in the query information and the text association information, and select the target chapter texts in the query list based on the calculated distance information; or, when the interactive electronic manual client displays the chapter A text, obtain the text association information of the chapter A text, and use the text association information to find the chapter B text that is closely associated with the text association information of the chapter A text from the file management server.
举例说明:该电子手册处理方法可以应用于基于VR的车辆设备维修系统,结合交互电子手册技术。具体的,用户通过虚实融合技术识别车辆维修故障点,根据用户虚实交互所操作的虚拟识别部件,获取维修部件与相关维修技术的各目标章节文本。或者,通过用户输入的相关文章关键词,对相关文章进行直接搜索,并获取文章文本主题关联的文本关联信息,通过文本关联信息获取关联章节文本,在用户终端同时显示相关文章和关联章节文本。For example: This electronic manual processing method can be applied to a VR-based vehicle equipment maintenance system, combined with interactive electronic manual technology. Specifically, the user identifies the vehicle maintenance fault point through virtual-reality fusion technology, and obtains the target chapter texts of the maintenance parts and related maintenance technologies based on the virtual identification parts operated by the user's virtual-reality interaction. Alternatively, through the relevant article keywords entered by the user, the relevant articles are directly searched, and the text association information associated with the article text theme is obtained, and the associated chapter text is obtained through the text association information, and the relevant articles and associated chapter texts are displayed on the user terminal at the same time.
应用本发明实施例所提供的方法,该方法包括:获取技术文档,并对技术文档进行章节分解,获得章节文本;提取章节文本的文本特征向量,基于文本特征向量对章节文本进行文章内容分类,得到文章分类;对不同类的章节文本进行主题提取,得到文字主题;对章节文本进行关键词提取,得到文章关键词;将文章分类、文字主题和文章关键词作为相应章节文本的文本关联信息进行存储;响应于交互式电子手册客户端的操作行为,基于文本关联信息,输出与操作行为关联的目标章节文本。The method provided by the embodiment of the present invention is applied, and the method includes: obtaining technical documents, and decomposing the technical documents into chapters to obtain chapter texts; extracting text feature vectors of the chapter texts, and classifying the chapter texts into article content based on the text feature vectors to obtain article classifications; extracting topics from chapter texts of different categories to obtain text topics; extracting keywords from chapter texts to obtain article keywords; storing article classifications, text topics, and article keywords as text association information of corresponding chapter texts; and outputting target chapter texts associated with the operation behavior based on the text association information in response to the operation behavior of the interactive electronic manual client.
在本发明中,为了实现对技术文档中的各个部分进行有效处理,首先将完整的技术文档进行章节分解,从而得到章节文本。针对章节文本进行分类,可以得到文章分类;对章节文本进行关键词提取,可以得到章节文本的关键词;对不同类的章节文本进行主题提取,可以得到文字主题。将章节文本对应的文章分类、文字主题和文章关键词作为该章节文本的文本关联信息进行存储。当用户在交互式电子手册客户端进行操作时,响应于交互式电子手册客户端的操作行为,便可基于文本关联信息,输出与操作行为关联的目标章节文本。由于基于章节文本维度进行存储文本关联信息,且该文本关联信息包括文章分类、文字主题和文章关键词,相较于仅基于关键词在完整文档维度实现交互响应,本发明能够在交互过程中,提供更为精准的输出内容。In the present invention, in order to achieve effective processing of each part in the technical document, the complete technical document is first decomposed into chapters to obtain chapter texts. By classifying the chapter texts, article classifications can be obtained; by extracting keywords from the chapter texts, keywords of the chapter texts can be obtained; by extracting topics from chapter texts of different categories, text topics can be obtained. The article classifications, text topics and article keywords corresponding to the chapter texts are stored as text association information of the chapter texts. When the user operates on the interactive electronic manual client, in response to the operation behavior of the interactive electronic manual client, the target chapter text associated with the operation behavior can be output based on the text association information. Since the text association information is stored based on the chapter text dimension, and the text association information includes article classifications, text topics and article keywords, compared with realizing interactive responses based only on keywords in the complete document dimension, the present invention can provide more accurate output content during the interactive process.
技术效果:将技术文档进行逐章分解,并通过结合不同的算法对各章节进行关键信息提取与分类,从而得到包括文章分类、文字主题和文章关键词的文本关联信息,基于该文本关联信息可以提高电子手册的查询与交互效率。Technical effect: The technical document is broken down chapter by chapter, and key information of each chapter is extracted and classified by combining different algorithms, so as to obtain text association information including article classification, text theme and article keywords. Based on this text association information, the query and interaction efficiency of the electronic manual can be improved.
相应于上面的方法实施例,本发明实施例还提供了一种交互系统,下文描述的交互系统与上文描述的电子手册处理方法可相互对应参照。Corresponding to the above method embodiment, the embodiment of the present invention further provides an interactive system. The interactive system described below and the electronic manual processing method described above can be referred to in correspondence with each other.
参见图2所示,该系统包括:As shown in FIG2 , the system includes:
交互式电子手册客户端100、文件管理服务器200、文件解析服务器300;Interactive electronic manual client 100, file management server 200, file parsing server 300;
其中,文件解析服务器,用于获取技术文档,并对技术文档进行章节分解,获得章节文本;提取章节文本的文本特征向量,基于文本特征向量对章节文本进行文章内容分类,得到文章分类;对不同类的章节文本进行主题提取,得到文字主题;The file parsing server is used to obtain technical documents, decompose the technical documents into chapters, and obtain chapter texts; extract text feature vectors of the chapter texts, classify the chapter texts into article content based on the text feature vectors, and obtain article classifications; extract topics from different types of chapter texts to obtain text topics;
对章节文本进行关键词提取,得到文章关键词;将文章分类、文字主题和文章关键词作为相应章节文本的文本关联信息;将文本关联信息和技术文档发送给文件管理服务器;Extract keywords from the chapter text to obtain article keywords; use article classification, text theme and article keywords as text association information of the corresponding chapter text; send the text association information and technical documents to the file management server;
文件管理服务器,用于接收并存储文本关联信息和技术文档;响应于交互式电子手册客户端的操作行为,基于文本关联信息,输出与操作行为关联的目标章节文本;The file management server is used to receive and store text association information and technical documents; in response to the operation behavior of the interactive electronic manual client, based on the text association information, output the target chapter text associated with the operation behavior;
交互式电子手册客户端,用于提供操作界面,与文件管理服务器进行交互,并输出文件管理服务器反馈的目标章节文本。The interactive electronic manual client is used to provide an operation interface, interact with the file management server, and output the target chapter text fed back by the file management server.
应用本发明实施例所提供的系统,获取技术文档,并对技术文档进行章节分解,获得章节文本;提取章节文本的文本特征向量,基于文本特征向量对章节文本进行文章内容分类,得到文章分类;对不同类的章节文本进行主题提取,得到文字主题;对章节文本进行关键词提取,得到文章关键词;将文章分类、文字主题和文章关键词作为相应章节文本的文本关联信息进行存储;响应于交互式电子手册客户端的操作行为,基于文本关联信息,输出与操作行为关联的目标章节文本。The system provided by the embodiment of the present invention is applied to obtain technical documents, decompose the technical documents into chapters, and obtain chapter texts; extract text feature vectors of the chapter texts, classify the chapter texts by article content based on the text feature vectors, and obtain article classifications; extract topics from different types of chapter texts to obtain text topics; extract keywords from the chapter texts to obtain article keywords; store article classifications, text topics, and article keywords as text association information of corresponding chapter texts; and output target chapter texts associated with the operation behavior based on the text association information in response to the operation behavior of the interactive electronic manual client.
在本发明中,为了实现对技术文档中的各个部分进行有效处理,首先将完整的技术文档进行章节分解,从而得到章节文本。针对章节文本进行分类,可以得到文章分类;对章节文本进行关键词提取,可以得到章节文本的关键词;对不同类的章节文本进行主题提取,可以得到文字主题。将章节文本对应的文章分类、文字主题和文章关键词作为该章节文本的文本关联信息进行存储。当用户在交互式电子手册客户端进行操作时,响应于交互式电子手册客户端的操作行为,便可基于文本关联信息,输出与操作行为关联的目标章节文本。由于基于章节文本维度进行存储文本关联信息,且该文本关联信息包括文章分类、文字主题和文章关键词,相较于仅基于关键词在完整文档维度实现交互响应,本发明能够在交互过程中,提供更为精准的输出内容。In the present invention, in order to achieve effective processing of each part in the technical document, the complete technical document is first decomposed into chapters to obtain chapter texts. By classifying the chapter texts, article classifications can be obtained; by extracting keywords from the chapter texts, keywords of the chapter texts can be obtained; by extracting topics from chapter texts of different categories, text topics can be obtained. The article classifications, text topics and article keywords corresponding to the chapter texts are stored as text association information of the chapter texts. When the user operates on the interactive electronic manual client, in response to the operation behavior of the interactive electronic manual client, the target chapter text associated with the operation behavior can be output based on the text association information. Since the text association information is stored based on the chapter text dimension, and the text association information includes article classifications, text topics and article keywords, compared with realizing interactive responses based only on keywords in the complete document dimension, the present invention can provide more accurate output content during the interactive process.
技术效果:将技术文档进行逐章分解,并通过结合不同的算法对各章节进行关键信息提取与分类,从而得到包括文章分类、文字主题和文章关键词的文本关联信息,基于该文本关联信息可以提高电子手册的查询与交互效率。Technical effect: The technical document is broken down chapter by chapter, and key information of each chapter is extracted and classified by combining different algorithms, so as to obtain text association information including article classification, text theme and article keywords. Based on this text association information, the query and interaction efficiency of the electronic manual can be improved.
在本发明的一种具体实施方式中,文件解析服务器,具体用于获取技术文档,并解析技术文档,得到文章标题与标题级别;In a specific implementation of the present invention, the file parsing server is specifically used to obtain technical documents and parse the technical documents to obtain article titles and title levels;
基于文章标题和标题级别对技术文档进行章节分解,获得章节文本。The technical document is decomposed into chapters based on the article title and heading level to obtain the chapter text.
在本发明的一种具体实施方式中,文件解析服务器,具体用于对章节文本进行分词处理,得到组成章节文本的若干词语;In a specific embodiment of the present invention, the file parsing server is specifically used to perform word segmentation processing on the chapter text to obtain a number of words constituting the chapter text;
计算词语的词频和逆向文件频率,将词频和逆向文件频率的乘积作为词语分数;Calculate the word frequency and inverse document frequency of the word, and take the product of the word frequency and the inverse document frequency as the word score;
利用词语分数,从若干个词语中选出文章关键词。Use word scores to select article keywords from a number of words.
在本发明的一种具体实施方式中,文件解析服务器,具体用于将不同类的章节文本作为文本组;In a specific embodiment of the present invention, the file parsing server is specifically used to treat different types of chapter texts as text groups;
利用主题模型算法,对文本组的主题进行训练;Using the topic model algorithm, the topics of the text group are trained;
在完成收敛后,利用机器学习文本相似度算法确定文章主题相似度;After convergence, the machine learning text similarity algorithm is used to determine the topic similarity of the articles;
基于文章主题相似度,从训练出的主题中确定出文字主题。Based on the similarity of article topics, the text topics are determined from the trained topics.
在本发明的一种具体实施方式中,文件解析服务器,具体用于利用朴素贝叶斯算法,并基于贝叶斯定理,通过计算给定类别的条件下,文本特征向量中特征出现的概率进行分类,得到文字分类。In a specific embodiment of the present invention, the file parsing server is specifically used to utilize the naive Bayes algorithm and, based on the Bayes theorem, classify by calculating the probability of occurrence of features in a text feature vector under the condition of a given category to obtain a text classification.
在本发明的一种具体实施方式中,文件解析服务器,具体用于从指定接口中接收技术文档;In a specific implementation of the present invention, the file parsing server is specifically used to receive the technical document from the specified interface;
利用指定接口,获取技术文档的标题格式信息;标题格式信息包括标题等级。Use the specified interface to obtain the title format information of the technical document; the title format information includes the title level.
在本发明的一种具体实施方式中,文件解析服务器,具体用于在利用指定接口接收技术文档过程中,创建与技术文档对应的章节对象;其中,章节对象储存章节标题内容、章节文本内容与子章节列表;In a specific embodiment of the present invention, the file parsing server is specifically used to create a chapter object corresponding to the technical document during the process of receiving the technical document using the specified interface; wherein the chapter object stores the chapter title content, the chapter text content and the sub-chapter list;
循环遍历技术文档中的每一段文本,并利用标题格式信息判断当前文本是否为标题;Loop through each text in the technical document and use the title format information to determine whether the current text is a title;
如果否,则确定当前文本为文本内容,则将当前文本写入上一次生成的标题对象的章节文本内容中;If not, it is determined that the current text is text content, and the current text is written into the chapter text content of the last generated title object;
如果是,则若当前文本与前一次生成的标题对象为同级标题或为前一次生成的标题对象的上级标题,则确定上一次生成的标题对象的文本内容已完结,生成新标题对象,并将当前文本写入章节标题内容中;若当前文本为前一次生成的标题对象的下级标题,则生成新标题对象并将当前文本写入章节标题内容中,在前一次生成的标题对象的子章节列表中添加新标题对象;If so, if the current text is a title of the same level as the title object generated last time or is a higher level title of the title object generated last time, it is determined that the text content of the title object generated last time has been completed, a new title object is generated, and the current text is written into the chapter title content; if the current text is a lower level title of the title object generated last time, a new title object is generated and the current text is written into the chapter title content, and the new title object is added to the sub-chapter list of the title object generated last time;
利用章节对象,将技术文档转化为按文档结构划分的JSON格式对象数组;其中,在数组中的每一个值对应技术文档的每一个章节对象,若章节对象存在子章节,则章节对象拥有子章节对象列表作为属性。Using the chapter object, the technical document is converted into a JSON format object array divided according to the document structure; each value in the array corresponds to each chapter object of the technical document. If the chapter object has sub-chapter, the chapter object has a list of sub-chapter objects as an attribute.
该电子手册处理方法可以应用于上述系统中,为便于本领域技术人员更好地理解和实施该电子手册处理方法,下面结合具体应用场景为例,对该电子手册处理方法进行详细说明。The electronic manual processing method can be applied to the above-mentioned system. To facilitate those skilled in the art to better understand and implement the electronic manual processing method, the electronic manual processing method is described in detail below with reference to a specific application scenario as an example.
具体的,可以在如图2所示的系统中实施该电子手册处理方法,详情如下:Specifically, the electronic manual processing method can be implemented in the system shown in FIG2 , and the details are as follows:
交互式电子手册客户端(本文简称客户端):通过可视化操作界面进行交互式电子手册的信息浏览或设备交互操作。例如,客户端采用VR的方式对仿真设备进行维修操作交互,并根据其交互的维修操作信息进行进一步地相关技术文档展示。Interactive electronic manual client (hereinafter referred to as client): browse the information of the interactive electronic manual or perform interactive operations on the equipment through a visual operation interface. For example, the client uses VR to perform maintenance operations on the simulated equipment, and further displays relevant technical documents based on the interactive maintenance operation information.
文件管理服务器:用于储存所有资料文件的所有文本数据与文本关联数据。File management server: used to store all text data and text-related data of all data files.
文件解析服务器:用于将所搜资料文件进行解析、获取相关文本关联数据与文章构成。File parsing server: used to parse the searched data files and obtain relevant text-related data and article composition.
文本关联数据(即文本关联信息)为文章分类、文字主题与文章关键词信息。Text-related data (i.e., text-related information) includes article classification, text topics, and article keyword information.
请参考图3,方法包括以下步骤:Referring to FIG. 3 , the method includes the following steps:
S1、将资料文件上传至文件解析服务器,文件解析服务器根据解析文章标题与标题级别将文件进行章节分解,获取不同章节标题与其包含的文章内容。S1. Upload the data file to the file parsing server. The file parsing server decomposes the file into chapters according to the parsed article titles and title levels, and obtains different chapter titles and the article contents they contain.
即,文件解析服务器获取文档中所有标题位置,并根据标题的文本样式或者标题编号判断文档的章节层级;That is, the file parsing server obtains the positions of all titles in the document and determines the chapter level of the document based on the text style or title number of the title;
具体的,可以通过Apache POI完成对Word文档信息的提取,Word文档以统一格式进行资料上传,并储存Word文档中所有标题格式信息,且储存时区分标题等级;Specifically, Apache POI can be used to extract Word document information, upload Word documents in a unified format, and store all title format information in the Word document, and distinguish title levels when storing;
进一步地,定义章节对象,用于储存章节标题内容、章节文本内容与子章节列表,循环遍历文档中的每一段文本,并根据文本的格式判断是否为标题,如果非标题,则视为文本内容,且将其作为前一次标题对象的文本内容;如果为标题,且与前一次生成的标题对象为同级标题或者前一次标题对象为上级标题,则视为上一次生成的标题对象完成文本内容,并生成新标题对象;如果与前一次生成的标题对象为该标题的下级标题,则生成新标题对象,并将其作为前标题对象的子级标题。Furthermore, a chapter object is defined to store chapter title content, chapter text content and a sub-chapter list. Each text in the document is looped through, and it is determined whether it is a title based on the format of the text. If it is not a title, it is regarded as text content and used as the text content of the previous title object. If it is a title, and it is a title of the same level as the previously generated title object or the previous title object is a higher-level title, the text content of the previously generated title object is regarded as completed, and a new title object is generated. If the title object generated previously is a lower-level title of this title, a new title object is generated and used as a sub-title of the previous title object.
进一步地,根据文档的义章节对象将整个文档转化为按文档结构划分的JSON格式对象数组,数组中的每一个值对应文章的每一个章节对象,其每一个章节对象如果存在子章节,则该章节对象拥有子章节对象列表作为属性,同时储存了所有子章节对象。Furthermore, according to the document's chapter objects, the entire document is converted into a JSON format object array divided according to the document structure. Each value in the array corresponds to each chapter object of the article. If each chapter object has sub-chapter, the chapter object has a sub-chapter object list as an attribute and stores all sub-chapter objects at the same time.
S2、文件解析服务器通过AI算法进行章节文本分析,并将分析后的结果整理发送至文件管理服务器。S2. The file parsing server uses AI algorithms to analyze the chapter text, and sends the results of the analysis to the file management server.
利用文件解析服务器进行文本分析,首先将所有上传的资料文件以文章为单位进行文本预处理,包括文本数据的文本清洗、文本分词、停用词过滤等操作。The file parsing server is used for text analysis. First, all uploaded data files are preprocessed as articles, including text cleaning, text segmentation, stop word filtering and other operations.
在初次通过文件解析服务器进行文本资料上传时,选择已经完成了分类的文本资料进行上传,且不进行算法参数设置(权重与光滑因子),在完成模型训练与预测后通过评估算法对结果进行评估,并再次调整参数设置继续训练直到评估值达到最大,之后逐次添加新的资料文件完善分类模型。When uploading text data for the first time through the file parsing server, select the text data that has been classified for uploading, and do not set the algorithm parameters (weight and smoothing factor). After completing the model training and prediction, evaluate the results through the evaluation algorithm, and adjust the parameter settings again to continue training until the evaluation value reaches the maximum. Then add new data files one by one to improve the classification model.
根据文章分类结果,将不同类文章作为一组进行主题提取,通过LDA算法进行算法训练并使其完成收敛,之后使用机器学习文本相似度算法判断文章主题相似度,提取相似度较高的主题作为同一类主题进行储存。According to the article classification results, different types of articles are grouped together for topic extraction. The LDA algorithm is used to train the algorithm and make it converge. Then, the machine learning text similarity algorithm is used to determine the similarity of article topics, and topics with higher similarity are extracted and stored as topics of the same category.
以文章中的章节为单位进行TF-IDF关键词提取算法,获取每段章节中所含关键信息,提取章节中的关键词并进行储存。The TF-IDF keyword extraction algorithm is performed on the chapters in the article to obtain the key information contained in each chapter, extract the keywords in the chapters and store them.
S3、文件管理服务器根据文件解析服务器解析后的结果将相关文档分析信息进行储存。S3. The file management server stores the relevant document analysis information according to the analysis results of the file analysis server.
S4、在客户端进行信息浏览与交互时,展示关联信息。S4. When the client browses and interacts with information, related information is displayed.
客户端进行VR电子文档交互时,根据维修交互所相关联的文本关联数据,从文件管理服务器中提取关联的文章章节进行信息展示,同时可直接浏览相关文章内容。When the client performs VR electronic document interaction, it extracts the relevant article chapters from the file management server for information display based on the text-related data associated with the maintenance interaction, and can directly browse the relevant article content.
可见,本发明解决了交互式电子手册在储存大量资料文件时,通过人力搜索效率低下的问题。通过机器学习算法自动化完成文章文本信息提取。通过多种不同算法弥补单一算法所存在的缺点。It can be seen that the present invention solves the problem of low efficiency of manual search when storing a large number of data files in an interactive electronic manual. The article text information extraction is completed automatically through a machine learning algorithm. The shortcomings of a single algorithm are compensated by a variety of different algorithms.
相应于上面的方法实施例,本发明实施例还提供了一种电子设备,下文描述的一种电子设备与上文描述的一种电子手册处理方法可相互对应参照。Corresponding to the above method embodiment, an embodiment of the present invention further provides an electronic device. The electronic device described below and the electronic manual processing method described above can refer to each other.
参见图4所示,该电子设备包括:As shown in FIG4 , the electronic device includes:
存储器332,用于存储计算机程序;A memory 332, for storing computer programs;
处理器322,用于执行计算机程序时实现上述方法实施例的电子手册处理方法的步骤。The processor 322 is used to implement the steps of the electronic manual processing method of the above method embodiment when executing the computer program.
具体的,请参考图5,图5为本实施例提供的一种电子设备的具体结构示意图,该电子设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)(例如,一个或一个以上处理器)和存储器332,存储器332存储有一个或一个以上的计算机程序342或数据344。其中,存储器332可以是短暂存储或持久存储。存储在存储器332的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对数据处理设备中的一系列指令操作。更进一步地,处理器322可以设置为与存储器332通信,在电子设备301上执行存储器332中的一系列指令操作。Specifically, please refer to FIG. 5, which is a schematic diagram of the specific structure of an electronic device provided in this embodiment. The electronic device may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units, CPU) (for example, one or more processors) and a memory 332, and the memory 332 stores one or more computer programs 342 or data 344. Among them, the memory 332 can be a temporary storage or a permanent storage. The program stored in the memory 332 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the data processing device. Furthermore, the processor 322 can be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the electronic device 301.
电子设备301还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341。The electronic device 301 may further include one or more power supplies 326 , one or more wired or wireless network interfaces 350 , one or more input and output interfaces 358 , and/or one or more operating systems 341 .
上文所描述的电子手册处理方法中的步骤可以由电子设备的结构实现。The steps in the electronic manual processing method described above can be implemented by the structure of an electronic device.
相应于上面的方法实施例,本发明实施例还提供了一种可读存储介质,下文描述的一种可读存储介质与上文描述的一种电子手册处理方法可相互对应参照。Corresponding to the above method embodiment, the embodiment of the present invention further provides a readable storage medium. The readable storage medium described below and the electronic manual processing method described above can be referred to each other.
一种可读存储介质,可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例的电子手册处理方法的步骤。A readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the electronic manual processing method of the above method embodiment are implemented.
该可读存储介质具体可以为U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可存储程序代码的可读存储介质。The readable storage medium may specifically be a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, or other readable storage medium that can store program codes.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410841066.7A CN118377912B (en) | 2024-06-27 | 2024-06-27 | Electronic manual processing method, interaction system, electronic device and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410841066.7A CN118377912B (en) | 2024-06-27 | 2024-06-27 | Electronic manual processing method, interaction system, electronic device and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118377912A true CN118377912A (en) | 2024-07-23 |
CN118377912B CN118377912B (en) | 2024-11-08 |
Family
ID=91906124
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410841066.7A Active CN118377912B (en) | 2024-06-27 | 2024-06-27 | Electronic manual processing method, interaction system, electronic device and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118377912B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004145626A (en) * | 2002-10-24 | 2004-05-20 | Telecommunication Advancement Organization Of Japan | Documents classification support device and computer program |
CN108399228A (en) * | 2018-02-12 | 2018-08-14 | 平安科技(深圳)有限公司 | Article sorting technique, device, computer equipment and storage medium |
KR20210089340A (en) * | 2020-01-08 | 2021-07-16 | 삼성에스디에스 주식회사 | Method and appratus for catergorizing text in document |
CN114064890A (en) * | 2021-11-08 | 2022-02-18 | 福建正孚软件有限公司 | Data analysis method and storage medium |
CN114328983A (en) * | 2021-12-31 | 2022-04-12 | 北京索为系统技术股份有限公司 | Document shredding method, data retrieval method, device and electronic device |
CN115374781A (en) * | 2022-08-25 | 2022-11-22 | 上海浦东发展银行股份有限公司 | Text data information mining method, device and equipment |
CN116205212A (en) * | 2023-02-27 | 2023-06-02 | 华润数字科技有限公司 | Bid file information extraction method, device, equipment and storage medium |
CN116258131A (en) * | 2023-02-13 | 2023-06-13 | 电科云(北京)科技有限公司 | Template engine-based scheme compiling method and system |
-
2024
- 2024-06-27 CN CN202410841066.7A patent/CN118377912B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004145626A (en) * | 2002-10-24 | 2004-05-20 | Telecommunication Advancement Organization Of Japan | Documents classification support device and computer program |
CN108399228A (en) * | 2018-02-12 | 2018-08-14 | 平安科技(深圳)有限公司 | Article sorting technique, device, computer equipment and storage medium |
KR20210089340A (en) * | 2020-01-08 | 2021-07-16 | 삼성에스디에스 주식회사 | Method and appratus for catergorizing text in document |
CN114064890A (en) * | 2021-11-08 | 2022-02-18 | 福建正孚软件有限公司 | Data analysis method and storage medium |
CN114328983A (en) * | 2021-12-31 | 2022-04-12 | 北京索为系统技术股份有限公司 | Document shredding method, data retrieval method, device and electronic device |
CN115374781A (en) * | 2022-08-25 | 2022-11-22 | 上海浦东发展银行股份有限公司 | Text data information mining method, device and equipment |
CN116258131A (en) * | 2023-02-13 | 2023-06-13 | 电科云(北京)科技有限公司 | Template engine-based scheme compiling method and system |
CN116205212A (en) * | 2023-02-27 | 2023-06-02 | 华润数字科技有限公司 | Bid file information extraction method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN118377912B (en) | 2024-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102288249B1 (en) | Information processing method, terminal, and computer storage medium | |
US8380727B2 (en) | Information processing device and method, program, and recording medium | |
CN111723784B (en) | Risk video identification method and device and electronic equipment | |
CN111859149A (en) | Information recommendation method, device, electronic device and storage medium | |
CN109451147B (en) | Information display method and device | |
JP4538760B2 (en) | Information processing apparatus and method, program, and recording medium | |
CN109508448A (en) | Short information method, medium, device are generated based on long article and calculate equipment | |
CN110209875A (en) | User content portrait determines method, access object recommendation method and relevant apparatus | |
WO2025092584A1 (en) | Method and apparatus for generating interaction component of client ui, terminal, and medium | |
CN118094239A (en) | Image-text rating method, image-text rating device and computer readable storage medium | |
CN111475731B (en) | Data processing method, device, storage medium and equipment | |
CN112507214B (en) | User name-based data processing method, device, equipment and medium | |
JP7593043B2 (en) | Summary generation device, control method and system | |
CN117648444B (en) | Patent clustering method and system based on graph convolution attribute aggregation | |
Ritter et al. | Toward application integration with multimedia data | |
CN113821669A (en) | Searching method, searching device, electronic equipment and storage medium | |
CN117171432B (en) | Data pushing method of client APP | |
CN117332098A (en) | Content generation method based on interactive virtual assistant | |
JP2004341948A (en) | Concept extraction system, concept extraction method, program, and storage medium | |
CN107577690B (en) | Recommendation method and recommendation device for mass information data | |
CN118377912A (en) | Electronic manual processing method, interactive system, electronic device and readable storage medium | |
Akpınar et al. | Heuristic role detection of visual elements of web pages | |
CN117648504A (en) | Method, device, computer equipment and storage medium for generating media resource sequence | |
CN115130453A (en) | Interactive information generation method and device | |
Ali et al. | Unsupervised Learning-based News Aggregation: A Comparative Study of Different Embedding and Clustering Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |