WO2018184600A1 - Approximate entry structure recommendation method and system - Google Patents

Approximate entry structure recommendation method and system Download PDF

Info

Publication number
WO2018184600A1
WO2018184600A1 PCT/CN2018/084818 CN2018084818W WO2018184600A1 WO 2018184600 A1 WO2018184600 A1 WO 2018184600A1 CN 2018084818 W CN2018084818 W CN 2018084818W WO 2018184600 A1 WO2018184600 A1 WO 2018184600A1
Authority
WO
WIPO (PCT)
Prior art keywords
term
entry
format
root
text format
Prior art date
Application number
PCT/CN2018/084818
Other languages
French (fr)
Chinese (zh)
Inventor
马也驰
谭红
Original Assignee
上海颐为网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海颐为网络科技有限公司 filed Critical 上海颐为网络科技有限公司
Publication of WO2018184600A1 publication Critical patent/WO2018184600A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to a preferred method and system for approximating term structure, and more particularly to a technique for recommending a term structure based on a cosine similarity parameter.
  • the object of the present invention is to solve the above problems, and to provide an approximate term structure recommendation method and system, which can automatically identify similar term structures and provide reference to users of newly created terms, thereby improving user establishment of term structure. The efficiency and deepen the user's understanding of the structure of the term.
  • the technical solution of the present invention is as follows:
  • the present invention discloses an approximate term structure recommendation method, including:
  • Step 1 Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time;
  • Step 2 Compare the newly created root terms converted to text format with the existing root terms converted into text format by two or two cosine similarities
  • Step 3 Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.
  • the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term is stored.
  • Attributes include entry identifier, entry name, entry text, parent entry, child entry, and the entry of the root entry in the entry structure in the process of converting the form structure into a text format.
  • the attributes and the entry attributes of all sub-entries under the root entry are read out to form a text format.
  • step two further includes:
  • Step 1 Import the gensim database
  • Step 2 Import all existing entries into the documents list, and the terms and terms are separated by commas;
  • Step 3 Vectorize all existing entries
  • Step 4 construct a corresponding TD_IDF model by using the vector values in step 3;
  • Step 5 Calculate the TD_IDF value of each entry through the TD_IDF model
  • Step 6 construct a corresponding LSI model by using the TD_IDF value of each entry;
  • Step 7 Import the newly created root entry of the user and vectorize it
  • Step 8 Import the vector value of the newly created root term in step 7 into the LSI model constructed in step 6;
  • Step 9 Import the vector value of the term in step 3 into the LSI model constructed in step 6, and construct a cosine similarity calculation model;
  • Step 10 Import the value obtained in step 8 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing terms.
  • the term attribute related to the text format is hash-stored according to the key value pair.
  • the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term.
  • An embodiment of the method for recommending an approximate term structure according to the present invention further includes in step 3:
  • Step 1 Use the basic command hgetall of the redis hash to extract the attributes of the root entry and the attributes of all the sub-terms of the root entry to an object;
  • Step 2 The web front end loads the D3.js open source library
  • Step 3 Define a tree object using the d3.layout.tree command and determine the image area size
  • Step 4 The web front end requests data from the server, and the server transmits the object of step 1 to the web front end according to the JSON format;
  • Step 5 Generate a node set node according to the JSON data of step 4.
  • Step 6 Generate a node according to the nodes collection
  • Step 7 Use the tree.links(nodes) command to get the node relationship set.
  • Step 8 Set the Bezier connection for the relationship set
  • Step 9 Add a circular mark to the node, if there are child nodes that are black, otherwise white;
  • Step 10 Add a description text to the node according to the document attribute of the JSON data
  • Step 11 Complete the conversion of the text format to the structure format.
  • the invention also discloses an approximate term structure recommendation system, comprising:
  • a text format conversion module that converts the structural format of the root term into a text format
  • a storage module that stores a structure format of all terms and a corresponding text format thereof
  • the cosine similarity comparison module compares the newly created root terms converted into text format with the existing root terms converted into text format, and compares the cosine similarity with the preset threshold.
  • the text format of the root entry is output as a form of the entry structure;
  • the structure format conversion module converts the text format of the root entry into the structural format of the entry.
  • the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term attribute includes the item identifier. , the term name, the term text, the parent term, the child term, in the process of converting the form structure into a text format, the term attribute of the root term in the term structure and the root term.
  • the entry attributes of all sub-entries are read out to form a text format.
  • the term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a word.
  • Figure 1 shows a flow chart of an embodiment of the approximate term structure recommendation method of the present invention.
  • Fig. 2 shows two terms of the structure used in the present invention.
  • Figure 3 is a flow chart showing the cosine similarity of the calculation terms and terms of the present invention.
  • Figure 4 is a flow chart showing the conversion of the text format of the present invention into a form structure of a term.
  • Figure 5 shows a schematic diagram of an embodiment of an approximate term structure recommendation system of the present invention.
  • FIG. 1 shows an implementation of an embodiment of the approximate term structure recommendation method of the present invention.
  • the two term structure shown in FIG. 2 is used as an example, which are respectively shown in FIG. Entry structure 1 and entry structure 2.
  • Step S1 Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time.
  • the entry attributes include the entry identifier (ID), the entry name (name), the entry text (document), the parent term (parent), and the child term (children).
  • ID entry identifier
  • name entry name
  • entry text document
  • parent parent
  • child term children
  • the structured display of the network mostly uses the D3 open source library, that is, the D3 open source library displays the entries stored in the server in a tree diagram.
  • the entry attribute is stored according to the key-value pair, that is, a mapping table of field and value of string type, so the hash storage mode is applicable to the above storage.
  • the web background uses the Key-Value database redis to store terms and term attributes, and the term attributes of each entry are stored in the database redis according to the hash storage mode.
  • the basic command hgetall of the redis hash is used to take out the attributes of the root entry and the attributes of all the sub-terms of the root entry.
  • Step S2 Performing a pairwise cosine similarity comparison between the newly created root term converted into a text format and the existing root term converted into a text format.
  • Step S201 Import the gensim database.
  • Step S202 Import all existing entries into the documents list, and the terms and terms are separated by commas.
  • Step S203 Vectorize all existing entries.
  • Step S204 Construct a corresponding TD_IDF model by the vector value in step S203.
  • Step S205 Calculate the TD_IDF value of each entry by using the TD_IDF model.
  • Step S206 Construct a corresponding LSI model by the TD_IDF value of each entry.
  • Step S207 Import the newly created root term of the user and vectorize it.
  • Step S208 The vector value of the newly created root term in step S207 is introduced into the LSI model constructed in step S206.
  • Step S209 The vector value of the term in step S203 is introduced into the LSI model constructed in step S206, and a cosine similarity calculation model is constructed.
  • Step S210 Import the value obtained in step S208 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing entries.
  • Step S3 Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.
  • the existing root term with a cosine similarity exceeding a preset threshold (such as 80%) is recognized, and the text format is converted into a structural format.
  • the term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term. All terms and term attributes are stored in the redis database in a hashed hash format. The specific implementation steps are further shown in FIG. 4, as follows.
  • Step S301 Use the basic command hgetall of the redis hash to extract the attribute of the root term and the attributes of all the sub-terms of the root term to an object.
  • Step S302 The web front end loads the D3.js open source library.
  • Step S303 Define a tree object using the d3.layout.tree command, and determine the image area size.
  • Step S304 The web front end requests data from the server, and the server transmits the object of step S301 to the web front end according to the JSON format.
  • Step S305 Generate node set nodes according to the JSON data of step S304.
  • Step S306 Generate a node according to the nodes collection.
  • Step S307 Obtain a node relationship set by using the tree.links(nodes) command.
  • Step S308 Set a Bezier curve connection for the relationship set.
  • Step S309 Add a circular mark to the node, if there is a child node that is black, otherwise it is white.
  • Step S310 Add a description text to the node according to the document attribute of the JSON data.
  • Step S311 Complete conversion of the text format to the structural format.
  • the tools mentioned in this embodiment are used in Python, where D3, gensim, and redis are all open source libraries of Python. Documents are self-created lists, TD_IDF, LSI model is the model of gensim open source library, hgetall is the basic command of redis open source library, tree is the object defined by D3 open source library command d3.layout.tree, JSON is a data format, Nodes are node collection objects created by themselves.
  • Figure 5 illustrates the principles of an embodiment of the approximate term structure recommendation system of the present invention.
  • the system of this embodiment includes a text format conversion module 1, a cosine similarity comparison module 2, a structural format conversion module 3, and a storage module 4.
  • the text format conversion module 1 is used to implement the conversion of the structural format of the root term into a text format.
  • the entry attribute includes a term identification (ID), a name (name), a document (document), a parent term (parent), and a child term (children).
  • ID a term identification
  • name a name
  • document a document
  • parent a parent term
  • child term children
  • the structured display of the network mostly uses the D3 open source library, that is, the D3 open source library displays the entries stored in the server in a tree diagram.
  • the entry attribute is stored according to the key-value pair, that is, a mapping table of field and value of string type, so the hash storage mode is applicable to the above storage.
  • the web background uses the Key-Value database redis to store terms and term attributes, and the term attributes of each entry are stored in the database redis according to the hash storage mode.
  • the basic command hgetall of the redis hash is used to take out the attributes of the root entry and the attributes of all the sub-terms of the root entry.
  • the storage module 4 is configured to store the structural format of all the terms and their corresponding text formats.
  • the cosine similarity comparison module 2 compares the newly created root term converted into a text format with the existing root terms converted into a text format, and compares the cosine similarity to a preset threshold.
  • the text of the root entry is formatted and output as a form structure.
  • Step S201 Import the gensim database.
  • Step S202 Import all existing entries into the documents list, and the terms and terms are separated by commas.
  • Step S203 Vectorize all existing entries.
  • Step S204 Construct a corresponding TD_IDF model by the vector value in step S203.
  • Step S205 Calculate the TD_IDF value of each entry by using the TD_IDF model.
  • Step S206 Construct a corresponding LSI model by the TD_IDF value of each entry.
  • Step S207 Import the newly created root term of the user and vectorize it.
  • Step S208 The vector value of the newly created root term in step S207 is introduced into the LSI model constructed in step S206.
  • Step S209 The vector value of the term in step S203 is introduced into the LSI model constructed in step S206, and a cosine similarity calculation model is constructed.
  • Step S210 Import the value obtained in step S208 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing entries.
  • the structure format conversion module 3 is used to convert the text format of the root entry into the structural format of the entry.
  • the term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a term identifier, a term name, a term text, and a parent word. Articles, sub-levels. All terms and term attributes are stored in the redis database in a hashed hash format.
  • FIG. 4 The specific implementation steps are further shown in FIG. 4, as follows.
  • Step S301 Use the basic command hgetall of the redis hash to extract the attribute of the root term and the attributes of all the sub-terms of the root term to an object.
  • Step S302 The web front end loads the D3.js open source library.
  • Step S303 Define a tree object using the d3.layout.tree command, and determine the image area size.
  • Step S304 The web front end requests data from the server, and the server transmits the object of step S301 to the web front end according to the JSON format.
  • Step S305 Generate node set nodes according to the JSON data of step S304.
  • Step S306 Generate a node according to the nodes collection.
  • Step S307 Obtain a node relationship set by using the tree.links(nodes) command.
  • Step S308 Set a Bezier curve connection for the relationship set.
  • Step S309 Add a circular mark to the node, if there is a child node that is black, otherwise it is white.
  • Step S310 Add a description text to the node according to the document attribute of the JSON data.
  • Step S311 Complete conversion of the text format to the structural format.
  • the tools mentioned in this embodiment are used in Python, where D3, gensim, and redis are all open source libraries of Python. Documents are self-created lists, TD_IDF, LSI model is the model of gensim open source library, hgetall is the basic command of redis open source library, tree is the object defined by D3 open source library command d3.layout.tree, JSON is a data format, Nodes are node collection objects created by themselves.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • Programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein are implemented or executed.
  • a general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • the processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
  • a software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor to enable the processor to read and write information to/from the storage medium.
  • the storage medium can be integrated into the processor.
  • the processor and the storage medium can reside in an ASIC.
  • the ASIC can reside in the user terminal.
  • the processor and the storage medium may reside as a discrete component in the user terminal.
  • the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented as a computer program product in software, the functions may be stored on or transmitted as one or more instructions or code on a computer readable medium.
  • Computer readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium may be any available media that can be accessed by a computer.
  • such computer readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, or can be used to carry or store instructions or data structures. Any other medium that is desirable for program code and that can be accessed by a computer.
  • any connection is also properly referred to as a computer readable medium.
  • the software is transmitted from a web site, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave.
  • the coaxial cable, fiber optic cable, twisted pair cable, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of the medium.
  • Disks and discs as used herein include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, in which disks are often reproduced magnetically. Data, and discs optically reproduce data with a laser. Combinations of the above should also be included within the scope of computer readable media.

Abstract

Disclosed are an approximate entry structure recommendation method and system, capable of automatically identifying approximate entry structures and providing users that create entries for reference, thereby improving the efficiency of creating entry structures of the users and enhancing user's comprehension on the entry structure. The technical solution of the present invention comprises: receiving a structure of a root entry created by a user, converting the structure format into a text format in real time and storing same; performing pairwise cosine similarity comparison on the created root entry that is converted into the text format and other existing root entries that are converted into the text format; and converting the text format of the existing root entries with the cosine similarity exceeding a preset threshold into a structure format and then presenting same to the user, otherwise not presenting to the user.

Description

一种近似词条结构推荐方法和系统Approximate term structure recommendation method and system 技术领域Technical field
本发明涉及一种近似词条结构的推荐方法和系统,尤其涉及基于余弦相似度这一参数对词条结构进行推荐的技术。The present invention relates to a preferred method and system for approximating term structure, and more particularly to a technique for recommending a term structure based on a cosine similarity parameter.
背景技术Background technique
在以词条结构为基础的信息平台上,随着用户数量的增加,会有很多用户对同样的知识体系进行定义和结构化。当用户在系统中为了建立一个词条结构而新建一个根词条时,往往在系统中已经存储了和该新建根词条类似的词条结构。On the information platform based on the term structure, as the number of users increases, many users will define and structure the same knowledge system. When a user creates a new root entry in the system in order to establish a term structure, a term structure similar to the new root term is often stored in the system.
在以往的信息平台上,即使存在类似的词条结构,也不会告知新建根词条的用户,信息平台上已知的词条结构便不能为该用户服务。用户依然在没有任何参考的情况下建立词条结构,这会造成用户在信息平台上的使用效率的降低。而且容易导致平台上产生大量结构格式相似的词条,不利于平台上的信息整理和显示。In the past information platform, even if there is a similar term structure, the user of the new root term will not be informed, and the known term structure on the information platform cannot serve the user. Users still build the term structure without any reference, which will reduce the user's use efficiency on the information platform. Moreover, it is easy to cause a large number of terms with similar structural forms on the platform, which is not conducive to the information collation and display on the platform.
因此,目前业界亟待一种能够自动将系统中已存的近似词条结构获取出来并提供给用户进行参考的手段。Therefore, the industry currently needs a means to automatically obtain the approximate term structure already stored in the system and provide it to the user for reference.
发明内容Summary of the invention
以下给出一个或多个方面的简要概述以提供对这些方面的基本理解。此概述不是所有构想到的方面的详尽综览,并且既非旨在指认出所有方面的关键性或决定性要素亦非试图界定任何或所有方面的范围。其唯一的目的是要以简化形式给出一个或多个方面的一些概念以为稍后给出的更加详细的描述之序。A brief overview of one or more aspects is provided below to provide a basic understanding of these aspects. This summary is not an extensive overview of all aspects that are conceived, and is not intended to identify key or critical elements in all aspects. Its sole purpose is to present some concepts of one or more aspects
本发明的目的在于解决上述问题,提供了一种近似词条结构推荐方法和系统,能够自动将相近的词条结构识别出来并提供给新建词条的用户进行参考,提升了用户建立词条结构的效率、并能加深用户对词条结构的理解。The object of the present invention is to solve the above problems, and to provide an approximate term structure recommendation method and system, which can automatically identify similar term structures and provide reference to users of newly created terms, thereby improving user establishment of term structure. The efficiency and deepen the user's understanding of the structure of the term.
本发明的技术方案为:本发明揭示了一种近似词条结构推荐方法,包括:The technical solution of the present invention is as follows: The present invention discloses an approximate term structure recommendation method, including:
步骤一:接收用户新建的根词条的结构,实时将结构格式转化为文本格式并存储;Step 1: Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time;
步骤二:将转化为文本格式的新建的根词条与其他转化为文本格式的已有根词条进行两两余弦相似度对比;Step 2: Compare the newly created root terms converted to text format with the existing root terms converted into text format by two or two cosine similarities;
步骤三:将余弦相似度超过预设阈值的已有根词条的文本格式转化为结构格式后呈现给用户,否则不呈现给用户。Step 3: Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.
根据本发明的近似词条结构推荐方法的一实施例,词条结构格式转化为文本格式的过程中,词条结构中的词条属性按照键值对以哈希存储方式进行存储,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条,在将词条结构格式转化为文本格式的过程中,将词条结构中的根词条的词条属性以及根词条下所有子词条的词条属性读取出来以形成文本格式。According to an embodiment of the approximate term structure recommendation method of the present invention, in the process of converting the term structure format into a text format, the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term is stored. Attributes include entry identifier, entry name, entry text, parent entry, child entry, and the entry of the root entry in the entry structure in the process of converting the form structure into a text format. The attributes and the entry attributes of all sub-entries under the root entry are read out to form a text format.
根据本发明的近似词条结构推荐方法的一实施例,步骤二进一步包括:According to an embodiment of the method for recommending approximate structure of the present invention, step two further includes:
步骤1:导入gensim数据库;Step 1: Import the gensim database;
步骤2;将现有的所有词条导入documents列表中,词条与词条用逗号间隔;Step 2: Import all existing entries into the documents list, and the terms and terms are separated by commas;
步骤3:将现有的所有词条向量化;Step 3: Vectorize all existing entries;
步骤4:通过步骤3中的向量值构建相应的TD_IDF模型;Step 4: construct a corresponding TD_IDF model by using the vector values in step 3;
步骤5:通过TD_IDF模型计算每个词条的TD_IDF值;Step 5: Calculate the TD_IDF value of each entry through the TD_IDF model;
步骤6:通过每个词条的TD_IDF值构建相应的LSI模型;Step 6: construct a corresponding LSI model by using the TD_IDF value of each entry;
步骤7:导入用户新建的根词条,将其向量化;Step 7: Import the newly created root entry of the user and vectorize it;
步骤8:将步骤7中的新建的根词条的向量值导入步骤6构建的LSI模型中;Step 8: Import the vector value of the newly created root term in step 7 into the LSI model constructed in step 6;
步骤9:将步骤3中的词条的向量值导入步骤6构建的LSI模型中,并构建余弦相似度计算模型;Step 9: Import the vector value of the term in step 3 into the LSI model constructed in step 6, and construct a cosine similarity calculation model;
步骤10:将步骤8得到的值导入到余弦相似度计算模型中,输出新建的根词条与现有的所有词条的余弦相似度。Step 10: Import the value obtained in step 8 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing terms.
根据本发明的近似词条结构推荐方法的一实施例,在步骤三的在将文本格式转化为词条结构格式的过程中,将文本格式涉及的词条属性按照键值对以哈希存储方式存储成词条结构,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条。According to an embodiment of the approximate term structure recommendation method of the present invention, in the process of converting the text format into the term structure format in step three, the term attribute related to the text format is hash-stored according to the key value pair. Stored in a term structure, where the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term.
根据本发明的近似词条结构推荐方法的一实施例,在步骤三进一步包括:An embodiment of the method for recommending an approximate term structure according to the present invention further includes in step 3:
步骤1:使用redis hash的基本命令hgetall将根词条的属性以及根词条所有子词条的属性取出给到某一对象;Step 1: Use the basic command hgetall of the redis hash to extract the attributes of the root entry and the attributes of all the sub-terms of the root entry to an object;
步骤2:web前端加载D3.js开源库;Step 2: The web front end loads the D3.js open source library;
步骤3:使用d3.layout.tree命令定义一个tree对象,并确定图像区域大小;Step 3: Define a tree object using the d3.layout.tree command and determine the image area size;
步骤4:web前端向服务器请求数据,服务器将步骤1的对象按照JSON格式传到web前端中;Step 4: The web front end requests data from the server, and the server transmits the object of step 1 to the web front end according to the JSON format;
步骤5:根据步骤4的JSON数据生成节点集合nodes;Step 5: Generate a node set node according to the JSON data of step 4;
步骤6:根据nodes集合生成节点;Step 6: Generate a node according to the nodes collection;
步骤7:使用tree.links(nodes)命令获取节点关系集合;Step 7: Use the tree.links(nodes) command to get the node relationship set.
步骤8:为关系集合设置贝塞尔曲线连接;Step 8: Set the Bezier connection for the relationship set;
步骤9:为节点添加圆形标记,如果有子节点为黑色,否则白色;Step 9: Add a circular mark to the node, if there are child nodes that are black, otherwise white;
步骤10:根据JSON数据的document属性为节点添加说明文字;Step 10: Add a description text to the node according to the document attribute of the JSON data;
步骤11:完成文本格式到结构格式的转化。Step 11: Complete the conversion of the text format to the structure format.
本发明还揭示了一种近似词条结构推荐系统,包括:The invention also discloses an approximate term structure recommendation system, comprising:
文本格式转化模块,将根词条的结构格式转化为文本格式;A text format conversion module that converts the structural format of the root term into a text format;
存储模块,存储所有词条的结构格式及其对应的文本格式;a storage module that stores a structure format of all terms and a corresponding text format thereof;
余弦相似度对比模块,将转化为文本格式的新建的根词条与其他转化为文本格式的已有根词条进行两两余弦相似度对比,筛选出余弦相似度高于预设阈值的已有根词条的文本格式并输出为词条结构格式;The cosine similarity comparison module compares the newly created root terms converted into text format with the existing root terms converted into text format, and compares the cosine similarity with the preset threshold. The text format of the root entry is output as a form of the entry structure;
结构格式转化模块,将根词条的文本格式转化为词条的结构格式。The structure format conversion module converts the text format of the root entry into the structural format of the entry.
根据本发明的近似词条结构推荐系统的一实施例,在文本格式转化模块中,词条结构中的词条属性按照键值对以哈希存储方式进行存储,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条,在将词条结构格式转化为文本格式的过程中,将词条结构中的根词条的词条属性以及根词条下所有子词条的词条属性读取出来以形成文本格式。According to an embodiment of the approximate term structure recommendation system of the present invention, in the text format conversion module, the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term attribute includes the item identifier. , the term name, the term text, the parent term, the child term, in the process of converting the form structure into a text format, the term attribute of the root term in the term structure and the root term The entry attributes of all sub-entries are read out to form a text format.
根据本发明的近似词条结构推荐系统的一实施例,结构格式转化模块中,将文本格式涉及的词条属性按照键值对以哈希存储方式存储成词条结构,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条。According to an embodiment of the approximate term structure recommendation system of the present invention, in the structural format conversion module, the term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a word. Article identification, entry name, entry text, parent entry, child entry.
附图说明DRAWINGS
图1示出了本发明的近似词条结构推荐方法的实施例的流程图。Figure 1 shows a flow chart of an embodiment of the approximate term structure recommendation method of the present invention.
图2示出了本发明举例用的两个词条结构。Fig. 2 shows two terms of the structure used in the present invention.
图3示出了本发明的计算词条与词条的余弦相似度的流程图。Figure 3 is a flow chart showing the cosine similarity of the calculation terms and terms of the present invention.
图4示出了本发明的文本格式转化为词条结构格式的流程图。Figure 4 is a flow chart showing the conversion of the text format of the present invention into a form structure of a term.
图5示出了本发明的近似词条结构推荐系统的实施例的原理图。Figure 5 shows a schematic diagram of an embodiment of an approximate term structure recommendation system of the present invention.
具体实施方式detailed description
在结合以下附图阅读本公开的实施例的详细描述之后,能够更好地理解本发明的上述特征和优点。在附图中,各组件不一定是按比例绘制,并且具有类似的相关特性或特征的组件可能具有相同或相近的附图标记。The above features and advantages of the present invention will be better understood from the following description of the appended claims. In the figures, components are not necessarily drawn to scale, and components having similar related features or features may have the same or similar reference numerals.
近似词条结构推荐方法的实施例Embodiment of approximate term structure recommendation method
图1示出了本发明的近似词条结构推荐方法的实施例的实现,在本实施例的描述中,用图2所示的两个词条结构来举例,分别为图2中所示的词条结构1和词条结构2。1 shows an implementation of an embodiment of the approximate term structure recommendation method of the present invention. In the description of the embodiment, the two term structure shown in FIG. 2 is used as an example, which are respectively shown in FIG. Entry structure 1 and entry structure 2.
步骤S1:接收用户新建的根词条的结构,实时将结构格式转化为文本格式并存储。Step S1: Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time.
词条属性包括词条标识(ID)、词条名称(name)、词条文本(document)、父级词条(parent)、子级词条(children)。在将词条结构格式转化为文本格式的过程中,将词条结构中的根词条的词条属性以及根词条下所有子词条的词条属性读取出来以形成文本格式。The entry attributes include the entry identifier (ID), the entry name (name), the entry text (document), the parent term (parent), and the child term (children). In the process of converting the term structure format into a text format, the term attribute of the root term in the term structure and the term attribute of all sub-terms under the root term are read out to form a text format.
现在网络的结构化显示多采用D3开源库,即D3开源库将存储在服务器的词条按照树状图的方式显示。词条属性按照键值对进行存储,即是一个string类型的field和value的映射表,因此hash(哈希)存储方式适用于上述存储。Now the structured display of the network mostly uses the D3 open source library, that is, the D3 open source library displays the entries stored in the server in a tree diagram. The entry attribute is stored according to the key-value pair, that is, a mapping table of field and value of string type, so the hash storage mode is applicable to the above storage.
web后台使用Key-Value数据库redis存储词条以及词条属性,创建的每个词条的词条属性按照hash存储方式存储在数据库redis中。需要格式转化时,使用redis hash的基本命令hgetall将根词条的属性以及根词条所有子词条的属性取出。以图2为例,词条结构在数据库中的局部存储信息示例如下:The web background uses the Key-Value database redis to store terms and term attributes, and the term attributes of each entry are stored in the database redis according to the hash storage mode. When a format conversion is required, the basic command hgetall of the redis hash is used to take out the attributes of the root entry and the attributes of all the sub-terms of the root entry. Taking Figure 2 as an example, the example of local storage information of the entry structure in the database is as follows:
文本1:Text 1:
标题1Heading 1
XXXXXX这是标题1的内容XXXXXXXXXXXX This is the content of title 1 XXXXXX
第一章Chapter One
XXXXXX第一章的内容XXXXXXContents of the first chapter of XXXXXXXXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
第二章Chapter two
XXXXXX第二章的内容XXXXXXContent of XXXXXX Chapter 2 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
第三节Third quarter
XXXXXX第三节的内容XXXXXXThe content of the third section of XXXXXXXXXXXX
第三章third chapter
XXXXXX第三章的内容XXXXXXContent of XXXXXX Chapter 3 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
文本2:Text 2:
标题2Heading 2
XXXXXX这是标题2的内容XXXXXXXXXXXX This is the content of title 2 XXXXXX
第一章Chapter One
XXXXXX第一章的内容XXXXXXContents of the first chapter of XXXXXXXXXXXX
第二章Chapter two
XXXXXX第二章的内容XXXXXXContent of XXXXXX Chapter 2 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
第三章third chapter
XXXXXX第三章的内容XXXXXXContent of XXXXXX Chapter 3 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
步骤S2:将转化为文本格式的新建的根词条与其他转化为文本格式的已有根词条进行两两余弦相似度对比。Step S2: Performing a pairwise cosine similarity comparison between the newly created root term converted into a text format and the existing root term converted into a text format.
词条与词条之间的余弦相似度的计算如图3所示,具体步骤如下。The calculation of the cosine similarity between the terms and the terms is shown in Figure 3. The specific steps are as follows.
步骤S201:导入gensim数据库。Step S201: Import the gensim database.
步骤S202;将现有的所有词条导入documents列表中,词条与词条用逗号间隔。Step S202: Import all existing entries into the documents list, and the terms and terms are separated by commas.
步骤S203:将现有的所有词条向量化。Step S203: Vectorize all existing entries.
步骤S204:通过步骤S203中的向量值构建相应的TD_IDF模型。Step S204: Construct a corresponding TD_IDF model by the vector value in step S203.
步骤S205:通过TD_IDF模型计算每个词条的TD_IDF值。Step S205: Calculate the TD_IDF value of each entry by using the TD_IDF model.
步骤S206:通过每个词条的TD_IDF值构建相应的LSI模型。Step S206: Construct a corresponding LSI model by the TD_IDF value of each entry.
步骤S207:导入用户新建的根词条,将其向量化。Step S207: Import the newly created root term of the user and vectorize it.
步骤S208:将步骤S207中的新建的根词条的向量值导入步骤S206构建的LSI模型中。Step S208: The vector value of the newly created root term in step S207 is introduced into the LSI model constructed in step S206.
步骤S209:将步骤S203中的词条的向量值导入步骤S206构建的LSI模型中,并构建余弦相似度计算模型。Step S209: The vector value of the term in step S203 is introduced into the LSI model constructed in step S206, and a cosine similarity calculation model is constructed.
步骤S210:将步骤S208得到的值导入到余弦相似度计算模型中,输出新建的根词条与现有的所有词条的余弦相似度。Step S210: Import the value obtained in step S208 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing entries.
步骤S3:将余弦相似度超过预设阈值的已有根词条的文本格式转化为结构格式后呈现给用户,否则不呈现给用户。Step S3: Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.
将余弦相似度超过预设阈值(比如80%)的已有根词条识别出来,将文本格式转化为结构格式。The existing root term with a cosine similarity exceeding a preset threshold (such as 80%) is recognized, and the text format is converted into a structural format.
将文本格式涉及的词条属性按照键值对以哈希存储方式存储成词条结构,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条。所有词条以及词条属性存储在redis数据库中,存储格式为哈希hash格式。其具体实现步骤进一步如图4所示,如下。The term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term. All terms and term attributes are stored in the redis database in a hashed hash format. The specific implementation steps are further shown in FIG. 4, as follows.
步骤S301:使用redis hash的基本命令hgetall将根词条的属性以及根词条所有子词条的属性取出给到某一对象。Step S301: Use the basic command hgetall of the redis hash to extract the attribute of the root term and the attributes of all the sub-terms of the root term to an object.
步骤S302:web前端加载D3.js开源库。Step S302: The web front end loads the D3.js open source library.
步骤S303:使用d3.layout.tree命令定义一个tree对象,并确定图像区域大小。Step S303: Define a tree object using the d3.layout.tree command, and determine the image area size.
步骤S304:web前端向服务器请求数据,服务器将步骤S301的对象按照JSON格式传到web前端中。Step S304: The web front end requests data from the server, and the server transmits the object of step S301 to the web front end according to the JSON format.
步骤S305:根据步骤S304的JSON数据生成节点集合nodes。Step S305: Generate node set nodes according to the JSON data of step S304.
步骤S306:根据nodes集合生成节点。Step S306: Generate a node according to the nodes collection.
步骤S307:使用tree.links(nodes)命令获取节点关系集合。Step S307: Obtain a node relationship set by using the tree.links(nodes) command.
步骤S308:为关系集合设置贝塞尔曲线连接。Step S308: Set a Bezier curve connection for the relationship set.
步骤S309:为节点添加圆形标记,如果有子节点为黑色,否则白色。Step S309: Add a circular mark to the node, if there is a child node that is black, otherwise it is white.
步骤S310:根据JSON数据的document属性为节点添加说明文字。Step S310: Add a description text to the node according to the document attribute of the JSON data.
步骤S311:完成文本格式到结构格式的转化。Step S311: Complete conversion of the text format to the structural format.
本实施例中提到的工具是在python中使用,其中D3、gensim、redis都是python的开源库。documents是自己创建的列表,TD_IDF、LSI模型是gensim开源库的模型,hgetall是redis开源库的基本命令,tree是D3开源库的命令d3.layout.tree定义的对象,JSON是一种数据格式,nodes是自己创建的节点集合对象。The tools mentioned in this embodiment are used in Python, where D3, gensim, and redis are all open source libraries of Python. Documents are self-created lists, TD_IDF, LSI model is the model of gensim open source library, hgetall is the basic command of redis open source library, tree is the object defined by D3 open source library command d3.layout.tree, JSON is a data format, Nodes are node collection objects created by themselves.
近似词条结构推荐系统的实施例Embodiment of approximate term structure recommendation system
图5示出了本发明的近似词条结构推荐系统的实施例的原理。请参见图5,本实施例的系统包括文本格式转换模块1、余弦相似度对比模块2、结构格式转换模块3以及存储模块4。Figure 5 illustrates the principles of an embodiment of the approximate term structure recommendation system of the present invention. Referring to FIG. 5, the system of this embodiment includes a text format conversion module 1, a cosine similarity comparison module 2, a structural format conversion module 3, and a storage module 4.
文本格式转化模块1用于实现将根词条的结构格式转化为文本格式。在文本格式转化模块1中,词条属性包括词条标识(ID)、词条名称(name)、词条文本(document)、父级词条(parent)、子级词条(children)。在将词条结构格式转化为文本格式的过程中,将词条结构中的根词条的词条属性以及根词条下所有子词条的词条属性读取出来以形成文本格式。The text format conversion module 1 is used to implement the conversion of the structural format of the root term into a text format. In the text format conversion module 1, the entry attribute includes a term identification (ID), a name (name), a document (document), a parent term (parent), and a child term (children). In the process of converting the term structure format into a text format, the term attribute of the root term in the term structure and the term attribute of all sub-terms under the root term are read out to form a text format.
现在网络的结构化显示多采用D3开源库,即D3开源库将存储在服务器的词条按照树状图的方式显示。词条属性按照键值对进行存储,即是一个string类型的field和value的映射表,因此hash(哈希)存储方式适用于上述存储。Now the structured display of the network mostly uses the D3 open source library, that is, the D3 open source library displays the entries stored in the server in a tree diagram. The entry attribute is stored according to the key-value pair, that is, a mapping table of field and value of string type, so the hash storage mode is applicable to the above storage.
web后台使用Key-Value数据库redis存储词条以及词条属性,创建的每个词条的词条属性按照hash存储方式存储在数据库redis中。需要格式转化时,使用redis hash的基本命令hgetall将根词条的属性以及根词条所有子词条的属性取出。以图2为例,词条结构在数据库中的局部存储信息示例如下:The web background uses the Key-Value database redis to store terms and term attributes, and the term attributes of each entry are stored in the database redis according to the hash storage mode. When a format conversion is required, the basic command hgetall of the redis hash is used to take out the attributes of the root entry and the attributes of all the sub-terms of the root entry. Taking Figure 2 as an example, the example of local storage information of the entry structure in the database is as follows:
文本1:Text 1:
标题1Heading 1
XXXXXX这是标题1的内容XXXXXXXXXXXX This is the content of title 1 XXXXXX
第一章Chapter One
XXXXXX第一章的内容XXXXXXContents of the first chapter of XXXXXXXXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
第二章Chapter two
XXXXXX第二章的内容XXXXXXContent of XXXXXX Chapter 2 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
第三节Third quarter
XXXXXX第三节的内容XXXXXXThe content of the third section of XXXXXXXXXXXX
第三章third chapter
XXXXXX第三章的内容XXXXXXContent of XXXXXX Chapter 3 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
文本2:Text 2:
标题2Heading 2
XXXXXX这是标题2的内容XXXXXXXXXXXX This is the content of title 2 XXXXXX
第一章Chapter One
XXXXXX第一章的内容XXXXXXContents of the first chapter of XXXXXXXXXXXX
第二章Chapter two
XXXXXX第二章的内容XXXXXXContent of XXXXXX Chapter 2 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
第三章third chapter
XXXXXX第三章的内容XXXXXXContent of XXXXXX Chapter 3 XXXXXX
第一节First quarter
XXXXXX第一节的内容XXXXXXContents of the first section of XXXXXXXXXXXX
第二节Second quarter
XXXXXX第二节的内容XXXXXXThe content of the second section of XXXXXXXXXXXX
存储模块4用于存储所有词条的结构格式及其对应的文本格式。The storage module 4 is configured to store the structural format of all the terms and their corresponding text formats.
余弦相似度对比模块2将转化为文本格式的新建的根词条与其他转化为文本格式的已有根词条进行两两余弦相似度对比,筛选出余弦相似度高于预设阈值的已有根词条的文本格式并输出为词条结构格式。The cosine similarity comparison module 2 compares the newly created root term converted into a text format with the existing root terms converted into a text format, and compares the cosine similarity to a preset threshold. The text of the root entry is formatted and output as a form structure.
余弦相似度对比模块2中有关词条与词条之间的余弦相似度的计算如图3所示,具体步骤如下。The calculation of the cosine similarity between the terms and the terms in the cosine similarity comparison module 2 is shown in FIG. 3, and the specific steps are as follows.
步骤S201:导入gensim数据库。Step S201: Import the gensim database.
步骤S202;将现有的所有词条导入documents列表中,词条与词条用逗号间隔。Step S202: Import all existing entries into the documents list, and the terms and terms are separated by commas.
步骤S203:将现有的所有词条向量化。Step S203: Vectorize all existing entries.
步骤S204:通过步骤S203中的向量值构建相应的TD_IDF模型。Step S204: Construct a corresponding TD_IDF model by the vector value in step S203.
步骤S205:通过TD_IDF模型计算每个词条的TD_IDF值。Step S205: Calculate the TD_IDF value of each entry by using the TD_IDF model.
步骤S206:通过每个词条的TD_IDF值构建相应的LSI模型。Step S206: Construct a corresponding LSI model by the TD_IDF value of each entry.
步骤S207:导入用户新建的根词条,将其向量化。Step S207: Import the newly created root term of the user and vectorize it.
步骤S208:将步骤S207中的新建的根词条的向量值导入步骤S206构建的LSI模型中。Step S208: The vector value of the newly created root term in step S207 is introduced into the LSI model constructed in step S206.
步骤S209:将步骤S203中的词条的向量值导入步骤S206构建的LSI模型中,并构建余弦相似度计算模型。Step S209: The vector value of the term in step S203 is introduced into the LSI model constructed in step S206, and a cosine similarity calculation model is constructed.
步骤S210:将步骤S208得到的值导入到余弦相似度计算模型中,输出新建的根词条与现有的所有词条的余弦相似度。Step S210: Import the value obtained in step S208 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing entries.
结构格式转化模块3用于将根词条的文本格式转化为词条的结构格式。结构格式转化模块3中,将文本格式涉及的词条属性按照键值对以哈希存储方式存储成词条结构,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条。所有词条以及词条属性存储在redis数据库中,存储格式为哈希hash格式。其具体实现步骤进一步如图4所示,如下。The structure format conversion module 3 is used to convert the text format of the root entry into the structural format of the entry. In the structural format conversion module 3, the term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a term identifier, a term name, a term text, and a parent word. Articles, sub-levels. All terms and term attributes are stored in the redis database in a hashed hash format. The specific implementation steps are further shown in FIG. 4, as follows.
步骤S301:使用redis hash的基本命令hgetall将根词条的属性以及根词条所有子词条的属性取出给到某一对象。Step S301: Use the basic command hgetall of the redis hash to extract the attribute of the root term and the attributes of all the sub-terms of the root term to an object.
步骤S302:web前端加载D3.js开源库。Step S302: The web front end loads the D3.js open source library.
步骤S303:使用d3.layout.tree命令定义一个tree对象,并确定图像区域大小。Step S303: Define a tree object using the d3.layout.tree command, and determine the image area size.
步骤S304:web前端向服务器请求数据,服务器将步骤S301的对象按照JSON格式传到web前端中。Step S304: The web front end requests data from the server, and the server transmits the object of step S301 to the web front end according to the JSON format.
步骤S305:根据步骤S304的JSON数据生成节点集合nodes。Step S305: Generate node set nodes according to the JSON data of step S304.
步骤S306:根据nodes集合生成节点。Step S306: Generate a node according to the nodes collection.
步骤S307:使用tree.links(nodes)命令获取节点关系集合。Step S307: Obtain a node relationship set by using the tree.links(nodes) command.
步骤S308:为关系集合设置贝塞尔曲线连接。Step S308: Set a Bezier curve connection for the relationship set.
步骤S309:为节点添加圆形标记,如果有子节点为黑色,否则白色。Step S309: Add a circular mark to the node, if there is a child node that is black, otherwise it is white.
步骤S310:根据JSON数据的document属性为节点添加说明文字。Step S310: Add a description text to the node according to the document attribute of the JSON data.
步骤S311:完成文本格式到结构格式的转化。Step S311: Complete conversion of the text format to the structural format.
本实施例中提到的工具是在python中使用,其中D3、gensim、redis都是python的开源库。documents是自己创建的列表,TD_IDF、LSI模型是gensim开源库的模型,hgetall是redis开源库的基本命令,tree是D3开源库的命令d3.layout.tree定义的对象,JSON是一种数据格式,nodes是自己创建的节点集合对象。The tools mentioned in this embodiment are used in Python, where D3, gensim, and redis are all open source libraries of Python. Documents are self-created lists, TD_IDF, LSI model is the model of gensim open source library, hgetall is the basic command of redis open source library, tree is the object defined by D3 open source library command d3.layout.tree, JSON is a data format, Nodes are node collection objects created by themselves.
尽管为使解释简单化将上述方法图示并描述为一系列动作,但是应理解并领会,这些方法不受动作的次序所限,因为根据一个或多个实施例,一些动作可按不同次序发生和/或与来自本文中图示和描述或本文中未图示和描述但本领域技术人员可以理解的其他动作并发地发生。Although the above method is illustrated and described as a series of acts for simplicity of the explanation, it should be understood and appreciated that these methods are not limited by the order of the acts, as some acts may occur in different orders in accordance with one or more embodiments. And/or concurrently with other acts from what is illustrated and described herein or that are not illustrated and described herein, but are understood by those skilled in the art.
本领域技术人员将进一步领会,结合本文中所公开的实施例来描述的各种解说性逻辑板块、模块、电路、和算法步骤可实现为电子硬件、计算机软件、或这两者的组合。为清楚地解说硬件与软件的这一可互换性,各种解说性组件、框、模块、电路、和步骤在上面是以其功能性的形式作一般化描述的。此类功能性是被实现为硬件还是软件取决于具体应用和施加于整体系统的设计约束。技术人员对于每种特定应用可用不同的方式来实现所描述的功能性,但这样的实现决策不应被解读成导致脱离了本发明的范围。Those skilled in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps are described above generally in the form of their functionality. Whether such functionality is implemented as hardware or software depends on the particular application and design constraints imposed on the overall system. The skilled person will be able to implement the described functionality in a different manner for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the invention.
结合本文所公开的实施例描述的各种解说性逻辑板块、模块、和电路可用通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑器件、分立的门或晶体管逻辑、分立的硬件组件、或其设计成执行本文所描述功能的任何组合来实现或执行。通用处理器可以是微处理器,但在替换方案中,该处理器可以是任何常规的处理器、控制 器、微控制器、或状态机。处理器还可以被实现为计算设备的组合,例如DSP与微处理器的组合、多个微处理器、与DSP核心协作的一个或多个微处理器、或任何其他此类配置。Various illustrative logic blocks, modules, and circuits described in connection with the embodiments disclosed herein may be general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or others. Programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein are implemented or executed. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
结合本文中公开的实施例描述的方法或算法的步骤可直接在硬件中、在由处理器执行的软件模块中、或在这两者的组合中体现。软件模块可驻留在RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动盘、CD-ROM、或本领域中所知的任何其他形式的存储介质中。示例性存储介质耦合到处理器以使得该处理器能从/向该存储介质读取和写入信息。在替换方案中,存储介质可以被整合到处理器。处理器和存储介质可驻留在ASIC中。ASIC可驻留在用户终端中。在替换方案中,处理器和存储介质可作为分立组件驻留在用户终端中。The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor to enable the processor to read and write information to/from the storage medium. In the alternative, the storage medium can be integrated into the processor. The processor and the storage medium can reside in an ASIC. The ASIC can reside in the user terminal. In the alternative, the processor and the storage medium may reside as a discrete component in the user terminal.
在一个或多个示例性实施例中,所描述的功能可在硬件、软件、固件或其任何组合中实现。如果在软件中实现为计算机程序产品,则各功能可以作为一条或更多条指令或代码存储在计算机可读介质上或藉其进行传送。计算机可读介质包括计算机存储介质和通信介质两者,其包括促成计算机程序从一地向另一地转移的任何介质。存储介质可以是能被计算机访问的任何可用介质。作为示例而非限定,这样的计算机可读介质可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储、磁盘存储或其它磁存储设备、或能被用来携带或存储指令或数据结构形式的合意程序代码且能被计算机访问的任何其它介质。任何连接也被正当地称为计算机可读介质。例如,如果软件是使用同轴电缆、光纤电缆、双绞线、数字订户线(DSL)、或诸如红外、无线电、以及微波之类的无线技术从web网站、服务器、或其它远程源传送而来,则该同轴电缆、光纤电缆、双绞线、DSL、或诸如红外、无线电、以及微波之类的无线技术就被包括在介质的定义之中。如本文中所使用的盘(disk)和碟(disc)包括压缩碟(CD)、激光碟、光碟、数字多用碟(DVD)、软盘和蓝光碟,其中盘(disk)往往以磁的方式再现数据,而碟(disc)用激光以光学方式再现数据。上述的组合也应被包括在计算机可读介质的范围内。In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented as a computer program product in software, the functions may be stored on or transmitted as one or more instructions or code on a computer readable medium. Computer readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available media that can be accessed by a computer. By way of example and not limitation, such computer readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, or can be used to carry or store instructions or data structures. Any other medium that is desirable for program code and that can be accessed by a computer. Any connection is also properly referred to as a computer readable medium. For example, if the software is transmitted from a web site, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave. The coaxial cable, fiber optic cable, twisted pair cable, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of the medium. Disks and discs as used herein include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, in which disks are often reproduced magnetically. Data, and discs optically reproduce data with a laser. Combinations of the above should also be included within the scope of computer readable media.
提供对本公开的先前描述是为使得本领域任何技术人员皆能够制作或使用本公开。对本公开的各种修改对本领域技术人员来说都将是显而易见的,且本文中所定义的普适原理可被应用到其他变体而不会脱离本公开的精神或范围。由此,本公开并非旨在被限定于本文中所描述的示例和设计,而是应被授予与本文中所公开的原理和新颖性特征相一致的最广范围。The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the present disclosure will be obvious to those skilled in the art, and the general principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. The present disclosure is not intended to be limited to the examples and designs described herein, but rather the broadest scope of the principles and novel features disclosed herein.

Claims (8)

  1. 一种近似词条结构推荐方法,其特征在于,包括:An approximate term structure recommendation method, comprising:
    步骤一:接收用户新建的根词条的结构,实时将结构格式转化为文本格式并存储;Step 1: Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time;
    步骤二:将转化为文本格式的新建的根词条与其他转化为文本格式的已有根词条进行两两余弦相似度对比;Step 2: Compare the newly created root terms converted to text format with the existing root terms converted into text format by two or two cosine similarities;
    步骤三:将余弦相似度超过预设阈值的已有根词条的文本格式转化为结构格式后呈现给用户,否则不呈现给用户。Step 3: Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.
  2. 根据权利要求1所述的近似词条结构推荐方法,其特征在于,词条结构格式转化为文本格式的过程中,词条结构中的词条属性按照键值对以哈希存储方式进行存储,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条,在将词条结构格式转化为文本格式的过程中,将词条结构中的根词条的词条属性以及根词条下所有子词条的词条属性读取出来以形成文本格式。The approximate term structure recommendation method according to claim 1, wherein in the process of converting the term structure format into a text format, the term attribute in the term structure is stored in a hash storage manner according to the key value pair. The term attribute includes a term identifier, a term name, a term text, a parent term, and a child term. In the process of converting the term structure format into a text format, the root term in the term structure is The entry attribute and the entry attribute of all sub-entries under the root entry are read out to form a text format.
  3. 根据权利要求1所述的近似词条结构推荐方法,其特征在于,步骤二进一步包括:The method for recommending an approximate term structure according to claim 1, wherein the second step further comprises:
    步骤1:导入gensim数据库;Step 1: Import the gensim database;
    步骤2;将现有的所有词条导入documents列表中,词条与词条用逗号间隔;Step 2: Import all existing entries into the documents list, and the terms and terms are separated by commas;
    步骤3:将现有的所有词条向量化;Step 3: Vectorize all existing entries;
    步骤4:通过步骤3中的向量值构建相应的TD_IDF模型;Step 4: construct a corresponding TD_IDF model by using the vector values in step 3;
    步骤5:通过TD_IDF模型计算每个词条的TD_IDF值;Step 5: Calculate the TD_IDF value of each entry through the TD_IDF model;
    步骤6:通过每个词条的TD_IDF值构建相应的LSI模型;Step 6: construct a corresponding LSI model by using the TD_IDF value of each entry;
    步骤7:导入用户新建的根词条,将其向量化;Step 7: Import the newly created root entry of the user and vectorize it;
    步骤8:将步骤7中的新建的根词条的向量值导入步骤6构建的LSI模型中;Step 8: Import the vector value of the newly created root term in step 7 into the LSI model constructed in step 6;
    步骤9:将步骤3中的词条的向量值导入步骤6构建的LSI模型中,并构建余弦相似度计算模型;Step 9: Import the vector value of the term in step 3 into the LSI model constructed in step 6, and construct a cosine similarity calculation model;
    步骤10:将步骤8得到的值导入到余弦相似度计算模型中,输出新建的根词条与现有的所有词条的余弦相似度。Step 10: Import the value obtained in step 8 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing terms.
  4. 根据权利要求1所述的近似词条结构推荐方法,其特征在于,在步骤三的在将文本格式转化为词条结构格式的过程中,将文本格式涉及的词条属性按照键值对以哈希存储方式存储成词条结构,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条。The approximation term structure recommendation method according to claim 1, wherein in the process of converting the text format into the term structure format in step three, the term attribute related to the text format is in accordance with the key value pair. The storage mode is stored as a term structure, wherein the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term.
  5. 根据权利要求4所述的近似词条结构推荐方法,其特征在于,在步骤三进一步包括:The method for recommending an approximate term structure according to claim 4, wherein the step 3 further comprises:
    步骤1:使用redis hash的基本命令hgetall将根词条的属性以及根词条所有子词条的属性取出给到某一对象;Step 1: Use the basic command hgetall of the redis hash to extract the attributes of the root entry and the attributes of all the sub-terms of the root entry to an object;
    步骤2:web前端加载D3.js开源库;Step 2: The web front end loads the D3.js open source library;
    步骤3:使用d3.layout.tree命令定义一个tree对象,并确定图像区域大小;Step 3: Define a tree object using the d3.layout.tree command and determine the image area size;
    步骤4:web前端向服务器请求数据,服务器将步骤1的对象按照JSON格式传到web前端中;Step 4: The web front end requests data from the server, and the server transmits the object of step 1 to the web front end according to the JSON format;
    步骤5:根据步骤4的JSON数据生成节点集合nodes;Step 5: Generate a node set node according to the JSON data of step 4;
    步骤6:根据nodes集合生成节点;Step 6: Generate a node according to the nodes collection;
    步骤7:使用tree.links(nodes)命令获取节点关系集合;Step 7: Use the tree.links(nodes) command to get the node relationship set.
    步骤8:为关系集合设置贝塞尔曲线连接;Step 8: Set the Bezier connection for the relationship set;
    步骤9:为节点添加圆形标记,如果有子节点为黑色,否则白色;Step 9: Add a circular mark to the node, if there are child nodes that are black, otherwise white;
    步骤10:根据JSON数据的document属性为节点添加说明文字;Step 10: Add a description text to the node according to the document attribute of the JSON data;
    步骤11:完成文本格式到结构格式的转化。Step 11: Complete the conversion of the text format to the structure format.
  6. 一种近似词条结构推荐系统,其特征在于,包括:An approximate term structure recommendation system, comprising:
    文本格式转化模块,将根词条的结构格式转化为文本格式;A text format conversion module that converts the structural format of the root term into a text format;
    存储模块,存储所有词条的结构格式及其对应的文本格式;a storage module that stores a structure format of all terms and a corresponding text format thereof;
    余弦相似度对比模块,将转化为文本格式的新建的根词条与其他转化为文本格式的已有根词条进行两两余弦相似度对比,筛选出余弦相似度高于预设阈值的已有根词条的文本格式并输出为词条结构格式;The cosine similarity comparison module compares the newly created root terms converted into text format with the existing root terms converted into text format, and compares the cosine similarity with the preset threshold. The text format of the root entry is output as a form of the entry structure;
    结构格式转化模块,将根词条的文本格式转化为词条的结构格式。The structure format conversion module converts the text format of the root entry into the structural format of the entry.
  7. 根据权利要求6所述的近似词条结构推荐系统,其特征在于,在文本格式转化模块中,词条结构中的词条属性按照键值对以哈希存储方式进行存储,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条,在将词条结构格式转化为文本格式的过程中,将词条结构中的根词条的词条属性以及根词条下所有子词条的词条属性读取出来以形成文本格式。The approximate term structure recommendation system according to claim 6, wherein in the text format conversion module, the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term attribute includes Entry identifier, entry name, entry text, parent entry, child entry, in the process of converting the form structure into a text format, the term attribute of the root entry in the entry structure and The term attribute of all sub-entries under the root entry is read out to form a text format.
  8. 根据权利要求6所述的近似词条结构推荐系统,其特征在于,结构格式转化模块中,将文本格式涉及的词条属性按照键值对以哈希存储方式存储成词条结构,其中词条属性包括词条标识、词条名称、词条文本、父级词条、子级词条。The approximate term structure recommendation system according to claim 6, wherein the structure format conversion module stores the term attribute related to the text format in a hash storage manner according to the key value pair, wherein the term is Attributes include entry identifier, entry name, entry text, parent entry, and child entry.
PCT/CN2018/084818 2017-03-07 2018-04-27 Approximate entry structure recommendation method and system WO2018184600A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710131132.1A CN108572954B (en) 2017-03-07 2017-03-07 Method and system for recommending approximate entry structure
CN201710131132.1 2017-03-07

Publications (1)

Publication Number Publication Date
WO2018184600A1 true WO2018184600A1 (en) 2018-10-11

Family

ID=63577212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/084818 WO2018184600A1 (en) 2017-03-07 2018-04-27 Approximate entry structure recommendation method and system

Country Status (2)

Country Link
CN (1) CN108572954B (en)
WO (1) WO2018184600A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829862A (en) * 2024-03-04 2024-04-05 贵州联广科技股份有限公司 Interconnection-based data source tracing method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360358A (en) * 2011-09-28 2012-02-22 百度在线网络技术(北京)有限公司 Keyword recommendation method and system
CN103150376A (en) * 2013-03-12 2013-06-12 中科软科技股份有限公司 Construction method for industrial application software root table
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
US20160224636A1 (en) * 2015-01-30 2016-08-04 Nec Europe Ltd. Scalable system and method for weighted similarity estimation in massive datasets revealed in a streaming fashion

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0823865B2 (en) * 1987-08-28 1996-03-06 株式会社日立製作所 DATA SEARCH METHOD AND DEVICE
JP2004005337A (en) * 2002-03-28 2004-01-08 Nippon Telegr & Teleph Corp <Ntt> Word relation database constructing method and device, word/document processing method and device using word relation database, explanation expression adequacy verifying method, programs for these, storage medium storing them, word similarity computing method, word grouping method, representive word extracting method, and word concept hierarchial method
EP2000925A1 (en) * 2007-06-08 2008-12-10 Deutsche Telekom AG An intelligent graph-based expert searching system
CN101620608A (en) * 2008-07-04 2010-01-06 全国组织机构代码管理中心 Information collection method and system
CN103150667B (en) * 2013-03-14 2016-06-15 北京大学 A kind of personalized recommendation method based on body construction
CN103593792B (en) * 2013-11-13 2016-09-28 复旦大学 A kind of personalized recommendation method based on Chinese knowledge mapping and system
US9684709B2 (en) * 2013-12-14 2017-06-20 Microsoft Technology Licensing, Llc Building features and indexing for knowledge-based matching
CN104142918B (en) * 2014-07-31 2017-04-05 天津大学 Short text clustering and focus subject distillation method based on TF IDF features
US10210246B2 (en) * 2014-09-26 2019-02-19 Oracle International Corporation Techniques for similarity analysis and data enrichment using knowledge sources
CN104484374B (en) * 2014-12-08 2018-11-16 百度在线网络技术(北京)有限公司 A kind of method and device creating network encyclopaedia entry
CN104572970B (en) * 2014-12-31 2017-09-12 浙江大学 A kind of SPARQL query statements generation system based on ontology library content
CN105989088B (en) * 2015-02-12 2019-05-14 马正方 Learning device under digitized environment
CN104866614A (en) * 2015-06-05 2015-08-26 深圳市爱学堂教育科技有限公司 Entry creating method and entry creating device
CN105653650B (en) * 2015-12-28 2019-02-26 湖北工业大学 A kind of discussion system mind map and its development approach based on D3
CN106250526A (en) * 2016-08-05 2016-12-21 浪潮电子信息产业股份有限公司 A kind of text class based on content and user behavior recommends method and apparatus
CN106295912B (en) * 2016-08-30 2021-10-22 成都科来网络技术有限公司 Method and device for configuring and displaying transaction path based on business logic
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106372194B (en) * 2016-08-31 2019-12-20 杭州追灿科技有限公司 Method and system for presenting search results

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360358A (en) * 2011-09-28 2012-02-22 百度在线网络技术(北京)有限公司 Keyword recommendation method and system
CN103150376A (en) * 2013-03-12 2013-06-12 中科软科技股份有限公司 Construction method for industrial application software root table
CN104408148A (en) * 2014-12-03 2015-03-11 复旦大学 Field encyclopedia establishment system based on general encyclopedia websites
US20160224636A1 (en) * 2015-01-30 2016-08-04 Nec Europe Ltd. Scalable system and method for weighted similarity estimation in massive datasets revealed in a streaming fashion

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829862A (en) * 2024-03-04 2024-04-05 贵州联广科技股份有限公司 Interconnection-based data source tracing method and system

Also Published As

Publication number Publication date
CN108572954B (en) 2023-04-28
CN108572954A (en) 2018-09-25

Similar Documents

Publication Publication Date Title
US10120844B2 (en) Determining the likelihood that an input descriptor and associated text content match a target field using natural language processing techniques in preparation for an extract, transform and load process
US10332012B2 (en) Knowledge driven solution inference
CN106776495B (en) Document logic structure reconstruction method
CN102693269B (en) For the Extensible surface of consumption information extraction service
US8271513B1 (en) Module and method for searching named entity of terms from the named entity database using named entity database and mining rule merged ontology schema
EP2737421A1 (en) Weighting metric for visual search of entity-relationship databases
US20140195532A1 (en) Collecting digital assets to form a searchable repository
US9720895B1 (en) Device for construction of computable linked semantic annotations
US20110022627A1 (en) Method and apparatus for functional integration of metadata
WO2018205852A1 (en) Method and system for automatically creating folder tree diagram
CN114817481A (en) Big data-based intelligent supply chain visualization method and device
CN111046135A (en) Unstructured text processing method and device, computer equipment and storage medium
US10489024B2 (en) UI rendering based on adaptive label text infrastructure
WO2022019986A1 (en) Enterprise knowledge graphs using multiple toolkits
US20130006979A1 (en) Enhancing cluster analysis using document metadata
WO2018184600A1 (en) Approximate entry structure recommendation method and system
WO2022020005A1 (en) Enterprise knowledge graphs using user-based mining
KR20230115964A (en) Method and apparatus for generating knowledge graph
Wagenpfeil et al. Graph codes-2d projections of multimedia feature graphs for fast and effective retrieval
CN110363206A (en) Cluster, data processing and the data identification method of data object
CN112632223A (en) Case and event knowledge graph construction method and related equipment
CN104281570A (en) Information processing method and device and method and device for standardizing organization names
Koutras Data as a Language: A Novel Approach to Data Integration.
WO2022020012A1 (en) Annotations for enterprise knowledge graphs using multiple toolkits
Kamran et al. SemanticHadith: An ontology-driven knowledge graph for the hadith corpus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18780786

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18780786

Country of ref document: EP

Kind code of ref document: A1