WO2018184600A1

WO2018184600A1 - Approximate entry structure recommendation method and system

Info

Publication number: WO2018184600A1
Application number: PCT/CN2018/084818
Authority: WO
Inventors: 马也驰; 谭红
Original assignee: 上海颐为网络科技有限公司
Priority date: 2017-03-07
Filing date: 2018-04-27
Publication date: 2018-10-11
Also published as: CN108572954B; CN108572954A

Abstract

Disclosed are an approximate entry structure recommendation method and system, capable of automatically identifying approximate entry structures and providing users that create entries for reference, thereby improving the efficiency of creating entry structures of the users and enhancing user's comprehension on the entry structure. The technical solution of the present invention comprises: receiving a structure of a root entry created by a user, converting the structure format into a text format in real time and storing same; performing pairwise cosine similarity comparison on the created root entry that is converted into the text format and other existing root entries that are converted into the text format; and converting the text format of the existing root entries with the cosine similarity exceeding a preset threshold into a structure format and then presenting same to the user, otherwise not presenting to the user.

Description

Approximate term structure recommendation method and system

Technical field

The present invention relates to a preferred method and system for approximating term structure, and more particularly to a technique for recommending a term structure based on a cosine similarity parameter.

Background technique

On the information platform based on the term structure, as the number of users increases, many users will define and structure the same knowledge system. When a user creates a new root entry in the system in order to establish a term structure, a term structure similar to the new root term is often stored in the system.

In the past information platform, even if there is a similar term structure, the user of the new root term will not be informed, and the known term structure on the information platform cannot serve the user. Users still build the term structure without any reference, which will reduce the user's use efficiency on the information platform. Moreover, it is easy to cause a large number of terms with similar structural forms on the platform, which is not conducive to the information collation and display on the platform.

Therefore, the industry currently needs a means to automatically obtain the approximate term structure already stored in the system and provide it to the user for reference.

Summary of the invention

A brief overview of one or more aspects is provided below to provide a basic understanding of these aspects. This summary is not an extensive overview of all aspects that are conceived, and is not intended to identify key or critical elements in all aspects. Its sole purpose is to present some concepts of one or more aspects

The object of the present invention is to solve the above problems, and to provide an approximate term structure recommendation method and system, which can automatically identify similar term structures and provide reference to users of newly created terms, thereby improving user establishment of term structure. The efficiency and deepen the user's understanding of the structure of the term.

The technical solution of the present invention is as follows: The present invention discloses an approximate term structure recommendation method, including:

Step 1: Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time;

Step 2: Compare the newly created root terms converted to text format with the existing root terms converted into text format by two or two cosine similarities;

Step 3: Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.

According to an embodiment of the approximate term structure recommendation method of the present invention, in the process of converting the term structure format into a text format, the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term is stored. Attributes include entry identifier, entry name, entry text, parent entry, child entry, and the entry of the root entry in the entry structure in the process of converting the form structure into a text format. The attributes and the entry attributes of all sub-entries under the root entry are read out to form a text format.

According to an embodiment of the method for recommending approximate structure of the present invention, step two further includes:

Step 1: Import the gensim database;

Step 2: Import all existing entries into the documents list, and the terms and terms are separated by commas;

Step 3: Vectorize all existing entries;

Step 4: construct a corresponding TD_IDF model by using the vector values in step 3;

Step 5: Calculate the TD_IDF value of each entry through the TD_IDF model;

Step 6: construct a corresponding LSI model by using the TD_IDF value of each entry;

Step 7: Import the newly created root entry of the user and vectorize it;

Step 8: Import the vector value of the newly created root term in step 7 into the LSI model constructed in step 6;

Step 9: Import the vector value of the term in step 3 into the LSI model constructed in step 6, and construct a cosine similarity calculation model;

Step 10: Import the value obtained in step 8 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing terms.

According to an embodiment of the approximate term structure recommendation method of the present invention, in the process of converting the text format into the term structure format in step three, the term attribute related to the text format is hash-stored according to the key value pair. Stored in a term structure, where the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term.

An embodiment of the method for recommending an approximate term structure according to the present invention further includes in step 3:

Step 1: Use the basic command hgetall of the redis hash to extract the attributes of the root entry and the attributes of all the sub-terms of the root entry to an object;

Step 2: The web front end loads the D3.js open source library;

Step 3: Define a tree object using the d3.layout.tree command and determine the image area size;

Step 4: The web front end requests data from the server, and the server transmits the object of step 1 to the web front end according to the JSON format;

Step 5: Generate a node set node according to the JSON data of step 4;

Step 6: Generate a node according to the nodes collection;

Step 7: Use the tree.links(nodes) command to get the node relationship set.

Step 8: Set the Bezier connection for the relationship set;

Step 9: Add a circular mark to the node, if there are child nodes that are black, otherwise white;

Step 10: Add a description text to the node according to the document attribute of the JSON data;

Step 11: Complete the conversion of the text format to the structure format.

The invention also discloses an approximate term structure recommendation system, comprising:

A text format conversion module that converts the structural format of the root term into a text format;

a storage module that stores a structure format of all terms and a corresponding text format thereof;

The cosine similarity comparison module compares the newly created root terms converted into text format with the existing root terms converted into text format, and compares the cosine similarity with the preset threshold. The text format of the root entry is output as a form of the entry structure;

The structure format conversion module converts the text format of the root entry into the structural format of the entry.

According to an embodiment of the approximate term structure recommendation system of the present invention, in the text format conversion module, the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term attribute includes the item identifier. , the term name, the term text, the parent term, the child term, in the process of converting the form structure into a text format, the term attribute of the root term in the term structure and the root term The entry attributes of all sub-entries are read out to form a text format.

According to an embodiment of the approximate term structure recommendation system of the present invention, in the structural format conversion module, the term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a word. Article identification, entry name, entry text, parent entry, child entry.

DRAWINGS

Figure 1 shows a flow chart of an embodiment of the approximate term structure recommendation method of the present invention.

Fig. 2 shows two terms of the structure used in the present invention.

Figure 3 is a flow chart showing the cosine similarity of the calculation terms and terms of the present invention.

Figure 4 is a flow chart showing the conversion of the text format of the present invention into a form structure of a term.

Figure 5 shows a schematic diagram of an embodiment of an approximate term structure recommendation system of the present invention.

detailed description

The above features and advantages of the present invention will be better understood from the following description of the appended claims. In the figures, components are not necessarily drawn to scale, and components having similar related features or features may have the same or similar reference numerals.

Embodiment of approximate term structure recommendation method

1 shows an implementation of an embodiment of the approximate term structure recommendation method of the present invention. In the description of the embodiment, the two term structure shown in FIG. 2 is used as an example, which are respectively shown in FIG. Entry structure 1 and entry structure 2.

Step S1: Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time.

The entry attributes include the entry identifier (ID), the entry name (name), the entry text (document), the parent term (parent), and the child term (children). In the process of converting the term structure format into a text format, the term attribute of the root term in the term structure and the term attribute of all sub-terms under the root term are read out to form a text format.

Now the structured display of the network mostly uses the D3 open source library, that is, the D3 open source library displays the entries stored in the server in a tree diagram. The entry attribute is stored according to the key-value pair, that is, a mapping table of field and value of string type, so the hash storage mode is applicable to the above storage.

The web background uses the Key-Value database redis to store terms and term attributes, and the term attributes of each entry are stored in the database redis according to the hash storage mode. When a format conversion is required, the basic command hgetall of the redis hash is used to take out the attributes of the root entry and the attributes of all the sub-terms of the root entry. Taking Figure 2 as an example, the example of local storage information of the entry structure in the database is as follows:

Text 1:

Heading 1

XXXXXX This is the content of title 1 XXXXXX

Chapter One

Contents of the first chapter of XXXXXXXXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

Chapter two

Content of XXXXXX Chapter 2 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

Third quarter

The content of the third section of XXXXXXXXXXXX

third chapter

Content of XXXXXX Chapter 3 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

Text 2:

Heading 2

XXXXXX This is the content of title 2 XXXXXX

Chapter One

Contents of the first chapter of XXXXXXXXXXXX

Chapter two

Content of XXXXXX Chapter 2 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

third chapter

Content of XXXXXX Chapter 3 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

Step S2: Performing a pairwise cosine similarity comparison between the newly created root term converted into a text format and the existing root term converted into a text format.

The calculation of the cosine similarity between the terms and the terms is shown in Figure 3. The specific steps are as follows.

Step S201: Import the gensim database.

Step S202: Import all existing entries into the documents list, and the terms and terms are separated by commas.

Step S203: Vectorize all existing entries.

Step S204: Construct a corresponding TD_IDF model by the vector value in step S203.

Step S205: Calculate the TD_IDF value of each entry by using the TD_IDF model.

Step S206: Construct a corresponding LSI model by the TD_IDF value of each entry.

Step S207: Import the newly created root term of the user and vectorize it.

Step S208: The vector value of the newly created root term in step S207 is introduced into the LSI model constructed in step S206.

Step S209: The vector value of the term in step S203 is introduced into the LSI model constructed in step S206, and a cosine similarity calculation model is constructed.

Step S210: Import the value obtained in step S208 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing entries.

Step S3: Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.

The existing root term with a cosine similarity exceeding a preset threshold (such as 80%) is recognized, and the text format is converted into a structural format.

The term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term. All terms and term attributes are stored in the redis database in a hashed hash format. The specific implementation steps are further shown in FIG. 4, as follows.

Step S301: Use the basic command hgetall of the redis hash to extract the attribute of the root term and the attributes of all the sub-terms of the root term to an object.

Step S302: The web front end loads the D3.js open source library.

Step S303: Define a tree object using the d3.layout.tree command, and determine the image area size.

Step S304: The web front end requests data from the server, and the server transmits the object of step S301 to the web front end according to the JSON format.

Step S305: Generate node set nodes according to the JSON data of step S304.

Step S306: Generate a node according to the nodes collection.

Step S307: Obtain a node relationship set by using the tree.links(nodes) command.

Step S308: Set a Bezier curve connection for the relationship set.

Step S309: Add a circular mark to the node, if there is a child node that is black, otherwise it is white.

Step S310: Add a description text to the node according to the document attribute of the JSON data.

Step S311: Complete conversion of the text format to the structural format.

The tools mentioned in this embodiment are used in Python, where D3, gensim, and redis are all open source libraries of Python. Documents are self-created lists, TD_IDF, LSI model is the model of gensim open source library, hgetall is the basic command of redis open source library, tree is the object defined by D3 open source library command d3.layout.tree, JSON is a data format, Nodes are node collection objects created by themselves.

Embodiment of approximate term structure recommendation system

Figure 5 illustrates the principles of an embodiment of the approximate term structure recommendation system of the present invention. Referring to FIG. 5, the system of this embodiment includes a text format conversion module 1, a cosine similarity comparison module 2, a structural format conversion module 3, and a storage module 4.

The text format conversion module 1 is used to implement the conversion of the structural format of the root term into a text format. In the text format conversion module 1, the entry attribute includes a term identification (ID), a name (name), a document (document), a parent term (parent), and a child term (children). In the process of converting the term structure format into a text format, the term attribute of the root term in the term structure and the term attribute of all sub-terms under the root term are read out to form a text format.

Text 1:

Heading 1

XXXXXX This is the content of title 1 XXXXXX

Chapter One

Contents of the first chapter of XXXXXXXXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

Chapter two

Content of XXXXXX Chapter 2 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

Third quarter

The content of the third section of XXXXXXXXXXXX

third chapter

Content of XXXXXX Chapter 3 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

Text 2:

Heading 2

XXXXXX This is the content of title 2 XXXXXX

Chapter One

Contents of the first chapter of XXXXXXXXXXXX

Chapter two

Content of XXXXXX Chapter 2 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

third chapter

Content of XXXXXX Chapter 3 XXXXXX

First quarter

Contents of the first section of XXXXXXXXXXXX

Second quarter

The content of the second section of XXXXXXXXXXXX

The storage module 4 is configured to store the structural format of all the terms and their corresponding text formats.

The cosine similarity comparison module 2 compares the newly created root term converted into a text format with the existing root terms converted into a text format, and compares the cosine similarity to a preset threshold. The text of the root entry is formatted and output as a form structure.

The calculation of the cosine similarity between the terms and the terms in the cosine similarity comparison module 2 is shown in FIG. 3, and the specific steps are as follows.

Step S201: Import the gensim database.

Step S203: Vectorize all existing entries.

Step S205: Calculate the TD_IDF value of each entry by using the TD_IDF model.

Step S207: Import the newly created root term of the user and vectorize it.

The structure format conversion module 3 is used to convert the text format of the root entry into the structural format of the entry. In the structural format conversion module 3, the term attribute related to the text format is stored as a term structure in a hash storage manner according to the key value pair, wherein the term attribute includes a term identifier, a term name, a term text, and a parent word. Articles, sub-levels. All terms and term attributes are stored in the redis database in a hashed hash format. The specific implementation steps are further shown in FIG. 4, as follows.

Step S302: The web front end loads the D3.js open source library.

Step S305: Generate node set nodes according to the JSON data of step S304.

Step S306: Generate a node according to the nodes collection.

Step S308: Set a Bezier curve connection for the relationship set.

Step S311: Complete conversion of the text format to the structural format.

Although the above method is illustrated and described as a series of acts for simplicity of the explanation, it should be understood and appreciated that these methods are not limited by the order of the acts, as some acts may occur in different orders in accordance with one or more embodiments. And/or concurrently with other acts from what is illustrated and described herein or that are not illustrated and described herein, but are understood by those skilled in the art.

Those skilled in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps are described above generally in the form of their functionality. Whether such functionality is implemented as hardware or software depends on the particular application and design constraints imposed on the overall system. The skilled person will be able to implement the described functionality in a different manner for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the invention.

Various illustrative logic blocks, modules, and circuits described in connection with the embodiments disclosed herein may be general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or others. Programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein are implemented or executed. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor to enable the processor to read and write information to/from the storage medium. In the alternative, the storage medium can be integrated into the processor. The processor and the storage medium can reside in an ASIC. The ASIC can reside in the user terminal. In the alternative, the processor and the storage medium may reside as a discrete component in the user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented as a computer program product in software, the functions may be stored on or transmitted as one or more instructions or code on a computer readable medium. Computer readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available media that can be accessed by a computer. By way of example and not limitation, such computer readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, or can be used to carry or store instructions or data structures. Any other medium that is desirable for program code and that can be accessed by a computer. Any connection is also properly referred to as a computer readable medium. For example, if the software is transmitted from a web site, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave. The coaxial cable, fiber optic cable, twisted pair cable, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of the medium. Disks and discs as used herein include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, in which disks are often reproduced magnetically. Data, and discs optically reproduce data with a laser. Combinations of the above should also be included within the scope of computer readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the present disclosure will be obvious to those skilled in the art, and the general principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. The present disclosure is not intended to be limited to the examples and designs described herein, but rather the broadest scope of the principles and novel features disclosed herein.

Claims

An approximate term structure recommendation method, comprising:

Step 1: Receive the structure of the newly created root term of the user, convert the structure format into a text format and store it in real time;

Step 2: Compare the newly created root terms converted to text format with the existing root terms converted into text format by two or two cosine similarities;

Step 3: Convert the text format of the existing root term whose cosine similarity exceeds the preset threshold into a structural format and present it to the user, otherwise it is not presented to the user.
The approximate term structure recommendation method according to claim 1, wherein in the process of converting the term structure format into a text format, the term attribute in the term structure is stored in a hash storage manner according to the key value pair. The term attribute includes a term identifier, a term name, a term text, a parent term, and a child term. In the process of converting the term structure format into a text format, the root term in the term structure is The entry attribute and the entry attribute of all sub-entries under the root entry are read out to form a text format.
The method for recommending an approximate term structure according to claim 1, wherein the second step further comprises:

Step 1: Import the gensim database;

Step 2: Import all existing entries into the documents list, and the terms and terms are separated by commas;

Step 3: Vectorize all existing entries;

Step 4: construct a corresponding TD_IDF model by using the vector values in step 3;

Step 5: Calculate the TD_IDF value of each entry through the TD_IDF model;

Step 6: construct a corresponding LSI model by using the TD_IDF value of each entry;

Step 7: Import the newly created root entry of the user and vectorize it;

Step 8: Import the vector value of the newly created root term in step 7 into the LSI model constructed in step 6;

Step 9: Import the vector value of the term in step 3 into the LSI model constructed in step 6, and construct a cosine similarity calculation model;

Step 10: Import the value obtained in step 8 into the cosine similarity calculation model, and output the cosine similarity between the newly created root term and all the existing terms.
The approximation term structure recommendation method according to claim 1, wherein in the process of converting the text format into the term structure format in step three, the term attribute related to the text format is in accordance with the key value pair. The storage mode is stored as a term structure, wherein the term attribute includes a term identifier, a term name, a term text, a parent term, and a child term.
The method for recommending an approximate term structure according to claim 4, wherein the step 3 further comprises:

Step 1: Use the basic command hgetall of the redis hash to extract the attributes of the root entry and the attributes of all the sub-terms of the root entry to an object;

Step 2: The web front end loads the D3.js open source library;

Step 3: Define a tree object using the d3.layout.tree command and determine the image area size;

Step 4: The web front end requests data from the server, and the server transmits the object of step 1 to the web front end according to the JSON format;

Step 5: Generate a node set node according to the JSON data of step 4;

Step 6: Generate a node according to the nodes collection;

Step 7: Use the tree.links(nodes) command to get the node relationship set.

Step 8: Set the Bezier connection for the relationship set;

Step 9: Add a circular mark to the node, if there are child nodes that are black, otherwise white;

Step 10: Add a description text to the node according to the document attribute of the JSON data;

Step 11: Complete the conversion of the text format to the structure format.
An approximate term structure recommendation system, comprising:

A text format conversion module that converts the structural format of the root term into a text format;

a storage module that stores a structure format of all terms and a corresponding text format thereof;

The cosine similarity comparison module compares the newly created root terms converted into text format with the existing root terms converted into text format, and compares the cosine similarity with the preset threshold. The text format of the root entry is output as a form of the entry structure;

The structure format conversion module converts the text format of the root entry into the structural format of the entry.
The approximate term structure recommendation system according to claim 6, wherein in the text format conversion module, the term attribute in the term structure is stored in a hash storage manner according to the key value pair, wherein the term attribute includes Entry identifier, entry name, entry text, parent entry, child entry, in the process of converting the form structure into a text format, the term attribute of the root entry in the entry structure and The term attribute of all sub-entries under the root entry is read out to form a text format.
The approximate term structure recommendation system according to claim 6, wherein the structure format conversion module stores the term attribute related to the text format in a hash storage manner according to the key value pair, wherein the term is Attributes include entry identifier, entry name, entry text, parent entry, and child entry.