CN113536761A

CN113536761A - Method for calculating sentence similarity based on frame importance

Info

Publication number: CN113536761A
Application number: CN202110776700.XA
Authority: CN
Inventors: 王铁鑫; 史荟; 刘文静; 严欣华
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-10-22
Anticipated expiration: 2041-07-09
Also published as: CN113536761B

Abstract

The invention discloses a method for calculating sentence similarity based on frame importance, which comprises the following steps: step 1: all frames in the English sentence S form a frame semantic information set E; step 2: extracting core frame elements of each frame in the set E; and step 3: calculating the importance of each frame according to the number of core frame elements in each frame in the set E; and 4, step 4: all frames in the English sentence S ' form a frame semantic information set E ', and the importance of each frame in the set E ' is calculated; and 5: taking the same frame in the set E and the set E' as a group of frames; selecting the minimum frame importance in each frame group as the importance of the frame group; and accumulating and calculating the frame importance of all the frame groups, and calculating the similarity of the English sentences S and S' based on the accumulated and calculated values. The method provided by the invention can be applied to natural language processing tasks such as text inclusion recognition, text summarization and the like.

Description

Method for calculating sentence similarity based on frame importance

Technical Field

The invention belongs to the technical field of natural language processing.

Background

The Frame semantic library FrameNet is a semantic knowledge base based on Frame Semantics (Frame Semantics) and is used for the research of languages such as linguistics, computational linguistics, natural language processing and the like. Concept structures and semantic scenes hidden behind words can be mined through the frame semantics.

A frame (frame) in FrameNet refers to a semantic structural form of a sentence expressing a specific scene, which is composed of lemmas (lexical units, LUs) and Frame Elements (FEs) to which it is associated. The various participants, external conditions, etc. involved in the framework are referred to as framework elements. The frame elements are divided into core frame elements (CoreFEs) and non-frame elements (Peripheral, Extra-composite) according to the importance degree, the core frame elements are necessary components of a frame in conceptual understanding, the core frame elements are different in number and type in different frames, and the personality of the frames is displayed; the non-core frame elements express general semantic components such as time, place and the like.

When a sentence includes multiple frames, the importance of the different frames is not necessarily the same, and to accurately measure the similarity between sentences, the importance of the frames must be considered while considering the frames themselves, however, it is not easy to measure the importance of the frames in the sentence, because the measurement result is not constant according to different importance measurement standards. Therefore, the frame importance metric selection is the key to the frame importance metric. The similarity calculation method based on the word level features does not consider the structural information of sentences at present; the similarity calculation method based on sentence structure characteristics fails to fully consider sentence semantics. The conventional sentence similarity calculation method mainly aims at the problems of sentence keywords and sentence structures, and the similarity calculation result is not accurate enough due to the fact that the semantics of the sentences are not comprehensive and the interpretability is poor.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a method for calculating sentence similarity based on frame importance, which aims to solve the problems in the prior art.

The technical scheme is as follows: the invention provides a method for calculating sentence similarity based on frame importance, which comprises the following steps:

step 1: extracting all frames in the English sentence S, and forming a frame semantic information set E by all the frames;

step 2: constructing a frame semantic library FrameNet visualization tool GIFN, and extracting core frame elements of each frame in a frame semantic information set E through the GIFN;

and step 3: calculating a frame influence factor of each frame based on the number of core frame elements in each frame; establishing a frame importance function according to the frame influence factors to obtain the importance w (f) of the ith frame in the frame semantic information set E_E,i)，f_E,iRepresenting the ith frame in the frame semantic information set E, wherein i is 1, 2., frame _ S, and frame _ S is the total number of frames in the frame semantic information set E;

and 4, step 4: forming all frames in the English sentence S ' into a frame semantic information set E ' according to the steps 1-3, and calculating the importance of each frame in the frame semantic information set E ';

and 5: taking the same frame in E and E' as a group of frame groups to obtain frame _ same frame groups; comparing the importance of two frames in the jth frame group, and selecting the minimum frame importance as the frame importance min of the jth frame group_jJ ═ 1,2,. said, frame _ same; and accumulating the frame importance of the frame _ same frame groups, and calculating the similarity of the English sentences S and S' based on the accumulated values.

Further, in the step 1, the english sentence S is input into an open source semantic frame extraction tool SEMAFOR, and the SEMAFOR analyzes the input english sentence S according to the structure of the frame semantic library FrameNet, so as to extract the frame in the english sentence S.

Further, the specific method for constructing the framework semantic library FrameNet visualization tool GIFN in the step 2 comprises the following steps: all frames in the FrameNet are taken as nodes, semantic relations among the frames and semantic relations among the lemmas and the frames are taken as edges, and the nodes and the edges are stored in a graph database Neo4 j.

Further, the similarity calculation formula corresponding to the english sentence S and the sentence S' is as follows:

wherein, Similarity _ score is the Similarity between English sentence S and sentence S'; frame _ S 'is the total number of frames in the frame semantic information set E', Maximum (.) is the Maximum value; wherein the expression of Path _ score is as follows:

wherein frame _ rel is the number of shortest path frame pairs, and the method for specifically obtaining the shortest path frame pairs is as follows: removing the frames which are the same as the frames in the frame semantic information set E' from the frame semantic information set E to obtain a set E1; removing the frames which are the same as the frames in the frame semantic information set E from the frame semantic information set E 'to obtain a set E' 1; obtaining the number of edges required by each frame in the set E1 to reach any frame in the set E' 1 through a visualization tool GIFN; using two frames with the minimum number of required edges as a shortest path frame pair; path _ value_i，The expression of (a) is as follows:

wherein CountPath is the number of edges required by one frame in the ith' shortest path frame pair to reach the other frame; weight_tIs the weight of the t-th edge.

Further, the framework influence factor in step 3 is:

wherein, c_iIs f_E,iTotal number of center core frame elements; n is_iIs f_E,iTotal number of middle frame elements, beta_iIs f_E,iThe framework influencing factor of (1).

Further, the frame importance function in step 3 is:

wherein

Is beta_iIs indexed to the score.

Has the advantages that: the invention considers the importance of the frame while considering the frame, and can more accurately measure the similarity between sentences. The method can be applied to natural language processing tasks such as text inclusion recognition and text summarization.

Drawings

FIG. 1 is a schematic flow chart of a method for calculating the importance of a frame according to the present invention;

FIG. 2 is a flow chart of extracting core frame elements according to a frame semantic library FrameNet;

FIG. 3 is a flow chart of a compute frame importance function;

FIG. 4 is a diagram of the semantic relationships between partial frames in the GIFN.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.

According to the method for calculating the sentence similarity based on the frame importance, the core frame elements of the frame are extracted according to the frame semantic library FrameNet, the frame importance is distinguished through the number of the core frame elements contained in the frame, and the method is conveniently applied to natural language processing tasks such as text inclusion recognition and text summarization.

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so as to fully understand how to implement the technical solution of the present invention and achieve the technical effects. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

The FrameNet described in this embodiment refers to a semantic knowledge base based on Frame Semantics (Frame Semantics) constructed by berkeley division, university of california, usa, and is used for linguistic studies, such as linguistics, computational linguistics, and natural language processing. Concept structures and semantic scenes hidden behind words can be mined through the frame semantics. In FrameNet, a frame refers to the semantic structural form of a sentence expressing a particular scene, made up of a token and its associated frame elements. The various participants, external conditions, etc. involved in the frame are referred to as frame elements, which in the real corpus correspond to the vocabulary describing the event or event modality in the context. The frame elements are divided into core frame elements and non-frame elements according to the importance degree, the core frame elements are necessary components of a frame in concept understanding, the core frame elements are different in number and type in different frames, and the personality of the frames is displayed; the non-core frame elements express general semantic components such as time, place and the like.

SEMAFOR is an open-source framework semantic parser. The method can automatically analyze English sentences according to the FrameNet structure, and obtain frames, frame elements, specific contents indicated by the frame elements and the like aroused by the sentence contents. In the key steps of the implementation design, frame semantic information is acquired by a SEMAFOR open source tool according to a semantic knowledge base FrameNet.

Neo4j is a high-performance NOSQL graph database that stores structured data on a network rather than in tables. It is an embedded, disk-based Java persistence engine with full transactional features. Neo4j provides large-scale scalability, allowing billions of nodes/relationships/attributes to be processed on one machine, extending to multiple machines running in parallel. Graph databases are good at handling large amounts of complex, interconnected, low-structured data that changes rapidly and requires frequent queries, as opposed to relational databases, where such queries result in large numbers of table connections and, therefore, create performance problems. Neo4j focuses on solving the performance degradation problem that occurs when a traditional RDBMS with a large number of connections queries. By modeling the data around the graph, Neo4j will traverse nodes and edges at the same speed, which does not have any relationship to the amount of data that makes up the graph.

As shown in fig. 1, the present embodiment provides a method for calculating sentence similarity based on frame importance, which includes:

step one, identifying all frame semantic information from the English sentence S. And analyzing the sentence S by using an open source frame semantic analysis tool SEMAFOR according to a FrameNet structure to obtain a frame excited by the content of the sentence S, wherein the frame comprises the lemma and frame elements connected with the lemma, and all the frames form a frame semantic information set E. The input SEMAFOR content is English sentence S, and the output is the result analyzed by the SEMAFOR tool.

By means of the development tool Eclipse, frarnet was mapped into Neo4j, resulting in the constructed frarnet visualization tool GIFN: all frames, frame elements and lemmas in the frame semantic library FrameNet are taken as nodes (because a frame is a sentence semantic structure form which is formed by the lemmas and the frame elements connected with the lemmas and expresses a specific scene, the frame is taken as a node), the relationship between the frames and the relationship between the lemmas and the frames are taken as edges and stored in a graphic database Neo4j, and the constructed FrameNet visualization tool 'graphic Interpretation for FrameNet: GIFN' is obtained.

Step two, extracting the frame of each line from the result analyzed by the frame semantic analysis tool SEMAFOR: defining a FrameExtraction class to extract a plurality of frames of each line in a result analyzed by a frame semantic analysis tool SEMAFOR, and defining a searchFrame method in the FrameExtraction class to extract a plurality of frames of each line; the searchFrame method is called and the result output is all the frames contained in each row.

Determining a FrameFEExtraction class to extract frame elements contained in each frame in a result analyzed by a frame semantic analysis tool SEMAFOR, and defining a searchFE method in the FrameFEExtraction class to extract the frame elements contained in each frame; the searchFE method is invoked and the result output is all the frame elements contained under each frame.

Part of key codes of the second step are as follows:

the results obtained in step two are partially collated as shown in Table 1:

TABLE 1

Frame (Frame)	FEs (frame element)
		Statement	{Message,Speaker}
Sign_agreement	{Agreement,Signatory}
		Ordinal_numbers	{Type}
Possession	{Possession}
		Compliance	{Protagonist}

Step three, displaying a framework semantic library FrameNet in a graphical topographic form by utilizing a FrameNet visualization tool GIFN, wherein frames (framework) and LUs (lemmas) in the GIFN comprise annoset, FEs (framework elements) comprise FEcoreSet (core framework element set), and the FEcoreSet represents core framework elements of the framework in the set E: the specific flow is shown in FIG. 2; defining a FEcoreExtraction class, extracting core frame elements contained in each frame in a result analyzed by a frame semantic analysis tool SEMAFOR, finding out the core frame element result of the frame in the FEcoreExtraction class through FEcoreSet, and outputting the core frame element result as the core frame elements contained in each frame.

Part of the results obtained in step three are summarized in table 2:

TABLE 2

Frame (Frame)	CoreFEs (core frame element)
		Statement	{Message,Speaker}
Sign_agreement	{Agreement,Signatory}
		Ordinal_numbers	{Type}
Possession	{Possession}
		Compliance	{Protagonist}

And step four, calculating the frame influence factor of each frame based on the number of the core frame elements in each frame. Calculating the probability of the number of the core frame elements covered by each frame in the number of the frame elements covered by the frames in the whole sentence S, and defining the probability as a frame influence factor in a frame semantic information set E, wherein the calculation formula is as follows:

wherein: c. C_iIs f_E,iTotal number of center core frame elements; n is_iIs f_E,iTotal number of middle frame elements, f_E,iDenotes the ith frame in the set E, i 1, 2., frame _ S, which is the total number of frames covered in the sentence S. The greater the number of core frame elements covered by a frame, the higher the importance, and the greater its impact factor value.

When the number of core frame elements covered by the two frames is the same, the semantic importance of the two frames is considered to be the same, and the influence factor values are the same.

In this embodiment, fes classification is defined to calculate the total number of frame elements contained in the english sentence S, FrameNum is defined to calculate the number of frames covered by each sentence, fes num is defined to calculate the number of frame elements covered by each frame, fes null is defined to accumulate the number of frame elements, and the output result is the total number of frame elements contained in the whole sentence. Defining a CoreFEsCalculation class to calculate the probability of the number of core frame elements covered by each frame in the number of frame elements covered by the frames in the whole sentence S, defining a CoreFEsNum method to calculate the number of frame elements covered by each frame, and defining a CoreFEsPeer method to calculate the probability by using a formula (1).

The partial calculation results obtained in step four are shown in table 3:

TABLE 3

Step five, constructing a frame influence factor matrix, wherein the frame influence factor matrix is as follows:

M＝(β_i)_{frame_S×1}

and step six, measuring the importance of the frames according to the influence factors of the frames in the set E, defining an importance function of each frame in the sentence S, and calculating the importance of each frame in the frame information set E. Giving corresponding weight according to the number of the core frame elements covered in the frame, and calculating the importance of each frame of the sentence S to the sentence, specifically: (ii) a

The importance of each frame in the frame information set E is initialized. The initialized formula for the importance of each frame in the frame information set E is:

and normalizing the importance of the frame in the English sentence S. The calculation formula of the importance of each frame in the normalized English sentence S to the sentence is as follows:

wherein

For each element in the framework impact factor matrix, also β_iAn exponential score of; w (f) is more than 0_E,i)≤1，

According to one embodiment of the invention, the FrameWeight type calculation frame importance is defined, a CoreFEsPerall method is defined to accumulate frame influence factors, the FrameWeight method is defined to calculate the frame importance by using a formula (2), and the output result is the importance of each frame of a sentence corresponding to the sentence. A flow chart for defining the frame importance function is shown in fig. 3.

Step seven, forming a frame semantic information set E ' by all frames in the English sentence S ' according to the steps one to six, and calculating the importance of each frame in the frame semantic information set E ';

step eight: taking the same frame in the frame semantic information set E and the frame semantic information set E' as a group of frame groups; obtaining frame _ same frame groups; comparing the importance of two frames in the jth frame group, and selecting the minimum frame importance as the importance min of the frame in the jth frame group_jJ 1, 2., frame _ same; and accumulating the importance of the frames of the frame _ same frame group, and calculating the similarity of the English sentences S and S' based on the following formula:

wherein, Similarity _ score is the Similarity between English sentence S and sentence S'; frame _ S 'is the total number of frames in the frame semantic information set E', Maximum (.) is the Maximum value; wherein the calculation formula of Path _ score is as follows:

frame _ rel is the number of shortest path frame pairs, and the specific method for obtaining the number of shortest path frame pairs is as follows: removing the frames which are the same as the frames in the frame semantic information set E' from the frame semantic information set E to obtain a set E1; removing the same frame in the frame semantic information set E ' from the frame semantic information set E ' to obtain a set E ' 1; obtaining the number of edges required by each frame in the set E1 to reach any frame in the set E' 1 through the visualization tool GIFN, wherein the semantic relation among partial frames is shown in FIG. 4; using the two frames with the minimum number of required edges as the shortest circuitA radial frame pair; path _ value_i，The expression of (a) is as follows:

wherein CountPath is the number of edges required by one frame in the ith' shortest path frame pair to reach the other frame; weight_tIs the weight of the t-th edge. The weight of each path in fig. 4 is shown in table 4:

TABLE 4

Inter-frame semantic relationships (semantic relationships represented by paths in GIFN are also edges)	Path weight
		Inherits from	0.55
Is Inherited by	0.55
		Perspective on	0.45
Is Perspective in	0.45
		Users	0.3
Is Used by	0.3
		Subframe of	0.35
Has Subframe(s)	0.35
		Precedes	0.2
Is Preceded by	0.2
		Is Inchoative of	0.3
Is Causative of	0.3
		See also	0.4

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. The invention is not described in detail in order to avoid unnecessary repetition.

Claims

1. A method for calculating sentence similarity based on frame importance is characterized by comprising the following steps:

and 5: taking the same frame in E and E' as a group of frame groups to obtain frame _ same frame groups; comparing the importance of two frames in the jth frame group, and selecting the minimum frame importance as the frame importance min of the jth frame group_jJ 1, 2., frame _ same; the frame importance of the frame _ same frame groups is accumulated,and calculating the similarity of the english sentences S and S' based on the cumulatively calculated values.

2. The method for calculating sentence similarity based on frame importance according to claim 1, wherein the english sentence S in step 1 is input into an open source semantic frame extraction tool semfor, and the semfor analyzes the input english sentence S according to the structure of a frame semantic library FrameNet, thereby extracting the frame in the english sentence S.

3. The method for calculating sentence similarity based on frame importance according to claim 1, wherein the specific method for constructing the frame semantic library FrameNet visualization tool GIFN in the step 2 is as follows: all frames in the FrameNet are taken as nodes, semantic relations among the frames and semantic relations among the lemmas and the frames are taken as edges, and the nodes and the edges are stored in a graph database Neo4 j.

4. The method for calculating sentence similarity based on frame importance of claim 3, wherein the similarity calculation formula for correspondence between English sentence S and sentence S' is as follows:

wherein frame _ rel is the number of shortest path frame pairs, and the method for specifically obtaining the shortest path frame pairs is as follows: the same frame as in the frame semantic information set E' is removed from the frame semantic information set E,obtaining a set E1; removing the frames which are the same as the frames in the frame semantic information set E from the frame semantic information set E 'to obtain a set E' 1; obtaining the number of edges required by each frame in the set E1 to reach any frame in the set E' 1 through a visualization tool GIFN; using two frames with the minimum number of required edges as a shortest path frame pair; path _ value_i’The expression of (a) is as follows:

5. The method for calculating sentence similarity based on frame importance according to claim 1, wherein the frame influence factors in the step 3 are:

6. The method for calculating sentence similarity based on frame importance according to claim 5, wherein the frame importance function in step 3 is:

wherein

Is beta_iIs indexed to the score.