CN113536761B

CN113536761B - Method for calculating sentence similarity based on frame importance

Info

Publication number: CN113536761B
Application number: CN202110776700.XA
Authority: CN
Inventors: 王铁鑫; 史荟; 刘文静; 严欣华
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2024-01-30
Anticipated expiration: 2041-07-09
Also published as: CN113536761A

Abstract

The invention discloses a method for calculating sentence similarity based on frame importance, which specifically comprises the following steps: step 1: all frames in the English sentence S form a frame semantic information set E; step 2: extracting core frame elements of each frame in the set E; step 3: calculating importance of each frame according to the number of core frame elements in each frame in the set E; step 4: all frames in the English sentence S ' are formed into a frame semantic information set E ', and the importance of each frame in the set E ' is calculated; step 5: taking the same frames in the set E and the set E' as a group of frame groups; selecting a minimum frame importance in each frame group as the importance of the frames of the frame group; and carrying out accumulation calculation on the frame importance of all the frame groups, and calculating the similarity of English sentences S and S' based on the value of the accumulation calculation. The method provided by the invention can be applied to natural language processing tasks such as text inclusion recognition, text abstracts and the like.

Description

Method for calculating sentence similarity based on frame importance

Technical Field

The invention belongs to the technical field of natural language processing.

Background

The Frame semantic library FrameNet is a semantic knowledge base based on Frame Semantics (Frame Semantics) and is used for researching linguistic aspects such as linguistic aspects, computational linguistic aspects, natural language processing and the like. Concept structures and semantic scenes hidden behind words can be mined through frame semantics.

The frame (frame) in frame net refers to a sentence semantic structure form that expresses a specific scene, consisting of a word element (LUs) and its associated Frame Elements (FEs). The various participants, external conditions, etc. involved in the framework are referred to as framework elements. The frame elements are divided into core frame elements (CoreFEs) and non-frame elements (Peripheral, extra-thenmatic) according to the importance degree, wherein the core frame elements are essential components of a frame in concept understanding, and the number and the type of the core frame elements are different in different frames, so that the individuality of the frame is displayed; the non-core framework elements express general semantic components such as time, place and the like.

When a sentence contains multiple frames, the importance of the frames is not necessarily the same, and to accurately measure the similarity between sentences, the importance of the frames must be considered while the frames themselves are considered, however, measuring the importance of the frames in the sentence is not easy, because the measurement result is not constant according to different importance measurement standards. Frame importance metric selection is therefore the key to the frame importance metric. The existing similarity calculation method based on word level features does not consider the structural information of sentences; the similarity calculation method based on the sentence structure features fails to comprehensively consider sentence semantics. The traditional sentence similarity calculation method mainly aims at the problems of sentence keywords and structures, and the similarity calculation result is inaccurate due to incomplete semantic consideration and lack of interpretation.

Disclosure of Invention

The invention aims to: the invention provides a method for calculating sentence similarity based on frame importance in order to solve the problems existing in the prior art.

The technical scheme is as follows: the invention provides a method for calculating sentence similarity based on frame importance, which specifically comprises the following steps:

step 1: extracting all frames in the English sentence S, and forming a frame semantic information set E by all frames;

step 2: constructing a frame semantic library FrameNet visualization tool GIFN, and extracting core frame elements of each frame in the frame semantic information set E through the GIFN;

step 3: calculating a frame influence factor of each frame based on the number of core frame elements in each frame; establishing a frame importance function according to the frame influence factors to obtain the importance w (f) of the ith frame in the frame semantic information set E _E,i )，f _E,i Representing an ith frame in the frame semantic information set E, i=1, 2,..；

Step 4: all frames in the English sentence S ' are formed into a frame semantic information set E ' according to the steps 1-3, and the importance of each frame in the frame semantic information set E ' is calculated;

step 5: taking the same frames in E and E' as a group of frames to obtain frame_same frame groups; comparing the importance of two frames in the jth frame group, and selecting the minimum frame importance as the frame importance min of the jth frame group _j J=1, 2,..frame_same; and carrying out accumulation calculation on the frame importance of the frame_same frame groups, and calculating the similarity of English sentences S and S' based on the accumulation calculated values.

Further, in the step 1, the english sentence S is input to an open source semantic frame extraction tool SEMAFOR, and the SEMAFOR parses the input english sentence S according to the structure of the frame semantic library FrameNet, thereby extracting the frame in the english sentence S.

Further, the specific method for constructing the frame semantic library FrameNet visualization tool GIFN in the step 2 is as follows: all frames in the FrameNet are used as nodes, semantic relations among the frames and semantic relations among the word elements and the frames are used as edges, and the nodes and the edges are stored in a graph database Neo4 j.

Further, the similarity calculation formula corresponding to the english sentence S and the sentence S' is as follows:

wherein similarity_score is the Similarity between english sentence S and sentence S'; frame_s 'is the total number of frames in the frame semantic information set E', maximum (); wherein Path_score is expressed as follows:

where frame_rel is the number of shortest path frame pairs, specifically the shortest pathThe method of the frame pair is as follows: removing the same frames as those in the frame semantic information set E' from the frame semantic information set E to obtain a set E1; removing the same frames as those in the frame semantic information set E from the frame semantic information set E 'to obtain a set E'1; obtaining the number of edges required by each frame in the set E1 to reach any frame in the set E'1 through the visualization tool GIFN; taking two frames with the minimum number of required edges as a shortest path frame pair; path_value _i， The expression of (2) is as follows:

wherein CountPath is the number of edges required for one frame to reach the other frame in the ith' shortest path frame pair; weight (weight) _t The weight of the t-th edge.

Further, in the step 3, the frame influencing factor is:

wherein c _i Is f _E,i The total number of middle core frame elements; n is n _i Is f _E,i Total number of middle frame elements, beta _i Is f _E,i Is a framework influencing factor of (a).

Further, in the step 3, the frame importance function is:

wherein the method comprises the steps ofBeta is _i Is a combination of the above.

The beneficial effects are that: the invention considers the importance of the frame and simultaneously considers the importance of the frame, and can measure the similarity between sentences more accurately. The method and the device can be applied to natural language processing tasks such as text implication recognition, text abstract and the like.

Drawings

FIG. 1 is a schematic flow chart of a frame importance calculating method according to the present invention;

FIG. 2 is a flow diagram of extracting core frame elements from a frame semantic library FrameNet;

FIG. 3 is a flow chart for calculating a frame importance function;

fig. 4 is a semantic relationship diagram between partial frameworks in the GIFN.

Detailed Description

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

According to the method for calculating sentence similarity based on the frame importance, the core frame elements of the frame are extracted according to the frame semantic library FrameNet, the frame importance is distinguished through the number of the core frame elements contained in the frame, and the method is convenient to apply to natural language processing tasks such as text implication recognition and text abstract.

The following describes embodiments of the present invention in detail with reference to the drawings and examples, thereby fully understanding and implementing the implementation process of how the technical means are applied to solve the technical problems and achieve the technical effects. It should be noted that, as long as no conflict is formed, each embodiment of the present invention and each feature of each embodiment may be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.

The FrameNet described in this embodiment refers to a semantic knowledge base based on Frame Semantics (Frame semantecs) constructed by the university of california, berkeley division, and is used for research in linguistics, computational linguistics, natural language processing, and the like. Concept structures and semantic scenes hidden behind words can be mined through frame semantics. In FrameNet, a frame refers to a semantic structural form of a sentence that expresses a particular scene, consisting of a word element and the frame elements it contacts. The various participants, external conditions, etc. involved in the framework are referred to as framework elements, which correspond in the real corpus to the vocabulary in the context that describes the event or the morphology of the event. The frame elements are divided into core frame elements and non-frame elements according to importance, wherein the core frame elements are essential components of a frame in concept understanding, and the number and the types of the frame elements are different in different frames, so that individuality of the frame is displayed; the non-core framework elements express general semantic components such as time, place and the like.

SEMAFOR is an open-source framework semantic parser. The method can automatically analyze English sentences according to the FrameNet structure to obtain frames, frame elements, specific contents pointed by the frame elements and the like stimulated by sentence contents. In the key steps of the implementation design, the frame semantic information is obtained through a SEMAFOR open source tool according to a semantic knowledge base FrameNet.

Neo4j is a high-performance NOSQL graph database that stores structured data on the network rather than in tables. It is an embedded, disk-based Java persistence engine with full transactional properties. Neo4j provides large scale scalability, can handle billions of nodes/relationships/attributes on one machine, and can be extended to multiple machines running in parallel. The graph database is good at handling large volumes of complex, interconnected, low structured data that change rapidly, requiring frequent queries-in the relational database, these queries result in a large number of table connections, thus creating performance problems. Neo4j focuses on solving the performance degradation problem that occurs when a conventional RDBMS with a large number of connections is queried. By modeling the data around the graph, neo4j traverses nodes and edges at the same speed, which does not have any relation to the amount of data that make up the graph.

As shown in fig. 1, the present embodiment provides a method for calculating sentence similarity based on frame importance, which includes:

step one, all frame semantic information is identified from English sentences S. And analyzing the sentence S according to a frame Net structure by using an open source frame semantic analysis tool SEMAFOR to obtain a frame stimulated by the content of the sentence S, wherein the frame comprises a word element and frame elements connected with the word element, and all the frames form a frame semantic information set E. The content of the input SEMAFOR is English sentence S, and the output is the result after the analysis of the SEMAFOR by the framework semantic analysis tool.

By means of the development tool Eclipse, the FrameNet is mapped into Neo4j, resulting in the constructed FrameNet visualization tool GIFN: all frames, frame elements and tokens in the frame semantic library frame net are taken as nodes (the frames are taken as nodes because the frames refer to sentence semantic structure forms which are formed by the tokens and the frame elements which are connected with the frames and express specific scenes), the relations among the frames and the relations among the tokens and the frames are taken as edges and stored in a graph database Neo4j, and the constructed frame net visualization tool 'Graphical Interpretation for FrameNet:GIFN' is obtained.

Step two, extracting the frames of each row from the results analyzed by the frame semantic analysis tool SEMAFOR: defining a frame extraction class to extract a plurality of frames of each row in a result after the frame semantic analysis tool SEMAFOR analysis, wherein a definition search frame method in the frame extraction class is used for extracting the plurality of frames of each row; the searchFrame method is invoked and the result is output as the entire frame contained in each row.

The frame extraction method is used for extracting frame elements contained in each frame in the result of frame semantic analysis by a frame semantic analysis tool SEMAFOR; and calling a searchFE method, and outputting the result as all frame elements contained under each frame.

Part of key codes of the second step are as follows:

the partial arrangement of the results obtained in the second step is shown in Table 1:

TABLE 1

Frame (Frame)	FEs (frame element)
		Statement	{Message,Speaker}
Sign_agreement	{Agreement,Signatory}
		Ordinal_numbers	{Type}
Possession	{Possession}
		Compliance	{Protagonist}

Step three, the frame net visualization tool GIFN is utilized to display the frame net of the frame semantic library in a graphical form, frames (frames) and LUs (lements) in the GIFN contain an anonset, FEs (frame elements) contain a FEcoreSet (core frame element set), wherein the FEcoreSet represents the core frame elements of the frames in the set E: the specific flow is shown in figure 2; and defining a FEcoreextraction class to extract core frame elements contained in each frame in the result after the frame semantic analysis tool SEMAFOR analysis, and outputting the result of finding out the core frame elements of the frames through FEcoreset in the FEcoreextraction class as the core frame elements contained under each frame.

The partial result arrangement obtained in the third step is shown in table 2:

TABLE 2

Frame (Frame)	CoreFEs (core frame element)
		Statement	{Message,Speaker}
Sign_agreement	{Agreement,Signatory}
		Ordinal_numbers	{Type}
Possession	{Possession}
		Compliance	{Protagonist}

And step four, calculating the frame influence factor of each frame based on the number of core frame elements in each frame. The probability that the number of core frame elements covered by each frame occupies the number of frame elements covered by the frame in the whole sentence S is calculated, the probability is defined as a frame influence factor in the frame semantic information set E, and a calculation formula is as follows:

wherein: c _i Is f _E,i The total number of middle core frame elements; n is n _i Is f _E,i Total number of middle frame elements, f _E,i The i-th frame in set E, i=1, 2,..frame_s, frame_s is the total number of frames covered in sentence S. The greater the number of core framework elements, the greater the importance, the greater the impact factor value.

When the number of core frame elements covered by the two frames is the same, the semantic importance of the two frames is considered to be the same, and the influence factor values are the same.

In this embodiment, the fes calculation class is defined to calculate the total number of frame elements included in the english sentence S, the frame num method is defined to calculate the number of frames included in each sentence, the FEsNum method is defined to calculate the number of frame elements included in each frame, the FEsNum method is defined to accumulate the number of frame elements, and the output result is the total number of frame elements included in the whole sentence. The CoreFEs calculation class is defined to calculate the probability that the number of core frame elements covered by each frame occupies the number of frame elements covered by the frames in the whole sentence S, the CoreFEs num method is defined to calculate the number of frame elements covered by each frame, and the CoreFEs Per method is defined to calculate the probability by using the formula (1).

The partial calculation results obtained in the fourth step are shown in table 3:

TABLE 3 Table 3

Step five, constructing a framework influence factor matrix, wherein the framework influence factor matrix is as follows:

M＝(β _i ) _{frame_S×1}

and step six, measuring the importance of the frames according to the frame influence factors in the set E, defining an importance function of each frame in the sentence S, and calculating the importance of each frame in the frame information set E. Giving corresponding weight according to the number of the core frame elements covered in the frames, and calculating the importance of each frame of the sentence S for the sentence, wherein the importance is specifically as follows: the method comprises the steps of carrying out a first treatment on the surface of the

The importance of each frame in the set of frame information E is initialized. The initialization formula for the importance of each frame in the frame information set E is:

the importance of the frames in the english sentence S is normalized. The importance calculation formula of each frame in the normalized English sentence S for the sentence is as follows:

wherein the method comprises the steps ofFor each element in the frame influencing factor matrix, not only beta _i Is a combination of the exponential score of (a); 0 < w (f) _E,i )≤1，

In one embodiment of the invention, the frame weight class is defined to calculate the importance of the frames, the CoreFEs Perall method is defined to accumulate the frame influence factors, the frame weight method is defined to calculate the importance of the frames by using the formula (2), and the output result is the importance of each frame corresponding to the sentences. A flow chart defining the framework importance function is shown in fig. 3.

Step seven, all frames in the English sentence S ' are formed into a frame semantic information set E ' according to the steps one to six, and the importance of each frame in the frame semantic information set E ' is calculated;

step eight: taking the same frames in the frame semantic information set E and the frame semantic information set E' as a group of frame groups; obtaining frame_same frame groups; comparing importance of two frames in the j-th frame group, and selecting the minimum frameImportance as importance min of frame of jth frame group _j J=1, 2,., frame_same; the importance of frames of the frame_same frame groups is calculated in an accumulated mode, and the similarity of English sentences S and S' is calculated based on the following formula:

wherein similarity_score is the Similarity between english sentence S and sentence S'; frame_s 'is the total number of frames in the frame semantic information set E', maximum (); wherein the calculation formula of Path_score is as follows:

wherein frame_rel is the number of shortest path frame pairs, and the method for specifically obtaining the number of shortest path frame pairs is as follows: removing the same frames as those in the frame semantic information set E' from the frame semantic information set E to obtain a set E1; removing the same frames in the frame semantic information set E from the frame semantic information set E 'to obtain a set E'1; obtaining the number of edges required by each frame in the set E1 to reach any frame in the set E'1 through the visualization tool GIFN, wherein the semantic relationship among part of frames is shown in FIG. 4; taking two frames with the minimum number of required edges as a shortest path frame pair; path_value _i， The expression of (2) is as follows:

wherein CountPath is the number of edges required for one frame to reach the other frame in the ith' shortest path frame pair; weight (weight) _t The weight of the t-th edge. The weights for each path in fig. 4 are shown in table 4:

TABLE 4 Table 4

Inter-frame semantic relationships (semantic relationships represented by paths in GIFN are also bordered)	Path weight
		Inherits from	0.55
Is Inherited by	0.55
		Perspective on	0.45
Is Perspective in	0.45
		Users	0.3
Is Used by	0.3
		Subframe of	0.35
Has Subframe(s)	0.35
		Precedes	0.2
Is Preceded by	0.2
		Is Inchoative of	0.3
Is Causative of	0.3
		See also	0.4

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

In addition, the specific features described in the above embodiments may be combined in any suitable manner without contradiction. The various possible combinations of the invention are not described in detail in order to avoid unnecessary repetition.

Claims

1. The method for calculating the sentence similarity based on the frame importance is characterized by comprising the following steps:

step 3: calculating a frame influence factor of each frame based on the number of core frame elements in each frame; establishing a frame importance function according to the frame influence factors to obtain the importance w (f) of the ith frame in the frame semantic information set E _E,i )，f _E,i The method comprises the steps of representing an ith frame in a frame semantic information set E, wherein i=1, 2, frame_S, and frame_S are the total number of frames in the frame semantic information set E;

step 5: taking the same frames in E and E' as a group of frames to obtain frame_same frame groups; comparing the importance of two frames in the jth frame group, and selecting the minimum frame importance as the frame importance min of the jth frame group _j J=1, 2, …, frame_same; accumulating the frame importance of the frame_same frame groups, and calculating the similarity of English sentences S and S' based on the accumulated value;

the corresponding similarity calculation formula between the english sentence S and the sentence S' is as follows:

wherein frame_rel is the number of shortest path frame pairs, and the method for specifically obtaining the shortest path frame pairs comprises the following steps: removing the same frames as those in the frame semantic information set E' from the frame semantic information set E to obtain a set E1; removing the same frames as those in the frame semantic information set E from the frame semantic information set E 'to obtain a set E'1; obtaining the number of edges required by each frame in the set E1 to reach any frame in the set E'1 through the visualization tool GIFN; taking two frames with the minimum number of required edges as a shortest path frame pair; path_value _i， The expression of (2) is as follows:

wherein CountPath is the number of edges required for one frame to reach the other frame in the ith' shortest path frame pair; weight (weight) _t The weight of the t-th edge;

the frame influence factors in the step 3 are as follows:

wherein c _i Is f _E,i The total number of middle core frame elements; n is n _i Is f _E,i Total number of middle frame elements, beta _i Is f _E,i Framework influencing factors of (2);

the frame importance function in the step 3 is as follows:

2. The method according to claim 1, wherein in the step 1, the english sentence S is input into an open source semantic frame extraction tool SEMAFOR, and the SEMAFOR parses the input english sentence S according to the structure of a frame semantic library FrameNet, thereby extracting the frame in the english sentence S.

3. The method for calculating sentence similarity based on frame importance according to claim 1, wherein the specific method for constructing the frame semantic library FrameNet visualization tool GIFN in step 2 is as follows: all frames in the FrameNet are used as nodes, semantic relations among the frames and semantic relations among the word elements and the frames are used as edges, and the nodes and the edges are stored in a graph database Neo4 j.