CN113204620A

CN113204620A - Method, system, equipment and computer storage medium for automatically constructing narrative table

Info

Publication number: CN113204620A
Application number: CN202110515734.3A
Authority: CN
Inventors: 张凯; 周建设; 刘杰; 王伟丽
Original assignee: Capital Normal University
Current assignee: Capital Normal University
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-08-03

Abstract

The application provides a method for automatically constructing a narrative word list, which comprises the steps of performing statistics on the contract and calculation of distribution similarity, and then identifying the grade relation among words so as to compile a natural language narrative word list; calculating the co-occurrence weight among the words according to the frequency of the words in the file, the co-occurrence frequency among the words and the adjusting factor; thirdly, constructing a feature vector, and calculating semantic similarity, so that all words are combined into a cluster; the words in the cluster are transformed into each grade according to the grade coefficient, and the upper and lower relations of the words are identified; and finally, constructing a narrative word list according to the related relations among the words and the superior-inferior relations of the narrative word set.

Description

Method, system, equipment and computer storage medium for automatically constructing narrative table

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, system, device, and computer storage medium for automatic construction of a narrative table.

Background

The rapid development of the network brings about the explosive growth of information resources, provides convenience for people, and makes people gradually realize that the information is submerged in the ocean, so that how to accurately and efficiently acquire required information from massive information becomes a problem to be solved urgently. Most of the existing network information retrieval tools (such as search engines and the like) adopt a full-text retrieval mode based on keyword literal matching, the method is simple and feasible, the searching is convenient, the full-text retrieval rate is higher, but the returned information of the retrieval is too much, only a few parts of the returned information meet the requirements of a retriever, the precision rate is low, and meanwhile, the phenomena of missed retrieval and false retrieval exist. The normalized control narrative word list is applied to the indexing and searching process, so that the detection rate can be effectively improved. However, the traditional narrative table is difficult to be applied to word list establishment and maintenance and in a network information retrieval environment, so that the research on how to automatically construct the natural language narrative table is of great significance.

At present, how to automatically identify semantic relationships such as equivalence, level and correlation among narrative words by using a computer technology is a key link for realizing automatic construction of a narrative word list and is also a difficulty.

Disclosure of Invention

In order to solve the technology that the narrative table is difficult to compile in the word table, the application provides a method, a system, equipment and a computer storage medium for automatically constructing the narrative table.

In a first aspect of the present application, a method for automatically constructing a narrative table is provided, wherein the method comprises:

s1, collecting vocabularies, and inputting original data files required by building a narrative word list;

s2, extracting each word according to the original data file to form a narrative word set;

s3, calculating the co-occurrence weight of the words in the narrative word set according to the frequency of the words in the file, the co-occurrence frequency of the words and the adjusting factor, thereby obtaining the association degree of the words;

s4, constructing feature vectors of each word and other words according to the association degree, wherein the other words are selected as the most relevant K words;

s5, carrying out hierarchical clustering on the words of the narrative word set, and calculating semantic similarity among the words according to the feature vectors; setting a threshold value, and merging words with semantic similarity values smaller than the threshold value to form clusters;

s6, dividing the words in the cluster into various levels according to the level coefficients, and identifying the upper and lower level relations;

and S7, finally, constructing a narrative word list according to the related relations among words and the upper and lower relations of the narrative word set.

Preferably, the calculation formula of the co-occurrence weight between the words is:

wherein, W (T)_i,T_j) The expression T_iAnd T_jCo-occurrence weight of, tf (T)_iT_j) The expression T_iAnd T_jFrequency of co-occurrence in corpus, tf (T)_i) The expression T_iFrequency in corpus, WeightingFactor (T)_i,T_j) Is an adjustment factor;

preferably, the formula of the adjustment factor is:

min(length(d_i) ) express a word T_iAnd T_jThe minimum length in the co-occurrence corpus,

represents the average length of the co-occurrence corpus, and k is the co-occurrence corpus length.

Preferably, the calculation formula of the feature vector is as follows:

V(T)＝(<T₁,W₁>,<T₂,W₂>,…,<T_k,W_k>)

wherein, T₁，T₂，…,T_kRepresenting words related to the word T, W₁，W₂,…,W_kAre the words T and T, respectively₁，T₂，…,T_kCo-occurrence weight of.

Preferably, the semantic similarity is calculated by the following formula:

wherein, Sim (T)₁,T₂) The expression T₁And T₂Semantic similarity of (1), W_1iThe expression T₁Of the ith dimension, W_2iThe expression T₂Is given by the number of words in the feature vector, k represents the dimension of the feature vector, and n represents the number of identical words in the feature vector.

Preferably, the grade coefficient is calculated by the formula:

H(T_i) Is the word T_iClass coefficient of (1), tf (T)_i) The expression T_iWord frequency of (n), len (T)_i) Indicating a word length.

Preferably, the hierarchical clustering algorithm includes: single connectivity, full connectivity, and average connectivity.

Preferably, the hierarchical clustering algorithm is preferably average connectivity.

Preferably, the threshold is 0.1.

Preferably, the algorithm flow of the above-mentioned identifying the context relationship of the words in the cluster is:

s501, determining the grade number, and classifying the words in the cluster into each grade according to the grade coefficient; the words with high grade coefficient are in high grade, and the highest grade is L₀And the rest are L in sequence₁，L₂,…,L_i；

S502, generating a superior-inferior relation between adjacent word levels. Word level L_iA word T in the table, calculating the word T and the word level L_i-1The similarity of each word in the Chinese character is taken as the superior word of the word T, and the word with the maximum similarity is taken as the superior word of the word T; continue from word level L_iGet words until L_iEstablishing a superior-inferior relation among all the words; examining the word level L_i-1The middle word, the word without the hyponym is moved to the word level L_i；

And S503, judging whether the bottom layer is reached, if so, ending, otherwise, continuing to execute the operation of the S502.

In a second aspect, the present application provides a system for automatically constructing a narrative table, the system comprising: the system comprises an original file acquisition module, a word segmentation module, a narrative extraction module and a narrative table construction module, wherein:

the original file acquisition module is used for acquiring original file data;

the word division module is used for acquiring each word in the original file;

the narrative extraction module is used for realizing the calculation mode of the method so as to determine the correlation among words and the superior-inferior relation;

and the narrative word list construction module is used for constructing a narrative word list according to the correlation among the words and the superior-inferior relation.

A third aspect of the present application provides an apparatus for automatically constructing a narrative table, the apparatus comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the method as described above.

A fourth aspect of the present application provides a computer storage medium having stored thereon computer instructions for executing the method as described above when the computer instructions are invoked.

The invention has the beneficial effects that:

compared with the existing construction of a narrative word list, the similarity among words without similar characters can be identified by analyzing and calculating the correlation among words at the same time; on the basis, the words expressing different subject categories can be basically distinguished by using a level identification method, the generated word clusters are distributed more uniformly, and the similarity between words in the clusters is higher; the adopted grade recognition algorithm can basically classify the words in the cluster into different grades; thus, a narrative word list is automatically constructed according to the correlation among words and the superior-inferior relation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flow chart of a method for automatically constructing a narrative table disclosed in the embodiment of the application.

Fig. 2 is a schematic diagram of an algorithm flow for identifying a context relationship of words in a cluster in a method for automatically constructing a narrative table disclosed in an embodiment of the present application.

FIG. 3 is a schematic structural diagram of a system for automatically constructing a narrative table disclosed in an embodiment of the present application.

FIG. 4 is a schematic structural diagram of an apparatus for automatically constructing narrative tables disclosed in the embodiments of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings or the orientation or positional relationship which the present invention product is usually put into use, it is only for convenience of describing the present application and simplifying the description, but it is not intended to indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and thus, should not be construed as limiting the present application.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

Example 1

Referring to fig. 1, fig. 1 is a flow chart illustrating a method for automatically constructing a narrative table according to an embodiment of the present disclosure. As shown in fig. 1, a first aspect of the present application provides a method for automatically constructing a narrative table, the method comprising:

In this embodiment, the calculation formula for obtaining the co-occurrence weight between the words is as follows:

the formula for calculating the adjustment factor in this embodiment is:

represents the average length of the co-occurrence corpus, k is the corpus of the co-occurrence corpus,by calculating the co-occurrence association degree between words, an "associated concept space" can be constructed: and taking the word as a point and taking the co-occurrence weight as an undirected graph of the edge weight.

In this embodiment, the calculation formula for constructing the feature vector is:

V(T)＝(<T₁,W₁>,<T₂,W₂>,…,<T_k,W_k>)

In this embodiment, the calculation formula for obtaining the semantic similarity between words is as follows:

In this embodiment, the calculation formula for obtaining the inter-word ranking coefficient is:

In this embodiment, the hierarchical clustering algorithm includes: single connectivity, full connectivity, and average connectivity.

Wherein, hierarchical clustering with an average connectivity algorithm is adopted, and the effect is better when the threshold value is 0.1.

In this embodiment, the context relationship of the words in the cluster is identified, and the algorithm flow is as follows:

Example 2

Referring to fig. 3, fig. 3 is a schematic structural diagram of a system for automatically constructing a narrative table according to an embodiment of the present disclosure. As shown in fig. 3, a second aspect of the present application provides a system for automatically constructing a narrative table, comprising: the method comprises the following steps: the system comprises an original file acquisition module, a word segmentation module, a narrative extraction module and a narrative table construction module, wherein:

the original file acquisition module is used for acquiring original file data;

the word division module is used for acquiring each word in the original file;

Example 3

Referring to fig. 4, fig. 4 is a schematic structural diagram of an apparatus for automatically constructing a narrative table disclosed in an embodiment of the present application. As shown in fig. 4, a third aspect of the present application provides an apparatus for automatically constructing a narrative table, comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program codes stored in the memory to execute the method for automatically constructing the narrative table in the embodiment 1.

Example 4

This embodiment provides a computer storage medium, characterized in that the computer storage medium stores computer instructions for executing the method of automatic construction of the thesaurus in embodiment 1 when the computer instructions are called.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for automatically constructing a narrative table, which is characterized by comprising the following steps:

2. The method of claim 1, wherein the co-occurrence weight between words is calculated by the formula:

3. the method of claim 2, wherein the adjustment factor is calculated by the formula:

4. The method of claim 1, wherein the eigenvector is calculated by the formula:

V(T)＝(<T₁,W₁>,<T₂,W₂>,…,<T_k,W_k>)

5. The method of claim 4, wherein the semantic similarity is calculated by the formula:

6. The method of claim 1, wherein the level coefficient is calculated by the formula:

7. The method of claim 1, wherein said hierarchical clustering comprises the steps of: single connectivity, full connectivity, and average connectivity.

8. The method of claim 7, wherein said hierarchical clustering algorithm is preferably mean-connected.

9. The method of claim 8, wherein the threshold is preferably 0.1.

10. The method of claim 1, wherein the algorithm for identifying context relationships of words in clusters comprises:

step 1: determining the number of grades, and classifying the words in the cluster into each word grade according to the grade coefficient; the words with high grade coefficient are in high grade, and the highest grade is L₀And the rest are L in sequence₁，L₂,…,L_i；

Step 2: generating a superior-inferior relation between adjacent word levels; word level L_iA word T in the table, calculating the word T and the word level L_i-1The similarity of each word in the Chinese character is taken as the superior word of the word T, and the word with the maximum similarity is taken as the superior word of the word T; continue from word level L_iGet words until L_iEstablishing a superior-inferior relation among all the words; examining the word level L_i-1The middle word, the word without the hyponym is moved to the word level L_i；

And step 3: and (4) judging whether the bottom layer is reached, if so, ending, otherwise, continuing to execute the operation of the step (2).