CN111611341B

CN111611341B - Method and device for acquiring structural position weight of term document

Info

Publication number: CN111611341B
Application number: CN202010274874.1A
Authority: CN
Inventors: 邓吉秋; 路馥毓; 刘文毅; 李晨菡; 何美香
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2023-04-25
Anticipated expiration: 2040-04-09
Also published as: CN111611341A

Abstract

The invention relates to a method and a device for acquiring a position weight of a term document structure, comprising the following steps: acquiring a first weight corresponding to a position type based on a plurality of position types of a preset document structure position and a document level corresponding to the position type of the document structure position; acquiring the number of terms in the document structure position corresponding to the position type; acquiring a second weight corresponding to the position type based on the first weight corresponding to the position type and the number of terms in the document structure position corresponding to the position type; acquiring a third weight corresponding to the position type based on the first weight and the second weight corresponding to the position type; the third weight corresponding to the position type is the sum of the first weight and the second weight corresponding to the position type; and acquiring the structural position weight of the preset specific term based on the third weight corresponding to the position type and the preset specific term corresponding to the position type.

Description

Method and device for acquiring structural position weight of term document

Technical Field

The invention relates to the field of natural language processing, in particular to a method and a device for acquiring a structural position weight of a term document.

Background

The most commonly used and effective text characterization method is to build a term-document matrix. Each element value in the term-document matrix represents the weight of the term on the corresponding row corresponding to the document on the corresponding column, i.e., the importance of the term to the document. Whether a word is important for a document is reflected in two aspects: the more times a term appears in a document, the greater the importance with respect to the document; if the term appears more times in the whole corpus, the term is less meaningful, i.e. less important, for the document, which is the idea of the TF-IDF algorithm.

Keyword extraction based on TextRank is another type of method, and keyword extraction can be implemented for a single document. The task of extracting the TextRank keywords is to automatically extract a plurality of meaningful words or phrases from a given text, and the TextRank algorithm is to sort the subsequent keywords by using the relation (co-occurrence window) among the local vocabularies and directly extract the keywords from the text.

The same term in a document may be located differently in the document, and the characterization effect on the subject of the document may also be different. For example, the term "study" may appear at a different location in the title of a document, in a section of the title, in a specific paragraph, in a reference, etc., where it is apparent that the "study" appearing in the section of the title of the document has the greatest effect on characterizing the content of the document and that the "study" appearing in the section of the reference has a lesser effect. The general term-document matrix is characterized by purely adopting the occurrence frequency of the term to represent the term to the document theme, and the term with low frequency in the term of a specific document and high frequency relative to other documents is used as a subject term, so that the TF-IDF tends to filter common words and keep important words; the TextRank algorithm sorts the subsequent keywords by using the relation (co-occurrence window) among the local vocabularies, and only considers the co-occurrence relation among the local adjacent terms; both conventional methods do not consider differences of different structural positions of terms in a document on document characterization, so that the characterization of a document theme is inaccurate.

Disclosure of Invention

First, the technical problem to be solved

In order to solve the problem that the difference of different structural positions of the term in the document to the document representation is not considered in the prior art, the invention provides a method and a device for acquiring the structural position weight of the term document.

(II) technical scheme

In order to achieve the above object, the present invention provides a method for obtaining a positional weight of a term document structure, comprising the steps of:

a1, acquiring a first weight corresponding to a position type based on a plurality of position types of a preset document structure position and a document level corresponding to the position type of the document structure position;

the first weight is N; wherein n=2 ^n-1 The method comprises the steps of carrying out a first treatment on the surface of the n is the document level corresponding to the location type;

a2, acquiring the number of terms in the document structure position corresponding to the position type;

a3, acquiring a second weight corresponding to the position type based on the first weight corresponding to the position type and the number of terms in the document structure position corresponding to the position type;

the second weight corresponding to the position type is the ratio of the first weight corresponding to the position type to the number of terms in the document structure position corresponding to the position type;

a4, acquiring a third weight corresponding to the position type based on the first weight and the second weight corresponding to the position type;

the third weight corresponding to the position type is the sum of the first weight and the second weight corresponding to the position type;

a5, acquiring the document structure position weight of the preset specific term based on a third weight corresponding to the position type and the preset specific term corresponding to the position type;

the document structure position weight of the preset specific term is the sum of third weights corresponding to all position types corresponding to the specific term.

Preferably, the method further comprises:

a6, sorting the preset specific terms according to the document structure position weights of the preset specific terms to obtain specific terms in a first sequence;

the first sequence is as follows: the structure position weights are in the sequence from high to low;

a7, acquiring the first M specific terms of the first sequence according to the specific terms of the first sequence;

wherein M is a preset value.

Preferably, the plurality of location types of the preset document structure location include: a first location type, a second location type, a third location type, a fourth location type, a fifth location type, a sixth location type, a seventh location type, an eighth location type, a ninth location type, a tenth location type, an eleventh location type, a twelfth location type, a thirteenth location type, a fourteenth location type, a fifteenth location type, a sixteenth location type.

Preferably, the specific term corresponding to the position type is a term corresponding to the first position type and/or the second position type and/or the third position type and/or the fourth position type and/or the fifth position type and/or the sixth position type and/or the seventh position type and/or the eighth position type and/or the ninth position type and/or the tenth position type and/or the eleventh position type and/or the twelfth position type and/or the thirteenth position type and/or the fourteenth position type and/or the fifteenth position type and/or the sixteenth position type.

A device for acquiring the position weight of a term document structure, wherein the device for acquiring the position weight of the term document structure stores a first instruction;

the first instruction causes the obtaining device of the term document structure position weight to execute the obtaining method of the term document structure position weight according to any one of the above.

(III) beneficial effects

The beneficial effects of the invention are as follows: the invention considers the differences of the characterization of the vocabulary terms at different structural positions in the document on the document theme, so that the calculation of the vocabulary term weight is more effective, and the characterization effect of the keyword terms at the structural position of the document and the structure positions of the different document of the vocabulary terms on the document is highlighted.

Drawings

FIG. 1 is a flowchart of a method for obtaining a term document structure position weight according to the present invention;

fig. 2 is a schematic diagram of a method for obtaining a term document structure position weight in the second embodiment of the present invention.

Detailed Description

The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.

Example 1

As shown in fig. 1, the method for obtaining the structural position weight of the term document provided in this embodiment is characterized by comprising the following steps:

a1, acquiring a first weight corresponding to a position type based on a plurality of preset position types of a document structure position and a document level corresponding to the position type of the document structure position.

The first weight is N; wherein n=2 ^n-1 The method comprises the steps of carrying out a first treatment on the surface of the n is a preset document level corresponding to the position type.

In this embodiment, the first weight represents a weight difference of terms in different document location levels, terms in document structure locations of a high document level are more important for document characterization than terms in document structure locations of a low document level, but terms in document structure locations of a high document level are fewer than terms in document structure locations of a low document level. If the term word frequency in the document structure position with the position type being the title is extremely low, the term word frequency in the document structure position with the position type being the text paragraph can be tens of thousands, which is unfavorable for reflecting the characterization effect of the term in the high-level document structure position.

In this embodiment, 2 is adopted in the practical application of this embodiment, considering that the number of terms in the low-document level position is geometrically increased relative to the number of terms of the previous level ^n-1 As a first weight corresponding to the location type.

The plurality of location types of the preset document structure location include: a first location type, a second location type, a third location type, a fourth location type, a fifth location type, a sixth location type, a seventh location type, an eighth location type, a ninth location type, a tenth location type, an eleventh location type, a twelfth location type, a thirteenth location type, a fourteenth location type, a fifteenth location type, a sixteenth location type.

The first location type in this embodiment is a document title; the second position type is a summary keyword; the third position type is summary content; the fourth location type is a catalog entry chapter title; the fifth location type is a non-directory entry chapter title; the sixth location type is the numbered item title; the seventh location type is a non-chapter directory entry; the eighth location type is cover non-title content; the ninth location type is flyleaf non-title content; the tenth location type is the unnumbered item content; the eleventh location type is a graph; the twelfth location type is a table; the thirteenth location type is a text paragraph; the fourteenth location type is an annex title; the fifteenth location type is the annex content; the sixteenth position type is other content including references, back covers, and the like.

A2, acquiring the number of terms in the document structure position corresponding to the position type.

A3, acquiring a second weight corresponding to the position type based on the first weight corresponding to the position type and the number of terms in the document structure position corresponding to the position type.

The second weight corresponding to the position type is the ratio of the first weight corresponding to the position type to the number of terms in the document structure position corresponding to the position type.

For example, if the location type is a document title of a first location type, the second weight corresponding to the document title is a ratio of the first weight corresponding to the document title to the number of terms contained in the document title.

In this embodiment, the second weight is fine adjustment to the first weight, and because the number of terms in the document structure positions of the same level and different position types is asymmetric, for example, the number of terms in the abstract may be smaller than the number of terms in the catalog, the weight is increased for the position type with fewer terms in the document structure position of the same document level. Since terms in a certain level position should not be weighted higher than terms in a higher level position, the level difference between the two levels is equally divided by terms number to highlight terms in a document structure position with a small number of terms.

A4, acquiring a third weight corresponding to the position type based on the first weight and the second weight corresponding to the position type.

The third weight corresponding to the location type is the sum of the first weight and the second weight corresponding to the location type.

In this embodiment, after the third weight obtained by adding the first weight and the second weight, it is ensured that the third weight of the term at the specific document structure position does not exceed the third weight at the higher document level position, so as to maintain the differences of the document characterization by the document structure positions of different document levels.

the document structure position weight of the preset specific term is the sum of third weights corresponding to the position types corresponding to the specific term.

In this embodiment, the subject term in the document may also be extracted according to the preset document structure position weight of the specific term. The subject terms in the document are the preset x term terms with the highest paragraph weight value.

In this embodiment, after the document is processed according to the existing TF-IDF algorithm, important terms in the document are obtained, then the document structure position weights of the terms are obtained according to the document structure position weight method for obtaining the terms in this embodiment, and finally the subject terms in the document are extracted, where the subject terms in the document are a preset number of terms with the highest document structure position weights of the terms.

The embodiment further includes:

the first sequence is as follows: the structural position weights are in the order from high to low.

wherein M is a preset value.

In the embodiment, the differences of the vocabulary terms at different structural positions in the document on the document theme representation are considered, and the representation effect of the keyword terms at the structural positions of the document on the document at the structural positions of the different document structural positions of the vocabulary terms is highlighted.

In this embodiment, the document structure position weights represent the comprehensive representation of the weights of the terms at different positions of the document, and a plurality of terms appearing at the same high document level position have equal third weights, but a certain term also appears at a low document level position and has the low document level position weights possibly appear, so that the term is more important for the document characterization than other terms appearing at the same high document level position, the low document level position weights should be considered on the basis of the high level position weights, and therefore, the third weights of the accumulated specific terms in the document structure positions of different position types are the most suitable way. Because averaging the third weights of the particular term in the document structure locations of the different location types pulls down the document structure location weights, multiplying the third weights of the particular term in the document structure locations of the different location types results in the document structure location weights of the particular term being greater than the document structure location weights of the particular term in the location of the last document level.

In this embodiment, the specific term corresponding to the preset position type is a preset term corresponding to the first position type and/or the second position type and/or the third position type and/or the fourth position type and/or the fifth position type and/or the sixth position type and/or the seventh position type and/or the eighth position type and/or the ninth position type and/or the tenth position type and/or the eleventh position type and/or the twelfth position type and/or the thirteenth position type and/or the fourteenth position type and/or the fifteenth position type and/or the sixteenth position type.

In this embodiment, if the preset specific term corresponds to both the first location type and the second location type, the structural location weight of the preset specific term is the sum of the third weight corresponding to the first location type and the third weight of the second location type.

In the embodiment, in the document structure position of the same position type, weights of different terms are equal, and equality positions of different terms of key parts of the document structure are embodied.

Example two

As shown in fig. 2, a method for obtaining a term document structure position weight is provided in the second embodiment.

(1) Input description in this embodiment

The term document structure position list word_list of the specific document is input, and is a database list containing all terms extracted from the specific document and document structure position information thereof, wherein a plurality of records can exist in the same structure position or different structure positions of the document for each term with specific number in the list, and specific field definition is shown in the list 1.

Table 1 term document Structure location Table definition

The pos_id value in table 1 depends on the document structure and its specific location, and the specific number is shown in table 2.

Table 2 term document structure location level definition

Each document structure position in table 2 corresponds to a certain document level, and terms in a high document level position are more capable of characterizing the subject characteristics of the document than terms in a low document level position.

(2) Output description in this embodiment

The output is word term document structure position weight table words_weights, which is a database table containing word term numbers and corresponding relation of the document structure position weights, and the specific field definition is shown in table 3.

Table 3 term document structure location weight table definition

Field name	Meaning of field	Field type	Field description
				word_id	Lexical item numbering	INTEGER	Unique numbering of specific terms
pos_weights	Position comprehensive weight	DECIMAL	Comprehensive weight of document structure position of term

(3) The term document structure position weight calculation process specifically comprises the following steps:

(3-1) entering system initialization, and defining a database operation statement execution function sql_execution, wherein the input parameter of the function sql_execution is a text SQL, and the text SQL is a database operation statement meeting SQL-92 standards; the function call database system functions execute text sql, the execution result of the text sql is the change of a table or data in the table in the database, and the function does not directly output the result; and then 3-2).

(3-2) setting the text sql to: SELECT pos_id, pos_level, COUNT (pos_id) AS cnt intotemp 1 FROM words_ list GROUP BY pos _id, pos_ level ORDER BY pos _id, summarizing the number of terms at different document structure locations by calling a function sql_execution to a structure location weight table temp1, the structure location weight table temp1 containing location number pos_id, location level pos_level, location term number cnt, and then entering 3-3.

(3-3) setting the text sql to: the ALTER TABLE temp1 ADD level_weight DECIMAL ADD average_weight DECIAL, two fields level_ weigt, average _weight are added for the structure location weight TABLE temp1 by calling the function sql_execution, the first weight and the second weight are recorded respectively, and then 3-4 is entered.

(3-4) setting the text sql to: UPDATE temp1 SET level_weight=power (2, pos_level-1), where POWER (2, pos_level-1) represents the pos_level-1 POWER of 2, the first weight to calculate the document structure location is achieved by calling the function sql_execution, and then going to 3-5.

(3-5) setting the text sql to: UPDATE temp1 SET average_weight=level_weight/cnt, and by calling the function sql_execution, a second weight for calculating the document structure location is achieved, and then 3-6 is entered.

(3-6) setting the text sql to: SELECT DISTINCT word_id, pos_id, level_weight, average_weight inside 2 FROM words_list, temp1 WHER words_list, pos_id=temp1. Pos_id GROUP BY words_id, pos_id, BY calling function sql_execution, realizing creating term location weight table temp2, recording first weight and second weight of term document structure location, wherein the same term at the same location is recorded only BY one piece, and then entering 3-7.

(3-7) setting the text sql to: ALTER TABLE temp2 ADD ADD pos_weight DECIAL, ADDs a field pos_weight to the term location weight TABLE temp2 for recording the third weight by calling the function sql_execution, and then goes to 3-8).

(3-8) setting the text sql to: UPDATE temp2 SET pos_weight=level_weight+average_weight, and by calling the function sql_execution, a third weight of the term at a different document structure position is calculated, and then 3-9 is entered.

(3-9) setting the text sql to: SELECT word_id, SUM (pos_weight) AS pos_ weights INTO words _ weights FROM temp2 GROUP BY word_id, BY calling function sql_execution, the summation of the third weights of the terms at different document structure positions is achieved, the document structure position weights of the terms are obtained, and then 3-10 is entered.

(3-10) outputting a term document structure position weight table words_weights.

In the embodiment, counting the number of terms at the document structure positions corresponding to different position types by summarizing, calculating a first weight at a specific position according to the corresponding document level of the position type, calculating a second weight according to the number of terms in the same position type, and adding the two weights to obtain a third weight of the terms; and finally, accumulating the third weights of the terms in different position types to obtain the document structure position weights of the terms. In the embodiment, when the document theme is represented, the position of the keyword in the document structure can be promoted and highlighted by increasing the level weight of the level document structure position. In the embodiment, weights of different terms are equal in the document structure positions corresponding to the same position type, so that equality of different terms of key parts of the document structure is reflected. In the embodiment, the same term weight appearing in different document structure positions is accumulated to obtain the final weight of the term, the representation of the document theme by the document structure positions of the high and low document levels of the same term is comprehensively considered, and the difference of different terms in the same high document level position is reflected; the method is suitable for calculating the term weight of the document characterization difference by all terms needing to be highlighted at different document structure positions.

The technical principles of the present invention have been described above in connection with specific embodiments, which are provided for the purpose of explaining the principles of the present invention and are not to be construed as limiting the scope of the present invention in any way. Other embodiments of the invention will be apparent to those skilled in the art from consideration of this specification without undue burden.

Claims

1. The method for acquiring the position weight of the term document structure is characterized by comprising the following steps:

2. The method as recited in claim 1, further comprising:

the first sequence is as follows: the document structure position weight is in the order from high to low;

wherein M is a preset value.

3. The method of claim 1, wherein the plurality of location types of the predetermined document structure location include: a first location type, a second location type, a third location type, a fourth location type, a fifth location type, a sixth location type, a seventh location type, an eighth location type, a ninth location type, a tenth location type, an eleventh location type, a twelfth location type, a thirteenth location type, a fourteenth location type, a fifteenth location type, a sixteenth location type.

4. A method according to claim 3, characterized in that the specific term corresponding to the position type is a term corresponding to the first position type and/or the second position type and/or the third position type and/or the fourth position type and/or the fifth position type and/or the sixth position type and/or the seventh position type and/or the eighth position type and/or the ninth position type and/or the tenth position type and/or the eleventh position type and/or the twelfth position type and/or the thirteenth position type and/or the fourteenth position type and/or the fifteenth position type and/or the sixteenth position type.

5. The acquisition device of the structural position weight of the term document is characterized in that the acquisition device of the structural position weight of the term document stores a first instruction;

the first instruction causes the obtaining means of the term document structure position weight to execute the obtaining method of the term document structure position weight as recited in any one of claims 1 to 4.