CN111611342A - Method and device for obtaining lexical item and paragraph association weight - Google Patents
Method and device for obtaining lexical item and paragraph association weight Download PDFInfo
- Publication number
- CN111611342A CN111611342A CN202010274876.0A CN202010274876A CN111611342A CN 111611342 A CN111611342 A CN 111611342A CN 202010274876 A CN202010274876 A CN 202010274876A CN 111611342 A CN111611342 A CN 111611342A
- Authority
- CN
- China
- Prior art keywords
- paragraph
- document structure
- structure position
- term
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method and a device for acquiring lexical item and paragraph association weight, wherein the method comprises the following steps: a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items; the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located; a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
Description
Technical Field
The invention relates to the technical field of document extraction, in particular to a method and a device for obtaining term paragraph association weight.
Background
Most of the current Chinese text classification systems use words as feature items, called feature words. The characteristic words are used as an intermediate representation form of the document and are used for realizing similarity calculation between the document and between the document and a user target. Generally, the scoring values of all the features are calculated according to a certain feature evaluation function, then the features are ranked according to the scoring values, and a plurality of words with the highest scoring values are selected as feature words.
The most common text characterization method with better effect is to establish a term-document matrix. Each element value in the term-document matrix represents the weight of a term on the corresponding row corresponding to a document on the corresponding column, i.e., the degree of importance of this term to the document. Whether a word is important for a document is reflected in two aspects: the greater the number of occurrences of a term in a document, the greater the importance with respect to the document; if a term occurs more often in the entire corpus, the more meaningless, i.e., less important, the word is for the document, which is the idea of the TF-IDF algorithm. The keyword extraction based on the TextRank is another method, and the keyword extraction can be realized for a single document. The task of extracting the TextRank keywords is to automatically extract a plurality of meaningful words or phrases from a given text, and the TextRank algorithm is to sort the subsequent keywords by using the relationship (co-occurrence window) between local words and directly extract the keywords from the text.
The same term in the document is located in different paragraphs in the same structural position of the document, and the characterization effect on the document theme may also be different. For example, paragraph 1 and paragraph 2 of a section of a document generally have a continuity in lines, and terms in paragraph 1 and terms in paragraph 2 have some inevitable relationship (which may be repeated occurrences of terms, or potentially semantic the same, or logically related to causal or sequential exposition, etc.). The general term-document matrix uses the representation of terms on the document theme by purely adopting the occurrence times of terms, takes terms with low frequency in a specific document term and high frequency relative to other documents as theme words, and TF-IDF tends to filter out common terms and keep important terms; the TextRank algorithm sorts the subsequent keywords by using the relation (co-occurrence window) between local vocabularies, and only considers the co-occurrence relation between local adjacent vocabularies; neither of the two common methods considers the difference of the word term difference adjacency relation in the same structural position of the document to the document representation.
Disclosure of Invention
Technical problem to be solved
In order to solve the problem that the difference of the document representation caused by the paragraph difference adjacency relation of the term at the same structure position of the document is not considered in the prior art, the invention provides a method and a device for acquiring the term paragraph association weight.
(II) technical scheme
In order to achieve the above object, the present invention provides a method for obtaining term paragraph association weights, comprising the steps of:
a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
Preferably, the step a2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical value of the paragraphs in the document structure position corresponding to the number of the document structure position is arranged from high to low;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
wherein, the first association weight of the paragraph is a preset initial value;
a2-4, acquiring any term paragraph association weight in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, the step a2-4 includes:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the content of the first and second substances,the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first ordernThe value of (d);
wherein n is the absolute value of the difference between the number of the paragraph and the number of the paragraph preceding the paragraph in the first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is respectively the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph;
a2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
Preferably, the step a2-4-5 comprises:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a corresponding fourth average value of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, obtaining paragraph association weight of the term based on the fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, the step A2-4-5-2 comprises the following steps:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result;
a2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result.
Preferably, the step A2-4-5-2-2 comprises the following steps:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the predetermined threshold value;
and if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
Preferably, the predetermined threshold is 2.
Preferably, the step A2-4-5-4 comprises:
a2-4-5-4-1, obtaining a total value of fourth association weights of any term in any paragraph in a document structure position corresponding to the number of the document structure position based on the fourth association weight of any term;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, determining the paragraph association weight of any term;
wherein the final association weight is an average of all fourth association weights of the any term.
Preferably, the preset initial value is 1.
An apparatus for obtaining term paragraph association weights, the apparatus for obtaining term paragraph association weights storing computer instructions; the computer instructions cause the apparatus for obtaining term paragraph association weights to perform the method for obtaining term paragraph association weights as described in any one of the above.
(III) advantageous effects
The invention has the beneficial effects that: when the document theme is characterized, the method considers the neighbor relation between the paragraphs and the high-average term weight paragraphs, improves the paragraph association weight of the terms in the neighbor paragraphs, and improves and highlights the position of the terms near the important paragraphs of the document structure.
The invention simultaneously considers the influence level differences of a plurality of paragraphs and adjacent distances in the same document structure position, and embodies the combined action of the plurality of paragraphs.
The invention averages the relation weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation.
Drawings
FIG. 1 is a flow chart of a method for obtaining term paragraph association weights according to the present invention;
fig. 2 is a schematic diagram illustrating a method for obtaining term paragraph association weights in the second embodiment of the present invention.
Detailed Description
For the purpose of better explaining the invention and to facilitate understanding, the invention will be described in detail below by way of specific embodiments with reference to the accompanying drawings
Detailed description of the preferred embodiment
Referring to fig. 1, the method for obtaining term paragraph association weights in the first embodiment includes the steps of:
a1, acquiring the number of the lexical items in any one of the document structure positions corresponding to the number of the document structure positions and the total number of the weights of all the lexical items in the paragraphs based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items.
And the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
A2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
In this embodiment, the subject term in the document may be extracted according to the preset paragraph weight of any term in the plurality of terms. The subject term in the document is n terms with the highest paragraph weight value.
In this embodiment, after the document is processed according to the existing TF-IDF algorithm, important terms in the document may be obtained, then paragraph association weights of the terms are obtained according to the method for obtaining the term and paragraph association weights in this embodiment, and finally, the subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weight in the terms.
Preferably, in this embodiment, the step a2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
Wherein the first value is: the average of the weights of all terms in the passage.
A2-2, obtaining a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on the first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the first order is: and arranging the first numerical values of the paragraphs in the document structure positions corresponding to the numbers of the document structure positions from high to low.
A2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value for the document structure position corresponding to the number of the document structure position.
Wherein, the first association weight of the paragraph is a preset initial value.
For example, if a result portion has 5 paragraphs (original paragraph sequence), the paragraph number and evaluation weight are:
paragraph 1: 5.6
Paragraph 2: 3.2
Paragraph 3: 8.8
Paragraph 4; 1.2
Paragraph 5; 6.6
The paragraph average weight ordering and associated weight initialization values are as follows (ordered paragraph sequence):
paragraph 3: 8.8, 1 (initial value)
Paragraph 5; 6.6, 1 (initial value)
Paragraph 1; 5.6, 1 (initial value)
Paragraph 2; 3.2, 1 (initial value)
Paragraph 4; 1.2, 1 (initial value)
The paragraphs 1, 3 and 5 in the original paragraph sequence have higher average weight, which indicates that there are more terms with higher original weight in these paragraphs, and such paragraphs are more important for characterizing the document features. Based on the context relevance of natural language, paragraphs with low average weight, such as paragraph 2 or paragraph 4, have higher average weight, and the average weight of paragraphs in such paragraphs cannot reflect the context relevance, so the weight needs to be properly increased. If the average weight of paragraph 4 is 1.2, and the average weights of preceding and succeeding paragraphs 3 and 5 are higher, it is stated that paragraph 4 makes sense for the characterization of paragraphs 3 and 5. The weight of the paragraph 4 after promotion is slightly higher than the average weight of the original paragraph obtained by calculation, but cannot be higher than the weights of the paragraphs 3 and 5; and the lifted weight can not change the original sequencing order, and the farther apart paragraphs, the relevance of which is weakened in turn. From the sequence of ordered paragraphs, paragraph 3 in the first order is the paragraph with the highest average weight, and the weight cannot be increased any more, and only the second and subsequent paragraphs can be increased, so the paragraph starts with the paragraph with the second order.
A2-4, acquiring any term paragraph association weight in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step a2-4 includes:
a2-4-1, obtaining a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between any of the paragraphs and the first numerical value of the paragraph preceding the paragraph in the first order, respectively.
A2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first ordernThe numerical value of (c).
Where n is the absolute value of the difference between the paragraph and the number of paragraphs preceding the paragraph in the first order, respectively.
Dividing the number difference power of 2 by the average weight difference; the smaller the relevance of the far-apart paragraphs, the average weight difference between the directly adjacent paragraphs is divided by 2, the two paragraphs are separated by 4, and so on, i.e. the division by 2nAnd n is the absolute value of the paragraph number difference, so that the influence of the paragraphs farther away on the current paragraph is reduced.
A2-4-3, obtaining a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the third absolute value comprises: the paragraph is respectively a quotient of the first absolute value and the second absolute value of any paragraph preceding the paragraph in the first order.
A2-4-4, obtaining a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the fourth average is: the paragraph is the average of the third absolute values of all paragraphs that precede the paragraph in the first order with the paragraph, respectively.
A2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
Preferably, in this embodiment, the step a2-4-5 includes:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on the corresponding fourth average value of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph.
Wherein the second association weight of the paragraph is: the fourth average of the paragraph is the value of the quotient of the first value of the paragraph, followed by the first associated weight of the paragraph.
A2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold value.
A2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value.
Wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located.
A2-4-5-4, obtaining paragraph association weight of the term based on the fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step a2-4-5-2 includes:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining the judgment result.
A2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result.
Preferably, in this embodiment, the step a2-4-5-2-2 includes.
If the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the preset value is set in advance.
And if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
In this embodiment, the preset threshold is preferably 2.
Preferably, in this embodiment, the step a2-4-5-4 includes:
a2-4-5-4-1, obtaining the total value of the fourth association weights of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the fourth association weight of any term.
A2-4-5-4-2, obtaining the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located.
A2-4-5-4-3, the total value of the fourth association weights of all the term items and the number of any term item in a plurality of term items, determining the paragraph association weight of any term item.
Wherein the final association weight is an average of all fourth association weights of the any term.
In this embodiment, the preset initial value is preferably 1.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
Detailed description of the invention
For better explaining the present invention, referring to fig. 2, the term document paragraph position table in the present embodiment is pre-input into the computer, and the table will be explained first.
In this embodiment, the term document paragraph position table word _ list that is input as a specific document is a database table that includes all terms extracted from the specific document and document paragraph position information thereof, and each term with a specific number in the table may have multiple records in different paragraphs of the same structure of the document or different sentences of the same paragraph, and specific field definitions are shown in table 1.
Table 1 term document paragraph position table definitions
Name of field | Meaning of a field | Type of field | Description of field |
word_id | Term numbering | INTEGER | Unique numbering of particular terms |
word_weight | Basic weight of term | DECIMAL | Basic weight of a particular term |
section_id | Document structure numbering | INTEGER | Number of specific structure position of document where term is located |
parag_id | Document paragraph numbering | INTEGER | Number of specific paragraph of document where term is located |
Word weight in table 1 is the basic weight of the term obtained by some method, such as the patented method: a method and apparatus for obtaining hierarchical weight of domain document lexical item; section _ id is the number of the specific structure position of the document where the specific paragraph is located, and 1 or more paragraphs may exist in the same section _ id position corresponding to the term; the tag _ id is the number of a specific paragraph of the document where a specific term is located, the numbers of adjacent paragraphs are consecutive, and 1 or more specific terms may exist for the same tag _ id paragraph to which the term corresponds.
In this embodiment, after a term document paragraph position table that is a specific document is input, referring to fig. 1, obtaining is performed according to a method for obtaining term paragraph association weights, and the specific steps include:
b1, acquiring the number of the lexical items in any one of the document structure positions corresponding to the number of the document structure positions and the total number of the weights of all the lexical items in the paragraphs based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
and the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
In a specific application of this embodiment, the specific steps are as follows:
(1-1) entering system initialization, defining a database operation statement execution function SQL _ execute, wherein the input parameter of the function SQL _ execute is text SQL which is a database operation statement meeting SQL-92 standard; a function calls a database system function to execute a text sql, the execution result of the text sql is a table in the database or the change of data in the table, and the function itself does not directly output the result; and then enters 1-2).
(1-2) set the text sql to: SELECT section _ id, param _ id, SUM (word _ weight)/COUNT (word _ weight) AS avg _ weights,1.0AS param _ weight _ list GROUP _ weight _ GROUP BY section _ id, param _ id ORDER BY section _ id, weights DESC, BY calling function sql _ execute, performing term weight accumulation and term counting on document terms according to paragraphs, dividing the accumulated weight of the paragraph terms by the number of terms to obtain an average weight of the paragraph terms, setting an initial value of paragraph association weight as 1.0, sequencing according to the sequence of ascending of document structure position number and descending of the average weight of the paragraph terms, storing the average weight of the paragraph terms and the paragraph association weight into a paragraph weight table params _ weights, wherein the paragraph weight table params _ weights comprises a document structure number section _ id, a document paragraph number param _ id, an average weight avg _ weights of the paragraph terms, and the paragraph association weight param _ weights, and then entering 1-3).
B2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
In a specific application of this embodiment, the specific steps include:
(1-3) calculating paragraph association weight according to the height of the average weight of each paragraph term in the same document structure position and the mutual adjacency relation aiming at each record of the paragraph weight table params _ weights. And setting the section of the structure position of the current document as 0, reading the first bar of the paragraph weight table params _ weights as a current record, and entering 1-3-1).
(1-3-1) reading the currently recorded document structure number section _ id, paragraph number parag _ id and paragraph term average weight avg _ weights, and entering 1-3-2).
(1-3-2) judging whether the current document structure position section is equal to the document structure number section _ id, and if so, entering 1-3-3); otherwise, enter 1-3-8).
(1-3-3) set the text sql to: selecting tag _ id, (avg _ weights-)/POWER (2, ABS (tag _ id-) +1AS weights INTO temp FROM tags _ weights with section _ id [% ] converting current paragraph term average weight avg _ weights, paragraph number tag _ id, AND document structure number section _ id INTO character strings AND respectively replacing [ # ],% ] in the text sql; acquiring all paragraphs with the term average weight higher than that of the current paragraph in the structural position range of the current document by calling a function sql _ execute, calculating the difference between the term average weight of each paragraph and the term average weight of the current paragraph and the absolute value of the difference between the number of each paragraph and the number of the current paragraph, dividing the term average weight by the absolute value power of the difference between the number of the paragraphs of 2 to obtain the accumulated weight of the current paragraph obtained from each paragraph, and writing the result into a temporary table temp; enter 1-3-4).
(1-3-4) set the text sql to: selecting SUM (weights)/COUNT (weights)/1 FROMTemp, converting the term average weight avg _ weights of the current paragraph into character strings, replacing terms in the text sql, and calling a function sql _ execute to realize the purpose of obtaining the initial association weight of the current paragraph by summarizing the accumulated weight and the number of paragraphs in the temporary table temp, then calculating the average value, then dividing the average value by the term average weight of the current paragraph and adding 1; enter 3-3-5).
(1-3-5) judging whether the initial association weight of the current paragraph is greater than 2, if so, modifying the initial association weight to 2; otherwise, not processing; then, enter 1-3-6).
(1-3-6) set the text sql to: UPDATE tags _ weights SET tag _ weight ═ avg _ weights # WHERE tag _ id ═ to convert the current paragraph initial association weight and the paragraph number tag _ id into character strings to respectively replace #, percent in the text sql, and UPDATE the current paragraph association weight and the term average weight by calling the function sql _ execute; then, enter 1-3-7).
(1-3-7) set the text sql to: DROP TABLE temp, which realizes the deletion of temporary TABLE temp by calling function sql _ execute; enter 1-3-8).
(1-3-8) judging whether the current record is the last record of paragraph weight table params _ weights, and if so, entering 1-4); otherwise, reading the next record as the current record and entering 1-3-1).
(1-4) set the text sql to: selecting word _ id, tag _ weight INTO temp FROM words _ list, tag _ weights WHERE words _ list, tag _ weights _ id is tag _ weights, tag _ weights _ id is used for assigning corresponding segment association weights to the segment association weights of all terms corresponding to the segment according to segment numbers by calling a function sql _ execute, writing the result INTO a term segment weight association table temp, and then entering 3-5).
(1-5) set the text sql to: selecting word _ id, SUM (mark _ weight)/COUNT (mark _ weight) AS re _ weight intowords _ weights FROM, by calling function sql _ execute, accumulating paragraph association weights of terms in different paragraphs of the document FROM the term paragraph weight association table, dividing by the word frequency number of the terms in the document to obtain term paragraph association weights, writing the result INTO the term document paragraph association weight table words _ weights, and then entering 1-6).
(1-6) outputting a term document paragraph association weight table words _ weights.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
In the second embodiment, after a document is processed according to the existing TF-IDF algorithm, important terms in the document are obtained, then, a term-paragraph association weight table words _ weights of terms is obtained according to the method for obtaining term-paragraph association weights in the second embodiment, and finally, subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weights in the terms.
In the embodiment, the influence level differences of a plurality of important paragraphs and adjacent distances are considered simultaneously in the same document structure position, so that the common action of a plurality of paragraphs is embodied;
the embodiment averages the relation weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation;
the method of the embodiment is suitable for calculating the term weight of the document characterization difference by highlighting the neighbor relation of different paragraphs.
The technical principles of the present invention have been described above in connection with specific embodiments, which are intended to explain the principles of the present invention and should not be construed as limiting the scope of the present invention in any way. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive efforts, which shall fall within the scope of the present invention.
Claims (10)
1. A method for obtaining term paragraph association weight is characterized by comprising the following steps:
a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
2. The method according to claim 1, wherein said step a2 comprises:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical value of the paragraphs in the document structure position corresponding to the number of the document structure position is arranged from high to low;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
wherein, the first association weight of the paragraph is a preset initial value;
a2-4, acquiring any term paragraph association weight in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
3. The method according to claim 2, wherein the step a2-4 comprises:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first ordernThe value of (d);
wherein n is the absolute value of the difference between the number of the paragraph and the number of the paragraph preceding the paragraph in the first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is respectively the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph;
a2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
4. The method of claim 3, wherein the step A2-4-5 comprises:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a corresponding fourth average value of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, obtaining paragraph association weight of the term based on the fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
5. The method of claim 4, wherein the step A2-4-5-2 comprises:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result;
a2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result.
6. The method of claim 5, wherein the step a2-4-5-2-2 comprises:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the preset value is set;
and if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
7. The method according to claim 5 or 6, wherein the predetermined threshold is 2.
8. The method of claim 7, wherein the step a2-4-5-4 comprises:
a2-4-5-4-1, obtaining a total value of fourth association weights of any term in any paragraph in a document structure position corresponding to the number of the document structure position based on the fourth association weight of any term;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, determining the paragraph association weight of any term;
wherein the final association weight is an average of all fourth association weights of the any term.
9. The method of claim 2, wherein the predetermined initial value is 1.
10. An apparatus for obtaining term paragraph association weights, wherein the apparatus for obtaining term document paragraph association weights stores computer instructions; the computer instructions cause the means for obtaining term document paragraph association weights to perform the method for obtaining term paragraph association weights as claimed in any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010274876.0A CN111611342B (en) | 2020-04-09 | 2020-04-09 | Method and device for obtaining lexical item and paragraph association weight |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010274876.0A CN111611342B (en) | 2020-04-09 | 2020-04-09 | Method and device for obtaining lexical item and paragraph association weight |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111611342A true CN111611342A (en) | 2020-09-01 |
CN111611342B CN111611342B (en) | 2023-04-18 |
Family
ID=72201801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010274876.0A Active CN111611342B (en) | 2020-04-09 | 2020-04-09 | Method and device for obtaining lexical item and paragraph association weight |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111611342B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
US20120109624A1 (en) * | 2010-11-03 | 2012-05-03 | Institute For Information Industry | Text conversion method and text conversion system |
US9201876B1 (en) * | 2012-05-29 | 2015-12-01 | Google Inc. | Contextual weighting of words in a word grouping |
CN105426379A (en) * | 2014-10-22 | 2016-03-23 | 武汉理工大学 | Keyword weight calculation method based on position of word |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN106845265A (en) * | 2016-12-01 | 2017-06-13 | 北京计算机技术及应用研究所 | A kind of document security level automatic identifying method |
WO2018121145A1 (en) * | 2016-12-30 | 2018-07-05 | 北京国双科技有限公司 | Method and device for vectorizing paragraph |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
-
2020
- 2020-04-09 CN CN202010274876.0A patent/CN111611342B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120109624A1 (en) * | 2010-11-03 | 2012-05-03 | Institute For Information Industry | Text conversion method and text conversion system |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
US9201876B1 (en) * | 2012-05-29 | 2015-12-01 | Google Inc. | Contextual weighting of words in a word grouping |
CN105426379A (en) * | 2014-10-22 | 2016-03-23 | 武汉理工大学 | Keyword weight calculation method based on position of word |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN106845265A (en) * | 2016-12-01 | 2017-06-13 | 北京计算机技术及应用研究所 | A kind of document security level automatic identifying method |
WO2018121145A1 (en) * | 2016-12-30 | 2018-07-05 | 北京国双科技有限公司 | Method and device for vectorizing paragraph |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
Also Published As
Publication number | Publication date |
---|---|
CN111611342B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pranckevičius et al. | Application of logistic regression with part-of-the-speech tagging for multi-class text classification | |
CN107423444B (en) | Hot word phrase extraction method and system | |
US7461056B2 (en) | Text mining apparatus and associated methods | |
JP3041268B2 (en) | Chinese Error Checking (CEC) System | |
US11144723B2 (en) | Method, device, and program for text classification | |
CN108363694B (en) | Keyword extraction method and device | |
CN107885717B (en) | Keyword extraction method and device | |
CN111831804A (en) | Key phrase extraction method and device, terminal equipment and storage medium | |
CN102789464A (en) | Natural language processing method, device and system based on semanteme recognition | |
CN107729337B (en) | Event monitoring method and device | |
CN102789452A (en) | Similar content extraction method | |
WO2018164879A1 (en) | Aggregating procedures for automatic document analysis | |
CN106649308B (en) | Word segmentation and word library updating method and system | |
CN107341142B (en) | Enterprise relation calculation method and system based on keyword extraction and analysis | |
WO2022105178A1 (en) | Keyword extraction method and related device | |
CN112395881B (en) | Material label construction method and device, readable storage medium and electronic equipment | |
JP2019200784A (en) | Analysis method, analysis device and analysis program | |
CN108628875B (en) | Text label extraction method and device and server | |
CN111611342B (en) | Method and device for obtaining lexical item and paragraph association weight | |
US11361565B2 (en) | Natural language processing (NLP) pipeline for automated attribute extraction | |
Lemnitzer et al. | Combining a rule-based approach and machine learning in a good-example extraction task for the purpose of lexicographic work on contemporary standard German | |
CN104850609B (en) | A kind of filter method for rising space class keywords | |
CN113052544A (en) | Method and device for intelligently adapting workflow according to user behavior and storage medium | |
CN111079425B (en) | Geological document term grading method and device | |
US20180005300A1 (en) | Information presentation device, information presentation method, and computer program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Deng Jiqiu Inventor after: Lu Biyu Inventor after: Liu Wenyi Inventor after: Li Chenhan Inventor after: He Meixiang Inventor before: Deng Jiqiu Inventor before: Lu Biyu Inventor before: Li Chenhan |
|
CB03 | Change of inventor or designer information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |