CN111611342B - Method and device for obtaining lexical item and paragraph association weight - Google Patents
Method and device for obtaining lexical item and paragraph association weight Download PDFInfo
- Publication number
- CN111611342B CN111611342B CN202010274876.0A CN202010274876A CN111611342B CN 111611342 B CN111611342 B CN 111611342B CN 202010274876 A CN202010274876 A CN 202010274876A CN 111611342 B CN111611342 B CN 111611342B
- Authority
- CN
- China
- Prior art keywords
- paragraph
- document structure
- structure position
- value
- term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method and a device for acquiring lexical item and paragraph association weight, wherein the method comprises the following steps: a1, acquiring the number of lexical items in any paragraph in a document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items; the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located; and A2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph.
Description
Technical Field
The invention relates to the technical field of document extraction, in particular to a method and a device for obtaining term paragraph association weight.
Background
Most of the current Chinese text classification systems use words as feature items, called feature words. The characteristic words are used as an intermediate representation form of the document and are used for realizing similarity calculation between the document and between the document and a user target. Generally, the scoring values of all the features are calculated according to a certain feature evaluation function, then the features are ranked according to the scoring values, and a plurality of words with the highest scoring values are selected as feature words.
The most common text characterization method with better effect is to establish a term-document matrix. Each element value in the term-document matrix represents the weight of a term on the corresponding row corresponding to a document on the corresponding column, i.e., the degree of importance of this term to the document. Whether a word is important for a document is reflected in two aspects: the greater the number of times a term appears in a document, the greater the importance with respect to the document; if a term occurs more often in the entire corpus, the more meaningless, i.e., less important, the word is for the document, which is the idea of the TF-IDF algorithm. The keyword extraction based on the TextRank is another method, and the keyword extraction can be realized for a single document. The task of extracting the TextRank keywords is to automatically extract a plurality of meaningful words or phrases from a given text, and the TextRank algorithm is to sort the subsequent keywords by using the relationship (co-occurrence window) between local words and directly extract the keywords from the text.
The same term in the document has different paragraphs in the same structural position of the document, and the characterization effect on the document theme may also be different. For example, paragraph 1 and paragraph 2 of a section of a document generally have a continuity in lines, and terms in paragraph 1 and terms in paragraph 2 have some inevitable relationship (which may be repeated occurrences of terms, or potentially semantic the same, or logically related to causal or sequential exposition, etc.). The general term-document matrix uses the representation of terms on the document theme by purely adopting the occurrence times of terms, takes terms with low frequency in a specific document term and high frequency relative to other documents as theme words, and TF-IDF tends to filter out common terms and keep important terms; the TextRank algorithm sorts the subsequent keywords by using the relation (co-occurrence window) between local vocabularies, and only considers the co-occurrence relation between local adjacent vocabularies; neither of the two common methods considers the difference of the word term difference adjacency relation in the same structural position of the document to the document representation.
Disclosure of Invention
Technical problem to be solved
In order to solve the problem that the difference of the document representation caused by the paragraph difference adjacency relation of the term at the same structure position of the document is not considered in the prior art, the invention provides a method and a device for acquiring the term paragraph association weight.
(II) technical scheme
In order to achieve the above object, the present invention provides a method for obtaining term paragraph association weights, comprising the steps of:
a1, acquiring the number of lexical items in any paragraph in a document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
and A2, acquiring a paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total weight of all terms in the paragraph.
Preferably, the step A2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical value of the paragraphs in the document structure position corresponding to the number of the document structure position is arranged from high to low;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
the first association weight of the paragraph is a preset initial value;
a2-4, acquiring the association weight of any term paragraph in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, the steps A2 to a 4 include:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first order n The value of (d);
wherein n is the absolute value of the difference between the number of the paragraph and the number of the paragraph preceding the paragraph in the first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph, respectively;
a2-4-5, determining paragraph association weights of the terms based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the terms are located and the number of the paragraph in the document structure position where the terms are located.
Preferably, the step A2-4-5 comprises:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on a second association weight of the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein, the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, acquiring paragraph association weights of the terms based on a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, the step A2-4-5-2 comprises:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result;
a2-4-5-2-2, and determining a third associated weight value of the paragraph based on the judgment result.
Preferably, the step A2-4-5-2-2 comprises:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the predetermined threshold value;
and if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
Preferably, the predetermined threshold is 2.
Preferably, the step A2-4-5-4 comprises:
a2-4-5-4-1, obtaining a total value of all fourth association weights of any term in any paragraph in a document structure position corresponding to the number of the document structure position based on the fourth association weight of any term in the document structure position;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of a paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, and determining the paragraph association weight of any term;
wherein the final association weight is an average of all fourth association weights of any term.
Preferably, the preset initial value is 1.
An apparatus for obtaining term paragraph association weights, the apparatus for obtaining term paragraph association weights storing computer instructions; the computer instructions cause the apparatus for obtaining term paragraph association weights to perform the method for obtaining term paragraph association weights as described in any one of the above.
(III) advantageous effects
The beneficial effects of the invention are: when the document theme is characterized, the method considers the neighbor relation between the paragraphs and the high-average term weight paragraphs, improves the paragraph association weight of the terms in the neighbor paragraphs, and improves and highlights the position of the terms near the important paragraphs of the document structure.
The invention simultaneously considers the influence level differences of a plurality of paragraphs and adjacent distances in the same document structure position, and embodies the combined action of the plurality of paragraphs.
The invention averages the relation weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation.
Drawings
FIG. 1 is a flow chart of a method for obtaining term paragraph association weights according to the present invention;
fig. 2 is a schematic diagram illustrating a method for obtaining term paragraph association weights according to a second embodiment of the present invention.
Detailed Description
For the purpose of better explaining the invention and to facilitate understanding, the invention will be described in detail below by way of specific embodiments with reference to the accompanying drawings
Detailed description of the preferred embodiment
Referring to fig. 1, the method for obtaining term paragraph association weights in the first embodiment includes the steps of:
a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items.
And the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
And A2, acquiring a paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total weight of all terms in the paragraph.
In this embodiment, the subject term in the document may be extracted according to the preset paragraph weight of any term in the plurality of terms. The subject term in the document is n terms with the highest paragraph weight value.
In this embodiment, after the document is processed according to the existing TF-IDF algorithm, important terms in the document may be obtained, then paragraph association weights of the terms are obtained according to the method for obtaining the term and paragraph association weights in this embodiment, and finally, the subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weight in the terms.
Preferably in this embodiment, the step A2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
Wherein the first value is: the average of the weights of all terms in the passage.
A2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the first order is: and arranging the first numerical values of the paragraphs in the document structure positions corresponding to the numbers of the document structure positions from high to low.
And A2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position.
Wherein, the first association weight of the paragraph is a preset initial value.
For example, if a result portion has 5 paragraphs (original paragraph sequence), the paragraph number and evaluation weight are:
paragraph 1:5.6
Paragraph 2:3.2
Paragraph 3:8.8
Paragraph 4;1.2
Paragraph 5;6.6
The paragraph average weight ordering and associated weight initialization values are as follows (ordered paragraph sequence):
paragraph 3:8.8,1 (initial value)
Paragraph 5;6.6,1 (initial value)
Paragraph 1;5.6,1 (initial value)
Paragraph 2;3.2,1 (initial value)
Paragraph 4;1.2,1 (initial value)
The paragraphs 1, 3 and 5 in the original paragraph sequence have higher average weight, which indicates that there are more terms with higher original weight in these paragraphs, and such paragraphs are more important for characterizing the document features. Based on the context relevance of natural language, paragraphs with low average weight, such as paragraph 2 or paragraph 4, have higher average weight, and the average weight of paragraphs in such paragraphs cannot reflect the context relevance, so the weight needs to be properly increased. If the average weight of paragraph 4 is 1.2, and the average weights of preceding and succeeding paragraphs 3 and 5 are higher, it is stated that paragraph 4 makes sense for the characterization of paragraphs 3 and 5. The weight of the promoted paragraph 4 is slightly higher than the average weight of the original paragraph obtained by calculation, but cannot be higher than the weights of the paragraphs 3 and 5; and the lifted weight can not change the original sequencing order, and the farther apart paragraphs, the relevance of which is weakened in turn. From the sequence of ordered paragraphs, paragraph 3 in the first order is the paragraph with the highest average weight, and the weight cannot be increased any more, and only the second and subsequent paragraphs can be increased, so the paragraph starts with the paragraph with the second order.
And A2-4, acquiring the association weight of any term paragraph in the preset multiple terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step A2-4 includes:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between any of the paragraphs and the first numerical value of the paragraph preceding the paragraph in the first order, respectively.
A2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs being respectively2 corresponding to the paragraph preceding the paragraph in the first order n The numerical value of (c).
Where n is the absolute value of the difference between the paragraph and the number of paragraphs preceding the paragraph in the first order, respectively.
Dividing the number difference power of 2 by the average weight difference; the smaller the relevance of the far-apart paragraphs, the average weight difference between the directly adjacent paragraphs is divided by 2, the two paragraphs are separated by 4, and so on, i.e. the division by 2 n And n is the absolute value of the paragraph number difference, so that the influence of the paragraphs farther away on the current paragraph is reduced.
A2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the third absolute value comprises: the paragraph is respectively a quotient of the first absolute value and the second absolute value of any paragraph preceding the paragraph in the first order.
And A2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the fourth average is: the paragraph is the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph, respectively.
A2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any one of paragraphs in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
Preferably, in this embodiment, the step A2-4-5 includes:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph.
Wherein the second association weight of the paragraph is: the fourth average of the paragraph is the value of the quotient of the first value of the paragraph, plus the first associated weight of the paragraph.
A2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on a second association weight of the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold.
A2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value.
Wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located.
A2-4-5-4, acquiring paragraph association weights of the terms based on a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step A2-4-5-2 includes:
and A2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result.
A2-4-5-2-2, and determining a third associated weight value of the paragraph based on the judgment result.
Preferably, in this embodiment, the step A2-4-5-2-2 includes.
If the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the preset value is set in advance.
And if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
In this embodiment, the preset threshold is preferably 2.
Preferably, in this embodiment, the step A2-4-5-4 includes:
a2-4-5-4-1, obtaining a total value of all fourth association weights of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the fourth association weight of any term in the document structure position.
A2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the plurality of terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located.
A2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, and determining the paragraph association weight of any term.
Wherein the final association weight is an average of all fourth association weights of any term.
In this embodiment, the preset initial value is preferably 1.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
Detailed description of the invention
For better explaining the present invention, referring to fig. 2, the term document paragraph position table in the present embodiment is pre-input into the computer, and the table will be explained first.
In this embodiment, the term document paragraph position table word _ list that is input as a specific document is a database table that includes all terms extracted from the specific document and document paragraph position information thereof, and each term with a specific number in the table may have multiple records in different paragraphs of the same structure of the document or different sentences of the same paragraph, and specific field definitions are shown in table 1.
Table 1 term document paragraph position table definitions
Name of field | Meaning of a field | Type of field | Description of field |
word_id | Term numbering | INTEGER | Unique numbering of particular terms |
word_weight | Basic weight of term | DECIMAL | Basic weight of a particular term |
section_id | Document structure numbering | INTEGER | Number of specific structure position of document where term is located |
parag_id | Document paragraph numbering | INTEGER | Number of specific paragraph of document where term is located |
Word weight in table 1 is the basic weight of the term obtained by some method, such as the patented method: a method and apparatus for obtaining hierarchical weight of domain document lexical item; section _ id is the number of the specific structure position of the document where the specific paragraph is located, and 1 or more paragraphs may exist in the same section _ id position corresponding to the term; the tag _ id is the number of a specific paragraph of the document where a specific term is located, the numbers of adjacent paragraphs are consecutive, and 1 or more specific terms may exist for the same tag _ id paragraph to which the term corresponds.
In this embodiment, after a term document paragraph position table that is a specific document is input, referring to fig. 1, obtaining is performed according to a method for obtaining term paragraph association weights, and the specific steps include:
b1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
and the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
In a specific application of this embodiment, the specific steps are as follows:
(1-1) entering system initialization, defining a database operation statement execution function SQL _ execute, wherein the input parameter of the function SQL _ execute is text SQL which is a database operation statement meeting SQL-92 standard; a function calls a database system function to execute a text sql, the execution result of the text sql is a table in a database or the change of data in the table, and the function does not directly output the result; and then enters 1-2).
(1-2) setting the text sql to: SELECT section _ id, paragraph _ id, SUM (word _ weight)/COUNT (word _ weight) AS avg _ weights,1.0AS paragraph up/weight INTO paragraphs one/w weight FROM words, text GROUP BY section id, paragraph more than ORDER BY section id, weights DESC, BY calling sql _ estimate, performing term weight accumulation and term counting on document terms BY paragraphs, dividing the accumulated weight of paragraph terms BY the number of terms to obtain the average weight of paragraph terms, setting the initial value of the paragraph association weight to 1.0, sequencing according to the sequence of ascending of the document structure position number and descending of the paragraph term average weight, storing the paragraph term average weight and the paragraph association weight INTO a paragraph weight table params _ weights, wherein the paragraph weight table params _ weights comprises a document structure number section _ id, a document paragraph number param _ id, a paragraph term average weight avg _ weights and a paragraph association weight param _ weights, and then entering 1-3).
And B2, acquiring a paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the serial number of the document structure position and the total weight of all terms in the paragraph.
In a specific application of this embodiment, the specific steps include:
(1-3) calculating paragraph association weight according to the height of the average weight of each paragraph term at the same document structure position and the mutual adjacency relation aiming at each record of the paragraph weight table parags _ weights. And setting the section of the structure position of the current document as 0, reading the first bar of the paragraph weight table params _ weights as a current record, and entering 1-3-1).
(1-3-1) reading the currently recorded document structure number section _ id, paragraph number parag _ id and paragraph term average weight avg _ weights, and entering 1-3-2).
(1-3-2) judging whether the current document structure position section is equal to the document structure number section _ id, and if so, entering 1-3-3); otherwise, enter 1-3-8).
(1-3-3) set the text sql to: selecting tag _ id, (avg _ weights-)/POWER (2, ABS (tag _ id-) +1AS weights INTO temp FROM tags _weightswhile sections id =% AND avg _ weights >), converting the current paragraph term average weight avg _ weights, paragraph number tag _ id AND document structure number section _ id INTO character strings, AND respectively replacing the value, # AND% in the text sql; acquiring all paragraphs with the term average weight higher than that of the current paragraph in the structural position range of the current document by calling a function sql _ execute, calculating the difference between the term average weight of each paragraph and the term average weight of the current paragraph and the absolute value of the difference between the number of each paragraph and the number of the current paragraph, dividing the term average weight by the absolute value power of the difference between the number of the paragraphs of 2 to obtain the accumulated weight of the current paragraph obtained from each paragraph, and writing the result into a temporary table temp; enter 1-3-4).
(1-3-4) set the text sql to: selecting SUM (weights)/COUNT (weights)/+ 1 FROM temp, converting the term average weight avg _ weights of the current paragraph into character strings, replacing the words in the text sql, and calling a function sql _ execute to realize the purpose of obtaining the initial association weight of the current paragraph by summarizing the accumulated weight and the number of paragraphs in the temporary table temp, then calculating the average value, then dividing the average value by the term average weight of the current paragraph, and then adding 1; enter 3-3-5).
(1-3-5) judging whether the initial association weight of the current paragraph is greater than 2, and if so, modifying the initial association weight to 2; otherwise, not processing; then, enter 1-3-6).
(1-3-6) set the text sql to: UPDATE tags _ weights SET tag _ weight = #, avg _ weights = avg _ weights # WHERE tag _ id =%, the initial association weight of the current paragraph and the paragraph number tag _ id are converted into character strings to respectively replace #, and% in the text sql, and the UPDATE of the association weight of the current paragraph and the average weight of the term is realized by calling a function sql _ execute; then, enter 1-3-7).
(1-3-7) set the text sql to: DROP TABLE temp, which realizes the deletion of temporary TABLE temp by calling function sql _ execute; enter 1-3-8).
(1-3-8) judging whether the current record is the last record of paragraph weight table params _ weights, and if so, entering 1-4); otherwise, reading the next record as the current record and entering 1-3-1).
(1-4) set the text sql to: selecting word _ id, tag _ weight INTO temp FROM words _ list, tags _ weights WHERE words _ list, tag _ id = tags _ weights, tag _ id, and implementing that the corresponding paragraph association weight is assigned to the paragraph association weight of all the terms corresponding to the paragraph according to the paragraph number by calling the function sql _ execute, writing the result INTO the term paragraph weight association table temp, and then entering 3-5).
(1-5) set the text sql to: selecting word _ id, SUM (mark _ weight)/COUNT (mark _ weight) AS re _ weight intowords _ weights FROM, by calling function sql _ execute, accumulating paragraph association weights of terms in different paragraphs of the document FROM the term paragraph weight association table, dividing by the word frequency number of the terms in the document to obtain term paragraph association weights, writing the result INTO the term document paragraph association weight table words _ weights, and then entering 1-6).
And (1-6) outputting a term document paragraph association weight table words _ weights.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
In the second embodiment, after a document is processed according to the existing TF-IDF algorithm, important terms in the document are obtained, then, a term-paragraph association weight table words _ weights of terms is obtained according to the method for obtaining term-paragraph association weights in the second embodiment, and finally, subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weights in the terms.
In the embodiment, the influence level differences of a plurality of important paragraphs and adjacent distances are considered simultaneously in the same document structure position, so that the common action of a plurality of paragraphs is embodied;
the embodiment calculates the average value of the relationship weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation;
the method of the embodiment is suitable for calculating the weights of all terms needing to highlight the differences of the neighbor relations of different paragraphs on the document characterization.
The foregoing describes the technical principles of the present invention in conjunction with specific embodiments, which are provided for the purpose of illustrating the principles of the present invention and are not to be construed as limiting the scope of the present invention in any way. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive efforts, which shall fall within the scope of the present invention.
Claims (3)
1. A method for obtaining term paragraph association weight includes the steps of:
a1, acquiring the number of lexical items in any paragraph in a document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph;
the step A2 comprises the following steps:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical values of the paragraphs in the document structure positions corresponding to the numbers of the document structure positions are arranged in a high-to-low sequence;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
wherein, the first association weight of the paragraph is a preset initial value;
a2-4, acquiring a first numerical value and a first association weight of any paragraph in a document structure position corresponding to the number of the document structure position, and acquiring a first order of the paragraphs in the document structure position corresponding to the number of the document structure position, wherein the first order of the paragraphs in the document structure position corresponds to the number of the document structure position;
the step A2-4 comprises the following steps:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the second absolute value of any paragraph in the document structure position comprises: the number 2 corresponding between said paragraph and the first paragraph n ;
The first paragraph is any paragraph preceding the paragraph in the first order;
wherein n is the absolute value of the difference between the number of the paragraph in first order and the number of the first paragraph in first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is respectively the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph;
a2-4-5, determining paragraph association weights of the terms based on a fourth average value corresponding to any one paragraph in the document structure positions corresponding to the numbers of the document structure positions and the first numerical value of the paragraph, the numbers of the document structure positions where the terms are located and the numbers of the paragraphs in the document structure positions where the terms are located;
the step A2-4-5 comprises the following steps:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein, the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, acquiring paragraph association weights of the terms based on a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position;
the step A2-4-5-2 comprises the following steps:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value to obtain a judgment result;
a2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result;
the step A2-4-5-2-2 comprises the following steps:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that a third association weight value of the paragraph is: the preset value is set;
if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph;
the step A2-4-5-4 comprises the following steps:
a2-4-5-4-1, acquiring a total value of fourth association weights of all the terms based on the fourth association weights of any term in any paragraph in the document structure position corresponding to the serial number of the document structure position;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of a paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, and determining the paragraph association weight of any term;
wherein the paragraph association weight is an average of all fourth association weights of any term.
2. The method of claim 1, wherein the predetermined threshold is 2.
3. The method of claim 1, wherein the predetermined initial value is 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010274876.0A CN111611342B (en) | 2020-04-09 | 2020-04-09 | Method and device for obtaining lexical item and paragraph association weight |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010274876.0A CN111611342B (en) | 2020-04-09 | 2020-04-09 | Method and device for obtaining lexical item and paragraph association weight |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111611342A CN111611342A (en) | 2020-09-01 |
CN111611342B true CN111611342B (en) | 2023-04-18 |
Family
ID=72201801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010274876.0A Active CN111611342B (en) | 2020-04-09 | 2020-04-09 | Method and device for obtaining lexical item and paragraph association weight |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111611342B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
US9201876B1 (en) * | 2012-05-29 | 2015-12-01 | Google Inc. | Contextual weighting of words in a word grouping |
CN105426379A (en) * | 2014-10-22 | 2016-03-23 | 武汉理工大学 | Keyword weight calculation method based on position of word |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN106845265A (en) * | 2016-12-01 | 2017-06-13 | 北京计算机技术及应用研究所 | A kind of document security level automatic identifying method |
WO2018121145A1 (en) * | 2016-12-30 | 2018-07-05 | 北京国双科技有限公司 | Method and device for vectorizing paragraph |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI434187B (en) * | 2010-11-03 | 2014-04-11 | Inst Information Industry | Text conversion method and system |
-
2020
- 2020-04-09 CN CN202010274876.0A patent/CN111611342B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
US9201876B1 (en) * | 2012-05-29 | 2015-12-01 | Google Inc. | Contextual weighting of words in a word grouping |
CN105426379A (en) * | 2014-10-22 | 2016-03-23 | 武汉理工大学 | Keyword weight calculation method based on position of word |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN106845265A (en) * | 2016-12-01 | 2017-06-13 | 北京计算机技术及应用研究所 | A kind of document security level automatic identifying method |
WO2018121145A1 (en) * | 2016-12-30 | 2018-07-05 | 北京国双科技有限公司 | Method and device for vectorizing paragraph |
CN109766408A (en) * | 2018-12-04 | 2019-05-17 | 上海大学 | The text key word weighing computation method of comprehensive word positional factor and word frequency factor |
Also Published As
Publication number | Publication date |
---|---|
CN111611342A (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107423444B (en) | Hot word phrase extraction method and system | |
WO2020192401A1 (en) | System and method for generating answer based on clustering and sentence similarity | |
US10019515B2 (en) | Attribute-based contexts for sentiment-topic pairs | |
Pranckevičius et al. | Application of logistic regression with part-of-the-speech tagging for multi-class text classification | |
CN107463548B (en) | Phrase mining method and device | |
Usman et al. | Urdu text classification using majority voting | |
CN108363694B (en) | Keyword extraction method and device | |
US11144723B2 (en) | Method, device, and program for text classification | |
CN107885717B (en) | Keyword extraction method and device | |
CN107729337B (en) | Event monitoring method and device | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
KR101638535B1 (en) | Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same | |
CN106649308B (en) | Word segmentation and word library updating method and system | |
JP2019200784A (en) | Analysis method, analysis device and analysis program | |
US10474700B2 (en) | Robust stream filtering based on reference document | |
CN112395881B (en) | Material label construction method and device, readable storage medium and electronic equipment | |
CN111611342B (en) | Method and device for obtaining lexical item and paragraph association weight | |
CN107665222B (en) | Keyword expansion method and device | |
Lemnitzer et al. | Combining a rule-based approach and machine learning in a good-example extraction task for the purpose of lexicographic work on contemporary standard German | |
CN104850609B (en) | A kind of filter method for rising space class keywords | |
CN113052544A (en) | Method and device for intelligently adapting workflow according to user behavior and storage medium | |
CN106777191B (en) | Search engine-based retrieval mode generation method and device | |
US20180005300A1 (en) | Information presentation device, information presentation method, and computer program product | |
CN111079425B (en) | Geological document term grading method and device | |
CN106649367B (en) | Method and device for detecting keyword popularization degree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Deng Jiqiu Inventor after: Lu Biyu Inventor after: Liu Wenyi Inventor after: Li Chenhan Inventor after: He Meixiang Inventor before: Deng Jiqiu Inventor before: Lu Biyu Inventor before: Li Chenhan |
|
CB03 | Change of inventor or designer information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |