CN111611342B - Method and device for obtaining lexical item and paragraph association weight - Google Patents

Method and device for obtaining lexical item and paragraph association weight Download PDF

Info

Publication number
CN111611342B
CN111611342B CN202010274876.0A CN202010274876A CN111611342B CN 111611342 B CN111611342 B CN 111611342B CN 202010274876 A CN202010274876 A CN 202010274876A CN 111611342 B CN111611342 B CN 111611342B
Authority
CN
China
Prior art keywords
paragraph
document structure
structure position
value
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010274876.0A
Other languages
Chinese (zh)
Other versions
CN111611342A (en
Inventor
邓吉秋
路馥毓
刘文毅
李晨菡
何美香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010274876.0A priority Critical patent/CN111611342B/en
Publication of CN111611342A publication Critical patent/CN111611342A/en
Application granted granted Critical
Publication of CN111611342B publication Critical patent/CN111611342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for acquiring lexical item and paragraph association weight, wherein the method comprises the following steps: a1, acquiring the number of lexical items in any paragraph in a document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items; the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located; and A2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph.

Description

Method and device for obtaining lexical item and paragraph association weight
Technical Field
The invention relates to the technical field of document extraction, in particular to a method and a device for obtaining term paragraph association weight.
Background
Most of the current Chinese text classification systems use words as feature items, called feature words. The characteristic words are used as an intermediate representation form of the document and are used for realizing similarity calculation between the document and between the document and a user target. Generally, the scoring values of all the features are calculated according to a certain feature evaluation function, then the features are ranked according to the scoring values, and a plurality of words with the highest scoring values are selected as feature words.
The most common text characterization method with better effect is to establish a term-document matrix. Each element value in the term-document matrix represents the weight of a term on the corresponding row corresponding to a document on the corresponding column, i.e., the degree of importance of this term to the document. Whether a word is important for a document is reflected in two aspects: the greater the number of times a term appears in a document, the greater the importance with respect to the document; if a term occurs more often in the entire corpus, the more meaningless, i.e., less important, the word is for the document, which is the idea of the TF-IDF algorithm. The keyword extraction based on the TextRank is another method, and the keyword extraction can be realized for a single document. The task of extracting the TextRank keywords is to automatically extract a plurality of meaningful words or phrases from a given text, and the TextRank algorithm is to sort the subsequent keywords by using the relationship (co-occurrence window) between local words and directly extract the keywords from the text.
The same term in the document has different paragraphs in the same structural position of the document, and the characterization effect on the document theme may also be different. For example, paragraph 1 and paragraph 2 of a section of a document generally have a continuity in lines, and terms in paragraph 1 and terms in paragraph 2 have some inevitable relationship (which may be repeated occurrences of terms, or potentially semantic the same, or logically related to causal or sequential exposition, etc.). The general term-document matrix uses the representation of terms on the document theme by purely adopting the occurrence times of terms, takes terms with low frequency in a specific document term and high frequency relative to other documents as theme words, and TF-IDF tends to filter out common terms and keep important terms; the TextRank algorithm sorts the subsequent keywords by using the relation (co-occurrence window) between local vocabularies, and only considers the co-occurrence relation between local adjacent vocabularies; neither of the two common methods considers the difference of the word term difference adjacency relation in the same structural position of the document to the document representation.
Disclosure of Invention
Technical problem to be solved
In order to solve the problem that the difference of the document representation caused by the paragraph difference adjacency relation of the term at the same structure position of the document is not considered in the prior art, the invention provides a method and a device for acquiring the term paragraph association weight.
(II) technical scheme
In order to achieve the above object, the present invention provides a method for obtaining term paragraph association weights, comprising the steps of:
a1, acquiring the number of lexical items in any paragraph in a document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
and A2, acquiring a paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total weight of all terms in the paragraph.
Preferably, the step A2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical value of the paragraphs in the document structure position corresponding to the number of the document structure position is arranged from high to low;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
the first association weight of the paragraph is a preset initial value;
a2-4, acquiring the association weight of any term paragraph in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, the steps A2 to a 4 include:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first order n The value of (d);
wherein n is the absolute value of the difference between the number of the paragraph and the number of the paragraph preceding the paragraph in the first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph, respectively;
a2-4-5, determining paragraph association weights of the terms based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the terms are located and the number of the paragraph in the document structure position where the terms are located.
Preferably, the step A2-4-5 comprises:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on a second association weight of the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein, the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, acquiring paragraph association weights of the terms based on a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, the step A2-4-5-2 comprises:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result;
a2-4-5-2-2, and determining a third associated weight value of the paragraph based on the judgment result.
Preferably, the step A2-4-5-2-2 comprises:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the predetermined threshold value;
and if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
Preferably, the predetermined threshold is 2.
Preferably, the step A2-4-5-4 comprises:
a2-4-5-4-1, obtaining a total value of all fourth association weights of any term in any paragraph in a document structure position corresponding to the number of the document structure position based on the fourth association weight of any term in the document structure position;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of a paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, and determining the paragraph association weight of any term;
wherein the final association weight is an average of all fourth association weights of any term.
Preferably, the preset initial value is 1.
An apparatus for obtaining term paragraph association weights, the apparatus for obtaining term paragraph association weights storing computer instructions; the computer instructions cause the apparatus for obtaining term paragraph association weights to perform the method for obtaining term paragraph association weights as described in any one of the above.
(III) advantageous effects
The beneficial effects of the invention are: when the document theme is characterized, the method considers the neighbor relation between the paragraphs and the high-average term weight paragraphs, improves the paragraph association weight of the terms in the neighbor paragraphs, and improves and highlights the position of the terms near the important paragraphs of the document structure.
The invention simultaneously considers the influence level differences of a plurality of paragraphs and adjacent distances in the same document structure position, and embodies the combined action of the plurality of paragraphs.
The invention averages the relation weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation.
Drawings
FIG. 1 is a flow chart of a method for obtaining term paragraph association weights according to the present invention;
fig. 2 is a schematic diagram illustrating a method for obtaining term paragraph association weights according to a second embodiment of the present invention.
Detailed Description
For the purpose of better explaining the invention and to facilitate understanding, the invention will be described in detail below by way of specific embodiments with reference to the accompanying drawings
Detailed description of the preferred embodiment
Referring to fig. 1, the method for obtaining term paragraph association weights in the first embodiment includes the steps of:
a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items.
And the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
And A2, acquiring a paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total weight of all terms in the paragraph.
In this embodiment, the subject term in the document may be extracted according to the preset paragraph weight of any term in the plurality of terms. The subject term in the document is n terms with the highest paragraph weight value.
In this embodiment, after the document is processed according to the existing TF-IDF algorithm, important terms in the document may be obtained, then paragraph association weights of the terms are obtained according to the method for obtaining the term and paragraph association weights in this embodiment, and finally, the subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weight in the terms.
Preferably in this embodiment, the step A2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
Wherein the first value is: the average of the weights of all terms in the passage.
A2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the first order is: and arranging the first numerical values of the paragraphs in the document structure positions corresponding to the numbers of the document structure positions from high to low.
And A2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position.
Wherein, the first association weight of the paragraph is a preset initial value.
For example, if a result portion has 5 paragraphs (original paragraph sequence), the paragraph number and evaluation weight are:
paragraph 1:5.6
Paragraph 2:3.2
Paragraph 3:8.8
Paragraph 4;1.2
Paragraph 5;6.6
The paragraph average weight ordering and associated weight initialization values are as follows (ordered paragraph sequence):
paragraph 3:8.8,1 (initial value)
Paragraph 5;6.6,1 (initial value)
Paragraph 1;5.6,1 (initial value)
Paragraph 2;3.2,1 (initial value)
Paragraph 4;1.2,1 (initial value)
The paragraphs 1, 3 and 5 in the original paragraph sequence have higher average weight, which indicates that there are more terms with higher original weight in these paragraphs, and such paragraphs are more important for characterizing the document features. Based on the context relevance of natural language, paragraphs with low average weight, such as paragraph 2 or paragraph 4, have higher average weight, and the average weight of paragraphs in such paragraphs cannot reflect the context relevance, so the weight needs to be properly increased. If the average weight of paragraph 4 is 1.2, and the average weights of preceding and succeeding paragraphs 3 and 5 are higher, it is stated that paragraph 4 makes sense for the characterization of paragraphs 3 and 5. The weight of the promoted paragraph 4 is slightly higher than the average weight of the original paragraph obtained by calculation, but cannot be higher than the weights of the paragraphs 3 and 5; and the lifted weight can not change the original sequencing order, and the farther apart paragraphs, the relevance of which is weakened in turn. From the sequence of ordered paragraphs, paragraph 3 in the first order is the paragraph with the highest average weight, and the weight cannot be increased any more, and only the second and subsequent paragraphs can be increased, so the paragraph starts with the paragraph with the second order.
And A2-4, acquiring the association weight of any term paragraph in the preset multiple terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step A2-4 includes:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between any of the paragraphs and the first numerical value of the paragraph preceding the paragraph in the first order, respectively.
A2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs being respectively2 corresponding to the paragraph preceding the paragraph in the first order n The numerical value of (c).
Where n is the absolute value of the difference between the paragraph and the number of paragraphs preceding the paragraph in the first order, respectively.
Dividing the number difference power of 2 by the average weight difference; the smaller the relevance of the far-apart paragraphs, the average weight difference between the directly adjacent paragraphs is divided by 2, the two paragraphs are separated by 4, and so on, i.e. the division by 2 n And n is the absolute value of the paragraph number difference, so that the influence of the paragraphs farther away on the current paragraph is reduced.
A2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the third absolute value comprises: the paragraph is respectively a quotient of the first absolute value and the second absolute value of any paragraph preceding the paragraph in the first order.
And A2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the fourth average is: the paragraph is the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph, respectively.
A2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any one of paragraphs in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
Preferably, in this embodiment, the step A2-4-5 includes:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph.
Wherein the second association weight of the paragraph is: the fourth average of the paragraph is the value of the quotient of the first value of the paragraph, plus the first associated weight of the paragraph.
A2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on a second association weight of the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold.
A2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value.
Wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located.
A2-4-5-4, acquiring paragraph association weights of the terms based on a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step A2-4-5-2 includes:
and A2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result.
A2-4-5-2-2, and determining a third associated weight value of the paragraph based on the judgment result.
Preferably, in this embodiment, the step A2-4-5-2-2 includes.
If the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the preset value is set in advance.
And if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
In this embodiment, the preset threshold is preferably 2.
Preferably, in this embodiment, the step A2-4-5-4 includes:
a2-4-5-4-1, obtaining a total value of all fourth association weights of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the fourth association weight of any term in the document structure position.
A2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the plurality of terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located.
A2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, and determining the paragraph association weight of any term.
Wherein the final association weight is an average of all fourth association weights of any term.
In this embodiment, the preset initial value is preferably 1.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
Detailed description of the invention
For better explaining the present invention, referring to fig. 2, the term document paragraph position table in the present embodiment is pre-input into the computer, and the table will be explained first.
In this embodiment, the term document paragraph position table word _ list that is input as a specific document is a database table that includes all terms extracted from the specific document and document paragraph position information thereof, and each term with a specific number in the table may have multiple records in different paragraphs of the same structure of the document or different sentences of the same paragraph, and specific field definitions are shown in table 1.
Table 1 term document paragraph position table definitions
Name of field Meaning of a field Type of field Description of field
word_id Term numbering INTEGER Unique numbering of particular terms
word_weight Basic weight of term DECIMAL Basic weight of a particular term
section_id Document structure numbering INTEGER Number of specific structure position of document where term is located
parag_id Document paragraph numbering INTEGER Number of specific paragraph of document where term is located
Word weight in table 1 is the basic weight of the term obtained by some method, such as the patented method: a method and apparatus for obtaining hierarchical weight of domain document lexical item; section _ id is the number of the specific structure position of the document where the specific paragraph is located, and 1 or more paragraphs may exist in the same section _ id position corresponding to the term; the tag _ id is the number of a specific paragraph of the document where a specific term is located, the numbers of adjacent paragraphs are consecutive, and 1 or more specific terms may exist for the same tag _ id paragraph to which the term corresponds.
In this embodiment, after a term document paragraph position table that is a specific document is input, referring to fig. 1, obtaining is performed according to a method for obtaining term paragraph association weights, and the specific steps include:
b1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
and the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
In a specific application of this embodiment, the specific steps are as follows:
(1-1) entering system initialization, defining a database operation statement execution function SQL _ execute, wherein the input parameter of the function SQL _ execute is text SQL which is a database operation statement meeting SQL-92 standard; a function calls a database system function to execute a text sql, the execution result of the text sql is a table in a database or the change of data in the table, and the function does not directly output the result; and then enters 1-2).
(1-2) setting the text sql to: SELECT section _ id, paragraph _ id, SUM (word _ weight)/COUNT (word _ weight) AS avg _ weights,1.0AS paragraph up/weight INTO paragraphs one/w weight FROM words, text GROUP BY section id, paragraph more than ORDER BY section id, weights DESC, BY calling sql _ estimate, performing term weight accumulation and term counting on document terms BY paragraphs, dividing the accumulated weight of paragraph terms BY the number of terms to obtain the average weight of paragraph terms, setting the initial value of the paragraph association weight to 1.0, sequencing according to the sequence of ascending of the document structure position number and descending of the paragraph term average weight, storing the paragraph term average weight and the paragraph association weight INTO a paragraph weight table params _ weights, wherein the paragraph weight table params _ weights comprises a document structure number section _ id, a document paragraph number param _ id, a paragraph term average weight avg _ weights and a paragraph association weight param _ weights, and then entering 1-3).
And B2, acquiring a paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the serial number of the document structure position and the total weight of all terms in the paragraph.
In a specific application of this embodiment, the specific steps include:
(1-3) calculating paragraph association weight according to the height of the average weight of each paragraph term at the same document structure position and the mutual adjacency relation aiming at each record of the paragraph weight table parags _ weights. And setting the section of the structure position of the current document as 0, reading the first bar of the paragraph weight table params _ weights as a current record, and entering 1-3-1).
(1-3-1) reading the currently recorded document structure number section _ id, paragraph number parag _ id and paragraph term average weight avg _ weights, and entering 1-3-2).
(1-3-2) judging whether the current document structure position section is equal to the document structure number section _ id, and if so, entering 1-3-3); otherwise, enter 1-3-8).
(1-3-3) set the text sql to: selecting tag _ id, (avg _ weights-)/POWER (2, ABS (tag _ id-) +1AS weights INTO temp FROM tags _weightswhile sections id =% AND avg _ weights >), converting the current paragraph term average weight avg _ weights, paragraph number tag _ id AND document structure number section _ id INTO character strings, AND respectively replacing the value, # AND% in the text sql; acquiring all paragraphs with the term average weight higher than that of the current paragraph in the structural position range of the current document by calling a function sql _ execute, calculating the difference between the term average weight of each paragraph and the term average weight of the current paragraph and the absolute value of the difference between the number of each paragraph and the number of the current paragraph, dividing the term average weight by the absolute value power of the difference between the number of the paragraphs of 2 to obtain the accumulated weight of the current paragraph obtained from each paragraph, and writing the result into a temporary table temp; enter 1-3-4).
(1-3-4) set the text sql to: selecting SUM (weights)/COUNT (weights)/+ 1 FROM temp, converting the term average weight avg _ weights of the current paragraph into character strings, replacing the words in the text sql, and calling a function sql _ execute to realize the purpose of obtaining the initial association weight of the current paragraph by summarizing the accumulated weight and the number of paragraphs in the temporary table temp, then calculating the average value, then dividing the average value by the term average weight of the current paragraph, and then adding 1; enter 3-3-5).
(1-3-5) judging whether the initial association weight of the current paragraph is greater than 2, and if so, modifying the initial association weight to 2; otherwise, not processing; then, enter 1-3-6).
(1-3-6) set the text sql to: UPDATE tags _ weights SET tag _ weight = #, avg _ weights = avg _ weights # WHERE tag _ id =%, the initial association weight of the current paragraph and the paragraph number tag _ id are converted into character strings to respectively replace #, and% in the text sql, and the UPDATE of the association weight of the current paragraph and the average weight of the term is realized by calling a function sql _ execute; then, enter 1-3-7).
(1-3-7) set the text sql to: DROP TABLE temp, which realizes the deletion of temporary TABLE temp by calling function sql _ execute; enter 1-3-8).
(1-3-8) judging whether the current record is the last record of paragraph weight table params _ weights, and if so, entering 1-4); otherwise, reading the next record as the current record and entering 1-3-1).
(1-4) set the text sql to: selecting word _ id, tag _ weight INTO temp FROM words _ list, tags _ weights WHERE words _ list, tag _ id = tags _ weights, tag _ id, and implementing that the corresponding paragraph association weight is assigned to the paragraph association weight of all the terms corresponding to the paragraph according to the paragraph number by calling the function sql _ execute, writing the result INTO the term paragraph weight association table temp, and then entering 3-5).
(1-5) set the text sql to: selecting word _ id, SUM (mark _ weight)/COUNT (mark _ weight) AS re _ weight intowords _ weights FROM, by calling function sql _ execute, accumulating paragraph association weights of terms in different paragraphs of the document FROM the term paragraph weight association table, dividing by the word frequency number of the terms in the document to obtain term paragraph association weights, writing the result INTO the term document paragraph association weight table words _ weights, and then entering 1-6).
And (1-6) outputting a term document paragraph association weight table words _ weights.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
In the second embodiment, after a document is processed according to the existing TF-IDF algorithm, important terms in the document are obtained, then, a term-paragraph association weight table words _ weights of terms is obtained according to the method for obtaining term-paragraph association weights in the second embodiment, and finally, subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weights in the terms.
In the embodiment, the influence level differences of a plurality of important paragraphs and adjacent distances are considered simultaneously in the same document structure position, so that the common action of a plurality of paragraphs is embodied;
the embodiment calculates the average value of the relationship weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation;
the method of the embodiment is suitable for calculating the weights of all terms needing to highlight the differences of the neighbor relations of different paragraphs on the document characterization.
The foregoing describes the technical principles of the present invention in conjunction with specific embodiments, which are provided for the purpose of illustrating the principles of the present invention and are not to be construed as limiting the scope of the present invention in any way. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive efforts, which shall fall within the scope of the present invention.

Claims (3)

1. A method for obtaining term paragraph association weight includes the steps of:
a1, acquiring the number of lexical items in any paragraph in a document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of document structure positions where the lexical items are located, the numbers of paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph;
the step A2 comprises the following steps:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical values of the paragraphs in the document structure positions corresponding to the numbers of the document structure positions are arranged in a high-to-low sequence;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
wherein, the first association weight of the paragraph is a preset initial value;
a2-4, acquiring a first numerical value and a first association weight of any paragraph in a document structure position corresponding to the number of the document structure position, and acquiring a first order of the paragraphs in the document structure position corresponding to the number of the document structure position, wherein the first order of the paragraphs in the document structure position corresponds to the number of the document structure position;
the step A2-4 comprises the following steps:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the second absolute value of any paragraph in the document structure position comprises: the number 2 corresponding between said paragraph and the first paragraph n
The first paragraph is any paragraph preceding the paragraph in the first order;
wherein n is the absolute value of the difference between the number of the paragraph in first order and the number of the first paragraph in first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is respectively the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph;
a2-4-5, determining paragraph association weights of the terms based on a fourth average value corresponding to any one paragraph in the document structure positions corresponding to the numbers of the document structure positions and the first numerical value of the paragraph, the numbers of the document structure positions where the terms are located and the numbers of the paragraphs in the document structure positions where the terms are located;
the step A2-4-5 comprises the following steps:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein, the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, acquiring paragraph association weights of the terms based on a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position;
the step A2-4-5-2 comprises the following steps:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value to obtain a judgment result;
a2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result;
the step A2-4-5-2-2 comprises the following steps:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that a third association weight value of the paragraph is: the preset value is set;
if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph;
the step A2-4-5-4 comprises the following steps:
a2-4-5-4-1, acquiring a total value of fourth association weights of all the terms based on the fourth association weights of any term in any paragraph in the document structure position corresponding to the serial number of the document structure position;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of a paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, and determining the paragraph association weight of any term;
wherein the paragraph association weight is an average of all fourth association weights of any term.
2. The method of claim 1, wherein the predetermined threshold is 2.
3. The method of claim 1, wherein the predetermined initial value is 1.
CN202010274876.0A 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight Active CN111611342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010274876.0A CN111611342B (en) 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010274876.0A CN111611342B (en) 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight

Publications (2)

Publication Number Publication Date
CN111611342A CN111611342A (en) 2020-09-01
CN111611342B true CN111611342B (en) 2023-04-18

Family

ID=72201801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010274876.0A Active CN111611342B (en) 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight

Country Status (1)

Country Link
CN (1) CN111611342B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
US9201876B1 (en) * 2012-05-29 2015-12-01 Google Inc. Contextual weighting of words in a word grouping
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
WO2018121145A1 (en) * 2016-12-30 2018-07-05 北京国双科技有限公司 Method and device for vectorizing paragraph
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 The text key word weighing computation method of comprehensive word positional factor and word frequency factor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI434187B (en) * 2010-11-03 2014-04-11 Inst Information Industry Text conversion method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
US9201876B1 (en) * 2012-05-29 2015-12-01 Google Inc. Contextual weighting of words in a word grouping
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
WO2018121145A1 (en) * 2016-12-30 2018-07-05 北京国双科技有限公司 Method and device for vectorizing paragraph
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 The text key word weighing computation method of comprehensive word positional factor and word frequency factor

Also Published As

Publication number Publication date
CN111611342A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN107423444B (en) Hot word phrase extraction method and system
WO2020192401A1 (en) System and method for generating answer based on clustering and sentence similarity
US10019515B2 (en) Attribute-based contexts for sentiment-topic pairs
Pranckevičius et al. Application of logistic regression with part-of-the-speech tagging for multi-class text classification
CN107463548B (en) Phrase mining method and device
Usman et al. Urdu text classification using majority voting
CN108363694B (en) Keyword extraction method and device
US11144723B2 (en) Method, device, and program for text classification
CN107885717B (en) Keyword extraction method and device
CN107729337B (en) Event monitoring method and device
CN106844482B (en) Search engine-based retrieval information matching method and device
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN106649308B (en) Word segmentation and word library updating method and system
JP2019200784A (en) Analysis method, analysis device and analysis program
US10474700B2 (en) Robust stream filtering based on reference document
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN111611342B (en) Method and device for obtaining lexical item and paragraph association weight
CN107665222B (en) Keyword expansion method and device
Lemnitzer et al. Combining a rule-based approach and machine learning in a good-example extraction task for the purpose of lexicographic work on contemporary standard German
CN104850609B (en) A kind of filter method for rising space class keywords
CN113052544A (en) Method and device for intelligently adapting workflow according to user behavior and storage medium
CN106777191B (en) Search engine-based retrieval mode generation method and device
US20180005300A1 (en) Information presentation device, information presentation method, and computer program product
CN111079425B (en) Geological document term grading method and device
CN106649367B (en) Method and device for detecting keyword popularization degree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Deng Jiqiu

Inventor after: Lu Biyu

Inventor after: Liu Wenyi

Inventor after: Li Chenhan

Inventor after: He Meixiang

Inventor before: Deng Jiqiu

Inventor before: Lu Biyu

Inventor before: Li Chenhan

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant