CN111611342A - Method and device for obtaining lexical item and paragraph association weight - Google Patents

Method and device for obtaining lexical item and paragraph association weight Download PDF

Info

Publication number
CN111611342A
CN111611342A CN202010274876.0A CN202010274876A CN111611342A CN 111611342 A CN111611342 A CN 111611342A CN 202010274876 A CN202010274876 A CN 202010274876A CN 111611342 A CN111611342 A CN 111611342A
Authority
CN
China
Prior art keywords
paragraph
document structure
structure position
term
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010274876.0A
Other languages
Chinese (zh)
Other versions
CN111611342B (en
Inventor
邓吉秋
路馥毓
李晨菡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010274876.0A priority Critical patent/CN111611342B/en
Publication of CN111611342A publication Critical patent/CN111611342A/en
Application granted granted Critical
Publication of CN111611342B publication Critical patent/CN111611342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for acquiring lexical item and paragraph association weight, wherein the method comprises the following steps: a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items; the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located; a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.

Description

Method and device for obtaining lexical item and paragraph association weight
Technical Field
The invention relates to the technical field of document extraction, in particular to a method and a device for obtaining term paragraph association weight.
Background
Most of the current Chinese text classification systems use words as feature items, called feature words. The characteristic words are used as an intermediate representation form of the document and are used for realizing similarity calculation between the document and between the document and a user target. Generally, the scoring values of all the features are calculated according to a certain feature evaluation function, then the features are ranked according to the scoring values, and a plurality of words with the highest scoring values are selected as feature words.
The most common text characterization method with better effect is to establish a term-document matrix. Each element value in the term-document matrix represents the weight of a term on the corresponding row corresponding to a document on the corresponding column, i.e., the degree of importance of this term to the document. Whether a word is important for a document is reflected in two aspects: the greater the number of occurrences of a term in a document, the greater the importance with respect to the document; if a term occurs more often in the entire corpus, the more meaningless, i.e., less important, the word is for the document, which is the idea of the TF-IDF algorithm. The keyword extraction based on the TextRank is another method, and the keyword extraction can be realized for a single document. The task of extracting the TextRank keywords is to automatically extract a plurality of meaningful words or phrases from a given text, and the TextRank algorithm is to sort the subsequent keywords by using the relationship (co-occurrence window) between local words and directly extract the keywords from the text.
The same term in the document is located in different paragraphs in the same structural position of the document, and the characterization effect on the document theme may also be different. For example, paragraph 1 and paragraph 2 of a section of a document generally have a continuity in lines, and terms in paragraph 1 and terms in paragraph 2 have some inevitable relationship (which may be repeated occurrences of terms, or potentially semantic the same, or logically related to causal or sequential exposition, etc.). The general term-document matrix uses the representation of terms on the document theme by purely adopting the occurrence times of terms, takes terms with low frequency in a specific document term and high frequency relative to other documents as theme words, and TF-IDF tends to filter out common terms and keep important terms; the TextRank algorithm sorts the subsequent keywords by using the relation (co-occurrence window) between local vocabularies, and only considers the co-occurrence relation between local adjacent vocabularies; neither of the two common methods considers the difference of the word term difference adjacency relation in the same structural position of the document to the document representation.
Disclosure of Invention
Technical problem to be solved
In order to solve the problem that the difference of the document representation caused by the paragraph difference adjacency relation of the term at the same structure position of the document is not considered in the prior art, the invention provides a method and a device for acquiring the term paragraph association weight.
(II) technical scheme
In order to achieve the above object, the present invention provides a method for obtaining term paragraph association weights, comprising the steps of:
a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
Preferably, the step a2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical value of the paragraphs in the document structure position corresponding to the number of the document structure position is arranged from high to low;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
wherein, the first association weight of the paragraph is a preset initial value;
a2-4, acquiring any term paragraph association weight in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, the step a2-4 includes:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the content of the first and second substances,the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first ordernThe value of (d);
wherein n is the absolute value of the difference between the number of the paragraph and the number of the paragraph preceding the paragraph in the first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is respectively the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph;
a2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
Preferably, the step a2-4-5 comprises:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a corresponding fourth average value of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, obtaining paragraph association weight of the term based on the fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, the step A2-4-5-2 comprises the following steps:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result;
a2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result.
Preferably, the step A2-4-5-2-2 comprises the following steps:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the predetermined threshold value;
and if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
Preferably, the predetermined threshold is 2.
Preferably, the step A2-4-5-4 comprises:
a2-4-5-4-1, obtaining a total value of fourth association weights of any term in any paragraph in a document structure position corresponding to the number of the document structure position based on the fourth association weight of any term;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, determining the paragraph association weight of any term;
wherein the final association weight is an average of all fourth association weights of the any term.
Preferably, the preset initial value is 1.
An apparatus for obtaining term paragraph association weights, the apparatus for obtaining term paragraph association weights storing computer instructions; the computer instructions cause the apparatus for obtaining term paragraph association weights to perform the method for obtaining term paragraph association weights as described in any one of the above.
(III) advantageous effects
The invention has the beneficial effects that: when the document theme is characterized, the method considers the neighbor relation between the paragraphs and the high-average term weight paragraphs, improves the paragraph association weight of the terms in the neighbor paragraphs, and improves and highlights the position of the terms near the important paragraphs of the document structure.
The invention simultaneously considers the influence level differences of a plurality of paragraphs and adjacent distances in the same document structure position, and embodies the combined action of the plurality of paragraphs.
The invention averages the relation weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation.
Drawings
FIG. 1 is a flow chart of a method for obtaining term paragraph association weights according to the present invention;
fig. 2 is a schematic diagram illustrating a method for obtaining term paragraph association weights in the second embodiment of the present invention.
Detailed Description
For the purpose of better explaining the invention and to facilitate understanding, the invention will be described in detail below by way of specific embodiments with reference to the accompanying drawings
Detailed description of the preferred embodiment
Referring to fig. 1, the method for obtaining term paragraph association weights in the first embodiment includes the steps of:
a1, acquiring the number of the lexical items in any one of the document structure positions corresponding to the number of the document structure positions and the total number of the weights of all the lexical items in the paragraphs based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items.
And the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
A2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
In this embodiment, the subject term in the document may be extracted according to the preset paragraph weight of any term in the plurality of terms. The subject term in the document is n terms with the highest paragraph weight value.
In this embodiment, after the document is processed according to the existing TF-IDF algorithm, important terms in the document may be obtained, then paragraph association weights of the terms are obtained according to the method for obtaining the term and paragraph association weights in this embodiment, and finally, the subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weight in the terms.
Preferably, in this embodiment, the step a2 includes:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
Wherein the first value is: the average of the weights of all terms in the passage.
A2-2, obtaining a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on the first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the first order is: and arranging the first numerical values of the paragraphs in the document structure positions corresponding to the numbers of the document structure positions from high to low.
A2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value for the document structure position corresponding to the number of the document structure position.
Wherein, the first association weight of the paragraph is a preset initial value.
For example, if a result portion has 5 paragraphs (original paragraph sequence), the paragraph number and evaluation weight are:
paragraph 1: 5.6
Paragraph 2: 3.2
Paragraph 3: 8.8
Paragraph 4; 1.2
Paragraph 5; 6.6
The paragraph average weight ordering and associated weight initialization values are as follows (ordered paragraph sequence):
paragraph 3: 8.8, 1 (initial value)
Paragraph 5; 6.6, 1 (initial value)
Paragraph 1; 5.6, 1 (initial value)
Paragraph 2; 3.2, 1 (initial value)
Paragraph 4; 1.2, 1 (initial value)
The paragraphs 1, 3 and 5 in the original paragraph sequence have higher average weight, which indicates that there are more terms with higher original weight in these paragraphs, and such paragraphs are more important for characterizing the document features. Based on the context relevance of natural language, paragraphs with low average weight, such as paragraph 2 or paragraph 4, have higher average weight, and the average weight of paragraphs in such paragraphs cannot reflect the context relevance, so the weight needs to be properly increased. If the average weight of paragraph 4 is 1.2, and the average weights of preceding and succeeding paragraphs 3 and 5 are higher, it is stated that paragraph 4 makes sense for the characterization of paragraphs 3 and 5. The weight of the paragraph 4 after promotion is slightly higher than the average weight of the original paragraph obtained by calculation, but cannot be higher than the weights of the paragraphs 3 and 5; and the lifted weight can not change the original sequencing order, and the farther apart paragraphs, the relevance of which is weakened in turn. From the sequence of ordered paragraphs, paragraph 3 in the first order is the paragraph with the highest average weight, and the weight cannot be increased any more, and only the second and subsequent paragraphs can be increased, so the paragraph starts with the paragraph with the second order.
A2-4, acquiring any term paragraph association weight in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step a2-4 includes:
a2-4-1, obtaining a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between any of the paragraphs and the first numerical value of the paragraph preceding the paragraph in the first order, respectively.
A2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
Wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first ordernThe numerical value of (c).
Where n is the absolute value of the difference between the paragraph and the number of paragraphs preceding the paragraph in the first order, respectively.
Dividing the number difference power of 2 by the average weight difference; the smaller the relevance of the far-apart paragraphs, the average weight difference between the directly adjacent paragraphs is divided by 2, the two paragraphs are separated by 4, and so on, i.e. the division by 2nAnd n is the absolute value of the paragraph number difference, so that the influence of the paragraphs farther away on the current paragraph is reduced.
A2-4-3, obtaining a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the third absolute value comprises: the paragraph is respectively a quotient of the first absolute value and the second absolute value of any paragraph preceding the paragraph in the first order.
A2-4-4, obtaining a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position.
Wherein the fourth average is: the paragraph is the average of the third absolute values of all paragraphs that precede the paragraph in the first order with the paragraph, respectively.
A2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
Preferably, in this embodiment, the step a2-4-5 includes:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on the corresponding fourth average value of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph.
Wherein the second association weight of the paragraph is: the fourth average of the paragraph is the value of the quotient of the first value of the paragraph, followed by the first associated weight of the paragraph.
A2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold value.
A2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value.
Wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located.
A2-4-5-4, obtaining paragraph association weight of the term based on the fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
Preferably, in this embodiment, the step a2-4-5-2 includes:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining the judgment result.
A2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result.
Preferably, in this embodiment, the step a2-4-5-2-2 includes.
If the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the preset value is set in advance.
And if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
In this embodiment, the preset threshold is preferably 2.
Preferably, in this embodiment, the step a2-4-5-4 includes:
a2-4-5-4-1, obtaining the total value of the fourth association weights of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the fourth association weight of any term.
A2-4-5-4-2, obtaining the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located.
A2-4-5-4-3, the total value of the fourth association weights of all the term items and the number of any term item in a plurality of term items, determining the paragraph association weight of any term item.
Wherein the final association weight is an average of all fourth association weights of the any term.
In this embodiment, the preset initial value is preferably 1.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
Detailed description of the invention
For better explaining the present invention, referring to fig. 2, the term document paragraph position table in the present embodiment is pre-input into the computer, and the table will be explained first.
In this embodiment, the term document paragraph position table word _ list that is input as a specific document is a database table that includes all terms extracted from the specific document and document paragraph position information thereof, and each term with a specific number in the table may have multiple records in different paragraphs of the same structure of the document or different sentences of the same paragraph, and specific field definitions are shown in table 1.
Table 1 term document paragraph position table definitions
Name of field Meaning of a field Type of field Description of field
word_id Term numbering INTEGER Unique numbering of particular terms
word_weight Basic weight of term DECIMAL Basic weight of a particular term
section_id Document structure numbering INTEGER Number of specific structure position of document where term is located
parag_id Document paragraph numbering INTEGER Number of specific paragraph of document where term is located
Word weight in table 1 is the basic weight of the term obtained by some method, such as the patented method: a method and apparatus for obtaining hierarchical weight of domain document lexical item; section _ id is the number of the specific structure position of the document where the specific paragraph is located, and 1 or more paragraphs may exist in the same section _ id position corresponding to the term; the tag _ id is the number of a specific paragraph of the document where a specific term is located, the numbers of adjacent paragraphs are consecutive, and 1 or more specific terms may exist for the same tag _ id paragraph to which the term corresponds.
In this embodiment, after a term document paragraph position table that is a specific document is input, referring to fig. 1, obtaining is performed according to a method for obtaining term paragraph association weights, and the specific steps include:
b1, acquiring the number of the lexical items in any one of the document structure positions corresponding to the number of the document structure positions and the total number of the weights of all the lexical items in the paragraphs based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
and the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located.
In a specific application of this embodiment, the specific steps are as follows:
(1-1) entering system initialization, defining a database operation statement execution function SQL _ execute, wherein the input parameter of the function SQL _ execute is text SQL which is a database operation statement meeting SQL-92 standard; a function calls a database system function to execute a text sql, the execution result of the text sql is a table in the database or the change of data in the table, and the function itself does not directly output the result; and then enters 1-2).
(1-2) set the text sql to: SELECT section _ id, param _ id, SUM (word _ weight)/COUNT (word _ weight) AS avg _ weights,1.0AS param _ weight _ list GROUP _ weight _ GROUP BY section _ id, param _ id ORDER BY section _ id, weights DESC, BY calling function sql _ execute, performing term weight accumulation and term counting on document terms according to paragraphs, dividing the accumulated weight of the paragraph terms by the number of terms to obtain an average weight of the paragraph terms, setting an initial value of paragraph association weight as 1.0, sequencing according to the sequence of ascending of document structure position number and descending of the average weight of the paragraph terms, storing the average weight of the paragraph terms and the paragraph association weight into a paragraph weight table params _ weights, wherein the paragraph weight table params _ weights comprises a document structure number section _ id, a document paragraph number param _ id, an average weight avg _ weights of the paragraph terms, and the paragraph association weight param _ weights, and then entering 1-3).
B2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
In a specific application of this embodiment, the specific steps include:
(1-3) calculating paragraph association weight according to the height of the average weight of each paragraph term in the same document structure position and the mutual adjacency relation aiming at each record of the paragraph weight table params _ weights. And setting the section of the structure position of the current document as 0, reading the first bar of the paragraph weight table params _ weights as a current record, and entering 1-3-1).
(1-3-1) reading the currently recorded document structure number section _ id, paragraph number parag _ id and paragraph term average weight avg _ weights, and entering 1-3-2).
(1-3-2) judging whether the current document structure position section is equal to the document structure number section _ id, and if so, entering 1-3-3); otherwise, enter 1-3-8).
(1-3-3) set the text sql to: selecting tag _ id, (avg _ weights-)/POWER (2, ABS (tag _ id-) +1AS weights INTO temp FROM tags _ weights with section _ id [% ] converting current paragraph term average weight avg _ weights, paragraph number tag _ id, AND document structure number section _ id INTO character strings AND respectively replacing [ # ],% ] in the text sql; acquiring all paragraphs with the term average weight higher than that of the current paragraph in the structural position range of the current document by calling a function sql _ execute, calculating the difference between the term average weight of each paragraph and the term average weight of the current paragraph and the absolute value of the difference between the number of each paragraph and the number of the current paragraph, dividing the term average weight by the absolute value power of the difference between the number of the paragraphs of 2 to obtain the accumulated weight of the current paragraph obtained from each paragraph, and writing the result into a temporary table temp; enter 1-3-4).
(1-3-4) set the text sql to: selecting SUM (weights)/COUNT (weights)/1 FROMTemp, converting the term average weight avg _ weights of the current paragraph into character strings, replacing terms in the text sql, and calling a function sql _ execute to realize the purpose of obtaining the initial association weight of the current paragraph by summarizing the accumulated weight and the number of paragraphs in the temporary table temp, then calculating the average value, then dividing the average value by the term average weight of the current paragraph and adding 1; enter 3-3-5).
(1-3-5) judging whether the initial association weight of the current paragraph is greater than 2, if so, modifying the initial association weight to 2; otherwise, not processing; then, enter 1-3-6).
(1-3-6) set the text sql to: UPDATE tags _ weights SET tag _ weight ═ avg _ weights # WHERE tag _ id ═ to convert the current paragraph initial association weight and the paragraph number tag _ id into character strings to respectively replace #, percent in the text sql, and UPDATE the current paragraph association weight and the term average weight by calling the function sql _ execute; then, enter 1-3-7).
(1-3-7) set the text sql to: DROP TABLE temp, which realizes the deletion of temporary TABLE temp by calling function sql _ execute; enter 1-3-8).
(1-3-8) judging whether the current record is the last record of paragraph weight table params _ weights, and if so, entering 1-4); otherwise, reading the next record as the current record and entering 1-3-1).
(1-4) set the text sql to: selecting word _ id, tag _ weight INTO temp FROM words _ list, tag _ weights WHERE words _ list, tag _ weights _ id is tag _ weights, tag _ weights _ id is used for assigning corresponding segment association weights to the segment association weights of all terms corresponding to the segment according to segment numbers by calling a function sql _ execute, writing the result INTO a term segment weight association table temp, and then entering 3-5).
(1-5) set the text sql to: selecting word _ id, SUM (mark _ weight)/COUNT (mark _ weight) AS re _ weight intowords _ weights FROM, by calling function sql _ execute, accumulating paragraph association weights of terms in different paragraphs of the document FROM the term paragraph weight association table, dividing by the word frequency number of the terms in the document to obtain term paragraph association weights, writing the result INTO the term document paragraph association weight table words _ weights, and then entering 1-6).
(1-6) outputting a term document paragraph association weight table words _ weights.
In the embodiment, when characterizing the document theme, the neighbor relation between the paragraphs and the high-average term weight paragraphs is considered, the paragraph association weight of the terms in the neighbor paragraphs is promoted, and the positions of the terms around the important paragraphs of the document structure are promoted and highlighted.
In the second embodiment, after a document is processed according to the existing TF-IDF algorithm, important terms in the document are obtained, then, a term-paragraph association weight table words _ weights of terms is obtained according to the method for obtaining term-paragraph association weights in the second embodiment, and finally, subject words in the document are extracted, where the subject words in the document are n terms with the highest paragraph weights in the terms.
In the embodiment, the influence level differences of a plurality of important paragraphs and adjacent distances are considered simultaneously in the same document structure position, so that the common action of a plurality of paragraphs is embodied;
the embodiment averages the relation weights of the same term paragraph appearing at different document structure positions, and comprehensively considers the difference of the same term on different document structure positions to the document theme representation;
the method of the embodiment is suitable for calculating the term weight of the document characterization difference by highlighting the neighbor relation of different paragraphs.
The technical principles of the present invention have been described above in connection with specific embodiments, which are intended to explain the principles of the present invention and should not be construed as limiting the scope of the present invention in any way. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive efforts, which shall fall within the scope of the present invention.

Claims (10)

1. A method for obtaining term paragraph association weight is characterized by comprising the following steps:
a1, acquiring the number of lexical items in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of weights of all lexical items in the paragraph based on a plurality of preset lexical items, the numbers of the document structure positions where the lexical items are located, the numbers of the paragraphs in the document structure positions where the lexical items are located and the weights of the lexical items;
the number of the paragraph corresponds to the sequence of the paragraph in the document structure position where the paragraph is located;
a2, acquiring paragraph association weight of any term in the preset multiple terms based on the number of terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph.
2. The method according to claim 1, wherein said step a2 comprises:
a2-1, acquiring a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the terms in any paragraph in the document structure position corresponding to the number of the document structure position and the total number of the weights of all terms in the paragraph;
wherein the first value is: the average of the weights of all terms in the passage;
a2-2, acquiring a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the first order is: the first numerical value of the paragraphs in the document structure position corresponding to the number of the document structure position is arranged from high to low;
a2-3, determining a first association weight of any paragraph in the document structure position according to a preset initial value aiming at the document structure position corresponding to the number of the document structure position;
wherein, the first association weight of the paragraph is a preset initial value;
a2-4, acquiring any term paragraph association weight in the preset plurality of terms based on the first numerical value and the first association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position.
3. The method according to claim 2, wherein the step a2-4 comprises:
a2-4-1, acquiring a first absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on a first numerical value of any paragraph in the document structure position corresponding to the number of the document structure position and a first sequence of paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the first absolute value of any paragraph in the document structure position comprises: the absolute value of the difference between each of said paragraphs and the first numerical value of the paragraph preceding said paragraph in the first order;
a2-4-2, acquiring a second absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the number of the paragraph corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first sequence of the paragraphs in the document structure position corresponding to the number of the document structure position;
wherein the second absolute value of any paragraph in the document structure position comprises: said paragraphs corresponding to 2's respectively preceding the paragraph in the first ordernThe value of (d);
wherein n is the absolute value of the difference between the number of the paragraph and the number of the paragraph preceding the paragraph in the first order;
a2-4-3, acquiring a third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position based on the first absolute value and the second absolute value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the third absolute value comprises: the paragraph is respectively quotient of the first absolute value and the second absolute value of any paragraph before the paragraph in the first sequence;
a2-4-4, acquiring a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position based on the third absolute value of any paragraph in the document structure position corresponding to the number of the document structure position;
wherein the fourth average is: (iii) the paragraph is respectively the average of the third absolute values of all paragraphs preceding the paragraph in the first order with the paragraph;
a2-4-5, determining a paragraph association weight of the term based on a fourth average value corresponding to any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph, the number of the document structure position where the term is located, and the number of the paragraph in the document structure position where the term is located.
4. The method of claim 3, wherein the step A2-4-5 comprises:
a2-4-5-1, determining a second association weight of any paragraph in the document structure position corresponding to the number of the document structure position based on a corresponding fourth average value of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value of the paragraph;
wherein the second association weight of the paragraph is: a value of a quotient of the fourth mean value of the paragraph and the first value of the paragraph, followed by the first associated weight of the paragraph;
a2-4-5-2, determining a third association weight value of any paragraph in the document structure position corresponding to the number of the document structure position based on the second association weight of any paragraph in the document structure position corresponding to the number of the document structure position and a preset threshold;
a2-4-5-3, determining a fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position based on the third association weight of any paragraph in the document structure position corresponding to the number of the document structure position and the first numerical value;
wherein the fourth association weight of any term in the paragraph is: the product of the third weight value and the first numerical value of the paragraph in which the term is located;
a2-4-5-4, obtaining paragraph association weight of the term based on the fourth association weight of any term in any paragraph in the document structure position corresponding to the number of the document structure position and the number of terms in the paragraph corresponding to the number of any paragraph in the document structure position corresponding to the number of the document structure position.
5. The method of claim 4, wherein the step A2-4-5-2 comprises:
a2-4-5-2-1, judging the second association weight of the paragraph and the preset threshold value, and obtaining a judgment result;
a2-4-5-2-2, determining a third associated weight value of the paragraph based on the judgment result.
6. The method of claim 5, wherein the step a2-4-5-2-2 comprises:
if the judgment result is that the second association weight of the paragraph is greater than the preset threshold, determining that the third association weight of the paragraph is: the preset value is set;
and if the judgment result is that the second association weight of the paragraph is smaller than the preset threshold, determining that the third association weight value of the paragraph is the same as the second association weight value of the paragraph.
7. The method according to claim 5 or 6, wherein the predetermined threshold is 2.
8. The method of claim 7, wherein the step a2-4-5-4 comprises:
a2-4-5-4-1, obtaining a total value of fourth association weights of any term in any paragraph in a document structure position corresponding to the number of the document structure position based on the fourth association weight of any term;
a2-4-5-4-2, acquiring the number of any term in a plurality of preset terms based on the preset terms, the number of the document structure position where the term is located and the number of the paragraph in the document structure position where the term is located;
a2-4-5-4-3, the total value of all fourth association weights of any term and the number of any term in a plurality of terms, determining the paragraph association weight of any term;
wherein the final association weight is an average of all fourth association weights of the any term.
9. The method of claim 2, wherein the predetermined initial value is 1.
10. An apparatus for obtaining term paragraph association weights, wherein the apparatus for obtaining term document paragraph association weights stores computer instructions; the computer instructions cause the means for obtaining term document paragraph association weights to perform the method for obtaining term paragraph association weights as claimed in any one of claims 1 to 9.
CN202010274876.0A 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight Active CN111611342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010274876.0A CN111611342B (en) 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010274876.0A CN111611342B (en) 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight

Publications (2)

Publication Number Publication Date
CN111611342A true CN111611342A (en) 2020-09-01
CN111611342B CN111611342B (en) 2023-04-18

Family

ID=72201801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010274876.0A Active CN111611342B (en) 2020-04-09 2020-04-09 Method and device for obtaining lexical item and paragraph association weight

Country Status (1)

Country Link
CN (1) CN111611342B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
US20120109624A1 (en) * 2010-11-03 2012-05-03 Institute For Information Industry Text conversion method and text conversion system
US9201876B1 (en) * 2012-05-29 2015-12-01 Google Inc. Contextual weighting of words in a word grouping
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
WO2018121145A1 (en) * 2016-12-30 2018-07-05 北京国双科技有限公司 Method and device for vectorizing paragraph
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 The text key word weighing computation method of comprehensive word positional factor and word frequency factor

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120109624A1 (en) * 2010-11-03 2012-05-03 Institute For Information Industry Text conversion method and text conversion system
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
US9201876B1 (en) * 2012-05-29 2015-12-01 Google Inc. Contextual weighting of words in a word grouping
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN106845265A (en) * 2016-12-01 2017-06-13 北京计算机技术及应用研究所 A kind of document security level automatic identifying method
WO2018121145A1 (en) * 2016-12-30 2018-07-05 北京国双科技有限公司 Method and device for vectorizing paragraph
CN109766408A (en) * 2018-12-04 2019-05-17 上海大学 The text key word weighing computation method of comprehensive word positional factor and word frequency factor

Also Published As

Publication number Publication date
CN111611342B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
Pranckevičius et al. Application of logistic regression with part-of-the-speech tagging for multi-class text classification
CN107423444B (en) Hot word phrase extraction method and system
US7461056B2 (en) Text mining apparatus and associated methods
JP3041268B2 (en) Chinese Error Checking (CEC) System
US11144723B2 (en) Method, device, and program for text classification
CN108363694B (en) Keyword extraction method and device
CN107885717B (en) Keyword extraction method and device
CN111831804A (en) Key phrase extraction method and device, terminal equipment and storage medium
CN102789464A (en) Natural language processing method, device and system based on semanteme recognition
CN107729337B (en) Event monitoring method and device
CN102789452A (en) Similar content extraction method
WO2018164879A1 (en) Aggregating procedures for automatic document analysis
CN106649308B (en) Word segmentation and word library updating method and system
CN107341142B (en) Enterprise relation calculation method and system based on keyword extraction and analysis
WO2022105178A1 (en) Keyword extraction method and related device
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
JP2019200784A (en) Analysis method, analysis device and analysis program
CN108628875B (en) Text label extraction method and device and server
CN111611342B (en) Method and device for obtaining lexical item and paragraph association weight
US11361565B2 (en) Natural language processing (NLP) pipeline for automated attribute extraction
Lemnitzer et al. Combining a rule-based approach and machine learning in a good-example extraction task for the purpose of lexicographic work on contemporary standard German
CN104850609B (en) A kind of filter method for rising space class keywords
CN113052544A (en) Method and device for intelligently adapting workflow according to user behavior and storage medium
CN111079425B (en) Geological document term grading method and device
US20180005300A1 (en) Information presentation device, information presentation method, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Deng Jiqiu

Inventor after: Lu Biyu

Inventor after: Liu Wenyi

Inventor after: Li Chenhan

Inventor after: He Meixiang

Inventor before: Deng Jiqiu

Inventor before: Lu Biyu

Inventor before: Li Chenhan

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant