CN115879442A

CN115879442A - Method and system for dynamically calculating weight of keyword

Info

Publication number: CN115879442A
Application number: CN202111153756.6A
Authority: CN
Inventors: 王军华; 周健
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-03-31

Abstract

The invention provides a method and a system for dynamically calculating keyword weight. The method comprises the following steps: acquiring text data, and segmenting words of the text through a word segmentation tool to obtain a word segmentation list; loading a preset stop word list to perform stop word removal processing on the word segmentation list; counting and recording the position of each word in the text, and dividing the whole text into three parts, namely a first section, a middle section and a tail section according to the total text length and the total number of paragraphs; judging whether the position where the word segmentation appears for the first time is located at the first segment, the middle segment or the tail segment; and obtaining the weight value corresponding to each keyword according to whether the participle appears in the first sentence, whether the participle is related to the title or not and whether continuous participles exist or not. The invention fully considers the importance characteristics carried by the participles at different positions and solves the defect that the weight value at a specific position cannot be considered by TF-IDF and other methods.

Description

Method and system for dynamically calculating weight of keyword

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for dynamically calculating keyword weight.

Background

The keyword extraction technology is a relatively basic part in a natural language processing task and is applied to tasks such as text similarity, text classification, personalized recommendation, text summarization and the like.

The TF-IDF basically calculates the word frequency-based thought to obtain the weight of the key words; wherein, term Frequency (TF) refers to the number of times a given term appears in the document, and the main idea of term frequency is as follows: if the times of the occurrence of one entry t in the document are larger, the entry can reflect the theme of the document more, and the weight is larger; the main idea of the reverse document frequency (IDF) is: if the documents containing the entry t are fewer and the IDF is larger, the entry has good category distinguishing capability. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. Calculating the word frequency (TF) occupied by each participle in the text, then calculating the inverse document frequency occupied by each participle in the whole data corpus, and finally multiplying the two data by a TF-IDF formula to obtain a weight value, wherein the weight value is as follows:

TF formula: TF = number of occurrences of entry t/total number of entries in document

IDF formula: IDF = log (total number of documents in corpus/(number of documents containing entry t + 1))

TF-IDF formula: TF-IDF = TF x IDF

The TF-IDF method is based on the word frequency simply to measure the importance of a word, and is not comprehensive enough, and sometimes the number of possible occurrences of the important word is not large. Moreover, the algorithm cannot reflect the position information of the words, and the words with the front appearance positions and the words with the back appearance positions are considered to have the same importance, which is obviously incorrect.

The TextRank algorithm is a graph-based sorting algorithm for extracting keywords and abstracting text, generates a keyword list by using the relationship between local vocabularies (co-occurrence windows), and then sorts the subsequent keywords by calculation, and comprises the following steps:

(1) Segmenting an article: dividing a given text into a plurality of sentences by using punctuation or space cutting;

Text＝[S1，S2，...，Sn]

(2) And (3) reserving keywords: and for each sentence, performing word segmentation and part-of-speech tagging, filtering out stop words, and only reserving words with specified parts-of-speech, such as nouns, verbs and adjectives, namely, the reserved candidate keywords.

Si＝[W1，W2，...，Wn]

(3) Selecting a co-occurrence window size n;

(4) Regenerating a keyword list of all sentences in the Text according to the co-occurrence window;

[W1，W2，...，Wn]，[W2，W3，...，Wn+1]

(5) The co-occurrence, i.e. confidence, between each word is calculated.

(6) Initializing a word co-occurrence square matrix M and a mean value matrix U;

U0＝[n1，n1，...n1]

Un＝αMTUn-1+(1-α)U0

wherein: m is a co-occurrence matrix, which can also be understood as a transition probability matrix

(7) And according to the formula, iteratively propagating the weight of each node until convergence.

(8) And carrying out reverse sequencing on the node weights, thereby obtaining the most important N words as candidate keywords.

However, the TextRank algorithm has the following disadvantages:

1) The result is greatly influenced by word segmentation and text cleaning, namely, the final result is directly influenced by the retention of certain stop words or not.

2) Although the word frequency is utilized compared with TF-IDF, the method is still influenced by high-frequency words, and therefore, the filtering needs to be carried out by combining the part of speech and the word frequency to achieve a better effect, but part of speech tagging is obviously a problem.

3) The algorithm cannot consider the text position information of the word segmentation.

Disclosure of Invention

In view of the above, the invention improves the computation tools of the weights of the TF-IDF and the like based on the specific position information, and improves the defect that the methods of the TF-IDF and the like cannot well consider the weights of the text semantic layer.

Based on the above purpose, the present invention provides a method for dynamically calculating keyword weight, comprising:

acquiring text data, and segmenting words of the text through a word segmentation tool to obtain a word segmentation list;

loading a preset stop word list to perform stop word removal processing on the word segmentation list;

counting and recording the position of each word in the text, and dividing the whole text into three parts, namely a first segment, a middle segment and a tail segment according to the total text length and the total number of paragraphs;

judging whether the position where the word segmentation appears for the first time is located at the first segment, the middle segment or the tail segment;

and obtaining the weight value corresponding to each keyword according to whether the participle appears in the first sentence, whether the participle is related to the title or not and whether continuous participles exist or not.

Further, the word segmentation tool is jieba.

Further, the position of the participle in the text is a subscript of each participle in the text.

Further, the whole text is equally divided into three parts, namely a first section, a middle section and a tail section according to the subscript.

Further, whether the participle appears in the first sentence is judged by judging whether the first paragraph of the participle is in the first 1/3 position of the first paragraph, if so, the participle appears in the first sentence, otherwise, the participle does not appear in the first sentence.

Further, whether the participle is related to the title is judged by judging whether the participle has a similarity larger than 0.6 with the participle appearing in the title, if the similarity larger than 0.6 exists, the participle is related to the title, otherwise, the participle is not related to the title.

Further, whether continuous participles exist is judged by calculating whether the distance between the text positions where the two participles are located is less than 4 times of the participle length.

Further, the method further comprises the following steps:

and selecting a preset number of words as the text keywords according to the sequence from high to low of the weight values.

Based on the above object, the present invention further provides a system for dynamically calculating keyword weight, comprising:

the word segmentation module is used for acquiring text data and segmenting words of the text through a word segmentation tool to obtain a word segmentation list;

the stop word module is used for loading a preset stop word list to perform stop word processing on the participle list;

the position counting module is used for counting and recording the position of each participle in the text, and dividing the whole text into three parts, namely a first section, a middle section and a tail section according to the total text length and the total number of the paragraphs;

the position judging module is used for judging that the position where the word segmentation appears for the first time is positioned at the first section, the middle section or the tail section;

and the weight calculation module is used for obtaining the weight value corresponding to each keyword according to whether the participle appears in the first sentence, whether the participle is related to the title and whether continuous participles exist.

Generally, the advantages of the invention and the experience brought to the user are that:

the invention aims at the problem that the weight value of a specific position cannot be considered by TF-IDF and other methods.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 illustrates a flowchart of a method of dynamically calculating keyword weights according to an embodiment of the present invention.

Fig. 2 is a diagram illustrating a specific implementation process of a method for dynamically calculating a keyword weight according to an embodiment of the present invention.

Fig. 3 is a block diagram illustrating a system for dynamically calculating a keyword weight according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In the prior art, the weighted value is expressed only through the word frequency of the word segmentation in the document, the position information of the word segmentation cannot be reflected, and the importance of information carried by the position of the word segmentation in one article is different; the invention fully utilizes the position information of the word segmentation, obtains the weight value by hierarchical calculation, and solves the defect that the position information cannot be considered in the prior art.

Keyword terminology and technical abbreviations

Keyword terms:

word2vec: model for representing word vector through deep neural network training

Fig. 1 illustrates a flowchart of a method of dynamically calculating keyword weights according to an embodiment of the present invention. The method comprises the following steps:

step 101: acquiring text data, and segmenting words of the text through a word segmentation tool to obtain a word segmentation list;

step 102: loading a preset stop word list to perform stop word processing on the word segmentation list;

step 103: counting and recording the position of each word in the text, and dividing the whole text into three parts, namely a first segment, a middle segment and a tail segment according to the total text length and the total number of paragraphs;

step 104: judging whether the position where the word segmentation appears for the first time is located at the first segment, the middle segment or the tail segment;

step 105: and obtaining the weight value corresponding to each keyword according to whether the participle appears in the first sentence, whether the participle is related to the title or not and whether continuous participles exist or not.

Fig. 2 is a diagram illustrating a specific implementation process of a method for dynamically calculating a keyword weight according to an embodiment of the present invention. Reading text data, segmenting the text by a word segmentation tool such as jieba and the like, and loading a prepared stop word list to perform stop word processing on the word segmentation list; then, counting and recording the position of each word in the text (namely, the subscript of each word in the text) of each word, and dividing the whole text into three parts, namely a first section, a middle section and a tail section according to the total text length and the total paragraph number; and dividing the first section, the middle section and the tail section according to the position where the participle firstly appears to obtain the first section, the middle section or the tail section where the participle firstly appears. And respectively calculating the weighted values of the first, middle and tail participles according to the processing flow in the figure 2.

And judging the keywords in the first section according to 3 stages in the figure 2: the first is to judge whether the participle appears in the first sentence according to the judgment that whether the first paragraph of the participle is in the first 1/3 position of the paragraph (similarly, the first paragraph is measured by the subscript of the participle), and if the first sentence is, the corresponding weight value is directly given, for example, 0.61; if not, executing a second step, judging whether the participles are related to the title, judging whether the participles have similarity (the similarity value is obtained through calculation of a Word2vec model) larger than 0.6 with the participles appearing in the title according to the judgment, if so, directly giving a corresponding weight value, such as 0.82, otherwise, executing a third step, and if not, judging whether the first paragraph of the participles have continuity, wherein the continuity judgment standard is that whether the text positions of the two participles are smaller than the distance of 4 times of the participle length or not is calculated, if so, directly giving the corresponding weight value, such as 0.7, and if not, directly giving the weight value, such as 0.5, in the image 2.

And judging the keywords in the middle section in 3 stages according to the graph 2: the first is to judge whether the participle appears in the first sentence according to the judgment that whether the middle section of the participle is in the first 1/3 position of the paragraph (similarly, the judgment is carried out by the subscript of the participle), and if the middle section of the participle is in the first sentence, a corresponding weight value is directly given, for example, 0.53; if not, executing a second step, if the Word segmentation is related to the title, judging whether the Word segmentation has similarity (the similarity value is obtained by calculating a Word2vec model) larger than 0.6 with the Word segmentation appearing in the title, if the Word segmentation directly gives a corresponding weight value, such as 0.68, otherwise, executing a third step, if the Word segmentation in the middle section has continuity, and if the continuity judgment standard is that the text position where the two Word segmentation are located is smaller than the distance of 4 times of the Word segmentation length, if the text position where the two Word segmentation are located directly gives a corresponding weight value, such as 0.6, and if the text position does not have the corresponding weight value, such as 0.32, directly giving the weight value in the picture 2.

And judging the keywords in the tail section according to 3 stages in the figure 2: the first is to judge whether the participle appears in the first sentence, and the judgment basis is to judge whether the tail paragraph of the participle is in the first 1/3 position of the paragraph (which is also measured by the participle subscript), if the tail paragraph is in the first sentence, the corresponding weight value is directly given, for example, 0.48; if not, executing a second step, if the Word segmentation is related to the title, judging whether the Word segmentation has similarity (the similarity value is calculated through a Word2vec model) larger than 0.6 with the Word segmentation appearing in the title, if the Word segmentation directly gives a corresponding weight value, such as 0.36, otherwise, executing a third step, if the Word segmentation of the tail segment has continuity, and the continuity judgment standard is that whether the text position where the two Word segmentation are located is smaller than the distance of 4 times of the Word segmentation length, if the Word segmentation has continuity, directly giving the corresponding weight value, such as 0.65, and if the Word segmentation does not have continuity, directly giving the weight value, such as 0.25, in the picture 2.

And finally, sorting according to the weight value, and selecting TopK words as text keywords.

In addition to the improvement strategy provided by the invention, after Word segmentation is carried out by a jieba Word segmentation tool, word segmentation position information can be marked by calculating subscripts after Word segmentation, semantic similarity between all the segmented words and the segmented words appearing in the title is calculated by a Word2vec model, and the similarity is taken as the weight value of each segmented Word, so that the similar improvement effect as the invention can be obtained.

The application embodiment provides a system for dynamically calculating a keyword weight, which is used for executing the method for dynamically calculating a keyword weight according to the above embodiment, as shown in fig. 3, the system includes:

the word segmentation module 501 is configured to obtain text data, and perform word segmentation on the text through a word segmentation tool to obtain a word segmentation list;

a stop word removing module 502, configured to load a preset stop word list to perform stop word removing processing on the word segmentation list;

the position counting module 503 is configured to count and record the position of each participle in the text, and divide the entire text into three parts, namely a first section, a middle section, and a last section, according to the total text length and the total number of paragraphs;

a position determining module 504, configured to determine that a position where the word segmentation first appears is located in a first segment, a middle segment, or a last segment;

and the weight calculating module 505 is configured to obtain a weight value corresponding to each keyword according to whether a participle appears in the first sentence, whether the participle is related to the title, and whether continuous participles exist.

The system for dynamically calculating the keyword weight provided by the above embodiment of the present invention and the method for dynamically calculating the keyword weight provided by the embodiment of the present invention have the same inventive concept and have the same beneficial effects as the method adopted, operated or implemented by the application program stored in the system.

The embodiment of the invention also provides electronic equipment corresponding to the method for dynamically calculating the weight of the keyword, which is provided by the embodiment, so as to execute the method for dynamically calculating the weight of the keyword. The embodiments of the present invention are not limited.

Referring to fig. 4, a schematic diagram of an electronic device according to some embodiments of the invention is shown. As shown in fig. 4, the electronic device 2 includes: the system comprises a processor 200, a memory 201, a bus 202 and a communication interface 203, wherein the processor 200, the communication interface 203 and the memory 201 are connected through the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and the processor 200 executes the method for dynamically calculating the weights of the keywords according to any of the foregoing embodiments when executing the computer program.

The Memory 201 may include a Random Access Memory (RAM) and a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 202 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used for storing a program, and the processor 200 executes the program after receiving an execution instruction, and the method for dynamically calculating the keyword weight disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 200, or implemented by the processor 200.

The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200. The Processor 200 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201 and completes the steps of the method in combination with the hardware thereof.

The electronic device provided by the embodiment of the invention and the method for dynamically calculating the keyword weight provided by the embodiment of the invention have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.

Referring to fig. 5, the computer readable storage medium is an optical disc 30, and a computer program (i.e., a program product) is stored thereon, and when being executed by a processor, the computer program performs the method for dynamically calculating the keyword weight according to any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memories (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiment of the present invention and the method for dynamically calculating the keyword weight provided by the embodiment of the present invention have the same beneficial effects as the method adopted, operated or implemented by the application program stored in the computer-readable storage medium.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed to reflect the intent: rather, the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments, not others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a virtual machine creation system according to embodiments of the present invention. The present invention may also be embodied as apparatus or system programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several systems, several of these systems can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for dynamically calculating keyword weight, comprising:

loading a preset stop word list to perform stop word processing on the word segmentation list;

counting and recording the position of each word in the text, and dividing the whole text into three parts, namely a first section, a middle section and a tail section according to the total text length and the total number of paragraphs;

judging whether the position where the word is firstly appeared is positioned at the first section, the middle section or the tail section;

2. The method of claim 1,

the word segmentation tool is a jieba.

3. The method of claim 2,

the position of the participle in the text is a subscript of each participle in the text.

4. The method of claim 3,

and equally dividing the whole text into three parts, namely a first section, a middle section and a tail section according to the subscript.

5. The method of claim 4,

judging whether the participle appears in the first sentence or not by judging whether the first paragraph of the participle is in the first 1/3 position of the first paragraph or not, if so, judging that the participle appears in the first sentence, otherwise, judging that the participle does not appear in the first sentence.

6. The method of claim 5,

judging whether the participle is related to the title or not by judging whether the participle has a similarity larger than 0.6 with the participle appearing in the title or not, if so, judging that the participle is related to the title or not, and otherwise, judging that the participle is not related to the title.

7. The method of claim 6,

by calculating whether the text position of two of the participles is less than the distance of 4 times the length of the participle, and judging whether continuous word segmentation exists or not.

8. The method of claim 7, further comprising:

9. A system for dynamically calculating keyword weights, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of any one of claims 1-8.