WO2022105178A1

WO2022105178A1 - Keyword extraction method and related device

Info

Publication number: WO2022105178A1
Application number: PCT/CN2021/097021
Authority: WO
Inventors: 李弦; 阮晓雯; 徐亮; 洪博然
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-23
Filing date: 2021-05-29
Publication date: 2022-05-27
Also published as: CN112417101B; CN112417101A

Abstract

A keyword extraction method and a related device. The method comprises: performing keyword extraction on a text file to be processed, to obtain a first keyword set (101); calculating the frequency with which each keyword in the first keyword set occurs in a corpus (102); taking keywords in the first keyword set that have an occurrence frequency in the corpus that is lower than a first threshold as a second keyword set (103), a text file contained in the corpus and the text file to be processed belonging to the same file type; and taking the second keyword set as a keyword set of the text file to be processed (104). According to the method and the device, the accuracy of keyword extraction is improved by improving the keyword selection method during keyword extraction.

Description

A kind of keyword extraction method and related device

This application claims the priority of the Chinese patent application filed on November 23, 2020 with the application number 202011321892.7 and the title of the invention is "A method and related device for keyword extraction", the entire contents of which are incorporated by reference in in this application.

technical field

The embodiments of the present application relate to the field of natural language processing, and in particular, to a keyword extraction method and related apparatus.

Background technique

Keywords refer to the vocabulary used by a single media in the production and use of indexes, and keyword extraction from text files has always been a research hotspot in the industry. The term frequency-inverse document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) method extracts the keywords of the text file based on the word frequency: first, segment the text file for which keywords need to be extracted, and then count the word frequency and Inverse document frequency, and finally multiply the word frequency by the inverse document frequency as the weight value of the word segmentation, sort the word segmentation according to the weight value from large to small, and the top word segmentation can be used as the keyword of the text file.

The inventor realized that in the above TF-IDF method, the importance of a word is proportional to the number of times the word appears in the text file, and is inversely proportional to the number of times the word appears in the articles in the corpus. In the absence of corpus in the field, the keywords extracted by the above TF-IDF method may not be representative.

SUMMARY OF THE INVENTION

The embodiment of the present application discloses a method for extracting keywords and a related device. By improving the method for selecting keywords during keyword extraction, the accuracy of keyword extraction is improved.

In the first aspect, an example of the present application discloses a method for keyword extraction, including:

Perform keyword extraction on the text file to be processed to obtain a first keyword set;

Counting the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, and using the above-mentioned first keyword set in the above-mentioned first keyword set, the keyword whose frequency of occurrence in the above-mentioned corpus is lower than the first threshold is used as the second keyword set, The text files contained in the above-mentioned corpus belong to the same file type as the above-mentioned text files to be processed;

The above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.

In a second aspect, an embodiment of the present application discloses a device for keyword extraction, including:

an extraction unit, used to extract keywords from the text file to be processed to obtain a first keyword set;

A statistical unit, configured to count the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, where the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;

A determining unit, configured to use the keywords whose frequency in the above-mentioned corpus is lower than the first threshold in the above-mentioned first keyword set as the second keyword set; and use the above-mentioned second keyword set as the above-mentioned text file to be processed collection of keywords.

In a third aspect, an embodiment of the present application discloses a server, comprising: a processor and a memory, wherein a computer program is stored in the memory, and the processor calls the computer program stored in the memory to execute the following method:

In a fourth aspect, an embodiment of the present application discloses a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed on one or more processors, the following method is performed:

The embodiment of the present application first improves the method for selecting keywords, then optimizes the sorting of each keyword in the keyword set by considering the distribution of keywords in the text file to be processed, and finally uses the keywords and adjacent word segmentation according to the order of the keywords to be processed. The processed text files are combined in sequence, and the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the background technology, the following briefly introduces the accompanying drawings that are required in the embodiments or the background technology of the present application.

1 is a schematic flowchart of a keyword extraction method disclosed in an embodiment of the present application;

2 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application;

3 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application;

4 is a schematic structural diagram of a device for keyword extraction disclosed in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a server disclosed in an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be further described below with reference to the accompanying drawings.

The terms "first" and "second" in the description, claims and drawings of the present application are only used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device, etc. that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, etc., or optional It also includes other steps or units inherent to these processes, methods, products or devices, etc.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the above phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are they separate or alternative embodiments that are mutually exclusive with other embodiments. Those skilled in the art will understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.

In this application, "at least one (item)" means one or more, "plurality" means two or more, "at least two (item)" means two or three and three In the above, "and/or" is used to describe the relationship of related objects, indicating that there can be three kinds of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and both A and B exist three A case where A and B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one of the following" or similar expressions, refers to any combination of these items. For example, at least one (a) of a, b or c, can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c" ".

The technical solutions of the present application relate to the technical field of artificial intelligence and/or big data, such as natural language processing technology. The present application can be applied to scenarios such as text processing to realize keyword extraction, so as to improve the accuracy of keyword extraction, thereby promoting the construction of smart cities. Optionally, the data involved in this application, such as text files and/or keywords, etc., may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.

The embodiment of the present application is applicable to few text files of the same field type as the text file to be processed, that is, in the case of lack of corpus, to perform keyword extraction on the text file to be processed; the present application is for the traditional keyword extraction method TF- The optimization of IDF improves the accuracy of keyword extraction by improving the method of keyword selection during keyword extraction. In order to describe the solution of the present application more clearly, some knowledge related to the TF-IDF method is introduced below.

Corpus: refers to a large-scale electronic text database that has been sampled and processed, that is, a database that stores text files.

Term Frequency TF: Refers to the frequency of a given word appearing in the current text file. Since the word frequency of the same word in a long file may be higher than that in a short file, it is necessary to normalize the word frequency of a given word according to the length of the text file, then the above word frequency is the given word frequency The number of occurrences in the current text file is divided by the total number of words in the current text file, then the formula for word frequency can be expressed as:

Inverse Document Frequency IDF: is a measure of the universal importance of a given word. That is, if a given word only appears in fewer text files in the corpus, then the given word above is more able to represent the gist of the text file, and its weight value should be larger; if the given word appears in a large number of texts in the corpus If they appear in the file, then the given words above cannot represent the main idea of the text file, that is, the given words cannot clearly represent the content they represent, and their weight value should be small, then the formula of the inverse document frequency can be expressed for:

In the keyword extraction method TF-IDF, the importance of a word is proportional to the number of times it appears in the current text file and inversely proportional to its frequency in the text files of the corpus. Then, if a given word appears more frequently in the text file to be processed, and appears less frequently in the text file of the corpus, the word can better represent the meaning of the current text file and become the above text file keyword.

When using the TF-IDF method to extract keywords from the text file to be processed, usually follow the following steps:

1. Preprocess the text file to be processed.

The preprocessing operations of the text files to be processed include word segmentation, part-of-speech tagging, and removal of stop words. In the word segmentation part, there are many word segmentation tools that can be used, including stuttering word segmentation, Pangu word segmentation, etc. In this part, the most commonly used stuttering word segmentation can be used to segment the text file to be processed. The stuttering word segmentation is based on the prefix dictionary to achieve efficient word graph scanning and generate A directed acyclic graph composed of all possible word formations of Chinese characters in a sentence, and then dynamic programming to find the maximum probability path, and find the maximum segmentation combination based on word frequency, because the above-mentioned stuttering word segmentation is a very typical word segmentation tool, the specific principle is here No longer.

Part-of-speech tagging refers to adding appropriate part-of-speech tags to each participle to facilitate the analysis of the sentence and remove stop words from the participle set, such as retaining the participles whose parts of speech are nouns, proper nouns, and verbs in the participle set, because the above part of speech Labeling and removing stop words are very typical processing steps, and the specific principles will not be repeated here.

In this way, a candidate keyword set containing n word segments can be obtained, denoted as:

2. Calculate the word frequency TF of each word segment in the text file to be processed in the candidate keyword set.

3. Calculate the inverse document frequency IDF of each word segment in the candidate keyword set in the entire corpus.

4. Multiply TF by IDF to obtain the TF-IDF value of each keyword in the candidate keyword set.

5. Sort each segmented word in the candidate keyword set according to the TF-IDF value from large to small, and the top-ranked keyword can be used as the keyword of the above-mentioned text file to be processed.

Since the TF-IDF method relies on the corpus, in the case of a small number of text files belonging to the same field as the text files to be processed, the entire corpus contains a large number of text files that are not related to the field of the text files to be processed. The keywords extracted by the IDF method are likely to be unrepresentative in the field to which the text file to be processed belongs. The present application provides a new method for keyword extraction. First, the method for selecting keywords is improved to filter out keywords that are not representative in the field to which the text file to be processed belongs; The distribution in the processed text files and the inverse document frequency of keywords, optimize the ranking of each keyword in the keyword set, so that more representative keywords rank at the top; Combining the text files to be processed sequentially, and extracting the combined words that are split due to word segmentation, thereby improving the accuracy of keyword extraction.

Next, the embodiments of the present application will be described with reference to the accompanying drawings in the embodiments of the present application.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:

S101: Perform keyword extraction on the text file to be processed to obtain a first keyword set.

Before optimizing the method for selecting keywords, it is necessary to initially extract keywords from the text file to be processed to obtain a set of keywords, which can be recorded as the first keyword set.

In the above steps, the specific keyword extraction tool may be an open source program for keyword extraction based on large-scale corpus and TF-IDF algorithm, for example, using the jieba.analyse.extract_tags algorithm package to extract keywords from the text file to be processed.

If the number and types of text files in the corpus are large, and the corpus includes both the same and different types of text files as the text files to be processed, the above corpus can be called a large-scale corpus, and the text files in a large-scale corpus can be called a large-scale corpus. for a large-scale corpus. Wherein, the corpus corresponding to the first keyword set is extracted as a large-scale corpus.

In order not to miss keywords and to facilitate subsequent selection of each keyword in the keyword set, the number of keywords may be set to be greater than or equal to 20.

S102: Count the frequency of each keyword in the first keyword set appearing in the corpus.

In the above step, the frequency of each keyword in the first keyword set appearing in the corpus is counted, wherein the text file contained in the corpus and the text file to be processed belong to the same file type.

In step S101, the TF-IDF method used to obtain the above-mentioned first keyword set is based on a large-scale corpus, and the keywords in the above-mentioned first keyword set may be representative in the large-scale corpus, but not in the text to be processed. The field to which the file belongs is not representative. By taking the text file of the same type as the field to which the text file to be processed belongs as the corpus of the corpus, the importance of keywords in the field to which the text file to be processed belongs can be described. .

The frequency of keywords appearing in the corpus can be characterized by different dimensions, such as based on text files, the frequency of keywords appearing in the corpus can be expressed as the total number of text files in the corpus that contain the keyword divided by the number of text files contained in the corpus. total. For example, based on words, the frequency of a keyword appearing in a corpus can be expressed as the total number of times the keyword appears in the corpus divided by the total number of words in the corpus. The more frequently the keywords appear in the corpus, the less the keywords can represent the main idea of the text document.

S103: In the above-mentioned first keyword set, the keywords whose frequency appearing in the above-mentioned corpus is lower than the first threshold value are used as the second keyword set.

When the text files included in the corpus and the text files to be processed belong to the same file type, the frequency of each keyword in the above-mentioned first keyword set in the corpus can be well described in the above-mentioned text to be processed. The importance of the field to which the document belongs, the less frequently it appears in the above-mentioned corpus, the more representative the above-mentioned keyword is in the field to which the above-mentioned text document to be processed belongs.

A threshold is set for the frequency of the above-mentioned keywords appearing in the corpus, that is, the first threshold, and the keywords whose frequency is lower than the above-mentioned first threshold can be used as the representative keywords in the field to which the text file to be processed belongs; the final obtained The frequency is a number between 0 and 1. The above-mentioned first frequency can be set to any number between 0 and 1 according to the experimental needs, as long as the frequency of keywords existing in the first keyword set is guaranteed to be lower than the above-mentioned first threshold That's it.

S104: Use the second keyword set as the keyword set of the text file to be processed.

The keywords in the second keyword set are obtained by screening and filtering from the first keyword set, and each keyword appears in the corpus of the same file type as the text file to be processed. frequency to obtain the representative keywords in the field to which the text file to be processed belongs, and the second keyword set is used as the keyword set of the text file to be processed, and the keywords are more representative.

The solution of the present application further includes: counting the inverse document frequency of each keyword in the first keyword set in the corpus, and the text files contained in the corpus belong to the same file type as the above-mentioned to-be-processed text files. In order to distinguish it from the inverse document frequency obtained in the traditional TF-IDF method, the above inverse document frequency can be called the intra-class inverse document frequency. The more representative the field to which the processed text file belongs, a threshold value is set for the above-mentioned intra-class frequency, that is, the second threshold value, and the keywords whose inverse document frequency in the class is higher than the above-mentioned threshold value are used as the third keyword set, and the above-mentioned third keyword The set is used as the keyword set of the text file to be processed, so as to filter out the keywords that are representative in the large-scale corpus but not representative in the field to which the text file to be processed belongs, thereby improving the accuracy of keyword extraction.

It should be noted that there is no relationship between the second threshold and the first threshold here. The above-mentioned second threshold is a number greater than 0. The specific value can be adjusted through the experimental results. It is only necessary to ensure that the first keyword set exists. It is sufficient that the intra-class inverse document frequency of the keyword is higher than the above-mentioned second threshold.

The solution of the present application further includes: after counting the intra-class inverse document frequency of each keyword in the first keyword set, sorting each keyword in the first keyword set according to the frequency of the intra-class inverse document from high to low, to obtain the above-mentioned first The ranking of each keyword in a keyword set, the keywords are selected as the fourth keyword set according to the ranking from small to large, and the fourth keyword set is used as the keyword set of the text file to be processed. The selection method can be to select keywords that meet the requirements from the first position in the ranking. For example, the first keyword contains 25 keywords. After sorting according to the inverse document frequency within the class from high to low, select the top 5, or the top 10, or the top 15 keywords as the fourth keyword set , as long as the number of selected keywords is less than the number of keywords in the first keyword set, and the specific selection can be adjusted according to the experimental results.

For example, during the Sino-US trade war and the epidemic, the official documents issued by the Bureau of Industry and Information Technology of a certain city will contain special vocabulary of the period, and relevant corpus resources in the fields to which the official documents issued by the Bureau of Industry and Information Technology of a certain city belong. Less, through the traditional TF-IDF method, the keywords extracted from the above official documents are based on large-scale corpus. If the extracted keywords are 20, these 20 keywords often include the keywords "enterprise" or "tax". ”, etc., but the above keywords are very common words in the field of official documents issued by the Bureau of Industry and Information Technology, and are not representative. What we hope to get are keywords related to trade wars and epidemics; The text files are replaced with text files in the same field as official documents issued by the Bureau of Industry and Information Technology. The above keywords "enterprise" or "tax" exist in large numbers in the text files in the corpus, so the frequency of inverse documents within the class will be low. You can only keep keywords with a frequency greater than 0.5; or sort the above 20 keywords according to the inverse document frequency within the class, and only keep the top 10 keywords as the keywords of the above official documents, so that you can filter out the above official documents. Keywords that are not representative in the field.

The specific selection method may also be to start from the keywords whose ranking is greater than 1, and sequentially select the keywords whose number meets the requirements of the experiment. In particular, in order to improve the specificity of the extracted keywords in the field to which the text file to be processed belongs, the frequency of intra-class inverse documents of each keyword in the first keyword set can be counted, and the above keywords can be classified according to the above-mentioned intra-class inverse documents. The frequency is sorted from high to low, and the keywords with the top first percentile are taken as the first candidate keyword set, and the keywords with the top second percentile but not in the first candidate keyword set are taken as For the second set of candidate keywords, the values of the first percentage and the second percentage are between 0 and 100%, but the first percentage is less than the second percentage. A keyword is selected from a candidate keyword set and a second candidate keyword set as the second keyword set.

For example, the first keyword set contains 20 keywords. After sorting each keyword in the low-level keyword set according to the frequency of inverse documents within the class from high to low, the top 40% of the five keywords are used as The first candidate keyword set, which does not contain the keywords in the first candidate keyword set and ranks in the top 80% of the keywords, that is, 11 keywords ranked in the 40% to 80% range, as the second candidate keyword set , and then select keywords according to the needs of the experiment. For example, if a total of 10 keywords need to be extracted, 5 keywords can be selected from the first candidate keywords, and then 5 keywords can be selected from the second candidate keywords.

By using the text files in the same domain as the text files to be processed as the corpus, the keywords that are representative in the domain to which the text files to be processed belong are screened out based on the frequency of keywords appearing in the corpus, which can improve the extraction accuracy of keywords. , in addition to the above method based on word frequency to filter keywords, the distribution of keywords in the text file to be processed is also a factor to measure the importance of keywords. A sentence or a few sentences in a text file appear, and the representativeness of such keywords is not strong; some keywords have a low weight value, but appear in multiple places in the text file to be processed, such as Keywords are more representative. Please refer to FIG. 2. FIG. 2 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:

S201: Determine the number of different paragraphs in which the first keyword appears in the text file to be processed, and obtain the number of paragraphs corresponding to the first keyword.

Wherein, the first keyword obtained from the fourth keyword set is not the first keyword in the fourth keyword set, and the first keyword here refers to any keyword in the second keyword set, There is no specific order. For any keyword obtained from the fourth keyword set, it is necessary to determine the number of different paragraphs where the keyword is located in the text file to be processed, to obtain the number of paragraphs of the keyword. For example, the text file to be processed has a total of 5 paragraphs. For the keyword "infrastructure", the first paragraph appears 3 times, the second paragraph appears 5 times, and the fourth and fifth paragraphs appear respectively. 7 times, then the number of different paragraphs where the above keyword "infrastructure" is located in the text file to be processed is 4.

S202: Determine the order of the number of paragraphs corresponding to each keyword in the fourth keyword set, and obtain a first order value.

Each keyword in the above-mentioned fourth keyword set corresponds to its own number of paragraphs, and the keywords in the above-mentioned fourth keyword set are sorted according to the number of paragraphs to obtain a set of paragraph numbers, which can be recorded as rank1; Determine the sorting value of the first keyword in the sorting. The sorting value may be called the first sorting value. It should be noted that the first sorting value refers to the sorting value corresponding to the first keyword. "First" has no special sequential meaning.

S203: Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, to obtain the weight value of the first keyword, where the corpus includes the text to be processed and the text to be processed. Text files of the same and different file types.

In the preliminary extraction of keywords, the word frequency TF of the candidate keywords needs to be multiplied by the inverse document frequency IDF to obtain the TF-IDF value of each candidate keyword. In the above steps, the first keyword is calculated in the text file to be processed. The word frequency TF is multiplied by the second inverse document frequency to obtain the weight value of the above-mentioned first keyword. The above-mentioned second inverse document frequency refers to that the text file type corresponding to the corpus simultaneously includes the same and different files as the above-mentioned text file to be processed. Type, that is, the large-scale corpus mentioned above.

S204: Determine the order of the weight values corresponding to each keyword in the fourth keyword set, and obtain a second order value.

Similar to the number of paragraphs of the first keyword above, each keyword in the fourth keyword set corresponds to a weight value. Sorting the keywords in the fourth keyword set according to the weight value from large to small can get a value. The ranking of the group weight value can be recorded as rank2; the rank of the above-mentioned first keyword in the above-mentioned rank2 can be recorded as the second ranking value. It should be noted that the above-mentioned second ranking value refers to the corresponding first keyword. Sort value, "second" above has no special order meaning.

S205: Use the weighted sum of the first sorting value and the second sorting value as the sorting reference value of the first keyword.

The first ranking value is the rank of the first keyword in the ranking obtained by sorting according to the number of paragraphs, and the second ranking value is the ranking of the first keyword in the ranking obtained by sorting according to the size of the weight value. The first ranking value and the second ranking value are respectively assigned weighted values, the above-mentioned weighted values can be adjusted through experiments, and the weighted sum of the ranking and weighted values of the above-mentioned rankings is calculated, and the above-mentioned weighted sum is the reference value for sorting the above-mentioned first keywords. .

For example, the number of paragraphs corresponding to the keyword "infrastructure" is 4, and the number of paragraphs corresponding to each keyword in the second keyword set is ranked from more to less, that is, the rank 3 in the above rank1; the keyword "infrastructure" "The corresponding weight value ranks 5th in the above rank2; the above rank1 and rank2 are given weighted values of 0.5 and 0.6 respectively, then the final ranking reference value of the above keyword "infrastructure" is 4*0.5+3*0.6, which is 3.8 .

In particular, when weighting values are assigned to the first sorting value and the second sorting value, the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is set to 1, and the first sorting value is The weighted value of the value is greater than 0 and less than or equal to 0.5, so that the final ranking can be dominated by the weighted value, supplemented by the distribution of the number of paragraphs.

S206: Determine the order of the first keywords in the fourth keyword set according to the size of the sorting reference value of the first keywords.

For any keyword in the second keyword set, that is, the first keyword, the size of the ranking reference value is the basis for its ranking in the fourth keyword set; for the entire fourth keyword In terms of sets, each keyword corresponds to a sorting reference value, and the keywords in the above fourth keyword set are sorted according to their sorting reference values from large to small, and each keyword in the above second keyword set can be adjusted. order, let the truly representative keywords come first.

When extracting keywords, word segmentation is often performed on the text file to be processed. After word segmentation, some combined words or new words that can be used as keywords in the text file to be processed may be divided into two or more words. , but keyword extraction methods often can only extract one of the words, not the complete keyword. Please refer to FIG. 3. FIG. 3 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:

S301: Combining the word segmentation with the number of word segmentations between the second keyword and the second keyword that is less than a third threshold, and combining the second keywords in the order in the text file to be processed to obtain a combined word set;

Wherein, the second keyword is any keyword in the fourth keyword set, and has no special order meaning. After word segmentation, the combined word may be divided into multiple word segments, and the above-mentioned multiple word segments must be adjacent in the above-mentioned text file to be processed. In the above steps, the position of the second keyword in the original text may be located first, and then the participles adjacent to the position left and right are selected to combine with the second keyword to obtain a combined word set. When selecting the word segmentation adjacent to the above position, the number of word segmentation is less than the threshold, that is, the third threshold, and the third threshold is a positive integer greater than 0; when combined with the above second keyword, the word segmentation and the above second The sequence of the keywords must be combined according to the sequence in the text file to be processed to obtain the combined words.

For example, the above-mentioned second keyword is denoted as wordn, and the word-partitions whose number of word-partitions differs from the above-mentioned wordn in the above-mentioned text file to be processed by less than 4 are selected for combination, that is, three word-partitions on the left and right sides of the above-mentioned second keyword are selected for combination. , the selected word segment and the above-mentioned second keyword can be recorded as [wordn-3, wordn-2, wordn-1, wordn, wordn+1, wordn+2, wordn+3]. When combining keywords and word segments , according to the order in the text file to be processed, that is, according to [wordn-3, wordn-2, wordn-1, wordn], [wordn-2, wordn-1, wordn], [wordn-1 , wordn], [wordn, wordn+1], [wordn, wordn+1, wordn+2], [wordn, wordn+1, wordn+2, wordn+3], [wordn-1, wordn, wordn+1 ], etc. to combine to obtain a set of combined words.

It should be noted that the number of the above-mentioned participles can be adjusted through experiments. In the above example, the front and rear three participles are selected mainly considering that the combined words in the actual situation are generally obtained by combining at most four separate participles; the above-mentioned second keyword There may be multiple positions in the text file to be processed, and the above steps are taken for each position to select and combine word segmentation to obtain a combined word set.

S302: In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, use the combined word as the second keyword .

Wherein, the above-mentioned compound word is any combination word in the above-mentioned combination word set. The whole text of the above-mentioned compound word is traversed in the above-mentioned text file to be processed to obtain the word frequency of the above-mentioned combination word in the above-mentioned text file to be processed. If the above-mentioned compound word is The word frequency divided by the word frequency of the above-mentioned second keyword is greater than the threshold, then the above-mentioned combination word is more representative for the above-mentioned text file to be processed, then the above-mentioned combination word can be used as the above-mentioned second keyword, that is, the above-mentioned combination word can be used as the above-mentioned combination Word replaces the second keyword above.

The fourth threshold above is a number greater than 0.5 and less than 1. In order to ensure the representativeness of the combined word in the text file to be processed, the above threshold can be set to a number less than 1 but greater than or equal to 0.75, and can also be adjusted through experimental results. , this application makes any restrictions. By traversing the whole text of the above-mentioned combined words in the above-mentioned text file to be processed, and then using the word frequency to measure the method of combining words, the importance of the above-mentioned combined words in the above-mentioned text file to be processed can be guaranteed. That is, only when the frequency of the above-mentioned combined words in the text compared with the above-mentioned keywords exceeds a certain value, the above-mentioned combined words can be used as keywords, and the extracted combined words are representative keywords.

In particular, after completing the combination and screening of all keywords and word segmentations in the fourth keyword set, if there is an inclusion relationship between the keywords in the fourth keyword set, the keywords in the fourth keyword set The word segmentation is performed, and the non-repetitive word segmentation is used as the fourth keyword set, and the above-mentioned fourth keyword set is used as the above-mentioned keyword set of the text file to be processed.

To sum up, the method for keyword extraction provided by this application firstly improves the selection method of keywords, and filters out the keywords that are not representative in the field to which the text file to be processed belongs; The distribution in the text file to be processed and the inverse document frequency of keywords, optimize the ranking of each keyword in the keyword set, so that more representative keywords rank high; finally, through the keywords and adjacent word segmentation Combining words in the text file to be processed is performed in sequence, and the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.

The methods of the embodiments of the present application are described in detail above, and the apparatuses of the embodiments of the present application are provided below.

4 is a schematic structural diagram of a keyword extraction apparatus disclosed in an embodiment of the present application; the foregoing keyword extraction apparatus 40 may include an extraction unit 401, a statistics unit 402, and a determination unit 403, wherein the descriptions of each unit are as follows:

Extraction unit 401, configured to perform keyword extraction on the text file to be processed to obtain a first keyword set;

A statistical unit 402, configured to count the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, where the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;

The determining unit 403 is configured to use, in the above-mentioned first keyword set, keywords whose frequency in the above-mentioned corpus is lower than the first threshold as the second keyword set; and use the above-mentioned second keyword set as the above-mentioned text to be processed A collection of keywords for the file.

In a possible implementation manner, the above-mentioned statistical unit 402 is further configured to count the first inverse document frequency of each keyword in the above-mentioned first keyword set in the corpus, and the frequency of the first inverse document contained in the corpus corresponding to the above-mentioned first inverse document frequency The text file is of the same file type as the above-mentioned text file to be processed;

The above-mentioned determining unit 403, the above-mentioned determining unit, is further configured to use the keywords whose first inverse document frequency is higher than the second threshold in the above-mentioned first keyword set as the third keyword set; and use the above-mentioned third keyword set as the above-mentioned A collection of keywords for text files to be processed.

In a possible implementation, the above device further includes:

A sorting unit 404, configured to sort the above-mentioned first keyword set according to the above-mentioned first inverse document frequency from high to low;

The above determining unit 403 is further configured to select keywords from small to large as the fourth keyword set according to the ranking of each keyword; and use the above fourth keyword set as the above-mentioned keyword set of the text file to be processed.

In a possible implementation, the above determining unit 403 is further configured to use the keywords ranked in the top first percentage as the first candidate keyword set, and will not include the keywords in the first candidate keyword set keywords and keywords ranked in the top second percentage are used as a second candidate keyword set; the second percentage is greater than the first percentage; from the first candidate keyword set and the second A keyword is selected from the candidate keyword set as the fourth keyword set.

In a possible implementation, the above device further includes:

The above determining unit 403 is further configured to determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; Sort the number of paragraphs corresponding to each keyword in the fourth keyword set to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;

The calculation unit 405 is used to calculate the product of the word frequency of the above-mentioned first keyword in the above-mentioned text file to be processed and the second inverse document frequency in the corpus, to obtain the weight value of the above-mentioned first keyword, and the above-mentioned corpus includes the same as the above-mentioned. Text files of the same and different types to be processed;

The above-mentioned determining unit 403 is further configured to determine the order of the weight values corresponding to each keyword in the above-mentioned fourth keyword set, and obtain a second order value; the weighting of the above-mentioned first order value and the above-mentioned second order value and as the sorting reference value of the first keyword; the order of the first keyword in the fourth keyword set is determined according to the size of the sorting reference value of the first keyword.

In a possible implementation manner, the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is 1, and the weighting value of the first sorting value is greater than 0 and less than or equal to 0.5.

In this embodiment of the present application, when weighting values are assigned to the first sorting value and the second sorting value, respectively, the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is set to 1, and the above The weighted value of the first sorting value is greater than 0 and less than or equal to 0.5, so that the final sorting can be dominated by the weighted value and supplemented by the distribution of the number of paragraphs.

In a possible implementation, the above device further includes:

The combining unit 406 is used to combine the word segmentations with the second keyword that differs in the number of word segmentations by less than a third threshold, according to the order in the text file to be processed and the second keyword to obtain a combination word set, the second keyword is any keyword in the fourth keyword set;

The above determining unit 403 is further configured to divide the word frequency of the combined word in the text file to be processed by the word frequency of the second keyword in the text file to be processed is greater than the fourth threshold, and determine the The compound word is used as the second keyword, and the compound word is any compound word in the compound word set.

In the embodiment of the present application, the combined word may be divided into multiple word segments after word segmentation, and the above-mentioned multiple word segments must be adjacent in the above-mentioned text file to be processed. In the above steps, the position of the fourth keyword in the original text can be located first, and then the participles adjacent to the position left and right are selected and combined with the second keyword to obtain a combined word set. When selecting the word segmentation adjacent to the above position, the number of word segmentation is less than the threshold; when combining with the above-mentioned second keyword, the order of the word segmentation and the above-mentioned second keyword must be in accordance with the order in the text file to be processed. Combine in order to get compound words.

Carry out full-text traversal of the above-mentioned combined word in the above-mentioned text file to be processed to obtain the word frequency of the above-mentioned combined word in the above-mentioned text file to be processed. If the word frequency of the above-mentioned combined word divided by the word frequency of the above-mentioned second keyword is greater than the threshold, then The above-mentioned combination word is more representative for the above-mentioned text file to be processed, then the above-mentioned combination word can be used as the above-mentioned second keyword, that is, the above-mentioned combination word is used to replace the above-mentioned second keyword.

Among them, the above threshold can be set to 0.75, and can also be adjusted according to the experimental results. By traversing the whole text of the above-mentioned combined words in the above-mentioned text file to be processed, and then using the word frequency to measure the method of combining words, it can be guaranteed that the above-mentioned combined words are in the above-mentioned The importance in the text file to be processed, that is, only when the frequency of the above-mentioned combined words in the text compared with the above-mentioned keywords exceeds a certain value, the above-mentioned combined words can be used as keywords, so that the extracted combined words are representative. Sexual keywords.

It should be noted that the number of the above-mentioned participles can be adjusted through experiments. In the above example, the three participles before and after are selected mainly considering that the combined words in the actual situation are generally obtained by combining at most four separate participles.

The above-mentioned second keyword may appear in multiple positions in the above-mentioned text file to be processed, and the above steps are taken for each position to perform word segmentation selection and combination to obtain a combined word set.

In particular, after completing the combination and screening of all keywords and word segmentations in the fourth keyword set, if there is an inclusion relationship between the keywords in the fourth keyword set, the keywords in the second keyword set The word segmentation is performed, and the non-repetitive word segmentation is used as the second keyword set, and the above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.

To sum up, the method for keyword extraction provided by this application firstly improves the method for selecting keywords, then optimizes the ordering of each keyword in the keyword set by considering the distribution of keywords in the text file to be processed, and finally. By combining keywords and adjacent word segmentations in the order in the text file to be processed, the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.

Please refer to FIG. 5 , which is a schematic structural diagram of a server disclosed in an embodiment of the present application. The above server 50 may include a memory 501 and a processor 502 . Further optionally, a communication interface 503 and a bus 504 may also be included, wherein the memory 501 , the processor 502 and the communication interface 503 are communicated with each other through the bus 504 . The communication interface 503 is used for data interaction with the above-mentioned keyword extraction apparatus 40 .

Among them, the memory 501 is used to provide a storage space, and data such as an operating system and a computer program can be stored in the storage space. The memory 501 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM).

The processor 502 is a module that performs arithmetic operations and logical operations, and can be a processing module such as a central processing unit (CPU), a graphics processing unit (GPU), or a microprocessor (microprocessor unit, MPU). of one or more combinations.

A computer program is stored in the memory 501, and the processor 502 calls the computer program stored in the memory 501 to perform the following operations:

Counting the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;

In the above-mentioned first keyword set, the keywords whose frequency that appears in the above-mentioned corpus is lower than the first threshold value are used as the second keyword set;

It should be noted that, the specific implementation of the server 50 may also correspond to the corresponding descriptions of the method embodiments shown in FIG. 2 and FIG. 3 .

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed on one or more processors, FIG. 1 , FIG. 2 , and FIG. 3 shows the method of keyword extraction.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

To sum up, we can first improve the selection method of keywords, and filter out the keywords that are not representative in the field to which the text file to be processed belongs; Inverse document frequency, optimize the ranking of each keyword in the keyword set, so that more representative keywords rank first; finally, the keywords and adjacent word segmentation are carried out in the order in the text file to be processed. Combined, the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be realized. The above-mentioned processes can be completed by hardware related to computer programs. The above-mentioned computer programs can be stored in a computer-readable storage medium. When the above-mentioned computer programs are executed, , which may include the processes of the foregoing method embodiments. The aforementioned storage medium includes: read-only memory ROM or random-access storage memory RAM, magnetic disk or optical disk and other media that can store computer program codes.

Claims

A method for keyword extraction, comprising:

Perform keyword extraction on the text file to be processed to obtain a first keyword set;

Count the frequency of each keyword in the first keyword set appearing in the corpus, and use the first keyword set, the keyword whose frequency in the corpus is lower than the first threshold is used as the second keyword Set, the text files contained in the corpus belong to the same file type as the text files to be processed;

The second keyword set is used as the keyword set of the text file to be processed.
The method of claim 1, wherein the method further comprises:

Counting the first inverse document frequency of each keyword in the first keyword set in the corpus, the text file contained in the corpus corresponding to the first inverse document frequency and the text file to be processed belong to the same file type;

In the first keyword set, the keywords whose first inverse document frequency is higher than the second threshold are used as the third keyword set;

The third keyword set is used as the keyword set of the text file to be processed.
The method of claim 2, wherein the method further comprises:

Sorting the first keyword set according to the first inverse document frequency from high to low to obtain the ranking of each keyword in the first keyword set;

Select keywords from small to large according to the ranking of each keyword as the fourth keyword set;

The fourth keyword set is used as the keyword set of the text file to be processed.
The method according to claim 3, wherein, according to the ranking of each keyword, selecting keywords from small to large as the fourth keyword set, comprising:

Taking the keywords ranked in the top first percentage as the first candidate keyword set, and taking the keywords not including the keywords in the first candidate keyword set and ranking in the top second percentage as the second keyword candidate keyword set; the second percentage is greater than the first percentage;

A keyword is selected from the first candidate keyword set and the second candidate keyword set as a fourth keyword set.
The method of claim 3, wherein the method further comprises:

Determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; determine the number of paragraphs in each key in the fourth keyword set Sort the number of paragraphs corresponding to the words to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;

Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, and obtain the weight value of the first keyword, and the corpus includes Processing text files of the same and different types; determining the order of the weight values corresponding to each keyword in the fourth keyword set, and obtaining a second order value;

Using the weighted sum of the first ranking value and the second ranking value as a ranking reference value of the first keyword;

The order of the first keywords in the fourth keyword set is determined according to the size of the sorting reference value of the first keywords.
The method of claim 5, wherein the method further comprises:

The sum of the weighted value of the first sorting value and the weighted value of the second sorting value is 1, and the weighted value of the first sorting value is greater than 0 and less than or equal to 0.5.
The method of claim 6, wherein the method further comprises:

Combining the participles whose number of participles differs from the second keyword is less than the third threshold is combined with the second keyword according to the order in the text file to be processed to obtain a combined word set, and the second The keyword is any keyword in the fourth keyword set;

In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, the combined word is used as the first Two keywords, the compound word is any compound word in the compound word set.
An apparatus for keyword extraction, wherein the apparatus comprises:

an extraction unit, used to extract keywords from the text file to be processed to obtain a first keyword set;

A statistical unit, configured to count the frequency of occurrence of each keyword in the first keyword set in the corpus, where the text file contained in the corpus and the text file to be processed belong to the same file type;

A determining unit, configured to use the keywords whose frequency in the corpus is lower than the first threshold in the first keyword set as the second keyword set; use the second keyword set as the to-be-to-be A collection of keywords for processed text files.
A server, wherein the server includes a processor and a memory, wherein a computer program is stored in the memory, and the processor invokes the computer program stored in the memory to execute the following method:

Perform keyword extraction on the text file to be processed to obtain a first keyword set;

Count the frequency of each keyword in the first keyword set appearing in the corpus, and use the first keyword set, the keyword whose frequency in the corpus is lower than the first threshold is used as the second keyword Set, the text files contained in the corpus belong to the same file type as the text files to be processed;

The second keyword set is used as the keyword set of the text file to be processed.
The server of claim 9, wherein the processor is further configured to perform:

Counting the first inverse document frequency of each keyword in the first keyword set in the corpus, the text file contained in the corpus corresponding to the first inverse document frequency and the text file to be processed belong to the same file type;

In the first keyword set, the keywords whose first inverse document frequency is higher than the second threshold are used as the third keyword set;

The third keyword set is used as the keyword set of the text file to be processed.
The server of claim 10, wherein the processor is further configured to perform:

Sorting the first keyword set according to the first inverse document frequency from high to low to obtain the ranking of each keyword in the first keyword set;

Select keywords from small to large according to the ranking of each keyword as the fourth keyword set;

The fourth keyword set is used as the keyword set of the text file to be processed.
The server according to claim 11, wherein the selecting keywords according to the ranking of each keyword as the fourth keyword set from small to large, comprising:

Taking the keywords ranked in the top first percentage as the first candidate keyword set, and taking the keywords not including the keywords in the first candidate keyword set and ranking in the top second percentage as the second keyword candidate keyword set; the second percentage is greater than the first percentage;

A keyword is selected from the first candidate keyword set and the second candidate keyword set as a fourth keyword set.
The server of claim 11, wherein the processor is further configured to perform:

Determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; determine the number of paragraphs in each key in the fourth keyword set Sort the number of paragraphs corresponding to the words to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;

Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, and obtain the weight value of the first keyword, and the corpus includes Processing text files of the same and different types; determining the order of the weight values corresponding to each keyword in the fourth keyword set, and obtaining a second order value;

Using the weighted sum of the first ranking value and the second ranking value as a ranking reference value of the first keyword;

The order of the first keywords in the fourth keyword set is determined according to the size of the sorting reference value of the first keywords.
The server according to claim 13, wherein the sum of the weighted value of the first sorting value and the weighted value of the second sorting value is 1, and the weighted value of the first sorting value is greater than 0 and less than or equal to 0.5;

The processor is also used to execute:

Combining the participles whose number of participles differs from the second keyword is less than the third threshold is combined with the second keyword according to the order in the text file to be processed to obtain a combined word set, and the second The keyword is any keyword in the fourth keyword set;

In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, the combined word is used as the first Two keywords, the compound word is any compound word in the compound word set.
A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program runs on one or more processors, the following methods are performed:

Perform keyword extraction on the text file to be processed to obtain a first keyword set;

Count the frequency of each keyword in the first keyword set appearing in the corpus, and use the first keyword set, the keyword whose frequency in the corpus is lower than the first threshold is used as the second keyword Set, the text files contained in the corpus belong to the same file type as the text files to be processed;

The second keyword set is used as the keyword set of the text file to be processed.
The computer-readable storage medium of claim 15, wherein the computer program, when run on one or more processors, is further configured to perform:

Counting the first inverse document frequency of each keyword in the first keyword set in the corpus, the text file contained in the corpus corresponding to the first inverse document frequency and the text file to be processed belong to the same file type;

In the first keyword set, the keywords whose first inverse document frequency is higher than the second threshold are used as the third keyword set;

The third keyword set is used as the keyword set of the text file to be processed.
The computer-readable storage medium of claim 16, wherein the computer program, when run on one or more processors, is further configured to perform:

Sorting the first keyword set according to the first inverse document frequency from high to low to obtain the ranking of each keyword in the first keyword set;

Select keywords from small to large according to the ranking of each keyword as the fourth keyword set;

The fourth keyword set is used as the keyword set of the text file to be processed.
The computer-readable storage medium according to claim 17, wherein the selecting keywords according to the ranking of each keyword as the fourth keyword set from small to large, comprising:

Taking the keywords ranked in the top first percentage as the first candidate keyword set, and taking the keywords not including the keywords in the first candidate keyword set and ranking in the top second percentage as the second keyword candidate keyword set; the second percentage is greater than the first percentage;

A keyword is selected from the first candidate keyword set and the second candidate keyword set as a fourth keyword set.
The computer-readable storage medium of claim 17, wherein the computer program, when run on one or more processors, is further configured to perform:

Determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; determine the number of paragraphs in each key in the fourth keyword set Sort the number of paragraphs corresponding to the words to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;

Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, and obtain the weight value of the first keyword, and the corpus includes Processing text files of the same and different types; determining the order of the weight values corresponding to each keyword in the fourth keyword set, and obtaining a second order value;

Using the weighted sum of the first ranking value and the second ranking value as a ranking reference value of the first keyword;

The order of the first keywords in the fourth keyword set is determined according to the size of the sorting reference value of the first keywords.
The computer-readable storage medium of claim 19, wherein a sum of the weighted value of the first ranking value and the weighted value of the second ranking value is 1, and the weighted value of the first ranking value is greater than 0 and less than or equal to 0.5;

The computer program, when run on one or more processors, is also used to perform:

Combining the participles whose number of participles differs from the second keyword is less than the third threshold is combined with the second keyword according to the order in the text file to be processed to obtain a combined word set, and the second The keyword is any keyword in the fourth keyword set;

In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, the combined word is used as the first Two keywords, the compound word is any compound word in the compound word set.