WO2022105178A1 - Keyword extraction method and related device - Google Patents

Keyword extraction method and related device Download PDF

Info

Publication number
WO2022105178A1
WO2022105178A1 PCT/CN2021/097021 CN2021097021W WO2022105178A1 WO 2022105178 A1 WO2022105178 A1 WO 2022105178A1 CN 2021097021 W CN2021097021 W CN 2021097021W WO 2022105178 A1 WO2022105178 A1 WO 2022105178A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
keyword set
processed
keywords
text file
Prior art date
Application number
PCT/CN2021/097021
Other languages
French (fr)
Chinese (zh)
Inventor
李弦
阮晓雯
徐亮
洪博然
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022105178A1 publication Critical patent/WO2022105178A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the field of natural language processing, and in particular, to a keyword extraction method and related apparatus.
  • Keywords refer to the vocabulary used by a single media in the production and use of indexes, and keyword extraction from text files has always been a research hotspot in the industry.
  • the term frequency-inverse document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) method extracts the keywords of the text file based on the word frequency: first, segment the text file for which keywords need to be extracted, and then count the word frequency and Inverse document frequency, and finally multiply the word frequency by the inverse document frequency as the weight value of the word segmentation, sort the word segmentation according to the weight value from large to small, and the top word segmentation can be used as the keyword of the text file.
  • TF-IDF Term Frequency-Inverse Document Frequency
  • the inventor realized that in the above TF-IDF method, the importance of a word is proportional to the number of times the word appears in the text file, and is inversely proportional to the number of times the word appears in the articles in the corpus. In the absence of corpus in the field, the keywords extracted by the above TF-IDF method may not be representative.
  • the embodiment of the present application discloses a method for extracting keywords and a related device. By improving the method for selecting keywords during keyword extraction, the accuracy of keyword extraction is improved.
  • an example of the present application discloses a method for keyword extraction, including:
  • the keyword whose frequency of occurrence in the above-mentioned corpus is lower than the first threshold is used as the second keyword set,
  • the text files contained in the above-mentioned corpus belong to the same file type as the above-mentioned text files to be processed;
  • the above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
  • an embodiment of the present application discloses a device for keyword extraction, including:
  • an extraction unit used to extract keywords from the text file to be processed to obtain a first keyword set
  • a statistical unit configured to count the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, where the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;
  • a determining unit configured to use the keywords whose frequency in the above-mentioned corpus is lower than the first threshold in the above-mentioned first keyword set as the second keyword set; and use the above-mentioned second keyword set as the above-mentioned text file to be processed collection of keywords.
  • an embodiment of the present application discloses a server, comprising: a processor and a memory, wherein a computer program is stored in the memory, and the processor calls the computer program stored in the memory to execute the following method:
  • the keyword whose frequency of occurrence in the above-mentioned corpus is lower than the first threshold is used as the second keyword set,
  • the text files contained in the above-mentioned corpus belong to the same file type as the above-mentioned text files to be processed;
  • the above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
  • an embodiment of the present application discloses a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed on one or more processors, the following method is performed:
  • the keyword whose frequency of occurrence in the above-mentioned corpus is lower than the first threshold is used as the second keyword set,
  • the text files contained in the above-mentioned corpus belong to the same file type as the above-mentioned text files to be processed;
  • the above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
  • the embodiment of the present application first improves the method for selecting keywords, then optimizes the sorting of each keyword in the keyword set by considering the distribution of keywords in the text file to be processed, and finally uses the keywords and adjacent word segmentation according to the order of the keywords to be processed.
  • the processed text files are combined in sequence, and the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.
  • FIG. 1 is a schematic flowchart of a keyword extraction method disclosed in an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a device for keyword extraction disclosed in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a server disclosed in an embodiment of the present application.
  • At least one (item) means one or more
  • plural means two or more
  • at least two (item) means two or three and three
  • “and/or” is used to describe the relationship of related objects, indicating that there can be three kinds of relationships, for example, "A and/or B” can mean: only A exists, only B exists, and both A and B exist three A case where A and B can be singular or plural.
  • the character “/” generally indicates that the associated objects are an “or” relationship.
  • At least one of the following” or similar expressions refers to any combination of these items. For example, at least one (a) of a, b or c, can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ".
  • the technical solutions of the present application relate to the technical field of artificial intelligence and/or big data, such as natural language processing technology.
  • the present application can be applied to scenarios such as text processing to realize keyword extraction, so as to improve the accuracy of keyword extraction, thereby promoting the construction of smart cities.
  • the data involved in this application such as text files and/or keywords, etc., may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.
  • the embodiment of the present application is applicable to few text files of the same field type as the text file to be processed, that is, in the case of lack of corpus, to perform keyword extraction on the text file to be processed; the present application is for the traditional keyword extraction method TF-
  • the optimization of IDF improves the accuracy of keyword extraction by improving the method of keyword selection during keyword extraction.
  • some knowledge related to the TF-IDF method is introduced below.
  • Corpus refers to a large-scale electronic text database that has been sampled and processed, that is, a database that stores text files.
  • Term Frequency TF Refers to the frequency of a given word appearing in the current text file. Since the word frequency of the same word in a long file may be higher than that in a short file, it is necessary to normalize the word frequency of a given word according to the length of the text file, then the above word frequency is the given word frequency The number of occurrences in the current text file is divided by the total number of words in the current text file, then the formula for word frequency can be expressed as:
  • Inverse Document Frequency IDF is a measure of the universal importance of a given word. That is, if a given word only appears in fewer text files in the corpus, then the given word above is more able to represent the gist of the text file, and its weight value should be larger; if the given word appears in a large number of texts in the corpus If they appear in the file, then the given words above cannot represent the main idea of the text file, that is, the given words cannot clearly represent the content they represent, and their weight value should be small, then the formula of the inverse document frequency can be expressed for:
  • the importance of a word is proportional to the number of times it appears in the current text file and inversely proportional to its frequency in the text files of the corpus. Then, if a given word appears more frequently in the text file to be processed, and appears less frequently in the text file of the corpus, the word can better represent the meaning of the current text file and become the above text file keyword.
  • the preprocessing operations of the text files to be processed include word segmentation, part-of-speech tagging, and removal of stop words.
  • word segmentation part there are many word segmentation tools that can be used, including stuttering word segmentation, Pangu word segmentation, etc.
  • stuttering word segmentation the most commonly used stuttering word segmentation can be used to segment the text file to be processed.
  • the stuttering word segmentation is based on the prefix dictionary to achieve efficient word graph scanning and generate A directed acyclic graph composed of all possible word formations of Chinese characters in a sentence, and then dynamic programming to find the maximum probability path, and find the maximum segmentation combination based on word frequency, because the above-mentioned stuttering word segmentation is a very typical word segmentation tool, the specific principle is here No longer.
  • Part-of-speech tagging refers to adding appropriate part-of-speech tags to each participle to facilitate the analysis of the sentence and remove stop words from the participle set, such as retaining the participles whose parts of speech are nouns, proper nouns, and verbs in the participle set, because the above part of speech Labeling and removing stop words are very typical processing steps, and the specific principles will not be repeated here.
  • the TF-IDF method relies on the corpus, in the case of a small number of text files belonging to the same field as the text files to be processed, the entire corpus contains a large number of text files that are not related to the field of the text files to be processed.
  • the keywords extracted by the IDF method are likely to be unrepresentative in the field to which the text file to be processed belongs.
  • the present application provides a new method for keyword extraction.
  • the method for selecting keywords is improved to filter out keywords that are not representative in the field to which the text file to be processed belongs;
  • the distribution in the processed text files and the inverse document frequency of keywords optimize the ranking of each keyword in the keyword set, so that more representative keywords rank at the top;
  • FIG. 1 is a schematic flowchart of a keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:
  • S101 Perform keyword extraction on the text file to be processed to obtain a first keyword set.
  • the specific keyword extraction tool may be an open source program for keyword extraction based on large-scale corpus and TF-IDF algorithm, for example, using the jieba. analyses.extract_tags algorithm package to extract keywords from the text file to be processed.
  • the above corpus can be called a large-scale corpus, and the text files in a large-scale corpus can be called a large-scale corpus. for a large-scale corpus.
  • the corpus corresponding to the first keyword set is extracted as a large-scale corpus.
  • the number of keywords may be set to be greater than or equal to 20.
  • the frequency of each keyword in the first keyword set appearing in the corpus is counted, wherein the text file contained in the corpus and the text file to be processed belong to the same file type.
  • step S101 the TF-IDF method used to obtain the above-mentioned first keyword set is based on a large-scale corpus, and the keywords in the above-mentioned first keyword set may be representative in the large-scale corpus, but not in the text to be processed.
  • the field to which the file belongs is not representative.
  • the frequency of keywords appearing in the corpus can be characterized by different dimensions, such as based on text files, the frequency of keywords appearing in the corpus can be expressed as the total number of text files in the corpus that contain the keyword divided by the number of text files contained in the corpus. total. For example, based on words, the frequency of a keyword appearing in a corpus can be expressed as the total number of times the keyword appears in the corpus divided by the total number of words in the corpus. The more frequently the keywords appear in the corpus, the less the keywords can represent the main idea of the text document.
  • the frequency of each keyword in the above-mentioned first keyword set in the corpus can be well described in the above-mentioned text to be processed.
  • a threshold is set for the frequency of the above-mentioned keywords appearing in the corpus, that is, the first threshold, and the keywords whose frequency is lower than the above-mentioned first threshold can be used as the representative keywords in the field to which the text file to be processed belongs; the final obtained
  • the frequency is a number between 0 and 1.
  • the above-mentioned first frequency can be set to any number between 0 and 1 according to the experimental needs, as long as the frequency of keywords existing in the first keyword set is guaranteed to be lower than the above-mentioned first threshold That's it.
  • S104 Use the second keyword set as the keyword set of the text file to be processed.
  • the keywords in the second keyword set are obtained by screening and filtering from the first keyword set, and each keyword appears in the corpus of the same file type as the text file to be processed. frequency to obtain the representative keywords in the field to which the text file to be processed belongs, and the second keyword set is used as the keyword set of the text file to be processed, and the keywords are more representative.
  • the solution of the present application further includes: counting the inverse document frequency of each keyword in the first keyword set in the corpus, and the text files contained in the corpus belong to the same file type as the above-mentioned to-be-processed text files.
  • the above inverse document frequency can be called the intra-class inverse document frequency.
  • a threshold value is set for the above-mentioned intra-class frequency, that is, the second threshold value, and the keywords whose inverse document frequency in the class is higher than the above-mentioned threshold value are used as the third keyword set, and the above-mentioned third keyword
  • the set is used as the keyword set of the text file to be processed, so as to filter out the keywords that are representative in the large-scale corpus but not representative in the field to which the text file to be processed belongs, thereby improving the accuracy of keyword extraction.
  • the above-mentioned second threshold is a number greater than 0. The specific value can be adjusted through the experimental results. It is only necessary to ensure that the first keyword set exists. It is sufficient that the intra-class inverse document frequency of the keyword is higher than the above-mentioned second threshold.
  • the solution of the present application further includes: after counting the intra-class inverse document frequency of each keyword in the first keyword set, sorting each keyword in the first keyword set according to the frequency of the intra-class inverse document from high to low, to obtain the above-mentioned first
  • the ranking of each keyword in a keyword set the keywords are selected as the fourth keyword set according to the ranking from small to large, and the fourth keyword set is used as the keyword set of the text file to be processed.
  • the selection method can be to select keywords that meet the requirements from the first position in the ranking. For example, the first keyword contains 25 keywords.
  • the official documents issued by the Bureau of Industry and Information Technology of a certain city will contain special vocabulary of the period, and relevant corpus resources in the fields to which the official documents issued by the Bureau of Industry and Information Technology of a certain city belong.
  • the keywords extracted from the above official documents are based on large-scale corpus. If the extracted keywords are 20, these 20 keywords often include the keywords "enterprise” or "tax”. ”, etc., but the above keywords are very common words in the field of official documents issued by the Bureau of Industry and Information Technology, and are not representative.
  • Keywords related to trade wars and epidemics are replaced with text files in the same field as official documents issued by the Bureau of Industry and Information Technology.
  • the above keywords "enterprise” or “tax” exist in large numbers in the text files in the corpus, so the frequency of inverse documents within the class will be low.
  • the specific selection method may also be to start from the keywords whose ranking is greater than 1, and sequentially select the keywords whose number meets the requirements of the experiment.
  • the frequency of intra-class inverse documents of each keyword in the first keyword set can be counted, and the above keywords can be classified according to the above-mentioned intra-class inverse documents.
  • the frequency is sorted from high to low, and the keywords with the top first percentile are taken as the first candidate keyword set, and the keywords with the top second percentile but not in the first candidate keyword set are taken as
  • the values of the first percentage and the second percentage are between 0 and 100%, but the first percentage is less than the second percentage.
  • a keyword is selected from a candidate keyword set and a second candidate keyword set as the second keyword set.
  • the first keyword set contains 20 keywords.
  • the first candidate keyword set which does not contain the keywords in the first candidate keyword set and ranks in the top 80% of the keywords, that is, 11 keywords ranked in the 40% to 80% range, as the second candidate keyword set , and then select keywords according to the needs of the experiment. For example, if a total of 10 keywords need to be extracted, 5 keywords can be selected from the first candidate keywords, and then 5 keywords can be selected from the second candidate keywords.
  • FIG. 2 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:
  • S201 Determine the number of different paragraphs in which the first keyword appears in the text file to be processed, and obtain the number of paragraphs corresponding to the first keyword.
  • the first keyword obtained from the fourth keyword set is not the first keyword in the fourth keyword set, and the first keyword here refers to any keyword in the second keyword set, There is no specific order.
  • any keyword obtained from the fourth keyword set it is necessary to determine the number of different paragraphs where the keyword is located in the text file to be processed, to obtain the number of paragraphs of the keyword.
  • the text file to be processed has a total of 5 paragraphs.
  • the keyword "infrastructure" the first paragraph appears 3 times, the second paragraph appears 5 times, and the fourth and fifth paragraphs appear respectively. 7 times, then the number of different paragraphs where the above keyword "infrastructure" is located in the text file to be processed is 4.
  • S202 Determine the order of the number of paragraphs corresponding to each keyword in the fourth keyword set, and obtain a first order value.
  • Each keyword in the above-mentioned fourth keyword set corresponds to its own number of paragraphs, and the keywords in the above-mentioned fourth keyword set are sorted according to the number of paragraphs to obtain a set of paragraph numbers, which can be recorded as rank1; Determine the sorting value of the first keyword in the sorting.
  • the sorting value may be called the first sorting value. It should be noted that the first sorting value refers to the sorting value corresponding to the first keyword. "First" has no special sequential meaning.
  • S203 Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, to obtain the weight value of the first keyword, where the corpus includes the text to be processed and the text to be processed. Text files of the same and different file types.
  • the word frequency TF of the candidate keywords needs to be multiplied by the inverse document frequency IDF to obtain the TF-IDF value of each candidate keyword.
  • the first keyword is calculated in the text file to be processed.
  • the word frequency TF is multiplied by the second inverse document frequency to obtain the weight value of the above-mentioned first keyword.
  • the above-mentioned second inverse document frequency refers to that the text file type corresponding to the corpus simultaneously includes the same and different files as the above-mentioned text file to be processed. Type, that is, the large-scale corpus mentioned above.
  • S204 Determine the order of the weight values corresponding to each keyword in the fourth keyword set, and obtain a second order value.
  • each keyword in the fourth keyword set corresponds to a weight value. Sorting the keywords in the fourth keyword set according to the weight value from large to small can get a value.
  • the ranking of the group weight value can be recorded as rank2; the rank of the above-mentioned first keyword in the above-mentioned rank2 can be recorded as the second ranking value.
  • the above-mentioned second ranking value refers to the corresponding first keyword. Sort value, "second" above has no special order meaning.
  • S205 Use the weighted sum of the first sorting value and the second sorting value as the sorting reference value of the first keyword.
  • the first ranking value is the rank of the first keyword in the ranking obtained by sorting according to the number of paragraphs
  • the second ranking value is the ranking of the first keyword in the ranking obtained by sorting according to the size of the weight value.
  • the first ranking value and the second ranking value are respectively assigned weighted values, the above-mentioned weighted values can be adjusted through experiments, and the weighted sum of the ranking and weighted values of the above-mentioned rankings is calculated, and the above-mentioned weighted sum is the reference value for sorting the above-mentioned first keywords. .
  • the number of paragraphs corresponding to the keyword “infrastructure” is 4, and the number of paragraphs corresponding to each keyword in the second keyword set is ranked from more to less, that is, the rank 3 in the above rank1; the keyword “infrastructure” "The corresponding weight value ranks 5th in the above rank2; the above rank1 and rank2 are given weighted values of 0.5 and 0.6 respectively, then the final ranking reference value of the above keyword “infrastructure” is 4*0.5+3*0.6, which is 3.8 .
  • the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is set to 1, and the first sorting value is The weighted value of the value is greater than 0 and less than or equal to 0.5, so that the final ranking can be dominated by the weighted value, supplemented by the distribution of the number of paragraphs.
  • S206 Determine the order of the first keywords in the fourth keyword set according to the size of the sorting reference value of the first keywords.
  • the size of the ranking reference value is the basis for its ranking in the fourth keyword set; for the entire fourth keyword
  • each keyword corresponds to a sorting reference value
  • the keywords in the above fourth keyword set are sorted according to their sorting reference values from large to small, and each keyword in the above second keyword set can be adjusted. order, let the truly representative keywords come first.
  • FIG. 3 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:
  • the second keyword is any keyword in the fourth keyword set, and has no special order meaning.
  • the combined word may be divided into multiple word segments, and the above-mentioned multiple word segments must be adjacent in the above-mentioned text file to be processed.
  • the position of the second keyword in the original text may be located first, and then the participles adjacent to the position left and right are selected to combine with the second keyword to obtain a combined word set.
  • the number of word segmentation is less than the threshold, that is, the third threshold, and the third threshold is a positive integer greater than 0; when combined with the above second keyword, the word segmentation and the above second
  • the sequence of the keywords must be combined according to the sequence in the text file to be processed to obtain the combined words.
  • the above-mentioned second keyword is denoted as wordn
  • the word-partitions whose number of word-partitions differs from the above-mentioned wordn in the above-mentioned text file to be processed by less than 4 are selected for combination, that is, three word-partitions on the left and right sides of the above-mentioned second keyword are selected for combination.
  • the selected word segment and the above-mentioned second keyword can be recorded as [wordn-3, wordn-2, wordn-1, wordn, wordn+1, wordn+2, wordn+3].
  • the number of the above-mentioned participles can be adjusted through experiments.
  • the front and rear three participles are selected mainly considering that the combined words in the actual situation are generally obtained by combining at most four separate participles; the above-mentioned second keyword
  • the above-mentioned compound word is any combination word in the above-mentioned combination word set.
  • the whole text of the above-mentioned compound word is traversed in the above-mentioned text file to be processed to obtain the word frequency of the above-mentioned combination word in the above-mentioned text file to be processed.
  • the above-mentioned compound word is The word frequency divided by the word frequency of the above-mentioned second keyword is greater than the threshold, then the above-mentioned combination word is more representative for the above-mentioned text file to be processed, then the above-mentioned combination word can be used as the above-mentioned second keyword, that is, the above-mentioned combination word can be used as the above-mentioned combination Word replaces the second keyword above.
  • the fourth threshold above is a number greater than 0.5 and less than 1.
  • the above threshold can be set to a number less than 1 but greater than or equal to 0.75, and can also be adjusted through experimental results. , this application makes any restrictions.
  • traversing the whole text of the above-mentioned combined words in the above-mentioned text file to be processed, and then using the word frequency to measure the method of combining words the importance of the above-mentioned combined words in the above-mentioned text file to be processed can be guaranteed. That is, only when the frequency of the above-mentioned combined words in the text compared with the above-mentioned keywords exceeds a certain value, the above-mentioned combined words can be used as keywords, and the extracted combined words are representative keywords.
  • the keywords in the fourth keyword set After completing the combination and screening of all keywords and word segmentations in the fourth keyword set, if there is an inclusion relationship between the keywords in the fourth keyword set, the keywords in the fourth keyword set The word segmentation is performed, and the non-repetitive word segmentation is used as the fourth keyword set, and the above-mentioned fourth keyword set is used as the above-mentioned keyword set of the text file to be processed.
  • the method for keyword extraction provided by this application firstly improves the selection method of keywords, and filters out the keywords that are not representative in the field to which the text file to be processed belongs;
  • through the keywords and adjacent word segmentation Combining words in the text file to be processed is performed in sequence, and the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.
  • keyword extraction apparatus 40 may include an extraction unit 401, a statistics unit 402, and a determination unit 403, wherein the descriptions of each unit are as follows:
  • Extraction unit 401 configured to perform keyword extraction on the text file to be processed to obtain a first keyword set
  • a statistical unit 402 configured to count the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, where the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;
  • the determining unit 403 is configured to use, in the above-mentioned first keyword set, keywords whose frequency in the above-mentioned corpus is lower than the first threshold as the second keyword set; and use the above-mentioned second keyword set as the above-mentioned text to be processed A collection of keywords for the file.
  • the above-mentioned statistical unit 402 is further configured to count the first inverse document frequency of each keyword in the above-mentioned first keyword set in the corpus, and the frequency of the first inverse document contained in the corpus corresponding to the above-mentioned first inverse document frequency
  • the text file is of the same file type as the above-mentioned text file to be processed;
  • the above-mentioned determining unit 403, the above-mentioned determining unit, is further configured to use the keywords whose first inverse document frequency is higher than the second threshold in the above-mentioned first keyword set as the third keyword set; and use the above-mentioned third keyword set as the above-mentioned A collection of keywords for text files to be processed.
  • the above device further includes:
  • a sorting unit 404 configured to sort the above-mentioned first keyword set according to the above-mentioned first inverse document frequency from high to low;
  • the above determining unit 403 is further configured to select keywords from small to large as the fourth keyword set according to the ranking of each keyword; and use the above fourth keyword set as the above-mentioned keyword set of the text file to be processed.
  • the above determining unit 403 is further configured to use the keywords ranked in the top first percentage as the first candidate keyword set, and will not include the keywords in the first candidate keyword set keywords and keywords ranked in the top second percentage are used as a second candidate keyword set; the second percentage is greater than the first percentage; from the first candidate keyword set and the second A keyword is selected from the candidate keyword set as the fourth keyword set.
  • the above device further includes:
  • the above determining unit 403 is further configured to determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; Sort the number of paragraphs corresponding to each keyword in the fourth keyword set to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;
  • the calculation unit 405 is used to calculate the product of the word frequency of the above-mentioned first keyword in the above-mentioned text file to be processed and the second inverse document frequency in the corpus, to obtain the weight value of the above-mentioned first keyword, and the above-mentioned corpus includes the same as the above-mentioned. Text files of the same and different types to be processed;
  • the above-mentioned determining unit 403 is further configured to determine the order of the weight values corresponding to each keyword in the above-mentioned fourth keyword set, and obtain a second order value; the weighting of the above-mentioned first order value and the above-mentioned second order value and as the sorting reference value of the first keyword; the order of the first keyword in the fourth keyword set is determined according to the size of the sorting reference value of the first keyword.
  • the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is 1, and the weighting value of the first sorting value is greater than 0 and less than or equal to 0.5.
  • the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is set to 1, and the above
  • the weighted value of the first sorting value is greater than 0 and less than or equal to 0.5, so that the final sorting can be dominated by the weighted value and supplemented by the distribution of the number of paragraphs.
  • the above device further includes:
  • the combining unit 406 is used to combine the word segmentations with the second keyword that differs in the number of word segmentations by less than a third threshold, according to the order in the text file to be processed and the second keyword to obtain a combination word set, the second keyword is any keyword in the fourth keyword set;
  • the above determining unit 403 is further configured to divide the word frequency of the combined word in the text file to be processed by the word frequency of the second keyword in the text file to be processed is greater than the fourth threshold, and determine the The compound word is used as the second keyword, and the compound word is any compound word in the compound word set.
  • the combined word may be divided into multiple word segments after word segmentation, and the above-mentioned multiple word segments must be adjacent in the above-mentioned text file to be processed.
  • the position of the fourth keyword in the original text can be located first, and then the participles adjacent to the position left and right are selected and combined with the second keyword to obtain a combined word set.
  • the number of word segmentation is less than the threshold;
  • the order of the word segmentation and the above-mentioned second keyword must be in accordance with the order in the text file to be processed. Combine in order to get compound words.
  • the above threshold can be set to 0.75, and can also be adjusted according to the experimental results.
  • the above-mentioned second keyword may appear in multiple positions in the above-mentioned text file to be processed, and the above steps are taken for each position to perform word segmentation selection and combination to obtain a combined word set.
  • the keywords in the second keyword set The word segmentation is performed, and the non-repetitive word segmentation is used as the second keyword set, and the above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
  • the method for keyword extraction provided by this application firstly improves the method for selecting keywords, then optimizes the ordering of each keyword in the keyword set by considering the distribution of keywords in the text file to be processed, and finally.
  • the method for keyword extraction By combining keywords and adjacent word segmentations in the order in the text file to be processed, the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.
  • FIG. 5 is a schematic structural diagram of a server disclosed in an embodiment of the present application.
  • the above server 50 may include a memory 501 and a processor 502 . Further optionally, a communication interface 503 and a bus 504 may also be included, wherein the memory 501 , the processor 502 and the communication interface 503 are communicated with each other through the bus 504 .
  • the communication interface 503 is used for data interaction with the above-mentioned keyword extraction apparatus 40 .
  • the memory 501 is used to provide a storage space, and data such as an operating system and a computer program can be stored in the storage space.
  • the memory 501 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM).
  • the processor 502 is a module that performs arithmetic operations and logical operations, and can be a processing module such as a central processing unit (CPU), a graphics processing unit (GPU), or a microprocessor (microprocessor unit, MPU). of one or more combinations.
  • a processing module such as a central processing unit (CPU), a graphics processing unit (GPU), or a microprocessor (microprocessor unit, MPU). of one or more combinations.
  • a computer program is stored in the memory 501, and the processor 502 calls the computer program stored in the memory 501 to perform the following operations:
  • the keywords whose frequency that appears in the above-mentioned corpus is lower than the first threshold value are used as the second keyword set;
  • the above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
  • server 50 may also correspond to the corresponding descriptions of the method embodiments shown in FIG. 2 and FIG. 3 .
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed on one or more processors, FIG. 1 , FIG. 2 , and FIG. 3 shows the method of keyword extraction.
  • the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
  • the above-mentioned processes can be completed by hardware related to computer programs.
  • the above-mentioned computer programs can be stored in a computer-readable storage medium.
  • the aforementioned storage medium includes: read-only memory ROM or random-access storage memory RAM, magnetic disk or optical disk and other media that can store computer program codes.

Abstract

A keyword extraction method and a related device. The method comprises: performing keyword extraction on a text file to be processed, to obtain a first keyword set (101); calculating the frequency with which each keyword in the first keyword set occurs in a corpus (102); taking keywords in the first keyword set that have an occurrence frequency in the corpus that is lower than a first threshold as a second keyword set (103), a text file contained in the corpus and the text file to be processed belonging to the same file type; and taking the second keyword set as a keyword set of the text file to be processed (104). According to the method and the device, the accuracy of keyword extraction is improved by improving the keyword selection method during keyword extraction.

Description

一种关键词提取的方法及相关装置A kind of keyword extraction method and related device
本申请要求于2020年11月23日提交中国专利局、申请号为202011321892.7,发明名称为“一种关键词提取的方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on November 23, 2020 with the application number 202011321892.7 and the title of the invention is "A method and related device for keyword extraction", the entire contents of which are incorporated by reference in in this application.
技术领域technical field
本申请实施例涉及自然语言处理领域,具体涉及一种关键词提取的方法及相关装置。The embodiments of the present application relate to the field of natural language processing, and in particular, to a keyword extraction method and related apparatus.
背景技术Background technique
关键词是指单个媒体在制作使用索引时用到的词汇,对文本文件进行关键词提取一直都是业界的研究热点。词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)方法基于词频对文本文件的关键词进行提取:首先对需要提取关键词的文本文件进行分词,再统计上述每个分词的词频和逆文档频率,最后将词频乘以逆文档频率的结果作为分词的权重值,将分词按照权重值由大到小进行排序,排名靠前的分词就可以作为文本文件的关键词。Keywords refer to the vocabulary used by a single media in the production and use of indexes, and keyword extraction from text files has always been a research hotspot in the industry. The term frequency-inverse document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) method extracts the keywords of the text file based on the word frequency: first, segment the text file for which keywords need to be extracted, and then count the word frequency and Inverse document frequency, and finally multiply the word frequency by the inverse document frequency as the weight value of the word segmentation, sort the word segmentation according to the weight value from large to small, and the top word segmentation can be used as the keyword of the text file.
发明人意识到,上述TF-IDF方法中,词的重要性与该词在文本文件中出现的次数成正比,与该词在语料库的文章中出现的次数成反比,在需要提取关键词的文章所属领域的语料缺乏的情况下,上述TF-IDF方法提取出来的关键词可能不具有代表性。The inventor realized that in the above TF-IDF method, the importance of a word is proportional to the number of times the word appears in the text file, and is inversely proportional to the number of times the word appears in the articles in the corpus. In the absence of corpus in the field, the keywords extracted by the above TF-IDF method may not be representative.
发明内容SUMMARY OF THE INVENTION
本申请实施例公开了一种关键词提取的方法及相关装置,通过改进关键词提取时关键词的选择方法,提高关键词提取的精度。The embodiment of the present application discloses a method for extracting keywords and a related device. By improving the method for selecting keywords during keyword extraction, the accuracy of keyword extraction is improved.
第一方面,本申请实例公开了一种关键词提取的方法,包括:In the first aspect, an example of the present application discloses a method for keyword extraction, including:
对待处理的文本文件进行关键词提取,得到第一关键词集合;Perform keyword extraction on the text file to be processed to obtain a first keyword set;
统计上述第一关键词集合中各关键词在语料库中出现的频率,将上述第一关键词集合中,在上述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合,,上述语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型;Counting the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, and using the above-mentioned first keyword set in the above-mentioned first keyword set, the keyword whose frequency of occurrence in the above-mentioned corpus is lower than the first threshold is used as the second keyword set, The text files contained in the above-mentioned corpus belong to the same file type as the above-mentioned text files to be processed;
将上述第二关键词集合作为上述待处理的文本文件的关键词集合。The above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
第二方面,本申请实施例公开了一种关键词提取的装置,包括:In a second aspect, an embodiment of the present application discloses a device for keyword extraction, including:
提取单元,用于对待处理的文本文件进行关键词提取,得到第一关键词集合;an extraction unit, used to extract keywords from the text file to be processed to obtain a first keyword set;
统计单元,用于统计上述第一关键词集合中各关键词在语料库中出现的频率,上述语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型;A statistical unit, configured to count the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, where the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;
确定单元,用于将上述第一关键词集合中,在上述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合;将上述第二关键词集合作为上述待处理的文本文件的关键词集合。A determining unit, configured to use the keywords whose frequency in the above-mentioned corpus is lower than the first threshold in the above-mentioned first keyword set as the second keyword set; and use the above-mentioned second keyword set as the above-mentioned text file to be processed collection of keywords.
第三方面,本申请实施例公开了一种服务器,包括:处理器和存储器,其中,上述存储器中存储有计算机程序,上述处理器调用上述存储器中存储的计算机程序,用于执行以下方法:In a third aspect, an embodiment of the present application discloses a server, comprising: a processor and a memory, wherein a computer program is stored in the memory, and the processor calls the computer program stored in the memory to execute the following method:
对待处理的文本文件进行关键词提取,得到第一关键词集合;Perform keyword extraction on the text file to be processed to obtain a first keyword set;
统计上述第一关键词集合中各关键词在语料库中出现的频率,将上述第一关键词集合中,在上述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合,,上述语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型;Counting the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, and using the above-mentioned first keyword set in the above-mentioned first keyword set, the keyword whose frequency of occurrence in the above-mentioned corpus is lower than the first threshold is used as the second keyword set, The text files contained in the above-mentioned corpus belong to the same file type as the above-mentioned text files to be processed;
将上述第二关键词集合作为上述待处理的文本文件的关键词集合。The above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
第四方面,本申请实施例公开了一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,当上述计算机程序在一个或多个处理器上运行时,执行以下方法:In a fourth aspect, an embodiment of the present application discloses a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed on one or more processors, the following method is performed:
对待处理的文本文件进行关键词提取,得到第一关键词集合;Perform keyword extraction on the text file to be processed to obtain a first keyword set;
统计上述第一关键词集合中各关键词在语料库中出现的频率,将上述第一关键词集合 中,在上述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合,,上述语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型;Counting the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, and using the above-mentioned first keyword set in the above-mentioned first keyword set, the keyword whose frequency of occurrence in the above-mentioned corpus is lower than the first threshold is used as the second keyword set, The text files contained in the above-mentioned corpus belong to the same file type as the above-mentioned text files to be processed;
将上述第二关键词集合作为上述待处理的文本文件的关键词集合。The above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
本申请实施例首先改进关键词的选择方法,再通过考虑关键词在待处理的文本文件中的分布,优化关键词集合中各关键词的排序,最后通过关键词与相邻的分词按照在待处理的文本文件中先后顺序进行组合,提取出因为分词而被拆分的组合词,从而提高关键词提取的精度。The embodiment of the present application first improves the method for selecting keywords, then optimizes the sorting of each keyword in the keyword set by considering the distribution of keywords in the text file to be processed, and finally uses the keywords and adjacent word segmentation according to the order of the keywords to be processed. The processed text files are combined in sequence, and the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.
附图说明Description of drawings
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图作简单的介绍。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the background technology, the following briefly introduces the accompanying drawings that are required in the embodiments or the background technology of the present application.
图1是本申请实施例公开的一种关键词提取方法的流程示意图;1 is a schematic flowchart of a keyword extraction method disclosed in an embodiment of the present application;
图2是本申请实施例公开的另一种关键词提取方法的流程示意图;2 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application;
图3是本申请实施例公开的又一种关键词提取方法的流程示意图;3 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application;
图4是本申请实施例公开的一种关键词提取的装置的结构示意图;4 is a schematic structural diagram of a device for keyword extraction disclosed in an embodiment of the present application;
图5是本申请实施例公开的一种服务器的结构示意图。FIG. 5 is a schematic structural diagram of a server disclosed in an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be further described below with reference to the accompanying drawings.
本申请的说明书、权利要求书及附图中的术语“第一”和“第二”等仅用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备等,没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元等,或可选地还包括对于这些过程、方法、产品或设备等固有的其它步骤或单元。The terms "first" and "second" in the description, claims and drawings of the present application are only used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device, etc. that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, etc., or optional It also includes other steps or units inherent to these processes, methods, products or devices, etc.
在本文中提及的“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现上述短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员可以显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the above phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are they separate or alternative embodiments that are mutually exclusive with other embodiments. Those skilled in the art will understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.
在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上,“至少两个(项)”是指两个或三个及三个以上,“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”。In this application, "at least one (item)" means one or more, "plurality" means two or more, "at least two (item)" means two or three and three In the above, "and/or" is used to describe the relationship of related objects, indicating that there can be three kinds of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and both A and B exist three A case where A and B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one of the following" or similar expressions, refers to any combination of these items. For example, at least one (a) of a, b or c, can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c" ".
本申请的技术方案涉及人工智能和/或大数据技术领域,如可具体涉及自然语言处理技术。本申请可应用于文本处理等场景,以实现关键词提取,使得提高关键词提取的精度,从而推动智慧城市的建设。可选的,本申请涉及的数据如文本文件和/或各关键词等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。The technical solutions of the present application relate to the technical field of artificial intelligence and/or big data, such as natural language processing technology. The present application can be applied to scenarios such as text processing to realize keyword extraction, so as to improve the accuracy of keyword extraction, thereby promoting the construction of smart cities. Optionally, the data involved in this application, such as text files and/or keywords, etc., may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application.
本申请实施例适用于与待处理的文本文件所属的领域类型相同的文本文件不多,即语料缺乏的情况下对待处理的文本文件进行关键词提取;本申请是对传统关键词提取方法TF-IDF的优化,通过改进关键词提取时关键词的选择方法,提高关键词提取的精度。为了更清楚地描述本申请的方案,下面先介绍一些与TF-IDF方法相关的知识。The embodiment of the present application is applicable to few text files of the same field type as the text file to be processed, that is, in the case of lack of corpus, to perform keyword extraction on the text file to be processed; the present application is for the traditional keyword extraction method TF- The optimization of IDF improves the accuracy of keyword extraction by improving the method of keyword selection during keyword extraction. In order to describe the solution of the present application more clearly, some knowledge related to the TF-IDF method is introduced below.
语料库:指经过取样、加工的大规模电子文本库,即存放文本文件的数据库。Corpus: refers to a large-scale electronic text database that has been sampled and processed, that is, a database that stores text files.
词频TF:指给定的词语在当前文本文件中出现的频率。由于同一个词语在长文件中的 词频可能比在短文件中有更高的词频,因此需要根据文本文件的长度对给定的词语的词频进行归一化,那么上述词频就为给定的词语在当前文本文件中出现的次数除以当前文本文件的总词数,那么词频的公式可以表示为:Term Frequency TF: Refers to the frequency of a given word appearing in the current text file. Since the word frequency of the same word in a long file may be higher than that in a short file, it is necessary to normalize the word frequency of a given word according to the length of the text file, then the above word frequency is the given word frequency The number of occurrences in the current text file is divided by the total number of words in the current text file, then the formula for word frequency can be expressed as:
Figure PCTCN2021097021-appb-000001
Figure PCTCN2021097021-appb-000001
逆文档频率IDF:是对给定的词语的普遍重要性的度量。即如果给定的词语只在语料库中越少的文本文件中出现,那么上述给定的词语更能够代表文本文件的主旨,其权重值应当越大;如果给定的词语在语料库的大量的文本文件中都出现,那么上述给定的词语就无法代表文本文件的主旨,也就是上述给定的词语无法清楚地表示出其代表的内容,其权重值应当小,那么逆文档频率的公式可以表示为:Inverse Document Frequency IDF: is a measure of the universal importance of a given word. That is, if a given word only appears in fewer text files in the corpus, then the given word above is more able to represent the gist of the text file, and its weight value should be larger; if the given word appears in a large number of texts in the corpus If they appear in the file, then the given words above cannot represent the main idea of the text file, that is, the given words cannot clearly represent the content they represent, and their weight value should be small, then the formula of the inverse document frequency can be expressed for:
Figure PCTCN2021097021-appb-000002
Figure PCTCN2021097021-appb-000002
在关键词提取方法TF-IDF中,一个词语的重要性与其在当前文本文件中出现的次数成正比,与其在语料库的文本文件中出现的频率成反比。那么,如果某个给定的词语在待处理的文本文件中出现频率越高,并且在语料库的文本文件中出现频率越低,该词语就能够较好地代表当前文本文件的含义,成为上述文本文件的关键词。In the keyword extraction method TF-IDF, the importance of a word is proportional to the number of times it appears in the current text file and inversely proportional to its frequency in the text files of the corpus. Then, if a given word appears more frequently in the text file to be processed, and appears less frequently in the text file of the corpus, the word can better represent the meaning of the current text file and become the above text file keyword.
使用TF-IDF方法对待处理的文本文件进行关键词提取时,通常按照以下步骤:When using the TF-IDF method to extract keywords from the text file to be processed, usually follow the following steps:
1、对待处理的文本文件进行预处理。1. Preprocess the text file to be processed.
对待处理的文本文件的预处理操作包括分词、词性标注、去除停用词。在分词部分,可以采用的分词工具有很多,包括结巴分词、盘古分词等,本部分可以采用最常用的结巴分词对待处理的文本文件进行分词,结巴分词基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能成词情况构成的有向无环图,再动态规划查找最大概率路径,找出基于词频的最大切分组合,由于上述结巴分词是一种非常典型的分词工具,具体原理这里不再赘述。The preprocessing operations of the text files to be processed include word segmentation, part-of-speech tagging, and removal of stop words. In the word segmentation part, there are many word segmentation tools that can be used, including stuttering word segmentation, Pangu word segmentation, etc. In this part, the most commonly used stuttering word segmentation can be used to segment the text file to be processed. The stuttering word segmentation is based on the prefix dictionary to achieve efficient word graph scanning and generate A directed acyclic graph composed of all possible word formations of Chinese characters in a sentence, and then dynamic programming to find the maximum probability path, and find the maximum segmentation combination based on word frequency, because the above-mentioned stuttering word segmentation is a very typical word segmentation tool, the specific principle is here No longer.
词性标注指为每个分词加上合适的词性标签,以便于对句子的分析和从分词集合中去除停用词,比如保留分词集合中词性为名词、专有名词、动词的分词,由于上述词性标注和去除停用词是非常典型的处理步骤,具体原理这里不再赘述。Part-of-speech tagging refers to adding appropriate part-of-speech tags to each participle to facilitate the analysis of the sentence and remove stop words from the participle set, such as retaining the participles whose parts of speech are nouns, proper nouns, and verbs in the participle set, because the above part of speech Labeling and removing stop words are very typical processing steps, and the specific principles will not be repeated here.
这样,可以得到包含n个分词的候选关键词集合,记为:In this way, a candidate keyword set containing n word segments can be obtained, denoted as:
2、计算候选关键词集合中每个分词在待处理的文本文件中的词频TF。2. Calculate the word frequency TF of each word segment in the text file to be processed in the candidate keyword set.
3、计算候选关键词集合中每个分词在整个语料库中的逆文档频率IDF。3. Calculate the inverse document frequency IDF of each word segment in the candidate keyword set in the entire corpus.
4、计算TF乘以IDF得到候选关键词集合中每个关键词的TF-IDF值。4. Multiply TF by IDF to obtain the TF-IDF value of each keyword in the candidate keyword set.
5、将候选关键词集合中每个分词按照TF-IDF值由大到小进行排序,排名靠前的关键词就可以作为上述待处理的文本文件的关键词。5. Sort each segmented word in the candidate keyword set according to the TF-IDF value from large to small, and the top-ranked keyword can be used as the keyword of the above-mentioned text file to be processed.
由于TF-IDF方法依赖于语料库,在与待处理的文本文件属于同一领域的文本文件数量很少的情况下,整个语料库中包含大量与待处理的文本文件领域不相关的文本文件,通过TF-IDF方法提取出来的关键词在待处理的文本文件所属领域很有可能不具有代表性。本申请提供了一种新的关键词提取的方法,首先改进关键词的选择方法,过滤掉在待提处理的文本文件所属领域中不具有代表性的关键词;再通过综合考虑关键词在待处理的文本文件中的分布和关键词的逆文档频率,优化关键词集合中各关键词的排名的名次,让更加具有代表性的关键词排名靠前;最后通过关键词与相邻的分词按照在待处理的文本文件中先后顺序进行组合,提取出因为分词而被拆分的组合词,从而提高关键词提取的精度。Since the TF-IDF method relies on the corpus, in the case of a small number of text files belonging to the same field as the text files to be processed, the entire corpus contains a large number of text files that are not related to the field of the text files to be processed. The keywords extracted by the IDF method are likely to be unrepresentative in the field to which the text file to be processed belongs. The present application provides a new method for keyword extraction. First, the method for selecting keywords is improved to filter out keywords that are not representative in the field to which the text file to be processed belongs; The distribution in the processed text files and the inverse document frequency of keywords, optimize the ranking of each keyword in the keyword set, so that more representative keywords rank at the top; Combining the text files to be processed sequentially, and extracting the combined words that are split due to word segmentation, thereby improving the accuracy of keyword extraction.
接下来结合本申请实施例中的附图对本申请实施例进行描述。Next, the embodiments of the present application will be described with reference to the accompanying drawings in the embodiments of the present application.
请参阅图1,图1是本申请实施例公开的一种关键词提取方法的流程示意图,如图所示,上述方法包括:Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:
S101:对待处理的文本文件进行关键词提取,得到第一关键词集合。S101: Perform keyword extraction on the text file to be processed to obtain a first keyword set.
在对关键词的选择方法进行优化之前,首先需要对待处理的文本文件进行关键词的初步提取,得到一组关键词集合,可以将上述关键词集合记为第一关键词集合。Before optimizing the method for selecting keywords, it is necessary to initially extract keywords from the text file to be processed to obtain a set of keywords, which can be recorded as the first keyword set.
在上述步骤中,具体的关键词提取工具可以为基于大规模语料和TF-IDF算法的关键词提取开源程序,例如,利用jieba.analyse.extract_tags算法包对待处理的文本文件进行关键词提取。In the above steps, the specific keyword extraction tool may be an open source program for keyword extraction based on large-scale corpus and TF-IDF algorithm, for example, using the jieba.analyse.extract_tags algorithm package to extract keywords from the text file to be processed.
如果语料库内的文本文件数量、类型较多,并且语料库中同时包括与待处理的文本文件类型相同和不同的文本文件,那么上述语料库可以称为大规模语料库,大规模语料库内的文本文件可以称为大规模语料。其中,提取上述第一关键词集合对应的语料库为大规模语料库。If the number and types of text files in the corpus are large, and the corpus includes both the same and different types of text files as the text files to be processed, the above corpus can be called a large-scale corpus, and the text files in a large-scale corpus can be called a large-scale corpus. for a large-scale corpus. Wherein, the corpus corresponding to the first keyword set is extracted as a large-scale corpus.
为了不遗漏关键词、方便后续对关键词集合中各个关键词的选择,可以将关键词的数量设置为大于或等于20。In order not to miss keywords and to facilitate subsequent selection of each keyword in the keyword set, the number of keywords may be set to be greater than or equal to 20.
S102:统计上述第一关键词集合中各关键词在语料库中出现的频率。S102: Count the frequency of each keyword in the first keyword set appearing in the corpus.
上述步骤中,统计上述第一关键词集合中各关键词在上述语料库中出现的频率,其中,语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型。In the above step, the frequency of each keyword in the first keyword set appearing in the corpus is counted, wherein the text file contained in the corpus and the text file to be processed belong to the same file type.
在步骤S101中,得到上述第一关键词集合所采用的TF-IDF方法基于大规模语料库,上述第一关键词集合中的关键词可能在大规模语料中具有代表性,但是在待处理的文本文件所属的领域内却不具有代表性,通过将与上述待处理的文本文件所属领域文件类型相同的文本文件作为语料库的语料,可以刻画关键词在上述待处理的文本文件所属领域中的重要程度。In step S101, the TF-IDF method used to obtain the above-mentioned first keyword set is based on a large-scale corpus, and the keywords in the above-mentioned first keyword set may be representative in the large-scale corpus, but not in the text to be processed. The field to which the file belongs is not representative. By taking the text file of the same type as the field to which the text file to be processed belongs as the corpus of the corpus, the importance of keywords in the field to which the text file to be processed belongs can be described. .
关键词在语料库中出现的频率可以通过不同的维度来刻画,比如基于文本文件,关键词在语料库中出现的频率可以表示为语料库包含关键词的文本文件的总数除以语料库中包含的文本文件的总数。比如基于词语,关键词在语料库中出现的频率可以表示为语料库中关键词出现的总次数除以语料库中词语的总数。关键词在语料库中出现的频率越大,关键词就越无法代表文本文件的主旨。The frequency of keywords appearing in the corpus can be characterized by different dimensions, such as based on text files, the frequency of keywords appearing in the corpus can be expressed as the total number of text files in the corpus that contain the keyword divided by the number of text files contained in the corpus. total. For example, based on words, the frequency of a keyword appearing in a corpus can be expressed as the total number of times the keyword appears in the corpus divided by the total number of words in the corpus. The more frequently the keywords appear in the corpus, the less the keywords can represent the main idea of the text document.
S103:将上述第一关键词集合中,在上述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合。S103: In the above-mentioned first keyword set, the keywords whose frequency appearing in the above-mentioned corpus is lower than the first threshold value are used as the second keyword set.
在语料库内包括的文本文件与待处理的文本文件属于相同文件类型时,统计的上述第一关键词集合中各关键词在语料库中的频率就可以很好的刻画关键词在上述待处理的文本文件所属领域的重要程度,在上述语料库中出现的频率越小,上述关键词在上述待处理的文本文件所属领域越具有代表性。When the text files included in the corpus and the text files to be processed belong to the same file type, the frequency of each keyword in the above-mentioned first keyword set in the corpus can be well described in the above-mentioned text to be processed. The importance of the field to which the document belongs, the less frequently it appears in the above-mentioned corpus, the more representative the above-mentioned keyword is in the field to which the above-mentioned text document to be processed belongs.
为上述关键词在语料库中出现的频率设置阈值,即第一阈值,频率低于上述第一阈值的关键词就可以作为在上述待处理的文本文件所属领域具有代表性的关键词;最终得到的频率为0到1之间的数,可以根据实验需要将上述第一频率设置为大于0小于1之间的任意数,只要保证第一关键词集合中存在关键词的频率低于上述第一阈值即可。A threshold is set for the frequency of the above-mentioned keywords appearing in the corpus, that is, the first threshold, and the keywords whose frequency is lower than the above-mentioned first threshold can be used as the representative keywords in the field to which the text file to be processed belongs; the final obtained The frequency is a number between 0 and 1. The above-mentioned first frequency can be set to any number between 0 and 1 according to the experimental needs, as long as the frequency of keywords existing in the first keyword set is guaranteed to be lower than the above-mentioned first threshold That's it.
S104:将上述第二关键词集合作为上述待处理的文本文件的关键词集合。S104: Use the second keyword set as the keyword set of the text file to be processed.
第二关键词集合中的关键词是从第一关键词集合中筛选和过滤而得到的,通过每个关键词在包含的文本文件与上述待处理的文本文件属于相同文件类型的语料库中出现的频率,得到在待处理的文本文件所属领域具有代表性的关键词,将上述第二关键词集合作为上述待处理的文本文件的关键词集合,关键词的代表性更强。The keywords in the second keyword set are obtained by screening and filtering from the first keyword set, and each keyword appears in the corpus of the same file type as the text file to be processed. frequency to obtain the representative keywords in the field to which the text file to be processed belongs, and the second keyword set is used as the keyword set of the text file to be processed, and the keywords are more representative.
本申请的方案还包括,统计第一关键词集合中每个关键词在语料库的逆文档频率,语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型。为了与传统TF-IDF方 法中获取的逆文档频率进行区分,可以将上述逆文档频率称为类内逆文档频率,根据上述逆文档频率的公式,类内逆文档频率越小,关键词在待处理的文本文件所属领域越具有代表性,为上述类内频率设置阈值,即第二阈值,将类内逆文档频率高于上述阈值的关键词作为第三关键词集合,将上述第三关键词集合作为上述待处理文本文件的关键词集合,从而过滤掉在大规模语料中具有代表性,但是在待处理的文本文件所属领域不具有代表性的关键词,从而提高关键词提取的精度。The solution of the present application further includes: counting the inverse document frequency of each keyword in the first keyword set in the corpus, and the text files contained in the corpus belong to the same file type as the above-mentioned to-be-processed text files. In order to distinguish it from the inverse document frequency obtained in the traditional TF-IDF method, the above inverse document frequency can be called the intra-class inverse document frequency. The more representative the field to which the processed text file belongs, a threshold value is set for the above-mentioned intra-class frequency, that is, the second threshold value, and the keywords whose inverse document frequency in the class is higher than the above-mentioned threshold value are used as the third keyword set, and the above-mentioned third keyword The set is used as the keyword set of the text file to be processed, so as to filter out the keywords that are representative in the large-scale corpus but not representative in the field to which the text file to be processed belongs, thereby improving the accuracy of keyword extraction.
需要说明的是,这里的第二阈值与第一阈值之间没有任何关系,上述第二阈值为大于0的数,具体取值可以通过实验结果进行调整,只需要保证第一关键词集合中存在关键词的类内逆文档频率高于上述第二阈值即可。It should be noted that there is no relationship between the second threshold and the first threshold here. The above-mentioned second threshold is a number greater than 0. The specific value can be adjusted through the experimental results. It is only necessary to ensure that the first keyword set exists. It is sufficient that the intra-class inverse document frequency of the keyword is higher than the above-mentioned second threshold.
本申请的方案还包括,统计第一关键词集合中各关键词的类内逆文档频率之后,按照类内逆文档频率高到低对第一关键词集合中各关键词进行排序,得到上述第一关键词集合中各个关键词的排名的名次,按照排名的名次从小到大选择关键词作为第四关键词集合,将第四关键词集合作为上述待处理的文本文件的关键词集合,具体的选择方式可以是从排名的第一个开始选择个数符合要求的关键词。比如,第一关键词词中包含25个关键词,按照类内逆文档频率由高到低进行排序后,选择前5个,或前10个,或前15个关键词作为第四关键词集合,只要选择的关键词个数小于第一关键词集合中关键词的个数即可,具体选择可以根据实验结果进行调整。The solution of the present application further includes: after counting the intra-class inverse document frequency of each keyword in the first keyword set, sorting each keyword in the first keyword set according to the frequency of the intra-class inverse document from high to low, to obtain the above-mentioned first The ranking of each keyword in a keyword set, the keywords are selected as the fourth keyword set according to the ranking from small to large, and the fourth keyword set is used as the keyword set of the text file to be processed. The selection method can be to select keywords that meet the requirements from the first position in the ranking. For example, the first keyword contains 25 keywords. After sorting according to the inverse document frequency within the class from high to low, select the top 5, or the top 10, or the top 15 keywords as the fourth keyword set , as long as the number of selected keywords is less than the number of keywords in the first keyword set, and the specific selection can be adjusted according to the experimental results.
例如,在中美贸易战时期、疫情时期,某个市的工业和信息化局发布的公文中会带有该时期的特殊词汇,本身工业和信息化局发布的公文所属的领域的相关语料资源较少,通过传统的TF-IDF方法,基于大规模语料对上述公文提取的关键词使,如果提取的关键词为20个,这20个关键词中往往会包括关键词“企业”或者“税收”等,但是上述关键词在工业和信息化局发布的公文的领域内是一个非常常见的词语,不具有代表性,我们希望得到的是与贸易战、疫情相关的关键词;将语料库内的文本文件替换为工业和信息化局发布的公文同领域的文本文件,上述关键词“企业”或者“税收”在语料库内的文本文件中大量存在,那么其类内逆文档频率就会较低,可以只保留频率大于0.5的关键词;或者对上述20个关键词按照类内逆文档频率进行排序,只保留排名前10的关键词,作为上述公文的关键词,这样就可以过滤掉在上述公文所属领域中不具有代表性的关键词。For example, during the Sino-US trade war and the epidemic, the official documents issued by the Bureau of Industry and Information Technology of a certain city will contain special vocabulary of the period, and relevant corpus resources in the fields to which the official documents issued by the Bureau of Industry and Information Technology of a certain city belong. Less, through the traditional TF-IDF method, the keywords extracted from the above official documents are based on large-scale corpus. If the extracted keywords are 20, these 20 keywords often include the keywords "enterprise" or "tax". ”, etc., but the above keywords are very common words in the field of official documents issued by the Bureau of Industry and Information Technology, and are not representative. What we hope to get are keywords related to trade wars and epidemics; The text files are replaced with text files in the same field as official documents issued by the Bureau of Industry and Information Technology. The above keywords "enterprise" or "tax" exist in large numbers in the text files in the corpus, so the frequency of inverse documents within the class will be low. You can only keep keywords with a frequency greater than 0.5; or sort the above 20 keywords according to the inverse document frequency within the class, and only keep the top 10 keywords as the keywords of the above official documents, so that you can filter out the above official documents. Keywords that are not representative in the field.
具体的选择方式还可以是从排名的名次大于1的关键词开始,依次选择个数符合实验要求的关键词。特别地,为了提高提取的关键词在上述待处理的文本文件所属领域的特异性,可以统计第一关键词集合中各关键词的类内逆文档频率,将上述关键词按照上述类内逆文档频率由高到低进行排序,将排名前第一百分比的关键词作为第一候选关键词集合,将排名前第二百分比,但是不属于上述第一候选关键词集合的关键词作为第二候选关键词集合,上述第一百分比和上述第二百分比的取值为0到100%之间,但是上述第一百分比小于上述第二百分比,再从上述第一候选关键词集合和第二候选关键词集合中选择关键词作为第二关键词集合。The specific selection method may also be to start from the keywords whose ranking is greater than 1, and sequentially select the keywords whose number meets the requirements of the experiment. In particular, in order to improve the specificity of the extracted keywords in the field to which the text file to be processed belongs, the frequency of intra-class inverse documents of each keyword in the first keyword set can be counted, and the above keywords can be classified according to the above-mentioned intra-class inverse documents. The frequency is sorted from high to low, and the keywords with the top first percentile are taken as the first candidate keyword set, and the keywords with the top second percentile but not in the first candidate keyword set are taken as For the second set of candidate keywords, the values of the first percentage and the second percentage are between 0 and 100%, but the first percentage is less than the second percentage. A keyword is selected from a candidate keyword set and a second candidate keyword set as the second keyword set.
比如,第一关键词集合包含20个关键词,将低级关键词集合中每个关键词按照类内类内逆文档频率由高到低进行排序之后,将排名前40%的5个关键词作为第一候选关键词集合,将不包含上述第一候选关键词集合中的关键词且排名前80%的关键词,即排名在40%至80%的11个关键词作为第二候选关键词集合,再根据实验需要选择关键词,比如,总共需要提取出10个关键词,可以从第一候选关键词中选择5个关键词,再从第二候选关键词中选择5个关键词即可。For example, the first keyword set contains 20 keywords. After sorting each keyword in the low-level keyword set according to the frequency of inverse documents within the class from high to low, the top 40% of the five keywords are used as The first candidate keyword set, which does not contain the keywords in the first candidate keyword set and ranks in the top 80% of the keywords, that is, 11 keywords ranked in the 40% to 80% range, as the second candidate keyword set , and then select keywords according to the needs of the experiment. For example, if a total of 10 keywords need to be extracted, 5 keywords can be selected from the first candidate keywords, and then 5 keywords can be selected from the second candidate keywords.
通过将与待处理的文本文件同领域的文本文件作为语料库,以关键词在语料库中出现的频率来筛选出在待处理的文本文件所属领域具有代表性的关键词,可以提高关键词的提取精度,除了上述基于词频的方法来筛选关键词,关键词在待处理的文本文件中的分布也 是衡量关键词的重要程度的因素,有的关键词的权重值很高,但是却只在待处理的文本文件的一句话或比较集中的几句话出现,这样的关键词的代表性不强;有的关键词权重值不是很高,但是在待处理的文本文件中多个地方都出现,这样的关键词代表性更强。请参阅图2,图2是本申请实施例公开的又一种关键词提取方法的流程示意图,如图所示,上述方法包括:By using the text files in the same domain as the text files to be processed as the corpus, the keywords that are representative in the domain to which the text files to be processed belong are screened out based on the frequency of keywords appearing in the corpus, which can improve the extraction accuracy of keywords. , in addition to the above method based on word frequency to filter keywords, the distribution of keywords in the text file to be processed is also a factor to measure the importance of keywords. A sentence or a few sentences in a text file appear, and the representativeness of such keywords is not strong; some keywords have a low weight value, but appear in multiple places in the text file to be processed, such as Keywords are more representative. Please refer to FIG. 2. FIG. 2 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:
S201:确定第一关键词在上述待处理的文本文件中出现的不同段落的个数,得到上述第一关键词对应的段落数。S201: Determine the number of different paragraphs in which the first keyword appears in the text file to be processed, and obtain the number of paragraphs corresponding to the first keyword.
其中,上述从第四关键词集合中获取的第一关键词不是第四关键词集合中排名第一的关键词,这里的第一关键词指的是第二关键词集合中任意一个关键词,没有特定的顺序。对于从第四关键词集合中获取的任意一个关键词,需要确定上述关键词在待处理的文本文件中所在不同的段落的个数,得到上述关键词的段落数。例如,待处理的文本文件一共有5个段落,对于关键词“基础设施”来说,在第一个段落出现3次,第二个段落出现5次,第4个和第5个段落分别出现7次,那么上述关键词“基础设施”在待处理的文本文件中所在的不同的段落数为4。Wherein, the first keyword obtained from the fourth keyword set is not the first keyword in the fourth keyword set, and the first keyword here refers to any keyword in the second keyword set, There is no specific order. For any keyword obtained from the fourth keyword set, it is necessary to determine the number of different paragraphs where the keyword is located in the text file to be processed, to obtain the number of paragraphs of the keyword. For example, the text file to be processed has a total of 5 paragraphs. For the keyword "infrastructure", the first paragraph appears 3 times, the second paragraph appears 5 times, and the fourth and fifth paragraphs appear respectively. 7 times, then the number of different paragraphs where the above keyword "infrastructure" is located in the text file to be processed is 4.
S202:确定上述段落数在上述第四关键词集合中各关键词对应的段落数的排序,得到第一排序值。S202: Determine the order of the number of paragraphs corresponding to each keyword in the fourth keyword set, and obtain a first order value.
上述第四关键词集合中每个关键词都对应各自的段落数,将上述第四关键词集合中各关键词按照段落数由多到少排序可以得到一组段落数排序,可以记为rank1;确定上述第一关键词在上述排序中的排序值,上述排序值可以称为第一排序值,需要说明的是,上述第一排序值指的是上述第一关键词对应的排序值,上述“第一”并没有特殊的顺序含义。Each keyword in the above-mentioned fourth keyword set corresponds to its own number of paragraphs, and the keywords in the above-mentioned fourth keyword set are sorted according to the number of paragraphs to obtain a set of paragraph numbers, which can be recorded as rank1; Determine the sorting value of the first keyword in the sorting. The sorting value may be called the first sorting value. It should be noted that the first sorting value refers to the sorting value corresponding to the first keyword. "First" has no special sequential meaning.
S203:计算上述第一关键词在上述待处理的文本文件中的词频和在语料库中的第二逆文档频率的乘积,得到上述第一关键词的权重值,上述语料库包括与上述待处理的文本文件类型相同和不同的文本文件。S203: Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, to obtain the weight value of the first keyword, where the corpus includes the text to be processed and the text to be processed. Text files of the same and different file types.
在初步提取关键词时,需要计算候选关键词的词频TF乘以逆文档频率IDF得到每个候选关键词的TF-IDF值,上述步骤中,计算上述第一关键词在待处理的文本文件中的词频TF乘以第二逆文档频率,得到上述第一关键词的权重值,上述第二逆文档频率指的是语料库对应的文本文件类型同时包括与上述待处理的文本文件相同和不同的文件类型,即上述大规模语料。In the preliminary extraction of keywords, the word frequency TF of the candidate keywords needs to be multiplied by the inverse document frequency IDF to obtain the TF-IDF value of each candidate keyword. In the above steps, the first keyword is calculated in the text file to be processed. The word frequency TF is multiplied by the second inverse document frequency to obtain the weight value of the above-mentioned first keyword. The above-mentioned second inverse document frequency refers to that the text file type corresponding to the corpus simultaneously includes the same and different files as the above-mentioned text file to be processed. Type, that is, the large-scale corpus mentioned above.
S204:确定上述权重值在上述第四关键词集合中各关键词对应的权重值的排序,得到第二排序值。S204: Determine the order of the weight values corresponding to each keyword in the fourth keyword set, and obtain a second order value.
与上述第一关键词的段落数类似,第四关键词集合中的每个关键词都对应一个权重值,将上述第四关键词集合中各关键词按照权重值由大到小排序可以得到一组权重值排序,可以记为rank2;上述第一关键词在上述rank2中排名的名次可以记为第二排序值,需要说明的是,上述第二排序值指的是上述第一关键词对应的排序值,上述“第二”并没有特殊的顺序含义。Similar to the number of paragraphs of the first keyword above, each keyword in the fourth keyword set corresponds to a weight value. Sorting the keywords in the fourth keyword set according to the weight value from large to small can get a value. The ranking of the group weight value can be recorded as rank2; the rank of the above-mentioned first keyword in the above-mentioned rank2 can be recorded as the second ranking value. It should be noted that the above-mentioned second ranking value refers to the corresponding first keyword. Sort value, "second" above has no special order meaning.
S205:将上述第一排序值与上述第二排序值的加权和作为上述第一关键词的排序参考值。S205: Use the weighted sum of the first sorting value and the second sorting value as the sorting reference value of the first keyword.
第一排序值是上述第一关键词在上述根据段落数多少进行排序得到的排名的名次,第二排序值是上述第一关键词在上述根据权重值大小进行排序得到的排名的名次,对上述第一排序值和第二排序值分别赋予加权值,上述加权值可以通过实验进行调整,计算上述排名的名次与加权值的加权和,上述加权和就为上述第一关键词进行排序的参考值。The first ranking value is the rank of the first keyword in the ranking obtained by sorting according to the number of paragraphs, and the second ranking value is the ranking of the first keyword in the ranking obtained by sorting according to the size of the weight value. The first ranking value and the second ranking value are respectively assigned weighted values, the above-mentioned weighted values can be adjusted through experiments, and the weighted sum of the ranking and weighted values of the above-mentioned rankings is calculated, and the above-mentioned weighted sum is the reference value for sorting the above-mentioned first keywords. .
例如,关键词“基础设施”对应的段落数为4,在第二关键词集合中各关键词对应的段落数由多到少的排名,即上述rank1中排名第3,;关键词“基础设施”对应的权重值在上述rank2中排名第5;分别对上述rank1和rank2赋予加权值0.5和0.6,那么上述关键 词“基础设施”最终的排序参考值为4*0.5+3*0.6,为3.8。For example, the number of paragraphs corresponding to the keyword "infrastructure" is 4, and the number of paragraphs corresponding to each keyword in the second keyword set is ranked from more to less, that is, the rank 3 in the above rank1; the keyword "infrastructure" "The corresponding weight value ranks 5th in the above rank2; the above rank1 and rank2 are given weighted values of 0.5 and 0.6 respectively, then the final ranking reference value of the above keyword "infrastructure" is 4*0.5+3*0.6, which is 3.8 .
特别地,对上述第一排序值和第二排序值分别赋予加权值时,将上述第一排序值的加权值与上述第二排序值的加权值的和设置为1,并且,上述第一排序值的加权值大于0且小于或等于0.5,这样,可以让最终的排序以权重值为主,段落数分布为辅。In particular, when weighting values are assigned to the first sorting value and the second sorting value, the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is set to 1, and the first sorting value is The weighted value of the value is greater than 0 and less than or equal to 0.5, so that the final ranking can be dominated by the weighted value, supplemented by the distribution of the number of paragraphs.
S206:按照上述第一关键词的排序参考值的大小确定上述第一关键词在上述第四关键词集合中的顺序。S206: Determine the order of the first keywords in the fourth keyword set according to the size of the sorting reference value of the first keywords.
对于上述第二关键词集合中任意一个关键词,即上述第一关键词来说,上述排序参考值的大小就是其在上述第四关键词集合中的排序的依据;对于整个上述第四关键词集合来说,每个关键词都对应一个排序参考值,将上述第四关键词集合中各关键词按照其排序参考值由大到小进行排序,可以调整上述第二关键词集合中各关键词的顺序,让真正具有代表性的关键词靠前。For any keyword in the second keyword set, that is, the first keyword, the size of the ranking reference value is the basis for its ranking in the fourth keyword set; for the entire fourth keyword In terms of sets, each keyword corresponds to a sorting reference value, and the keywords in the above fourth keyword set are sorted according to their sorting reference values from large to small, and each keyword in the above second keyword set can be adjusted. order, let the truly representative keywords come first.
提取关键词时,往往都会对待处理的文本文件进行分词处理,经过分词之后,某些本身可以作为上述待处理的文本文件的关键词的组合词或者新词可能会被分成2个或以上的词,但是关键词提取方法往往只能提取其中一个词,不能提取完整的关键词。请参阅图3,图3是本申请实施例公开的又一种关键词提取方法的流程示意图,如图所示,上述方法包括:When extracting keywords, word segmentation is often performed on the text file to be processed. After word segmentation, some combined words or new words that can be used as keywords in the text file to be processed may be divided into two or more words. , but keyword extraction methods often can only extract one of the words, not the complete keyword. Please refer to FIG. 3. FIG. 3 is a schematic flowchart of another keyword extraction method disclosed in an embodiment of the present application. As shown in the figure, the above method includes:
S301:将与上述第二关键词之间相差分词个数小于第三阈值的分词,按照在上述待处理的文本文件中的先后顺序与上述第二关键词进行组合得到组合词集合;S301: Combining the word segmentation with the number of word segmentations between the second keyword and the second keyword that is less than a third threshold, and combining the second keywords in the order in the text file to be processed to obtain a combined word set;
其中,上述第二关键词是上述第四关键词集合中任意一个关键词,没有特殊的顺序含义。组合词经过分词有可能被分成多个分词,上述多个分词在上述待处理的文本文件中必定是相邻的。上述步骤中,可以首先定位上述第二关键词在原文中的位置,再选择与上述位置左右相邻的分词与上述第二关键词进行组合,得到组合词集合。其中选择与上述位置左右相邻的分词时,分词的个数小于阈值,即第三阈值,上述第三阈值为大于0的正整数;与上述第二关键词进行组合时,分词与上述第二关键词的顺序必须在按照上述待处理的文本文件中的先后顺序进行组合得到组合词。Wherein, the second keyword is any keyword in the fourth keyword set, and has no special order meaning. After word segmentation, the combined word may be divided into multiple word segments, and the above-mentioned multiple word segments must be adjacent in the above-mentioned text file to be processed. In the above steps, the position of the second keyword in the original text may be located first, and then the participles adjacent to the position left and right are selected to combine with the second keyword to obtain a combined word set. When selecting the word segmentation adjacent to the above position, the number of word segmentation is less than the threshold, that is, the third threshold, and the third threshold is a positive integer greater than 0; when combined with the above second keyword, the word segmentation and the above second The sequence of the keywords must be combined according to the sequence in the text file to be processed to obtain the combined words.
例如,上述第二关键词记为wordn,选择在上述待处理的文本文件中与上述wordn之间相差分词个数小于4的分词进行组合,即选择与上述第二关键词左右三个分词进行组合,上述选择的分词与上述第二关键词可以记为[wordn-3,wordn-2,wordn-1,wordn,wordn+1,wordn+2,wordn+3],对关键词和分词进行组合时,按照在待处理的文本文件中的先后顺序进行组合,即可以按照[wordn-3,wordn-2,wordn-1,wordn]、[wordn-2,wordn-1,wordn]、[wordn-1,wordn]、[wordn,wordn+1]、[wordn,wordn+1,wordn+2]、[wordn,wordn+1,wordn+2,wordn+3]、[wordn-1,wordn,wordn+1]等进行组合得到组合词集合。For example, the above-mentioned second keyword is denoted as wordn, and the word-partitions whose number of word-partitions differs from the above-mentioned wordn in the above-mentioned text file to be processed by less than 4 are selected for combination, that is, three word-partitions on the left and right sides of the above-mentioned second keyword are selected for combination. , the selected word segment and the above-mentioned second keyword can be recorded as [wordn-3, wordn-2, wordn-1, wordn, wordn+1, wordn+2, wordn+3]. When combining keywords and word segments , according to the order in the text file to be processed, that is, according to [wordn-3, wordn-2, wordn-1, wordn], [wordn-2, wordn-1, wordn], [wordn-1 , wordn], [wordn, wordn+1], [wordn, wordn+1, wordn+2], [wordn, wordn+1, wordn+2, wordn+3], [wordn-1, wordn, wordn+1 ], etc. to combine to obtain a set of combined words.
需要说明的是,上述分词的个数可以通过实验进行调整,上述例子中选择前后3个分词主要考虑到实际情况中的组合词一般最多由4个单独的分词进行组合得到;上述第二关键词在上述待处理的文本文件中出现的位置可能有多个,对每个位置都采取上述步骤,进行分词选择和组合得到组合词集合。It should be noted that the number of the above-mentioned participles can be adjusted through experiments. In the above example, the front and rear three participles are selected mainly considering that the combined words in the actual situation are generally obtained by combining at most four separate participles; the above-mentioned second keyword There may be multiple positions in the text file to be processed, and the above steps are taken for each position to select and combine word segmentation to obtain a combined word set.
S302:在组合词在上述待处理的文本文件中的词频除以上述第二关键词在上述待处理的文本文件中的词频大于第四阈值的情况下,将上述组合词作为上述第二关键词。S302: In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, use the combined word as the second keyword .
其中,上述组合词为上述组合词集合中任意一个组合词将上述组合词在上述待处理的文本文件中进行全文遍历,得到上述组合词在上述待处理的文本文件中的词频,如果上述组合词的词频除以上述第二关键词的词频大于阈值,那么上述组合词对于上述待处理的文本文件来说更加具有代表性,那么就可以将上述组合词作为上述第二关键词,即用上述组合词替换上述第二关键词。Wherein, the above-mentioned compound word is any combination word in the above-mentioned combination word set. The whole text of the above-mentioned compound word is traversed in the above-mentioned text file to be processed to obtain the word frequency of the above-mentioned combination word in the above-mentioned text file to be processed. If the above-mentioned compound word is The word frequency divided by the word frequency of the above-mentioned second keyword is greater than the threshold, then the above-mentioned combination word is more representative for the above-mentioned text file to be processed, then the above-mentioned combination word can be used as the above-mentioned second keyword, that is, the above-mentioned combination word can be used as the above-mentioned combination Word replaces the second keyword above.
其中,上述第四阈值为大于0.5小于1之间的数,为了保证组合词在待处理文本文件 的代表性,上述阈值可以设为小于1但是大于或等于0.75的数,也可以通过实验结果调整,本申请补做任何限制,通过将上述组合词在上述待处理的文本文件中进行全文遍历,再利用词频来衡量组合词的方法,可以保证上述组合词在上述待处理的文本文件中的重要性,即只有在上述组合词相较于上述关键词在文中出现的频率超过一定值,才可以把上述组合词作为关键词,这样提取出来的组合词是具有代表性的关键词。The fourth threshold above is a number greater than 0.5 and less than 1. In order to ensure the representativeness of the combined word in the text file to be processed, the above threshold can be set to a number less than 1 but greater than or equal to 0.75, and can also be adjusted through experimental results. , this application makes any restrictions. By traversing the whole text of the above-mentioned combined words in the above-mentioned text file to be processed, and then using the word frequency to measure the method of combining words, the importance of the above-mentioned combined words in the above-mentioned text file to be processed can be guaranteed. That is, only when the frequency of the above-mentioned combined words in the text compared with the above-mentioned keywords exceeds a certain value, the above-mentioned combined words can be used as keywords, and the extracted combined words are representative keywords.
特别地,在完成上述第四关键词集合中所有关键词与分词进行组合和筛选之后,若上述第四关键词集合的关键词之间存在包含关系,对上述第四关键词集合中的关键词进行分词,将不重复的分词作为第四关键词集合,将上述第四关键词集合作为上述待处理的文本文件的关键词集合。In particular, after completing the combination and screening of all keywords and word segmentations in the fourth keyword set, if there is an inclusion relationship between the keywords in the fourth keyword set, the keywords in the fourth keyword set The word segmentation is performed, and the non-repetitive word segmentation is used as the fourth keyword set, and the above-mentioned fourth keyword set is used as the above-mentioned keyword set of the text file to be processed.
综上所述,本申请提供的关键词提取的方法,首先改进关键词的选择方法,过滤掉在待提处理的文本文件所属领域中不具有代表性的关键词;再通过综合考虑关键词在待处理的文本文件中的分布和关键词的逆文档频率,优化关键词集合中各关键词的排名的名次,让更加具有代表性的关键词排名靠前;最后通过关键词与相邻的分词按照在待处理的文本文件中先后顺序进行组合,提取出因为分词而被拆分的组合词,从而提高关键词提取的精度。To sum up, the method for keyword extraction provided by this application firstly improves the selection method of keywords, and filters out the keywords that are not representative in the field to which the text file to be processed belongs; The distribution in the text file to be processed and the inverse document frequency of keywords, optimize the ranking of each keyword in the keyword set, so that more representative keywords rank high; finally, through the keywords and adjacent word segmentation Combining words in the text file to be processed is performed in sequence, and the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.
上述详细阐述了本申请实施例的方法,下面提供本申请实施例的装置。The methods of the embodiments of the present application are described in detail above, and the apparatuses of the embodiments of the present application are provided below.
图4是本申请实施例公开的一种关键词提取的装置的结构示意图;上述关键词提取的装置40可以包括提取单元401、统计单元402、确定单元403,其中,各个单元的描述如下:4 is a schematic structural diagram of a keyword extraction apparatus disclosed in an embodiment of the present application; the foregoing keyword extraction apparatus 40 may include an extraction unit 401, a statistics unit 402, and a determination unit 403, wherein the descriptions of each unit are as follows:
提取单元401,用于对待处理的文本文件进行关键词提取,得到第一关键词集合; Extraction unit 401, configured to perform keyword extraction on the text file to be processed to obtain a first keyword set;
统计单元402,用于统计上述第一关键词集合中各关键词在语料库中出现的频率,上述语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型;A statistical unit 402, configured to count the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, where the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;
确定单元403,用于将上述第一关键词集合中,在上述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合;将上述第二关键词集合作为上述待处理的文本文件的关键词集合。The determining unit 403 is configured to use, in the above-mentioned first keyword set, keywords whose frequency in the above-mentioned corpus is lower than the first threshold as the second keyword set; and use the above-mentioned second keyword set as the above-mentioned text to be processed A collection of keywords for the file.
在一种可能的实施方式中,上述统计单元402,还用于统计上述第一关键词集合中各关键词在语料库中的第一逆文档频率,上述第一逆文档频率对应的语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型;In a possible implementation manner, the above-mentioned statistical unit 402 is further configured to count the first inverse document frequency of each keyword in the above-mentioned first keyword set in the corpus, and the frequency of the first inverse document contained in the corpus corresponding to the above-mentioned first inverse document frequency The text file is of the same file type as the above-mentioned text file to be processed;
上述确定单元403,上述确定单元,还用于将上述第一关键词集合中,第一逆文档频率高于第二阈值的关键词作为第三关键词集合;将上述第三关键词集合作为上述待处理的文本文件的关键词集合。The above-mentioned determining unit 403, the above-mentioned determining unit, is further configured to use the keywords whose first inverse document frequency is higher than the second threshold in the above-mentioned first keyword set as the third keyword set; and use the above-mentioned third keyword set as the above-mentioned A collection of keywords for text files to be processed.
在一种可能的实施方式中,上述装置还包括:In a possible implementation, the above device further includes:
排序单元404,用于将上述第一关键词集合按照上述第一逆文档频率由高到低进行排序;A sorting unit 404, configured to sort the above-mentioned first keyword set according to the above-mentioned first inverse document frequency from high to low;
上述确定单元403,还用于依据各关键词的排名的名次从小到大选择关键词作为第四关键词集合;将上述第四关键词集合作为上述待处理的文本文件的关键词集合。The above determining unit 403 is further configured to select keywords from small to large as the fourth keyword set according to the ranking of each keyword; and use the above fourth keyword set as the above-mentioned keyword set of the text file to be processed.
在一种可能的实施方式中,上述确定单元403,还用于将排名位于前第一百分比的关键词作为第一候选关键词集合,将不包括所述第一候选关键词集合中的关键词且排名位于前第二百分比的关键词作为第二候选关键词集合;所述第二百分比大于所述第一百分比;从所述第一候选关键词集合和第二候选关键词集合中选择关键词作为第四关键词集合。In a possible implementation, the above determining unit 403 is further configured to use the keywords ranked in the top first percentage as the first candidate keyword set, and will not include the keywords in the first candidate keyword set keywords and keywords ranked in the top second percentage are used as a second candidate keyword set; the second percentage is greater than the first percentage; from the first candidate keyword set and the second A keyword is selected from the candidate keyword set as the fourth keyword set.
在一种可能的实施方式中,上述装置还包括:In a possible implementation, the above device further includes:
上述确定单元403,还用于确定第一关键词在所述待处理的文本文件中出现的不同段落的个数,得到所述第一关键词对应的段落数;确定所述段落数在所述第四关键词集合中各关键词对应的段落数的排序,得到第一排序值;所述第一关键词为所述第四关键词集合 中任意一个关键词;The above determining unit 403 is further configured to determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; Sort the number of paragraphs corresponding to each keyword in the fourth keyword set to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;
计算单元405,用于计算上述第一关键词在上述待处理的文本文件中的词频和在语料库中的第二逆文档频率的乘积,得到上述第一关键词的权重值,上述语料库包括与上述待处理的文本文件类型相同和不同的文本文件;The calculation unit 405 is used to calculate the product of the word frequency of the above-mentioned first keyword in the above-mentioned text file to be processed and the second inverse document frequency in the corpus, to obtain the weight value of the above-mentioned first keyword, and the above-mentioned corpus includes the same as the above-mentioned. Text files of the same and different types to be processed;
上述确定单元403,还用于确定上述权重值在上述第四关键词集合中各关键词对应的权重值的排序,得到第二排序值;将上述第一排序值与上述第二排序值的加权和作为上述第一关键词的排序参考值;按照上述第一关键词的排序参考值的大小确定上述第一关键词在上述第四关键词集合中的顺序。The above-mentioned determining unit 403 is further configured to determine the order of the weight values corresponding to each keyword in the above-mentioned fourth keyword set, and obtain a second order value; the weighting of the above-mentioned first order value and the above-mentioned second order value and as the sorting reference value of the first keyword; the order of the first keyword in the fourth keyword set is determined according to the size of the sorting reference value of the first keyword.
在一种可能的实施方式中,上述第一排序值的加权值与上述第二排序值的加权值的和为1,上述第一排序值的加权值大于0且小于或等于0.5。In a possible implementation manner, the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is 1, and the weighting value of the first sorting value is greater than 0 and less than or equal to 0.5.
本申请实施例中,对上述第一排序值和第二排序值分别赋予加权值时,将上述第一排序值的加权值与上述第二排序值的加权值的和设置为1,并且,上述第一排序值的加权值大于0且小于或等于0.5,这样,可以让最终的排序以权重值为主,段落数分布为辅。In this embodiment of the present application, when weighting values are assigned to the first sorting value and the second sorting value, respectively, the sum of the weighting value of the first sorting value and the weighting value of the second sorting value is set to 1, and the above The weighted value of the first sorting value is greater than 0 and less than or equal to 0.5, so that the final sorting can be dominated by the weighted value and supplemented by the distribution of the number of paragraphs.
在一种可能的实施方式中,上述装置还包括:In a possible implementation, the above device further includes:
组合单元406,用于将与第二关键词之间相差分词个数小于第三阈值的分词,按照在所述待处理的文本文件中的先后顺序与所述第二关键词进行组合得到组合词集合,所述第二关键词为所述第四关键词集合中任意一个关键词;The combining unit 406 is used to combine the word segmentations with the second keyword that differs in the number of word segmentations by less than a third threshold, according to the order in the text file to be processed and the second keyword to obtain a combination word set, the second keyword is any keyword in the fourth keyword set;
上述确定单元403,还用于在组合词在所述待处理的文本文件中的词频除以所述第二关键词在所述待处理的文本文件中的词频大于第四阈值的情况下,将所述组合词作为所述第二关键词,所述组合词为所述组合词集合中任意一个组合词。The above determining unit 403 is further configured to divide the word frequency of the combined word in the text file to be processed by the word frequency of the second keyword in the text file to be processed is greater than the fourth threshold, and determine the The compound word is used as the second keyword, and the compound word is any compound word in the compound word set.
本申请实施例中,组合词经过分词有可能被分成多个分词,上述多个分词在上述待处理的文本文件中必定是相邻的。上述步骤中,可以首先定位上述第四关键词在原文中的位置,再选择与上述位置左右相邻的分词与上述第二关键词进行组合,得到组合词集合。其中选择与上述位置左右相邻的分词时,分词的个数小于阈值;与上述第二关键词进行组合时,分词与上述第二关键词的顺序必须在按照上述待处理的文本文件中的先后顺序进行组合得到组合词。In the embodiment of the present application, the combined word may be divided into multiple word segments after word segmentation, and the above-mentioned multiple word segments must be adjacent in the above-mentioned text file to be processed. In the above steps, the position of the fourth keyword in the original text can be located first, and then the participles adjacent to the position left and right are selected and combined with the second keyword to obtain a combined word set. When selecting the word segmentation adjacent to the above position, the number of word segmentation is less than the threshold; when combining with the above-mentioned second keyword, the order of the word segmentation and the above-mentioned second keyword must be in accordance with the order in the text file to be processed. Combine in order to get compound words.
将上述组合词在上述待处理的文本文件中进行全文遍历,得到上述组合词在上述待处理的文本文件中的词频,如果上述组合词的词频除以上述第二关键词的词频大于阈值,那么上述组合词对于上述待处理的文本文件来说更加具有代表性,那么就可以将上述组合词作为上述第二关键词,即用上述组合词替换上述第二关键词。Carry out full-text traversal of the above-mentioned combined word in the above-mentioned text file to be processed to obtain the word frequency of the above-mentioned combined word in the above-mentioned text file to be processed. If the word frequency of the above-mentioned combined word divided by the word frequency of the above-mentioned second keyword is greater than the threshold, then The above-mentioned combination word is more representative for the above-mentioned text file to be processed, then the above-mentioned combination word can be used as the above-mentioned second keyword, that is, the above-mentioned combination word is used to replace the above-mentioned second keyword.
其中,上述阈值可以设为0.75,也可以通过实验结果调整,通过将上述组合词在上述待处理的文本文件中进行全文遍历,再利用词频来衡量组合词的方法,可以保证上述组合词在上述待处理的文本文件中的重要性,即只有在上述组合词相较于上述关键词在文中出现的频率超过一定值,才可以把上述组合词作为关键词,这样提取出来的组合词是具有代表性的关键词。Among them, the above threshold can be set to 0.75, and can also be adjusted according to the experimental results. By traversing the whole text of the above-mentioned combined words in the above-mentioned text file to be processed, and then using the word frequency to measure the method of combining words, it can be guaranteed that the above-mentioned combined words are in the above-mentioned The importance in the text file to be processed, that is, only when the frequency of the above-mentioned combined words in the text compared with the above-mentioned keywords exceeds a certain value, the above-mentioned combined words can be used as keywords, so that the extracted combined words are representative. Sexual keywords.
需要说明的是,上述分词的个数可以通过实验进行调整,上述例子中选择前后3个分词主要考虑到实际情况中的组合词一般最多由4个单独的分词进行组合得到。It should be noted that the number of the above-mentioned participles can be adjusted through experiments. In the above example, the three participles before and after are selected mainly considering that the combined words in the actual situation are generally obtained by combining at most four separate participles.
上述第二关键词在上述待处理的文本文件中出现的位置可能有多个,对每个位置都采取上述步骤,进行分词选择和组合得到组合词集合。The above-mentioned second keyword may appear in multiple positions in the above-mentioned text file to be processed, and the above steps are taken for each position to perform word segmentation selection and combination to obtain a combined word set.
特别地,在完成上述第四关键词集合中所有关键词与分词进行组合和筛选之后,若上述第四关键词集合的关键词之间存在包含关系,对上述第二关键词集合中的关键词进行分词,将不重复的分词作为第二关键词集合,将上述第二关键词集合作为上述待处理的文本文件的关键词集合。In particular, after completing the combination and screening of all keywords and word segmentations in the fourth keyword set, if there is an inclusion relationship between the keywords in the fourth keyword set, the keywords in the second keyword set The word segmentation is performed, and the non-repetitive word segmentation is used as the second keyword set, and the above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
综上所述,本申请提供的关键词提取的方法,首先改进关键词的选择方法,再通过考虑关键词在待处理的文本文件中的分布,优化关键词集合中各关键词的排序,最后通过关键词与相邻的分词按照在待处理的文本文件中先后顺序进行组合,提取出因为分词而被拆分的组合词,从而提高关键词提取的精度。To sum up, the method for keyword extraction provided by this application firstly improves the method for selecting keywords, then optimizes the ordering of each keyword in the keyword set by considering the distribution of keywords in the text file to be processed, and finally. By combining keywords and adjacent word segmentations in the order in the text file to be processed, the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.
请参阅图5,图5是本申请实施例公开的一种服务器的结构示意图。上述服务器50可以包括存储器501、处理器502。进一步可选的,还可以包含通信接口503以及总线504,其中,存储器501、处理器502以及通信接口503通过总线504实现彼此之间的通信连接。通信接口503用于与上述关键词提取的装置40进行数据交互。Please refer to FIG. 5 , which is a schematic structural diagram of a server disclosed in an embodiment of the present application. The above server 50 may include a memory 501 and a processor 502 . Further optionally, a communication interface 503 and a bus 504 may also be included, wherein the memory 501 , the processor 502 and the communication interface 503 are communicated with each other through the bus 504 . The communication interface 503 is used for data interaction with the above-mentioned keyword extraction apparatus 40 .
其中,存储器501用于提供存储空间,存储空间中可以存储操作系统和计算机程序等数据。存储器501包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM)。Among them, the memory 501 is used to provide a storage space, and data such as an operating system and a computer program can be stored in the storage space. The memory 501 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or Portable read-only memory (compact disc read-only memory, CD-ROM).
处理器502是进行算术运算和逻辑运算的模块,可以是中央处理器(central processing unit,CPU)、显卡处理器(graphics processing unit,GPU)或微处理器(microprocessor unit,MPU)等处理模块中的一种或者多种的组合。The processor 502 is a module that performs arithmetic operations and logical operations, and can be a processing module such as a central processing unit (CPU), a graphics processing unit (GPU), or a microprocessor (microprocessor unit, MPU). of one or more combinations.
存储器501中存储有计算机程序,处理器502调用存储器501中存储的计算机程序,以执行以下操作:A computer program is stored in the memory 501, and the processor 502 calls the computer program stored in the memory 501 to perform the following operations:
对待处理的文本文件进行关键词提取,得到第一关键词集合;Perform keyword extraction on the text file to be processed to obtain a first keyword set;
统计上述第一关键词集合中各关键词在语料库中出现的频率,上述语料库内包含的文本文件与上述待处理的文本文件属于相同文件类型;Counting the frequency of occurrence of each keyword in the above-mentioned first keyword set in the corpus, the text file contained in the above-mentioned corpus and the above-mentioned text file to be processed belong to the same file type;
将上述第一关键词集合中,在上述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合;In the above-mentioned first keyword set, the keywords whose frequency that appears in the above-mentioned corpus is lower than the first threshold value are used as the second keyword set;
将上述第二关键词集合作为上述待处理的文本文件的关键词集合。The above-mentioned second keyword set is used as the above-mentioned keyword set of the text file to be processed.
需要说明的是,服务器50的具体实现还可以对应参照图2、图3所示的方法实施例的相应描述。It should be noted that, the specific implementation of the server 50 may also correspond to the corresponding descriptions of the method embodiments shown in FIG. 2 and FIG. 3 .
本申请实施例还提供一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,当上述计算机程序在一个或多个处理器上运行时,可以实现图1、图2以及图3所示的关键词提取的方法。Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed on one or more processors, FIG. 1 , FIG. 2 , and FIG. 3 shows the method of keyword extraction.
可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
综上可知,首先改进关键词的选择方法,过滤掉在待提处理的文本文件所属领域中不具有代表性的关键词;再通过综合考虑关键词在待处理的文本文件中的分布和关键词的逆文档频率,优化关键词集合中各关键词的排名的名次,让更加具有代表性的关键词排名靠前;最后通过关键词与相邻的分词按照在待处理的文本文件中先后顺序进行组合,提取出因为分词而被拆分的组合词,从而提高关键词提取的精度。To sum up, we can first improve the selection method of keywords, and filter out the keywords that are not representative in the field to which the text file to be processed belongs; Inverse document frequency, optimize the ranking of each keyword in the keyword set, so that more representative keywords rank first; finally, the keywords and adjacent word segmentation are carried out in the order in the text file to be processed. Combined, the combined words that are split due to word segmentation are extracted, thereby improving the accuracy of keyword extraction.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,上述流程可以由计算机程序相关的硬件完成,上述计算机程序可存储于计算机可读取存储介质中,上述计算机程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:只读存储器ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储计算机程序代码的介质。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be realized. The above-mentioned processes can be completed by hardware related to computer programs. The above-mentioned computer programs can be stored in a computer-readable storage medium. When the above-mentioned computer programs are executed, , which may include the processes of the foregoing method embodiments. The aforementioned storage medium includes: read-only memory ROM or random-access storage memory RAM, magnetic disk or optical disk and other media that can store computer program codes.

Claims (20)

  1. 一种关键词提取的方法,包括:A method for keyword extraction, comprising:
    对待处理的文本文件进行关键词提取,得到第一关键词集合;Perform keyword extraction on the text file to be processed to obtain a first keyword set;
    统计所述第一关键词集合中各关键词在语料库中出现的频率,将所述第一关键词集合中,在所述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合,所述语料库内包含的文本文件与所述待处理的文本文件属于相同文件类型;Count the frequency of each keyword in the first keyword set appearing in the corpus, and use the first keyword set, the keyword whose frequency in the corpus is lower than the first threshold is used as the second keyword Set, the text files contained in the corpus belong to the same file type as the text files to be processed;
    将所述第二关键词集合作为所述待处理的文本文件的关键词集合。The second keyword set is used as the keyword set of the text file to be processed.
  2. 根据权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    统计所述第一关键词集合中各关键词在语料库中的第一逆文档频率,所述第一逆文档频率对应的语料库内包含的文本文件与所述待处理的文本文件属于相同文件类型;Counting the first inverse document frequency of each keyword in the first keyword set in the corpus, the text file contained in the corpus corresponding to the first inverse document frequency and the text file to be processed belong to the same file type;
    将所述第一关键词集合中,第一逆文档频率高于第二阈值的关键词作为第三关键词集合;In the first keyword set, the keywords whose first inverse document frequency is higher than the second threshold are used as the third keyword set;
    将所述第三关键词集合作为所述待处理的文本文件的关键词集合。The third keyword set is used as the keyword set of the text file to be processed.
  3. 根据权利要求2所述的方法,其中,所述方法还包括:The method of claim 2, wherein the method further comprises:
    将所述第一关键词集合按照所述第一逆文档频率由高到低进行排序,得到所述第一关键词集合中各关键词排名的名次;Sorting the first keyword set according to the first inverse document frequency from high to low to obtain the ranking of each keyword in the first keyword set;
    依据各关键词的排名的名次从小到大选择关键词作为第四关键词集合;Select keywords from small to large according to the ranking of each keyword as the fourth keyword set;
    将所述第四关键词集合作为所述待处理的文本文件的关键词集合。The fourth keyword set is used as the keyword set of the text file to be processed.
  4. 根据权利要求3所述的方法,其中,所述依据各关键词的排名的名次从小到大选择关键词作为第四关键词集合,包括:The method according to claim 3, wherein, according to the ranking of each keyword, selecting keywords from small to large as the fourth keyword set, comprising:
    将排名位于前第一百分比的关键词作为第一候选关键词集合,将不包括所述第一候选关键词集合中的关键词且排名位于前第二百分比的关键词作为第二候选关键词集合;所述第二百分比大于所述第一百分比;Taking the keywords ranked in the top first percentage as the first candidate keyword set, and taking the keywords not including the keywords in the first candidate keyword set and ranking in the top second percentage as the second keyword candidate keyword set; the second percentage is greater than the first percentage;
    从所述第一候选关键词集合和第二候选关键词集合中选择关键词作为第四关键词集合。A keyword is selected from the first candidate keyword set and the second candidate keyword set as a fourth keyword set.
  5. 根据权利要求3所述的方法,其中,所述方法还包括:The method of claim 3, wherein the method further comprises:
    确定第一关键词在所述待处理的文本文件中出现的不同段落的个数,得到所述第一关键词对应的段落数;确定所述段落数在所述第四关键词集合中各关键词对应的段落数的排序,得到第一排序值;所述第一关键词为所述第四关键词集合中任意一个关键词;Determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; determine the number of paragraphs in each key in the fourth keyword set Sort the number of paragraphs corresponding to the words to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;
    计算所述第一关键词在所述待处理的文本文件中的词频和在语料库中的第二逆文档频率的乘积,得到所述第一关键词的权重值,所述语料库包括与所述待处理的文本文件类型相同和不同的文本文件;确定所述权重值在所述第四关键词集合中各关键词对应的权重值的排序,得到第二排序值;Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, and obtain the weight value of the first keyword, and the corpus includes Processing text files of the same and different types; determining the order of the weight values corresponding to each keyword in the fourth keyword set, and obtaining a second order value;
    将所述第一排序值与所述第二排序值的加权和作为所述第一关键词的排序参考值;Using the weighted sum of the first ranking value and the second ranking value as a ranking reference value of the first keyword;
    按照所述第一关键词的排序参考值的大小确定所述第一关键词在所述第四关键词集合中的顺序。The order of the first keywords in the fourth keyword set is determined according to the size of the sorting reference value of the first keywords.
  6. 根据权利要求5所述的方法,其中,所述方法还包括:The method of claim 5, wherein the method further comprises:
    所述第一排序值的加权值与所述第二排序值的加权值的和为1,所述第一排序值的加权值大于0且小于或等于0.5。The sum of the weighted value of the first sorting value and the weighted value of the second sorting value is 1, and the weighted value of the first sorting value is greater than 0 and less than or equal to 0.5.
  7. 根据权利要求6所述的方法,其中,所述方法还包括:The method of claim 6, wherein the method further comprises:
    将与第二关键词之间相差分词个数小于第三阈值的分词,按照在所述待处理的文本文件中的先后顺序与所述第二关键词进行组合得到组合词集合,所述第二关键词为所述第四关键词集合中任意一个关键词;Combining the participles whose number of participles differs from the second keyword is less than the third threshold is combined with the second keyword according to the order in the text file to be processed to obtain a combined word set, and the second The keyword is any keyword in the fourth keyword set;
    在组合词在所述待处理的文本文件中的词频除以所述第二关键词在所述待处理的文本文件中的词频大于第四阈值的情况下,将所述组合词作为所述第二关键词,所述组合词为 所述组合词集合中任意一个组合词。In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, the combined word is used as the first Two keywords, the compound word is any compound word in the compound word set.
  8. 一种关键词提取的装置,其中,所述装置包括:An apparatus for keyword extraction, wherein the apparatus comprises:
    提取单元,用于对待处理的文本文件进行关键词提取,得到第一关键词集合;an extraction unit, used to extract keywords from the text file to be processed to obtain a first keyword set;
    统计单元,用于统计所述第一关键词集合中各关键词在语料库中出现的频率,所述语料库内包含的文本文件与所述待处理的文本文件属于相同文件类型;A statistical unit, configured to count the frequency of occurrence of each keyword in the first keyword set in the corpus, where the text file contained in the corpus and the text file to be processed belong to the same file type;
    确定单元,用于将所述第一关键词集合中,在所述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合;将所述第二关键词集合作为所述待处理的文本文件的关键词集合。A determining unit, configured to use the keywords whose frequency in the corpus is lower than the first threshold in the first keyword set as the second keyword set; use the second keyword set as the to-be-to-be A collection of keywords for processed text files.
  9. 一种服务器,其中,所述服务器包括处理器和存储器,其中,所述存储器中存储有计算机程序,所述处理器调用所述存储器中存储的计算机程序,用于执行以下方法:A server, wherein the server includes a processor and a memory, wherein a computer program is stored in the memory, and the processor invokes the computer program stored in the memory to execute the following method:
    对待处理的文本文件进行关键词提取,得到第一关键词集合;Perform keyword extraction on the text file to be processed to obtain a first keyword set;
    统计所述第一关键词集合中各关键词在语料库中出现的频率,将所述第一关键词集合中,在所述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合,所述语料库内包含的文本文件与所述待处理的文本文件属于相同文件类型;Count the frequency of each keyword in the first keyword set appearing in the corpus, and use the first keyword set, the keyword whose frequency in the corpus is lower than the first threshold is used as the second keyword Set, the text files contained in the corpus belong to the same file type as the text files to be processed;
    将所述第二关键词集合作为所述待处理的文本文件的关键词集合。The second keyword set is used as the keyword set of the text file to be processed.
  10. 根据权利要求9所述的服务器,其中,所述处理器还用于执行:The server of claim 9, wherein the processor is further configured to perform:
    统计所述第一关键词集合中各关键词在语料库中的第一逆文档频率,所述第一逆文档频率对应的语料库内包含的文本文件与所述待处理的文本文件属于相同文件类型;Counting the first inverse document frequency of each keyword in the first keyword set in the corpus, the text file contained in the corpus corresponding to the first inverse document frequency and the text file to be processed belong to the same file type;
    将所述第一关键词集合中,第一逆文档频率高于第二阈值的关键词作为第三关键词集合;In the first keyword set, the keywords whose first inverse document frequency is higher than the second threshold are used as the third keyword set;
    将所述第三关键词集合作为所述待处理的文本文件的关键词集合。The third keyword set is used as the keyword set of the text file to be processed.
  11. 根据权利要求10所述的服务器,其中,所述处理器还用于执行:The server of claim 10, wherein the processor is further configured to perform:
    将所述第一关键词集合按照所述第一逆文档频率由高到低进行排序,得到所述第一关键词集合中各关键词排名的名次;Sorting the first keyword set according to the first inverse document frequency from high to low to obtain the ranking of each keyword in the first keyword set;
    依据各关键词的排名的名次从小到大选择关键词作为第四关键词集合;Select keywords from small to large according to the ranking of each keyword as the fourth keyword set;
    将所述第四关键词集合作为所述待处理的文本文件的关键词集合。The fourth keyword set is used as the keyword set of the text file to be processed.
  12. 根据权利要求11所述的服务器,其中,执行所述依据各关键词的排名的名次从小到大选择关键词作为第四关键词集合,包括:The server according to claim 11, wherein the selecting keywords according to the ranking of each keyword as the fourth keyword set from small to large, comprising:
    将排名位于前第一百分比的关键词作为第一候选关键词集合,将不包括所述第一候选关键词集合中的关键词且排名位于前第二百分比的关键词作为第二候选关键词集合;所述第二百分比大于所述第一百分比;Taking the keywords ranked in the top first percentage as the first candidate keyword set, and taking the keywords not including the keywords in the first candidate keyword set and ranking in the top second percentage as the second keyword candidate keyword set; the second percentage is greater than the first percentage;
    从所述第一候选关键词集合和第二候选关键词集合中选择关键词作为第四关键词集合。A keyword is selected from the first candidate keyword set and the second candidate keyword set as a fourth keyword set.
  13. 根据权利要求11所述的服务器,其中,所述处理器还用于执行:The server of claim 11, wherein the processor is further configured to perform:
    确定第一关键词在所述待处理的文本文件中出现的不同段落的个数,得到所述第一关键词对应的段落数;确定所述段落数在所述第四关键词集合中各关键词对应的段落数的排序,得到第一排序值;所述第一关键词为所述第四关键词集合中任意一个关键词;Determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; determine the number of paragraphs in each key in the fourth keyword set Sort the number of paragraphs corresponding to the words to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;
    计算所述第一关键词在所述待处理的文本文件中的词频和在语料库中的第二逆文档频率的乘积,得到所述第一关键词的权重值,所述语料库包括与所述待处理的文本文件类型相同和不同的文本文件;确定所述权重值在所述第四关键词集合中各关键词对应的权重值的排序,得到第二排序值;Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, and obtain the weight value of the first keyword, and the corpus includes Processing text files of the same and different types; determining the order of the weight values corresponding to each keyword in the fourth keyword set, and obtaining a second order value;
    将所述第一排序值与所述第二排序值的加权和作为所述第一关键词的排序参考值;Using the weighted sum of the first ranking value and the second ranking value as a ranking reference value of the first keyword;
    按照所述第一关键词的排序参考值的大小确定所述第一关键词在所述第四关键词集合中的顺序。The order of the first keywords in the fourth keyword set is determined according to the size of the sorting reference value of the first keywords.
  14. 根据权利要求13所述的服务器,其中,所述第一排序值的加权值与所述第二排序 值的加权值的和为1,所述第一排序值的加权值大于0且小于或等于0.5;The server according to claim 13, wherein the sum of the weighted value of the first sorting value and the weighted value of the second sorting value is 1, and the weighted value of the first sorting value is greater than 0 and less than or equal to 0.5;
    所述处理器还用于执行:The processor is also used to execute:
    将与第二关键词之间相差分词个数小于第三阈值的分词,按照在所述待处理的文本文件中的先后顺序与所述第二关键词进行组合得到组合词集合,所述第二关键词为所述第四关键词集合中任意一个关键词;Combining the participles whose number of participles differs from the second keyword is less than the third threshold is combined with the second keyword according to the order in the text file to be processed to obtain a combined word set, and the second The keyword is any keyword in the fourth keyword set;
    在组合词在所述待处理的文本文件中的词频除以所述第二关键词在所述待处理的文本文件中的词频大于第四阈值的情况下,将所述组合词作为所述第二关键词,所述组合词为所述组合词集合中任意一个组合词。In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, the combined word is used as the first Two keywords, the compound word is any compound word in the compound word set.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有计算机程序,当所述计算机程序在一个或多个处理器上运行时,执行以下方法:A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program runs on one or more processors, the following methods are performed:
    对待处理的文本文件进行关键词提取,得到第一关键词集合;Perform keyword extraction on the text file to be processed to obtain a first keyword set;
    统计所述第一关键词集合中各关键词在语料库中出现的频率,将所述第一关键词集合中,在所述语料库中出现的频率低于第一阈值的关键词作为第二关键词集合,所述语料库内包含的文本文件与所述待处理的文本文件属于相同文件类型;Count the frequency of each keyword in the first keyword set appearing in the corpus, and use the first keyword set, the keyword whose frequency in the corpus is lower than the first threshold is used as the second keyword Set, the text files contained in the corpus belong to the same file type as the text files to be processed;
    将所述第二关键词集合作为所述待处理的文本文件的关键词集合。The second keyword set is used as the keyword set of the text file to be processed.
  16. 根据权利要求15所述的计算机可读存储介质,其中,当所述计算机程序在一个或多个处理器上运行时,还用于执行:The computer-readable storage medium of claim 15, wherein the computer program, when run on one or more processors, is further configured to perform:
    统计所述第一关键词集合中各关键词在语料库中的第一逆文档频率,所述第一逆文档频率对应的语料库内包含的文本文件与所述待处理的文本文件属于相同文件类型;Counting the first inverse document frequency of each keyword in the first keyword set in the corpus, the text file contained in the corpus corresponding to the first inverse document frequency and the text file to be processed belong to the same file type;
    将所述第一关键词集合中,第一逆文档频率高于第二阈值的关键词作为第三关键词集合;In the first keyword set, the keywords whose first inverse document frequency is higher than the second threshold are used as the third keyword set;
    将所述第三关键词集合作为所述待处理的文本文件的关键词集合。The third keyword set is used as the keyword set of the text file to be processed.
  17. 根据权利要求16所述的计算机可读存储介质,其中,当所述计算机程序在一个或多个处理器上运行时,还用于执行:The computer-readable storage medium of claim 16, wherein the computer program, when run on one or more processors, is further configured to perform:
    将所述第一关键词集合按照所述第一逆文档频率由高到低进行排序,得到所述第一关键词集合中各关键词排名的名次;Sorting the first keyword set according to the first inverse document frequency from high to low to obtain the ranking of each keyword in the first keyword set;
    依据各关键词的排名的名次从小到大选择关键词作为第四关键词集合;Select keywords from small to large according to the ranking of each keyword as the fourth keyword set;
    将所述第四关键词集合作为所述待处理的文本文件的关键词集合。The fourth keyword set is used as the keyword set of the text file to be processed.
  18. 根据权利要求17所述的计算机可读存储介质,其中,执行所述依据各关键词的排名的名次从小到大选择关键词作为第四关键词集合,包括:The computer-readable storage medium according to claim 17, wherein the selecting keywords according to the ranking of each keyword as the fourth keyword set from small to large, comprising:
    将排名位于前第一百分比的关键词作为第一候选关键词集合,将不包括所述第一候选关键词集合中的关键词且排名位于前第二百分比的关键词作为第二候选关键词集合;所述第二百分比大于所述第一百分比;Taking the keywords ranked in the top first percentage as the first candidate keyword set, and taking the keywords not including the keywords in the first candidate keyword set and ranking in the top second percentage as the second keyword candidate keyword set; the second percentage is greater than the first percentage;
    从所述第一候选关键词集合和第二候选关键词集合中选择关键词作为第四关键词集合。A keyword is selected from the first candidate keyword set and the second candidate keyword set as a fourth keyword set.
  19. 根据权利要求17所述的计算机可读存储介质,其中,当所述计算机程序在一个或多个处理器上运行时,还用于执行:The computer-readable storage medium of claim 17, wherein the computer program, when run on one or more processors, is further configured to perform:
    确定第一关键词在所述待处理的文本文件中出现的不同段落的个数,得到所述第一关键词对应的段落数;确定所述段落数在所述第四关键词集合中各关键词对应的段落数的排序,得到第一排序值;所述第一关键词为所述第四关键词集合中任意一个关键词;Determine the number of different paragraphs in which the first keyword appears in the to-be-processed text file, and obtain the number of paragraphs corresponding to the first keyword; determine the number of paragraphs in each key in the fourth keyword set Sort the number of paragraphs corresponding to the words to obtain a first ranking value; the first keyword is any keyword in the fourth keyword set;
    计算所述第一关键词在所述待处理的文本文件中的词频和在语料库中的第二逆文档频率的乘积,得到所述第一关键词的权重值,所述语料库包括与所述待处理的文本文件类型相同和不同的文本文件;确定所述权重值在所述第四关键词集合中各关键词对应的权重值的排序,得到第二排序值;Calculate the product of the word frequency of the first keyword in the text file to be processed and the second inverse document frequency in the corpus, and obtain the weight value of the first keyword, and the corpus includes Processing text files of the same and different types; determining the order of the weight values corresponding to each keyword in the fourth keyword set, and obtaining a second order value;
    将所述第一排序值与所述第二排序值的加权和作为所述第一关键词的排序参考值;Using the weighted sum of the first ranking value and the second ranking value as a ranking reference value of the first keyword;
    按照所述第一关键词的排序参考值的大小确定所述第一关键词在所述第四关键词集合中的顺序。The order of the first keywords in the fourth keyword set is determined according to the size of the sorting reference value of the first keywords.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述第一排序值的加权值与所述第二排序值的加权值的和为1,所述第一排序值的加权值大于0且小于或等于0.5;The computer-readable storage medium of claim 19, wherein a sum of the weighted value of the first ranking value and the weighted value of the second ranking value is 1, and the weighted value of the first ranking value is greater than 0 and less than or equal to 0.5;
    当所述计算机程序在一个或多个处理器上运行时,还用于执行:The computer program, when run on one or more processors, is also used to perform:
    将与第二关键词之间相差分词个数小于第三阈值的分词,按照在所述待处理的文本文件中的先后顺序与所述第二关键词进行组合得到组合词集合,所述第二关键词为所述第四关键词集合中任意一个关键词;Combining the participles whose number of participles differs from the second keyword is less than the third threshold is combined with the second keyword according to the order in the text file to be processed to obtain a combined word set, and the second The keyword is any keyword in the fourth keyword set;
    在组合词在所述待处理的文本文件中的词频除以所述第二关键词在所述待处理的文本文件中的词频大于第四阈值的情况下,将所述组合词作为所述第二关键词,所述组合词为所述组合词集合中任意一个组合词。In the case that the word frequency of the combined word in the text file to be processed divided by the word frequency of the second keyword in the text file to be processed is greater than a fourth threshold, the combined word is used as the first Two keywords, the compound word is any compound word in the compound word set.
PCT/CN2021/097021 2020-11-23 2021-05-29 Keyword extraction method and related device WO2022105178A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011321892.7 2020-11-23
CN202011321892.7A CN112417101B (en) 2020-11-23 2020-11-23 Keyword extraction method and related device

Publications (1)

Publication Number Publication Date
WO2022105178A1 true WO2022105178A1 (en) 2022-05-27

Family

ID=74778716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097021 WO2022105178A1 (en) 2020-11-23 2021-05-29 Keyword extraction method and related device

Country Status (2)

Country Link
CN (1) CN112417101B (en)
WO (1) WO2022105178A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417101B (en) * 2020-11-23 2023-08-18 平安科技(深圳)有限公司 Keyword extraction method and related device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346418A (en) * 2013-03-15 2015-02-11 国际商业机器公司 Anonymizing Sensitive Identifying Information Based on Relational Context Across a Group
CN106328147A (en) * 2016-08-31 2017-01-11 中国科学技术大学 Speech recognition method and device
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting
US20180342241A1 (en) * 2017-05-25 2018-11-29 Baidu Online Network Technology (Beijing) Co., Ltd . Method and Apparatus of Recognizing Field of Semantic Parsing Information, Device and Readable Medium
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN112417101A (en) * 2020-11-23 2021-02-26 平安科技(深圳)有限公司 Keyword extraction method and related device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8509537B2 (en) * 2010-08-05 2013-08-13 Xerox Corporation Learning weights of fonts for typed samples in handwritten keyword spotting
CN108073568B (en) * 2016-11-10 2020-09-11 腾讯科技(深圳)有限公司 Keyword extraction method and device
CN107122413B (en) * 2017-03-31 2020-04-10 北京奇艺世纪科技有限公司 Keyword extraction method and device based on graph model
KR102019194B1 (en) * 2017-11-22 2019-09-06 주식회사 와이즈넛 Core keywords extraction system and method in document
CN110147425B (en) * 2019-05-22 2021-04-06 华泰期货有限公司 Keyword extraction method and device, computer equipment and storage medium
CN110427626B (en) * 2019-07-31 2022-12-09 北京明略软件系统有限公司 Keyword extraction method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346418A (en) * 2013-03-15 2015-02-11 国际商业机器公司 Anonymizing Sensitive Identifying Information Based on Relational Context Across a Group
CN106328147A (en) * 2016-08-31 2017-01-11 中国科学技术大学 Speech recognition method and device
US20180342241A1 (en) * 2017-05-25 2018-11-29 Baidu Online Network Technology (Beijing) Co., Ltd . Method and Apparatus of Recognizing Field of Semantic Parsing Information, Device and Readable Medium
CN108399165A (en) * 2018-03-28 2018-08-14 广东技术师范学院 A kind of keyword abstraction method based on position weighting
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN112417101A (en) * 2020-11-23 2021-02-26 平安科技(深圳)有限公司 Keyword extraction method and related device

Also Published As

Publication number Publication date
CN112417101B (en) 2023-08-18
CN112417101A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
WO2019223103A1 (en) Text similarity acquisition method and apparatus, terminal device and medium
CN107491518B (en) Search recall method and device, server and storage medium
CN106649818B (en) Application search intention identification method and device, application search method and server
Kaur et al. A systematic review on stopword removal algorithms
WO2019085236A1 (en) Search intention recognition method and apparatus, and electronic device and readable storage medium
US7461056B2 (en) Text mining apparatus and associated methods
Ladani et al. Stopword identification and removal techniques on tc and ir applications: A survey
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN106776574B (en) User comment text mining method and device
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
US9251248B2 (en) Using context to extract entities from a document collection
CN109657053B (en) Multi-text abstract generation method, device, server and storage medium
US10528662B2 (en) Automated discovery using textual analysis
US20190163737A1 (en) Method and apparatus for constructing binary feature dictionary
Albishre et al. Effective 20 newsgroups dataset cleaning
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN107193892A (en) A kind of document subject matter determines method and device
Jaman et al. Sentiment analysis of customers on utilizing online motorcycle taxi service at twitter with the support vector machine
WO2022105178A1 (en) Keyword extraction method and related device
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN109344397B (en) Text feature word extraction method and device, storage medium and program product
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
KR20180059112A (en) Apparatus for classifying contents and method for using the same
JP2004192546A (en) Information retrieval method, device, program, and recording medium
CN113157857B (en) Hot topic detection method, device and equipment for news

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893333

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893333

Country of ref document: EP

Kind code of ref document: A1