CN114492433A - Method for automatically selecting proper keyword combination to extract text - Google Patents
Method for automatically selecting proper keyword combination to extract text Download PDFInfo
- Publication number
- CN114492433A CN114492433A CN202210100206.6A CN202210100206A CN114492433A CN 114492433 A CN114492433 A CN 114492433A CN 202210100206 A CN202210100206 A CN 202210100206A CN 114492433 A CN114492433 A CN 114492433A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- original text
- phrases
- combinations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 16
- 230000011218 segmentation Effects 0.000 claims abstract description 22
- 238000001914 filtration Methods 0.000 claims abstract description 14
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 238000011156 evaluation Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 4
- 239000013589 supplement Substances 0.000 claims 1
- 238000005516 engineering process Methods 0.000 description 2
- 239000012535 impurity Substances 0.000 description 2
- 230000001502 supplementing effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for automatically selecting a proper keyword combination to extract a text, which comprises the following steps: s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words; s2, selecting candidate keywords; and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index. According to the method and the device, the machine is used for replacing manual work to select the keyword combination for searching the target text, so that the labor cost and the time cost required by the task are effectively reduced, and the selected keyword combination result has the characteristics of better quality and more complete content.
Description
Technical Field
The invention relates to a natural language processing technology in the field of artificial intelligence, in particular to a method for automatically selecting a proper keyword combination to extract a text.
Background
Searching for content using keywords and combinations of keywords is a common text search method. The method has the advantages of high search efficiency, capability of quickly providing search results for the user, and higher requirements on the keywords and the keyword combinations selected by the user. Whether the proper keywords and keyword combinations can be found to search the text to find out the key which can extract the target text satisfying the user. When the quality of the selected keywords and phrases is poor, a large number of impurities exist in the search results, and the user needs to further filter the search results when wanting to obtain the target text, so that the workload of the user is increased.
At present, the selection of suitable keywords and keyword combinations for searching texts mainly depends on manual summary, however, in some tasks, the text data volume is large, and the included information is messy and complicated. It is difficult to summarize that the combination of keywords capable of avoiding the impurities as much as possible and simultaneously retaining the target text as much as possible is difficult, and a lot of time is consumed. Moreover, the labor results are not reusable, and in a new text search task, keywords and keyword combinations need to be summarized again based on new search targets. Therefore, the invention provides a method for automatically selecting proper keyword combinations for searching texts, which can greatly reduce the labor input and time cost of the work.
Disclosure of Invention
In order to solve the technical problems, the invention provides the following technical scheme:
the invention relates to a method for automatically selecting a proper keyword combination to extract a text, which comprises the following steps:
s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words;
s2, selecting candidate keywords, selecting words with strong topic characteristics from the word segmentation results in the S1 as the candidate keywords, and the specific steps are as follows:
s2.1, using a TF-IDF algorithm to endow words contained in each piece of data in the original text with weights calculated based on statistical information of the words;
s2.2, training an LDA model based on the original text, and calculating the topic saliency of words contained in each piece of data in the original text by using the trained LDA model;
s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the topic saliency calculated based on the trained LDA model to obtain the sum of the weights and the topic saliency, and correcting the sum by considering the part of speech of each word to obtain the final weight of each word;
s2.4, sorting words contained in each piece of data in the original text from high to low according to the weight of the words, setting a minimum threshold value of the weight, and taking the words with the weight larger than the set threshold value as candidate keywords;
and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index.
As a preferred technical solution of the present invention, the specific steps of finding a fixed collocation phrase in S1 are as follows:
s1.1, generating all N-element word strings based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold of the left-right entropy and the mutual information, and taking the N-element word strings with the left-right entropy and the mutual information both larger than the set threshold as candidate fixed collocation phrases;
s1.2, removing duplication of the candidate fixed collocation phrases, and when the two candidate fixed collocation phrases belong to the inclusion relationship, reserving the candidate fixed collocation phrases with longer length and deleting the candidate fixed collocation phrases with shorter length;
and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech.
As a preferred technical solution of the present invention, in S1, the segmenting of the original text and the filtering of stop words are performed by adding the obtained fixed collocation phrases to the user dictionary, segmenting the original text by using a pkuseg word segmentation tool to obtain a word segmentation result, and then, by collecting common stop words and supplementing the common stop words to the stop word dictionary, further filtering the word segmentation result based on the stop word dictionary.
As a preferred technical solution of the present invention, the specific steps of S3 are as follows:
s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1 to N based on candidate keywords contained in each piece of data in the original text;
s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as positive samples, and marking the data not belonging to the target text as negative samples;
s3.3, searching the marked texts by using the unordered and ordered co-occurrence word combinations, calculating F1-score of each combination based on the search result, setting a minimum threshold of F1-score, and taking the co-occurrence word combination with the F1-score value larger than the set threshold as an intermediate result;
s3.4, removing the duplicate of the intermediate result, wherein when the intermediate result contains unordered co-occurrence word combinations and ordered co-occurrence word combinations with completely identical contained words, co-occurrence word combinations with a larger F1-score value are reserved, and a smaller F1-score value is deleted; and when the F1-score values are the same, retaining the unordered co-occurrence word combinations, deleting the ordered co-occurrence word combinations, and finally obtaining all co-occurrence word combination sets, namely the selected proper keyword combinations for searching the target text.
The beneficial effects of the invention are:
1. compared with the manually selected keyword combination, the selected keyword combination has better quality and more complete content, the keyword combination with more satisfactory search effect can be selected by effectively avoiding the condition that the optimal solution is selected from all combinations without exhausting all combinations in consideration of the labor cost and the time cost and generally following the 'satisfaction principle' when the keyword combination for searching the target text is manually selected, and the condition that the optimal solution is selected from all combinations can be made possible by exhausting all combinations, so that the selected keyword combination result has the characteristics of better quality and more complete content.
2. The invention ensures the code efficiency by improving the code logic, using the multithread operation and other methods, so that the time consumed by the technical scheme is relatively less, thereby reducing the time cost required by the technical scheme.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is an overall flow diagram of the present invention for automatically selecting appropriate keyword combinations for extracting text.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
As shown in fig. 1, the method for extracting text by automatically selecting a suitable keyword combination according to the present invention includes the following steps:
s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, segmenting the original text and filtering stop words;
s2, selecting candidate keywords, selecting words with strong topic characteristics from the word segmentation results in the S1 as the candidate keywords, and the specific steps are as follows:
s2.1, using a TF-IDF algorithm (TF-IDF is a common weighting technology used for information retrieval and data mining in the prior art, is often used for mining keywords in an article, is simple and efficient, and is not repeated herein), and giving a weight calculated based on statistical information to words contained in each piece of data in an original text;
s2.2, training an LDA model based on the original text, and calculating the topic saliency of words contained in each piece of data in the original text by using the trained LDA model;
s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the theme prominence calculated by the trained LDA model to obtain the sum of the two, and correcting the sum of the two by considering the part of speech of each word (for example, when the part of speech is a noun and a verb, the sum of the two is increased according to a set proportion, and when the part of speech is an adverb and an adjective, the sum of the two is decreased according to the set proportion) to obtain the final weight of each word;
s2.4, sorting words contained in each piece of data in the original text from high to low according to the weight of the words, setting a minimum threshold value of the weight, and taking the words with the weight larger than the set threshold value as candidate keywords;
and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index.
Wherein, the specific steps of finding the fixed collocation phrases in S1 are as follows:
s1.1, generating all N-element word strings (setting N to be 1 to 6) based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold value of the left-right entropy and the mutual information, and taking the N-element word strings of which the left-right entropy and the mutual information are both greater than the set threshold value as candidate fixed collocation phrases;
s1.2, removing duplication of the candidate fixed collocation phrases, and when the two candidate fixed collocation phrases belong to the inclusion relationship, reserving the candidate fixed collocation phrases with longer length and deleting the candidate fixed collocation phrases with shorter length;
and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech. Most fixed collocation phrases are noun phrases and accord with a certain part of speech mode. And after the word segmentation tool is used for obtaining the part-of-speech combination condition of each candidate fixed collocation phrase, further filtering the candidate fixed collocation phrases based on the summarized part-of-speech patterns to obtain a final fixed collocation phrase set.
In S1, the word segmentation and the word filtering for the original text are performed by adding the obtained fixed collocation phrases to the user dictionary, segmenting the original text by using a pkuseg word segmentation tool to obtain a word segmentation result, and then, by collecting common stop words and supplementing the common stop words to the stop word dictionary, further filtering the word segmentation result based on the stop word dictionary.
The specific steps of S3 are as follows:
s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1-6 based on the candidate keywords contained in each piece of data in the original text;
s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as a positive sample, and marking the data not belonging to the target text as a negative sample;
s3.3, searching the marked texts by using the unordered and ordered co-occurrence word combinations, calculating F1-score of each combination based on the search result, setting a minimum threshold of F1-score, and taking the co-occurrence word combination with the F1-score value larger than the set threshold as an intermediate result;
s3.4, removing the duplicate of the intermediate result, wherein when the intermediate result contains unordered co-occurrence word combinations and ordered co-occurrence word combinations with completely identical contained words, co-occurrence word combinations with a larger F1-score value are reserved, and a smaller F1-score value is deleted; and when the F1-score values are the same, retaining the unordered co-occurrence word combinations, deleting the ordered co-occurrence word combinations, and finally obtaining all co-occurrence word combination sets, namely the selected proper keyword combinations for searching the target text.
Example 2
As shown in fig. 1, the method of use of the present invention is as follows: and inputting a text set T, obtaining fixed collocation phrases in the text set T according to the left-right entropy and the mutual information, adding the fixed collocation phrases into a pkuseg user dictionary, segmenting words in the text set by using the pkuseg, and removing stop words to obtain a high-quality word segmentation result.
Training an LDA model based on a text set T, calculating the topic saliency of words in each category text by using the trained LDA model, calculating the final topic weight of the text words in each category i by combining TF-IDF and word part of speech, and setting the words with the topic weight higher than a threshold value v as candidate keywords of the i-th category text after sequencing the topic weights.
And traversing the candidate keywords of the ith type of text to generate unordered keyword combinations and ordered keyword combinations of the ith type of text, and inputting the unordered keyword combinations and the ordered keyword combinations into the manually marked positive and negative examples. And setting an F1-score value threshold value F to obtain a keyword combination with the F1-score value larger than the set threshold value F, and removing the weight of the keyword combination (if the F1-score of the ordered keyword combination is higher than the unordered keyword combination containing the same word, keeping the ordered keyword combination, and if not, keeping the unordered keyword combination) to obtain a final keyword combination recommendation result.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (4)
1. A method for automatically selecting a proper keyword combination to extract a text is characterized by comprising the following steps:
s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words;
s2, selecting candidate keywords, selecting words with strong topic characteristics from the word segmentation results in the S1 as the candidate keywords, and the specific steps are as follows:
s2.1, using a TF-IDF algorithm to endow words contained in each piece of data in the original text with weights calculated based on statistical information of the words;
s2.2, training an LDA model based on the original text, and calculating the topic saliency of words contained in each piece of data in the original text by using the trained LDA model;
s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the topic saliency calculated based on the trained LDA model to obtain the sum of the weights and the topic saliency, and correcting the sum by considering the part of speech of each word to obtain the final weight of each word;
s2.4, sorting words contained in each piece of data in the original text from high to low according to the weight of the words, setting a minimum threshold value of the weight, and taking the words with the weight larger than the set threshold value as candidate keywords;
and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index.
2. The method of claim 1, wherein the step of finding fixed collocation phrases in S1 comprises the following steps:
s1.1, generating all N-element word strings based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold of the left-right entropy and the mutual information, and taking the N-element word strings with the left-right entropy and the mutual information both larger than the set threshold as candidate fixed collocation phrases;
s1.2, removing duplication of the candidate fixed collocation phrases, and when the two candidate fixed collocation phrases belong to the inclusion relationship, reserving the candidate fixed collocation phrases with longer length and deleting the candidate fixed collocation phrases with shorter length;
and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech.
3. The method of claim 1, wherein the segmenting of the original text and the filtering of stop words in S1 are performed by adding the obtained fixed collocation phrases to a user dictionary, segmenting the original text by using a pkuseg segmentation tool to obtain segmentation results, and then performing further filtering of the segmentation results based on the stop word dictionary by collecting common stop words to supplement the stop word dictionary.
4. The method for extracting text by automatically selecting proper keyword combinations according to claim 1, wherein the specific steps of S3 are as follows:
s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1 to N based on candidate keywords contained in each piece of data in the original text;
s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as a positive sample, and marking the data not belonging to the target text as a negative sample;
s3.3, searching the marked texts by using the unordered and ordered co-occurrence word combinations, calculating F1-score of each combination based on the search result, setting a minimum threshold of F1-score, and taking the co-occurrence word combination with the F1-score value larger than the set threshold as an intermediate result;
s3.4, removing the duplicate of the intermediate result, wherein when the intermediate result contains unordered co-occurrence word combinations and ordered co-occurrence word combinations with completely identical contained words, co-occurrence word combinations with a larger F1-score value are reserved, and a smaller F1-score value is deleted; and when the F1-score values are the same, retaining the unordered co-occurrence word combinations, deleting the ordered co-occurrence word combinations, and finally obtaining all co-occurrence word combination sets, namely the selected proper keyword combinations for searching the target text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210100206.6A CN114492433A (en) | 2022-01-27 | 2022-01-27 | Method for automatically selecting proper keyword combination to extract text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210100206.6A CN114492433A (en) | 2022-01-27 | 2022-01-27 | Method for automatically selecting proper keyword combination to extract text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114492433A true CN114492433A (en) | 2022-05-13 |
Family
ID=81475903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210100206.6A Pending CN114492433A (en) | 2022-01-27 | 2022-01-27 | Method for automatically selecting proper keyword combination to extract text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114492433A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063387A (en) * | 2013-03-19 | 2014-09-24 | 三星电子(中国)研发中心 | Device and method abstracting keywords in text |
CN108052500A (en) * | 2017-12-13 | 2018-05-18 | 北京数洋智慧科技有限公司 | A kind of text key message extracting method and device based on semantic analysis |
CN108920456A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of keyword Automatic method |
CN110807326A (en) * | 2019-10-24 | 2020-02-18 | 江汉大学 | Short text keyword extraction method combining GPU-DMM and text features |
US20200081977A1 (en) * | 2017-10-20 | 2020-03-12 | Tencent Technology (Shenzhen) Company Limited | Keyword extraction method and apparatus, storage medium, and electronic apparatus |
-
2022
- 2022-01-27 CN CN202210100206.6A patent/CN114492433A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063387A (en) * | 2013-03-19 | 2014-09-24 | 三星电子(中国)研发中心 | Device and method abstracting keywords in text |
US20200081977A1 (en) * | 2017-10-20 | 2020-03-12 | Tencent Technology (Shenzhen) Company Limited | Keyword extraction method and apparatus, storage medium, and electronic apparatus |
CN108052500A (en) * | 2017-12-13 | 2018-05-18 | 北京数洋智慧科技有限公司 | A kind of text key message extracting method and device based on semantic analysis |
CN108920456A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of keyword Automatic method |
CN110807326A (en) * | 2019-10-24 | 2020-02-18 | 江汉大学 | Short text keyword extraction method combining GPU-DMM and text features |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442760B (en) | Synonym mining method and device for question-answer retrieval system | |
CN110059311B (en) | Judicial text data-oriented keyword extraction method and system | |
CN109710947B (en) | Electric power professional word bank generation method and device | |
CN109299480B (en) | Context-based term translation method and device | |
CN106776574B (en) | User comment text mining method and device | |
CN111177365A (en) | Unsupervised automatic abstract extraction method based on graph model | |
CN112395395B (en) | Text keyword extraction method, device, equipment and storage medium | |
WO2024131111A1 (en) | Intelligent writing method and apparatus, device, and nonvolatile readable storage medium | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN108073571B (en) | Multi-language text quality evaluation method and system and intelligent text processing system | |
CN110263154A (en) | A kind of network public-opinion emotion situation quantization method, system and storage medium | |
CN107526841A (en) | A kind of Tibetan language text summarization generation method based on Web | |
CN109062895A (en) | A kind of intelligent semantic processing method | |
CN107688630A (en) | A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme | |
CN111090994A (en) | Chinese-internet-forum-text-oriented event place attribution province identification method | |
CN114266256A (en) | Method and system for extracting new words in field | |
CN117251524A (en) | Short text classification method based on multi-strategy fusion | |
CN111563372A (en) | Typesetting document content self-duplication checking method based on teaching book publishing | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
CN113032550B (en) | Viewpoint abstract evaluation system based on pre-training language model | |
CN108595413B (en) | Answer extraction method based on semantic dependency tree | |
CN112632969B (en) | Incremental industry dictionary updating method and system | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN109002540B (en) | Method for automatically generating Chinese announcement document question answer pairs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |