CN114492433A - Method for automatically selecting proper keyword combination to extract text - Google Patents

Method for automatically selecting proper keyword combination to extract text Download PDF

Info

Publication number
CN114492433A
CN114492433A CN202210100206.6A CN202210100206A CN114492433A CN 114492433 A CN114492433 A CN 114492433A CN 202210100206 A CN202210100206 A CN 202210100206A CN 114492433 A CN114492433 A CN 114492433A
Authority
CN
China
Prior art keywords
word
text
original text
phrases
combinations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210100206.6A
Other languages
Chinese (zh)
Inventor
王栋平
李颜戎
杨学鑫
刘秀美
周晶
钱柏丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fiberhome Telecommunication Technologies Co ltd
Original Assignee
Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fiberhome Telecommunication Technologies Co ltd filed Critical Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority to CN202210100206.6A priority Critical patent/CN114492433A/en
Publication of CN114492433A publication Critical patent/CN114492433A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for automatically selecting a proper keyword combination to extract a text, which comprises the following steps: s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words; s2, selecting candidate keywords; and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index. According to the method and the device, the machine is used for replacing manual work to select the keyword combination for searching the target text, so that the labor cost and the time cost required by the task are effectively reduced, and the selected keyword combination result has the characteristics of better quality and more complete content.

Description

Method for automatically selecting proper keyword combination to extract text
Technical Field
The invention relates to a natural language processing technology in the field of artificial intelligence, in particular to a method for automatically selecting a proper keyword combination to extract a text.
Background
Searching for content using keywords and combinations of keywords is a common text search method. The method has the advantages of high search efficiency, capability of quickly providing search results for the user, and higher requirements on the keywords and the keyword combinations selected by the user. Whether the proper keywords and keyword combinations can be found to search the text to find out the key which can extract the target text satisfying the user. When the quality of the selected keywords and phrases is poor, a large number of impurities exist in the search results, and the user needs to further filter the search results when wanting to obtain the target text, so that the workload of the user is increased.
At present, the selection of suitable keywords and keyword combinations for searching texts mainly depends on manual summary, however, in some tasks, the text data volume is large, and the included information is messy and complicated. It is difficult to summarize that the combination of keywords capable of avoiding the impurities as much as possible and simultaneously retaining the target text as much as possible is difficult, and a lot of time is consumed. Moreover, the labor results are not reusable, and in a new text search task, keywords and keyword combinations need to be summarized again based on new search targets. Therefore, the invention provides a method for automatically selecting proper keyword combinations for searching texts, which can greatly reduce the labor input and time cost of the work.
Disclosure of Invention
In order to solve the technical problems, the invention provides the following technical scheme:
the invention relates to a method for automatically selecting a proper keyword combination to extract a text, which comprises the following steps:
s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words;
s2, selecting candidate keywords, selecting words with strong topic characteristics from the word segmentation results in the S1 as the candidate keywords, and the specific steps are as follows:
s2.1, using a TF-IDF algorithm to endow words contained in each piece of data in the original text with weights calculated based on statistical information of the words;
s2.2, training an LDA model based on the original text, and calculating the topic saliency of words contained in each piece of data in the original text by using the trained LDA model;
s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the topic saliency calculated based on the trained LDA model to obtain the sum of the weights and the topic saliency, and correcting the sum by considering the part of speech of each word to obtain the final weight of each word;
s2.4, sorting words contained in each piece of data in the original text from high to low according to the weight of the words, setting a minimum threshold value of the weight, and taking the words with the weight larger than the set threshold value as candidate keywords;
and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index.
As a preferred technical solution of the present invention, the specific steps of finding a fixed collocation phrase in S1 are as follows:
s1.1, generating all N-element word strings based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold of the left-right entropy and the mutual information, and taking the N-element word strings with the left-right entropy and the mutual information both larger than the set threshold as candidate fixed collocation phrases;
s1.2, removing duplication of the candidate fixed collocation phrases, and when the two candidate fixed collocation phrases belong to the inclusion relationship, reserving the candidate fixed collocation phrases with longer length and deleting the candidate fixed collocation phrases with shorter length;
and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech.
As a preferred technical solution of the present invention, in S1, the segmenting of the original text and the filtering of stop words are performed by adding the obtained fixed collocation phrases to the user dictionary, segmenting the original text by using a pkuseg word segmentation tool to obtain a word segmentation result, and then, by collecting common stop words and supplementing the common stop words to the stop word dictionary, further filtering the word segmentation result based on the stop word dictionary.
As a preferred technical solution of the present invention, the specific steps of S3 are as follows:
s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1 to N based on candidate keywords contained in each piece of data in the original text;
s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as positive samples, and marking the data not belonging to the target text as negative samples;
s3.3, searching the marked texts by using the unordered and ordered co-occurrence word combinations, calculating F1-score of each combination based on the search result, setting a minimum threshold of F1-score, and taking the co-occurrence word combination with the F1-score value larger than the set threshold as an intermediate result;
s3.4, removing the duplicate of the intermediate result, wherein when the intermediate result contains unordered co-occurrence word combinations and ordered co-occurrence word combinations with completely identical contained words, co-occurrence word combinations with a larger F1-score value are reserved, and a smaller F1-score value is deleted; and when the F1-score values are the same, retaining the unordered co-occurrence word combinations, deleting the ordered co-occurrence word combinations, and finally obtaining all co-occurrence word combination sets, namely the selected proper keyword combinations for searching the target text.
The beneficial effects of the invention are:
1. compared with the manually selected keyword combination, the selected keyword combination has better quality and more complete content, the keyword combination with more satisfactory search effect can be selected by effectively avoiding the condition that the optimal solution is selected from all combinations without exhausting all combinations in consideration of the labor cost and the time cost and generally following the 'satisfaction principle' when the keyword combination for searching the target text is manually selected, and the condition that the optimal solution is selected from all combinations can be made possible by exhausting all combinations, so that the selected keyword combination result has the characteristics of better quality and more complete content.
2. The invention ensures the code efficiency by improving the code logic, using the multithread operation and other methods, so that the time consumed by the technical scheme is relatively less, thereby reducing the time cost required by the technical scheme.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is an overall flow diagram of the present invention for automatically selecting appropriate keyword combinations for extracting text.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
As shown in fig. 1, the method for extracting text by automatically selecting a suitable keyword combination according to the present invention includes the following steps:
s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, segmenting the original text and filtering stop words;
s2, selecting candidate keywords, selecting words with strong topic characteristics from the word segmentation results in the S1 as the candidate keywords, and the specific steps are as follows:
s2.1, using a TF-IDF algorithm (TF-IDF is a common weighting technology used for information retrieval and data mining in the prior art, is often used for mining keywords in an article, is simple and efficient, and is not repeated herein), and giving a weight calculated based on statistical information to words contained in each piece of data in an original text;
s2.2, training an LDA model based on the original text, and calculating the topic saliency of words contained in each piece of data in the original text by using the trained LDA model;
s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the theme prominence calculated by the trained LDA model to obtain the sum of the two, and correcting the sum of the two by considering the part of speech of each word (for example, when the part of speech is a noun and a verb, the sum of the two is increased according to a set proportion, and when the part of speech is an adverb and an adjective, the sum of the two is decreased according to the set proportion) to obtain the final weight of each word;
s2.4, sorting words contained in each piece of data in the original text from high to low according to the weight of the words, setting a minimum threshold value of the weight, and taking the words with the weight larger than the set threshold value as candidate keywords;
and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index.
Wherein, the specific steps of finding the fixed collocation phrases in S1 are as follows:
s1.1, generating all N-element word strings (setting N to be 1 to 6) based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold value of the left-right entropy and the mutual information, and taking the N-element word strings of which the left-right entropy and the mutual information are both greater than the set threshold value as candidate fixed collocation phrases;
s1.2, removing duplication of the candidate fixed collocation phrases, and when the two candidate fixed collocation phrases belong to the inclusion relationship, reserving the candidate fixed collocation phrases with longer length and deleting the candidate fixed collocation phrases with shorter length;
and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech. Most fixed collocation phrases are noun phrases and accord with a certain part of speech mode. And after the word segmentation tool is used for obtaining the part-of-speech combination condition of each candidate fixed collocation phrase, further filtering the candidate fixed collocation phrases based on the summarized part-of-speech patterns to obtain a final fixed collocation phrase set.
In S1, the word segmentation and the word filtering for the original text are performed by adding the obtained fixed collocation phrases to the user dictionary, segmenting the original text by using a pkuseg word segmentation tool to obtain a word segmentation result, and then, by collecting common stop words and supplementing the common stop words to the stop word dictionary, further filtering the word segmentation result based on the stop word dictionary.
The specific steps of S3 are as follows:
s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1-6 based on the candidate keywords contained in each piece of data in the original text;
s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as a positive sample, and marking the data not belonging to the target text as a negative sample;
s3.3, searching the marked texts by using the unordered and ordered co-occurrence word combinations, calculating F1-score of each combination based on the search result, setting a minimum threshold of F1-score, and taking the co-occurrence word combination with the F1-score value larger than the set threshold as an intermediate result;
s3.4, removing the duplicate of the intermediate result, wherein when the intermediate result contains unordered co-occurrence word combinations and ordered co-occurrence word combinations with completely identical contained words, co-occurrence word combinations with a larger F1-score value are reserved, and a smaller F1-score value is deleted; and when the F1-score values are the same, retaining the unordered co-occurrence word combinations, deleting the ordered co-occurrence word combinations, and finally obtaining all co-occurrence word combination sets, namely the selected proper keyword combinations for searching the target text.
Example 2
As shown in fig. 1, the method of use of the present invention is as follows: and inputting a text set T, obtaining fixed collocation phrases in the text set T according to the left-right entropy and the mutual information, adding the fixed collocation phrases into a pkuseg user dictionary, segmenting words in the text set by using the pkuseg, and removing stop words to obtain a high-quality word segmentation result.
Training an LDA model based on a text set T, calculating the topic saliency of words in each category text by using the trained LDA model, calculating the final topic weight of the text words in each category i by combining TF-IDF and word part of speech, and setting the words with the topic weight higher than a threshold value v as candidate keywords of the i-th category text after sequencing the topic weights.
And traversing the candidate keywords of the ith type of text to generate unordered keyword combinations and ordered keyword combinations of the ith type of text, and inputting the unordered keyword combinations and the ordered keyword combinations into the manually marked positive and negative examples. And setting an F1-score value threshold value F to obtain a keyword combination with the F1-score value larger than the set threshold value F, and removing the weight of the keyword combination (if the F1-score of the ordered keyword combination is higher than the unordered keyword combination containing the same word, keeping the ordered keyword combination, and if not, keeping the unordered keyword combination) to obtain a final keyword combination recommendation result.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (4)

1. A method for automatically selecting a proper keyword combination to extract a text is characterized by comprising the following steps:
s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words;
s2, selecting candidate keywords, selecting words with strong topic characteristics from the word segmentation results in the S1 as the candidate keywords, and the specific steps are as follows:
s2.1, using a TF-IDF algorithm to endow words contained in each piece of data in the original text with weights calculated based on statistical information of the words;
s2.2, training an LDA model based on the original text, and calculating the topic saliency of words contained in each piece of data in the original text by using the trained LDA model;
s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the topic saliency calculated based on the trained LDA model to obtain the sum of the weights and the topic saliency, and correcting the sum by considering the part of speech of each word to obtain the final weight of each word;
s2.4, sorting words contained in each piece of data in the original text from high to low according to the weight of the words, setting a minimum threshold value of the weight, and taking the words with the weight larger than the set threshold value as candidate keywords;
and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index.
2. The method of claim 1, wherein the step of finding fixed collocation phrases in S1 comprises the following steps:
s1.1, generating all N-element word strings based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold of the left-right entropy and the mutual information, and taking the N-element word strings with the left-right entropy and the mutual information both larger than the set threshold as candidate fixed collocation phrases;
s1.2, removing duplication of the candidate fixed collocation phrases, and when the two candidate fixed collocation phrases belong to the inclusion relationship, reserving the candidate fixed collocation phrases with longer length and deleting the candidate fixed collocation phrases with shorter length;
and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech.
3. The method of claim 1, wherein the segmenting of the original text and the filtering of stop words in S1 are performed by adding the obtained fixed collocation phrases to a user dictionary, segmenting the original text by using a pkuseg segmentation tool to obtain segmentation results, and then performing further filtering of the segmentation results based on the stop word dictionary by collecting common stop words to supplement the stop word dictionary.
4. The method for extracting text by automatically selecting proper keyword combinations according to claim 1, wherein the specific steps of S3 are as follows:
s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1 to N based on candidate keywords contained in each piece of data in the original text;
s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as a positive sample, and marking the data not belonging to the target text as a negative sample;
s3.3, searching the marked texts by using the unordered and ordered co-occurrence word combinations, calculating F1-score of each combination based on the search result, setting a minimum threshold of F1-score, and taking the co-occurrence word combination with the F1-score value larger than the set threshold as an intermediate result;
s3.4, removing the duplicate of the intermediate result, wherein when the intermediate result contains unordered co-occurrence word combinations and ordered co-occurrence word combinations with completely identical contained words, co-occurrence word combinations with a larger F1-score value are reserved, and a smaller F1-score value is deleted; and when the F1-score values are the same, retaining the unordered co-occurrence word combinations, deleting the ordered co-occurrence word combinations, and finally obtaining all co-occurrence word combination sets, namely the selected proper keyword combinations for searching the target text.
CN202210100206.6A 2022-01-27 2022-01-27 Method for automatically selecting proper keyword combination to extract text Pending CN114492433A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210100206.6A CN114492433A (en) 2022-01-27 2022-01-27 Method for automatically selecting proper keyword combination to extract text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210100206.6A CN114492433A (en) 2022-01-27 2022-01-27 Method for automatically selecting proper keyword combination to extract text

Publications (1)

Publication Number Publication Date
CN114492433A true CN114492433A (en) 2022-05-13

Family

ID=81475903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210100206.6A Pending CN114492433A (en) 2022-01-27 2022-01-27 Method for automatically selecting proper keyword combination to extract text

Country Status (1)

Country Link
CN (1) CN114492433A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063387A (en) * 2013-03-19 2014-09-24 三星电子(中国)研发中心 Device and method abstracting keywords in text
CN108052500A (en) * 2017-12-13 2018-05-18 北京数洋智慧科技有限公司 A kind of text key message extracting method and device based on semantic analysis
CN108920456A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of keyword Automatic method
CN110807326A (en) * 2019-10-24 2020-02-18 江汉大学 Short text keyword extraction method combining GPU-DMM and text features
US20200081977A1 (en) * 2017-10-20 2020-03-12 Tencent Technology (Shenzhen) Company Limited Keyword extraction method and apparatus, storage medium, and electronic apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063387A (en) * 2013-03-19 2014-09-24 三星电子(中国)研发中心 Device and method abstracting keywords in text
US20200081977A1 (en) * 2017-10-20 2020-03-12 Tencent Technology (Shenzhen) Company Limited Keyword extraction method and apparatus, storage medium, and electronic apparatus
CN108052500A (en) * 2017-12-13 2018-05-18 北京数洋智慧科技有限公司 A kind of text key message extracting method and device based on semantic analysis
CN108920456A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of keyword Automatic method
CN110807326A (en) * 2019-10-24 2020-02-18 江汉大学 Short text keyword extraction method combining GPU-DMM and text features

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN110059311B (en) Judicial text data-oriented keyword extraction method and system
CN109710947B (en) Electric power professional word bank generation method and device
CN109299480B (en) Context-based term translation method and device
CN106776574B (en) User comment text mining method and device
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
WO2024131111A1 (en) Intelligent writing method and apparatus, device, and nonvolatile readable storage medium
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN110263154A (en) A kind of network public-opinion emotion situation quantization method, system and storage medium
CN107526841A (en) A kind of Tibetan language text summarization generation method based on Web
CN109062895A (en) A kind of intelligent semantic processing method
CN107688630A (en) A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN114266256A (en) Method and system for extracting new words in field
CN117251524A (en) Short text classification method based on multi-strategy fusion
CN111563372A (en) Typesetting document content self-duplication checking method based on teaching book publishing
CN111460147A (en) Title short text classification method based on semantic enhancement
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN113032550B (en) Viewpoint abstract evaluation system based on pre-training language model
CN108595413B (en) Answer extraction method based on semantic dependency tree
CN112632969B (en) Incremental industry dictionary updating method and system
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN109002540B (en) Method for automatically generating Chinese announcement document question answer pairs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination