CN114492433A

CN114492433A - Method for automatically selecting proper keyword combination to extract text

Info

Publication number: CN114492433A
Application number: CN202210100206.6A
Authority: CN
Inventors: 王栋平; 李颜戎; 杨学鑫; 刘秀美; 周晶; 钱柏丞
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-13

Abstract

The invention discloses a method for automatically selecting a proper keyword combination to extract a text, which comprises the following steps: s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words; s2, selecting candidate keywords; and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index. According to the method and the device, the machine is used for replacing manual work to select the keyword combination for searching the target text, so that the labor cost and the time cost required by the task are effectively reduced, and the selected keyword combination result has the characteristics of better quality and more complete content.

Description

Method for automatically selecting proper keyword combination to extract text

Technical Field

The invention relates to a natural language processing technology in the field of artificial intelligence, in particular to a method for automatically selecting a proper keyword combination to extract a text.

Background

Searching for content using keywords and combinations of keywords is a common text search method. The method has the advantages of high search efficiency, capability of quickly providing search results for the user, and higher requirements on the keywords and the keyword combinations selected by the user. Whether the proper keywords and keyword combinations can be found to search the text to find out the key which can extract the target text satisfying the user. When the quality of the selected keywords and phrases is poor, a large number of impurities exist in the search results, and the user needs to further filter the search results when wanting to obtain the target text, so that the workload of the user is increased.

At present, the selection of suitable keywords and keyword combinations for searching texts mainly depends on manual summary, however, in some tasks, the text data volume is large, and the included information is messy and complicated. It is difficult to summarize that the combination of keywords capable of avoiding the impurities as much as possible and simultaneously retaining the target text as much as possible is difficult, and a lot of time is consumed. Moreover, the labor results are not reusable, and in a new text search task, keywords and keyword combinations need to be summarized again based on new search targets. Therefore, the invention provides a method for automatically selecting proper keyword combinations for searching texts, which can greatly reduce the labor input and time cost of the work.

Disclosure of Invention

In order to solve the technical problems, the invention provides the following technical scheme:

the invention relates to a method for automatically selecting a proper keyword combination to extract a text, which comprises the following steps:

s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, carrying out word segmentation on the original text and filtering stop words;

s2, selecting candidate keywords, selecting words with strong topic characteristics from the word segmentation results in the S1 as the candidate keywords, and the specific steps are as follows:

s2.1, using a TF-IDF algorithm to endow words contained in each piece of data in the original text with weights calculated based on statistical information of the words;

s2.2, training an LDA model based on the original text, and calculating the topic saliency of words contained in each piece of data in the original text by using the trained LDA model;

s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the topic saliency calculated based on the trained LDA model to obtain the sum of the weights and the topic saliency, and correcting the sum by considering the part of speech of each word to obtain the final weight of each word;

s2.4, sorting words contained in each piece of data in the original text from high to low according to the weight of the words, setting a minimum threshold value of the weight, and taking the words with the weight larger than the set threshold value as candidate keywords;

and S3, recommending keyword combinations, generating unordered and ordered co-occurrence word combination sets based on candidate keywords contained in each piece of data in the original text, and recommending proper keyword combinations for searching the target text from the unordered and ordered co-occurrence word combination sets by taking F1-score as an evaluation index.

As a preferred technical solution of the present invention, the specific steps of finding a fixed collocation phrase in S1 are as follows:

s1.1, generating all N-element word strings based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold of the left-right entropy and the mutual information, and taking the N-element word strings with the left-right entropy and the mutual information both larger than the set threshold as candidate fixed collocation phrases;

s1.2, removing duplication of the candidate fixed collocation phrases, and when the two candidate fixed collocation phrases belong to the inclusion relationship, reserving the candidate fixed collocation phrases with longer length and deleting the candidate fixed collocation phrases with shorter length;

and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech.

As a preferred technical solution of the present invention, in S1, the segmenting of the original text and the filtering of stop words are performed by adding the obtained fixed collocation phrases to the user dictionary, segmenting the original text by using a pkuseg word segmentation tool to obtain a word segmentation result, and then, by collecting common stop words and supplementing the common stop words to the stop word dictionary, further filtering the word segmentation result based on the stop word dictionary.

As a preferred technical solution of the present invention, the specific steps of S3 are as follows:

s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1 to N based on candidate keywords contained in each piece of data in the original text;

s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as positive samples, and marking the data not belonging to the target text as negative samples;

s3.3, searching the marked texts by using the unordered and ordered co-occurrence word combinations, calculating F1-score of each combination based on the search result, setting a minimum threshold of F1-score, and taking the co-occurrence word combination with the F1-score value larger than the set threshold as an intermediate result;

s3.4, removing the duplicate of the intermediate result, wherein when the intermediate result contains unordered co-occurrence word combinations and ordered co-occurrence word combinations with completely identical contained words, co-occurrence word combinations with a larger F1-score value are reserved, and a smaller F1-score value is deleted; and when the F1-score values are the same, retaining the unordered co-occurrence word combinations, deleting the ordered co-occurrence word combinations, and finally obtaining all co-occurrence word combination sets, namely the selected proper keyword combinations for searching the target text.

The beneficial effects of the invention are:

1. compared with the manually selected keyword combination, the selected keyword combination has better quality and more complete content, the keyword combination with more satisfactory search effect can be selected by effectively avoiding the condition that the optimal solution is selected from all combinations without exhausting all combinations in consideration of the labor cost and the time cost and generally following the 'satisfaction principle' when the keyword combination for searching the target text is manually selected, and the condition that the optimal solution is selected from all combinations can be made possible by exhausting all combinations, so that the selected keyword combination result has the characteristics of better quality and more complete content.

2. The invention ensures the code efficiency by improving the code logic, using the multithread operation and other methods, so that the time consumed by the technical scheme is relatively less, thereby reducing the time cost required by the technical scheme.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is an overall flow diagram of the present invention for automatically selecting appropriate keyword combinations for extracting text.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1

As shown in fig. 1, the method for extracting text by automatically selecting a suitable keyword combination according to the present invention includes the following steps:

s1, carrying out data preprocessing on the original text to obtain a high-quality word segmentation result, wherein the data preprocessing comprises finding fixed collocation phrases, segmenting the original text and filtering stop words;

s2.1, using a TF-IDF algorithm (TF-IDF is a common weighting technology used for information retrieval and data mining in the prior art, is often used for mining keywords in an article, is simple and efficient, and is not repeated herein), and giving a weight calculated based on statistical information to words contained in each piece of data in an original text;

s2.3, adding the weight calculated by each word based on the TF-IDF algorithm and the theme prominence calculated by the trained LDA model to obtain the sum of the two, and correcting the sum of the two by considering the part of speech of each word (for example, when the part of speech is a noun and a verb, the sum of the two is increased according to a set proportion, and when the part of speech is an adverb and an adjective, the sum of the two is decreased according to the set proportion) to obtain the final weight of each word;

Wherein, the specific steps of finding the fixed collocation phrases in S1 are as follows:

s1.1, generating all N-element word strings (setting N to be 1 to 6) based on an original text, calculating left-right entropy and mutual information of each N-element word string, setting a minimum threshold value of the left-right entropy and the mutual information, and taking the N-element word strings of which the left-right entropy and the mutual information are both greater than the set threshold value as candidate fixed collocation phrases;

and S1.3, further filtering the candidate fixed collocation phrases based on the part of speech. Most fixed collocation phrases are noun phrases and accord with a certain part of speech mode. And after the word segmentation tool is used for obtaining the part-of-speech combination condition of each candidate fixed collocation phrase, further filtering the candidate fixed collocation phrases based on the summarized part-of-speech patterns to obtain a final fixed collocation phrase set.

In S1, the word segmentation and the word filtering for the original text are performed by adding the obtained fixed collocation phrases to the user dictionary, segmenting the original text by using a pkuseg word segmentation tool to obtain a word segmentation result, and then, by collecting common stop words and supplementing the common stop words to the stop word dictionary, further filtering the word segmentation result based on the stop word dictionary.

The specific steps of S3 are as follows:

s3.1, generating an unordered and ordered co-occurrence word combination set with the traversal length of 1-6 based on the candidate keywords contained in each piece of data in the original text;

s3.2, extracting partial data from the original text to mark, marking the data belonging to the target text as a positive sample, and marking the data not belonging to the target text as a negative sample;

Example 2

As shown in fig. 1, the method of use of the present invention is as follows: and inputting a text set T, obtaining fixed collocation phrases in the text set T according to the left-right entropy and the mutual information, adding the fixed collocation phrases into a pkuseg user dictionary, segmenting words in the text set by using the pkuseg, and removing stop words to obtain a high-quality word segmentation result.

Training an LDA model based on a text set T, calculating the topic saliency of words in each category text by using the trained LDA model, calculating the final topic weight of the text words in each category i by combining TF-IDF and word part of speech, and setting the words with the topic weight higher than a threshold value v as candidate keywords of the i-th category text after sequencing the topic weights.

And traversing the candidate keywords of the ith type of text to generate unordered keyword combinations and ordered keyword combinations of the ith type of text, and inputting the unordered keyword combinations and the ordered keyword combinations into the manually marked positive and negative examples. And setting an F1-score value threshold value F to obtain a keyword combination with the F1-score value larger than the set threshold value F, and removing the weight of the keyword combination (if the F1-score of the ordered keyword combination is higher than the unordered keyword combination containing the same word, keeping the ordered keyword combination, and if not, keeping the unordered keyword combination) to obtain a final keyword combination recommendation result.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for automatically selecting a proper keyword combination to extract a text is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of finding fixed collocation phrases in S1 comprises the following steps:

3. The method of claim 1, wherein the segmenting of the original text and the filtering of stop words in S1 are performed by adding the obtained fixed collocation phrases to a user dictionary, segmenting the original text by using a pkuseg segmentation tool to obtain segmentation results, and then performing further filtering of the segmentation results based on the stop word dictionary by collecting common stop words to supplement the stop word dictionary.

4. The method for extracting text by automatically selecting proper keyword combinations according to claim 1, wherein the specific steps of S3 are as follows: