CN113743090B

CN113743090B - Keyword extraction method and device

Info

Publication number: CN113743090B
Application number: CN202111048659.0A
Authority: CN
Inventors: 张雅琴
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2024-04-12
Anticipated expiration: 2041-09-08
Also published as: CN113743090A

Abstract

The application provides a keyword extraction method and device, after word segmentation processing is carried out on sentences to be processed, word segmentation results are subjected to word segmentation and merging, and then TF-IDF values of each word are obtained based on a keyword dictionary. Dividing sentences of the sentences to be processed, dividing words of each sentence and merging broken words to obtain words contained in each sentence, and further performing dependency syntax analysis on the words contained in each sentence to obtain a core phrase of the sentence; and determining the keywords of the sentence to be processed according to each word contained in the sentence to be processed, the corresponding TF-IDF value and the core phrase contained in the sentence to be processed. After the words are extracted for the whole sentence, the whole sentence is divided into short sentences, and then the core phrase is extracted for each short sentence, so that important information cannot be omitted. In addition, after word segmentation, the scheme performs word segmentation and word segmentation combination, so that the number of words is reduced, and meanwhile, the extracted keyword information is more complete.

Description

Keyword extraction method and device

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a keyword extraction method and device.

Background

Keyword extraction is to automatically extract words which can most express the meaning of the text from the text. The current keyword extraction technology, such as word frequency-reverse file frequency technology, texttrank, topic model extraction keyword and the like, is basically based on text corpus such as documents and articles, and the corpus is characterized by multiple words, large information quantity, clear topic and clear relation of context.

In the application scenario of automatic question-answering, a user inputs a sentence, the automatic question-answering system needs to extract the keyword of the sentence, and the sentences in the automatic question-answering system generally have the following characteristics: (1) the content is short and the word number is relatively small; (2) the purpose is clear; (3) a sentence comprising a plurality of meanings; (4) very spoken language, flexible expression and various styles. Therefore, the features of the corpus in the automatic question-answering system are completely different from those of the long text corpus, so that the keyword extraction technology suitable for the long text is not suitable for the short text corpus of the automatic question-answering system.

Disclosure of Invention

Accordingly, the present invention aims to provide a keyword extraction method and apparatus, so as to solve the above technical problems, and the specific technical scheme disclosed in the present invention includes:

in a first aspect, the present application provides a keyword extraction method, including:

performing word segmentation on the sentence to be processed to obtain a word segmentation result, and performing word segmentation merging on the word segmentation result to obtain a word segmentation merging result;

based on a keyword dictionary obtained through pre-training, obtaining word frequency-reverse file frequency of each word in the word segmentation merging result, wherein the keyword dictionary comprises word frequency-reverse file frequency corresponding to each keyword;

dividing sentences of the sentence to be processed, carrying out word division processing and word breaking merging on each short sentence to obtain words contained in the short sentence, and carrying out dependency syntax analysis on the words contained in each short sentence to obtain a core phrase contained in the short sentence;

and obtaining the keywords of the sentence to be processed based on the words contained in the sentence to be processed, the word frequency-reverse file frequency corresponding to the words and the core phrase.

In a possible implementation manner of the first aspect, the obtaining, based on the word included in the sentence to be processed, the word frequency-reverse file frequency corresponding to the word, and the core phrase, a keyword of the sentence to be processed includes:

acquiring a weight coefficient corresponding to a word contained in the sentence to be processed, wherein the weight coefficient comprises a weight corresponding to the position of the word and a weight corresponding to a core word group;

obtaining a target weight corresponding to each word based on a weight coefficient corresponding to the word and word frequency-reverse text frequency;

and determining the preset number of words as keywords of the sentence to be processed according to the sequence from high to low of the target weight corresponding to each word in the sentence to be processed.

In another possible implementation manner of the first aspect, the weight coefficient includes a first weight corresponding to a word frequency-reverse file frequency, a second weight corresponding to the core phrase, a third weight corresponding to a position of the phrase in the to-be-processed sentence, and a fourth weight corresponding to a part of speech of each core phrase;

the obtaining the target weight corresponding to the words based on the weight coefficient corresponding to each word and the word frequency-reverse text frequency comprises the following steps:

calculating the product of the first weight corresponding to the word and the word frequency-reverse text frequency of the word;

and calculating the sum of the product and the second weight, the third weight and the fourth weight to obtain the target weight corresponding to the word.

In a further possible implementation manner of the first aspect, a maximum sum of the first weight, the second weight, the third weight and the fourth weight is equal to 1;

the second weight corresponding to the core phrase is a second weight preset value, and the second weight corresponding to the word of the non-core phrase is 0;

the numerical value of the third weight corresponding to the short sentence at the head or tail of the sentence is higher than the third weight corresponding to the short sentence at other positions in the sentence to be processed;

and the fourth weights corresponding to the words with different parts of speech are different.

In still another possible implementation manner of the first aspect, the step of dividing the sentence to be processed, and performing word division processing and word breaking merging on each sentence to obtain a word included in the sentence includes:

dividing the sentence to be processed into short sentences according to the punctuation coincidence contained in the sentence to be processed;

and performing word segmentation on the short sentence to obtain a word segmentation result, and merging words with co-occurrence frequency greater than a preset threshold value contained in the word segmentation result to obtain words contained in the short sentence.

In another possible implementation manner of the first aspect, the performing dependency syntax analysis on the words included in each phrase to obtain a core phrase included in the phrase includes:

analyzing semantic dependency relations among words contained in the short sentence by utilizing a dependency syntax analysis method;

and determining the core phrase in the short sentence according to the semantic dependency relationship.

In a further possible implementation manner of the first aspect, the determining, according to the semantic dependency relationship, a core phrase included in the phrase includes:

extracting initial core words of the phrases according to the semantic dependency relationship;

and expanding the initial core word according to the semantic dependency relationship corresponding to the initial core word to obtain the core word group.

In a further possible implementation manner of the first aspect, the process of obtaining the keyword dictionary includes:

aiming at any sentence in a training sentence set, carrying out word segmentation processing and broken word merging to obtain words contained in the sentence;

and for each word, calculating the word frequency-reverse file frequency of the word according to the word frequency of the word and the sentence data quantity containing the word, and obtaining the word frequency-reverse file frequency of each word contained in the training sentence set.

In a second aspect, the present application further provides a keyword extraction apparatus, including:

the word segmentation and broken word merging module is used for carrying out word segmentation processing on the sentence to be processed to obtain a word segmentation result, and carrying out broken word merging on the word segmentation result to obtain a word segmentation merging result;

the word frequency-reverse file frequency acquisition module is used for acquiring the word frequency-reverse file frequency of each word in the word segmentation merging result based on a keyword dictionary which is obtained through training in advance, wherein the keyword dictionary comprises the word frequency-reverse file frequency corresponding to each keyword;

the core phrase acquisition module is used for carrying out sentence segmentation on the sentence to be processed, carrying out word segmentation processing and word breaking merging on each sentence to obtain a word contained in the sentence, and carrying out dependency syntactic analysis on the word contained in each sentence to obtain a core phrase contained in the sentence;

the keyword determining module is used for obtaining keywords of the sentence to be processed based on words contained in the sentence to be processed, word frequency-reverse file frequency corresponding to the words and the core phrase.

In a third aspect, the present application further provides an electronic device, including: a memory and a processor;

the memory stores a program, and the processor invokes the program in the memory to implement the keyword extraction method according to any one of the possible implementation manners of the first aspect.

In a fourth aspect, the present application further provides a computer readable storage medium, where a program is stored, where the program, when executed by a computing device, implements the keyword extraction method of the first aspect or any one of possible implementations.

According to the keyword extraction method, word segmentation processing is carried out on sentences to be processed, word segmentation results are combined, and then TF-IDF values of each word are obtained based on a keyword dictionary. Further, sentence segmentation is carried out on the sentences to be processed, word segmentation processing and word breaking merging are carried out on each sentence to obtain words contained in each sentence, dependency syntactic analysis is further carried out on the words contained in each sentence to obtain a core phrase of the sentence; and determining the keywords of the sentence to be processed according to each word contained in the sentence to be processed, the corresponding TF-IDF value and the core phrase contained in the sentence to be processed. After the words are extracted for the whole sentence, the whole sentence is divided into short sentences, and then the core phrase is extracted for each short sentence, so that important information cannot be omitted. In addition, after word segmentation processing is carried out on the sentences, the scheme also carries out word segmentation and merging, so that the number of words is reduced, and meanwhile, the extracted keyword information is more complete. In summary, the scheme is suitable for extracting keywords of the corpus of the automatic question-answering system, namely, the scheme aims at the corpus of the automatic question-answering system, and the extracted keywords are more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a specific flow diagram of a keyword extraction process provided in an embodiment of the present application;

fig. 2 is a flowchart of a keyword extraction method provided in an embodiment of the present application;

FIG. 3 is a flowchart of a process for obtaining a keyword dictionary using corpus training provided by an embodiment of the present application;

FIG. 4 is a flowchart of a process for obtaining keywords of a sentence to be processed provided by an embodiment of the present application;

fig. 5 is a block diagram of a keyword extraction apparatus provided in an embodiment of the present application.

Detailed Description

Before describing in detail the method embodiments provided herein, technical terms referred to herein will be described.

Word frequency-reverse file frequency: english is termFrench-inverse Document Frequency, english is abbreviated as TF-IDF, wherein TF is the number of times a word appears in an article, namely word frequency; IDF is the ratio of the total number of documents in the corpus to the number of documents containing the term, TF-IDF calculates the importance of the term from a statistical perspective.

The Chinese word segmentation technology comprises the following steps: the Chinese word segmentation algorithm is to segment a Chinese character sequence into individual words, and when the Chinese characters are identified semantically, a plurality of Chinese characters are required to be combined into words so as to express the true meaning.

Breaking word combination: the name implies that two words with higher co-occurrence probability are combined into one word.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 and fig. 2, fig. 1 shows a specific flowchart of a keyword extraction process provided in an embodiment of the present application, and fig. 2 shows a flowchart of a keyword extraction method provided in an embodiment of the present application.

The keyword extraction method is applied to electronic equipment, and the electronic equipment can be terminal equipment such as a mobile phone, a computer, a tablet personal computer and the like, and can also be a server.

As shown in fig. 1, for a sentence to be processed, word segmentation processing and word breaking merging are performed first, and a keyword dictionary is queried to obtain TF-IDF values corresponding to each word included in the sentence (please refer to S110 and S120 shown in fig. 2 for details); then, dividing the sentence to be processed into a plurality of short sentences according to the standard point, and executing the following steps for each short sentence: the word segmentation process and the word breaking process are combined (see S130 shown in FIG. 2 for details), then the dependency syntax analysis is performed, and the core word group is extracted according to the dependency analysis result (see S140 shown in FIG. 2 for details). Finally, for each word in the sentence to be processed, according to the TF-IDF value and the weight corresponding to each word, obtaining the target weight corresponding to each word, and taking the first n words according to the order of the target weights from high to low to determine the first n words as the keywords corresponding to the sentence to be processed.

The following describes the procedure of the keyword extraction method provided in the present application with reference to fig. 2:

s110, performing word segmentation on the sentence to be processed to obtain a word segmentation result, and performing word segmentation merging on the word segmentation result to obtain a word segmentation merging result.

The to-be-processed sentence refers to any sentence needing to extract a keyword, for example, in an on-line customer service automatic question-answering application scene, the to-be-processed sentence is a sentence input by a user.

The Chinese word segmentation tool can be used for segmenting words of the sentences to be processed to obtain corresponding word segmentation results, for example, the original sentences are troublesome, you tell me where to return goods, and the word segmentation results obtained after word segmentation are troublesome, you tell me, where to return goods.

The main functions of the broken word combination are as follows: 1) Combining the spoken words; for example, "trouble, you tell me where, return" is "trouble you after broken word merging," tell me where, return ", broken word merging technique can greatly reduce the number of word fragments of a sentence.

2) Combining and reducing proper nouns. For example, the proper term "drop insurance" may be divided into "drop, insurance", and thus the semantics may be changed after word segmentation, thus requiring the two words to be combined into "drop insurance".

The broken word merging is to count the occurrence times of two adjacent left and right words together, if the co-occurrence times exceeds a certain threshold value, the two words are merged into one word, the number of the words can be greatly reduced due to the urgency of the technology, and the extracted keywords can be kept more completely in terms of semantics.

S120, based on a keyword dictionary obtained through pre-training, TF-IDF values corresponding to each word of the word segmentation merging result are obtained.

The keyword dictionary comprises word frequency-reverse file frequency corresponding to each keyword.

As shown in fig. 3, the process of obtaining the keyword dictionary by corpus training includes the following steps:

s121, performing word segmentation processing and broken word combination on each sentence in the input sentence set to obtain a keyword contained in the sentence.

During training, all input sentences are combined into a data set, and word segmentation and word breaking combination are carried out on each sentence in the data set.

S122, calculating corresponding TF-IDF values of each keyword.

After a complete keyword set is obtained, counting the word frequency (TF) of each word for each keyword in the set, counting the number of texts of the word in a corpus, calculating to obtain a TF-IDF value of the word, and finally outputting each word and the TF-IDF value as a dictionary.

TF-IDF is used to evaluate the importance of a term to a document and/or a document in a corpus. The importance of a term increases proportionally with the number of times it appears in a document, but at the same time decreases inversely with the frequency with which it appears in the corpus. In other words, the more a term appears in a text, the fewer the number of occurrences in all documents, the more representative the article.

TF, term frequency, refers to the number of occurrences of a given word in a document, which is typically normalized, e.g., TF is equal to the quotient of the number of occurrences of a word in a text divided by the total number of words in the text.

IDF is the reverse file frequency, and if fewer documents containing the word, the larger the IDF. The IDF of a given term may be obtained by dividing the total text number in the corpus by the text number containing the term, and taking the logarithm of the quotient obtained, i.e., the calculation formula of the IDF is as follows:

idf=log (total number of text in corpus/number of text containing the word)

Finally, calculating to obtain a TF-IDF value according to the TF value and the IDF value obtained by calculation, wherein the calculation formula of the TF-IDF value is TF-IDF=TF-IDF.

S123, generating a keyword dictionary according to each keyword and the corresponding TF-IDF value.

After the TF-IDF value of each word in the corpus is calculated, outputting each word and the TF-IDF value of the word as a keyword dictionary. For example, the keyword dictionary { ' w1': tfidf_1, ' w2': tfidf_2, ' w3': tfidf_3, …, ' wn ': tfidf_n }, where ' w1' represents a word, tfidf_1 represents a TF-IDF value corresponding to the word "w1", and so on, ' wn ' represents an nth word, and tfidf_n represents a TF-IDF value corresponding to the word "wn '.

The TF-IDF value corresponding to the term can be directly queried by using the keyword dictionary subsequently, and for a given term, the TF-IDF value corresponding to the term can be directly queried from the keyword dictionary. For example, if the keyword dictionary contains the same keywords as the words to be queried (for example, if the keyword dictionary contains the same words as the words to be queried through a Chinese word matching algorithm), the TF-IDF value corresponding to the words is read to determine the TF-IDF value corresponding to the words to be queried.

S130, sentence segmentation and broken word combination are carried out on the sentences to be processed, and words contained in each sentence are obtained.

In an automatic question-answering system, a sentence entered by a user may contain a plurality of messages that are distinguished by punctuation marks. In order to improve the accuracy of keyword extraction in the application scene, firstly, the sentences to be processed are matched with the clauses according to punctuation to obtain different clauses. And extracting keywords for each short sentence, so that important information is not lost.

For each short sentence, word segmentation processing and word breaking merging are carried out first to obtain the words contained in the short sentence.

S140, performing dependency syntactic analysis on the words contained in each short sentence to obtain the core phrase contained in the short sentence.

Dependency syntax analysis is the automatic derivation of the syntax structure of sentences from a given syntax hierarchy, analyzing the syntax units contained in sentences and the relationships between these syntax units.

In one possible implementation, based on the dependency syntax analysis, semantic dependencies between terms contained in a short sentence are analyzed, and a core phrase in the short sentence is further determined according to the semantic dependencies.

In another possible implementation manner, semantic dependency relationships among words contained in a short sentence are analyzed based on a dependency syntax analysis method, a core word of the short sentence is extracted according to the semantic dependency relationships and is used as an initial core word of the short sentence, and then core word expansion is performed according to a main-predicate structure, a movable guest structure, a state middle structure and the like of the initial core word, so that a core phrase of the short sentence is obtained.

For example, a phrase is "is not in our case now and meets the requirements of the application? "first, the sentence is subjected to word segmentation processing and broken word merging to be" whether the current situation is |my case|and |meets|guarantee requirement ", and" meets "in the sentence is a verb, namely a predicate of the whole sentence, by utilizing dependency syntax analysis, so that a core word of the sentence is" meets ". "meet" and "apply for assurance" are words of a dynamic guest structure, and "apply for assurance" has a specific meaning, thus expanding the core phrase to "meet the apply for assurance".

While "my situation" and "coincidence" are the main-term structures, but "my situation" is not a word having a specific meaning, expansion of such a word into a core phrase does not act as a keyword, and therefore, the word "my situation" is not expanded into the core phrase.

And S150, obtaining keywords of the sentence to be processed based on the words contained in the sentence to be processed, TF-IDF values corresponding to the words and the core phrase.

In one embodiment of the present application, as shown in fig. 4, the process of obtaining the keywords of the sentence to be processed may include:

s151, acquiring a word contained in the sentence to be processed and a weight coefficient corresponding to the word.

In one possible implementation, a corresponding weight coefficient, e.g., a location weight, a weight corresponding to a core phrase, etc., is set for each word.

S152, obtaining the target weight corresponding to each word based on the weight coefficient and the TF-IDF value corresponding to the word.

For a given term, factors affecting whether it is a keyword include a plurality of factors, such as TF-IDF values of the term, positions of the term in the sentence, parts of speech of the term, whether the term is a core phrase, etc., and the embodiments of the present application will describe the influence of the four dimensions on the keyword from the above four dimensions, respectively. Thus, in one embodiment of the present application, the weight coefficient for each term includes four weights:

1) One weight tfidf_weight, i.e., the first weight, is set for each TF-IDF value.

The first weight is used for representing the influence degree of the dimension of the TF-IDF value on the keyword, the tfidf_weight values corresponding to different TF-IDF values are the same, the size of the tfidf_weight values can be determined according to practical conditions, the larger the value is, the larger the influence degree of the dimension of the TF-IDF value on the keyword is, and conversely, the smaller the value is, the smaller the influence degree of the TF-IDF value on the keyword is.

2) And setting a weight w_word_group, namely a second weight, for the core phrase.

For a word contained in a sentence, if the word is a core phrase, the probability that the word is a keyword is greater than a word that is not a core phrase. Thus, a weight is set for words belonging to the core phrase. The weight characterizes the influence degree of the dimension of the core phrase on the keywords. The second weight may be a fixed value, e.g., for a given word, the second weight takes a corresponding set value if the word is a core phrase and is 0 if the word is not a core phrase.

3) The location weight location_weight corresponding to the word is the third weight.

In an automatic question-answering system, the purpose of the words input by the user is clear, the purpose is generally embodied in a first sentence or a last sentence, and the information of the positions of the words is particularly important, so that the position weights of the words are introduced, and the position weights represent the influence degree of the words at different positions on whether the words are keywords or not.

The numerical value of the position weight can be determined according to the actual situation, and the numerical value of the position weight corresponding to the position capable of reflecting the statement purpose is larger, and the numerical value of the position weight corresponding to other positions in the statement is smaller.

For example, one sentence includes d phrases, a position weight is set for each position of each phrase, the position weights corresponding to the d phrases are {1:location_weight_1,2:location_weight_2, …, and d:location_weight_d }, where location_weight_1 represents the position weight corresponding to the 1 st phrase in the sentence, and so on, location_weight_d represents the position weight corresponding to the d-th phrase.

For example, the values of location_weight_1 and location_weight_d are larger, and may be equal or unequal, and the phrases at other locations have smaller location weights.

4) A part-of-speech weight w4, the fourth weight, is set for the parts of speech.

Words of different parts of speech play different roles in a sentence, for example, a verb is typically the central component of a sentence, and words of other parts of speech are typically governed by the verb, so that the weight of the verb is greatest, the corresponding weight of the noun is less, and the corresponding weight of the adjective is least. Of course, the weight coefficients corresponding to other parts of speech may also be set, and will not be described in detail here.

For example, { 'verb': verb_w, 'not': not_w, 'adaptive': adj_w, verb_w, noun_w, adj_w, adjective, and verb.

After determining the weight coefficients, for a given word c, calculating according to the following formula to obtain the Final corresponding target weight final_weight of the word:

Final_weight＝tfidf_weight*tfidf+w_word_group+location_weight+w4

for example, for the word c, the word is located in the first short sentence of the whole sentence, and the part of speech is a verb, then the target weight corresponding to the word c is:

Final_weight_c＝tfidf_weight*tfidf_c+w_word_group+location_weight_1+verb_w

the specific values corresponding to the four weights can be determined according to practical situations, and the application is not limited to this.

S153, determining a preset number of words as keywords of the sentence to be processed according to the sequence from high to low of the target weight corresponding to each word in the sentence to be processed.

After the target weight of each word in the sentence to be processed is calculated according to the formula, sorting is carried out according to the target weight from high to low, and the first n words are selected as keywords of the sentence to be processed.

According to the keyword extraction method provided by the embodiment, after word segmentation is carried out on sentences to be processed, word segmentation results are combined by utilizing a broken word combination method, and then TF-IDF values of each word contained in the word segmentation combination results are obtained based on a keyword dictionary. Dividing sentences of the sentences to be processed, dividing words of each sentence and merging broken words to obtain words contained in each sentence, and further performing dependency syntax analysis on the words contained in each sentence to obtain a core phrase of the sentence; and determining the keywords of the sentence to be processed according to each word contained in the sentence to be processed, the corresponding TF-IDF value and the core phrase contained in the sentence to be processed. After the words are extracted for the whole sentence, the whole sentence is divided into short sentences, and then the core phrase is extracted for each short sentence, so that important information cannot be omitted. In addition, after word segmentation processing is carried out on the sentences, the scheme also carries out word segmentation and merging, so that the number of words is reduced, and meanwhile, the extracted keyword information is more complete. In summary, the scheme is suitable for extracting keywords of the corpus of the automatic question-answering system, namely, the scheme aims at the corpus of the automatic question-answering system, and the extracted keywords are more accurate.

Further, the scheme introduces the position weight of the words in the whole sentence when extracting the keywords, thereby ensuring that the words at the position containing important information can be extracted, and finally improving the accuracy of the extracted keywords. In addition, weights of other dimensions, such as weights corresponding to TF-IDF, weights corresponding to core phrases and weights corresponding to parts of speech, are set, so that whether one word is a keyword or not is measured from a plurality of different dimensions, the accuracy of the extracted keyword is finally improved, and the measured dimensions are determined according to the characteristics of corpus in an automatic question-answering system, and therefore the scheme is more suitable for the automatic question-answering system.

Corresponding to the keyword extraction method embodiment, the application also provides a keyword extraction device embodiment.

Referring to fig. 5, a block diagram of a keyword extraction apparatus provided in an embodiment of the present application is shown, where the apparatus is applied to an electronic device, as shown in fig. 5, the apparatus may include:

the word segmentation and broken word combination module 110 is configured to perform word segmentation on the sentence to be processed to obtain a word segmentation result, and perform broken word combination on the word segmentation result to obtain a word segmentation combination result.

In one possible implementation, the word segmentation and word breaking combination module 110 includes:

and the phrase dividing sub-module is used for dividing the to-be-processed sentence into phrases according to the punctuation coincidence contained in the to-be-processed sentence.

And the word segmentation and broken word merging sub-module is used for carrying out word segmentation processing on the short sentences to obtain word segmentation results, and merging words with co-occurrence frequency larger than a preset threshold value contained in the word segmentation results to obtain words contained in the short sentences.

The word frequency-reverse document frequency obtaining module 120 is configured to obtain TF-IDF values of each word in the word segmentation merging result based on a keyword dictionary obtained through training in advance.

The keyword dictionary includes TF-IDF values corresponding to each keyword.

In one embodiment of the present application, the training keyword dictionary process includes:

The core phrase obtaining module 130 is configured to perform sentence segmentation on the sentence to be processed, perform word segmentation processing and word breaking merging on each sentence to obtain a word included in the sentence, and perform dependency syntactic analysis on the word included in each sentence to obtain a core phrase included in the sentence.

In one possible implementation, the dependency syntax word segmentation process includes: analyzing semantic dependency relations among words contained in the short sentence by utilizing a dependency syntax analysis method; and determining the core phrase in the short sentence according to the semantic dependency relationship.

The process of determining the core phrase contained in the phrase comprises the following steps: extracting initial core words of the phrases according to the semantic dependency relationship; and expanding the initial core word according to the semantic dependency relationship corresponding to the initial core word to obtain the core word group.

The keyword determining module 140 is configured to obtain a keyword of the sentence to be processed based on a word included in the sentence to be processed, a word frequency-reverse file frequency corresponding to the word, and the core phrase.

In one embodiment of the present application, the keyword determination module 140 may include:

and the weight acquisition sub-module is used for acquiring weight coefficients corresponding to words contained in the statement to be processed.

The weight coefficient comprises a weight corresponding to the position of the word and a weight corresponding to the core phrase.

The target weight calculation sub-module is used for obtaining the target weight corresponding to each word based on the weight coefficient corresponding to the word and the word frequency-reverse text frequency.

In one possible implementation manner, the weight coefficient includes a first weight corresponding to TF-IDF, a second weight corresponding to the core phrase, a third weight corresponding to a position of the phrase in the to-be-processed sentence, and a fourth weight corresponding to a part of speech of each core phrase.

Wherein, the first weights corresponding to words with different TF-IDF values are the same.

And the second weight corresponding to the core phrase is a second weight preset value, and the second weight corresponding to the word of the non-core phrase is 0.

And the numerical value of the third weight corresponding to the short sentence at the head or tail of the sentence to be processed is higher than the third weight corresponding to the short sentence at other positions in the sentence to be processed.

The target weight calculation sub-module may include: the first weight calculation sub-module and the second weight calculation sub-module.

And the first weight calculation sub-module is used for calculating the product of the first weight corresponding to the word and the word frequency-reverse text frequency of the word.

And the second weight calculation sub-module is used for calculating the sum of the product and the second weight, the third weight and the fourth weight to obtain the target weight corresponding to the word.

And the keyword selection sub-module is used for determining the preset number of words as keywords of the sentence to be processed according to the sequence from high to low of the target weight corresponding to each word in the sentence to be processed.

According to the keyword extraction device provided by the embodiment, after the words are extracted for the whole sentence, the whole sentence is divided into short sentences, and then the core phrase is extracted for each short sentence, so that important information is ensured not to be missed. In addition, after word segmentation processing is carried out on the sentences, the scheme also carries out broken word combination, namely, word combination with higher co-occurrence frequency, so that the number of words is reduced, and meanwhile, the extracted keyword information is more complete. In summary, the scheme is suitable for extracting keywords of the corpus of the automatic question-answering system, namely, the scheme aims at the corpus of the automatic question-answering system, and the extracted keywords are more accurate.

An electronic device includes a processor and a memory having stored thereon a program executable on the processor. The above-mentioned keyword extraction method embodiment is implemented when the processor runs the program stored in the memory.

The application also provides a storage medium executable by the computing device, wherein the storage medium stores a program, and the program realizes the keyword extraction method when being executed by the electronic device.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present invention is not limited by the order of acts, as some steps may, in accordance with the present invention, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

It should be noted that the technical features described in each embodiment in this specification may be replaced or combined with each other, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

The steps in the methods of the embodiments of the present application may be sequentially adjusted, combined, and pruned according to actual needs.

The modules and sub-modules in the device and the terminal in the embodiments of the present application may be combined, divided, and deleted according to actual needs.

In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional module or sub-module in each embodiment of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A keyword extraction method, comprising:

obtaining keywords of the sentence to be processed based on words contained in the sentence to be processed, word frequency-reverse file frequency corresponding to the words and the core phrase;

performing dependency syntactic analysis on words contained in each short sentence to obtain a core phrase contained in the short sentence, wherein the method comprises the following steps:

analyzing semantic dependency relationship among words contained in a short sentence based on a dependency syntax analysis method, extracting core words of the short sentence according to the semantic dependency relationship, and performing core word expansion at least according to a main-predicate structure, a movable guest structure and a state middle structure of the initial core words to obtain a core word group of the short sentence, wherein the core word is used as the initial core word of the short sentence;

the obtaining the keyword of the sentence to be processed based on the word included in the sentence to be processed, the word frequency-reverse file frequency corresponding to the word, and the core phrase includes:

acquiring a weight coefficient corresponding to a word contained in the sentence to be processed, wherein the weight coefficient comprises a first weight corresponding to word frequency-reverse file frequency, a second weight corresponding to the core phrase, a third weight corresponding to the position of the phrase in the sentence to be processed and a fourth weight corresponding to the part of speech of each core phrase, the first weight is used for representing the influence degree of word frequency-reverse file frequency value dimension on the word, and the third weight represents the influence degree of the word at different positions on whether the word is a keyword or not;

calculating the product of the first weight corresponding to the word and the word frequency-reverse text frequency of the word; calculating the sum of the product and the second weight, the third weight and the fourth weight to obtain a target weight corresponding to the word;

determining a preset number of words as keywords of the sentence to be processed according to the sequence from high to low of the target weight corresponding to each word in the sentence to be processed;

the sum of the maximum values of the first weight, the second weight, the third weight and the fourth weight is equal to 1;

2. The method of claim 1, wherein the sentence to be processed is divided, word segmentation and word breaking merging are performed on each sentence to obtain a word contained in the sentence, and the method comprises the following steps:

3. The method of claim 1, wherein the process of obtaining a keyword dictionary comprises:

4. A keyword extraction apparatus, characterized by comprising:

the keyword determining module is used for obtaining keywords of the sentence to be processed based on words contained in the sentence to be processed, word frequency-reverse file frequency corresponding to the words and the core phrase;

the core phrase acquisition module is specifically configured to:

the keyword determining module is specifically configured to:

5. A computer-readable storage medium, characterized in that the storage medium has stored therein a program which, when executed by a computing device, implements the keyword extraction method of any one of claims 1-3.