CN109241525B

CN109241525B - Keyword extraction method, device and system

Info

Publication number: CN109241525B
Application number: CN201810953403.6A
Authority: CN
Inventors: 马凯; 刘云峰; 徐易楠; 吴悦; 陈正钦; 杨振宇; 胡晓; 汶林丁
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2018-08-20
Filing date: 2018-08-20
Publication date: 2022-05-06
Anticipated expiration: 2038-08-20
Also published as: WO2020038253A1; CN109241525A

Abstract

The application relates to a keyword extraction method, a keyword extraction device and a keyword extraction system, which comprise the following steps: extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; fusing a preset number of results according to a fusion mode to obtain a fusion result; outputting a fusion result; and evaluating the fusion result and updating the fusion mode. For irregular information contents such as spoken language expression, the advantages of the keyword extraction models can be brought into play for extraction, the accuracy of keyword extraction is higher, and ideal results can be extracted more easily.

Description

Keyword extraction method, device and system

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, and a system for extracting a keyword.

Background

With the coming of the internet information era, people rely on the internet more and more in daily life and work, for example, the way of purchasing goods is changed greatly with the development of the internet, and most consumers have the experience of shopping on the internet. Since the real goods can not be touched in online shopping, consumers can purchase goods suitable for themselves through communication with customer service. However, the questioning mode of each consumer for the same question is different, and the customer service must browse each question to determine what the consumer is asking at all, which causes a large burden to the customer service, and a keyword extraction technology is provided for solving the problem.

At present, two methods, namely rule-based and statistical information-based, are mainly used for extracting keywords. Because the Chinese usage scene is complex, and with the arrival of the network era, some new words come out endlessly, the rule-based method is difficult to be applied to all situations, and the effect is poor in practical application. The method based on statistical information is popular at present, and a large amount of text corpora are subjected to statistical analysis to obtain some statistical characteristics in the text corpora and automatically extract keywords.

However, in the related art, most methods based on statistical information only use a single machine learning model to extract keywords, and since the current information content spoken language expressions are many and there is no uniform specification, it is difficult for the technology of extracting keywords by using a single model to identify text corpora of different spoken language expressions with the same meaning, and it is difficult to accurately extract an ideal result.

Disclosure of Invention

In order to overcome the problems in the related art at least to a certain extent, the application provides a keyword extraction method, device and system.

According to a first aspect of an embodiment of the present application, there is provided a keyword extraction method, including:

extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2;

fusing the preset number of results according to a fusion mode to obtain a fusion result;

outputting the fusion result;

and evaluating the fusion result and updating the fusion mode.

Optionally, the preset number of keyword extraction models include keyword extraction models depending on word segmentation;

when the keyword extraction model is the keyword extraction model depending on the participle, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:

performing word segmentation on the text corpus to obtain a word segmentation corpus;

and extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.

Optionally, the participle corpus includes a plurality of words; the keyword extraction model depending on the participles comprises a model depending on the participles and based on vocabulary associated information and/or a model depending on the participles and based on word frequency information, and the model depending on the participles and based on the vocabulary associated information comprises an entropy and mutual information model

When the keyword extraction model dependent on participles is the entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:

respectively calculating entropy values and mutual information values of the words;

obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;

extracting the plurality of terms with the entropy values larger than the entropy threshold value and the plurality of terms with the mutual information values larger than the mutual information threshold value as the preset number of results.

Optionally, the keyword extraction models other than the keyword extraction models dependent on the participles in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participles and/or a model based on word frequency information independent of the participles, and the model based on vocabulary associated information independent of the participles includes an entropy model and a mutual information model

When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:

setting the maximum number of characters contained in the keywords to be extracted;

enumerating all character strings of the text corpus;

respectively calculating entropy values and mutual information values of all the character strings;

obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;

and extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as results of the preset number according to the maximum number of the characters.

Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:

obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain participle results;

obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;

and merging the word segmentation result and the non-word segmentation result to obtain a fusion result.

calculating the intersection of every two results with the preset number to obtain a plurality of intersection results;

and solving a union set of the intersection set results to obtain the fusion result.

According to a second aspect of the embodiments of the present application, there is provided an apparatus for extracting a keyword, including:

the extraction module is used for extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the predetermined number is at least 2;

the fusion module is used for fusing the preset number of results according to a fusion mode to obtain a fusion result;

the output module is used for outputting the fusion result;

and the evaluation module is used for evaluating the fusion result and updating the fusion mode.

when the keyword extraction model is the keyword extraction model dependent on word segmentation, the extraction module comprises:

the word segmentation unit is used for segmenting the text corpus to obtain a word segmentation corpus;

the first extraction unit is used for extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.

When the keyword extraction model dependent on the participle is the entropy and mutual information model, the first extraction unit includes:

a calculating subunit, configured to calculate entropy values and mutual information values of the multiple words, respectively;

the threshold value determining subunit is used for obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the plurality of words;

an extracting subunit, configured to extract, as the preset number of results, the plurality of terms whose entropy values are greater than the entropy threshold and the plurality of terms whose mutual information values are greater than the mutual information threshold.

When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, the extraction module includes:

the setting unit is used for setting the maximum number of characters contained in the keywords to be extracted;

the enumeration unit is used for enumerating all character strings of the text corpus;

a calculating unit, configured to calculate entropy values and mutual information values of all the character strings respectively;

a threshold determining unit, configured to obtain an entropy threshold and a mutual information threshold according to the entropy values and the mutual information values of all the character strings;

and the second extraction unit is used for extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value according to the maximum number of the characters as the preset number of results.

Optionally, the fusion module includes:

the first intersection unit is used for obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain a participle result;

the second intersection unit is used for obtaining the intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain non-participle results;

and the first merging unit is used for merging the word segmentation result and the non-word segmentation result to obtain a merging result.

Optionally, the fusion module includes:

the third intersection unit is used for solving the intersection of every two results with the preset number to obtain a plurality of intersection results;

and the second union unit is used for solving a union set of the intersection set results to obtain the fusion result.

According to a third aspect of embodiments herein, there is provided a non-transitory computer readable storage medium having instructions thereon, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a keyword extraction method, the method including:

extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the predetermined number is at least 2;

outputting the fusion result;

and evaluating the fusion result and updating the fusion mode.

enumerating all character strings of the text corpus;

According to a fourth aspect of the embodiments of the present application, there is provided an apparatus for extracting a keyword, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to:

outputting the fusion result;

and evaluating the fusion result and updating the fusion mode.

enumerating all character strings of the text corpus;

obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on participles to obtain participle results;

According to a fifth aspect of the embodiments of the present application, there is provided a keyword extraction system, including: a processor, and a memory coupled to the processor;

the memory is used for storing a computer program used for executing the following keyword extraction method:

outputting the fusion result;

and evaluating the fusion result and updating the fusion mode.

enumerating all character strings of the text corpus;

obtaining a union set of the intersection results to obtain the fusion result;

the processor is used for calling and executing the computer program in the memory.

According to a sixth aspect of the embodiments of the present application, there is provided a storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements the following steps in the keyword extraction method:

outputting the fusion result;

and evaluating the fusion result and updating the fusion mode.

enumerating all character strings of the text corpus;

calculating intersection of every two results with the preset number to obtain a plurality of intersection results;

The technical scheme provided by the embodiment of the application can have the following beneficial effects: the method comprises the steps of extracting a text corpus according to a preset number of keyword extraction models to obtain a preset number of results, fusing the preset number of results according to a fusion mode to obtain a fusion result, outputting the fusion result, evaluating the fusion result and updating the fusion mode, and based on the result, for irregular information contents such as spoken language expression, the advantages of the keyword extraction models can be brought into play to extract, the keyword extraction accuracy is higher, and ideal results can be extracted more easily.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flowchart of a keyword extraction method according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a keyword extraction method when the keyword extraction model is a keyword extraction model dependent on word segmentation according to a second embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for extracting keywords by using a keyword extraction model based on entropy and mutual information when the keyword extraction model is a keyword extraction model dependent on word segmentation according to a third embodiment of the present application.

Fig. 4 is a flowchart illustrating a method for extracting keywords by using a keyword extraction model of entropy and mutual information when the keyword extraction model is a keyword extraction model independent of word segmentation according to a fourth embodiment of the present application.

Fig. 5 is a schematic flowchart of a fusion method provided in the fifth embodiment of the present application.

Fig. 6 is a schematic flow chart of a fusion method provided in the sixth embodiment of the present application.

Fig. 7 is a schematic structural diagram of an apparatus for extracting keywords according to a seventh embodiment of the present application.

Fig. 8 is a schematic structural diagram of an apparatus for extracting keywords according to an eighth embodiment of the present application.

Fig. 9 is a schematic structural diagram of an apparatus for extracting keywords according to a ninth embodiment of the present application.

Fig. 10 is a schematic structural diagram of an apparatus for extracting keywords according to a tenth embodiment of the present application.

Fig. 11 is a schematic structural diagram of an apparatus for extracting keywords according to an eleventh embodiment of the present application.

Fig. 12 is a schematic structural diagram of an apparatus for extracting keywords according to a twelfth embodiment of the present application.

Fig. 13 is a schematic structural diagram of a keyword extraction system according to a thirteenth embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Referring to fig. 1, the method for extracting a keyword provided in this embodiment includes:

step S11, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one by one; the preset number is at least 2;

step S12, fusing a preset number of results according to a fusion mode to obtain a fusion result;

step S13, outputting a fusion result;

step S14, evaluating the fusion result and updating the fusion mode;

because the text corpus is extracted according to the preset number of keyword extraction models to obtain the preset number of results, the preset number of results are fused according to the fusion mode to obtain the fusion result and then are output, and the fusion result is evaluated to update the fusion mode, on the basis, the respective advantages of the plurality of keyword extraction models can be played for extracting irregular information contents such as spoken language expression, the accuracy of keyword extraction is higher, and ideal results can be extracted more easily.

It should be noted that the keyword extraction models are divided into a keyword extraction model that depends on the participle and a keyword extraction model that does not depend on the participle, except for the keyword extraction model that depends on the participle.

Referring to fig. 2, the method for extracting keywords according to the present embodiment includes:

step S211, performing word segmentation on the text corpus to obtain a word segmentation corpus;

step S212, extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle;

step S22, fusing a preset number of results according to a fusion mode to obtain a fusion result;

step S23, outputting a fusion result;

and step S24, evaluating the fusion result and updating the fusion mode.

It should be noted that, regardless of the extraction model of the keyword depending on the word segmentation or the extraction model of the keyword independent of the word segmentation, the extraction model of the keyword based on the word frequency information and the extraction model of the keyword based on the associated information may be included. Moreover, the extraction steps of the keyword extraction model based on the word frequency information and the keyword extraction model based on the associated information are different for the keyword extraction model depending on the word segmentation and the keyword extraction model not depending on the word segmentation.

For the keyword extraction model based on the word Frequency information, taking TF-idf (term Frequency invoked Document Frequency) as an example, the following steps can be summarized:

step 1, calculating word frequency, wherein the specific formula is as follows:

the word frequency is the number of times a word appears in an article/the total number of words in the article;

step 2, calculating the number of the reverse documents, wherein the specific formula is as follows:

the inverse document number log (total number of documents in the corpus/(total number of documents containing the word + 1));

and 3, calculating TF-IDF, wherein the specific formula is as follows:

TF-IDF is the number of words frequency inverse documents.

For the keyword extraction model based on the associated information, the following takes the keyword extraction model of entropy and mutual information as an example, and the keyword extraction model types depending on the participles and the keyword extraction model types not depending on the participles are respectively elaborated.

Referring to fig. 3, the method of the present embodiment includes:

step S311, performing word segmentation on the text corpus to obtain a word segmentation corpus;

step S3121, respectively calculating entropy values and mutual information values of a plurality of words;

step S3122, obtaining an entropy threshold and a mutual information threshold according to the entropy values and the mutual information values of the plurality of words;

step S3123, extracting a plurality of words of which the entropy values are greater than an entropy threshold value and a plurality of words of which the mutual information values are greater than a mutual information threshold value as a preset number of results;

step S32, fusing a preset number of results according to a fusion mode to obtain a fusion result;

step S33, outputting a fusion result;

and step S34, evaluating the fusion result and updating the fusion mode.

Referring to fig. 4, the method of the present embodiment includes:

step S411, setting the maximum number of characters contained in the keywords to be extracted;

step S412, enumerating all character strings of the text corpus;

step S413, respectively calculating entropy values and mutual information values of all the character strings;

step S414, obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;

step S415, extracting a plurality of character strings with entropy values larger than an entropy threshold value and a plurality of character strings with mutual information values larger than a mutual information threshold value according to the maximum number of characters as a preset number of results;

step S42, fusing a preset number of results according to a fusion mode to obtain a fusion result;

step S43, outputting a fusion result;

and step S44, evaluating the fusion result and updating the fusion mode.

In addition, the fusion method in the above embodiments may be various, and two fusion methods are described in detail below.

Referring to fig. 5, the fusion method of the present embodiment includes:

step S51, obtaining intersection of a preset number of results extracted from a participle corpus according to a plurality of keyword extraction models depending on participles to obtain participle results;

step S52, obtaining the non-participle result by obtaining the intersection of the results of the preset number extracted from the text corpus according to the keyword extraction models except the keyword extraction models depending on participles;

and step S53, merging the word segmentation result with the non-word segmentation result to obtain a fusion result.

Referring to fig. 6, the fusion method of the present embodiment includes:

step S61, calculating intersection of every two results with preset number to obtain multiple intersection results;

and step S62, obtaining a fusion result by taking a union of a plurality of intersection results.

Referring to fig. 7, the present embodiment includes: the device comprises an extraction module, a fusion module, an output module and an evaluation module.

Specifically, the following is set forth for each module:

the extraction module is used for extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one by one; the preset number is at least 2;

the fusion module is used for fusing a preset number of results according to a fusion mode to obtain a fusion result;

the output module is used for outputting a fusion result;

Further, referring to fig. 8, when the keyword extraction model is a segmentation-dependent keyword extraction model, the extraction module includes:

the first extraction unit is used for extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle.

Referring to fig. 9, when the keyword extraction model depending on the participle is an entropy and mutual information model, the first extraction unit includes:

the calculation subunit is used for respectively calculating entropy values and mutual information values of the plurality of words;

the threshold value determining subunit is used for obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the multiple words;

and the extraction subunit is used for extracting a plurality of words of which the entropy values are greater than the entropy threshold value and a plurality of words of which the mutual information values are greater than the mutual information threshold value as a preset number of results.

Referring to fig. 10, when the keyword extraction models other than the preset number of keyword extraction models depending on the participles are entropy and mutual information models, the extraction module includes:

the calculation unit is used for respectively calculating entropy values and mutual information values of all the character strings;

the threshold value determining unit is used for obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;

and the second extraction unit is used for extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as a preset number of results according to the maximum number of the characters.

Wherein, the fusion mode can be multiple, see fig. 11, and when one of the fusion modes is adopted, the fusion module can include:

the first intersection unit is used for solving the intersection of a preset number of results extracted from the participle corpus according to a plurality of keyword extraction models depending on participles to obtain participle results;

the second intersection unit is used for obtaining the intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;

and the first merging unit is used for merging the word segmentation result with the non-word segmentation result to obtain a merging result.

Referring to fig. 12, when another fusion method is adopted, the fusion module may include:

the third intersection unit is used for solving the intersection of every two results with preset numbers to obtain a plurality of intersection results;

and the second union unit is used for obtaining a union of the intersection results to obtain a fusion result.

Referring to fig. 13, the present embodiment includes: a processor, and a memory coupled to the processor;

the memory is used for storing a computer program, the computer program is at least used for executing the following keyword extraction method:

extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one by one; the preset number is at least 2;

fusing a preset number of results according to a fusion mode to obtain a fusion result;

outputting a fusion result;

and evaluating the fusion result and updating the fusion mode.

Further, the preset number of keyword extraction models comprises keyword extraction models depending on word segmentation;

when the keyword extraction model is a keyword extraction model depending on word segmentation, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models, wherein the preset number of results comprises the following steps:

and extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle.

Further, the participle corpus comprises a plurality of words; the keyword extraction model dependent on the participles comprises a model based on word associated information and/or a model based on word frequency information, and the model based on word associated information comprises entropy and mutual information models

When the keyword extraction model depending on the participle is an entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle, wherein the preset number of results comprises the following steps:

respectively calculating entropy values and mutual information values of a plurality of words;

and extracting a plurality of words of which the entropy values are larger than the entropy threshold value and a plurality of words of which the mutual information values are larger than the mutual information threshold value as a preset number of results.

Further, the keyword extraction models other than the plurality of segmentation-dependent keyword extraction models in the plurality of keyword extraction models include segmentation-independent vocabulary association information-based models and/or segmentation-independent word frequency information-based models, and the segmentation-independent vocabulary association information-based models include entropy and mutual information models

When the keyword extraction models except the keyword extraction models depending on the participles with the preset number are entropy and mutual information models, extracting results with the preset number from the text corpus according to the keyword extraction models with the preset number, wherein the results with the preset number comprise:

enumerating all character strings of the text corpus;

respectively calculating entropy values and mutual information values of all character strings;

and extracting a plurality of character strings of which the entropy values are greater than an entropy threshold value and a plurality of character strings of which the mutual information values are greater than a mutual information threshold value as a preset number of results according to the maximum number of the characters.

Further, fusing the preset number of results according to feedback to obtain a fusion result, including:

obtaining an intersection of a preset number of results extracted from a participle corpus according to a plurality of keyword extraction models depending on participles to obtain a participle result;

and taking a union set of the word segmentation result and the non-word segmentation result to obtain a fusion result.

obtaining an intersection of every two results with preset numbers to obtain a plurality of intersection results;

and obtaining a fusion result by taking a union set of the intersection results.

The processor is used to call and execute the computer program in the memory.

In addition, the present application further provides a storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the method implements the following steps of the keyword extraction method:

outputting a fusion result;

and evaluating the fusion result and updating the fusion mode.

Optionally, the preset number of keyword extraction models includes a keyword extraction model depending on word segmentation;

when the keyword extraction model is a keyword extraction model depending on participles, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models, wherein the preset number of results comprises the following steps:

Optionally, the participle corpus includes a plurality of words; the keyword extraction model dependent on the participles comprises a model based on word associated information and/or a model based on word frequency information, and the model based on word associated information comprises entropy and mutual information models

Optionally, the keyword extraction models other than the plurality of keyword extraction models depending on the participle in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participle and/or a model based on word frequency information independent of the participle, and the model based on vocabulary associated information independent of the participle includes an entropy model and a mutual information model

enumerating all character strings of the text corpus;

Optionally, the preset number of results are fused according to the feedback to obtain a fusion result, including:

Optionally, the preset number of results are fused according to the feedback to obtain a fusion result, including: obtaining an intersection of every two results with preset numbers to obtain a plurality of intersection results; and obtaining a fusion result by taking a union set of the intersection results.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A keyword extraction method is characterized by comprising the following steps:

extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2; the keyword extraction models are divided into keyword extraction models which depend on participles and keyword extraction models which do not depend on participles;

fusing the preset number of results according to a fusion mode to obtain a fusion result; the fusion mode comprises an intersection operation and a parallel operation;

outputting the fusion result;

evaluating the fusion result and updating the fusion mode;

when the keyword extraction model is the keyword extraction model depending on the participles, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models, including:

extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles;

wherein the participle corpus comprises a plurality of words; when the keyword extraction model dependent on participles is an entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:

extracting a plurality of words of which the entropy values are larger than the entropy threshold value and a plurality of words of which the mutual information values are larger than the mutual information threshold value as the preset number of results;

wherein, the fusing the results of the preset number according to feedback to obtain a fused result comprises:

2. The method according to claim 1, wherein the segmentation-dependent keyword extraction model comprises a segmentation-dependent vocabulary association information-based model and/or a segmentation-dependent vocabulary association information-based model, and the segmentation-dependent vocabulary association information-based model comprises an entropy and mutual information model.

3. The method of claim 1, wherein the keyword extraction models other than the plurality of segmentation-dependent keyword extraction models of the plurality of keyword extraction models comprise segmentation-independent lexical relevance information-based models and/or segmentation-independent word frequency information-based models, the segmentation-independent lexical relevance information-based models comprising entropy and mutual information models;

when the keyword extraction models except the keyword extraction models depending on the participles with the preset number are the entropy and mutual information models, extracting the results with the preset number from the text corpus according to the keyword extraction models with the preset number, wherein the extracting comprises the following steps:

enumerating all character strings of the text corpus;

4. The method according to claim 1, wherein the fusing the preset number of results according to feedback to obtain a fused result comprises:

5. An extraction device of a keyword, characterized by comprising:

the extraction module is used for extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2;

the output module is used for outputting the fusion result;

the evaluation module is used for evaluating the fusion result and updating the fusion mode;

the preset number of keyword extraction models comprise keyword extraction models depending on word segmentation;

when the keyword extraction model is the keyword extraction model depending on the participle, the extraction module comprises:

the first extraction unit is used for extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles;

wherein the participle corpus comprises a plurality of words; when the keyword extraction model dependent on the participle is an entropy and mutual information model, the first extraction unit includes:

an extraction subunit, configured to extract, as the preset number of results, the plurality of terms whose entropy values are greater than the entropy threshold and the plurality of terms whose mutual information values are greater than the mutual information threshold;

wherein the fusion module comprises:

6. The apparatus of claim 5, wherein the segmentation-dependent keyword extraction model comprises a segmentation-dependent vocabulary association information-based model and/or a segmentation-dependent vocabulary association information-based model, and the segmentation-dependent vocabulary association information-based model comprises an entropy and mutual information model.

7. The apparatus of claim 5, wherein the keyword extraction models of the plurality of keyword extraction models other than the plurality of segmentation-dependent keyword extraction models comprise segmentation-independent vocabulary-related information-based models and/or segmentation-independent word frequency information-based models, the segmentation-independent vocabulary-related information-based models comprising entropy and mutual information models;

when the keyword extraction models except the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, the extraction module comprises:

8. The apparatus of claim 5, wherein the fusion module comprises:

9. A keyword extraction system, comprising:

a processor, and a memory coupled to the processor;

the memory is intended to store a computer program intended at least to execute the method of extraction of keywords according to any one of claims 1 to 4;

10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the steps of the keyword extraction method according to any one of claims 1 to 4.