CN109241525B - Keyword extraction method, device and system - Google Patents

Keyword extraction method, device and system Download PDF

Info

Publication number
CN109241525B
CN109241525B CN201810953403.6A CN201810953403A CN109241525B CN 109241525 B CN109241525 B CN 109241525B CN 201810953403 A CN201810953403 A CN 201810953403A CN 109241525 B CN109241525 B CN 109241525B
Authority
CN
China
Prior art keywords
keyword extraction
preset number
results
entropy
mutual information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810953403.6A
Other languages
Chinese (zh)
Other versions
CN109241525A (en
Inventor
马凯
刘云峰
徐易楠
吴悦
陈正钦
杨振宇
胡晓
汶林丁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN201810953403.6A priority Critical patent/CN109241525B/en
Publication of CN109241525A publication Critical patent/CN109241525A/en
Priority to PCT/CN2019/100322 priority patent/WO2020038253A1/en
Application granted granted Critical
Publication of CN109241525B publication Critical patent/CN109241525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a keyword extraction method, a keyword extraction device and a keyword extraction system, which comprise the following steps: extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; fusing a preset number of results according to a fusion mode to obtain a fusion result; outputting a fusion result; and evaluating the fusion result and updating the fusion mode. For irregular information contents such as spoken language expression, the advantages of the keyword extraction models can be brought into play for extraction, the accuracy of keyword extraction is higher, and ideal results can be extracted more easily.

Description

Keyword extraction method, device and system
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, and a system for extracting a keyword.
Background
With the coming of the internet information era, people rely on the internet more and more in daily life and work, for example, the way of purchasing goods is changed greatly with the development of the internet, and most consumers have the experience of shopping on the internet. Since the real goods can not be touched in online shopping, consumers can purchase goods suitable for themselves through communication with customer service. However, the questioning mode of each consumer for the same question is different, and the customer service must browse each question to determine what the consumer is asking at all, which causes a large burden to the customer service, and a keyword extraction technology is provided for solving the problem.
At present, two methods, namely rule-based and statistical information-based, are mainly used for extracting keywords. Because the Chinese usage scene is complex, and with the arrival of the network era, some new words come out endlessly, the rule-based method is difficult to be applied to all situations, and the effect is poor in practical application. The method based on statistical information is popular at present, and a large amount of text corpora are subjected to statistical analysis to obtain some statistical characteristics in the text corpora and automatically extract keywords.
However, in the related art, most methods based on statistical information only use a single machine learning model to extract keywords, and since the current information content spoken language expressions are many and there is no uniform specification, it is difficult for the technology of extracting keywords by using a single model to identify text corpora of different spoken language expressions with the same meaning, and it is difficult to accurately extract an ideal result.
Disclosure of Invention
In order to overcome the problems in the related art at least to a certain extent, the application provides a keyword extraction method, device and system.
According to a first aspect of an embodiment of the present application, there is provided a keyword extraction method, including:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2;
fusing the preset number of results according to a fusion mode to obtain a fusion result;
outputting the fusion result;
and evaluating the fusion result and updating the fusion mode.
Optionally, the preset number of keyword extraction models include keyword extraction models depending on word segmentation;
when the keyword extraction model is the keyword extraction model depending on the participle, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
and extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.
Optionally, the participle corpus includes a plurality of words; the keyword extraction model depending on the participles comprises a model depending on the participles and based on vocabulary associated information and/or a model depending on the participles and based on word frequency information, and the model depending on the participles and based on the vocabulary associated information comprises an entropy and mutual information model
When the keyword extraction model dependent on participles is the entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:
respectively calculating entropy values and mutual information values of the words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
extracting the plurality of terms with the entropy values larger than the entropy threshold value and the plurality of terms with the mutual information values larger than the mutual information threshold value as the preset number of results.
Optionally, the keyword extraction models other than the keyword extraction models dependent on the participles in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participles and/or a model based on word frequency information independent of the participles, and the model based on vocabulary associated information independent of the participles includes an entropy model and a mutual information model
When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all the character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as results of the preset number according to the maximum number of the characters.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain participle results;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and merging the word segmentation result and the non-word segmentation result to obtain a fusion result.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
calculating the intersection of every two results with the preset number to obtain a plurality of intersection results;
and solving a union set of the intersection set results to obtain the fusion result.
According to a second aspect of the embodiments of the present application, there is provided an apparatus for extracting a keyword, including:
the extraction module is used for extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the predetermined number is at least 2;
the fusion module is used for fusing the preset number of results according to a fusion mode to obtain a fusion result;
the output module is used for outputting the fusion result;
and the evaluation module is used for evaluating the fusion result and updating the fusion mode.
Optionally, the preset number of keyword extraction models include keyword extraction models depending on word segmentation;
when the keyword extraction model is the keyword extraction model dependent on word segmentation, the extraction module comprises:
the word segmentation unit is used for segmenting the text corpus to obtain a word segmentation corpus;
the first extraction unit is used for extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.
Optionally, the participle corpus includes a plurality of words; the keyword extraction model depending on the participles comprises a model depending on the participles and based on vocabulary associated information and/or a model depending on the participles and based on word frequency information, and the model depending on the participles and based on the vocabulary associated information comprises an entropy and mutual information model
When the keyword extraction model dependent on the participle is the entropy and mutual information model, the first extraction unit includes:
a calculating subunit, configured to calculate entropy values and mutual information values of the multiple words, respectively;
the threshold value determining subunit is used for obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the plurality of words;
an extracting subunit, configured to extract, as the preset number of results, the plurality of terms whose entropy values are greater than the entropy threshold and the plurality of terms whose mutual information values are greater than the mutual information threshold.
Optionally, the keyword extraction models other than the keyword extraction models dependent on the participles in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participles and/or a model based on word frequency information independent of the participles, and the model based on vocabulary associated information independent of the participles includes an entropy model and a mutual information model
When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, the extraction module includes:
the setting unit is used for setting the maximum number of characters contained in the keywords to be extracted;
the enumeration unit is used for enumerating all character strings of the text corpus;
a calculating unit, configured to calculate entropy values and mutual information values of all the character strings respectively;
a threshold determining unit, configured to obtain an entropy threshold and a mutual information threshold according to the entropy values and the mutual information values of all the character strings;
and the second extraction unit is used for extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value according to the maximum number of the characters as the preset number of results.
Optionally, the fusion module includes:
the first intersection unit is used for obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain a participle result;
the second intersection unit is used for obtaining the intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain non-participle results;
and the first merging unit is used for merging the word segmentation result and the non-word segmentation result to obtain a merging result.
Optionally, the fusion module includes:
the third intersection unit is used for solving the intersection of every two results with the preset number to obtain a plurality of intersection results;
and the second union unit is used for solving a union set of the intersection set results to obtain the fusion result.
According to a third aspect of embodiments herein, there is provided a non-transitory computer readable storage medium having instructions thereon, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a keyword extraction method, the method including:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the predetermined number is at least 2;
fusing the preset number of results according to a fusion mode to obtain a fusion result;
outputting the fusion result;
and evaluating the fusion result and updating the fusion mode.
Optionally, the preset number of keyword extraction models include keyword extraction models depending on word segmentation;
when the keyword extraction model is the keyword extraction model depending on the participle, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
and extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.
Optionally, the participle corpus includes a plurality of words; the keyword extraction model depending on the participles comprises a model depending on the participles and based on vocabulary associated information and/or a model depending on the participles and based on word frequency information, and the model depending on the participles and based on the vocabulary associated information comprises an entropy and mutual information model
When the keyword extraction model dependent on participles is the entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:
respectively calculating entropy values and mutual information values of the words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
extracting the plurality of terms with the entropy values larger than the entropy threshold value and the plurality of terms with the mutual information values larger than the mutual information threshold value as the preset number of results.
Optionally, the keyword extraction models other than the keyword extraction models dependent on the participles in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participles and/or a model based on word frequency information independent of the participles, and the model based on vocabulary associated information independent of the participles includes an entropy model and a mutual information model
When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all the character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as results of the preset number according to the maximum number of the characters.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain participle results;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and merging the word segmentation result and the non-word segmentation result to obtain a fusion result.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
calculating the intersection of every two results with the preset number to obtain a plurality of intersection results;
and solving a union set of the intersection set results to obtain the fusion result.
According to a fourth aspect of the embodiments of the present application, there is provided an apparatus for extracting a keyword, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2;
fusing the preset number of results according to a fusion mode to obtain a fusion result;
outputting the fusion result;
and evaluating the fusion result and updating the fusion mode.
Optionally, the preset number of keyword extraction models include keyword extraction models depending on word segmentation;
when the keyword extraction model is the keyword extraction model depending on the participle, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
and extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.
Optionally, the participle corpus includes a plurality of words; the keyword extraction model depending on the participles comprises a model depending on the participles and based on vocabulary associated information and/or a model depending on the participles and based on word frequency information, and the model depending on the participles and based on the vocabulary associated information comprises an entropy and mutual information model
When the keyword extraction model dependent on participles is the entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:
respectively calculating entropy values and mutual information values of the words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
extracting the plurality of terms with the entropy values larger than the entropy threshold value and the plurality of terms with the mutual information values larger than the mutual information threshold value as the preset number of results.
Optionally, the keyword extraction models other than the keyword extraction models dependent on the participles in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participles and/or a model based on word frequency information independent of the participles, and the model based on vocabulary associated information independent of the participles includes an entropy model and a mutual information model
When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all the character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as results of the preset number according to the maximum number of the characters.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on participles to obtain participle results;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and merging the word segmentation result and the non-word segmentation result to obtain a fusion result.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
calculating the intersection of every two results with the preset number to obtain a plurality of intersection results;
and solving a union set of the intersection set results to obtain the fusion result.
According to a fifth aspect of the embodiments of the present application, there is provided a keyword extraction system, including: a processor, and a memory coupled to the processor;
the memory is used for storing a computer program used for executing the following keyword extraction method:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2;
fusing the preset number of results according to a fusion mode to obtain a fusion result;
outputting the fusion result;
and evaluating the fusion result and updating the fusion mode.
Optionally, the preset number of keyword extraction models include keyword extraction models depending on word segmentation;
when the keyword extraction model is the keyword extraction model depending on the participle, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
and extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.
Optionally, the participle corpus includes a plurality of words; the keyword extraction model depending on the participles comprises a model depending on the participles and based on vocabulary associated information and/or a model depending on the participles and based on word frequency information, and the model depending on the participles and based on the vocabulary associated information comprises an entropy and mutual information model
When the keyword extraction model dependent on participles is the entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:
respectively calculating entropy values and mutual information values of the words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
extracting the plurality of terms with the entropy values larger than the entropy threshold value and the plurality of terms with the mutual information values larger than the mutual information threshold value as the preset number of results.
Optionally, the keyword extraction models other than the keyword extraction models dependent on the participles in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participles and/or a model based on word frequency information independent of the participles, and the model based on vocabulary associated information independent of the participles includes an entropy model and a mutual information model
When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all the character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as results of the preset number according to the maximum number of the characters.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain participle results;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and merging the word segmentation result and the non-word segmentation result to obtain a fusion result.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
calculating the intersection of every two results with the preset number to obtain a plurality of intersection results;
obtaining a union set of the intersection results to obtain the fusion result;
the processor is used for calling and executing the computer program in the memory.
According to a sixth aspect of the embodiments of the present application, there is provided a storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements the following steps in the keyword extraction method:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the predetermined number is at least 2;
fusing the preset number of results according to a fusion mode to obtain a fusion result;
outputting the fusion result;
and evaluating the fusion result and updating the fusion mode.
Optionally, the preset number of keyword extraction models include keyword extraction models depending on word segmentation;
when the keyword extraction model is the keyword extraction model depending on the participle, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
and extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles.
Optionally, the participle corpus includes a plurality of words; the keyword extraction model depending on the participles comprises a model depending on the participles and based on vocabulary associated information and/or a model depending on the participles and based on word frequency information, and the model depending on the participles and based on the vocabulary associated information comprises an entropy and mutual information model
When the keyword extraction model dependent on participles is the entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:
respectively calculating entropy values and mutual information values of the words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
extracting the plurality of terms with the entropy values larger than the entropy threshold value and the plurality of terms with the mutual information values larger than the mutual information threshold value as the preset number of results.
Optionally, the keyword extraction models other than the keyword extraction models dependent on the participles in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participles and/or a model based on word frequency information independent of the participles, and the model based on vocabulary associated information independent of the participles includes an entropy model and a mutual information model
When the keyword extraction models other than the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, extracting a preset number of results from the text corpus according to the preset number of keyword extraction models, including:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all the character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as results of the preset number according to the maximum number of the characters.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain participle results;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and merging the word segmentation result and the non-word segmentation result to obtain a fusion result.
Optionally, the fusing the preset number of results according to feedback to obtain a fused result includes:
calculating intersection of every two results with the preset number to obtain a plurality of intersection results;
and solving a union set of the intersection set results to obtain the fusion result.
The technical scheme provided by the embodiment of the application can have the following beneficial effects: the method comprises the steps of extracting a text corpus according to a preset number of keyword extraction models to obtain a preset number of results, fusing the preset number of results according to a fusion mode to obtain a fusion result, outputting the fusion result, evaluating the fusion result and updating the fusion mode, and based on the result, for irregular information contents such as spoken language expression, the advantages of the keyword extraction models can be brought into play to extract, the keyword extraction accuracy is higher, and ideal results can be extracted more easily.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic flowchart of a keyword extraction method according to an embodiment of the present application.
Fig. 2 is a flowchart illustrating a keyword extraction method when the keyword extraction model is a keyword extraction model dependent on word segmentation according to a second embodiment of the present application.
Fig. 3 is a flowchart illustrating a method for extracting keywords by using a keyword extraction model based on entropy and mutual information when the keyword extraction model is a keyword extraction model dependent on word segmentation according to a third embodiment of the present application.
Fig. 4 is a flowchart illustrating a method for extracting keywords by using a keyword extraction model of entropy and mutual information when the keyword extraction model is a keyword extraction model independent of word segmentation according to a fourth embodiment of the present application.
Fig. 5 is a schematic flowchart of a fusion method provided in the fifth embodiment of the present application.
Fig. 6 is a schematic flow chart of a fusion method provided in the sixth embodiment of the present application.
Fig. 7 is a schematic structural diagram of an apparatus for extracting keywords according to a seventh embodiment of the present application.
Fig. 8 is a schematic structural diagram of an apparatus for extracting keywords according to an eighth embodiment of the present application.
Fig. 9 is a schematic structural diagram of an apparatus for extracting keywords according to a ninth embodiment of the present application.
Fig. 10 is a schematic structural diagram of an apparatus for extracting keywords according to a tenth embodiment of the present application.
Fig. 11 is a schematic structural diagram of an apparatus for extracting keywords according to an eleventh embodiment of the present application.
Fig. 12 is a schematic structural diagram of an apparatus for extracting keywords according to a twelfth embodiment of the present application.
Fig. 13 is a schematic structural diagram of a keyword extraction system according to a thirteenth embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
Fig. 1 is a schematic flowchart of a keyword extraction method according to an embodiment of the present application.
Referring to fig. 1, the method for extracting a keyword provided in this embodiment includes:
step S11, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one by one; the preset number is at least 2;
step S12, fusing a preset number of results according to a fusion mode to obtain a fusion result;
step S13, outputting a fusion result;
step S14, evaluating the fusion result and updating the fusion mode;
because the text corpus is extracted according to the preset number of keyword extraction models to obtain the preset number of results, the preset number of results are fused according to the fusion mode to obtain the fusion result and then are output, and the fusion result is evaluated to update the fusion mode, on the basis, the respective advantages of the plurality of keyword extraction models can be played for extracting irregular information contents such as spoken language expression, the accuracy of keyword extraction is higher, and ideal results can be extracted more easily.
It should be noted that the keyword extraction models are divided into a keyword extraction model that depends on the participle and a keyword extraction model that does not depend on the participle, except for the keyword extraction model that depends on the participle.
Fig. 2 is a flowchart illustrating a keyword extraction method when the keyword extraction model is a keyword extraction model dependent on word segmentation according to a second embodiment of the present application.
Referring to fig. 2, the method for extracting keywords according to the present embodiment includes:
step S211, performing word segmentation on the text corpus to obtain a word segmentation corpus;
step S212, extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle;
step S22, fusing a preset number of results according to a fusion mode to obtain a fusion result;
step S23, outputting a fusion result;
and step S24, evaluating the fusion result and updating the fusion mode.
It should be noted that, regardless of the extraction model of the keyword depending on the word segmentation or the extraction model of the keyword independent of the word segmentation, the extraction model of the keyword based on the word frequency information and the extraction model of the keyword based on the associated information may be included. Moreover, the extraction steps of the keyword extraction model based on the word frequency information and the keyword extraction model based on the associated information are different for the keyword extraction model depending on the word segmentation and the keyword extraction model not depending on the word segmentation.
For the keyword extraction model based on the word Frequency information, taking TF-idf (term Frequency invoked Document Frequency) as an example, the following steps can be summarized:
step 1, calculating word frequency, wherein the specific formula is as follows:
the word frequency is the number of times a word appears in an article/the total number of words in the article;
step 2, calculating the number of the reverse documents, wherein the specific formula is as follows:
the inverse document number log (total number of documents in the corpus/(total number of documents containing the word + 1));
and 3, calculating TF-IDF, wherein the specific formula is as follows:
TF-IDF is the number of words frequency inverse documents.
For the keyword extraction model based on the associated information, the following takes the keyword extraction model of entropy and mutual information as an example, and the keyword extraction model types depending on the participles and the keyword extraction model types not depending on the participles are respectively elaborated.
Fig. 3 is a flowchart illustrating a method for extracting keywords by using a keyword extraction model based on entropy and mutual information when the keyword extraction model is a keyword extraction model dependent on word segmentation according to a third embodiment of the present application.
Referring to fig. 3, the method of the present embodiment includes:
step S311, performing word segmentation on the text corpus to obtain a word segmentation corpus;
step S3121, respectively calculating entropy values and mutual information values of a plurality of words;
step S3122, obtaining an entropy threshold and a mutual information threshold according to the entropy values and the mutual information values of the plurality of words;
step S3123, extracting a plurality of words of which the entropy values are greater than an entropy threshold value and a plurality of words of which the mutual information values are greater than a mutual information threshold value as a preset number of results;
step S32, fusing a preset number of results according to a fusion mode to obtain a fusion result;
step S33, outputting a fusion result;
and step S34, evaluating the fusion result and updating the fusion mode.
Fig. 4 is a flowchart illustrating a method for extracting keywords by using a keyword extraction model of entropy and mutual information when the keyword extraction model is a keyword extraction model independent of word segmentation according to a fourth embodiment of the present application.
Referring to fig. 4, the method of the present embodiment includes:
step S411, setting the maximum number of characters contained in the keywords to be extracted;
step S412, enumerating all character strings of the text corpus;
step S413, respectively calculating entropy values and mutual information values of all the character strings;
step S414, obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
step S415, extracting a plurality of character strings with entropy values larger than an entropy threshold value and a plurality of character strings with mutual information values larger than a mutual information threshold value according to the maximum number of characters as a preset number of results;
step S42, fusing a preset number of results according to a fusion mode to obtain a fusion result;
step S43, outputting a fusion result;
and step S44, evaluating the fusion result and updating the fusion mode.
In addition, the fusion method in the above embodiments may be various, and two fusion methods are described in detail below.
Fig. 5 is a schematic flowchart of a fusion method provided in the fifth embodiment of the present application.
Referring to fig. 5, the fusion method of the present embodiment includes:
step S51, obtaining intersection of a preset number of results extracted from a participle corpus according to a plurality of keyword extraction models depending on participles to obtain participle results;
step S52, obtaining the non-participle result by obtaining the intersection of the results of the preset number extracted from the text corpus according to the keyword extraction models except the keyword extraction models depending on participles;
and step S53, merging the word segmentation result with the non-word segmentation result to obtain a fusion result.
Fig. 6 is a schematic flow chart of a fusion method provided in the sixth embodiment of the present application.
Referring to fig. 6, the fusion method of the present embodiment includes:
step S61, calculating intersection of every two results with preset number to obtain multiple intersection results;
and step S62, obtaining a fusion result by taking a union of a plurality of intersection results.
Fig. 7 is a schematic structural diagram of an apparatus for extracting keywords according to a seventh embodiment of the present application.
Referring to fig. 7, the present embodiment includes: the device comprises an extraction module, a fusion module, an output module and an evaluation module.
Specifically, the following is set forth for each module:
the extraction module is used for extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one by one; the preset number is at least 2;
the fusion module is used for fusing a preset number of results according to a fusion mode to obtain a fusion result;
the output module is used for outputting a fusion result;
and the evaluation module is used for evaluating the fusion result and updating the fusion mode.
Fig. 8 is a schematic structural diagram of an apparatus for extracting keywords according to an eighth embodiment of the present application.
Further, referring to fig. 8, when the keyword extraction model is a segmentation-dependent keyword extraction model, the extraction module includes:
the word segmentation unit is used for segmenting the text corpus to obtain a word segmentation corpus;
the first extraction unit is used for extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle.
Fig. 9 is a schematic structural diagram of an apparatus for extracting keywords according to a ninth embodiment of the present application.
Referring to fig. 9, when the keyword extraction model depending on the participle is an entropy and mutual information model, the first extraction unit includes:
the calculation subunit is used for respectively calculating entropy values and mutual information values of the plurality of words;
the threshold value determining subunit is used for obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the multiple words;
and the extraction subunit is used for extracting a plurality of words of which the entropy values are greater than the entropy threshold value and a plurality of words of which the mutual information values are greater than the mutual information threshold value as a preset number of results.
Fig. 10 is a schematic structural diagram of an apparatus for extracting keywords according to a tenth embodiment of the present application.
Referring to fig. 10, when the keyword extraction models other than the preset number of keyword extraction models depending on the participles are entropy and mutual information models, the extraction module includes:
the setting unit is used for setting the maximum number of characters contained in the keywords to be extracted;
the enumeration unit is used for enumerating all character strings of the text corpus;
the calculation unit is used for respectively calculating entropy values and mutual information values of all the character strings;
the threshold value determining unit is used for obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and the second extraction unit is used for extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as a preset number of results according to the maximum number of the characters.
Fig. 11 is a schematic structural diagram of an apparatus for extracting keywords according to an eleventh embodiment of the present application.
Wherein, the fusion mode can be multiple, see fig. 11, and when one of the fusion modes is adopted, the fusion module can include:
the first intersection unit is used for solving the intersection of a preset number of results extracted from the participle corpus according to a plurality of keyword extraction models depending on participles to obtain participle results;
the second intersection unit is used for obtaining the intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and the first merging unit is used for merging the word segmentation result with the non-word segmentation result to obtain a merging result.
Fig. 12 is a schematic structural diagram of an apparatus for extracting keywords according to a twelfth embodiment of the present application.
Referring to fig. 12, when another fusion method is adopted, the fusion module may include:
the third intersection unit is used for solving the intersection of every two results with preset numbers to obtain a plurality of intersection results;
and the second union unit is used for obtaining a union of the intersection results to obtain a fusion result.
Fig. 13 is a schematic structural diagram of a keyword extraction system according to a thirteenth embodiment of the present application.
Referring to fig. 13, the present embodiment includes: a processor, and a memory coupled to the processor;
the memory is used for storing a computer program, the computer program is at least used for executing the following keyword extraction method:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one by one; the preset number is at least 2;
fusing a preset number of results according to a fusion mode to obtain a fusion result;
outputting a fusion result;
and evaluating the fusion result and updating the fusion mode.
Further, the preset number of keyword extraction models comprises keyword extraction models depending on word segmentation;
when the keyword extraction model is a keyword extraction model depending on word segmentation, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models, wherein the preset number of results comprises the following steps:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
and extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle.
Further, the participle corpus comprises a plurality of words; the keyword extraction model dependent on the participles comprises a model based on word associated information and/or a model based on word frequency information, and the model based on word associated information comprises entropy and mutual information models
When the keyword extraction model depending on the participle is an entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle, wherein the preset number of results comprises the following steps:
respectively calculating entropy values and mutual information values of a plurality of words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
and extracting a plurality of words of which the entropy values are larger than the entropy threshold value and a plurality of words of which the mutual information values are larger than the mutual information threshold value as a preset number of results.
Further, the keyword extraction models other than the plurality of segmentation-dependent keyword extraction models in the plurality of keyword extraction models include segmentation-independent vocabulary association information-based models and/or segmentation-independent word frequency information-based models, and the segmentation-independent vocabulary association information-based models include entropy and mutual information models
When the keyword extraction models except the keyword extraction models depending on the participles with the preset number are entropy and mutual information models, extracting results with the preset number from the text corpus according to the keyword extraction models with the preset number, wherein the results with the preset number comprise:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than an entropy threshold value and a plurality of character strings of which the mutual information values are greater than a mutual information threshold value as a preset number of results according to the maximum number of the characters.
Further, fusing the preset number of results according to feedback to obtain a fusion result, including:
obtaining an intersection of a preset number of results extracted from a participle corpus according to a plurality of keyword extraction models depending on participles to obtain a participle result;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and taking a union set of the word segmentation result and the non-word segmentation result to obtain a fusion result.
Further, fusing the preset number of results according to feedback to obtain a fusion result, including:
obtaining an intersection of every two results with preset numbers to obtain a plurality of intersection results;
and obtaining a fusion result by taking a union set of the intersection results.
The processor is used to call and execute the computer program in the memory.
In addition, the present application further provides a storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the method implements the following steps of the keyword extraction method:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one by one; the preset number is at least 2;
fusing a preset number of results according to a fusion mode to obtain a fusion result;
outputting a fusion result;
and evaluating the fusion result and updating the fusion mode.
Optionally, the preset number of keyword extraction models includes a keyword extraction model depending on word segmentation;
when the keyword extraction model is a keyword extraction model depending on participles, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models, wherein the preset number of results comprises the following steps:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
and extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle.
Optionally, the participle corpus includes a plurality of words; the keyword extraction model dependent on the participles comprises a model based on word associated information and/or a model based on word frequency information, and the model based on word associated information comprises entropy and mutual information models
When the keyword extraction model depending on the participle is an entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model depending on the participle, wherein the preset number of results comprises the following steps:
respectively calculating entropy values and mutual information values of a plurality of words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
and extracting a plurality of words of which the entropy values are larger than the entropy threshold value and a plurality of words of which the mutual information values are larger than the mutual information threshold value as a preset number of results.
Optionally, the keyword extraction models other than the plurality of keyword extraction models depending on the participle in the plurality of keyword extraction models include a model based on vocabulary associated information independent of the participle and/or a model based on word frequency information independent of the participle, and the model based on vocabulary associated information independent of the participle includes an entropy model and a mutual information model
When the keyword extraction models except the keyword extraction models depending on the participles with the preset number are entropy and mutual information models, extracting results with the preset number from the text corpus according to the keyword extraction models with the preset number, wherein the results with the preset number comprise:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than an entropy threshold value and a plurality of character strings of which the mutual information values are greater than a mutual information threshold value as a preset number of results according to the maximum number of the characters.
Optionally, the preset number of results are fused according to the feedback to obtain a fusion result, including:
obtaining an intersection of a preset number of results extracted from a participle corpus according to a plurality of keyword extraction models depending on participles to obtain a participle result;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and taking a union set of the word segmentation result and the non-word segmentation result to obtain a fusion result.
Optionally, the preset number of results are fused according to the feedback to obtain a fusion result, including: obtaining an intersection of every two results with preset numbers to obtain a plurality of intersection results; and obtaining a fusion result by taking a union set of the intersection results.
With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A keyword extraction method is characterized by comprising the following steps:
extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2; the keyword extraction models are divided into keyword extraction models which depend on participles and keyword extraction models which do not depend on participles;
fusing the preset number of results according to a fusion mode to obtain a fusion result; the fusion mode comprises an intersection operation and a parallel operation;
outputting the fusion result;
evaluating the fusion result and updating the fusion mode;
when the keyword extraction model is the keyword extraction model depending on the participles, extracting a preset number of results from the text corpus according to a preset number of keyword extraction models, including:
performing word segmentation on the text corpus to obtain a word segmentation corpus;
extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles;
wherein the participle corpus comprises a plurality of words; when the keyword extraction model dependent on participles is an entropy and mutual information model, extracting a preset number of results from the participle corpus according to the keyword extraction model dependent on participles, including:
respectively calculating entropy values and mutual information values of the words;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the words;
extracting a plurality of words of which the entropy values are larger than the entropy threshold value and a plurality of words of which the mutual information values are larger than the mutual information threshold value as the preset number of results;
wherein, the fusing the results of the preset number according to feedback to obtain a fused result comprises:
obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain participle results;
obtaining an intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain a non-participle result;
and merging the word segmentation result and the non-word segmentation result to obtain a fusion result.
2. The method according to claim 1, wherein the segmentation-dependent keyword extraction model comprises a segmentation-dependent vocabulary association information-based model and/or a segmentation-dependent vocabulary association information-based model, and the segmentation-dependent vocabulary association information-based model comprises an entropy and mutual information model.
3. The method of claim 1, wherein the keyword extraction models other than the plurality of segmentation-dependent keyword extraction models of the plurality of keyword extraction models comprise segmentation-independent lexical relevance information-based models and/or segmentation-independent word frequency information-based models, the segmentation-independent lexical relevance information-based models comprising entropy and mutual information models;
when the keyword extraction models except the keyword extraction models depending on the participles with the preset number are the entropy and mutual information models, extracting the results with the preset number from the text corpus according to the keyword extraction models with the preset number, wherein the extracting comprises the following steps:
setting the maximum number of characters contained in the keywords to be extracted;
enumerating all character strings of the text corpus;
respectively calculating entropy values and mutual information values of all the character strings;
obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of all the character strings;
and extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value as results of the preset number according to the maximum number of the characters.
4. The method according to claim 1, wherein the fusing the preset number of results according to feedback to obtain a fused result comprises:
calculating the intersection of every two results with the preset number to obtain a plurality of intersection results;
and solving a union set of the intersection set results to obtain the fusion result.
5. An extraction device of a keyword, characterized by comprising:
the extraction module is used for extracting a preset number of results from the text corpus according to a preset number of keyword extraction models; the preset number of results corresponds to the preset number of keyword extraction models one to one; the preset number is at least 2;
the fusion module is used for fusing the preset number of results according to a fusion mode to obtain a fusion result;
the output module is used for outputting the fusion result;
the evaluation module is used for evaluating the fusion result and updating the fusion mode;
the preset number of keyword extraction models comprise keyword extraction models depending on word segmentation;
when the keyword extraction model is the keyword extraction model depending on the participle, the extraction module comprises:
the word segmentation unit is used for segmenting the text corpus to obtain a word segmentation corpus;
the first extraction unit is used for extracting a preset number of results from the participle corpus according to a keyword extraction model depending on participles;
wherein the participle corpus comprises a plurality of words; when the keyword extraction model dependent on the participle is an entropy and mutual information model, the first extraction unit includes:
a calculating subunit, configured to calculate entropy values and mutual information values of the multiple words, respectively;
the threshold value determining subunit is used for obtaining an entropy threshold value and a mutual information threshold value according to the entropy values and the mutual information values of the plurality of words;
an extraction subunit, configured to extract, as the preset number of results, the plurality of terms whose entropy values are greater than the entropy threshold and the plurality of terms whose mutual information values are greater than the mutual information threshold;
wherein the fusion module comprises:
the first intersection unit is used for obtaining an intersection of a preset number of results extracted from the participle corpus according to the plurality of keyword extraction models depending on the participles to obtain a participle result;
the second intersection unit is used for obtaining the intersection of a preset number of results extracted from the text corpus according to the keyword extraction models except the plurality of keyword extraction models depending on the participles to obtain non-participle results;
and the first merging unit is used for merging the word segmentation result and the non-word segmentation result to obtain a merging result.
6. The apparatus of claim 5, wherein the segmentation-dependent keyword extraction model comprises a segmentation-dependent vocabulary association information-based model and/or a segmentation-dependent vocabulary association information-based model, and the segmentation-dependent vocabulary association information-based model comprises an entropy and mutual information model.
7. The apparatus of claim 5, wherein the keyword extraction models of the plurality of keyword extraction models other than the plurality of segmentation-dependent keyword extraction models comprise segmentation-independent vocabulary-related information-based models and/or segmentation-independent word frequency information-based models, the segmentation-independent vocabulary-related information-based models comprising entropy and mutual information models;
when the keyword extraction models except the preset number of keyword extraction models depending on the participles are the entropy and mutual information models, the extraction module comprises:
the setting unit is used for setting the maximum number of characters contained in the keywords to be extracted;
the enumeration unit is used for enumerating all character strings of the text corpus;
a calculating unit, configured to calculate entropy values and mutual information values of all the character strings respectively;
a threshold determining unit, configured to obtain an entropy threshold and a mutual information threshold according to the entropy values and the mutual information values of all the character strings;
and the second extraction unit is used for extracting a plurality of character strings of which the entropy values are greater than the entropy threshold value and a plurality of character strings of which the mutual information values are greater than the mutual information threshold value according to the maximum number of the characters as the preset number of results.
8. The apparatus of claim 5, wherein the fusion module comprises:
the third intersection unit is used for solving the intersection of every two results with the preset number to obtain a plurality of intersection results;
and the second union unit is used for solving a union set of the intersection set results to obtain the fusion result.
9. A keyword extraction system, comprising:
a processor, and a memory coupled to the processor;
the memory is intended to store a computer program intended at least to execute the method of extraction of keywords according to any one of claims 1 to 4;
the processor is used for calling and executing the computer program in the memory.
10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the steps of the keyword extraction method according to any one of claims 1 to 4.
CN201810953403.6A 2018-08-20 2018-08-20 Keyword extraction method, device and system Active CN109241525B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810953403.6A CN109241525B (en) 2018-08-20 2018-08-20 Keyword extraction method, device and system
PCT/CN2019/100322 WO2020038253A1 (en) 2018-08-20 2019-08-13 Keyword extraction method, system, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810953403.6A CN109241525B (en) 2018-08-20 2018-08-20 Keyword extraction method, device and system

Publications (2)

Publication Number Publication Date
CN109241525A CN109241525A (en) 2019-01-18
CN109241525B true CN109241525B (en) 2022-05-06

Family

ID=65071218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810953403.6A Active CN109241525B (en) 2018-08-20 2018-08-20 Keyword extraction method, device and system

Country Status (2)

Country Link
CN (1) CN109241525B (en)
WO (1) WO2020038253A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241525B (en) * 2018-08-20 2022-05-06 深圳追一科技有限公司 Keyword extraction method, device and system
CN110162614B (en) * 2019-05-29 2021-08-27 腾讯科技(深圳)有限公司 Question information extraction method and device, electronic equipment and storage medium
CN110188181B (en) * 2019-05-31 2021-06-18 腾讯科技(深圳)有限公司 Method and device for determining domain keywords, electronic equipment and storage medium
CN111353050A (en) * 2019-12-27 2020-06-30 北京合力亿捷科技股份有限公司 Word stock construction method and tool in vertical field of telecommunication customer service
CN111898034A (en) * 2020-09-29 2020-11-06 江西汉辰信息技术股份有限公司 News content pushing method and device, storage medium and computer equipment
CN112395404B (en) * 2020-12-05 2023-01-24 中国南方电网有限责任公司 Voice key information extraction method applied to power dispatching
CN112597776A (en) * 2021-03-08 2021-04-02 中译语通科技股份有限公司 Keyword extraction method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937471A (en) * 2010-09-21 2011-01-05 上海大学 Multidimensional space evaluation method of keyword extraction algorithm
CN103699625A (en) * 2013-12-20 2014-04-02 北京百度网讯科技有限公司 Method and device for retrieving based on keyword
CN105243053A (en) * 2015-09-15 2016-01-13 百度在线网络技术(北京)有限公司 Method and apparatus for extracting key sentence of document
CN105653547A (en) * 2014-11-12 2016-06-08 北大方正集团有限公司 Method and device for extracting keywords of text
CN106126588A (en) * 2016-06-17 2016-11-16 广州视源电子科技股份有限公司 The method and apparatus that related term is provided
CN107168954A (en) * 2017-05-18 2017-09-15 北京奇艺世纪科技有限公司 Text key word generation method and device and electronic equipment and readable storage medium storing program for executing
WO2018086470A1 (en) * 2016-11-10 2018-05-17 腾讯科技(深圳)有限公司 Keyword extraction method and device, and server
CN108133045A (en) * 2018-01-12 2018-06-08 广州杰赛科技股份有限公司 Keyword extracting method and system, keyword extraction model generating method and system

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067808B (en) * 2007-05-24 2010-12-15 上海大学 Text key word extracting method
CN102411563B (en) * 2010-09-26 2015-06-17 阿里巴巴集团控股有限公司 Method, device and system for identifying target words
US9483557B2 (en) * 2011-03-04 2016-11-01 Microsoft Technology Licensing Llc Keyword generation for media content
CN104572622B (en) * 2015-01-05 2018-01-02 武汉传神信息技术有限公司 A kind of screening technique of term
CN104778201B (en) * 2015-01-23 2018-01-02 湖南科技大学 A kind of first technology search method merged based on more Query Results
US20170139899A1 (en) * 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device
CN108228556A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Key phrase extracting method and device
CN108334490B (en) * 2017-04-07 2021-05-07 腾讯科技(深圳)有限公司 Keyword extraction method and keyword extraction device
CN107577671B (en) * 2017-09-19 2020-09-22 中央民族大学 Subject term extraction method based on multi-feature fusion
CN109241525B (en) * 2018-08-20 2022-05-06 深圳追一科技有限公司 Keyword extraction method, device and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937471A (en) * 2010-09-21 2011-01-05 上海大学 Multidimensional space evaluation method of keyword extraction algorithm
CN103699625A (en) * 2013-12-20 2014-04-02 北京百度网讯科技有限公司 Method and device for retrieving based on keyword
CN105653547A (en) * 2014-11-12 2016-06-08 北大方正集团有限公司 Method and device for extracting keywords of text
CN105243053A (en) * 2015-09-15 2016-01-13 百度在线网络技术(北京)有限公司 Method and apparatus for extracting key sentence of document
CN106126588A (en) * 2016-06-17 2016-11-16 广州视源电子科技股份有限公司 The method and apparatus that related term is provided
WO2018086470A1 (en) * 2016-11-10 2018-05-17 腾讯科技(深圳)有限公司 Keyword extraction method and device, and server
CN107168954A (en) * 2017-05-18 2017-09-15 北京奇艺世纪科技有限公司 Text key word generation method and device and electronic equipment and readable storage medium storing program for executing
CN108133045A (en) * 2018-01-12 2018-06-08 广州杰赛科技股份有限公司 Keyword extracting method and system, keyword extraction model generating method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Data Fusion : Boosting Performance in Keyword Extraction;Thomas Bohne et.al;《20th Annual IEEE International Conference and Workshops on the Engineering of Computer Based Systems (ECBS)》;20130919;第166-173页 *
融合多特征的中文关键词提取方法;潘丽敏等;《信息网络安全》;20140831(第8期);第40-43页 *

Also Published As

Publication number Publication date
WO2020038253A1 (en) 2020-02-27
CN109241525A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
CN109241525B (en) Keyword extraction method, device and system
CN107818781B (en) Intelligent interaction method, equipment and storage medium
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
CN107122346B (en) The error correction method and device of a kind of read statement
US10262062B2 (en) Natural language system question classifier, semantic representations, and logical form templates
US20230237328A1 (en) Information processing method and terminal, and computer storage medium
US20200250378A1 (en) Methods and apparatuses for identifying a user intent of a statement
CN106649825B (en) Voice interaction system and creation method and device thereof
KR101970008B1 (en) Computer program stored in computer-readable medium and user device having translation algorithm using by deep learning neural network circuit
CN110427627A (en) Task processing method and device based on semantic expressiveness model
CN108628868B (en) Text classification method and device
CN108304387B (en) Method, device, server group and storage medium for recognizing noise words in text
CN111324698B (en) Deep learning method, evaluation viewpoint extraction method, device and system
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
CN108776677B (en) Parallel sentence library creating method and device and computer readable storage medium
CN111737961B (en) Method and device for generating story, computer equipment and medium
CN110187780B (en) Long text prediction method, long text prediction device, long text prediction equipment and storage medium
US20190019094A1 (en) Determining suitability for presentation as a testimonial about an entity
CN110110143B (en) Video classification method and device
CN110969005A (en) Method and device for determining similarity between entity corpora
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN112100355A (en) Intelligent interaction method, device and equipment
CN101727451A (en) Method and device for extracting information
CN110705308A (en) Method and device for recognizing field of voice information, storage medium and electronic equipment
JP7216627B2 (en) INPUT SUPPORT METHOD, INPUT SUPPORT SYSTEM, AND PROGRAM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant