CN115858773A

CN115858773A - Keyword mining method, device and medium suitable for long document

Info

Publication number: CN115858773A
Application number: CN202210357739.2A
Authority: CN
Inventors: 段兴涛; 赵国庆; 周长安
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2023-03-28

Abstract

The application discloses a keyword mining method, device and medium applicable to long documents. The method comprises the following steps: acquiring each first text to be processed; performing cluster analysis on each first text to determine each second text belonging to a target cluster; determining candidate keyword sets corresponding to the second texts respectively; performing relevance calculation on the candidate keyword sets respectively corresponding to the second texts to obtain relevance indexes respectively corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts; and performing keyword screening according to the relevance indexes corresponding to the keywords in the candidate keyword sets respectively corresponding to the second texts to obtain a keyword dictionary for the target cluster. The method and the device reduce the redundancy problem caused by the information of the document tag in the related technology through clustering analysis; meanwhile, various indexes of the keywords are improved by utilizing the associated information among the vocabularies, and the aim of improving the accuracy of mining the keywords is fulfilled.

Description

Keyword mining method, device and medium suitable for long document

Technical Field

The present application relates to the field of document processing technologies, and in particular, to a keyword mining method, apparatus, and medium suitable for long documents.

Background

The keyword mining task is one of the basic tasks in all tasks of natural language processing technology NLP. Keyword mining can support retrieval in search engines, semantic understanding of question-answering systems, text matching, in-field dictionary expansion, knowledge maps and other application scenarios. Therefore, the quality of the keyword mining effect directly influences the effect guarantee of the subsequent task. The current keyword mining methods mainly comprise deep learning and machine learning. The machine learning can not mine semantic representation, the deep learning contains semantic information but is not suitable for long documents, and both the deep learning and the deep learning have a word bag model and easily cause the problem that the efficiency of keyword mining is too low to meet business requirements. For example, deep learning models, such as RNN, CNN, BERT model, RNN + CRF, CNN + CRF, BERT + CRF model, etc., are used as named entity recognition tasks to mine keywords, but the input of this approach has limitations on document length, cannot process ultra-long articles, and large models also need GPU hardware support, for example, the text length input by BILSTM bidirectional LSTM processing is about 700 words, and the text length input by BERT model is about 512 words; for another example, the efficiency of mining the domain keywords is not high by using a correlation analysis method, which mainly includes that information of all document tags in the same domain is too redundant, so that the measures of four indexes, namely confidence, support, promotion and information entropy, are not good, and the redundant information needs to be reduced in advance.

Disclosure of Invention

The application provides a keyword mining method and device, electronic equipment and a computer-readable storage medium suitable for long documents, which can solve at least one of the problems. The technical scheme is as follows:

in a first aspect, a keyword mining method suitable for a long document is provided, and the method includes:

acquiring each first text to be processed obtained through speech recognition ASR conversion;

performing cluster analysis on each first text to determine each second text belonging to a target cluster;

determining candidate keyword sets respectively corresponding to the second texts;

performing relevance calculation on the candidate keyword sets respectively corresponding to the second texts to obtain relevance indexes respectively corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts;

and performing keyword screening according to the relevance indexes corresponding to the keywords in the candidate keyword sets respectively corresponding to the second texts to obtain a keyword dictionary for the target cluster.

Further, the step of performing cluster analysis on each first text and determining each second text belonging to the target cluster class includes:

vectorizing each character corresponding to each first text to obtain a vector sequence corresponding to each first text;

and clustering the vector sequences respectively corresponding to the first texts to obtain second texts belonging to the target cluster.

Furthermore, the vectorizing step of each character corresponding to each first text includes:

and performing word frequency-reverse file frequency TF-IDF vector conversion on each character corresponding to each first text to obtain a TF-IDF vector sequence corresponding to each first text.

Further, before the step of vectorizing each character corresponding to each first text, the method further includes:

and preprocessing each first text to obtain each first text only comprising Chinese characters.

Further, the step of determining a candidate keyword set corresponding to each second text includes:

and performing word segmentation on each second text according to a plurality of preset character string distances aiming at the N-Gram model to obtain a candidate keyword set corresponding to each second text.

Further, the relevance indicator includes at least one of:

support degree, confidence degree, promotion degree and information entropy.

Further, the step of performing keyword screening according to a plurality of relevance indexes corresponding to each keyword included in the candidate keyword set respectively corresponding to each second text to obtain a keyword dictionary for the target cluster includes:

and if the support degree, the confidence degree, the promotion degree and the information entropy corresponding to any candidate keyword contained in the candidate keyword set corresponding to any second text are all larger than respective threshold values, determining that any candidate keyword contained in the candidate keyword set corresponding to any second text is determined as a target keyword, and adding the target keyword into the keyword dictionary for the target cluster.

In a second aspect, a keyword mining apparatus suitable for a long document is provided, the apparatus comprising:

the text acquisition module is used for acquiring each first text to be processed, which is obtained by the conversion of the speech recognition ASR;

the cluster analysis module is used for carrying out cluster analysis on each first text and determining each second text belonging to a target cluster;

the keyword extraction module is used for determining candidate keyword sets corresponding to the second texts respectively;

the correlation calculation module is used for performing correlation calculation on the candidate keyword sets respectively corresponding to the second texts to obtain correlation indexes respectively corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts;

and the keyword processing module is used for screening keywords according to the relevance indexes corresponding to the keywords in the candidate keyword set respectively corresponding to each second text to obtain a keyword dictionary for the target cluster.

Further, the cluster analysis module includes:

the vector conversion sub-module is used for vectorizing each character corresponding to each first text to obtain a vector sequence corresponding to each first text;

and the vector clustering submodule is used for clustering the vector sequences respectively corresponding to the first texts to obtain the second texts belonging to the target cluster.

Further, the vector conversion sub-module includes:

Further, before the clustering module vectorizes each character corresponding to each first text, the method further includes:

and the text preprocessing submodule is used for preprocessing each first text to obtain each first text only comprising Chinese characters.

Further, the keyword extraction module includes:

and the text word cutting sub-module is used for cutting words of each second text according to a plurality of character string distances pre-configured for the N-Gram model to obtain a candidate keyword set corresponding to each second text.

Further, the relevance indicator comprises at least one of:

support degree, confidence degree, promotion degree and information entropy.

Further, the keyword processing module includes:

and the keyword screening sub-module is used for determining that any candidate keyword included in the candidate keyword set corresponding to any second text is determined as a target keyword and adding the target keyword into the keyword dictionary for the target cluster class if the support degree, the confidence degree, the promotion degree and the information entropy corresponding to any candidate keyword included in the candidate keyword set corresponding to any second text are all larger than respective threshold values.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the above keyword mining method suitable for long documents is performed.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the above-described keyword mining method applicable to long documents.

According to the method, each first text to be processed obtained through speech recognition ASR conversion is obtained, clustering analysis is conducted on each first text, each second text belonging to a target cluster is determined, candidate keyword sets corresponding to each second text are extracted, correlation calculation is conducted on the candidate keyword sets corresponding to each second text, correlation indexes corresponding to each keyword included in the candidate keyword sets corresponding to each second text are obtained, keyword screening is conducted according to the correlation indexes corresponding to the keywords included in the candidate keyword sets corresponding to each second text, a keyword dictionary for the target cluster is obtained, and the method of clustering analysis in advance is used for classifying different documents and digging out the keywords, so that the redundancy problem caused by information of document labels in the related technology is reduced; meanwhile, various indexes of the keywords are improved by utilizing the associated information among the vocabularies, and the aim of improving the accuracy of mining the keywords is fulfilled.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a flowchart illustrating a keyword mining method for long documents according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a keyword mining apparatus suitable for long documents according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The following describes the technical solution of the present application and how to solve the above technical problems in detail by specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiment of the application provides a keyword mining method suitable for long documents, and as shown in fig. 1, the method comprises the following steps: step S101 to step S104.

And S101, acquiring each first text to be processed obtained through ASR conversion.

In particular, the first text may come from a real-time speech recognition ASR interface from a preset; or from a designated platform such as a credit card center.

And S102, performing cluster analysis on each first text, and determining each second text belonging to a target cluster.

Specifically, after the category to which the text belongs can be determined through cluster analysis, the text is labeled, and a text library for the target cluster is constructed through labeled category identification, so that the text can be classified quickly.

And step S103, determining candidate keyword sets corresponding to the second texts respectively.

Specifically, each keyword included in the candidate keyword set may be of the same character string length, or may include different character string lengths. For example, the candidate keyword set includes a keyword having a number of kanji words of N, where N may be 1 or a natural number greater than 1.

And step S104, performing relevance calculation on the candidate keyword sets respectively corresponding to the second texts to obtain a plurality of relevance indexes corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts.

Specifically, the word frequency of any keyword extracted from a certain document in the co-occurrence of all documents can be counted in a traversal mode, and further, the calculation of various relevance indexes is completed.

Step S105, performing keyword screening according to a plurality of correlation indexes corresponding to the keywords in the candidate keyword set corresponding to each second text, and obtaining a keyword dictionary for the target cluster.

Specifically, threshold values for different relevance indexes may be preset, so that keywords that are consistent with documents of the same category are screened out from a plurality of keywords of the same document. For example, when each relevance index of a certain keyword is greater than each relevance index threshold, the certain keyword is determined as a keyword belonging to the category of documents.

In some implementations, step S102 further includes:

Specifically, each kanji character of the text may be determined by traversal, thereby completing the conversion of a kanji character to a vector step by step. In this way, the vectorization purpose of the long text is realized.

Specifically, a preset clustering algorithm, such as a K-mean clustering algorithm, may be used to cluster each first text, and determine a cluster to which each first text belongs, so as to determine a category to which each first text belongs according to the cluster.

For example, suppose that two texts of "i am Chinese" and "i love Chinese" are on the character level Counter ("i am Chinese") = { "I": 1, "yes": 1, "medium": 1, "country": 1, "person": 1} and Counter ("i love Chinese") = { "I": 1, "love": 1, "medium": 1, "country": 1}, and TF-IDF ("i am Chinese") = [ -0.4,0, -0.4, -0.4]TF-IDF ("i is china") = [ -0.4,0,0, -0.4, -0.4,0]. Wherein, the TF-IDF calculation formula is TF-IDF (w) = counter (w) × log ₂ (N _{General assembly} /N _w +1)N _w For the total number of words in which the word w appears, counter (w) is the word frequency with which the word w appears in one piece. After the vector conversion is completed, a kmeans clustering algorithm is used to determine the clustered cluster class.

In some implementations, the vectorizing, in step S102, each character corresponding to each first text includes:

According to the embodiment of the application, through the conversion of the TF-IDF vector, the TF-IDF vector obtained through conversion and the document have high correlation, and the problems that semantic representation cannot be mined and long documents cannot be applied in the correlation technology are solved.

In some implementations, prior to step S102, the method further includes:

each first text is preprocessed to obtain each first text including only kanji characters.

When the method is applied, after all characters included in a document are read one by one, preprocessing can be performed by using a re module built in python to remove texts of other characters except Chinese characters.

In some implementations, step S103 further includes:

and performing word segmentation on each second text according to a plurality of character string distances pre-configured for the N-Gram model to obtain a candidate keyword set corresponding to each second text.

The document is segmented according to an N-Gram model, wherein N is set to be 3, 4 and 5, namely word segmentation is respectively performed according to the lengths of the character strings of 3, 4 and 5, and therefore candidate keyword sets of the second texts are obtained through statistics.

In some implementations, the relevance indicators include at least one of:

support degree, confidence degree, promotion degree and information entropy.

In the embodiment of the application, the support degree represents the times of appearance of words and labels of Gram segmentation, and the more times, the more the combination possibly has strong relevance; the greater the value of the confidence, the greater the likelihood that label y is represented in the case of the occurrence of word x; the degree of promotion represents the degree to which the probability of occurrence of label y for word x increases; the information entropy represents the purity of the information, and the smaller the value, the higher the purity.

In some implementations, step S105 further includes:

and if the support degree, the confidence degree, the promotion degree and the information entropy corresponding to any keyword included in the candidate keyword set corresponding to any second text are all larger than respective threshold values, determining that any keyword included in the candidate keyword set corresponding to any second text is determined as a target keyword, and adding the target keyword into a keyword dictionary aiming at the target cluster class.

During application, the candidate keyword set obtained in step S103 is counted to form a Sample, sample (x, y) x is Gram word combination, y is label (i.e. class cluster), and then word frequency is counted. The word frequency statistics means word frequency statistics of word co-occurrence of words and tags in all documents, and all documents can be included by traversing the same cluster, the cnt initial value is 0, the words and tags which are encountered appear, and the cnt count is increased by one. Wherein, the support degree represents the times of the words of Gram segmentation and the tags appearing together. The more times, the more the combination is, the stronger the relevance of the combination is probably; the greater the value of the confidence, the greater the likelihood that label y is represented in the case of the occurrence of word x; the degree of promotion represents the degree of increased probability of occurrence of the tag y for the occurrence of the word x, and represents the correlation between x and y, wherein x represents no correlation when the value is 1, positive correlation when the value is greater than 1, and negative correlation when the value is less than 1; the information entropy represents the purity of information, and the smaller the value, the higher the purity.

The word segmentation processing of the N-Gram model is utilized, correlation analysis is carried out by utilizing correlation dependency among words, four indexes including confidence degree, support degree, promotion degree and information entropy among words are counted, and relevant threshold values are set to screen keywords.

Still another embodiment of the present application provides a keyword mining apparatus suitable for long documents, as shown in fig. 2, the apparatus 20 includes: a text acquisition module 201, a cluster analysis module 202, a keyword extraction module 203, a correlation calculation module 204, and a keyword processing module 205.

A text acquiring module 201, configured to acquire each first text to be processed obtained through ASR conversion;

the cluster analysis module 202 is configured to perform cluster analysis on each first text, and determine each second text belonging to a target cluster;

the keyword extraction module 203 is configured to determine candidate keyword sets corresponding to the second texts, respectively;

a correlation calculation module 204, configured to perform correlation calculation on the candidate keyword sets respectively corresponding to the second texts to obtain correlation indexes respectively corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts;

the keyword processing module 205 is configured to perform keyword screening according to the relevance indexes corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts, so as to obtain a keyword dictionary for the target cluster.

According to the method, each first text to be processed obtained through speech recognition ASR conversion is obtained, clustering analysis is conducted on each first text, each second text belonging to a target cluster is determined, candidate keyword sets corresponding to each second text are extracted, correlation calculation is conducted on the candidate keyword sets corresponding to each second text, correlation indexes corresponding to each keyword included in the candidate keyword sets corresponding to each second text are obtained, keyword screening is conducted according to the correlation indexes corresponding to the keywords included in the candidate keyword sets corresponding to each second text, a keyword dictionary for the target cluster is obtained, and the method of clustering analysis in advance is used for classifying different documents and digging out the keywords, so that the redundancy problem caused by information of document labels in the related technology is reduced; meanwhile, various indexes of the keywords are improved by utilizing the associated information among the vocabularies, and the purpose of improving the accuracy of the keywords is achieved.

Further, the cluster analysis module includes:

and the vector clustering sub-module is used for clustering the vector sequences respectively corresponding to the first texts to obtain the second texts belonging to the target cluster.

Further, the vector conversion sub-module includes:

Further, the keyword extraction module includes:

and the text word cutting sub-module is used for cutting words of each second text according to a plurality of preset character string distances aiming at the N-Gram model to obtain a candidate keyword set corresponding to each second text.

Further, the relevance indicator comprises at least one of:

support degree, confidence degree, promotion degree and information entropy.

Further, the keyword processing module includes:

and the keyword screening sub-module is used for determining that any candidate keyword included in the candidate keyword set corresponding to any second text is determined as a target keyword if the support degree, the confidence degree, the promotion degree and the information entropy corresponding to any candidate keyword included in the candidate keyword set corresponding to any second text are all larger than respective threshold values, and adding the target keyword into the keyword dictionary aiming at the target cluster class.

The keyword mining apparatus for long documents of the present embodiment can execute the keyword mining method for long documents in the first embodiment of the present application, and the implementation principles thereof are similar, and are not described herein again.

Another embodiment of the present application provides a terminal, including: the processor executes the computer program to realize the keyword mining method suitable for the long document.

In particular, the processor may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like.

In particular, the processor is coupled to the memory via a bus, which may include a path for communicating information. The bus may be a PCI bus or EISA bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc.

The memory may be, but is not limited to, ROM or other type of static storage device that can store static information and instructions, RAM or other type of dynamic storage device that can store information and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

Optionally, the memory is used for storing codes of computer programs for executing the scheme of the application, and the processor is used for controlling the execution. The processor is configured to execute the application program code stored in the memory to implement the actions of the keyword mining apparatus for long documents provided by the above embodiments.

Yet another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the above-described keyword mining method applicable to long documents.

The above-described embodiments of the apparatus are merely illustrative, and the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A keyword mining method suitable for long documents is characterized by comprising the following steps:

determining candidate keyword sets corresponding to the second texts respectively;

performing relevance calculation on the candidate keyword sets respectively corresponding to the second texts to obtain relevance indexes corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts;

and screening the keywords according to the relevance indexes corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts to obtain a keyword dictionary for the target cluster.

2. The method of claim 1, wherein the step of performing cluster analysis on each first text to determine each second text belonging to the target cluster class comprises:

3. The method of claim 2, wherein the vectorizing of the characters corresponding to the first texts comprises:

4. The method of claim 1, wherein before the step of vectorizing the characters corresponding to the first texts, the method further comprises:

5. The method of claim 1, wherein the step of determining the candidate keyword sets corresponding to the second texts respectively comprises:

6. The method of claim 1, wherein the relevance indicator comprises at least one of:

support degree, confidence degree, promotion degree and information entropy.

7. The method according to claim 1, wherein the step of performing keyword screening according to a plurality of relevance indicators corresponding to the keywords included in the candidate keyword sets respectively corresponding to the second texts to obtain the keyword dictionary for the target cluster class comprises:

and if the support degree, the confidence degree, the promotion degree and the information entropy corresponding to any candidate keyword contained in the candidate keyword set corresponding to any second text are all larger than respective threshold values, determining that any candidate keyword contained in the candidate keyword set corresponding to any second text is determined as a target keyword, and adding the target keyword into the keyword dictionary for the target cluster class.

8. A keyword mining apparatus suitable for a long document, comprising:

the text acquisition module is used for acquiring each first text to be processed, which is obtained by speech recognition ASR conversion;

the cluster analysis module is used for carrying out cluster analysis on each first text and determining each second text belonging to the target cluster;

and the keyword processing module is used for screening keywords according to the relevance indexes corresponding to the keywords in the candidate keyword set corresponding to each second text to obtain a keyword dictionary for the target cluster class.

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured for execution by the one or more processors, the one or more programs configured to: performing the method according to any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.