CN112487132A - Keyword determination method and related equipment - Google Patents

Keyword determination method and related equipment Download PDF

Info

Publication number
CN112487132A
CN112487132A CN201910863849.4A CN201910863849A CN112487132A CN 112487132 A CN112487132 A CN 112487132A CN 201910863849 A CN201910863849 A CN 201910863849A CN 112487132 A CN112487132 A CN 112487132A
Authority
CN
China
Prior art keywords
word
document
words
score
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910863849.4A
Other languages
Chinese (zh)
Inventor
戴泽辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201910863849.4A priority Critical patent/CN112487132A/en
Publication of CN112487132A publication Critical patent/CN112487132A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention provides a method for determining keywords, which can obtain a plurality of document sets of different clusters, perform word segmentation on each document contained in any one of the document sets of the clusters according to a plurality of different segmentation lengths to obtain a plurality of words of different word number lengths, obtain occurrence frequency scores, length weight scores and comprehensive scores of each word through calculation, and determine a preset number of words with the comprehensive scores sequenced in the front as the keywords of any one of the document sets of the clusters. The invention can obtain a plurality of words with different word length by segmenting the words of the document, scores the words with different word length, and corrects the word score by using the length weight during scoring, so that the word with longer length is scored higher, the determined keywords can reflect the text content more objectively, and the accuracy of determining the keywords is further improved. In addition, the invention also provides a keyword determining device so as to ensure the application and the realization of the method in practice.

Description

Keyword determination method and related equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to a keyword determining method and related equipment.
Background
Nowadays, the way of expressing information is increasingly diversified with the development of the information age, wherein the way of using text can express information more intuitively. In the case of text, keywords are abstracts of text topic information, which can highly summarize main contents of text and can help users to quickly understand text contents. The method for determining the keywords is very important because the number of the text messages is too large.
The current keyword determination method is to perform word segmentation operation on a text by using a preset dictionary, and then determine keywords in word segmentation results by using a keyword determination algorithm such as a word frequency-inverse document frequency (tf-idf) -based statistical algorithm, a text ranking (textrank) algorithm, a word entropy algorithm and the like based on the word segmentation results. However, the keywords obtained by the keyword determination method are usually in 2-3 words, and the keywords cannot be accurately determined.
Disclosure of Invention
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
in a first aspect, the present invention provides a method for determining a keyword, including:
obtaining a plurality of different clustered document sets;
taking the document set of any cluster as the document set of the current cluster, and executing the following keyword determination operation:
performing word segmentation on each document contained in the current clustered document set according to various different segmentation lengths to obtain various words with different word number lengths;
calculating occurrence frequency scores of the terms in the document sets of the different clusters;
calculating length weight scores of the words based on the word number lengths of the words; wherein the length weight score is used to represent the degree of influence of the word number length of a word on the word being determined as a keyword;
calculating a composite score of each word based on the occurrence frequency score and the length weight score;
and determining the keywords of the current clustered document set according to the comprehensive scores of all the words.
In a second aspect, the present invention provides an apparatus for determining a keyword, including:
the acquiring unit is used for acquiring a plurality of document sets of different clusters;
the execution unit is used for taking the document set of any cluster as the document set of the current cluster, and executing the following keyword determination operation:
performing word segmentation on each document contained in the current clustered document set according to various different segmentation lengths to obtain various words with different word number lengths;
calculating occurrence frequency scores of the terms in the document sets of the different clusters;
calculating length weight scores of the words based on the word number lengths of the words; wherein the length weight score is used to represent the degree of influence of the word number length of a word on the word being determined as a keyword;
calculating a composite score of each word based on the occurrence frequency score and the length weight score;
determining the words with the comprehensive scores ranked in the top preset number as the keywords of the document set of the current cluster.
In a third aspect, the present invention provides a storage medium having a program stored thereon, the program, when executed by a processor, implementing the method for determining a keyword as described above.
In a fourth aspect, the present invention provides a keyword determining apparatus, including at least one processor, and at least one memory and a bus connected to the processor; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory to execute the keyword determination method.
Compared with the prior art, the invention has the following advantages:
the invention provides a method for determining keywords, which can obtain a plurality of document sets of different clusters, perform word segmentation on each document contained in any one of the document sets of the clusters according to a plurality of different segmentation lengths to obtain a plurality of words of different word number lengths, obtain occurrence frequency scores, length weight scores and comprehensive scores of each word through calculation, and determine a preset number of words with the comprehensive scores sequenced in the front as the keywords of any one of the document sets of the clusters. In the invention, words of various word lengths can be obtained by segmenting the words of the document, the words of different word lengths are scored, and the length weight is used for correcting the score of the words during scoring, so that the score of the words with longer length is higher, the probability that the words capable of objectively reflecting the content of the document are determined as the keywords is higher, and the accuracy of the determined keywords is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for determining keywords according to the present invention;
fig. 2 is a schematic structural diagram of a keyword determination apparatus provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.
The invention provides a keyword determination method, and fig. 1 shows a method flowchart of the keyword determination method provided by the embodiment of the invention, and the method flowchart includes steps S101 and S102.
S101: a document set of a plurality of different clusters is obtained.
Specifically, the document set may be obtained by a clustering algorithm, and the clustering algorithm may specifically be a hierarchical tree clustering algorithm or the like. The document set comprises one or more documents, and the documents are a set of characters. The clustering criteria may be the subject matter of the documents, and documents of different subject matter may be divided into documents of different clusters. For example, a document set of a cluster contains documents of the subject matter "science and technology"; the document set of the other cluster contains documents of the subject matter "demon".
Specifically, one implementation of obtaining a document set of a plurality of different clusters may include the following steps:
firstly, a corpus in a specific field is obtained, the corpus can be used as a document in the invention, and the document is subjected to word segmentation processing by using language technology software, such as a Harvard large LTP language technology platform, so as to obtain word segmentation results corresponding to each document. The corpus refers to linguistic materials occurring in the practical use of a language, and generally exists in a corpus of a corresponding field, and the specific field can be determined by the requirements of a user, that is, the corpus corresponding to the specific field is obtained. After the word segmentation processing is carried out on the document, word units with preset word numbers can be obtained. For example, a document content is "the international crude oil future price oscillates between $ 89 and $ 106/bucket throughout the month", and after word segmentation processing, the word segmentation result of the document is: national, international, crude, futures, price, at 89, -, 106, $,/, bucket, interval, shake.
Then, the word frequency of each word unit is counted, and a dictionary is constructed according to the word frequency of the word unit, for example, a dictionary can be constructed by the word unit with the word frequency greater than or equal to a certain threshold value, such as 3. And converting the word segmentation result corresponding to each document into a one-hot vector form of a dictionary space.
The specific way of converting into the one-hot vector is to convert each word unit corresponding to each document into the one-hot vector in the form of 0 and 1 according to the position of each word unit in the document segmentation result in the dictionary space. Wherein 0 represents that the word at the position in the dictionary does not appear in the document, or the word at the position in the dictionary is not a dictionary element in the segmentation result, 1 represents that the word at the position in the dictionary appears in the document, or the word at the position in the dictionary is a dictionary element in the segmentation result. For example, if the dictionary is "barrel, crude oil, futures, price" and the segmentation result corresponding to a certain document is "crude oil, futures", the one-hot vector converted from the segmentation result is "0, 1, 1, 0".
And then, after the one-hot vector corresponding to each document is obtained, converting each one-hot vector into a feature vector corresponding to each document by using a feature vector algorithm.
The feature vector algorithm may be tf-idf (term frequency-inverse document frequency) weighting algorithm, LDA (Linear discriminant Analysis) algorithm, or the like. And inputting the one-hot vector corresponding to each document into a feature vector algorithm, and calculating to obtain the feature vector corresponding to each document.
And finally, clustering the feature vectors corresponding to the documents by using a clustering algorithm so as to obtain a plurality of different clusters.
Clustering algorithm, an unsupervised algorithm that classifies texts with similar content, is also the basis of many downstream tasks. The clustering algorithm may classify documents with similar content, i.e., with the same subject matter, into the same cluster. For example, "how do retirement fees are higher than payroll, see americans" and "what we take to save retired oneself? The two texts are similar in content and can be classified into the same class, and the aged industry development promoted by the ecological development platform of the health industry built by the insurance profit cannot be classified into the class.
The clustering algorithm may include a K-Means clustering algorithm, a hierarchical tree clustering algorithm such as a birch clustering algorithm, and the like. The clustering algorithm can classify each feature vector according to the feature vector corresponding to each document and according to a certain preset threshold value, so as to obtain a plurality of different clusters. Each cluster contains several documents.
S102: and aiming at the document set of any cluster, executing the following keyword determination operation: performing word segmentation on each document contained in the document set of any cluster according to various different segmentation lengths to obtain various words with different word number lengths; calculating the occurrence frequency scores of all the terms in the document sets of the different clusters; calculating length weight scores of all the words based on the word number lengths of the words; wherein the length weight score is used for representing the influence degree of the word number length of the word on the word determined as the keyword; calculating a composite score of each word based on the occurrence frequency score and the length weight score; and determining the keywords of the document set of any cluster according to the comprehensive scores of all the words.
Specifically, the cluster obtained in the previous step is multiple, and in this step, for any one of multiple clusters, a keyword may be determined for the document in the cluster according to the keyword determination operation. When the keyword determination operation is performed on the document set of any cluster, the cluster may be referred to as a current cluster, and accordingly, any cluster in each step is the current cluster.
The keyword determination operation is to perform word segmentation on the document according to a segmentation length, where the segmentation length may be preset, and the specific segmentation length may be determined by the word number of a word that is desired to be obtained, for example, 2 words need to be obtained, that is, the segmentation length is set to 2.
The length of the segmentation is a plurality of, so that a plurality of words with different lengths can be obtained. For example, the length of the segmentation may include four lengths of 2, 3, 4 and 5, and words with four lengths of 2, 3, 4 and 5 may be segmented. The segmentation length may refer to the number of characters included in a word, or may refer to the number of word units included in a word. Taking the word unit as an example, the segmentation length includes four lengths of 2, 3, 4, and 5, which indicates that the words obtained after segmentation respectively include 2 word units, 3 word units, 4 word units, and 5 word units. Specifically, for example, "fund company" is a word including 2 word units, "Chinese fund company" is a word including 3 word units, "Chinese fund company stock" is a word including 4 word units, "Chinese fund company stock prediction" is a word including 5 word units.
It should be noted that there are multiple segmentation lengths, and accordingly, the document may be segmented for multiple times, and each segmentation cuts the document according to the same segmentation length; or the document can be preliminarily segmented according to the length of the word units, and then adjacent word units are combined according to the number of the word units contained in the segmentation length to obtain segmented words. For example, according to the length of the word unit, if the document is subjected to preliminary word segmentation to obtain that "a fund company is the best fund company", and a segmentation length is 2, then combining two adjacent word units can obtain a word with a segmentation length of 2, where the word with a segmentation length of 2 includes: "A fund", "Fund company", "company is", "is best", "Fund company". And in the same way, words with other segmentation lengths can be obtained.
And after the words are obtained through the segmentation operation, counting the occurrence times of each word in the document set. Taking the above-mentioned segmentation result as an example, it can be counted that the number of occurrences of the words "fund a", "company is", "is best", "fund" is all 1, and the number of occurrences of "fund company" is 2. That is, the same document is segmented according to different word unit number settings, words composed of different number of unit words are obtained, and the times of occurrence of the words are counted based on the words with different lengths.
The number of occurrences may be considered a specific form of frequency of occurrence, or the frequency of occurrence may be the ratio of the number of occurrences of a word to the total number of words. And calculating the score corresponding to the occurrence frequency according to the occurrence frequency, wherein the calculated standard is that the higher the occurrence frequency is, the higher the occurrence frequency score of the word is. Specifically, a frequency score calculation formula is used to calculate the occurrence frequency scores of the words obtained through segmentation processing in the obtained document sets of the plurality of different clusters.
Further, based on the word count length of the word, a length weight score of each word is calculated. The length weight score is used to represent the degree of influence of the word number length of a word on the word being determined as a keyword.
In the process of determining the keywords, many times of words with more words or longer words can better show the specific content of the document, but according to the prior art, the words are only segmented into shorter words, and when the occurrence frequency score is calculated, the situation cannot be considered, and some longer keywords with stronger expression ability may have scores obviously lower than shorter keywords with lower expression ability.
In consideration of the occurrence of the situation, the method introduces the calculation of the length weight score, and well avoids the influence of the situation on the keyword determination.
Specifically, the length weight score calculation mode can be adjusted for multiple times according to experiments, and the calculated length weight score can ensure that the score of the long word can be arranged in front of the word with less word number and higher occurrence frequency score.
It should be noted that the present invention is not particularly limited to the definition of the long term, and the term length may be defined by those skilled in the art according to the understanding known in the art about the length of the term.
And calculating the obtained occurrence frequency score and the length weight score to obtain a comprehensive score corresponding to each word. After the comprehensive scores of the words are obtained, the comprehensive scores of the words can be ranked, and the keywords of the current clustered document set are determined according to the ranking result. Specifically, the comprehensive scores are ranked from high to low, and a preset number of words with the comprehensive scores ranked in the top are determined as keywords of the document set of any one cluster.
Specifically, the step of calculating the occurrence frequency scores of the terms in the document sets of the different clusters comprises the following steps:
calculating the occurrence frequency of each word in the document set of any cluster; calculating the occurrence inverse frequency of each word in the document sets of other clusters except any one cluster; and taking the product of the occurrence frequency and the occurrence inverse frequency of each word as the occurrence frequency score of each word in the document sets of the plurality of different clusters.
Wherein, the occurrence frequency refers to the reciprocal of the occurrence frequency of the document set of other clusters with the word outside any one cluster. The higher the occurrence inverse frequency is, the lower the occurrence frequency of the words in the document set of other clusters is; conversely, the higher the occurrence frequency of the inverse, the higher the occurrence frequency of the term in the document set of other clusters. The inverse frequency of occurrence may also include the inverse of the frequency of occurrence of the term in the document set of some specified cluster. Therefore, other clusters than any one cluster are comparison clusters for correcting the appearance frequency score. The other clusters may be all other clusters than the any one cluster, or some designated clusters other than the any one cluster.
The term "inverse frequency" is used herein to refer to the occurrence of a term in the document sets of other clusters, in the sense that if a term occurs more frequently in the document sets of other clusters, the inverse frequency is smaller, which expresses that the term is less likely to become a keyword in the document of the current cluster. In this way, the result obtained by multiplying the appearance frequency of the word in the document set of the current cluster by the appearance frequency of the word in the other clusters can accurately reflect the possibility that the word becomes a keyword, and the larger the result, the higher the possibility. The processing not only considers the condition of the word appearing in the current clustered document set, but also considers the condition of the word appearing in other clustered document sets, obtains the inverse frequency by taking the reciprocal, multiplies the inverse frequency by the frequency of the word appearing in the current clustered document set, can uniformly quantize and calculate the two conditions, further corrects the possibility that the word becomes a keyword through the inverse frequency, and can enable the determination result of the keyword to be more objective and accurate.
Specifically, the step of calculating the occurrence frequency of each term in the document set of any one cluster is as follows:
counting the number of documents contained in the document set of any cluster and the appearance number of each word in the document set of any cluster; and calculating the occurrence frequency of each word in the document set of any cluster based on the occurrence subsections and the document quantity.
Correspondingly, the step of calculating the occurrence inverse frequency of each word in the document sets of other clusters except any one cluster comprises the following steps:
counting the appearance sections of the document sets of other clusters except any one cluster of the words; and calculating the occurrence inverse frequency of the document sets of other clusters of the words outside any one cluster based on the occurrence sections.
Specifically, the occurrence frequency may be calculated by dividing the occurrence number by the total number, that is, the number of each term appearing in the document set of a certain cluster is divided by the total number of the documents included in the document set, and correspondingly, the occurrence frequency is obtained by dividing the occurrence number by the total number and then taking the reciprocal. The specific formula is as follows:
Figure BDA0002200663160000081
where a is the occurrence number, b is the non-occurrence number, and a + b is the sum of the numbers in the category. Firstly, the appearance frequency corresponding to the appearance space is obtained through a/a + b, and in the formula, the square processing is carried out on the appearance frequency so as to increase the influence of the appearance frequency. And the multiplied log item is used for calculating the occurrence inverse frequency of each word in the document set of other clusters except any one cluster, and the value is smaller when the occurrence frequency of each word in other clusters is higher, namely the word is not a characteristic word of the cluster but a word common to all the clusters. The meaning of the whole formula is that the higher the frequency of occurrence of a word in any one cluster, the lower the number of occurrences in other categories, the higher the frequency feature score.
In addition, the frequency of occurrence may be obtained by dividing the number of occurrences of a certain term in all documents by the total number of occurrences of the term with the same word number and length corresponding to the term. For example, the number of occurrences of the "fund" is 50, the word length corresponding to the fund is 2, and the total number of occurrences of the term with the word length of 2 is 500 through calculation, that is, in the above formula, the value of a is 50, correspondingly, a + b may be replaced by another parameter for expressing the total number of occurrences, and the value of the parameter is 500, and correspondingly, the occurrence frequency score of each term in the document set of the plurality of different clusters may also be obtained through the number of occurrences.
Specifically, based on the word number length of a word, the step of calculating the length weight score of each word is:
determining the number of phrases contained in the words, taking the number of phrases as an index of an index calculation formula with a preset base number as a base, and calculating an index correction score; calculating length correction scores of all lengths included in the word number lengths of the words;
the specific formula is as follows:
Figure BDA0002200663160000091
wherein 2 is a preset base number, n is the number of phrases contained in the words, and l represents the number of words of the words. The result obtained by the formula is a length correction term, and the simple understanding is that the larger the number n of word groups in a word is, the larger the word number l of the word is, the higher the word length weight score is, and conversely, the smaller the number n of word groups in the word is, the smaller the word number l of the word is, the lower the word length weight score is. And the formula can approach an exponential function along with the change of the length and the number. Therefore, the formula is used for calculating the length weight scores of all the words, so that the scores of the long words can be better ensured to be arranged in front of the words with lower word counts and higher occurrence frequency scores.
After the occurrence frequency score and the length weight score are obtained, a composite score of each word is calculated based on the occurrence frequency score and the length weight score, specifically, a product of the occurrence frequency score and the length weight score of each word is used as the composite score of each word, and a specific formula is as follows:
score=scorefreq×scorelength
it can be seen from the above formula that the length weight score is actually an adjustment parameter, which is a correction way for the occurrence frequency score, and can avoid the foregoing situation that some longer keywords with stronger expression ability have scores significantly lower than those of shorter keywords with lower expression ability. Therefore, the two are multiplied to obtain a corrected comprehensive score, and the comprehensive score can better show the expression capability of the corresponding word.
Specifically, determining keywords of the document set of the current cluster according to the comprehensive scores of the words comprises the following steps:
and sequencing the comprehensive scores from high to low, and determining the words sequenced in the top by a preset number as the keywords of the document set of any cluster, wherein the preset number is manually set, and the preset number can be set according to the number of the keywords which the user wants to know according to historical experience. For example, the preset number may be 5, i.e., the top 5 ranked words are determined as keywords for the document set of this cluster and returned to the user.
It should be noted that, in the currently adopted keyword determination method, a preset dictionary is used to perform word segmentation operation on a text, and then a keyword determination algorithm is used to extract keywords from word segmentation results based on the word segmentation results. In order to ensure high accuracy, the word segmentation granularity is fine, so that the length of the keywords extracted by using keyword determination algorithms such as conventional textrank and the like is mainly 2-3 characters, and the keyword length can not necessarily meet the requirement of long keywords. For example, in the field of pension user review analysis, the "pay Bao, WeChat, the most cost-effective robust financing of the Jingdong … …! The keyword of the robust financing needs to be determined in the sentence, but the keyword is processed into two participles of the robust financing and the financing in the participle result; as another example, in the field of security check of internet text in industry, the keyword "international crude oil futures price oscillates between $ 89 and $ 106/barrel in whole month" is determined in the sentence, but the word segmentation result usually processes the word as four words, i.e., "international", "crude oil", "futures", "price". Although the granularity of the word after word segmentation can be increased by customizing the user dictionary, the construction of the user dictionary requires a lot of manual work and is time-consuming and labor-consuming.
However, the keyword determination method provided by the present invention may obtain a plurality of document sets of different clusters, perform term segmentation on each document included in the document set of any cluster according to a plurality of different segmentation lengths to obtain a plurality of terms of different word number lengths, obtain occurrence frequency scores, length weight scores, and composite scores of each term by calculation, and determine a preset number of terms with the composite scores ranked in the top as the keyword of the document set of any cluster. The keyword determining method provided by the invention is different from other open source algorithms which only can determine short keywords or manually construct dictionaries to determine the characteristic words, but can obtain a plurality of words with different word number lengths by word segmentation of the document, score the words with different word number lengths, correct the score of the words by using the length weight during scoring, enable the score of the words with longer lengths to be higher, further enable the probability of the selected words to be higher, and enable the keywords to objectively reflect the content of the document, thereby improving the accuracy of the determined keywords.
The method has great application value in the fields of comment text emotion analysis, text retrieval, text recommendation and the like. Determining text keywords is an upstream task in these areas that provides data support for downstream text processing tasks, which can further impact text processing effectiveness if text keyword extraction is inaccurate. For example, in the field of text recommendation, after reading a certain text, a user may recommend other texts with similar contents. Therefore, massive texts existing in the internet can be processed in advance to extract keywords of each text. And searching a target text which is the same as the keyword in the massive texts according to the keyword of the text read by the user, and recommending the target text to the user. The keywords determined by the method reflect the text content more objectively, and the accuracy is higher, so that the text pushed for the user is more accurate. For another example, the thesis can be clustered and managed according to the keywords in the thesis website, and if the accuracy of the keywords is determined to be poor, classification of the thesis is poor, so that the overall management work is influenced. It should be noted that this application scenario is only an example, and the keyword determination scheme provided by the present invention may also be applied to other scenarios of processing text according to keywords.
Further, after determining the keyword, the embodiment of the present invention may further include the following steps:
searching a target document containing the keyword in the document set of any cluster;
and adding a keyword tag containing the keyword to the target document.
Specifically, after keywords corresponding to each document are acquired, a keyword tag containing the keywords is added to each document. Alternatively, keyword tags may be added to the clusters. By means of the keyword tags, it can be intuitively determined which keywords are contained in the document, or the document can be retrieved by means of the keyword tags.
Corresponding to the method described in fig. 1, an embodiment of the present invention further provides a device for determining a keyword, which is used for specifically implementing the method in fig. 1, and a schematic structural diagram of the device is shown in fig. 2, and specifically includes:
an obtaining unit 201, configured to obtain a document set of a plurality of different clusters;
an executing unit 202, configured to take the document set of any cluster as the document set of the current cluster, and execute the following keyword determination operations: performing word segmentation on each document contained in the document set of any cluster according to various different segmentation lengths to obtain various words with different word number lengths; calculating the occurrence frequency scores of all the terms in the document sets of the different clusters; calculating length weight scores of the words based on the word number lengths of the words; wherein the length weight score is used for representing the influence degree of the word number length of the word on the word determined as the keyword; calculating a composite score of each word based on the occurrence frequency score and the length weight score; and determining the keywords of the document set of any cluster according to the comprehensive scores of all the words.
In one implementation, the execution unit is configured to calculate occurrence frequency scores of the terms in the document sets of the plurality of different clusters, and includes:
calculating the occurrence frequency of each word in the document set of any cluster;
calculating the occurrence inverse frequency of each word in the document set of other clusters except any one cluster;
and taking the product of the occurrence frequency and the occurrence inverse frequency of each word as the occurrence frequency score of each word in the document sets of the plurality of different clusters.
In one implementation, the execution unit is configured to calculate the occurrence frequency of each term in the document set of any cluster, and includes:
counting the number of documents contained in the document set of any cluster and the appearance number of each word in the document set of any cluster;
and calculating the occurrence frequency of each word in the document set of any cluster based on the occurrence sections and the document quantity.
In one implementation, the execution unit is configured to calculate an inverse frequency of occurrence of each term in the document set of other clusters than any one cluster, and includes:
counting the appearance sections of the document sets of other clusters except any one cluster of each word;
and calculating the occurrence inverse frequency of the document sets of other clusters of the words outside any one cluster based on the occurrence sections.
In one implementation, the execution unit is configured to calculate a length weight score for each word based on the word count length of the word, and includes:
determining the number of phrases contained in the words, taking the number of phrases as an index of an index calculation formula with a preset base number as a base, and calculating an index correction score;
calculating length correction scores of all lengths included in the word number lengths of the words;
and taking the product of the exponential correction score and the length correction score of the word as the length weight score of the word.
In one implementation, the execution unit is configured to calculate a composite score for each word based on the occurrence frequency score and the length weight score, and includes:
and taking the product of the occurrence frequency score and the length weight score of each word as the comprehensive score of each word.
In one implementation, the method further comprises:
and the adding unit is used for searching a target document containing the keyword in the document set of any cluster and adding a keyword tag containing the keyword to the target document.
The device for determining the keyword comprises a processor and a memory, wherein the acquiring unit, the executing unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the accuracy of determining the keywords is improved by adjusting the kernel parameters.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the method for determining the keyword when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the method for determining the keywords is executed when the program runs.
The embodiment of the invention provides equipment, which comprises at least one processor, at least one memory and a bus, wherein the memory and the bus are connected with the processor; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory to execute the keyword determination method. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The invention also provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
obtaining a plurality of different clustered document sets;
taking the document set of any cluster as the document set of the current cluster, and executing the following keyword determination operation:
performing word segmentation on each document contained in the current clustered document set according to various different segmentation lengths to obtain various words with different word number lengths;
calculating occurrence frequency scores of the terms in the document sets of the different clusters;
calculating length weight scores of the words based on the word number lengths of the words; wherein the length weight score is used to represent the degree of influence of the word number length of a word on the word being determined as a keyword;
calculating a composite score of each word based on the occurrence frequency score and the length weight score;
and determining the keywords of the current clustered document set according to the comprehensive scores of all the words.
In one implementation, calculating frequency of occurrence scores for respective terms in the plurality of differently clustered document sets includes: calculating the occurrence frequency of each word in the current clustered document set; calculating the occurrence inverse frequency of each word in the document sets of other clusters except the current cluster; and taking the product of the occurrence frequency and the occurrence inverse frequency of each word as the occurrence frequency score of each word in the document sets of the plurality of different clusters.
In one implementation, calculating the frequency of occurrence of each term in the document set of the current cluster includes: counting the number of documents contained in the document set of the current cluster and the appearance number of each word in the document set of the current cluster; and calculating the occurrence frequency of each word in the document set of the current cluster based on the occurrence subsections and the document quantity.
In one implementation, calculating the occurrence inverse frequency of each word in the document set of other clusters except the current cluster comprises: counting the appearance sections of the words in the document sets of other clusters except the current cluster; and calculating the occurrence inverse frequency of the document sets of other clusters of the words outside the current cluster based on the occurrence sections.
In one implementation, calculating a length weight score for each word based on the word count length of the word comprises: determining the number of phrases contained in the words, taking the number of phrases as an index of an index calculation formula with a preset base number as a base, and calculating an index correction score; calculating length correction scores of all lengths included in the word number lengths of the words; and taking the product of the exponential correction score and the length correction score of the word as the length weight score of the word.
In one implementation, calculating a composite score for each term based on the frequency of occurrence scores and the length weight scores includes: and taking the product of the occurrence frequency score and the length weight score of each word as the comprehensive score of each word.
In one implementation, the method for determining the keyword further includes: searching a target document containing the keyword in the document set of the current cluster; and adding a keyword tag containing the keyword to the target document.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present invention, and are not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (10)

1. A method for determining a keyword, comprising:
obtaining a plurality of different clustered document sets;
taking the document set of any cluster as the document set of the current cluster, and executing the following keyword determination operation:
performing word segmentation on each document contained in the current clustered document set according to various different segmentation lengths to obtain various words with different word number lengths;
calculating occurrence frequency scores of the terms in the document sets of the different clusters;
calculating length weight scores of the words based on the word number lengths of the words; wherein the length weight score is used to represent the degree of influence of the word number length of a word on the word being determined as a keyword;
calculating a composite score of each word based on the occurrence frequency score and the length weight score;
and determining the keywords of the current clustered document set according to the comprehensive scores of all the words.
2. The method for determining keywords according to claim 1, wherein calculating occurrence frequency scores of respective terms in the document sets of the plurality of different clusters comprises:
calculating the occurrence frequency of each word in the current clustered document set;
calculating the occurrence inverse frequency of each word in the document sets of other clusters except the current cluster;
and taking the product of the occurrence frequency and the occurrence inverse frequency of each word as the occurrence frequency score of each word in the document sets of the plurality of different clusters.
3. The method of claim 2, wherein calculating the frequency of occurrence of each term in the current clustered document set comprises:
counting the number of documents contained in the document set of the current cluster and the appearance number of each word in the document set of the current cluster;
and calculating the occurrence frequency of each word in the document set of the current cluster based on the occurrence subsections and the document quantity.
4. The method of determining keywords according to claim 2, wherein calculating the inverse frequency of occurrence of each term in the document sets of other clusters than the current cluster comprises:
counting the appearance sections of the words in the document sets of other clusters except the current cluster;
and calculating the occurrence inverse frequency of the document sets of other clusters of the words outside the current cluster based on the occurrence sections.
5. The method of claim 1, wherein calculating a length weight score for each word based on the word count length of the word comprises:
determining the number of phrases contained in the words, taking the number of phrases as an index of an index calculation formula with a preset base number as a base, and calculating an index correction score;
calculating length correction scores of all lengths included in the word number lengths of the words;
and taking the product of the exponential correction score and the length correction score of the word as the length weight score of the word.
6. The method of determining a keyword according to claim 1, wherein calculating a composite score of each word based on the occurrence frequency score and the length weight score comprises:
and taking the product of the occurrence frequency score and the length weight score of each word as the comprehensive score of each word.
7. The method for determining a keyword according to claim 1, further comprising:
searching a target document containing the keyword in the document set of the current cluster;
and adding a keyword tag containing the keyword to the target document.
8. An apparatus for determining a keyword, comprising:
the acquiring unit is used for acquiring a plurality of document sets of different clusters;
the execution unit is used for taking the document set of any cluster as the document set of the current cluster, and executing the following keyword determination operation:
performing word segmentation on each document contained in the current clustered document set according to various different segmentation lengths to obtain various words with different word number lengths;
calculating occurrence frequency scores of the terms in the document sets of the different clusters;
calculating length weight scores of the words based on the word number lengths of the words; wherein the length weight score is used to represent the degree of influence of the word number length of a word on the word being determined as a keyword;
calculating a composite score of each word based on the occurrence frequency score and the length weight score; determining the words with the comprehensive scores ranked in the top preset number as the keywords of the document set of the current cluster.
9. A storage medium having a program stored thereon, wherein the program, when executed by a processor, implements the method for determining a keyword according to any one of claims 1 to 7.
10. The keyword determining device is characterized by comprising at least one processor, at least one memory and a bus, wherein the memory and the bus are connected with the processor; the processor and the memory complete mutual communication through a bus; the processor is used for calling program instructions in the memory to execute the keyword determination method of any one of claims 1-7.
CN201910863849.4A 2019-09-12 2019-09-12 Keyword determination method and related equipment Pending CN112487132A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910863849.4A CN112487132A (en) 2019-09-12 2019-09-12 Keyword determination method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910863849.4A CN112487132A (en) 2019-09-12 2019-09-12 Keyword determination method and related equipment

Publications (1)

Publication Number Publication Date
CN112487132A true CN112487132A (en) 2021-03-12

Family

ID=74920528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910863849.4A Pending CN112487132A (en) 2019-09-12 2019-09-12 Keyword determination method and related equipment

Country Status (1)

Country Link
CN (1) CN112487132A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393838A (en) * 2021-06-30 2021-09-14 北京探境科技有限公司 Voice processing method and device, computer readable storage medium and computer equipment
CN113407584A (en) * 2021-06-29 2021-09-17 微民保险代理有限公司 Label extraction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528524A (en) * 2016-09-22 2017-03-22 中山大学 Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
CN108763196A (en) * 2018-05-03 2018-11-06 上海海事大学 A kind of keyword extraction method based on PMI
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium
WO2019041524A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method, electronic apparatus, and computer readable storage medium for generating cluster tag

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528524A (en) * 2016-09-22 2017-03-22 中山大学 Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
WO2019041524A1 (en) * 2017-08-31 2019-03-07 平安科技(深圳)有限公司 Method, electronic apparatus, and computer readable storage medium for generating cluster tag
CN108763196A (en) * 2018-05-03 2018-11-06 上海海事大学 A kind of keyword extraction method based on PMI
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407584A (en) * 2021-06-29 2021-09-17 微民保险代理有限公司 Label extraction method, device, equipment and storage medium
CN113393838A (en) * 2021-06-30 2021-09-14 北京探境科技有限公司 Voice processing method and device, computer readable storage medium and computer equipment

Similar Documents

Publication Publication Date Title
Gopi et al. Classification of tweets data based on polarity using improved RBF kernel of SVM
Atoum Computer and Information Sciences
Singh et al. Sentiment analysis of movie reviews and blog posts
US20190163807A1 (en) Feature vector profile generation for interviews
US20190340688A1 (en) Utilizing artificial intelligence to make a prediction about an entity based on user sentiment and transaction history
Jiang et al. Analyzing market performance via social media: a case study of a banking industry crisis
US10417338B2 (en) External resource identification
Castro et al. Smoothed n-gram based models for tweet language identification: A case study of the brazilian and european portuguese national varieties
Nokhbeh Zaeem et al. PrivacyCheck v2: A tool that recaps privacy policies for you
CN110990532A (en) Method and device for processing text
CN116521865A (en) Metadata classification method, storage medium and system based on automatic identification technology
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
Saleiro et al. TexRep: A text mining framework for online reputation monitoring
CN112487132A (en) Keyword determination method and related equipment
Altuncu et al. Graph-based topic extraction from vector embeddings of text documents: Application to a corpus of news articles
Abdi et al. Using an auxiliary dataset to improve emotion estimation in users’ opinions
CN112487181A (en) Keyword determination method and related equipment
CN114445043B (en) Open ecological cloud ERP-based heterogeneous graph user demand accurate discovery method and system
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN114741501A (en) Public opinion early warning method and device, readable storage medium and electronic equipment
Kustanto et al. Sentiment Analysis of Indonesia’s National Health Insurance Mobile Application using Naïve Bayes Algorithm
EP3745281A1 (en) Providing machine-learning training data for a differentiator module to identify document properties
Sriphaew et al. Cool blog identi? cation using topic-based models
Liu et al. Stratify Mobile App Reviews: E-LDA Model Based on Hot" Entity" Discovery
Cheng et al. A model for age and gender profiling of social media accounts based on post contents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination