CN114461783A

CN114461783A - Keyword generation method and device, computer equipment, storage medium and product

Info

Publication number: CN114461783A
Application number: CN202210044513.7A
Authority: CN
Inventors: 蒋乐怡
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-10

Abstract

The embodiment of the application discloses a keyword generation method, a keyword generation device, computer equipment, a storage medium and a product, wherein a first candidate keyword corresponding to a text set is obtained, and the relevancy and the inverse frequency of the first candidate keyword aiming at the text set are obtained; calculating the discrimination of the first candidate keyword according to the inverse class frequency and the correlation; screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination; and clustering the second candidate keywords to obtain target keywords for text screening. According to the scheme, the second candidate keywords with strong exclusivity are screened from the first candidate keywords according to the relevance and the inverse frequency, and then the second candidate keywords are clustered, so that the target keywords with strong exclusivity can be determined from the second candidate keywords, the target texts can be accurately screened when the texts are screened on the basis of the target keywords, the exclusivity of the generated target keywords is improved, and the labor cost is reduced.

Description

Keyword generation method and device, computer equipment, storage medium and product

Technical Field

The present application relates to the field of communications technologies, and in particular, to a keyword generation method, apparatus, computer device, storage medium, and product.

Background

Text is filtered using keywords, a common text processing method. For example, it is necessary to identify a certain class of merchants, and screening can be implemented by matching the name keywords of the merchants. However, it is time consuming to extract the required keywords directly from the text, it is difficult to exhaust all the keywords, and when new text appears, the coverage of screening the text using the old keywords is challenged.

The keywords extracted from the text can be selected in a term frequency-inverse text frequency (TF-IDF) mode, but the keywords generated by the method have poor exclusivity, and the exclusivity can represent the accuracy of the target text, so that manual participation is often needed when the keywords are generated, namely, the keywords are firstly selected in a TF-IDF mode and the like, and then the required keywords are manually selected. When a new text appears, keywords are required to be screened in a similar mode, namely a TF-IDF mode and the like, and then target keywords are manually screened.

Disclosure of Invention

The embodiment of the application provides a keyword generation method, a keyword generation device, computer equipment, a storage medium and a product, wherein a second candidate keyword with stronger exclusivity is screened from a first candidate keyword according to the relevance and the inverse frequency, and a target keyword with stronger exclusivity can be determined from the second candidate keyword by clustering the second candidate keyword, so that a target text can be accurately screened when the text is screened based on the target keyword, the exclusivity of the generated target keyword is improved, and the labor cost is reduced.

The keyword generation method provided by the embodiment of the application comprises the following steps:

acquiring a first candidate keyword corresponding to a text set, and acquiring the relevancy and the inverse frequency of the first candidate keyword aiming at the text set;

calculating the discrimination of the first candidate keywords according to the inverse class frequency and the correlation;

screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination;

and clustering the second candidate keywords to obtain target keywords for text screening.

Correspondingly, an embodiment of the present application further provides a keyword generation apparatus, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a first candidate keyword corresponding to a text set and acquiring the relevancy and the inverse frequency of the first candidate keyword aiming at the text set;

the calculating unit is used for calculating the discrimination of the first candidate keywords according to the inverse class frequency and the correlation;

the screening unit is used for screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination;

and the clustering unit is used for clustering the second candidate keywords to obtain target keywords for text screening.

In an embodiment, the obtaining unit includes:

the statistics subunit is configured to perform statistics on the text category to which the text including the first candidate keyword in the text set belongs to obtain a text category statistic;

the category acquisition subunit is used for acquiring the total number of text categories contained in the text set;

and the frequency calculating subunit is used for calculating the inverse class frequency of the first candidate keyword aiming at the text set according to the text category statistics and the text category total number.

In an embodiment, the keyword generation apparatus further includes:

the data acquisition unit is used for acquiring an initial text set and acquiring initial keywords;

the text screening unit is used for performing text screening on the initial text set through the initial keywords to obtain a target text containing the initial keywords;

and the marking unit is used for marking the text type of the target text in the initial text sample set to obtain a text set.

In one embodiment, the data acquisition unit includes:

the keyword obtaining subunit is configured to obtain a first alternative keyword corresponding to the text sample;

a keyword screening subunit, configured to screen a second alternative keyword from the first alternative keyword according to the relevance and the inverse frequency of the first alternative keyword;

the sample screening subunit is used for performing text screening on the text sample based on the second alternative keyword to obtain a screening result;

and the index calculation subunit is used for calculating the evaluation index of the second alternative keyword according to the screening result and selecting the initial keyword from the second alternative keyword through the evaluation index.

In one embodiment, the index calculating subunit includes:

the selection module is used for selecting seed keywords from the second alternative keywords according to the evaluation indexes;

the information acquisition module is used for acquiring first characteristic information of the first alternative keyword and second characteristic information of the seed keyword;

the similarity calculation module is used for calculating the similarity between the first candidate keyword and the seed keyword based on the first characteristic information and the second characteristic information;

and the determining module is used for determining the initial keyword from the first alternative keywords according to the seed keyword and the similarity.

In an embodiment, the obtaining unit includes:

a frequency information obtaining subunit, configured to obtain a word frequency and an inverse text frequency of the first candidate keyword;

and the correlation degree operator unit is used for calculating the correlation degree of the first candidate keyword according to the word frequency and the inverse text frequency.

In an embodiment, the second candidate keyword includes a plurality of keywords, and the clustering unit includes:

the keyword clustering subunit is used for clustering the second candidate keywords according to the density reachable relation among the plurality of second candidate keywords to obtain a target keyword cluster;

and the target keyword determining subunit is used for determining the target keyword from the target keyword cluster.

Correspondingly, the embodiment of the application also provides computer equipment, which comprises a memory and a processor; the memory stores a computer program, and the processor is configured to run the computer program in the memory to execute any one of the keyword generation methods provided by the embodiments of the present application.

Accordingly, embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program, and the computer program is loaded by a processor to execute any one of the keyword generation methods provided in the embodiments of the present application.

Correspondingly, an embodiment of the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements any one of the keyword generation methods provided in the embodiment of the present application.

According to the method, the first candidate keywords corresponding to the text set are obtained, and the relevancy and the inverse frequency of the first candidate keywords aiming at the text set are obtained; calculating the discrimination of the first candidate keywords according to the inverse class frequency and the correlation; screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination; and clustering the second candidate keywords to obtain target keywords for text screening.

According to the scheme, the second candidate keywords with strong exclusivity are screened from the first candidate keywords according to the relevance and the inverse frequency, and then the second candidate keywords are clustered, so that the target keywords with strong exclusivity can be determined from the second candidate keywords, the target texts can be accurately screened when the texts are screened on the basis of the target keywords, the exclusivity of the generated target keywords is improved, and the labor cost is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene diagram of a keyword generation method provided in an embodiment of the present application;

fig. 2 is a flowchart of a keyword generation method provided in an embodiment of the present application;

FIG. 3 is another flowchart of a keyword generation method provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a keyword generation apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a keyword generation method, a keyword generation device, computer equipment and a computer readable storage medium. The keyword generation apparatus may be integrated into a computer device, and the computer device may be a server or a terminal.

The terminal may include a mobile phone, a wearable smart device, a tablet Computer, a notebook Computer, a Personal Computer (PC), a vehicle-mounted Computer, and the like.

The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network), big data and artificial intelligence platforms and the like.

For example, as shown in fig. 1, the computer device obtains a first candidate keyword corresponding to the text set, and obtains a relevance and an inverse class frequency of the first candidate keyword with respect to the text set; calculating the discrimination of the first candidate keywords according to the inverse class frequency and the correlation; screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination; and clustering the second candidate keywords to obtain target keywords for text screening. According to the scheme, the second candidate keywords with strong exclusivity are screened from the first candidate keywords according to the relevance and the inverse frequency, and then the second candidate keywords are clustered, so that the target keywords with strong exclusivity can be determined from the second candidate keywords, the target texts can be accurately screened when the texts are screened on the basis of the target keywords, the exclusivity of the generated target keywords is improved, and the labor cost is reduced.

The following are detailed descriptions. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment will be described from the perspective of a keyword generation apparatus, which may be specifically integrated in a computer device, where the computer device may be a server or a terminal.

As shown in fig. 2, a specific process of the keyword generation method provided in the embodiment of the present application may be as follows:

101. and acquiring a first candidate keyword corresponding to the text set, and acquiring the relevancy and the inverse frequency of the first candidate keyword aiming at the text set.

The text set may include a plurality of texts, one text may be a sentence, a paragraph, or a chapter, and the plurality of texts include a positive text, where the positive text represents a text to be screened from the text set.

The first candidate keyword may include a keyword extracted from the text set.

The relevance can represent the relevance of a first candidate keyword and the positive text in the text set, and the higher the relevance is, the more the positive text can be screened according to the first candidate keyword.

The reverse frequency represents the category identification capability of the first candidate keyword, when the first candidate keyword only appears in the text of one text category, the larger the reverse frequency is, the stronger the category identification capability is, and the text of the text category can be screened according to the first candidate keyword, so that the text of other text categories can not be selected by mistake; when a candidate keyword appears in texts of a plurality of text categories, the smaller the inverse category frequency is, the weaker the category recognition capability is, and it is difficult to screen only the text of a certain text category.

For example, word segmentation processing may be specifically performed on each text included in the text set, each text is segmented into a plurality of words, and the words segmented from each text are used as the first candidate keywords corresponding to the text set.

The corresponding correlation degree is calculated for each first candidate keyword, and the correlation degree can be obtained through a keyword extraction algorithm, wherein the keyword extraction algorithm comprises a term frequency-inverse text frequency (TF-inverse text frequency, TF-IDF) algorithm, a text sorting algorithm (TextRank), a Latent Semantic Analysis algorithm (LSA), a Latent Semantic Index (LSI) algorithm, an implicit Dirichlet Allocation (LDA) algorithm and the like.

The TextRank algorithm is a graph-based ranking algorithm for keyword extraction and document summarization, and extracts keywords by using co-occurrence information (semantics) between words in a text, so that the keywords of the text can be extracted from a given text.

TextRank algorithmCalculating the correlation degree of each first candidate keyword based on the following formula, wherein d is a damping coefficient and can be removed by 0.85; WS (V)_i) Representing the weight, namely the relevance, of the first candidate keyword i; WS (V)_j) Representing a weight of the first candidate keyword j; ln (V)_i) A set representing first candidate keywords having a co-occurrence relationship with the first candidate keyword j; w is a_jiRepresenting the similarity between the first candidate keyword i and the first candidate keyword j.

The LSA algorithm and the LSI algorithm map each text into a vector through a Bag of words (BoW) model, splice word vectors of first candidate keywords contained in each text in a text set to obtain a keyword-text matrix, perform Singular Value Decomposition (SVD) operation on the keyword-text matrix, map the keyword-text matrix into a low-dimensional space according to an SVD result to represent each first candidate keyword as a point on a space formed by k bodies, calculate the similarity between each first candidate keyword and the text to obtain the similarity between each text and each first candidate keyword, and use the similarity as the correlation of the first candidate keyword to the text set.

Determining topics (topic) corresponding to texts in a text set through an LDA algorithm, wherein each topic is represented by n keywords and probabilities corresponding to the keywords, and if a first candidate keyword of the text is consistent with the keywords corresponding to the topics, determining the probability of the keyword of the topic as the correlation of the first candidate keyword.

The TF-IDF algorithm calculates the relevance of each first candidate keyword according to the word frequency and the inverse text frequency of the first candidate keyword, that is, in an embodiment, the step "obtaining the relevance of the first candidate keyword with respect to the text set" may specifically include:

acquiring the word frequency and the inverse text frequency of the first candidate keyword;

and calculating the relevancy of the first candidate keyword according to the word frequency and the inverse text frequency.

The word frequency may represent the frequency of the first candidate keyword appearing in the text of one text category, and may be determined according to the ratio between the total number of words included in the text of one text category and the number of occurrences of the first candidate keyword in the text of the text category.

Wherein the inverse text frequency may represent the occurrence frequency of the text containing the first candidate keyword in the text set.

For example, the total word count of the first candidate keywords corresponding to the positive text in the text set may be counted, the first candidate keywords corresponding to the positive text are keywords obtained by performing word segmentation on the positive text, the occurrence frequency of each first candidate keyword in the positive text is counted, for each first candidate keyword, the occurrence frequency of each first candidate keyword in the positive text is calculated, and a ratio between the occurrence frequency of each first candidate keyword and the total word count of the first candidate keywords corresponding to the positive text is obtained, that is, TF is the occurrence frequency of the first candidate keyword in the positive text/the total word count of the positive text.

The method includes the steps of counting the total number of texts included in a text set, counting the total number of texts including each first candidate keyword for each first candidate keyword, obtaining an inverse text frequency (namely IDF) of each first candidate keyword through the following formula (1), and calculating the inverse text frequency through the formula (2) in order to prevent a denominator from being 0, wherein a is greater than 0, for example, a is 1.

And for each first candidate keyword, taking the word frequency and the inverse text frequency of the candidate keyword as the relevancy of the first candidate keyword to the text set.

Optionally, for each candidate keyword, a result obtained by multiplying the word frequency of the first candidate keyword by the inverse text frequency is used as a correlation degree of the first candidate keyword with respect to the text set.

When a first candidate keyword has a high degree of relevance and appears in both the positive type text and the negative type text, the first candidate keyword has no distinction, and it is difficult to accurately screen a positive type sample from a text set through the first candidate keyword without mistakenly selecting a negative type sample.

Therefore, for each first candidate keyword, the relevancy and the inverse class frequency of the first candidate keyword may be integrated to determine the distinctiveness of the first candidate keyword, and in an embodiment, the relevancy and the inverse class frequency of each first candidate keyword for the text set may be obtained from a database or a block chain.

In another embodiment, the inverse class frequency may be calculated according to a total number of text categories to which texts included in the text sample belong and a text category statistic corresponding to the text including the first candidate keyword, that is, the step "acquiring the inverse class frequency of the first candidate keyword with respect to the text set" may specifically include:

counting the text category to which the text containing the first candidate keywords in the text set belongs to obtain a text category statistic;

acquiring the total number of text categories contained in a text set;

and calculating the inverse class frequency of the first candidate keyword aiming at the text set according to the text category statistics and the text category total number.

The text category statistics may be how many texts of different text categories a first candidate keyword appears, for example, only appears in a text of one text category, or appears in texts of two text categories, and so on.

The total number of text categories may be the total number of categories of text categories to which the text included in the text set belongs, for example, if the text set includes two types of texts, namely a positive type text and a negative type text, the total number of text categories is 2.

For example, the total number of text categories of the text set may be determined according to the text category to which the text included in the text set belongs; for each first candidate keyword, determining a text category corresponding to a text in which the first candidate keyword appears in a text set, performing statistics on the text categories in which the first candidate keyword appears to obtain a text category statistic, calculating an inverse-category frequency (namely, ICF) of each candidate keyword through the following formula (3), and calculating an inverse-text frequency through the formula (4) in order to prevent a denominator from being 0, wherein a is greater than 0, for example, a is 1.

It can be understood that the text category of the text in the text set may be manually labeled, the text that needs to be screened is manually labeled in advance, or the text that needs to be screened may be automatically labeled, that is, in an embodiment, before the step "obtaining the first candidate keyword corresponding to the text set", the keyword generation method provided in the embodiment of the present application may further include:

acquiring an initial text set and acquiring initial keywords;

performing text screening on the initial text set through the initial keywords to obtain a target text containing the initial keywords;

and carrying out text type labeling on the target text in the initial text sample set to obtain a text set.

Wherein the initial text set is a set containing unlabeled text.

Wherein the initial keywords may include initial keywords for text screening.

The target text may include text that needs to be screened out.

For example, the method may specifically include obtaining an initial text set and obtaining an initial keyword; the method comprises the steps of inquiring texts containing any initial keywords in texts contained in an initial text set, screening out the texts containing any initial keywords to obtain target texts containing the initial keywords, and labeling the target texts in text categories, for example, adding text labels 1 to the target texts to indicate that the target texts are texts needing to be screened out to obtain a text set.

In an embodiment, the initial keyword may be a preset keyword, and may be obtained from a database or a block chain, and in another embodiment, the initial keyword may be obtained by performing keyword extraction on a text sample, that is, the step "obtaining the initial keyword" may specifically include:

acquiring a first alternative keyword corresponding to a text sample;

screening a second alternative keyword from the first alternative keyword according to the correlation and the inverse frequency of the first alternative keyword;

performing text screening on the text sample based on the second alternative keywords to obtain a screening result;

and calculating evaluation indexes of the second alternative keywords according to the screening result, and selecting the initial keywords from the second alternative keywords through the evaluation indexes.

The text sample may include text screened from the text set, and optionally, the text set may be a set after updating (hereinafter, referred to as an updated text set), the text sample may be a plurality of texts in the text set before updating, and the text sample may contain all texts in the text set before updating. The text sample set may contain text that needs to be screened out, i.e. positive type text samples.

The first alternative keywords are keywords obtained by performing word segmentation processing on the text samples and segmenting each text sample into a plurality of words.

The screening result may include text screened from the text sample.

The evaluation index may represent the screening effect of a first candidate keyword, and may include, for example, indexes such as precision, coverage, accuracy, recall, and a harmonic mean of accuracy and recall (also referred to as F1 Score, F1_ Score).

If the text sample contains a real positive sample T (u) and the screening result of one alternative keyword is R (u), the Precision is Precision ═ T (u) and ^ R (u)/R (u); recall ═ t (u) andr (u)/t (u); coverage ratio Cover ═ r (u)/t (u);

for example, the first candidate keywords corresponding to the text sample may be obtained, and the relevance of each first candidate keyword (the relevance calculation of the first candidate keyword may refer to the above calculation manner for the first candidate related word, which is not described herein) and the inverse frequency may be obtained, the discrimination of each first candidate keyword is calculated according to the relevance and the inverse frequency (the description in step 102 may be referred to in the specific calculation process, which is not described herein), and the first candidate keywords whose discrimination satisfies the condition (the discrimination is greater than the preset threshold, or the preset number before the discrimination is sorted, etc.) are screened out to be used as the second candidate keywords.

And screening the text sample through the second alternative keywords, and screening out a text containing any second alternative keywords from the text sample to obtain a screening result, wherein the screening result may contain one or more of the positive text sample and other text samples.

And calculating the evaluation index of each second candidate keyword according to the number of the positive text samples in the screening result of each second candidate keyword, the number of the positive text samples contained in the text samples and the like, and taking the second candidate keyword of which the evaluation index meets the screening condition (for example, is greater than a threshold value or smaller than the threshold value, and is determined by a specific evaluation index) as the initial keyword.

For example, in an embodiment, the precision, the coverage, and the F1_ Score of the second candidate keyword may be calculated, and the second candidate keyword whose three evaluation indexes all satisfy the condition is taken as the initial keyword.

In order to improve the screening effect, after the second candidate keywords are screened according to the evaluation index, similar keywords are queried to be used as the initial keywords together, and the coverage rate of the obtained initial keywords on the positive text sample is improved, that is, the step "selecting the initial keywords from the second candidate keywords through the evaluation index" may specifically include:

selecting seed keywords from the second alternative keywords according to the evaluation indexes;

acquiring first characteristic information of the first alternative keyword and second characteristic information of the seed keyword;

calculating the similarity between the first alternative keyword and the seed keyword based on the first characteristic information and the second characteristic information;

and determining an initial keyword from the first alternative keywords according to the seed keyword and the similarity.

The first feature information may include information characterizing the first candidate keyword, for example, the first feature information may be in the form of a feature vector, and the second feature information may include information characterizing the seed keyword, for example, the second feature information may be in the form of a feature vector.

For example, the second candidate keywords whose evaluation indexes satisfy the conditions may be specifically screened out from the plurality of second candidate keywords as the seed keywords. And coding each seed keyword, for example, performing one-hot coding (one-hot) or word vector embedding (embedding) to obtain second feature information of the seed keyword, and obtaining first feature information corresponding to each first candidate keyword in the same manner.

And for each seed keyword, calculating the similarity between the seed keyword and each first alternative keyword according to the corresponding second characteristic information and the corresponding first characteristic information of each first alternative keyword, and taking the first alternative keywords with preset number of similarities as the synonyms of the seed keyword, or taking the first alternative keywords with the similarities larger than a preset similarity threshold value as the synonyms of the seed keyword.

And taking each seed keyword and the corresponding similar meaning word as initial keywords.

102. And calculating the discrimination of the first candidate keywords according to the inverse class frequency and the correlation.

The distinguishing degree can represent the distinguishing performance of a first candidate keyword, and the ability of distinguishing texts of different text categories in the text set can be realized.

For example, specifically, for each first candidate keyword, the inverse class frequency and the relevance of the first candidate keyword are multiplied to obtain the degree of distinction of the first candidate keyword.

Optionally, the discrimination of the first candidate keyword may be obtained by adding the inverse class frequency and the correlation.

103. And screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination.

For example, the first candidate keywords with the discrimination greater than the preset discrimination may be screened out as the second candidate keywords, and optionally, the first candidate keywords with the maximum discrimination and the preset number may also be used as the second candidate keywords.

104. And clustering the second candidate keywords to obtain target keywords for text screening.

For example, the second candidate keywords may be clustered by a k-means algorithm, a Gaussian Mixture Model (GMM), a hierarchical Clustering algorithm, a Density-Based Clustering algorithm with Noise (DBSCAN), and the like, and the target keywords may be determined according to the Clustering result.

The distance between each second candidate keyword and the cluster center is continuously calculated through a k-means algorithm, the second candidate keywords are classified, the mean value of the second candidate keywords in each cluster is updated according to the mean value until the change of the cluster center tends to be stable, the second candidate keywords can be clustered into a plurality of keyword clusters, each keyword cluster corresponds to one cluster center, the distance between each second candidate keyword in each cluster and the cluster center is calculated, and the second candidate keywords with the distance meeting a preset distance threshold value are determined as target keywords.

The DBSCAN algorithm may cluster the second candidate keywords according to a density reachable relationship between the plurality of second candidate keywords to obtain a target keyword cluster, and use the second candidate keywords in the target keyword cluster as the target keywords, that is, the step "cluster the second candidate keywords to obtain the target keywords for text screening" may specifically include:

clustering the second candidate keywords according to the density reachable relation among the plurality of second candidate keywords to obtain a target keyword cluster;

a target keyword is determined from the target keyword cluster.

Wherein, the density reachable relation comprises direct density reachable and indirect density reachable, etc.

For example, all the second candidate keywords may be used as the input data set D ═ (x1, x 2.., xm), and the neighborhood parameters (e, MinPts) and the keyword distance measurement manner are determined.

Wherein, the radius of the neighborhood of each core object belongs to, MinPts is a threshold of neighborhood density, the object belongs to the threshold of the number of objects in the neighborhood, and if the number of the objects of the object O (i.e. the key word O) at least comprises MinPts objects, the object O is the core object. If object p is within e-neighborhood of core object q, then p is directly density reachable from q. If there is a chain of objects { p1, p2 … pn }, such that p1 ═ p, pn ═ q, pi +1 is within the e-neighborhood of pi (i ═ 1,2 … … n), then p is reachable from q (core object) density.

The keyword Distance measurement method may include Euclidean Distance (Euclidean Distance), Manhattan Distance (Manhattan Distance), Chebyshev Distance (Chebyshev Distance), Hamming Distance (Hamming Distance), and the like.

The specific clustering process is as follows:

step 1: initializing a set of core objects

Initializing cluster number k equal to 0, initializing set of unaccessed data sets Γ equal to D, cluster partitioning

Step 2: for j ═ 1, 2.. m, all core objects were found as in steps a) and b) below:

a) finding object x by means of distance measurement_jIs-neighborhood subdata set N is (x)_j)；

b) If the number of the samples in the subsample set satisfies | N ∈ (x)_j) | ≧ MinPts, sample x_jAdding a core object set: Ω ═ u { x-_j}。

And step 3: if core object set

The algorithm ends, otherwise step 4 is carried out.

And 4, step 4: in the core object set omega, a core object o is randomly selected, a current cluster core object queue omega cur is initialized to { o }, a class serial number k is initialized to k +1, and a current cluster object set C is initialized_kAnd f, updating the unvisited sample set f- (o).

And 5: if the current cluster core object queue

Then the current cluster C is clustered_kAfter generation, the cluster partition C is updated to { C ═ C₁,C₂,...,C_kAnd updating a core object set omega-C_kAnd (5) turning to the step 3.

Step 6: taking out a core object o ' from the current cluster core object queue omega cur, finding out all the e-neighborhood subdata sets N e (o ') according to the neighborhood distance threshold e, enabling delta to be N e (o ') nΓ, and updating the current cluster object set C_k＝C_kU.delta.update not visitedAnd updating the current cluster core object queue omega cur U (delta and n omega) -o', and transferring to the step 5.

The algorithm end output result is as follows: target keyword cluster division C ═ { C ═ C₁,C₂,...,C_k}。

And optionally, taking a second candidate keyword in the target keyword cluster, which is different from the initial keyword, as the target keyword, and taking the target keyword and the initial keyword together as keywords for text screening.

For example, when a certain type of business or a certain type of text needs to be identified, keywords (i.e., target keywords and initial keywords) for the user to filter the certain type of business or the certain type of text may be generated through the above process, and then the certain type of business or the certain type of text may be filtered from the database according to the generated keywords.

As can be seen from the above, in the embodiment of the application, the first candidate keyword corresponding to the text set is obtained, and the relevancy and the inverse frequency of the first candidate keyword with respect to the text set are obtained; calculating the discrimination of the first candidate keywords according to the inverse class frequency and the correlation; screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination; and clustering the second candidate keywords to obtain target keywords for text screening.

On the basis of the above-described embodiments, further details will be given below by way of example.

The present embodiment will be described from the perspective of a keyword generation apparatus, which may be specifically integrated in a server.

In the keyword generation method provided in the embodiment of the present application, the keyword generation method includes two stages, one is a keyword generation stage, and the other is a keyword update iteration stage, as shown in fig. 3, a specific flow of the keyword generation method may be as follows:

first, generating key words.

201. The server obtains a text set before updating, and performs word segmentation processing on text samples in the text set before updating to obtain a plurality of first alternative keywords.

The sample set before updating may be a text set to which no new text is added, the sample set before updating includes a positive type text sample and a negative type text sample, and the positive type text sample indicates that the text sample needs to be extracted from the sample set before updating.

For example, the server may specifically perform word segmentation on each text sample included in the updated text set, segment each text sample into a plurality of words, and use the word segmented from each text sample as the first candidate keyword corresponding to the updated text set.

202. And the server calculates the discrimination of the first alternative keywords according to the word frequency, the inverse text frequency and the inverse class frequency of each first alternative keyword.

For example, the server may specifically count the total number of words of the first candidate keywords corresponding to the positive text sample in the text set before updating, count the number of occurrences of each first candidate keyword in the positive text sample, calculate, for each first candidate keyword, a ratio between the number of occurrences of the first candidate keyword in the positive text sample and the total number of words of the first candidate keyword corresponding to the positive text sample, and obtain a word frequency of each first candidate keyword, that is, TF is the number of occurrences of the first candidate keyword in the positive text sample/the total number of words of the positive text sample.

The server counts the total number of texts of the text samples contained in the updated text set, counts the total number of texts containing the first candidate keywords in the updated text set aiming at each first candidate keyword, and calculates the inverse text frequency (namely IDF) of each first candidate keyword through the following formula (5).

The total number of the text types is 2 because of the positive type text samples and the negative type text samples contained in the text set before updating; for each first alternative keyword, determining a text category corresponding to a text sample in which the first alternative keyword appears in a text set, and performing statistics on the text category in which the first alternative keyword appears to obtain a text category statistic, because only the positive text sample or the negative text sample may contain the first alternative keyword, or both the positive text sample and the negative text sample may contain the first alternative keyword, the text category statistic of each first alternative keyword is 1 or 2, and the inverse frequency (i.e., ICF) of each candidate keyword is calculated by the following formula (6).

And the server multiplies the inverse class frequency of each first alternative keyword by the relevance to obtain the discrimination of each first alternative keyword.

203. And the server selects a second alternative keyword from the first alternative keyword, and performs text screening on the text sample based on the second alternative keyword to obtain a screening result.

For example, the server may specifically screen out the first candidate keywords with the discrimination meeting the condition (the discrimination is greater than a preset threshold, or a preset number before the discrimination is sorted, etc.), and use the first candidate keywords as the second candidate keywords.

And the server screens the text samples in the text set before updating through the second alternative keywords, screens the text samples containing any second alternative keywords from the text set before updating, and obtains a screening result, wherein the screening result may contain one or more of the positive type text samples and the negative type text samples.

204. And the server selects seed keywords from the second alternative keywords according to the evaluation indexes of the second alternative keywords, and determines the initial keywords from the keywords which are the similar words of the seed keywords in the first alternative keywords.

For example, the evaluation indexes of the second candidate keywords, i.e., the precision, the coverage, and the F1_ Score, may be calculated according to the number of positive text samples in the screening result of each second candidate keyword and the number of positive text samples included in the updated text set, and the second candidate keywords whose evaluation indexes all satisfy the screening condition (e.g., are greater than a threshold or less than the threshold, and are specific evaluation indexes) are used as the seed keywords.

Optionally, the seed keywords may be screened from the second candidate keywords in a manual participation manner.

In order to improve the coverage rate of the obtained initial keywords on the normal text samples and improve the screening effect, the synonyms of the seed keywords can be searched in the first alternative keywords, the seed keywords and the synonyms of the seed keywords are used as the initial keywords, specifically, the seed keywords are mapped to the feature space through word vector embedding to obtain corresponding second feature vectors and first feature vectors of the first alternative keywords, and for each seed keyword, according to the second feature vectors of the seed keywords and the first feature vectors of the first alternative keywords, the first alternative keywords with the preset number, which are most adjacent to the seed keywords, and the seed keywords are used as the initial keywords through the distance between the first feature vectors and the second feature vectors.

In order to ensure the exclusivity of the keywords, the precision, the F1-Score and the coverage rate of each synonym of the seed keyword can be calculated, and the synonym with the precision, the F1-Score and the coverage rate within a preset range is used as an initial keyword.

And II, updating the keywords and iterating.

205. When the text base has the newly added text, the server acquires an initial text set, and performs text category labeling on the target text in the initial text set based on the initial keywords to obtain a text set.

For example, specifically, when there is a newly added text in the text library, the server obtains an initial text set including the newly added text, and obtains an initial keyword through the initial text set; the method comprises the steps of inquiring texts containing any initial keywords in texts contained in an initial text set, screening out the texts containing any initial keywords to obtain target texts containing the initial keywords, and labeling the target texts in text categories, for example, adding text labels 1 to the target texts to indicate that the target texts are texts needing to be screened out to obtain a text set.

206. The server obtains first candidate keywords corresponding to the text set, and calculates the discrimination of the first candidate keywords according to the relevancy and the inverse class frequency of each first candidate keyword aiming at the text set.

For example, the server may specifically perform word segmentation on each text included in the text set, segment each text into a plurality of words, and use the word segmented from each text as the first candidate keyword corresponding to the text set.

Referring to step 202, the discrimination of the first candidate keyword is calculated, and the discrimination of the first candidate keyword is calculated.

207. And the server screens second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination.

For example, the server may specifically screen out a first candidate keyword with a discrimination greater than a preset discrimination as a second candidate keyword, and optionally, may further take the first candidate keyword with the largest discrimination in a preset number as the second candidate keyword.

Optionally, the stop word filtering process may be further performed on the first candidate keyword, the stop word of the first candidate keyword may be filtered, and the stop word may include a word without any meaning compared to other words, and a word that is difficult to screen for correct text based on the word, such as "what", "is", "the", and "want", for example.

208. And clustering the second candidate keywords by the server to obtain target keywords for text screening.

For example, the DBSCAN algorithm may specifically be used to cluster the second candidate keywords according to the density reachable key among the plurality of second candidate keywords to obtain the target keyword cluster.

And taking the second candidate keyword in the target keyword cluster as the target keyword. Optionally, a second candidate keyword in the target keyword cluster, which is different from the initial keyword, is used as the target keyword, and the target keyword and the initial keyword are used together as keywords for text screening.

Optionally, the precision and the coverage rate of a second candidate keyword in the target keyword cluster for the text library are calculated, and when the corresponding precision and the coverage rate are within a preset range, the second candidate keyword is used as the target keyword, manual participation is not needed in the keyword updating iteration stage, the labor cost is reduced, and the timeliness of the keyword updating iteration is improved.

As can be seen from the above, the server in the embodiment of the application obtains the text set before updating, and performs word segmentation processing on the text samples in the text set before updating to obtain a plurality of first candidate keywords; calculating the discrimination of each first alternative keyword according to the word frequency, the inverse text frequency and the inverse class frequency of the first alternative keyword; selecting second alternative keywords from the first alternative keywords, and performing text screening on the text sample based on the second alternative keywords to obtain a screening result; selecting seed keywords from the second alternative keywords according to the evaluation indexes of the second alternative keywords, and determining initial keywords from keywords which are similar words of the seed keywords in the first alternative keywords; when a newly added text exists in the text library, the server acquires an initial text set, and performs text category labeling on a target text in the initial text set based on the initial keywords to obtain a text set; acquiring first candidate keywords corresponding to the text set, and calculating the discrimination of the first candidate keywords according to the relevancy and the inverse class frequency of each first candidate keyword aiming at the text set; screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination; and clustering the second candidate keywords to obtain target keywords for text screening.

In order to better implement the keyword generation method provided by the embodiment of the present application, in an embodiment, a keyword generation apparatus is further provided. The meaning of the noun is the same as that in the keyword generation method, and specific implementation details can refer to the description in the method embodiment.

The keyword generation apparatus may be specifically integrated in a computer device, as shown in fig. 4, and the keyword generation apparatus may include: the acquiring unit 301, the calculating unit 302, the screening unit 303 and the clustering unit 304 are specifically as follows:

(1) an obtaining unit 301, configured to obtain a first candidate keyword corresponding to the text set, and obtain a relevance and an inverse frequency of the first candidate keyword with respect to the text set.

In an embodiment, the obtaining unit 301 includes a statistics subunit, a category obtaining subunit, and a frequency calculating subunit, specifically:

In an embodiment, the keyword generation apparatus further includes a data acquisition unit, a text filtering unit, and a labeling unit, specifically:

and the marking unit is used for carrying out text type marking on the target text in the initial text sample set to obtain a text set.

In an embodiment, the data obtaining unit includes a keyword obtaining subunit, a keyword screening subunit, a sample screening subunit, and an index calculating subunit, specifically:

the keyword screening subunit is used for screening a second alternative keyword from the first alternative keyword according to the correlation and the inverse frequency of the first alternative keyword;

and the index calculation subunit is used for calculating the evaluation index of the second alternative keywords according to the screening result and selecting the initial keywords from the second alternative keywords through the evaluation index.

In an embodiment, the index calculating subunit includes a selecting module, an information obtaining module, a similarity calculating module, and a determining module, specifically:

the similarity calculation module is used for calculating the similarity between the first alternative keyword and the seed keyword based on the first characteristic information and the second characteristic information;

In an embodiment, the obtaining unit 301 includes a frequency information obtaining subunit and a correlation degree subunit, specifically:

the frequency information acquisition subunit is used for acquiring the word frequency and the inverse text frequency of the first candidate keyword;

and the relevancy calculation subunit is used for calculating the relevancy of the first candidate keyword according to the word frequency and the inverse text frequency.

(2) The calculating unit 302 is configured to calculate a degree of distinction of the first candidate keyword according to the inverse class frequency and the degree of correlation.

(3) The screening unit 303 is configured to screen a second candidate keyword meeting a preset condition from the first candidate keyword according to the differentiation.

(4) And a clustering unit 304, configured to cluster the second candidate keywords to obtain target keywords for text screening.

In an embodiment, the second candidate keyword includes a plurality of keywords, and the clustering unit 304 includes, specifically:

and a target keyword determination subunit for determining a target keyword from the target keyword cluster.

As can be seen from the above, the keyword generation apparatus in the embodiment of the application obtains the first candidate keyword corresponding to the text set through the obtaining unit 301, and obtains the relevancy and the inverse frequency of the first candidate keyword with respect to the text set; the calculating unit 302 calculates the degree of distinction of the first candidate keyword according to the inverse class frequency and the degree of correlation; the screening unit 303 screens the second candidate keywords meeting the preset condition from the first candidate keywords according to the discrimination; finally, clustering is performed on the second candidate keywords through the clustering unit 304, so as to obtain target keywords for text screening.

An embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 5, which shows a schematic structural diagram of the computer device according to the embodiment of the present application, and specifically:

the computer device may include components such as a processor 1001 of one or more processing cores, memory 1002 of one or more computer-readable storage media, a power supply 1003, and an input unit 1004. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 5 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 1001 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 1002 and calling data stored in the memory 1002, thereby monitoring the computer device as a whole. Optionally, processor 1001 may include one or more processing cores; preferably, the processor 1001 may integrate an application processor, which mainly handles operating systems, user interfaces, computer programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1001.

The memory 1002 may be used to store software programs and modules, and the processor 1001 executes various functional applications and data processing by operating the software programs and modules stored in the memory 1002. The memory 1002 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 1002 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 1002 may also include a memory controller to provide the processor 1001 access to the memory 1002.

The computer device further comprises a power supply 1003 for supplying power to each component, and preferably, the power supply 1003 is logically connected to the processor 1001 through a power management system, so that functions of managing charging, discharging, power consumption and the like are realized through the power management system. The power source 1003 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 1004, and the input unit 1004 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 1001 in the computer device loads the executable file corresponding to the process of one or more computer programs into the memory 1002 according to the following instructions, and the processor 1001 runs the computer programs stored in the memory 1002, so as to implement various functions as follows:

obtaining a first candidate keyword corresponding to the text set, and obtaining the relevancy and the inverse frequency of the first candidate keyword aiming at the text set;

The above operations can be implemented in the foregoing embodiments, and are not described herein.

As can be seen from the above, the computer device in the embodiment of the present application may obtain the first candidate keyword corresponding to the text set, and obtain the relevancy and the inverse frequency of the first candidate keyword with respect to the text set; calculating the discrimination of the first candidate keywords according to the inverse class frequency and the correlation; screening second candidate keywords meeting preset conditions from the first candidate keywords according to the discrimination; and clustering the second candidate keywords to obtain target keywords for text screening. According to the scheme, the second candidate keywords with strong exclusivity are screened from the first candidate keywords according to the relevance and the inverse frequency, and then the second candidate keywords are clustered, so that the target keywords with strong exclusivity can be determined from the second candidate keywords, the target texts can be accurately screened when the texts are screened on the basis of the target keywords, the exclusivity of the generated target keywords is improved, and the labor cost is reduced.

According to an aspect of the present application, there is provided a computer program product comprising a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the above embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute any one of the keyword generation methods provided in the present application.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the computer-readable storage medium can execute any keyword generation method provided in the embodiments of the present application, beneficial effects that can be achieved by any keyword generation method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The foregoing describes a keyword generation method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product provided in the embodiments of the present application in detail, and specific examples are applied herein to explain the principles and embodiments of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A keyword generation method, comprising:

2. The method of claim 1, wherein before obtaining the first candidate keyword corresponding to the text set, the method further comprises:

acquiring an initial text set and acquiring initial keywords;

3. The method of claim 2, wherein the obtaining the initial keyword comprises:

acquiring a first alternative keyword corresponding to a text sample;

screening second alternative keywords from the first alternative keywords according to the relevance and the inverse frequency of the first alternative keywords;

performing text screening on the text sample based on the second alternative keyword to obtain a screening result;

and calculating evaluation indexes of the second alternative keywords according to the screening result, and selecting the initial keywords from the second alternative keywords according to the evaluation indexes.

4. The method according to claim 3, wherein said selecting the initial keyword from the second candidate keywords by the evaluation index comprises:

and determining the initial keyword from the first alternative keywords according to the seed keyword and the similarity.

5. The method of claim 1, wherein obtaining the relevancy of the first candidate keyword with respect to the text set comprises:

6. The method according to any one of claims 1 to 5, wherein the second candidate keywords comprise a plurality of keywords, and the clustering the second candidate keywords to obtain target keywords for text screening comprises:

determining the target keyword from the target keyword cluster.

7. A keyword generation apparatus, comprising:

the calculating unit is used for calculating the discrimination of the first candidate keyword according to the inverse class frequency and the correlation;

8. A computer device comprising a memory and a processor; the memory stores a computer program, and the processor is configured to execute the computer program in the memory to perform the keyword generation method according to any one of claims 1 to 7.

9. A computer-readable storage medium for storing a computer program which is loaded by a processor to perform the keyword generation method of any one of claims 1 to 7.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the keyword generation method of any of claims 1-7.