CN112347778A

CN112347778A - Keyword extraction method and device, terminal equipment and storage medium

Info

Publication number: CN112347778A
Application number: CN202011229490.4A
Authority: CN
Inventors: 饶刚
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-02-09
Anticipated expiration: 2040-11-06
Also published as: WO2022095374A1; CN112347778B

Abstract

The application is suitable for the technical field of artificial intelligence and provides a keyword extraction method, a keyword extraction device, terminal equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of participles in a target article; determining a plurality of candidate keywords from the plurality of participles according to a preset keyword library; respectively calculating a plurality of score values corresponding to each candidate keyword in the candidate keywords according to the candidate keywords and the target article; and inputting a plurality of score values corresponding to each candidate keyword into a pre-trained supervision model, respectively obtaining the word probability of each candidate keyword, and determining a target keyword from the candidate keywords according to the word probability. By adopting the method to extract the target keywords from the target articles, the extracted target keywords can be ensured to belong to high-quality vocabularies with high association degree with the target articles.

Description

Keyword extraction method and device, terminal equipment and storage medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a keyword extraction method and device, terminal equipment and a storage medium.

Background

In the prior art, keyword extraction is widely applied in many fields of text processing, such as the field of text clustering, the field of text summarization and the field of information retrieval. In the current data age, keyword extraction is basically judged by extracting single information of each word in a text. At present, keywords of texts obtained by using a graph-based ranking algorithm TextRank algorithm or a topic model (LDA) are popular. However, some special words, such as names of people, names of places, etc., are often ignored, and this information may be important information in the text. Therefore, the existing method for extracting the text keywords is difficult to accurately extract the high-quality keywords related to the text.

Disclosure of Invention

The embodiment of the application provides a keyword extraction method, a keyword extraction device, terminal equipment and a storage medium, and can solve the problem that the existing method for extracting text keywords is difficult to accurately extract high-quality keywords related to texts.

In a first aspect, an embodiment of the present application provides a keyword extraction method, including:

acquiring a plurality of participles in a target article;

determining a plurality of candidate keywords from the plurality of participles according to a preset keyword library;

respectively calculating a plurality of score values corresponding to each candidate keyword in the candidate keywords according to the candidate keywords and the target article;

and inputting a plurality of score values corresponding to each candidate keyword into a pre-trained supervision model, respectively obtaining the word probability of each candidate keyword, and determining a target keyword from the candidate keywords according to the word probability.

In an embodiment, before determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library, the method further includes:

determining an article field of the target article, and acquiring a field text belonging to the article field;

calculating the domain association degree between each domain participle according to the plurality of domain participles in the domain text;

determining a target relevance degree which is greater than a preset relevance degree from a plurality of field relevance degrees, and determining a target field word segmentation corresponding to the target relevance degree;

and storing the target field participles into the keyword library.

determining an article field of the target article, and acquiring a plurality of field keywords belonging to the article field;

storing the plurality of domain keywords into the keyword library.

In an embodiment, the determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library includes:

determining whether the keyword library contains target participles, wherein the target participles are any one of the participles;

if the keyword library contains the target participle, taking the target participle as a candidate keyword;

if the keyword library does not contain the target participle, judging whether the target participle belongs to a real word; if the target participle belongs to the entity word, inputting the target participle belonging to the entity word into the supervision model to obtain the keyword probability of the target participle belonging to the entity word; and if the keyword probability is greater than a probability threshold, taking the target participle corresponding to the keyword probability as a candidate keyword.

In one embodiment, the supervision model is trained by:

acquiring a training sample, and acquiring labeled training keywords from the training sample;

performing word segmentation on the text content in the training sample to obtain a plurality of sample word segmentations, and respectively calculating a sample score value corresponding to each sample word segmentation;

determining sample keywords from the plurality of sample participles according to a plurality of sample score values;

determining a label category of the sample keyword based on the sample keyword and the training keyword;

extracting keyword features of the sample keywords;

and performing model training based on the keyword features and the mark categories of the sample keywords to obtain the supervision model.

In an embodiment, the plurality of score values comprises a first score value, a second score value, a third score value, and a fourth score value;

the calculating a plurality of score values corresponding to each candidate keyword in the plurality of candidate keywords respectively according to the plurality of candidate keywords and the target article comprises:

counting the number of the multiple participles, respectively calculating the word frequency of each candidate keyword in the target article according to the number, and correspondingly calculating a first score value of each candidate keyword according to the word frequency;

determining the positions of the candidate keywords in the target article, and calculating a second score value of each candidate keyword based on the positions of the candidate keywords in the target article;

respectively determining an initial position and an end position of each candidate keyword in the target article, and calculating a third score value corresponding to each candidate keyword according to the initial position and the end position;

and calculating a fourth score value corresponding to each candidate keyword according to a preset text sorting algorithm.

In one embodiment, the target keyword comprises a plurality of keywords;

after the inputting the plurality of score values corresponding to each candidate keyword into a pre-trained supervision model, respectively obtaining the word probability of each candidate keyword, and determining a target keyword from the plurality of candidate keywords according to the word probability, the method further comprises:

counting the total number of each target keyword in the multi-discourse target articles, and calculating the ratio of the total number of each target keyword;

and article recalling is carried out according to the ratio and each target keyword to obtain an article set, wherein the article set respectively comprises the ratio of the article quantity of each target keyword to the ratio which is equal to the ratio.

In a second aspect, an embodiment of the present application provides a keyword extraction apparatus, including:

the first acquisition module is used for acquiring a plurality of participles in a target article;

the first determining module is used for determining a plurality of candidate keywords from the plurality of participles according to a preset keyword library;

the first calculation module is used for respectively calculating a plurality of score values corresponding to each candidate keyword in the candidate keywords according to the candidate keywords and the target article;

and the second determining module is used for inputting the plurality of score values corresponding to each candidate keyword into a pre-trained supervision model, respectively obtaining the word probability of each candidate keyword, and determining a target keyword from the plurality of candidate keywords according to the word probability.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the method according to any one of the above first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the above first aspects.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method of any one of the above first aspects.

In the embodiment of the application, a plurality of participles are obtained by carrying out word segmentation on a target article, and are compared with a preset keyword library, candidate keywords are determined from the plurality of participles, a plurality of score values of each candidate keyword are respectively calculated, the target keyword is further determined from the plurality of candidate keywords according to the plurality of score values, so that on the basis of maintaining a high-quality keyword library as the candidate keyword in an output target article, the word probability of each candidate keyword can be further calculated according to a monitoring model, the target keyword is determined from the plurality of candidate keywords according to the word probability, and the extracted target keyword is ensured to belong to a high-quality word with high association degree with the target text.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart illustrating an implementation of a keyword extraction method according to an embodiment of the present application;

fig. 2 is a flowchart illustrating an implementation of a keyword extraction method according to another embodiment of the present application;

fig. 3 is a schematic view of an application scenario of a keyword extraction method according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating an implementation of a keyword extraction method according to another embodiment of the present application;

fig. 5 is a schematic diagram illustrating an implementation manner of S102 in a keyword extraction method according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating an implementation of a supervised model training procedure in a keyword extraction method according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating feature extraction of sample keywords in a keyword extraction method according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating an implementation manner of S103 of a keyword extraction method according to an embodiment of the present application;

FIG. 9 is a flowchart illustrating an implementation of a keyword extraction method according to yet another embodiment of the present application;

fig. 10 is a schematic structural diagram of a keyword extraction apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

The keyword extraction method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook and the like, and the embodiment of the application does not limit the specific types of the terminal devices.

Referring to fig. 1, fig. 1 shows a flowchart of an implementation of a keyword extraction method provided in an embodiment of the present application, where the method includes the following steps:

s101, acquiring a plurality of word segments in the target article.

In application, the target articles may be microblog articles, news articles, and the like, which is not limited. The method for acquiring the target article may be that the terminal device acquires the target article through a network, or that the terminal device acquires an existing target article from a specified storage path. The text language of the target article may be chinese, english or other text languages, which is not limited herein. In order to better explain the keyword extraction method, the embodiment takes a text language in a chinese form as an example for explanation.

In application, the multiple word segmentation can be obtained by performing word segmentation processing on the target article. For example, for a target article in the news category, the target article often includes characters such as a news source and reprintable, but these characters are irrelevant information, which may interfere with the accuracy of extracting keywords from the target article. Therefore, the target article can be cleaned in advance by data cleaning. The word segmentation of the target article may be to establish a word segmentation library in advance, and the word segmentation library includes all words that can be used by one language (for example, chinese). For the target article, a sentence or a segment of character string in the target article can be taken out according to a forward maximum matching algorithm or a reverse maximum matching algorithm and compared with the words in the word segmentation library. If they are consistent, the string can be a word representing one meaning, i.e., a participle. If the word stock has no matched word, the length of the character string can be reduced (for example, the tail character in the character string is excluded), and the character string is matched with the word in the word stock again until all the character strings are matched, so that a plurality of word stocks are obtained.

S102, determining a plurality of candidate keywords from the plurality of participles according to a preset keyword library.

In application, the preset keyword library may preset a plurality of interest vocabularies for a user, and store the plurality of interest vocabularies as keywords in the keyword library in a storage path specified by the terminal device. For example, when a user reads the rest of article contents, the user is interested in the article contents, and if the user wants to frequently read articles related to the article content field, the user can select words from the article contents as interesting words and store the interesting words in the keyword library. Or after the terminal device determines that the article content is interested by the user according to the determination instruction of the user, the terminal device may determine the domain to which the article belongs according to the currently read article content, and crawl a plurality of keywords of a plurality of articles in the domain from the network as interest words and store the interest words in a keyword library. The predetermined keyword library may include words such as names of people, names of places, time, etc. which are often ignored in the currently popular keyword extraction algorithm. Therefore, by individually setting the specific vocabulary, the quality of determining the candidate keyword from the plurality of segmented words can be ensured.

The determining of the candidate keywords from the segmentation may be performed, and if a word identical to the segmentation exists in the keyword library, the segmentation may be determined as the candidate keywords, so that the candidate keywords may be obtained. It is to be understood that, if the keywords and the participles included in the keyword library are similar words, the participles may also be used as candidate keywords, which is not limited herein.

S103, respectively calculating a plurality of score values corresponding to each candidate keyword in the candidate keywords according to the candidate keywords and the target article.

In application, the plurality of score values include, but are not limited to, a position score value of the candidate keyword at a chapter position in the target article, and a frequency score value of the candidate keyword appearing in the target article. Illustratively, different score values are assigned based on the different positions of the different candidate keywords in the target article. For example, for candidate keywords appearing at a headline, the headline of a target article of the news category may be considered to be generally the core of the article, which contains the main content of the target article. Therefore, the position score value of the candidate keyword appearing in the title in the target article can be set higher than the position score value appearing in the body text. It should be noted that, if the same candidate keyword appears in both the title and the text, the numerical value with the highest score value may be selected as the position score value of the candidate keyword (i.e., the position score value of the title). The frequency score value of the candidate keyword can be obtained by performing ratio calculation according to the total number of the multiple participles in the target article and the number of the candidate keyword appearing in the target article.

S104, inputting the plurality of score values corresponding to each candidate keyword into a pre-trained supervision model, respectively obtaining the word probability of each candidate keyword, and determining a target keyword from the plurality of candidate keywords according to the word probability.

In the application, the candidate keywords have a plurality of values, and each candidate keyword has a plurality of score values, however, the target keyword may be a partial word in the plurality of candidate keywords. For example, the sum of the plurality of score values of each candidate keyword may be counted, or an average of the plurality of score values of each candidate keyword may be calculated as a measure of the association degree (word probability) of each candidate keyword with the first text, respectively. Further, a target keyword may be determined from the plurality of candidate keywords. For example, a preset number of candidate keywords having the highest score values (word probabilities) may be determined as the target keywords.

In application, the monitoring model may be obtained by performing model training according to the existing article content and the corresponding keyword. The objective of supervised learning is to learn a function (model) whose sample data (existing article contents) and output values (keywords) are known, and to fit the relationship between input and output to the maximum extent possible. The word probability that each candidate keyword belongs to the target keyword in the target article can be obtained through the score value and the supervision model. Furthermore, on the basis of determining a plurality of candidate keywords through a preset keyword library, a target keyword is further determined from the candidate keywords through a supervision model, and the extracted keywords are guaranteed to belong to high-quality words.

In this embodiment, a plurality of segmented words are obtained by performing word segmentation processing on a target article, and are compared with a preset keyword library, candidate keywords are determined from the plurality of segmented words, a plurality of score values of each candidate keyword are respectively calculated, and a target keyword is further determined from the plurality of candidate keywords according to the plurality of score values. On the basis of maintaining a high-quality keyword library as candidate keywords in an output target article, the word probability of each candidate keyword can be further calculated according to a supervision model, and the target keyword is determined from a plurality of candidate keywords according to the word probability, so that the extracted target keywords are ensured to belong to high-quality words with high association degree with the target text.

Referring to fig. 2, in an embodiment, before determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library at S102, the method further includes the following steps S102A-S102D, which are detailed as follows:

S102A, determining the article field of the target article, and acquiring the field text belonging to the article field.

In application, the target article can be obtained by crawling on the network by the terminal device, and it can be understood that, for the user to browse the target article by using the terminal device, the target article usually has a domain tag (article domain) in advance when being released. Therefore, it can be considered that the terminal device can determine the article field of the target article at the same time when acquiring the target article. Illustratively, for a terminal device which is a smartphone, when a browser is used to browse a target article, the target article already has an exact article field. Specifically, reference may be made to the relevant channel in fig. 3, each vocabulary in the relevant channel may be regarded as an article field of the target article, and a plurality of texts in the article field may be regarded as field texts.

S102B, calculating the domain association degree between each domain participle according to the domain participles in the domain text.

In application, the above domain segmentation can also be performed through the above S101The method (2) can be obtained by referring to the explanation in the above S101, which will not be described in detail. The calculating the domain association degree between each domain participle may be calculating mutual information between each domain participle. Specifically, the following formula can be referred to:

wherein p (x, y) is the probability of the field participle x and the field participle y appearing in the plurality of field texts at the same time, p (x) is the probability of the field participle x appearing in the plurality of field texts independently, p (y) is the probability of the field participle y appearing in the plurality of field texts independently, and PMI (x, y) is the mutual information of the field participle x and the field participle y. The number of the obtained texts of the plurality of field texts can be counted, the number of the field texts with the field participles x and the field participles y simultaneously appearing in each field text is calculated in the plurality of field texts, and the number of the field texts with the field participles x and the number of the field texts with the field participles y appearing independently are calculated. Further, mutual information between each domain participle can be calculated according to the above formula.

In other applications, after the mutual information is calculated, the left and right information of each domain keyword is calculated according to the mutual information to obtain left and right mutual information, and the left and right mutual information is used as the domain association degree. Illustratively, for three participles of "flat", "peace", "character" appearing in the domain text. Mutual information (field association degree) of 'Ping' and 'An', mutual information of 'Ping' and 'An', and mutual information of 'An' and 'An' can be respectively calculated through the mutual information calculation formula. Then, according to the size of mutual information, the higher domain association degree between the 'safety' domain participles formed by 'safety' and 'safety' can be determined. Then, the 'safety' can be used as a domain participle, and left-right mutual information between the 'peace' and the 'symbol' can be calculated. If the right mutual information value forming the 'Fair' is calculated to be very low, the 'Fair' is determined not to form the field participle. However, if the left mutual information value constituting the "safety symbol" is calculated to be high, it is determined that this "safety symbol" can constitute a new domain participle. Finally, a plurality of domain association degrees (left-right mutual information) among three domain participles of 'Ping', 'An' and 'Fuji' can be obtained. It can be understood that, when calculating mutual information between "level" and "level", it is also necessary to calculate left and right mutual information between "level" and "level", and determine that "level" may be used as a domain participle and "level" may not be used as a domain participle according to the left and right mutual information.

S102C, determining a target relevance degree which is larger than a preset relevance degree from the multiple domain relevance degrees, and determining a target domain participle corresponding to the target relevance degree.

S102, 102D, storing the target domain participles into the keyword library.

In application, the preset association degree may be a value set by a user according to an actual situation, or may be a fixed value preset by the terminal device, which is not limited herein. After the domain association degree between each domain participle is obtained, the target domain association degree can be determined from the multiple domain association degrees according to the domain association degree, and the target domain participle corresponding to the target domain association degree is determined. For example, when the domain relevance is greater than the preset relevance, determining that the domain relevance is the target relevance, and determining that the domain participle corresponding to the target relevance is the target domain participle. The preset keyword library described in S102 may be configured to preset a plurality of interest vocabularies for the user, and store the plurality of interest vocabularies as keywords in the keyword library in a storage path specified by the terminal device. Therefore, for the target domain participles, the terminal device may store the target domain participles into the keyword library as well. For example, "safety" and "safety symbol" described above may be used as the target domain participle and stored in the keyword library.

Referring to fig. 4, in an embodiment, before determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library at S102, the method further includes the following steps S102E-S102F, which are detailed as follows:

S102E, determining the article field of the target article, and acquiring a plurality of field keywords belonging to the article field.

S102F, storing the plurality of domain keywords into the keyword library.

In application, the above has described how to determine the article domain of a target article, and has described that a plurality of domain texts can be crawled from the network after determining the article domain. Based on the method, the terminal equipment can also directly acquire the marked vocabularies of each field text as the field keywords to generate a keyword library. For example, referring to fig. 3, in fig. 3, a plurality of words ("5G channel", "internet") in the same column as "related channel" can be considered as the article field. In addition, as can be seen from fig. 3, when the user selects the article field of the "internet" in the terminal device, the terminal device may obtain the corresponding field text from the network according to the article field, and obtain that each field text already has the field keyword (the word indicated by the arrow for the first article in the drawing) when being published. The domain keywords can be regarded as high-frequency words or core words defined by the issuing organization for each domain text. Thus, a plurality of domain keywords for each domain text under the article domain may be stored into the keyword library.

Referring to fig. 5, in an embodiment, the step S102 of determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library further includes the following sub-steps S1021 to S1023, which are detailed as follows:

s1021, whether the keyword library contains target participles or not is determined, and the target participles are any one of the participles.

And S1022, if the keyword library contains the target participle, taking the target participle as a candidate keyword.

In application, the participles stored in the keyword library are high-quality vocabularies in the text field. Therefore, after the plurality of segmented words are obtained, the plurality of segmented words can be respectively compared with the segmented words in the keyword library. If the segmentation is consistent with the stored segmentation in the keyword library, the segmentation can be primarily used as a candidate keyword. Among the multiple participles, the participle compared with the participle in the keyword library can be regarded as the target participle.

S1023, if the target participle is not contained in the keyword bank, judging whether the target participle belongs to a real word; if the target participle belongs to the entity word, inputting the target participle belonging to the entity word into the supervision model to obtain the keyword probability of the target participle belonging to the entity word; and if the keyword probability is greater than a probability threshold, taking the target participle corresponding to the keyword probability as a candidate keyword.

In application, the above-mentioned entity words are words capable of describing independent existence of things. After it is determined that the segmented word is not stored in the keyword library, it may be determined whether the segmented word not stored in the keyword library belongs to an entity word. If not, the participle can be considered to have no meaning, and therefore, the participle can be deleted. Whether the participle belongs to the Entity word can be judged through a Named Entity Recognition (NER) technology. Specifically, named entity recognition is also called "proper name recognition" and refers to recognition of entities with specific meanings in text, and mainly includes names of people, places, organizations, proper nouns, and the like.

In application, when determining that the participle which is not stored in the keyword library belongs to the entity word, the participle can be input into the supervision model to obtain the keyword probability of the participle. The supervision model is a classification model trained in advance and used for judging the keyword probability of the participle belonging to the candidate keyword again. Reference may be made specifically to the description of the supervision model in S104 above, which will not be described in detail.

In application, the supervision model can extract word features of each participle in a target article, and then outputs keyword probability belonging to the target article according to the word features. The word features of the participles can be comprehensively extracted by the supervision model according to information such as the occurrence positions of the participles in the target article, the occurrence number of the participles in the target article, the word lengths of the participles and the like, and are classified according to the word features to output the keyword probability that the participles belong to the target article. The probability threshold may be a numerical value preset by a user, or may be a probability threshold set after a supervision model performs training analysis according to existing big data, which is not limited in this respect. When the probability of the keyword is greater than the probability threshold, the participle corresponding to the probability of the keyword is regarded as the candidate keyword.

Referring to fig. 6, in an embodiment, the supervision model can be obtained by the following steps S201 to S206, which are detailed as follows:

s201, obtaining a training sample, and obtaining labeled training keywords from the training sample.

In application, the training samples may be regarded as the described domain text, and the corresponding training keywords may be regarded as the target domain participles corresponding to the domain text. The method for acquiring the training sample can be to crawl a plurality of domain texts in the same article domain from the network. Based on the word segmentation method described in S101, the training samples can be segmented to obtain a plurality of sample segmented words, which will not be described in detail.

S202, performing word segmentation on the text content in the training sample to obtain a plurality of sample word segmentations, and respectively calculating a sample score value corresponding to each sample word segmentation.

S203, determining sample keywords from the sample participles according to the sample score values.

In application, the sample score value may be a sample score value determined according to article positions of the sample participles in the training sample, or a word span of the sample participles in the training sample is calculated as the score value, which is not limited to this. Wherein based on the magnitude of the sample score value, a sample score threshold may be set. When the sample score value is greater than the sample score threshold value, the sample participles corresponding to the sample score value are used as sample keywords, or a plurality of sample score values are sorted, and the sample participles corresponding to the preset number of sample score values in the front row are used as sample keywords, which is not limited.

S204, determining the label type of the sample keyword based on the sample keyword and the training keyword.

In application, the above labeled categories can be used to assign specific values for calculation to the sample keywords when calculating the training loss value of the model. Specifically, if the sample keyword is consistent with any training keyword, the label type of the sample keyword may be located by 1, otherwise, the label type of the sample keyword may be located by 0.

S205, extracting the keyword features of the sample keywords.

S206, model training is carried out based on the keyword features and the mark types of the sample keywords, and the supervision model is obtained.

In the application, the extraction of the keyword features of the sample keywords may be regarded as performing feature engineering processing on the sample keywords, that is, extracting word features of multiple aspects of the sample keywords. Specifically, as shown in fig. 7, the feature engineering process is performed on the sample keywords, and fig. 7 shows that the features of the keywords of the sample keywords in the training sample should be extracted.

In application, after obtaining a plurality of keyword features of the sample keywords, the features of the keywords can be fused through a neural network model structure in an initial supervision model to obtain a fusion feature, so that the fusion feature can comprehensively represent a plurality of feature information of the sample keywords. Then, the model can output the probability that the sample keywords belong to the keywords according to the fusion characteristics, and calculates the training loss by combining the labeled categories of the sample keywords. And finally, iteratively updating model parameters in the model according to the training loss, and taking the current model as a trained supervision model when the training loss is converged. And furthermore, the accuracy of the monitoring model for determining the target keywords in the target article is improved.

Referring to fig. 8, in an embodiment, the plurality of score values include a first score value, a second score value, a third score value and a fourth score value; s103 respectively calculating a plurality of score values corresponding to each of the plurality of candidate keywords according to the plurality of candidate keywords and the target article, and further includes the following substeps S1031 to S1034, which are detailed as follows:

and S1031, counting the number of the multiple participles, respectively calculating the word frequency of each candidate keyword in the target article according to the number, and correspondingly calculating the first score value of each candidate keyword according to the word frequency.

In application, the number of the multiple participles is the total number of the participles included in the target article, and the word frequency of the candidate keyword may be calculated by performing ratio calculation according to the total number of the participles in the target article and the number of times that each candidate keyword appears in the target article. The first score may be a word frequency in each target article, or may be a word frequency inverse file frequency calculated by the word frequency. Specifically, the inverse file frequency may be a first number of the plurality of domain texts counted by the terminal device, and a second number of the domain texts including the candidate keyword in the plurality of domain texts counted by the terminal device. And then, calculating a ratio result of the first quantity and the second quantity, and taking a logarithm with the base of 10 as the ratio result to obtain a numerical value, namely the inverse file frequency of the candidate keyword. Therefore, the word frequency inverse file frequency of each candidate keyword can be obtained, and further, the first score value can be obtained by multiplying the word frequency and the inverse file frequency. It should be noted that the value corresponding to the word frequency inverse file frequency may be any value between 0 and infinity, and for convenience of subsequent calculation, normalization processing may be performed on each word frequency inverse file frequency to make it in a value interval from 0 to 1.

S1032, determining the positions of the candidate keywords in the target article, and calculating a second score value of each candidate keyword based on the positions of the candidate keywords in the target article.

In the application, the titles or texts of the candidate keywords appearing in the target article are already described in the above S103, which may reflect the importance of the candidate keywords in the target article. Specifically, the second score of the candidate keyword appearing in the title may be set to 0.6, and the second score of the candidate keyword appearing in the text may be set to 0.4, which may be set according to actual situations. It can be understood that, if the same candidate keyword appears in the target article for multiple times and appears in multiple positions such as a title and a text at the same time, the sum of the scores corresponding to the same candidate keyword in the multiple positions can be used as a second score value. Or, the average value of the same candidate keyword is used as the second score value, which is not limited. In order to distinguish the title from the text in the target article, a space or a special symbol may be added between the title and the text for distinguishing.

S1033, respectively determining an initial position and an end position of each candidate keyword in the target article, and calculating a third score value corresponding to each candidate keyword according to the initial position and the end position.

In application, the third score may be considered as a word span of each candidate keyword in the target article. Specifically, the candidate keywords are determined from a plurality of participles in the target article, so that each participle in the target article can be respectively sorted according to the text content of the target article, and further the corresponding serial number of the corresponding candidate keyword in the target article can be determined, that is, the position of each candidate keyword in the target article can be determined. When a candidate keyword appears in the target article for multiple times (i.e. one candidate keyword has multiple serial numbers), the minimum serial number of the candidate keyword may be used as the initial position in the target article, and the maximum serial number of the candidate keyword may be used as the end position in the target article. And subtracting the two serial numbers to obtain a difference value, namely a third score value. In addition, for convenience of subsequent calculation, the difference may be divided by the total number of the multiple participles to perform normalization processing on the difference, and the normalized value is used as the third score value, which is not limited.

S1034, calculating a fourth score value corresponding to each candidate keyword according to a preset text sorting algorithm.

In application, the text sorting algorithm may be a graph-based sorting algorithm (textrank) model, which may perform score sorting on important components of a plurality of participles in a target article by dividing the target article into a plurality of component units (participles) and establishing a graph model, using a voting mechanism, i.e., perform score sorting on a plurality of participles in the target article. Then, according to the score of each participle, a score corresponding to the candidate keyword is determined from the plurality of participles and is used as a fourth score.

It will be appreciated that the target keywords in the target article generally appear in the title often, and the number of times the target keywords appear in the target article is relatively high. Therefore, the above-mentioned set calculation of four score values of each candidate keyword can be a good measure for judging the criticality of the keyword in the target article. The terminal equipment can comprehensively judge the key degree of each candidate keyword in the target article based on the plurality of score values, and the accuracy of determining the high-quality target keyword from the candidate keywords is improved.

Referring to fig. 9, in an embodiment, the target keyword includes a plurality of keywords; after the step S104 of inputting the plurality of score values corresponding to each candidate keyword into a pre-trained supervision model, respectively obtaining a word probability of each candidate keyword, and determining a target keyword from the plurality of candidate keywords according to the word probability, the method further includes the following steps S104A-S104B, which are detailed as follows:

S104A, counting the total number of each target keyword in the multi-space target articles, and calculating the ratio of the total number of each target keyword.

In an application, the multi-discourse articles may be understood as articles clicked on by the user within a preset time period. Each target keyword may be understood as one or more target keywords extracted from a target article by the terminal device using the method when the user clicks on the target article. Therefore, the terminal equipment can obtain one or more target keywords in each piece of target articles in the preset time period. Illustratively, when a user clicks a piece of target article and the terminal device records a plurality of target keywords of the target article, if the user clicks the piece of target article, the target keywords corresponding to the target article are: the mother, the baby and the family have the germination child and the nutrition development. If the user further clicks the rest of the multi-section articles within the preset time period, and the target keywords recorded from the multi-section articles, namely, "mother and baby", "family has lovely baby", and "nutrition development" appear for a plurality of times, the terminal device may respectively accumulate the appearance times of the target keywords, that is, count the total amount of each target keyword in the multi-section articles. Further, a ratio may be calculated based on the total number of each target keyword.

It is understood that the target keywords do not appear in every text, and the target keywords are only one example. In addition, the target keywords only appear once in the multi-space target articles, and the target keywords should be recorded and participate in the ratio calculation.

S104B, article recalling is carried out according to the ratio and each target keyword to obtain an article set, and the article set respectively comprises the ratio of the article quantity of each target keyword to be equal to the ratio.

In application, the article set is used for storing articles recalled by the terminal device according to the target keywords. After each target keyword and the ratio between the target keywords are determined, the number of articles can be recalled according to the ratio. Specifically, the total number of articles to be recalled by the terminal device may be preset, and the number of articles including the target keyword to be recalled may be calculated according to the total number and the ratio. For example, for the above "mother and infant", "family has a baby", "vegetative development" ratio is 5: 2: 3, and the total number of articles to be recalled by the terminal equipment is 10. Based on this, in order to make the ratio of the number of articles to the ratio of each target keyword included in the article set equal, it is known that the terminal device should recall 5 target articles including the target keyword "mother and baby", 3 target articles including the target keyword "family has loved baby", and 2 target articles including the target keyword "nutritional development". Therefore, the terminal equipment can automatically recall articles which are interested by the user from the network according to the target keywords, and the recall effect of the terminal equipment is improved.

Referring to fig. 10, fig. 10 is a block diagram illustrating a keyword extraction apparatus according to an embodiment of the present disclosure. The units included in the terminal device in this embodiment are used to execute the steps in the embodiments corresponding to fig. 1, fig. 2, fig. 4 to fig. 6, fig. 8, and fig. 9. Please refer to fig. 1, fig. 2, fig. 4 to fig. 6, fig. 8 and fig. 9, and fig. 1, fig. 2, fig. 4 to fig. 6, fig. 8 and fig. 9 for the corresponding embodiments. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 10, the keyword extraction apparatus 1000 includes: a first obtaining module 1010, a first determining module 1020, a first calculating module 1030, and a second determining module 1040, wherein:

the first obtaining module 1010 is configured to obtain a plurality of word segments in the target article.

A first determining module 1020, configured to determine a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library.

A first calculating module 1030, configured to calculate, according to the multiple candidate keywords and the target article, multiple score values corresponding to each of the multiple candidate keywords respectively.

The second determining module 1040 is configured to input the multiple score values corresponding to each candidate keyword into a pre-trained supervision model, respectively obtain a word probability of each candidate keyword, and determine a target keyword from the multiple candidate keywords according to the word probability.

In an embodiment, the keyword extraction apparatus 1000 further includes:

and the third determining module is used for determining the article field of the target article and acquiring the field text belonging to the article field.

And the second calculation module is used for calculating the domain association degree between each domain participle according to the plurality of domain participles in the domain text.

The fourth determining module is used for determining a target relevance degree which is larger than a preset relevance degree from the multiple field relevance degrees and determining a target field word segmentation corresponding to the target relevance degree.

And the first generation module is used for storing the target field participles into the keyword library.

In an embodiment, the keyword extraction apparatus 1000 further includes:

and the fifth determining module is used for determining the article field of the target article and acquiring a plurality of field keywords belonging to the article field.

And the second generation module is used for storing the plurality of domain keywords into the keyword library.

In an embodiment, the first determining module 1020 is further configured to:

In an embodiment, the keyword extraction apparatus 1000 further includes the following modules for supervised model training:

and the second acquisition module is used for acquiring a training sample and acquiring the labeled training keywords from the training sample.

And the word segmentation module is used for segmenting the text content in the training sample to obtain a plurality of sample word segmentations and respectively calculating the sample score value corresponding to each sample word segmentation.

And the sixth determining module is used for determining the sample keywords from the sample participles according to the sample score values.

And the seventh determining module is used for determining the mark type of the sample keyword based on the sample keyword and the training keyword.

And the extraction module is used for extracting the keyword characteristics of the sample keywords.

And the training module is used for carrying out model training based on the keyword features and the mark categories of the sample keywords to obtain the supervision model.

In an embodiment, the plurality of score values comprises a first score value, a second score value, a third score value, and a fourth score value; the first calculation module 1030 is further configured to:

In one embodiment, the target keyword comprises a plurality of keywords; the keyword extraction apparatus 1000 further includes:

the statistical module is used for counting the total number of each target keyword in the multi-space target articles and calculating the ratio of the total number of each target keyword;

and the recalling module is used for recalling articles according to the ratio and each target keyword to obtain an article set, wherein the article set respectively comprises the ratio of the article quantity of each target keyword to the ratio which is equal to the ratio.

It should be understood that, in the structural block diagram of the keyword extraction apparatus shown in fig. 10, each unit/module is configured to execute each step in the embodiments corresponding to fig. 1, fig. 2, fig. 4 to fig. 6, fig. 8, and fig. 9, and each step in the embodiments corresponding to fig. 1, fig. 2, fig. 4 to fig. 6, fig. 8, and fig. 9 has been explained in detail in the above embodiments, and specific reference is made to the relevant description in the embodiments corresponding to fig. 1, fig. 2, fig. 4 to fig. 6, fig. 8, and fig. 9, and fig. 1, fig. 2, fig. 4 to fig. 6, fig. 8, and fig. 9, which is not repeated herein.

Fig. 11 is a block diagram of a terminal device according to another embodiment of the present application. As shown in fig. 11, the terminal device 1100 of this embodiment includes: a processor 1110, a memory 1120, and a computer program 1130, such as a program for a keyword extraction method, stored in the memory 1120 and executable on the processor 1110. The processor 1110, when executing the computer program 1130, implements the steps of the above-described embodiments of the keyword extraction methods, such as S101 to S104 shown in fig. 1. Alternatively, when the processor 1110 executes the computer program 1130 to implement the functions of the units in the embodiment corresponding to fig. 10, for example, the functions of the units 1010 to 1040 shown in fig. 10, please refer to the related description in the embodiment corresponding to fig. 10.

Illustratively, the computer program 1130 may be divided into one or more units, which are stored in the memory 1120 and executed by the processor 1110 to accomplish the present application. One or more elements may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of computer program 1130 in terminal device 1100. For example, the computer program 1130 may be divided into a first acquisition unit, a first determination unit, a first calculation unit, and a second determination unit, each unit functioning as described above.

The terminal equipment may include, but is not limited to, a processor 1110, a memory 1120. Those skilled in the art will appreciate that fig. 10 is merely an example of a terminal device 1100 and does not constitute a limitation of terminal device 1100 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input output devices, network access devices, buses, etc.

The processor 1110 may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or any conventional processor or the like.

The storage 1120 may be an internal storage unit of the terminal device 1100, such as a hard disk or a memory of the terminal device 1100. The memory 1120 can also be an external storage device of the terminal device 1100, such as a plug-in hard disk, a smart memory card, a flash memory card, etc. provided on the terminal device 1100. Further, the memory 1120 may also include both an internal storage unit of the terminal device 1100 and an external storage device.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A keyword extraction method is characterized by comprising the following steps:

acquiring a plurality of participles in a target article;

2. The method for extracting keywords according to claim 1, wherein before determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library, the method further comprises:

and storing the target field participles into the keyword library.

3. The method for extracting keywords according to claim 1, wherein before determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library, the method further comprises:

storing the plurality of domain keywords into the keyword library.

4. The method for extracting keywords according to any one of claims 1 to 3, wherein the determining a plurality of candidate keywords from the plurality of segmented words according to a preset keyword library comprises:

5. The keyword extraction method according to claim 4, wherein the supervised model is trained by the following steps:

extracting keyword features of the sample keywords;

6. The keyword extraction method according to claim 1, wherein the plurality of score values include a first score value, a second score value, a third score value, and a fourth score value;

7. The keyword extraction method according to claim 1, wherein the target keyword includes a plurality;

8. A keyword extraction device is characterized by comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.