CN114943236A

CN114943236A - Keyword extraction method and device

Info

Publication number: CN114943236A
Application number: CN202210760249.7A
Authority: CN
Inventors: 蒋浩谊; 李长亮
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-08-26

Abstract

The application provides a keyword extraction method and a keyword extraction device, wherein the keyword extraction method comprises the following steps: acquiring a target text; extracting text structure features and text semantic features of the target text, and word structure features and word semantic features of words in the target text; determining text features according to the text structural features and the text semantic features, and determining word features according to the word structural features and the word semantic features; and determining keywords from the words according to the text characteristics and the word characteristics of the words. The method can improve the accuracy of keyword extraction.

Description

Keyword extraction method and device

Technical Field

The application relates to the technical field of natural language processing, in particular to a keyword extraction method. The application also relates to a keyword extraction device, a computing device and a computer readable storage medium.

Background

With the development of artificial intelligence in the field of computer technology, the field of natural language processing has also been rapidly developed, and information retrieval according to texts is an important branch of the field of natural language processing. Artificial Intelligence (AI) refers to the ability of an engineered (i.e., designed and manufactured) system to perceive the environment, as well as the ability to acquire, process, apply, and represent knowledge. The development conditions of key technologies in the field of artificial intelligence comprise key technologies such as machine learning, knowledge maps, natural language processing, computer vision, human-computer interaction, biological feature recognition, virtual reality/augmented reality and the like. Natural Language Processing (NLP) is an important research direction in the field of computer science, which studies various theories and methods that enable efficient communication between a person and a computer using Natural Language. The concrete expression forms of natural language processing include machine translation, text summarization, text classification, text proofreading, information extraction, speech synthesis, speech recognition and the like. With the development of natural language processing technology and the acceleration of life rhythm, effective information needing to be transmitted to a user becomes shorter and shorter, and at the moment, a keyword extraction technology in natural language processing can be adopted to extract keywords from a text so as to shorten the effective information.

The traditional general keyword extraction algorithm mainly extracts keywords from medium-length texts, such as a word frequency statistical algorithm (TF-IDF) and a graph center point algorithm (TextRank). However, the above algorithm has a poor effect when extracting the keywords of the short text, and is mainly reflected in the specificity of the short text: the frequency of words in short text is generally once; and the frequency of words in medium and long text is high. Therefore, an effective solution to solve the above problems is needed.

Disclosure of Invention

In view of this, the embodiment of the present application provides a keyword extraction method to solve the technical defects in the prior art. The embodiment of the application also provides a keyword extraction device, a computing device and a computer readable storage medium.

According to a first aspect of an embodiment of the present application, there is provided a keyword extraction method, including:

acquiring a target text;

extracting text structure features and text semantic features of the target text, and word structure features and word semantic features of words in the target text;

determining text features according to the text structural features and the text semantic features, and determining word features according to the word structural features and the word semantic features;

and determining keywords from the words according to the text characteristics and the word characteristics of the words.

According to a second aspect of the embodiments of the present application, there is provided a keyword extraction apparatus, including:

a first obtaining module configured to obtain a target text;

the extraction module is configured to extract the text structure characteristics and the text semantic characteristics of the target text, and the word structure characteristics and the word semantic characteristics of each word in the target text;

the first determining module is configured to determine text features according to the text structural features and the text semantic features, and determine word features according to the word structural features and the word semantic features;

a second determining module configured to determine keywords from the words according to the text features and the word features of the words.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor realizes the steps of the keyword extraction method when executing the computer-executable instructions.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the keyword extraction method.

According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the keyword extraction method.

The method for extracting the keywords comprises the steps of obtaining a target text, extracting text structure characteristics and text semantic characteristics of the target text, and extracting word structure characteristics and word semantic characteristics of words in the target text; determining text features according to the text structural features and the text semantic features, and determining word features according to the word structural features and the word semantic features; and determining keywords from the words according to the text characteristics and the word characteristics of the words. The text features are determined through the text structural features and the text semantic features, the word features are determined through the word structural features and the word semantic features, semantics and structural level information of the text or words can be determined more accurately, the text features and the word features are more accurate, then the keywords can be determined based on the text features and the word features, and the keyword determining efficiency is improved. The problems of uneven distribution of high and low frequency words in a semantic space and the like are avoided.

Drawings

Fig. 1 is a schematic structural diagram of a keyword extraction method according to an embodiment of the present application;

fig. 2 is a flowchart of a keyword extraction method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a feature extraction model in a keyword extraction method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of another feature extraction model in a keyword extraction method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of another feature extraction model in a keyword extraction method according to an embodiment of the present application;

fig. 6 is a processing flow chart of a keyword extraction method applied to a short text according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a keyword extraction apparatus according to an embodiment of the present application;

fig. 8 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

BERT (bidirectional Encoder Representation from transformations) pre-training language model: the deep bidirectional representation in the label-free data is learned through pre-training, fine adjustment is performed by adding an additional output layer after the pre-training is finished, and finally SOTA (state soft art, which refers to a method or a model which is best in performance in a specific task at present) is realized on a plurality of NLPs (Natural Language Processing).

Comparative Learning (contrast Learning): the contrast Learning belongs to Self-supervised Learning (Self-supervised Learning), and the Self-supervised Learning belongs to one of Unsupervised Learning (Unsupervised Learning), and is characterized in that the data is directly used as the supervision information without manually labeled class label information to learn the characteristic expression of the sample data and is used for downstream tasks.

And (3) keyword extraction algorithm: in the field of natural language processing, the relationship information in a long text or a short text needs to be extracted through a keyword extraction algorithm. The keyword extraction algorithm is widely applied to the fields of recommendation systems, search engines and the like, and is also an important composition text for text mining.

SimCSE (simple contextual Learning of sequence embedding) adopts a self-supervision method to improve the sentence representation capability of the model. There are two main construction methods, for unsupervised, the Dropout layer is used to construct the positive case, and one sample is passed through the encoder twice to obtain a positive case pair, and the negative case is the other sentence in the same batch. For supervised use, The Natural structure of The attention-driven Natural Language Inference (SNLI) dataset is used, where The opposite class is The negative sample and The other two classes are The positive samples.

Next, a keyword extraction method provided in this specification will be briefly described.

With the development of artificial intelligence in the field of computer technology, the field of natural language processing has also been rapidly developed, and information retrieval according to texts is an important branch of the field of natural language processing. Artificial Intelligence (AI) refers to the ability of an engineered (i.e., designed and manufactured) system to perceive the environment, as well as the ability to acquire, process, apply, and represent knowledge. The development conditions of key technologies in the field of artificial intelligence comprise key technologies such as machine learning, knowledge maps, natural language processing, computer vision, human-computer interaction, biological feature recognition, virtual reality/augmented reality and the like. Natural Language Processing (NLP) is an important research direction in the field of computer science, and it researches various theories and methods that can realize effective communication between a person and a computer using Natural Language. The concrete expression forms of natural language processing include machine translation, text summarization, text classification, text proofreading, information extraction, speech synthesis, speech recognition and the like. With the development of natural language processing technology and the acceleration of life rhythm, effective information needing to be transmitted to a user becomes shorter and shorter, and at the moment, the keyword extraction technology in natural language processing can be adopted to extract keywords from a text so as to shorten the effective information.

The existing keyword extraction technology has three major categories: graph algorithm, word frequency statistics and similarity calculation are specifically as follows.

TextRank keyword extraction algorithm (graph algorithm): the words are nodes in the graph, and the edges between the words are determined by the co-occurrence relationship; "co-occurrence" refers to co-occurrence within a sliding window of yet another given size. And constructing an unweighted undirected graph, and obtaining words with high corresponding weights through a PageRank algorithm.

A TF-IDF (term frequency-inverse document frequency) keyword extraction algorithm (word frequency statistical algorithm) which is divided into two parts: the first part calculates the word frequency, such as TF ═ (the number of words appearing in the document) ÷ (the total number of words in the article); the second part calculates the inverse document frequency, IDF ═ log ((total number of documents in the corpus) ÷ (documents containing the word) + 1); and finally, calculating to obtain words with high corresponding weights as the keywords of the document.

keyBERT is the method of extracting a document-level representation of a document by BERT and then finding the first few sub-phrases that are most similar to the document by using cosine similarity.

Due to the fact that the TF-IDF calculation speed is high, a TF-IDF keyword extraction algorithm is used in search engines of most websites.

The traditional general keyword extraction algorithm is mainly used for extracting keywords from medium and long texts. Such as the word frequency statistical algorithm (TF-IDF) and the graph center point algorithm (TextRank). However, the above algorithm is poor in extracting the keywords of the short text, and is mainly embodied in the particularity of the short text: the frequency of words in short text is generally once; and the frequency of words in medium and long text is high.

Furthermore, KeyBERT relies on BERT to provide a document-level representation of a document, as word vectors in BERT are not uniformly distributed in space, with high frequencies near the origin and low frequencies away from the origin. The low-frequency distribution is sparse, so that the training obtained from low-frequency words is insufficient, and finally the BERT has a certain degree of unsmooth semantics in a sentence vector space. Moreover, the BERT has an ERNIE (enhanced Representation through Knowledge integration) model for a pre-training language model for chinese, and the problem is also obvious, and the feature of insufficient generalization and migration capability of ERNIE is amplified for data sets in other specific fields.

Therefore, the application provides a keyword extraction method, which is used for obtaining a target text; extracting text structure features and text semantic features of the target text, and word structure features and word semantic features of words in the target text; determining text features according to the text structural features and the text semantic features, and determining word features according to the word structural features and the word semantic features; and determining keywords from the words according to the text characteristics and the word characteristics of the words. The text features are determined through the text structure features and the text semantic features, the word features are determined through the word structure features and the word semantic features, the semantics and the structure level information of the text or the words can be determined more accurately, the text features and the word features are more accurate, then the keywords can be determined based on the text features and the word features, and the efficiency of determining the keywords is improved. The problems of uneven distribution of high and low frequency words in a semantic space and the like are avoided.

In the present application, a keyword extraction method is provided. The present application relates to a keyword extraction apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

An execution subject of the keyword extraction method provided in the embodiment of the present application may be a server or a terminal, which is not limited in the embodiment of the present application. The terminal may be any electronic product capable of performing human-Computer interaction with a user, such as a PC (Personal Computer), a mobile phone, a palm PC (ppc) (pocketpc), a tablet PC, and the like. The server may be one server, a server cluster composed of multiple servers, or a cloud computing service center, which is not limited in this embodiment of the present application.

Fig. 1 is a schematic structural diagram of a keyword extraction method according to an embodiment of the present application, where a target text is obtained first; then extracting text structure characteristics and text semantic characteristics of the target text, and simultaneously extracting word structure characteristics and word semantic characteristics of all words in the target text, such as extracting word structure characteristics and word semantic characteristics of words 1 to X; and then, determining text characteristics according to the text structure characteristics and the text semantic characteristics, and determining word characteristics of each word according to the word structure characteristics and the word semantic characteristics of each word. And finally, determining keywords of the target text from the words according to the text characteristics and the word characteristics.

According to the keyword extraction method, the text features are determined through the text structure features and the text semantic features, the word features are determined through the word structure features and the word semantic features, the semantics and the structure level information of the text or the words can be determined more accurately, the text features and the word features are more accurate, the keywords can be determined based on the text features and the word features, and the keyword determination efficiency is improved. The problems of uneven distribution of high and low frequency words in a semantic space and the like are avoided.

Fig. 2 shows a flowchart of a keyword extraction method according to an embodiment of the present application, which specifically includes the following steps:

step 202: and acquiring a target text.

The key point of the embodiment of the application is to extract keywords, and the process of extracting the keywords is basically the same for texts in different fields or different types, such as texts in the medical field, texts in the astronomy field, long texts and short texts, and the process of extracting the keywords is described in detail below.

Specifically, text refers to the representation of written language, usually a sentence or a combination of sentences having complete and systematic meaning, and a text can be a sentence, a paragraph or a chapter, all belonging to the text; the target text refers to a text of the keyword to be extracted. Preferably, the target text is a short text.

In practical applications, there are various ways to obtain the target text, for example, the operator may send an instruction for extracting the keyword to the execution subject, or send an instruction for obtaining the target text, and accordingly, the execution subject starts to obtain the target text after receiving the instruction; the server may also automatically acquire the target text every preset time, for example, after the preset time, the server with the keyword extraction function automatically acquires the target text; or after the preset time, the terminal with the keyword extraction function automatically acquires the target text. The mode of acquiring the target text is not limited in any way in this specification.

Step 204: and extracting the text structure characteristic and the text semantic characteristic of the target text, and the word structure characteristic and the word semantic characteristic of each word in the target text.

On the basis of obtaining the target text, further extracting the text structure characteristics and the text semantic characteristics of the target text, and the word structure characteristics and the word semantic characteristics of each word in the target text.

Specifically, the structural feature refers to an arrangement or position feature of a plurality of text units, that is, a surface layer information feature of the text units; the semantic features refer to features corresponding to the language meanings of a plurality of character units; the text structural features refer to structural features corresponding to texts; the text semantic features refer to semantic features corresponding to texts; the word structural characteristics refer to structural characteristics corresponding to words; the term meaning characteristics refer to semantic characteristics corresponding to a term.

In a possible implementation manner of the embodiment of the present specification, after the target text is obtained, the text structure features of the target text may be extracted first by using a preset structure feature extraction tool, and then the word structure features of each word in the target text are extracted respectively; text semantic features of the target text are extracted firstly through a preset semantic feature extraction tool, and then word semantic features of words in the target text are extracted respectively. Therefore, the accuracy of determining and extracting the text structure characteristics, the text semantic characteristics, the word structure characteristics and the word semantic characteristics can be improved.

In another possible implementation manner of the embodiment of the present specification, after the target text is obtained, the text structure feature and the text semantic structure feature of the target text may be extracted first by using a preset structural semantic feature extraction tool, and then the word structure feature and the word semantic structure feature of each word in the target text are extracted. Therefore, the speed of extracting text structure features, text semantic features, word structure features and word meaning features can be improved.

It should be noted that, in order to facilitate processing of each word in the target text, before extracting the text structure feature and the text semantic feature of the target text, and the word structure feature and the word semantic feature of each word in the target text, each word in the target text needs to be determined, or each word in the target text needs to be acquired.

In a possible implementation manner of the embodiment of the present specification, before extracting the text structure feature and the text semantic feature of the target text and the word structure feature and the word semantic feature of each word in the target text, the words may be obtained again according to the target text to obtain each word in the target text that is stored in advance. In this way, words in the target text can be obtained quickly.

In another possible implementation manner of the embodiment of the present specification, before extracting the text structure feature and the text semantic feature of the target text, and the word structure feature and the word semantic feature of each word in the target text, word segmentation processing may be performed on the target text to obtain a plurality of words. Therefore, the target text is analyzed to obtain each word in the target text, the accuracy and comprehensiveness of each obtained word can be ensured, and the accuracy and efficiency of extracting the keywords are further improved.

Since the target text contains many dummy words which are not necessarily keywords of the target text, in order to reduce data processing amount, the dummy words can be removed after the target text is segmented. That is, the target text is subjected to word segmentation processing to obtain a plurality of words, and the specific implementation process can be as follows:

performing word segmentation processing on a target text to obtain a plurality of candidate words, and determining the part of speech of each candidate word;

and screening a plurality of words from the candidate words according to the part of speech.

Specifically, the word segmentation process for matching the character strings in the sample problem by word segmentation processing may be a forward maximum matching method, a reverse maximum matching method, a shortest path word segmentation method or a bidirectional maximum matching method, which is not limited in the present application; the candidate words are all words obtained after word segmentation processing is carried out on the target text; the part of speech is the basis for dividing the part of speech according to the characteristics of the word, and the part of speech is determined by the meaning, form and grammatical function of the word in the language to which the word belongs.

In practical application, word segmentation processing is performed on a target text, all obtained candidate words are subjected to part-of-speech tagging, and then the part-of-speech tagging is performed on each candidate word to determine the part-of-speech of each candidate word. And then deleting the candidate words which are the null words in the plurality of candidate words according to the part of speech of each candidate word to obtain a plurality of words.

Illustratively, the target text is "white chalk is waved on a blackboard, and sometimes white powder falls off", and after the target text is subjected to word segmentation, candidate words are obtained: "white", "chalk", "on", "blackboard", "on", "dancing", "on", "from time to time", "with", "white powder" and "falling off". Further, the candidate words are part-of-speech tagged to obtain "white" (adjective), "what" (of), "chalk" (of), "on" (preposition), "blackboard" (of), "up" (of orientation), "dance" (of status words), "on" (of), "from time to time" (of adverb), "with" (of), "white powder" (of noun), "drop" (of verb). Wherein, the's' (i), the's' (i's),'s)'s' (i's),'s '(i's), and's' (i's) are the (i's), the words),'s (i's) (the words),'s (i's) (the words),'s) (which are all the words),'s) (the words),'s (are all the words),'s (are the words),'s) (the words),'s (are all the words),'s (are the words),'s) (the words),'s (are the words),'s (.

In addition, stop words in a plurality of candidate words or words can be filtered, wherein the stop words refer to certain words or words which are automatically filtered before or after natural language text is processed in the information retrieval process, and the words or words are called stop words such as "I", "you", and the like, so that the storage space is saved and the searching efficiency is improved. These stop words are typically generated by manual input and not automatically, and the generated stop words form a stop word list, such as a stop word library. It should be noted that the filtering stop word may be performed before or after the filtering of the plurality of words. Thus, the data processing amount can be further reduced.

Step 206: and determining text characteristics according to the text structural characteristics and the text semantic characteristics, and determining word characteristics according to the word structural characteristics and the word semantic characteristics.

On the basis of extracting the text structure characteristics and the text semantic characteristics of the target text and the word structure characteristics and the word semantic characteristics of each word in the target text, further determining the text characteristics according to the text structure characteristics and the text semantic characteristics, and determining the word characteristics according to the word structure characteristics and the word semantic characteristics.

Specifically, the text feature refers to a comprehensive feature of the text, that is, a combination of features determined from multiple aspects, such as a combination of a text structure feature and a text semantic feature; word features refer to the composite features of a word, i.e., a composite of features determined from multiple aspects, such as word structural features and word semantic features.

In a possible implementation manner of the embodiment of the present specification, after extracting the text structure feature and the text semantic feature of the target text, and the word structure feature and the word semantic feature of each word in the target text, the text structure feature and the text semantic feature of the target text may be spliced to obtain the text feature of the target text; and aiming at any word, splicing the word structure characteristics and the word semantic characteristics of the word to obtain the word characteristics of the word. In this manner, the speed of determining text features and word features may be increased.

In another possible implementation manner of the embodiment of the present specification, after extracting the text structure feature and the text semantic feature of the target text, and the word structure feature and the word semantic feature of each word in the target text, the text structure feature and the text semantic feature of the target text may be input into a preset feature calculation formula for calculation, so as to obtain the text feature of the target text; and aiming at any word, inputting the word structure characteristics and the word semantic characteristics of the word into a preset characteristic calculation formula for calculation to obtain the word characteristics of the word. Therefore, the characteristic calculation formula is set according to the requirement, so that the text characteristics and the word characteristics can reflect the characteristics of the text and the words, the accuracy of the text characteristics and the word characteristics can be improved, and the accuracy of extracting the keywords is improved.

Illustratively, the preset feature calculation formula is a weighted average formula, see formula 1, where a is a preset weight of the structural feature and b is a preset weight of the semantic feature. Multiplying the text structure characteristic of the target text by a, and adding the product of the text semantic characteristic and b to obtain the text characteristic of the target text; and (4) multiplying the word structural characteristics of the word by a and adding the product of the word sense characteristics and b to obtain the word characteristics of the word aiming at any word.

A structural feature + b semantic feature (formula 1)

Therefore, in the case where a and b are both 0.5, the text feature of the target text is the average of the text structural feature and the text semantic feature, and the word feature of each word is the average of the word structural feature and the word semantic feature.

Step 208: and determining keywords from the words according to the text characteristics and the word characteristics of the words.

Specifically, on the basis of determining text features according to the text structure features and the text semantic features and determining word features according to the word structure features and the word semantic features, keywords are further determined according to the text features and the word features of the words.

Specifically, the keywords are used to express the subject content of the text, and are used not only for scientific papers, but also for texts such as scientific reports and academic papers.

In a possible implementation manner of the embodiment of the present specification, the text features of the target text and the word features of each word may be grouped according to a preset classification rule, or the text features of the target text and the word features of each word are clustered by using a preset clustering algorithm, and words used for word feature pairs in the same group or the same class as the text features are determined as the keywords of the target text. Thus, the efficiency of determining the target keyword can be improved.

Illustratively, the text feature of the target text is feature 1, the target text has 5 words, the word feature of the first word is feature 2, the word feature of the second word is feature 3, the word feature of the third word is feature 4, the word feature of the fourth word is feature 5, and the word feature of the fifth word is feature 6. Clustering the features 1 to 6 by using a certain clustering algorithm to obtain two classes, wherein the first class comprises the features 3 and 5, and the second class comprises the features 1, 2, 4 and 6, and then determining the first words, the third words and the fifth words corresponding to the features 2, 4 and 6 respectively as the keywords of the target text.

In a possible implementation manner of the embodiment of the present specification, the keywords may also be determined from each word according to the similarity between the text feature and each word feature. That is, determining keywords from each word according to the text features and the word features of each word, and the specific implementation process may be as follows:

respectively determining first similarity of word features and text features of each word;

and determining the keywords of the target text from the plurality of words according to the first similarity.

Specifically, similarity is used to describe the similarity between two things; the first similarity refers to the similarity between the word feature and the text feature.

In practical applications, after the text feature of the target text and the word feature of each word are determined, the first Similarity between the word feature of each word and the text feature may be respectively calculated according to a preset Similarity algorithm, where the preset Similarity algorithm may be any one of an euclidean Distance (euclidean Distance) algorithm, a Manhattan Distance (Manhattan Distance) algorithm, a Minkowski Distance (Minkowski Distance) algorithm, a Cosine Similarity (Cosine Similarity) algorithm, and the like. After the first similarity between the word features of each word and the text features is determined, the first similarities can be arranged from large to small, and the words corresponding to N first similarities before arrangement are determined as the keywords of the target text, wherein N is a preset positive integer; and a similarity threshold value can be set, and words corresponding to the first similarity greater than the similarity threshold value are determined as the keywords of the target text. Therefore, the keywords are determined according to the first similarity between the word features and the text features, namely the keywords are determined based on the association degree of the words and the target text, so that the accuracy of extracting the keywords can be improved.

Illustratively, the text feature of the target text is feature 1, the target text has 5 words, the word feature of the first word is feature 2, the word feature of the second word is feature 3, the word feature of the third word is feature 4, the word feature of the fourth word is feature 5, and the word feature of the fifth word is feature 6. By using the cosine similarity algorithm, the first similarity s1 between feature 1 and feature 2 is 0.2, the first similarity s2 between feature 1 and feature 3 is 0.7, the first similarity s3 between feature 1 and feature 4 is 0.9, the first similarity s4 between feature 1 and feature 5 is 0.1, and the first similarity s5 between feature 1 and feature 6 is 0.5. Assuming that the similarity threshold is 0.4, the first similarity s2, the first similarity s3, and the first similarity s5 are all greater than the similarity threshold, and the second word, the third word, and the fifth word are determined as keywords of the target text.

In one or more optional embodiments of the present description, before the text structure features and the text semantic features, and the word structure features and the word semantic features are extracted and determined, a pre-trained feature extraction model may be further obtained, then the target text and each word in the target text are input into the feature extraction model, and the feature extraction model performs feature extraction on the target text and each word in the target text to obtain the text features and the word features. Before extracting the text structure features and text semantic features of the target text, and the word structure features and word semantic features of each word in the target text, the method further comprises the following steps:

acquiring a pre-trained feature extraction model, wherein the feature extraction model comprises a structural feature extraction submodel, a semantic feature extraction submodel and an output layer;

correspondingly, the text structure feature and the text semantic feature of the target text, and the word structure feature and the word semantic feature of each word in the target text are extracted, and the specific implementation process can be as follows:

respectively inputting the target text and each word in the target text into the structural feature extraction submodel to obtain the text structural feature of the target text and the word structural feature of each word;

respectively inputting the target text and each word to a semantic feature extraction submodel to obtain text semantic features of the target text and word semantic features of each word;

correspondingly, determining text characteristics according to the text structure characteristics and the text semantic characteristics, and determining word characteristics according to the word structure characteristics and the word semantic characteristics, wherein the determining comprises the following steps:

aiming at a target text, inputting the text structure characteristics and the text semantic characteristics to an output layer for processing, and outputting the text characteristics of the target text;

and aiming at any word, inputting the word structure characteristics and the word semantic characteristics of the word into an output layer for processing, and outputting the word characteristics of the word.

Specifically, the feature extraction model refers to a pre-trained neural network model, such as a neural network model and a probabilistic neural network model, and also includes a BERT model, a transform model, and the like; the structural feature extraction sub-model is a part for extracting structural features of texts or words in the feature extraction model, and can better express surface layer information of the texts or words; the semantic feature extraction sub-model refers to a part for performing semantic feature extraction on texts or words in the feature extraction model, and can better express semantic level information of the texts or words; the output layer refers to a part which is obtained by processing semantic features and structural features in the feature extraction model and outputs a result.

In practical application, after a target text and each word in the target text are obtained, a pre-trained feature extraction model comprising a structural feature extraction submodel, a semantic feature extraction submodel and an output layer is obtained. And then inputting the target text and each word in the target text into a structural feature extraction submodel, extracting the structural features of the target text and each word by the structural feature extraction submodel, outputting the text structural features of the target text and the word structural features of each word, inputting the target text and each word into a semantic feature extraction submodel, extracting the semantic features by the semantic feature extraction submodel, and outputting the text semantic features of the target text and the word semantic features of each word. And finally, inputting the text semantic features and the text structure features of the target text, and the word semantic features and the word structure features of all words into an output layer, analyzing and processing the text semantic features and the text structure features of the target text by the output layer to obtain and output the text features of the target text, and analyzing and processing the word semantic features and the word structure features of all words to obtain and output the word features of all words. The target text and each word are subjected to feature extraction through the pre-trained feature extraction model, so that the acquisition rate and accuracy of the text features and the word features can be improved.

Referring to fig. 3, fig. 3 shows a schematic structural diagram of a feature extraction model in a keyword extraction method provided in an embodiment of the present application: the feature extraction model comprises a structural feature extraction submodel, a semantic feature extraction submodel and an output layer, wherein the structural feature extraction submodel is used for receiving the target text and each word and extracting the structural features of the target text and each word to obtain the text structural features and each word structural features; the semantic feature extraction submodel is used for receiving the target text and each word and extracting the semantic features of the target text and each word to obtain text semantic features and each word semantic feature; the output layer is used for receiving and processing the text structure characteristics and the text semantic characteristics to obtain and output the text characteristics, and receiving and processing the word structure characteristics and the word semantic characteristics to obtain and output the word characteristics.

In a possible implementation manner of the embodiment of the present specification, the structural feature extraction sub-model may be a first coding layer, and then the target text and each word in the target text may be respectively input to the first coding layer, so as to obtain the text structural feature of the target text and the word structural feature of each word. That is, under the condition that the structural feature extraction submodel includes the first coding layer, the target text and each word in the target text are respectively input into the structural feature extraction submodel to obtain the text structural feature of the target text and the word structural feature of each word, and the specific implementation process may be as follows:

inputting the target text into a first coding layer for feature extraction to obtain text structure features of the target text;

and aiming at any word, inputting the word into the first coding layer for feature extraction to obtain the word structure feature of the candidate word.

In practical application, the structural feature extraction submodel comprises a first coding layer, a target text can be input into the first coding layer, the first coding layer is used for extracting structural features of the target text, and text structural features of the target text are output; and then inputting each word into the first coding layer, extracting the structural characteristics of each word by the first coding layer, and outputting the word structural characteristics of each word. Therefore, as the structure information at the text level gradually disappears along with the increase of the coding layers, namely, the single coding layer can well acquire the structure information of the text or the words, one coding layer is used as a structure feature extraction submodel to extract the structure features of the target text and the words, so that the obtained text structure features and word structure features can be more accurate.

In a possible implementation manner of the embodiment of the present specification, the semantic feature extraction sub-model may include a plurality of second coding layers, and then the target text and each word in the target text may be respectively input to the plurality of second coding layers connected in series to be processed, so as to obtain a text semantic feature of the target text and a word semantic feature of each word. That is, under the condition that the semantic feature extraction submodel includes N second coding layers and N is a positive integer greater than 2, the target text and each word are respectively input into the semantic feature extraction submodel to obtain the text semantic features of the target text and the word semantic features of each word, and the specific implementation process can be as follows:

for a target text, sequentially taking the output of a previous second coding layer as the input of a current second coding layer from a 1 st second coding layer for feature extraction until an Nth second coding layer, and outputting text semantic features of the target text, wherein the input of the 1 st second coding layer is the target text;

and for any word, sequentially taking the output of the previous second coding layer as the input of the current second coding layer for feature extraction from the 1 st second coding layer until the Nth second coding layer, and outputting the word meaning feature of the word, wherein the input of the 1 st second coding layer is the word.

In practical application, the semantic feature extraction submodel comprises a plurality of second coding layers, the target text can be input into the 1 st second coding layer to obtain the first output of the target text, then the first output of the target text is input into the 2 nd second coding layer to obtain the second output of the target text, and the rest is repeated until the (N-1) th output of the target text is input into the Nth second coding layer to obtain the text semantic features of the target text. Similarly, for any word, inputting the word to the 1 st second coding layer to obtain the first output of the word, inputting the first output of the word to the 2 nd second coding layer to obtain the second output of the word, and so on until the (N-1) th output of the word is input to the Nth second coding layer to obtain the word sense characteristic of the word. Therefore, as the number of the coding layers is increased, the semantic information can be extracted more accurately, and the semantic features of the target text and the words can be extracted by taking the plurality of coding layers as semantic feature extraction submodels, so that the obtained text semantic features and word semantic features can be more accurate.

Referring to fig. 4, in a keyword extraction method provided in an embodiment of the present application, a schematic structural diagram of another feature extraction model is shown: the feature extraction model comprises a structural feature extraction submodel, a semantic feature extraction submodel and an output layer, and the structural feature extraction submodel and the semantic feature extraction submodel are connected in parallel. The structure feature extraction submodel comprises a first coding layer and a second coding layer, wherein the first coding layer is used for receiving a target text and each word and extracting the structure features of the target text and each word to obtain the text structure features and each word structure features; the semantic feature extraction submodel comprises a plurality of second coding layers which are connected in series and used for receiving the target text and each word and extracting the semantic features of the target text and each word to obtain text semantic features and each word semantic feature; the output layer is used for receiving and processing the text structure characteristics and the text semantic characteristics to obtain and output the text characteristics, and receiving and processing the word structure characteristics and the word semantic characteristics to obtain and output the word characteristics. The feature extraction model of the structural feature extraction submodel and the feature extraction model of the semantic feature extraction submodel are used for extracting features of the target text and the words, so that the structural feature extraction and the semantic feature extraction can be synchronously performed, and the model processing efficiency can be improved.

In a possible implementation manner of the embodiment of the present specification, the feature extraction model includes a plurality of third coding layers, where the 1 st third coding layer constitutes a structural feature extraction sub-model, and then the target text and each word in the target text may be respectively input to the first coding layer, so as to obtain a text structural feature of the target text and a word structural feature of each word. That is, under the condition that the feature extraction model includes M third coding layers, where M is a positive integer greater than 2, and the structural feature extraction submodel includes the 1 st third coding layer, correspondingly, the target text and each word in the target text are respectively input to the structural feature extraction submodel, so as to obtain the text structural feature of the target text and the word structural feature of each word, and the specific implementation process may be as follows:

inputting the target text into a 1 st third coding layer for feature extraction aiming at the target text to obtain a first text coding feature of the target text, and determining the first text coding feature as a text structure feature of the target text;

and (3) inputting the word into the 1 st third coding layer for feature extraction aiming at any word to obtain the first word coding feature of the word, and determining the first word coding feature as the word structure feature of the word.

In practical application, the feature extraction model comprises a plurality of third coding layers, wherein the 1 st third coding layer is a structural feature extraction submodel, the target text can be input into the 1 st third coding layer, the 1 st third coding layer performs structural feature extraction on the target text, and first text coding features, namely text structural features of the target text, are output; and then inputting each word into the 1 st third coding layer, extracting the structural characteristics of each word by the 1 st third coding layer, and outputting the first word coding characteristics of each word, namely the word structural characteristics of each word. Therefore, as the structural information at the text level gradually disappears along with the increase of the coding layers, namely, the single coding layer can well acquire the structural information of the text or the words, the 1 st third coding layer is used as a structural feature extraction submodel to extract the structural features of the target text and the words, and the acquired text structural features and word structural features can be more accurate.

In a possible implementation manner of the embodiment of the present specification, the feature extraction model includes a plurality of third coding layers, where the 1 st third coding layer constitutes a structural feature extraction sub-model, and the 1 st third coding layer to the last third coding layer constitute a semantic feature extraction sub-model. At this time, on the basis that the target text is input to the 1 st third coding layer for feature extraction to obtain the first text coding feature of the target text, and the word is input to the 1 st third coding layer for feature extraction to obtain the first word coding feature of the word, the first text coding feature and the first word coding feature may be respectively input to the remaining third coding layers connected in series for processing to obtain the text semantic feature of the target text and the word semantic feature of each word. Namely, on the basis that the semantic feature extraction submodel comprises the 1 st to the Mth third coding layers, the target text and each word are respectively input into the semantic feature extraction submodel to obtain the text semantic features of the target text and the word semantic features of each word, and the specific implementation process can be as follows:

inputting the first text coding features into the 2 nd third coding layer for feature extraction aiming at the target text to obtain second text coding features of the target text, inputting the (M-1) th text coding features into the Mth third coding layer for feature extraction to obtain Mth text coding features of the target text, and determining the Mth text coding features as text semantic features of the target text;

and aiming at any word, inputting the first word coding feature of the word to the 2 nd third coding layer for feature extraction to obtain the second word coding feature of the word, inputting the (M-1) th word coding feature to the Mth third coding layer for feature extraction to obtain the Mth word coding feature of the word, and determining the Mth word coding feature as the word meaning feature of the word.

In practical application, the feature extraction model comprises M third coding layers, wherein the structural feature extraction submodel comprises a 1 st third coding layer, and the semantic feature extraction submodel comprises the 1 st third coding layer to an Nth third coding layer. When the structural feature extraction is carried out, the target text is input to the 1 st third coding layer for feature extraction to obtain the first text coding feature of the target text, each word is input to the 1 st third coding layer for feature extraction to obtain the first word coding feature of each word, at the moment, the first text coding feature and each first word coding feature can be respectively input to the 2 nd third coding layer for processing to obtain the second text coding feature of the target text and the second word coding feature of each word; and respectively inputting the second text coding features and the second word coding features into a third 3 rd coding layer for processing, and repeating the steps until the (M-1) th text coding features and the (M-1) th word coding features are respectively input into the third M th coding layer for processing to obtain the Mth text coding features of the target text and the Mth word coding features of the words, namely the text semantic features of the target text and the word semantic features of the words. Therefore, as the number of the coding layers is increased, the semantic information can be extracted more accurately, and the semantic features of the target text and the words can be extracted by taking the plurality of coding layers as semantic feature extraction submodels, so that the obtained text semantic features and word semantic features can be more accurate.

Referring to fig. 5, in a keyword extraction method provided in an embodiment of the present application, a structural schematic diagram of another feature extraction model is shown: the feature extraction model comprises a plurality of third coding layers and an output layer, wherein the 1 st third coding layer forms a structural feature extraction submodel, and the 1 st third coding layer to the last third coding layer form a semantic feature extraction submodel. The structural feature extraction submodel is used for receiving the target text and all words and phrases and extracting structural features of the target text and all words and phrases to obtain text structural features and all word structural features; the semantic feature extraction submodel is used for receiving the target text and each word and extracting the semantic features of the target text and each word to obtain text semantic features and each word semantic feature; the output layer is used for receiving and processing the text structure characteristics and the text semantic characteristics to obtain and output the text characteristics, and receiving and processing the word structure characteristics and the word semantic characteristics to obtain and output the word characteristics. The structural feature extraction submodel and the semantic feature extraction submodel have intersection, so that the accuracy and the efficiency of feature extraction can be ensured while the coding layers contained in the feature extraction models are reduced.

Before the feature extraction model trained in advance is obtained, the language representation model needs to be trained so as to obtain the feature extraction model with the feature extraction function. That is, before the pre-trained feature extraction model is obtained, the method further includes:

acquiring a sample text set and a preset language representation model, wherein the language representation model comprises a structural feature extraction submodel, a semantic feature extraction submodel and an output layer;

extracting at least two sample texts from the sample text set, and respectively inputting each sample text into the structural feature extraction submodel to obtain the predicted structural feature of each sample text;

respectively inputting the predicted structural features of each sample text into a semantic feature extraction submodel to obtain the predicted semantic features of each sample text;

respectively inputting the predicted structural features and the predicted semantic features of each sample text into an output layer for processing, and outputting the predicted text features of each sample text;

calculating a loss value according to the predicted text characteristics of each sample text;

and adjusting model parameters of a structural feature extraction submodel and a semantic feature extraction submodel in the language representation model according to the loss value, continuously executing the step of extracting at least two sample texts from the sample text set, and determining the trained language representation model as the feature extraction model under the condition of reaching a preset training stop condition.

Specifically, the language characterization model refers to a pre-specified pre-trained neural network model, such as RoBERTa model; the sample text refers to a sample for a language representation model and can be sentences, words, articles and the like; the sample text set refers to a set of a plurality of sample texts; the predicted structural features refer to structural features of sample texts extracted by a structural feature extraction sub-model; the predicted semantic features refer to semantic features of sample texts extracted by a semantic feature extraction sub-model; the predicted text feature refers to a feature combining a predicted structural feature and a predicted semantic feature; the training stopping condition may be that the loss value is less than or equal to a preset threshold, or that the number of iterative training times reaches a preset iterative value, or that the loss value converges, i.e., the loss value does not decrease with the continued training.

In practical applications, there are various ways to obtain the sample text set and the preset language characterization model, for example, the method may be that an operator sends a training instruction of the language characterization model to an execution subject, or sends an obtaining instruction of the sample text set and the preset language characterization model, and accordingly, the execution subject starts to obtain the sample text set and the preset language characterization model after receiving the instruction; the server may also automatically acquire the sample text set and the preset language characterization model every preset time, for example, after the preset time, the server with the model training function automatically acquires the first sample text set and the preset language characterization model in the specified access area; or after the preset time length, the terminal with the model training function automatically acquires the sample text set stored locally and the preset language representation model. The present specification does not set any limit to the manner of obtaining the sample text set and the preset language characterization model.

After a sample text set and a preset language representation model are obtained, extracting a plurality of sample texts from the sample text set, inputting the plurality of sample texts into a structural feature extraction submodel, performing structural feature extraction on each sample text by the structural feature extraction submodel, outputting predicted structural features of each sample text, inputting each sample text into a semantic feature extraction submodel, performing semantic feature extraction on each sample text by the semantic feature extraction submodel, and outputting predicted semantic features of each sample text. And inputting the predicted semantic features and the predicted structural features of the sample texts into an output layer, and analyzing and processing the predicted semantic features and the predicted structural features of the sample texts by the output layer to obtain and output the predicted text features of the sample texts. Secondly, determining a loss value according to the predicted text characteristics of each sample text and a preset loss function, adjusting model parameters of a language representation model according to the loss value under the condition that a preset training stopping condition is not reached, namely model parameters of a structural characteristic extraction submodel and a semantic characteristic extraction submodel, then intensively extracting a plurality of sample texts from the sample texts again, and performing the next round of training; and determining the trained language representation model as a feature extraction model under the condition of reaching a preset training stopping condition. Therefore, the language representation model is trained through the plurality of sample texts, the accuracy and the speed of feature extraction of the feature extraction model can be improved, and the robustness of the feature extraction model is improved.

It should be noted that the training stop condition may be that all sample texts in the sample text set are traversed K times, where K is a preset numerical value. That is, each sample text of the sample text set has K times to be used for training the feature extraction model.

In addition, in order to extract keywords from the target text in the specific field, a sample text of the specific field may be obtained for the specific field, and a language representation model may be trained, so as to obtain a feature extraction model specific to the specific field. For example, the language representation model is trained by using a sample text in the medical field to obtain a feature extraction model dedicated to the medical field, and for example, the language representation model is trained by using a sample text in the geographic field to obtain a feature extraction model dedicated to the geographic field.

In a possible implementation manner of the embodiment of the present specification, in order to further improve the robustness of the feature extraction model, when determining the loss value, it may be determined that the loss value is determined by using a contrast learning method. That is, the loss value is calculated according to the predicted text features of each sample text, and the specific implementation process can be as follows:

inputting the predicted text features of any sample text into a preset random inactivation layer twice for processing aiming at the predicted text features of the sample text to obtain a first sample text feature and a second sample text feature of the sample text;

calculating a second similarity between the sample text features, wherein the sample text features comprise a first sample text feature and a second sample text feature;

and calculating the loss value according to the second similarity.

Specifically, the random inactivation (dropout) layer refers to a processing layer of random hidden codes; the first sample text characteristic refers to an output result obtained by inputting a certain predicted text characteristic into a random inactivation layer for the first time; the second sample text characteristic refers to an output result obtained by inputting a certain predicted text characteristic into the random inactivation layer for the second time; the second similarity refers to a similarity between the first sample text feature and the second sample text feature.

In practical application, each predicted text feature can be repeatedly input to the predicted text feature twice to obtain a first sample text feature and a second sample text feature of each predicted text feature; then, calculating a second similarity between every two of all the first sample text features and the second sample text features; and then inputting the second similarity into a preset loss function to obtain a loss value.

Illustratively, two sample texts are provided, the predicted text features of the first sample text are firstly input into the random inactivation layer for processing to obtain sample text features m1, and the predicted text features of the first sample text are secondly input into the random inactivation layer for processing to obtain sample text features m 2; and inputting the predicted text features of the second sample text into the random inactivation layer for the first time to be processed to obtain sample text features m3, and inputting the predicted text features of the second sample text into the random inactivation layer for the second time to be processed to obtain sample text features m 4. Then, second similarity among the sample text features m1, the sample text features m2, the sample text features m3 and the sample text features m4 is calculated, and then the second similarities are input into a loss function shown in the formula 2 for calculation to obtain a loss value.

In the formula 2, L _i Representing a loss value corresponding to the sample text feature; p _i The similarity between the sample text feature i and a positive sample thereof (another sample text feature of the sample text corresponding to the sample text feature i) is represented, and the similarity between the sample text feature i and a negative sample j (any sample text feature of other sample texts except the sample text corresponding to the sample text feature i) is represented; n represents the number of sample text features.

It should be noted that, when training the language representation model, the SimCSE unsupervised algorithm may be used to train the language representation model. Therefore, a few unmarked sample texts in the specific field can be used for representing the model (such as the pre-trained BERT) for the language and obtaining the model suitable for the specific field after unsupervised training, and the problem of uneven distribution caused by directly using anisotropy in BERT sentence vectors is avoided.

In a possible implementation manner of the embodiment of the present specification, in order to ensure that all sample texts in the sample text set can be used for training the feature extraction model, the sample texts in the sample text set may be grouped, and the sample text groups may be extracted according to a certain order. That is, before extracting at least two sample texts from the sample text set, the method further includes:

grouping sample texts in a sample text set to obtain at least one sample text group, wherein the sample text group comprises at least two sample texts;

accordingly, extracting at least two sample texts from the sample text set comprises:

and extracting a sample text group from the sample text set according to a preset extraction sequence.

Specifically, grouping refers to dividing a plurality of sample texts into several groups; the sample text group is a small group which is divided and contains a plurality of sample texts; the preset extraction sequence refers to a preset sequence for extracting the sample text group, such as sequential extraction and reverse extraction.

In practical application, after a sample text set is obtained, sample texts in the sample text set are differentiated into a plurality of sample text groups including at least two sample texts, and then one sample text group is extracted from the sample text set each time based on a preset extraction sequence, that is, at least two sample texts are extracted from the sample text set. Therefore, all sample texts in the sample text set can be used for training the feature extraction model, the robustness of the feature extraction model is improved, and in addition, one sample text group is taken each time, so that the problems that the extracted sample texts are too many and the language representation model data processing amount is too large and the language representation model is broken down can be effectively avoided.

Illustratively, the sample text set contains 100 sample texts, and the 100 sample texts are divided into 20 sample text groups according to an average distribution principle, wherein each sample text group contains 5 sample texts. And then, sequentially extracting a sample text group according to the sequence from the first sample text group to the twentieth sample text, namely sequentially extracting.

According to the keyword extraction method, a target text is obtained; extracting text structure features and text semantic features of the target text, and word structure features and word semantic features of words in the target text; determining text features according to the text structural features and the text semantic features, and determining word features according to the word structural features and the word semantic features; and determining keywords from the words according to the text characteristics and the word characteristics of the words. The text features are determined through the text structure features and the text semantic features, the word features are determined through the word structure features and the word semantic features, the semantics and the structure level information of the text or the words can be determined more accurately, the text features and the word features are more accurate, then the keywords can be determined based on the text features and the word features, and the efficiency of determining the keywords is improved. The problems of uneven distribution of high and low frequency words in a semantic space and the like are avoided. In addition, compared with the traditional word frequency statistics and graph algorithm, the keyword extraction method of the application focuses more on the related technology of keyword extraction of short texts.

The following will further describe the keyword extraction method with reference to fig. 6 by taking the application of the keyword extraction method provided by the present application to short texts as an example. Fig. 6 shows a processing flow chart of a keyword extraction method applied to a short text according to an embodiment of the present application, which specifically includes the following steps:

step 602: and acquiring a sample short text set and a preset language characterization model, wherein the language characterization model comprises a structural feature extraction submodel, a semantic feature extraction submodel and an output layer.

Step 604: and extracting at least two sample short texts from the sample short text set, and respectively inputting each sample short text into the structural feature extraction submodel to obtain the predicted structural feature of each sample short text.

In one or more alternative embodiments of the present specification, before extracting at least two sample short texts from the sample short text set, the method further includes:

grouping the sample short texts in the sample short text set to obtain at least one sample short text group, wherein the sample short text group comprises at least two sample short texts;

extracting at least two sample short texts from the sample short text set, comprising:

and extracting a sample short text group from the sample short text set according to a preset extraction sequence.

Step 606: and respectively inputting the predicted structural features of the short texts of the samples into the semantic feature extraction submodel to obtain the predicted semantic features of the short texts of the samples.

Step 608: and respectively inputting the predicted structural features and the predicted semantic features of the short texts of the samples into an output layer for processing, and outputting the predicted text features of the short texts of the samples.

Step 610: and aiming at the predicted text features of any sample short text, inputting the predicted text features of the sample short text to a preset random inactivation layer twice for processing to obtain a first sample text feature and a second sample text feature of the sample short text.

Step 612: a second similarity between sample text features is calculated, the sample text features including a first sample text feature and a second sample text feature.

Step 614: and calculating the loss value according to the second similarity.

Step 616: and adjusting model parameters of a structural feature extraction submodel and a semantic feature extraction submodel in the language characterization model according to the loss value, continuously executing the step of extracting at least two sample short texts from the sample short text set, and determining the trained language characterization model as the feature extraction model under the condition of reaching a preset training stop condition.

Step 618: and acquiring the target short text.

Step 620: and performing word segmentation processing on the target short text to obtain a plurality of candidate words, and determining the part of speech of each candidate word.

Step 622: and screening a plurality of words from the candidate words according to the part of speech.

Step 624: and respectively inputting the target short text and each word to the structural feature extraction submodel to obtain the text structural feature of the target short text and the word structural feature of each word.

In one or more alternative embodiments of the present description, the structural feature extraction submodel includes a first encoding layer;

correspondingly, the target short text and each word in the target short text are respectively input into the structural feature extraction submodel, and the text structural feature of the target short text and the word structural feature of each word are obtained, wherein the method comprises the following steps:

inputting the target short text into a first coding layer for feature extraction to obtain text structure features of the target short text;

In one or more alternative embodiments of the present specification, the feature extraction model includes M third coding layers, where M is a positive integer greater than 2, and the structural feature extraction submodel includes a 1 st third coding layer;

inputting the target short text into a 1 st third coding layer for feature extraction aiming at the target short text to obtain a first text coding feature of the target short text, and determining the first text coding feature as a text structure feature of the target short text;

Step 626: and respectively inputting the target short text and each word into the semantic feature extraction submodel to obtain the text semantic features of the target short text and the word semantic features of each word.

In one or more alternative embodiments of the present description, the semantic feature extraction submodel includes N second coding layers, where N is a positive integer greater than 2;

correspondingly, the target short text and each word are respectively input into the semantic feature extraction submodel, and the text semantic features of the target short text and the word semantic features of each word are obtained, wherein the method comprises the following steps:

for the target short text, sequentially taking the output of the previous second coding layer as the input of the current second coding layer for feature extraction from the 1 st second coding layer until the Nth second coding layer, and outputting the text semantic features of the target short text, wherein the input of the 1 st second coding layer is the target short text;

In one or more alternative embodiments of the present specification, the semantic feature extraction submodel includes 1 st to mth third coding layers;

aiming at the target short text, inputting the first text coding features into the 2 nd third coding layer for feature extraction to obtain second text coding features of the target short text, inputting the (M-1) th text coding features into the Mth third coding layer for feature extraction to obtain the Mth text coding features of the target short text, and determining the Mth text coding features as text semantic features of the target short text;

Step 628: and inputting the text structure characteristics and the text semantic characteristics to an output layer for processing aiming at the target short text, and outputting the text characteristics of the target short text.

Step 630: and aiming at any word, inputting the word structure characteristics and the word semantic characteristics of the word into an output layer for processing, and outputting the word characteristics of the word.

Step 632: and respectively determining first similarity of the word characteristics and the text characteristics of each word.

Step 634: and determining the keywords of the target short text from the plurality of words according to the first similarity.

Corresponding to the above method embodiment, the present application further provides an embodiment of a keyword extraction apparatus, and fig. 7 shows a schematic structural diagram of the keyword extraction apparatus provided in an embodiment of the present application. As shown in fig. 7, the apparatus includes:

a first obtaining module 702 configured to obtain a target text;

an extracting module 704, configured to extract a text structure feature and a text semantic feature of the target text, and a word structure feature and a word semantic feature of each word in the target text;

a first determining module 706 configured to determine text features according to the text structure features and the text semantic features, and determine word features according to the word structure features and the word semantic features;

a second determining module 708 configured to determine keywords from the words based on the text features and the word features of the words.

In one or more alternative embodiments of the present description, the second determining module 708 is further configured to:

In one or more alternative embodiments of the present description, the apparatus further includes a second acquisition model configured to:

accordingly, the extraction module 704 is further configured to:

accordingly, the first determination module 706 is further configured to:

accordingly, the extraction module 704 is further configured to:

In one or more alternative embodiments of the present description, the semantic feature extraction submodel includes 1 st to mth third coding layers;

accordingly, the extraction module 704 is further configured to:

and aiming at any word, inputting the first word coding feature of the word into the 2 nd third coding layer for feature extraction to obtain the second word coding feature of the word until the (M-1) th word coding feature is input into the Mth third coding layer for feature extraction to obtain the Mth word coding feature of the word, and determining the Mth word coding feature as the word meaning feature of the word.

In one or more alternative embodiments of the present description, the apparatus further includes a training module configured to:

acquiring a sample text set and a preset language characterization model, wherein the language characterization model comprises a structural feature extraction submodel, a semantic feature extraction submodel and an output layer;

In one or more alternative embodiments of the present description, the training module is further configured to:

and calculating the loss value according to the second similarity.

In one or more alternative embodiments of the present description, the apparatus further includes a word segmentation module configured to:

and performing word segmentation processing on the target text to obtain a plurality of words.

In one or more alternative embodiments of the present specification, the word segmentation module is further configured to:

According to the keyword extraction device, the text features are determined through the text structure features and the text semantic features, the word features are determined through the word structure features and the word semantic features, the semantics and the structure level information of the text or words can be determined more accurately, the text features and the word features are more accurate, the keywords can be determined based on the text features and the word features, and the keyword determination efficiency is improved. The problems of uneven distribution of high and low frequency words in a semantic space and the like are avoided.

The foregoing is a schematic solution of a keyword extraction apparatus according to this embodiment. It should be noted that the technical solution of the keyword extraction apparatus and the technical solution of the keyword extraction method belong to the same concept, and details that are not described in detail in the technical solution of the keyword extraction apparatus can be referred to the description of the technical solution of the keyword extraction method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

Fig. 8 illustrates a block diagram of a computing device 800 provided according to an embodiment of the present application. The components of the computing device 800 include, but are not limited to, a memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.

Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of Network Interface (e.g., a Network Interface Controller) that may be wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a Wi-MAX for Microwave Access Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.

In one embodiment of the application, the above-described components of the computing device 800 and other components not shown in fig. 8 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 8 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.

Wherein, the processor 820 is used for executing the computer-executable instructions of the keyword extraction method.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the keyword extraction method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the keyword extraction method.

An embodiment of the present application also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are used for a keyword extraction method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the keyword extraction method belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the keyword extraction method.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc.

An embodiment of the present application further provides a chip, in which a computer program is stored, and the steps of the keyword extraction method are implemented when the computer program is executed by the chip.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A keyword extraction method is characterized by comprising the following steps:

acquiring a target text;

2. The method of claim 1, wherein determining keywords from the words based on the textual features and word features of the words comprises:

respectively determining first similarity of word features of all words and the text features;

and determining keywords of the target text from the plurality of words according to the first similarity.

3. The method according to claim 1, wherein before extracting the text structure feature and the text semantic feature of the target text and the word structure feature and the word semantic feature of each word in the target text, further comprising:

correspondingly, the extracting text structure features and text semantic features of the target text, and word structure features and word semantic features of words in the target text includes:

respectively inputting the target text and each word to the semantic feature extraction submodel to obtain text semantic features of the target text and word semantic features of each word;

correspondingly, the determining the text feature according to the text structure feature and the text semantic feature, and determining the word feature according to the word structure feature and the word semantic feature include:

aiming at the target text, inputting the text structure characteristic and the text semantic characteristic into the output layer for processing, and outputting the text characteristic of the target text;

and aiming at any word, inputting the word structure characteristics and the word semantic characteristics of the word into the output layer for processing, and outputting the word characteristics of the word.

4. The method of claim 3, wherein the structural feature extraction submodel comprises a first coding layer;

correspondingly, the step of inputting the target text and each word in the target text to the structural feature extraction submodel respectively to obtain the text structural feature of the target text and the word structural feature of each word includes:

inputting the target text into the first coding layer for feature extraction to obtain text structure features of the target text;

5. The method according to claim 3 or 4, wherein the semantic feature extraction submodel comprises N second coding layers, N being a positive integer greater than 2;

correspondingly, the step of respectively inputting the target text and each word into the semantic feature extraction submodel to obtain the text semantic features of the target text and the word semantic features of each word comprises the following steps:

for the target text, feature extraction is performed by taking the output of the previous second coding layer as the input of the current second coding layer in sequence from the 1 st second coding layer until the Nth second coding layer, and the text semantic features of the target text are output, wherein the input of the 1 st second coding layer is the target text;

6. The method of claim 3, wherein the feature extraction model comprises M third coding layers, M being a positive integer greater than 2, and the structural feature extraction submodel comprises the 1 st third coding layer;

inputting the target text into the 1 st third coding layer for feature extraction to obtain a first text coding feature of the target text, and determining the first text coding feature as a text structure feature of the target text;

and aiming at any word, inputting the word into the 1 st third coding layer for feature extraction to obtain a first word coding feature of the word, and determining the first word coding feature as a word structure feature of the word.

7. The method of claim 6, wherein the semantic feature extraction submodel comprises 1 st to Mth third coding layers;

correspondingly, the step of respectively inputting the target text and each word into the semantic feature extraction submodel to obtain the text semantic features of the target text and the word semantic features of each word comprises:

inputting the first text coding features to a 2 nd third coding layer for feature extraction to obtain second text coding features of the target text, inputting the (M-1) th text coding features to an Mth third coding layer for feature extraction to obtain Mth text coding features of the target text, and determining the Mth text coding features as text semantic features of the target text;

and aiming at any word, inputting the first word coding feature of the word into a 2 nd third coding layer for feature extraction to obtain a second word coding feature of the word, inputting the (M-1) th word coding feature into an Mth third coding layer for feature extraction to obtain an Mth word coding feature of the word, and determining the Mth word coding feature as a word sense feature of the word.

8. The method of claim 3, wherein prior to obtaining the pre-trained feature extraction model, further comprising:

extracting at least two sample texts from the sample text set, and respectively inputting each sample text into the structural feature extraction submodel to obtain a predicted structural feature of each sample text;

respectively inputting the predicted structural features of each sample text into the semantic feature extraction submodel to obtain the predicted semantic features of each sample text;

respectively inputting the predicted structural features and the predicted semantic features of each sample text into the output layer for processing, and outputting the predicted text features of each sample text;

and adjusting model parameters of the structural feature extraction submodel and the semantic feature extraction submodel in the language representation model according to the loss value, continuously executing the step of extracting at least two sample texts from the sample text set, and determining the trained language representation model as a feature extraction model under the condition of reaching a preset training stop condition.

9. The method of claim 8, wherein calculating a loss value based on the predicted text feature of each sample text comprises:

calculating a second similarity between sample text features, wherein the sample text features comprise a first sample text feature and a second sample text feature;

and calculating the loss value according to the second similarity.

10. The method of claim 8, wherein prior to extracting at least two sample texts from the sample text set, further comprising:

grouping sample texts in the sample text set to obtain at least one sample text group, wherein the sample text group comprises at least two sample texts;

accordingly, the extracting at least two sample texts from the sample text set comprises:

11. The method according to claim 1, wherein before extracting the text structure feature and the text semantic feature of the target text and the word structure feature and the word semantic feature of each word in the target text, further comprising:

12. The method of claim 11, wherein the tokenizing the target text to obtain a plurality of terms comprises:

performing word segmentation processing on the target text to obtain a plurality of candidate words, and determining the part of speech of each candidate word;

13. A keyword extraction apparatus, comprising:

a first obtaining module configured to obtain a target text;

a second determining module configured to determine keywords from the words according to the text features and word features of the words.

14. A computing device, comprising:

a memory and a processor;

the memory is used for storing computer-executable instructions, and the processor is used for executing the computer-executable instructions to realize the steps of the keyword extraction method in any one of claims 1 to 12.

15. A computer-readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the keyword extraction method according to any one of claims 1 to 12.