CN115757680A - Keyword extraction method and device, electronic equipment and storage medium - Google Patents

Keyword extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115757680A
CN115757680A CN202211435542.2A CN202211435542A CN115757680A CN 115757680 A CN115757680 A CN 115757680A CN 202211435542 A CN202211435542 A CN 202211435542A CN 115757680 A CN115757680 A CN 115757680A
Authority
CN
China
Prior art keywords
word
text
words
extracted
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211435542.2A
Other languages
Chinese (zh)
Inventor
徐剑军
高丽
李奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Caizhi Technology Co ltd
Original Assignee
Beijing Caizhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Caizhi Technology Co ltd filed Critical Beijing Caizhi Technology Co ltd
Priority to CN202211435542.2A priority Critical patent/CN115757680A/en
Publication of CN115757680A publication Critical patent/CN115757680A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The disclosure relates to a keyword extraction method, a keyword extraction device, electronic equipment and a storage medium, and relates to the technical field of text processing. The method comprises the following steps: performing word segmentation processing on a text to be extracted based on the constructed word list, and dividing the text to be extracted into a plurality of words; determining the statistical characteristics of all the words, calculating the weighted values of the statistical characteristics of the words, and screening the words according to the weighted values to obtain candidate words; and inputting the text to be extracted and the candidate words into a pre-trained deep semantic matching model, and extracting keywords from the text to be extracted according to an output result, wherein the deep semantic matching model is used for predicting semantic similarity between the text to be extracted and the candidate words. According to the method and the device, when the keywords are extracted from the text to be extracted, the statistical characteristics and the semantic characteristics are comprehensively considered, and the extraction precision and efficiency are improved.

Description

Keyword extraction method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of text processing technologies, and in particular, to a keyword extraction method, a keyword extraction apparatus, an electronic device, and a storage medium.
Background
Keyword extraction, which aims to extract a group of words related to the text subject from the text to express the core content of the text, is a basic natural language processing task. By extracting the keywords, the user can quickly learn the text.
Related art techniques typically use statistical-based methods or pre-trained language model-based methods to extract keywords in text. However, the keyword extraction algorithm based on statistics is fast in calculation speed, but semantics of titles and abstracts are not considered, and a large number of noise words can be extracted. Although the keyword extraction algorithm based on the pre-training language model utilizes semantic features, the keyword extraction algorithm has poor effect in a specific field and low calculation efficiency.
In order to solve the above problems, the present disclosure provides a keyword extraction method, a keyword extraction apparatus, an electronic device, and a storage medium.
Disclosure of Invention
The disclosure provides a keyword extraction method, a keyword extraction device, an electronic device and a storage medium, which are used for at least solving the problems that a large number of noise words can be extracted by a keyword extraction algorithm based on statistics in the related art, and the keyword extraction algorithm based on a pre-training language model has poor effect in a specific field and low calculation efficiency. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, a keyword extraction method is provided, including: performing word segmentation processing on a text to be extracted based on the constructed word list, and dividing the text to be extracted into a plurality of words; determining the statistical characteristics of each word, calculating the weighted value of the statistical characteristics of each word, and screening the words according to the weighted value to obtain candidate words; inputting the text to be extracted and the candidate words into a pre-trained deep semantic matching model, and extracting keywords from the text to be extracted according to an output result, wherein the deep semantic matching model is used for predicting semantic similarity between the text to be extracted and the candidate words.
Alternatively, the word list may be constructed by the following method: traversing a target corpus, and performing N-Gram word segmentation processing on the corpus in the target corpus to obtain a plurality of word segmentation candidate words; determining the statistical characteristics of the word segmentation candidate words, and calculating the word forming probability of the corresponding word segmentation candidate words according to the statistical characteristics of the word segmentation candidate words; the statistical characteristics of the word segmentation candidate words comprise word frequency, point mutual information, left-right entropy, inverse document frequency and average time span of the word segmentation candidate words; and constructing a word list according to the word forming probability.
Optionally, the statistical characteristics of the words include word length, word position, word frequency, and inverse document frequency; the determining the statistical characteristics of the words and calculating the weighted values of the statistical characteristics of the words includes: counting the word length of each word; calculating the position weight score of the word position of each word according to a preset position formula; calculating the word frequency of each word in the text to be extracted, determining the inverse document frequency of each word in the target corpus, and calculating the word frequency-inverse document frequency weight score of each word according to the word frequency and the inverse document frequency; and calculating a weighted value according to the word length, the position weight score and the word frequency-inverse document frequency weight score.
Optionally, the screening the words according to the weighted value to obtain candidate words includes: and filtering out the words with the weighted values lower than a preset threshold value, and taking the words with the weighted values higher than or equal to the preset threshold value as candidate words.
Optionally, the inputting the text to be extracted and the candidate words into a pre-trained deep semantic matching model, and extracting the keywords from the text to be extracted according to the output result includes: inputting the candidate words and the text to be extracted into a deep semantic matching model, and outputting semantic similarity of the candidate words and the text to be extracted; and extracting candidate words with semantic similarity higher than a preset threshold value from the text to be extracted as keywords.
Optionally, the inputting the candidate words and the text to be extracted into the deep semantic matching model, and outputting semantic similarity between the candidate words and the text to be extracted includes: coding the candidate words and the text to be extracted by utilizing the deep semantic matching model to obtain corresponding vector representation; and calculating the vector distance between the candidate word and the text to be extracted through cosine similarity as semantic similarity.
Optionally, the deep semantic matching model may be obtained by training through the following method: acquiring a plurality of corpora from a target corpus as training texts; for each training sample, acquiring a labeled text corresponding to the training sample from the target corpus as a positive sample, and acquiring a plurality of other texts as negative samples; and training based on the training samples, the positive samples and the negative samples to obtain a deep semantic matching model.
According to a second aspect of the embodiments of the present disclosure, there is provided a keyword extraction apparatus including: the word segmentation processing module is configured to perform word segmentation processing on the text to be extracted based on the constructed word list, and divide the text to be extracted into a plurality of words; the preliminary extraction module is configured to determine statistical characteristics of all words, calculate weighted values of the statistical characteristics of the words, and screen the words according to the weighted values to obtain candidate words; and the final extraction module is configured to input the text to be extracted and the candidate words into a pre-trained deep semantic matching model, extract the keywords from the text to be extracted according to the output result, and the deep semantic matching model is used for predicting semantic similarity between the text to be extracted and the candidate words.
According to a third aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement any one of the keyword extraction methods described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein instructions that, when executed by a processor of an electronic device, enable the electronic device to perform any one of the keyword extraction methods described above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, which when executed by a processor implements the keyword extraction method of any one of the above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
in the keyword extraction provided by the embodiment of the disclosure, a text to be extracted is subjected to word segmentation processing based on a constructed word list, and the text to be extracted is divided into a plurality of words; determining the statistical characteristics of each word, calculating the weighted value of the statistical characteristics of each word, and screening the words according to the weighted values to obtain candidate words; inputting the text to be extracted and the candidate words into a pre-trained deep semantic matching model, and extracting keywords from the text to be extracted according to an output result, wherein the deep semantic matching model is used for predicting semantic similarity between the text to be extracted and the candidate words. On one hand, after the word segmentation processing is carried out on the text to be extracted, the keyword extraction method provided by the embodiment of the disclosure carries out primary screening on each word based on the statistical characteristics of each word, and carries out screening again through a pre-trained deep semantic matching model on the basis of the primary screening, so that the statistical characteristics and the semantic characteristics are comprehensively considered during keyword extraction, and the keyword extraction precision is improved. On the other hand, the semantic similarity between the text to be extracted and the candidate words is calculated through the constructed word list and a pre-trained deep semantic matching model, so that the extraction effect of the keywords in the specific field can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a schematic diagram of a system architecture of an exemplary application environment illustrating a keyword extraction method and apparatus in accordance with an exemplary embodiment;
FIG. 2 is a schematic block diagram of a computer system of an electronic device according to an exemplary embodiment;
FIG. 3 is a flow diagram illustrating a keyword extraction method in accordance with an exemplary embodiment;
FIG. 4 is a block diagram illustrating the structure of a DSSM model for a keyword extraction method according to an exemplary embodiment;
fig. 5 is a block diagram illustrating a keyword extraction apparatus according to an exemplary embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a keyword extraction method and apparatus according to an embodiment of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation. For example, server 105 may be a server cluster comprised of multiple servers, and the like.
The keyword extraction method provided by the embodiment of the present disclosure may be performed by the terminal devices 101, 102, and 103, and accordingly, the keyword extraction apparatus may be disposed in the terminal devices 101, 102, and 103. The server 105 may be used for the keyword extraction method provided by the embodiment of the present disclosure, and accordingly, the keyword extraction device may be disposed in the server 105. The keyword extraction method provided by the embodiment of the present disclosure may be executed by the terminal devices 101, 102, 103 and the server 105 together, and accordingly, the keyword extraction device may be disposed in the terminal devices 101, 102, 103 and the server 105.
For example, the keyword extraction method provided by the embodiments of the present disclosure may be executed by a server. After receiving the text to be extracted, the server performs word segmentation processing on the text to be extracted based on the constructed word list, and divides the text to be extracted into a plurality of words; determining the statistical characteristics of each word, calculating the weighted value of the statistical characteristics of the words, and screening the words according to the weighted value to obtain candidate words; inputting the text to be extracted and the candidate words into a pre-trained deep semantic matching model, and extracting keywords from the text to be extracted according to an output result, wherein the deep semantic matching model is used for predicting semantic similarity between the text to be extracted and the candidate words.
FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.
It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.
As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU) 201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.
The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 210 as necessary, so that a computer program read out therefrom is mounted into the storage section 208 as necessary.
Fig. 3 is a flowchart illustrating a keyword extraction method according to an exemplary embodiment, as shown in fig. 3, the method including the steps of:
in step S310, word segmentation is performed on the text to be extracted based on the constructed word list, and the text to be extracted is divided into a plurality of words.
The text to be extracted is the text needing keyword extraction, and the user can know the main idea of the text to be extracted more efficiently and quickly by extracting the keywords in the text to be extracted. For example, the text to be extracted may be a title and an abstract of an academic paper, a medical record file, or a text of another subject, which is not particularly limited in the embodiment of the present disclosure.
The word list is used for performing word segmentation processing on a text to be extracted, and the construction process can be realized as follows: traversing a target corpus, and performing N-Gram word segmentation processing on the corpus in the target corpus to obtain a plurality of word segmentation candidate words; determining the statistical characteristics of each word segmentation candidate word, and calculating the word forming probability of each corresponding word segmentation candidate word according to the statistical characteristics of the word segmentation candidate words; the statistical characteristics of the word segmentation candidate words comprise word frequency, point mutual information, left-right entropy, inverse document frequency and average time span of the word segmentation candidate words; and constructing a word list according to the word forming probability.
The target language database is used for providing the language material for constructing the word list, and the selection of the target language material library is determined by the field to which the word list is applied. For example, when the keyword extraction method provided by the embodiment of the disclosure is used for extracting keywords of titles and abstracts of papers, a target corpus for constructing a word list may select a paper corpus; when the keyword extraction method provided by the embodiment of the disclosure is used for extracting the keywords of the medical record, the target corpus used for constructing the word list can select the medical database. The keyword extraction method provided by the embodiment of the disclosure selects the target corpus according to the field to be applied, and improves the extraction effect in the specific field.
And traversing the target corpus, and performing N-Gram word segmentation on the corpus in the target corpus to obtain a plurality of word segmentation candidate words. Illustratively, the corpus in the target corpus can be segmented based on 1-Gram, 2-Gram, 3-Gram and 4-Gram in sequence to obtain a plurality of word segmentation candidate words. Specifically, taking "deep learning" as an example, the word segmentation candidate words obtained by 1-Gram segmentation are: depth, learning and learning; the word segmentation candidate words obtained by the 2-Gram segmentation are as follows: depth, mathematics and learning; the word segmentation candidate words obtained by the 3-Gram segmentation are as follows: deep study and degree study; the word segmentation candidate words obtained by the 4-Gram segmentation are as follows: deep learning, and the same principle for other situations.
And after the word segmentation candidate words are obtained in the N-Gram word segmentation processing process, determining the statistical characteristics of the word segmentation candidate words, and calculating the word forming probability of the word segmentation candidate words based on the statistical characteristics. The statistical characteristics of the word segmentation candidate words comprise word frequency, point mutual information, left-right entropy, inverse document frequency and average time span of the word segmentation candidate words.
The word frequency refers to the frequency of the word segmentation candidate words in the target language corpus, and the higher the frequency of the word segmentation candidate words in the target language corpus is, the higher the word formation probability is. Taking the segmentation result of the deep learning as an example, the frequency of occurrence of deep learning, deep learning and learning in the paper corpus is far higher than that of other word segmentation candidate words.
The point mutual information is a measure of interdependency between two word segmentation candidate words. The calculation formula of the point-to-point information is as follows:
Figure BDA0003947071660000061
wherein x and y are candidate word segmentation words, p (x and y) is the real probability of the combination of the candidate word segmentation word x and the candidate word segmentation word y, and p (x) p (y) is the probability of the combination of the candidate word segmentation word x and the candidate word segmentation word y when the candidate word segmentation word x and the candidate word segmentation word y are not correlated with each other, and the probability is used as the prediction probability. The logarithm of the ratio of the real probability to the prediction probability of the combination of the word segmentation candidate word x and the word segmentation candidate word y is the point mutual information of the word segmentation candidate word x and the word segmentation candidate word y, and the higher the value of the point mutual information is, the higher the probability of the word combination of the word segmentation candidate word x and the word segmentation candidate word y is.
For example, taking the application scenario as a medical scenario as an example, the target corpus may be a medical database, and it is assumed that the result of the corpus participle processing of the medical database includes two participle candidates of "allergic" and "rhinitis", and the frequency of occurrence of "allergic" and "rhinitis" in the medical database is 0.000125 and 0.0001871, respectively. If "allergic" and "rhinitis" are not related to each other, the probability of occurrence of the word "allergic rhinitis" in combination thereof is 0.000125 × 0.0001871, which is about 2.339 × 10-8 to the power. Suppose that the true probability of the combined word "allergic rhinitis" appearing in the medical database is 7.263X 10-6. If the prediction probability is far higher than the 2.339 × 10-8 th power, the logarithm of the ratio of the true probability of the occurrence of the allergic rhinitis in the medical database to the prediction probability is the point mutual information of the word segmentation candidate word "anaphylaxis" and the word segmentation candidate word "rhinitis", and the higher the value of the point mutual information is, the higher the word formation probability of the allergic rhinitis is.
The left-right entropy represents the left-adjacent word information entropy and the right-adjacent word information entropy of the word. The left entropy and the right entropy are used for indicating the richness of the information of the left adjacent characters and the right adjacent characters of the word, and the word segmentation candidate word with high word forming probability should have rich information of the left adjacent characters and the right adjacent characters. The information entropy is used for describing the disorder degree of the information, also called uncertainty degree. The calculation formula is as follows:
Figure BDA0003947071660000071
where p (x) represents the frequency of occurrence of the word.
Illustratively, taking the text "eating grape and not eating grape skin and not eating grape and inversely eating grape skin" as an example, the left adjacent characters of "grape" include { eating, spitting, eating, spitting }, and the right adjacent characters include { no, skin, inverted, skin }. The left entropy of the grape is- (1/2) · log (1/2) - (1/2) · log (1/2) ≈ 0.69, and the right entropy is- (1/2) · log (1/2) - (1/4) · log (1/4) - (1/4) · log (1/4) · 1.04. The left entropy and the right entropy are used for the richness degree of left and right adjacent character information of the word of 'grape'.
The above-mentioned inverse document frequency is used to represent the importance degree of the word segmentation candidate words. The target corpus typically contains a plurality of documents. The more documents a certain word segmentation candidate word appears, the lower the importance degree of the word segmentation candidate word, and the lower the probability of being a keyword, for example, stop words usually appear in many documents. The calculation formula of the inverse document frequency is as follows:
Figure BDA0003947071660000072
wherein t represents a word segmentation candidate, D represents the total number of documents in the target corpus, D e D: t belongs to d } | represents the number of documents containing the word segmentation candidate words in the target corpus. idf (t, D) represents the inverse document frequency, and the more documents containing the word segmentation candidate word, the lower the inverse document frequency idf value of the word segmentation candidate word.
Generally, a word needs to be widely referred to by a large amount of text within a period of time besides being capable of being freely combined with other words and frequently appearing, so that the time is also an important index for measuring whether a character string is a word or not. The average time span is an index for measuring the characteristic of the word segmentation candidate word, and the average time span of the text establishment time containing the word segmentation candidate word and the establishment time of the text in which the word segmentation candidate word appears for the first time is calculated as a time influence factor, and the calculation formula is as follows:
Figure BDA0003947071660000073
wherein t is i The set-up time of the ith text is shown, and tv is the set-up time of the text in which the word segmentation candidate word x appears for the first time. Take the application field as the extraction of the keywords of the thesis as an example, the above t i May be the publication year of the ith paper, and tv may be the publication time of the paper in which the word segmentation candidate x first appears.
After the statistical characteristics of the word segmentation candidate words are determined, word formation probabilities of the word segmentation candidate words can be further calculated based on the statistical characteristics, and the word formation probabilities refer to probabilities of the word segmentation candidate words serving as a word. Illustratively, the formula for calculating the word formation probability may be as follows:
Figure BDA0003947071660000074
wherein the content of the first and second substances,
Figure BDA0003947071660000081
for the above inverse document frequency, for reducingThe impact of low stop words. freq (x) is the word frequency of the word segmentation candidate word x, PMI (x) represents the point mutual information of the word segmentation candidate word x, H l (x) Entropy of left-adjacent word information, H, representing word-segmentation candidate word x r (x) And expressing the right adjacent character information entropy of the word segmentation candidate word x.
After the word forming probability of each word segmentation candidate word is calculated, words can be selected from the word segmentation candidate words according to the word forming probability to construct the word list. The text to be extracted is divided into a plurality of words based on the constructed word list, that is, the text to be extracted is divided into a plurality of words with higher word forming probability according to the constructed word list.
Exemplarily, taking an application scenario as an example of paper keyword extraction, assuming that an input text to be extracted is "human behavior recognition research under a complex scenario based on deep learning," after performing word segmentation processing on the text to be extracted based on a constructed word list, a plurality of words obtained may be: "based on", "deep learning", "of", "complex scene", "down", "human", "behavior recognition", "research".
In step S320, the statistical characteristics of each word are determined, the weighted value of the statistical characteristics of the word is calculated, and the word is filtered according to the weighted value to obtain a candidate word.
After the text to be extracted is divided into a plurality of words through step S310, keywords need to be further screened out from the divided words. In order to improve the precision of keyword extraction, the embodiment of the disclosure combines the statistical characteristics and semantic characteristics of words to screen the obtained words. This step S320 performs filtering based on the statistical characteristics of the words.
The statistical characteristics of the words comprise word length, word positions, word frequency and inverse document frequency. The term "length" refers to the length of a term, for example, the term "allergic rhinitis" is 5, and the term "machine learning" is 4; the word position refers to the position of a word in the text to be extracted; the word frequency refers to the frequency of the words appearing in the text to be extracted; the above-mentioned inverse document frequency refers to a frequency of occurrence in a plurality of documents of the word target corpus.
The above determining the statistical characteristics of each word and calculating the weighted value of the statistical characteristics of each word can be implemented as follows: counting word lengths of all words; calculating the position weight score of the word position of each word according to a preset position formula; calculating the word frequency of each word in the text to be extracted, determining the inverse document frequency of each word in the target corpus, and calculating the word frequency-inverse document frequency weight score of the word according to the word frequency and the inverse document frequency; and calculating a weighted value according to the word length, the position weight score and the word frequency-inverse document frequency weight score.
The position formula is set according to a specific field of application, for example, taking the application field as an example of paper keyword extraction, and the text to be extracted is a paper title and an abstract, the preset position formula may be as follows:
Figure BDA0003947071660000082
where i represents the index value of the term and the position formula is used to calculate the term x i The location weight score of (1). If the word xi is located in title, the position weight score is constant 2; if the word xi is located in abstrat (abstract), the position weight score is higher the further forward the position in the abstract.
The word frequency of the words in the text to be extracted can be calculated by the following formula:
Figure BDA0003947071660000091
wherein, the first and the second end of the pipe are connected with each other, t,d frepresents the number of times, Σ, that the word t appears in the text to be extracted t′∈d f′ t′,d Representing the total number of words of the text to be extracted.
The above-mentioned determination of the inverse document frequency of each term in the target corpus can be implemented by the following formula:
Figure BDA0003947071660000092
wherein, t represents a certain word, | D | represents the total number of documents in the target corpus, | { D ∈ D: and t belongs to d } | represents the number of documents containing the word segmentation candidate words in the target corpus.
After the word frequency of the word in the text to be extracted and the inverse document frequency of the word relative to the target corpus are obtained, the word frequency-inverse document frequency (TF-IDF) weight score can be calculated by the following formula:
tfidf(t,d,D)=tf(t,d)·idf(t,D)
wherein tf (t, D) represents the word frequency, idf (t, D) represents the inverse document frequency, and the more documents containing the word t in the target corpus, the lower the idf value. the tfidf value is the product of the TF value and the IDF value, the higher the word frequency is, the fewer the documents containing the word t in the target language library are, and the higher the TF-IDF weight score of the word t is.
After the word frequency of each word, the position weight score of the word position and the TF-IDF weight score of the word are calculated according to the corresponding formula, the weighted value of the statistical characteristics of each word can be calculated through the following formula:
Score(x i )=Length(x i )·Position(x i )·tfidf(x i )
where i represents the index value of the word, length (x) i ) The expression x i Length of word, position (x) i ) The expression x i Position weight score of (1), tfidf (x) i ) The expression x i Is scored for TF-IDF weight.
After the weighted value of the statistical characteristic of each word is calculated, the word can be screened according to the statistical characteristic based on the weighted value of the statistical characteristic, and exemplarily, the following can be implemented: and presetting a threshold, filtering out the words with the statistical characteristic weighted value lower than the preset threshold, taking the words with the statistical characteristic weighted value higher than or equal to the preset threshold as candidate words, and entering the next screening based on semantic characteristics.
Taking the above application field as an example of the extraction of the keywords of the thesis, the words obtained by the word segmentation processing of the text to be extracted "human behavior recognition research under complex scene based on deep learning" based on the constructed word list in step S310 are: "based on", "deep learning", "of", "complex scene", "down", "human", "behavior recognition", "research". Assume that the results of determining the statistical characteristics of the above words and calculating the statistical characteristic weights are shown in the following table:
table 1:
Figure BDA0003947071660000101
the screening process and results based on the statistical characteristics can be as follows: setting a preset threshold value to be 1.2, filtering out words "based on", "under" and "in a statistical feature screening stage, and taking the remaining candidate words including" deep learning "," complex scene "," human body "," behavior recognition "and" research "as candidate words to enter a screening stage based on semantic features in step S330.
In step S330, the text to be extracted and the candidate words are input into a pre-trained deep semantic matching model, and keywords are extracted from the text to be extracted according to the output result, where the deep semantic matching model is used to predict semantic similarity between the text to be extracted and the candidate words.
Although the filtering process based on the statistical characteristics in step S320 can filter some unimportant words from the statistical perspective, the filtering process will erroneously determine that the words with irrelevant semantics are the keywords, so step S330 further filters the candidate words again based on the semantic characteristics, thereby increasing the keywords of the text to be extracted. The screening based on the semantic features is carried out according to the semantic similarity between the text to be extracted and each candidate word, and the semantic similarity is obtained based on the pre-trained deep semantic matching model.
The training process of the deep semantic matching model can be realized as follows: acquiring a plurality of corpora from a target corpus as training texts; for each training sample, acquiring a labeled text corresponding to the training sample from the target corpus as a positive sample, and acquiring a plurality of other texts as negative samples; and training based on the training samples, the positive samples and the negative samples to obtain a deep semantic matching model.
Illustratively, taking the application scenario as an example of the above-mentioned paper keyword extraction, in consideration of the structural features of the paper, the semantics of the title and the abstract should be approximately equal, so that the distance between the title and the abstract in the semantic space is also very small. In order to better model the semantic model, in this scenario, the above-mentioned pre-trained deep semantic matching model may select a DSSM model (double tower structure model) which is used for a search engine semantic query scenario. The model uses click logs between input words and query target web pages in a search engine, expresses the input words and the query targets as low-dimensional semantic vectors by using a deep neural network model, calculates the distance between the two semantic vectors through cosine distance, and finally outputs the semantic similarity between the query and the target web pages. The model can be used for predicting the semantic similarity of two sentences and obtaining the low latitude semantic vector expression of a certain sentence.
Based on the structural features of the DSSM model, the training process of the deep semantic matching model may be implemented as follows: the title and the abstract of the paper are respectively used as the input of a double tower, the title is used as the target of semantic learning, the abstract is used as a positive sample similar to the title, then the abstracts of N papers are randomly sampled to be used as a negative sample dissimilar to the title, the N-Gram model is used for carrying out dimensionality reduction processing on an input word to obtain the bag-of-word representation of the title and the abstract, the bag-of-word model outputs vector representation of fixed dimensionality through coding of a fully connected neural network, the cosine distances between the positive and negative samples and the title are respectively calculated, and finally the neural network is optimized by using a negative log likelihood loss function.
The above training process is described in detail with reference to the structure of the DSSM model, which includes three layers, i.e., an input layer, a presentation layer, and a matching layer, as shown in fig. 4. Wherein:
and the input layer adopts an N-Gram model to reduce the dimension of the input words, thereby realizing the compression of the vector. When processing an english paper, a tri-gram model is used for compression, that is, segmentation is performed according to every 3 characters, for example, the input word "algorithm" is segmented into "# al", "alg", "lgo", "gor", "ori", "rit", "ith", "thm", "hm #", which has the advantages that firstly, the space occupied by word vectors can be compressed, and the one-hot vector space of 50 ten thousand words can be compressed into a 3 ten thousand-dimensional vector space through the tri-gram; secondly, the generalization ability is enhanced. The Chinese thesis uses the uni-gram model, i.e., each character is used as the minimum unit, for example, the input word "machine learning" is segmented into "machine", "device", "learning" and "learning". The vector space is about 1.5 ten thousand dimensions (the dimension is determined by the number of common Chinese words) by taking the word vector as input.
The presentation layer comprises three fully connected layers, each layer being activated using a non-linear activation function.
The matching layer calculates the similarity of positive and negative samples by using the cosine distance, and optimizes the neural network by using a negative log-likelihood loss function. The model was trained using the title and abstract of the paper as input data.
The intermediate network layer and the output layer both adopt a fully-connected neural network. By W i Weight matrix representing the i-th layer, b i Indicating the bias term for the ith layer. Coding the ith intermediate network layer to obtain a hidden layer vector l i The output vector y obtained by the output layer coding can be respectively expressed as:
l i =f(W i l i-1 +b i ),i=2,...,N-1
y=f(W N l N-1 +b N )
wherein f represents a hyperbolic tangent activation function, which is defined as follows:
Figure BDA0003947071660000121
and encoding by an intermediate network layer and an output layer to obtain a 128-dimensional semantic vector. The semantic similarity of the paper title and abstract can be represented by the cosine similarity of the two semantic vectors:
Figure BDA0003947071660000122
the semantic similarity of the title to the positive sample summary can be converted into a posterior probability by the softmax function:
Figure BDA0003947071660000123
where γ is the smoothing factor of the softmax function, D is the positive sample under the heading, D-is the negative sample under the heading (taking a random negative sample), and D is the entire sample space under the heading.
In the training phase, the loss function is minimized through maximum likelihood estimation, so that after the normalization calculation of the softmax function, the similarity of the title and the positive sample abstract is maximum:
Figure BDA0003947071660000124
it should be noted that the above scenario is only an exemplary illustration of an application scenario of paper keyword extraction, and the depth semantic matching model trained in advance related to keyword extraction in other fields also belongs to the protection scope of the embodiment of the present disclosure.
After the deep semantic matching model is obtained through training, the text to be extracted and the candidate words are input into the pre-trained deep semantic matching model, and keywords are extracted from the text to be extracted according to an output result, which can be realized as follows: inputting the candidate words and the text to be extracted into the pre-trained deep semantic matching model, and outputting semantic similarity of the candidate words and the text to be extracted; and extracting candidate words with semantic similarity higher than a preset threshold value from the text to be extracted as keywords.
For example, the candidate words and the text to be extracted are input into the deep semantic matching model, and semantic similarity between the candidate words and the text to be extracted is output, which can be specifically implemented as follows: coding the candidate words and the text to be extracted by using a deep semantic matching model to obtain corresponding vector representation; and calculating the vector distance between the candidate word and the text to be extracted through cosine similarity as semantic similarity.
Taking the application scenario of the thesis keywords as an example, after the deep semantic model is obtained by training using the titles and the abstracts as the similarity pairs, the trained model can be used for performing semantic coding on the candidate words, and the semantic similarity between the candidate keywords and the thesis title abstracts is calculated through the cosine distance. Specifically, for example, for the input paper title "human behavior recognition research under complex scene based on deep learning", and candidate words "deep learning", "complex scene", "human body", "behavior recognition" and "research", respectively, the semantic models trained by the DSSM structure are used for encoding, and the cosine distances are used for respectively calculating and ranking the semantic similarity between the candidate keywords and the title, assuming that the obtained results are shown in table 2 below:
table 2:
candidate word Semantic similarity
Behavior recognition 0.932
Deep learning 0.875
Complex scenes 0.824
Human body 0.541
Study of 0.323
Assuming that the preset threshold of the semantic similarity for screening is 0.6, finally, the behavior recognition, the deep learning and the complex scene are output as the keywords of the text to be extracted.
Correspondingly, the embodiment of the disclosure also provides a block diagram of a keyword extraction device. As shown in fig. 5, the apparatus includes a segmentation processing module 510, a preliminary extraction module 520, and a final extraction module 530. Wherein:
a word segmentation processing module 510 configured to perform word segmentation processing on the text to be extracted based on the constructed word list, and divide the text to be extracted into a plurality of words;
a preliminary extraction module 520 configured to determine statistical characteristics of each word, calculate a weighted value of the statistical characteristics of the word, and filter the word according to the weighted value to obtain a candidate word;
and a final extraction module 530 configured to input the text to be extracted and the candidate words into a pre-trained deep semantic matching model, and extract the keywords from the text to be extracted according to the output result, where the deep semantic matching model is used to predict semantic similarity between the text to be extracted and the candidate words.
In this disclosure, the keyword extraction apparatus further includes a vocabulary constructing module, where the vocabulary constructing module is configured to: traversing a target corpus, and performing N-Gram word segmentation processing on the corpus in the target corpus to obtain a plurality of word segmentation candidate words; determining the statistical characteristics of the word segmentation candidate words, and calculating the word forming probability of the corresponding word segmentation candidate words according to the statistical characteristics of the word segmentation candidate words; the statistical characteristics of the word segmentation candidate words comprise word frequency, point mutual information, left-right entropy, inverse document frequency and average time span of the word segmentation candidate words; and constructing a word list according to the word forming probability.
The keyword extraction device further comprises a deep semantic matching model training module, which is used for: acquiring a plurality of corpora from a target corpus as training texts; for each training sample, acquiring a labeled text corresponding to the training sample from the target corpus as a positive sample, and acquiring a plurality of other texts as negative samples; and training based on the training samples, the positive samples and the negative samples to obtain a deep semantic matching model.
The preliminary extraction module comprises a statistical characteristic weighted value calculation unit and a screening unit. Wherein:
the statistical characteristic weight value calculating unit is configured to: counting word lengths of all words; calculating the position weight score of the word position of each word according to a preset position formula; calculating the word frequency of each word in the text to be extracted, determining the inverse document frequency of each word in the target corpus, and calculating the word frequency-inverse document frequency weight score of the word according to the word frequency and the inverse document frequency; and calculating a weighted value according to the word length, the position weight score and the word frequency-inverse document frequency weight score.
The screening unit is used for: and filtering out the words with the weighted values lower than the preset threshold value, and taking the words with the weighted values higher than or equal to the preset threshold value as candidate words.
The final extraction module is specifically configured to: inputting the candidate words and the text to be extracted into a deep semantic matching model, and outputting semantic similarity of the candidate words and the text to be extracted; and extracting candidate words with semantic similarity higher than a preset threshold value from the text to be extracted as keywords.
The method specifically comprises the following steps of inputting the candidate words and the text to be extracted into the deep semantic matching model, and outputting the semantic similarity between the candidate words and the text to be extracted: coding the candidate words and the text to be extracted by using a deep semantic matching model to obtain corresponding vector representation; and calculating the vector distance between the candidate word and the text to be extracted through cosine similarity as semantic similarity.
With regard to the apparatus in the above-described embodiments, the specific manner in which each unit or module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail herein.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the method as described in the embodiments below.
It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A keyword extraction method is characterized by comprising the following steps:
performing word segmentation processing on a text to be extracted based on the constructed word list, and dividing the text to be extracted into a plurality of words;
determining the statistical characteristics of all the words, calculating the weighted values of the statistical characteristics of the words, and screening the words according to the weighted values to obtain candidate words;
and inputting the text to be extracted and the candidate words into a pre-trained deep semantic matching model, and extracting keywords from the text to be extracted according to an output result, wherein the deep semantic matching model is used for predicting semantic similarity between the text to be extracted and the candidate words.
2. The keyword extraction method according to claim 1, further comprising constructing the vocabulary;
the constructing the word list comprises the following steps:
traversing a target corpus, and performing N-Gram word segmentation processing on the corpus in the target corpus to obtain a plurality of word segmentation candidate words;
determining the statistical characteristics of each word segmentation candidate word, and calculating the word forming probability of each corresponding word segmentation candidate word according to the statistical characteristics of the word segmentation candidate words; the statistical characteristics of the word segmentation candidate words comprise word frequency, point mutual information, left-right entropy, inverse document frequency and average time span of the word segmentation candidate words;
and constructing the word list according to the word forming probability.
3. The keyword extraction method according to claim 1, wherein the statistical characteristics of the words include word length, word position, word frequency, and inverse document frequency;
the determining the statistical characteristics of the words and calculating the weighted values of the statistical characteristics of the words comprises the following steps:
counting the word length of each word;
calculating a position weight score of the word position of each word according to a preset position formula;
calculating the word frequency of each word in the text to be extracted, determining the inverse document frequency of each word in a target language material library, and calculating the word frequency-inverse document frequency weight score of each word according to the word frequency and the inverse document frequency;
calculating the weighting value according to the word length, the position weight score and the word frequency-inverse document frequency weight score.
4. The method according to claim 1 or 3, wherein the screening of the words according to the weighted values to obtain candidate words comprises:
and filtering out the words with the weighted values lower than a preset threshold, and taking the words with the weighted values higher than or equal to the preset threshold as the candidate words.
5. The method of claim 1, wherein the step of inputting the text to be extracted and the candidate words into a pre-trained deep semantic matching model and extracting keywords from the text to be extracted according to an output result comprises:
inputting the candidate words and the text to be extracted into the deep semantic matching model, and outputting semantic similarity between the candidate words and the text to be extracted;
and extracting the candidate words with the semantic similarity higher than a preset threshold value from the text to be extracted as the keywords.
6. The method according to claim 5, wherein the inputting the candidate word and the text to be extracted into the deep semantic matching model and outputting semantic similarity between the candidate word and the text to be extracted comprises:
coding the candidate words and the text to be extracted by utilizing the deep semantic matching model to obtain corresponding vector representation;
and calculating the vector distance between the candidate word and the text to be extracted through cosine similarity as the semantic similarity.
7. The keyword extraction method according to claim 1, wherein the deep semantic matching model is trained;
the training the deep semantic matching model comprises:
acquiring a plurality of corpora from a target corpus as training texts;
for each training sample, obtaining a labeled text corresponding to the training sample from the target corpus as a positive sample, and obtaining a plurality of other texts as negative samples;
and training based on the training sample, the positive sample and the negative sample to obtain the deep semantic matching model.
8. A keyword extraction device, comprising:
the word segmentation processing module is configured to perform word segmentation processing on a text to be extracted based on the constructed word list, and divide the text to be extracted into a plurality of words;
the preliminary extraction module is configured to determine statistical characteristics of the words, calculate weighted values of the statistical characteristics of the words, and screen the words according to the weighted values to obtain candidate words;
and the final extraction module is configured to input the text to be extracted and the candidate words into a pre-trained deep semantic matching model, extract keywords from the text to be extracted according to an output result, and the deep semantic matching model is used for predicting semantic similarity between the text to be extracted and the candidate words.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the keyword extraction method of any one of claims 1 to 7.
10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the keyword extraction method of any one of claims 1 to 7.
CN202211435542.2A 2022-11-16 2022-11-16 Keyword extraction method and device, electronic equipment and storage medium Pending CN115757680A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211435542.2A CN115757680A (en) 2022-11-16 2022-11-16 Keyword extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211435542.2A CN115757680A (en) 2022-11-16 2022-11-16 Keyword extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115757680A true CN115757680A (en) 2023-03-07

Family

ID=85372066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211435542.2A Pending CN115757680A (en) 2022-11-16 2022-11-16 Keyword extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115757680A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725922A (en) * 2023-04-13 2024-03-19 书行科技(北京)有限公司 Image generation method, device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725922A (en) * 2023-04-13 2024-03-19 书行科技(北京)有限公司 Image generation method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
Wang et al. Common sense knowledge for handwritten chinese text recognition
US20200019611A1 (en) Topic models with sentiment priors based on distributed representations
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
EP4138047A2 (en) Method of processing video, method of querying video, and method of training model
CN112256860A (en) Semantic retrieval method, system, equipment and storage medium for customer service conversation content
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111414746B (en) Method, device, equipment and storage medium for determining matching statement
CN115186665B (en) Semantic-based unsupervised academic keyword extraction method and equipment
CN111881264B (en) Method and electronic equipment for searching long text in question-answering task in open field
CN114091425A (en) Medical entity alignment method and device
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
WO2021190662A1 (en) Medical text sorting method and apparatus, electronic device, and storage medium
CN111159405B (en) Irony detection method based on background knowledge
CN112612892A (en) Special field corpus model construction method, computer equipment and storage medium
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN115757680A (en) Keyword extraction method and device, electronic equipment and storage medium
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN110309278B (en) Keyword retrieval method, device, medium and electronic equipment
CN117131155A (en) Multi-category identification method, device, electronic equipment and storage medium
CN116629238A (en) Text enhancement quality evaluation method, electronic device and storage medium
CN113836941B (en) Contract navigation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination