CN116306616A - Method and device for determining keywords of text - Google Patents

Method and device for determining keywords of text Download PDF

Info

Publication number
CN116306616A
CN116306616A CN202310138465.2A CN202310138465A CN116306616A CN 116306616 A CN116306616 A CN 116306616A CN 202310138465 A CN202310138465 A CN 202310138465A CN 116306616 A CN116306616 A CN 116306616A
Authority
CN
China
Prior art keywords
candidate
candidate object
text
redundancy
objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310138465.2A
Other languages
Chinese (zh)
Inventor
卢永辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seashell Housing Beijing Technology Co Ltd
Original Assignee
Seashell Housing Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Seashell Housing Beijing Technology Co Ltd filed Critical Seashell Housing Beijing Technology Co Ltd
Priority to CN202310138465.2A priority Critical patent/CN116306616A/en
Publication of CN116306616A publication Critical patent/CN116306616A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides a method and a device for determining keywords of a text, and belongs to the field of text processing. The method comprises the following steps: splitting a text to be processed into at least one element by word segmentation to the text to be processed so as to obtain a word segmentation result list; determining candidate objects based on the elements of the word segmentation result list, wherein any one of the candidate objects comprises at least one element with continuous positions in the word segmentation result list; screening the candidate objects based on screening characteristics; and determining the keywords of the text to be processed based on the candidate objects obtained by screening. By the method, keywords can be determined based on the text to be processed, and a corpus text library is not required to be built in advance.

Description

Method and device for determining keywords of text
Technical Field
The invention relates to the field of text processing, in particular to a method and a device for determining keywords of a text.
Background
Keyword extraction is a basic technology in the field of text processing, and tens of thousands of text messages are generated in a network every day in the information explosion age today, so that the keyword in the network text is accurately extracted, and the method plays a vital role in managing and applying the network text.
The existing keyword extraction methods can be roughly classified into two methods, i.e., a supervised method and an unsupervised method. The supervised method can obtain better effect in the text in the corresponding field of the data set, but the supervised method requires a great deal of manpower to label the data set, and the effect is obviously poor when the supervised method is applied to the text in the non-existing data set corresponding field. The corresponding information fields in web text are extensive and rapidly changing, so that the supervised method is not suitable for keyword extraction of web text. Typical methods in the unsupervised method are TF-IDF and YAKE, which both extract keywords by constructing statistical features of words in text, and achieve good results. However, most of the current unsupervised keyword extraction methods are applied to the Chinese web text, and have the following limitations.
Depending on the corpus text library constructed in advance. Most of the unsupervised methods adopt statistical features similar to IDF, and a text corpus is required to be constructed in advance for generating the statistical features, and the text corpus is required to be updated and maintained in time in order to keep the accuracy. In the network text scene, massive data are generated every day, and the collection, screening, warehousing, updating and other maintenance operations of the data are high in cost.
The output result granularity is a single word, rather than a phrase consisting of several words. The single word is often wider semantically, so that the semantic accuracy of the single word is lower than that of the phrase form, and the key information of the target text cannot be accurately described. For example: "loan" refers broadly to various types of banking, while "aggregate loan" refers specifically to banking that uses aggregate to conduct a loan.
The design of keyword features is not performed for the characteristics of Chinese text. Many of the features of keywords in english are not applicable to chinese, such as: the keywords are judged by whether the case is a root word or not. While some of the chinese valid text features are not exploited, such as: the number of words included in a keyword (a single word is almost impossible to be a keyword, and a large probability of proper nouns of three words or more is a keyword), and a position where the keyword appears in an article (the position where the article appears at the beginning is not necessarily a keyword, but a large probability of head-tail correspondence is a keyword).
There is a lack of post-processing of the output results. First is redundancy in the form of output. It is possible that a certain keyword is contained by another key phrase or that the two phrases mostly coincide, for example, a "loan" and an "accumulated gold loan" are output at the same time. And secondly, semantic redundancy. It is possible to output two results having very similar meanings at the same time, for example, to output "mediating company" and "mediating institution" at the same time. Finally, it is not confirmed whether the output keywords and the article semantics remain consistent. For example, the title "how is a house left for a child? The article of the keyword "Xiaoming" is output, and obviously the term "Xiaoming" is only a code name which appears repeatedly in the text and is mistaken as the keyword, and the meaning of the whole article has no relation, and the keyword cannot be used.
Disclosure of Invention
It is an aim of embodiments of the present invention to provide a method and apparatus for determining keywords for text which solves or at least partially solves the above mentioned problems.
To achieve the above object, an aspect of an embodiment of the present invention provides a method for determining keywords of a text, the method including: splitting a text to be processed into at least one element by word segmentation to the text to be processed so as to obtain a word segmentation result list; determining candidate objects based on the elements of the word segmentation result list, wherein any one of the candidate objects comprises at least one element with continuous positions in the word segmentation result list; screening based on screening features for the candidate object, wherein the screening features comprise at least one of: whether belonging to a preset deactivated element set, a score, output form redundancy between the candidate objects, semantic redundancy between the candidate objects, and semantic similarity between the candidate objects and the text to be processed, wherein the score of any candidate object reflects the possibility that the candidate object is the keyword of the text to be processed; and determining the keywords of the text to be processed based on the candidate objects obtained by screening.
Accordingly, another aspect of an embodiment of the present invention provides an apparatus for determining keywords of a text, the apparatus comprising: the word segmentation module is used for segmenting the text to be processed into at least one element by segmenting the text to be processed so as to obtain a word segmentation result list; a candidate object determining module, configured to determine candidate objects based on the elements of the word segmentation result list, where any one of the candidate objects includes at least one element that is continuous in position in the word segmentation result list; a screening module for screening based on screening features for the candidate object, wherein the screening features comprise at least one of: whether belonging to a preset deactivated element set, a score, output form redundancy between the candidate objects, semantic redundancy between the candidate objects, and semantic similarity between the candidate objects and the text to be processed, wherein the score of any candidate object reflects the possibility that the candidate object is the keyword of the text to be processed; and the keyword determining module is used for determining the keywords of the text to be processed based on the candidate objects obtained through screening.
Still another aspect of an embodiment of the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the above-described method.
In addition, another aspect of the embodiments of the present invention further provides a processor, configured to execute a program, where the program is executed to perform the method described above.
Furthermore, another aspect of the embodiments of the present invention provides a computer program product comprising a computer program/instruction which, when executed by a processor, implements the method described above.
According to the technical scheme, the word segmentation is carried out on the text to be processed to obtain the word segmentation result list, the candidate objects are determined based on the word segmentation result list, the candidate objects are screened, and then the keywords of the text to be processed are determined, so that the keywords can be determined based on the text to be processed, and a corpus text library is not required to be built in advance; the candidate object comprises at least one element, so that when a certain determined keyword comprises a plurality of elements, the determined keyword is not only a single word any more, the accuracy of the determined keyword is improved, and the keyword information of the target text is described more accurately; when screening candidates based on output form redundancy between candidates and/or semantic similarity between candidates and text to be processed, the output form redundancy between candidates and/or semantic similarity between candidates and text to be processed are taken into account when determining keywords.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain, without limitation, the embodiments of the invention. In the drawings:
FIG. 1 is a flow chart of a method for determining keywords of text provided by an embodiment of the present invention;
FIG. 2 is a logical schematic diagram of a method for determining keywords of text provided by another embodiment of the present invention;
FIG. 3 is a schematic diagram of a logic for screening candidates based on output form redundancy between candidates according to another embodiment of the present invention;
FIG. 4 is a schematic diagram of a logic for screening candidates based on semantic redundancy between the candidates according to another embodiment of the present invention; and
fig. 5 is a block diagram of an apparatus for determining keywords of text according to another embodiment of the present invention.
Description of the reference numerals
1. Word segmentation module 2 candidate object determination module
3. Screening module 4 keyword determination module
Detailed Description
The following describes the detailed implementation of the embodiments of the present invention with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
One aspect of an embodiment of the present invention provides a method for determining keywords of text.
Fig. 1 is a flowchart of a method for determining keywords of text provided in an embodiment of the present invention. As shown in fig. 1, the method includes the following.
In step S10, the text to be processed is split into at least one element by word segmentation to obtain a word segmentation result list. The text to be processed is the text of the keywords to be determined. Alternatively, the text to be processed may be chinese text, english text, or web text. In addition, word segmentation refers to reasonably dividing continuous texts into language forms with smaller granularity such as single words, punctuations, words, idioms and the like with practical significance, wherein the language forms with smaller granularity are elements in the embodiment of the invention. Alternatively, in the embodiment of the present invention, the word segmentation method may be an open-source word segmentation tool, for example, hanlp, jieba. After word segmentation, a word segmentation result list is obtained, in the word segmentation result list, the position of each element is consistent with the position in the text to be processed, and the original text to be processed can be completely recovered according to the word segmentation result list.
In step S11, candidates are determined based on the elements of the word segmentation result list, wherein any one candidate includes at least one element that is continuous in position in the word segmentation result list. Alternatively, the number of elements included in the candidate object may be preset, and then the elements in the word segmentation result list are divided from the first element of the word segmentation result list, and the division is performed once every preset number, so that each divided candidate object includes the preset number of elements and the positions of the elements are continuous. For example, the preset number is 2, and starting from the first element of the word segmentation result list, every two elements are divided to obtain a candidate object. Optionally, a range of the number of elements included in the candidate object may be preset, and for each value in the range, the elements in the segmentation result list may be divided from the first element in the segmentation result list, so as to obtain the candidate object. For example, the number is greater than or equal to 1 and less than or equal to 3, and then the number is 1, 2 and 3, respectively, from the first element in the word segmentation result list, the elements in the word segmentation result list are divided. Specifically, when the number is 1, dividing each element once to obtain a candidate object; when the number is 2, dividing every two elements to obtain a candidate object; when the number is 3, dividing every three elements to obtain a candidate object. Under the condition that the number of the ranges is preset, all candidate objects obtained by dividing all values in the ranges are determined candidate objects, and follow-up operation is participated.
In step S12, screening is performed for the candidate object based on screening features, wherein the screening features include at least one of: whether the candidate object belongs to a preset deactivated element set, the score, the output form redundancy among the candidate objects, the semantic redundancy among the candidate objects and the semantic similarity between the candidate objects and the text to be processed, and the score of any candidate object reflects the possibility that the candidate object becomes a keyword of the text to be processed.
Wherein the preset deactivation element is an element which does not wish to appear in the keyword, and may be set according to the specific situation, for example, the preset deactivation element may be "basic", "how", "general", or the like. The preset disabling element set comprises all preset disabling elements. For example, the candidate object includes at least one element, and for any candidate object, if the candidate object includes one or more preset deactivated elements, the candidate object belongs to a preset deactivated element set, and the candidate object is removed, which cannot be a keyword of the text to be processed; only when all elements included in the candidate object are not preset deactivated elements, it is determined that the candidate object does not belong to the preset deactivated element set.
Furthermore, the output form redundancy between candidates may be inclusion redundancy or overlap redundancy. Wherein the existence of inclusion redundancy between two candidates refers to the inclusion relationship between two candidates, i.e., all elements of one candidate including the other candidate. For example, one candidate object is "principal loan", and the other candidate object is "loan", and "loan" is included in "principal loan", and thus, there is an inclusion relationship between "principal loan" and "loan", and there is an inclusion redundancy. Specifically, whether or not redundancy is included can be determined from the character string relationship. Specifically, whether redundancy is included may be determined according to whether a candidate object having a short length, which refers to the number of characters included in the candidate object, is a substring of a candidate object having a long length. In addition, overlapping redundancy between two candidates refers to an overlapping relationship between the two candidates, i.e., the prefix element of one candidate is the same as the suffix element of the other candidate. For example, one candidate is "real estate project", the other candidate is "project surrounding", "real estate project" suffix element and "project surrounding" prefix element are both "project", so "real estate project" and "project surrounding" are in overlapping relationship, and overlapping redundancy exists. For two candidate objects with overlapping redundancy, new words can be obtained after splicing, for example, the real estate project and the project periphery are spliced, and the new words of the real estate project periphery are obtained. For any two candidate objects, whether the two candidate objects have redundancy or overlap redundancy is judged by judging whether the two candidate objects have the inclusion relationship or the overlap relationship, so that whether the two candidate objects have output form redundancy is judged. The following may be used for the determination of the overlapping relationship. For candidate object A and candidate object B, the lengths are SA and SB respectively, the length of the word after splicing is S, and the length ratio before and after splicing is as follows: ratio=s/(sa+sb). When the ratio is greater than 0.35, a coincidence relation exists between the candidate object A and the candidate object B; otherwise, no coincidence relation exists. Wherein the length refers to the number of characters included. For example, if redundancy is included between two candidate objects, eliminating the candidate object with short length; if the two candidate objects are overlapped and redundant, the two candidate objects are removed, and the content obtained by splicing the two candidate objects is used for replacing the two candidate objects.
In addition, semantic redundancy between candidates refers to the existence of semantically similar or identical between two candidates. Alternatively, it may be determined whether there is semantic redundancy between any two candidates by the following. Two candidates are represented as semantic vectors, for example, using an Embedding model. Semantic similarity between two candidate objects is calculated based on the semantic vector, for example, the cosine distance or the euclidean distance is used to represent the semantic similarity between the two candidate objects. And comparing the calculated semantic similarity with a preset candidate object semantic similarity threshold. If the calculated semantic similarity does not reach the candidate object semantic similarity threshold, the fact that the two candidate objects are not similar or identical is indicated, and semantic redundancy does not exist; if the calculated semantic similarity reaches the candidate object semantic similarity threshold, the similarity or the same exists between the two candidate objects, and semantic redundancy exists. For example, if there is semantic redundancy between two candidate objects, a candidate object with a larger semantic similarity with the text to be processed is retained, wherein the calculation of the semantic similarity between the candidate object and the text to be processed can refer to the semantic similarity between the candidate object and the candidate object.
In addition, the semantic similarity between the candidate object and the text to be processed reflects the semantic similarity between the candidate object and the text to be processed. The higher the semantic similarity, the higher the accuracy of the description candidate. Alternatively, the semantic similarity between a candidate object and the text to be processed may be determined by the following. And respectively calculating semantic vectors corresponding to the candidate object and the text to be processed, for example, calculating the semantic vectors corresponding to the candidate object and the text to be processed by using an Embedding model. Semantic similarity between the candidate object and the text to be processed is calculated, for example, the semantic similarity between the candidate object and the text to be processed is represented using a cosine distance or a euclidean distance. For example, a preset semantic similarity threshold is preset, and candidate objects, of which the semantic similarity with the text to be processed does not reach the preset semantic similarity threshold, are removed.
For screening the candidate objects by using the scores, the objects to be screened can be sorted by using the scores, and the candidate objects ranked in the previous preset value bits are selected; in addition, a threshold value of the score can be set, and a candidate object with the score reaching the threshold value is selected.
Alternatively, in embodiments of the present invention, when the filtering feature includes a plurality of pieces of content, filtering between candidate objects among the plurality of pieces of content may be performed sequentially. For example, the filtering features include redundancy in the form of candidate quality inspection sheets, semantic redundancy between candidates, and semantic similarity between candidates and text to be processed, the candidates may be filtered according to the following order: screening is firstly carried out according to redundancy of a candidate object quality inspection single output form, then screening is carried out according to semantic redundancy among candidate objects, and then screening is carried out according to semantic similarity between the candidate objects and the text to be processed.
In step S13, keywords of the text to be processed are determined based on the candidate object obtained by screening. Alternatively, the candidate object obtained by screening is the keyword of the text to be processed. Optionally, a preset number of required keywords may be preset, and a preset number of candidate objects are randomly selected from the keywords obtained by screening to serve as keywords of the text to be processed; or sorting the candidate objects obtained by screening based on the scores, and selecting the candidate objects with the preset number of positions in the previous sequence as keywords of the text to be processed.
According to the technical scheme, the word segmentation is carried out on the text to be processed to obtain the word segmentation result list, the candidate objects are determined based on the word segmentation result list, the candidate objects are screened, and then the keywords of the text to be processed are determined, so that the keywords can be determined based on the text to be processed, and a corpus text library is not required to be built in advance; the candidate object comprises at least one element, so that when a certain determined keyword comprises a plurality of elements, the determined keyword is not only a single word any more, the accuracy of the determined keyword is improved, and the keyword information of the target text is described more accurately; when screening candidates based on output form redundancy between candidates and/or semantic similarity between candidates and text to be processed, the output form redundancy between candidates and/or semantic similarity between candidates and text to be processed are taken into account when determining keywords.
Alternatively, in an embodiment of the present invention, filtering the candidate objects based on output form redundancy between the candidate objects may include the following. Step 1: a first set of candidates is established for candidates to be screened based on output formal redundancy between the candidates. The candidate objects to be filtered based on the output form redundancy among the candidate objects may be candidate objects that have been filtered based on part of the content included in the filtering feature, or may be candidate objects that have not been filtered and are determined based on elements in the word segmentation result list. Step 2: and determining a formal redundancy comparison candidate object and deleting the formal redundancy comparison candidate object from the first candidate object set, wherein the formal redundancy comparison candidate object is the candidate object with the largest length in the first candidate object set. The length of the candidate object is the number of characters included in the candidate object. Step 3: for each candidate remaining in the first set of candidates, performing the following: judging whether output form redundancy exists between the aimed candidate object and the form redundancy comparison candidate object, wherein, how to judge whether the output form redundancy exists or not can be referred to the content in the embodiment; under the condition that output form redundancy exists, determining screening candidate objects according to a first preset rule, updating the form redundancy comparison candidate objects based on the screening candidate objects, and deleting the targeted candidate objects from the first candidate object set; in the absence of output form redundancy, the candidate object in question is deleted from the first candidate object set and saved to the third candidate object set. The method comprises the steps of updating the formal redundancy comparison candidate based on the screening candidate, wherein the formal redundancy comparison candidate is replaced by the screening candidate, so that the formal redundancy comparison candidate is updated. In step 3, the formal redundancy comparison candidate is updated when the output formal redundancy exists, but the formal redundancy comparison candidate is still described in the following description, but the description of the formal redundancy comparison candidate does not indicate that the formal redundancy comparison candidate is not updated, and it is necessary to understand whether the formal redundancy comparison candidate is updated according to whether the output formal redundancy exists. Step 4: and saving the formal redundancy comparison candidate object into a second candidate object set. In step 3, the formal redundancy comparison candidate object determined in step 2 is compared with each of the remaining candidate objects in the first candidate object set to screen the candidate objects in the first candidate object set, and finally the formal redundancy comparison candidate object obtained in step 3 is saved in the second candidate object set, and the formal redundancy comparison candidate object obtained in step 3 is updated or not updated according to whether output formal redundancy exists when each of the remaining candidate objects in the first candidate object set is screened. Step 5: copying the candidate objects in the third candidate object set into the first candidate object set and emptying the third candidate object set. Step 6: it is determined whether the first candidate set is empty. Step 7: and (3) repeating the steps 2 to 6 until the first candidate set is empty under the condition that the first candidate set is not empty, wherein the formal redundancy comparison candidate in the second candidate set is the candidate object screened based on the output formal redundancy among the candidate objects. Alternatively, in an embodiment of the present invention, the first preset rule may include the following. Screening candidate objects to be redundancy comparison candidate objects under the condition that the output form redundancy is redundancy contained; and/or filtering the candidate object to obtain an object after the redundancy comparison candidate object and the aimed candidate object are spliced under the condition that the output form redundancy is overlapped redundancy.
Alternatively, in an embodiment of the present invention, filtering candidate objects based on semantic redundancy between candidate objects may include the following. Step 8: a fourth set of candidates is established for the candidates to be screened based on semantic redundancy between the candidates. The candidate objects to be screened based on the semantic redundancy between the candidate objects may be candidate objects that have been screened based on part of the content included in the screening feature, or may be candidate objects that have not been screened and have been determined based on elements in the word segmentation result list. Step 9: and determining a semantic redundancy comparison candidate object and deleting the semantic redundancy comparison candidate object from the fourth candidate object set, wherein the semantic redundancy comparison candidate object is the candidate object with the largest semantic similarity with the text to be processed in the fourth candidate object set. Specifically, the semantic similarity between the candidate object and the text to be processed can be calculated with reference to the content described in the above embodiments. Step 10: for each candidate remaining in the fourth candidate set, performing the following: judging whether semantic redundancy exists between the targeted candidate object and the semantic redundancy comparison candidate object, specifically, judging whether semantic redundancy exists by referring to the method described in the above embodiment; deleting the targeted candidate object from the fourth candidate object set in the case of semantic redundancy; in the absence of semantic redundancy, the candidate object in question is deleted from the fourth candidate object set and saved to the sixth candidate object set. Step 11: and saving the semantic redundancy comparison candidate object into a fifth candidate object set. In step 10, the semantic redundancy comparison candidate object determined in step 9 is compared with each candidate object remaining in the fourth candidate object set to screen the candidate objects in the fourth candidate object set, and finally the semantic redundancy comparison candidate object determined in step 9 is saved in the fifth candidate object set. Step 12: copying the candidate objects in the sixth candidate object set to the fourth candidate object set and emptying the sixth candidate object set. Step 13: it is determined whether the fourth candidate set is empty. Step 14: and (3) repeating the steps 9 to 13 until the fourth candidate object set is empty under the condition that the fourth candidate object set is not empty, wherein the semantic redundancy comparison candidate object in the fifth candidate object set is the candidate object obtained by screening based on the semantic redundancy among the candidate objects.
Alternatively, in an embodiment of the present invention, filtering the candidate object based on the semantic similarity between the candidate object and the text to be processed may include the following. For any candidate object to be screened based on the semantic similarity between the candidate object and the text to be processed, the semantic similarity between the candidate object to be screened and the text to be processed is calculated, and specifically, the semantic similarity between the candidate object and the text to be processed may be calculated by referring to the method described in the above embodiment. The candidate to be screened based on the semantic similarity between the candidate and the text to be processed may be a candidate that has been screened based on a part of the content included in the screening feature, or may be all candidates that have not been screened and are determined based on elements in the word segmentation result list. Screening candidate objects to be screened based on the calculated semantic similarity, removing candidate objects to be screened, of which the semantic similarity does not reach a preset semantic similarity threshold, and reserving candidate objects to be screened, of which the semantic similarity reaches the preset semantic similarity threshold, wherein the candidate objects to be screened, of which the semantic similarity reaches the preset semantic similarity threshold, are candidate objects obtained by screening based on the semantic similarity between the candidate objects and the text to be processed.
Alternatively, in an embodiment of the present invention, determining the score may include the following for any candidate object. Determining a title score and/or a text score corresponding to the candidate object based on the score determination features of the candidate objects, wherein for any candidate object, the score determination features include at least one of: the method comprises the steps of a first time, a second time, a position weight of a candidate object, a length of the candidate object and the number of elements included in the candidate object, wherein the first time represents the number of times the candidate object appears in a title of a text to be processed, and the second time represents the number of times the candidate object appears in the text to be processed. Furthermore, the position weight of the candidate object represents the weight corresponding to the position of the candidate object in the text as compared to the entire text. For example, the location weights may include a first location weight corresponding to a ratio of a location of the first occurrence of the candidate object in the body to a full length of the body and a second location weight corresponding to a ratio of a location span of the candidate object in the body to the full length. The length of the candidate object refers to the number of characters included in the candidate object, and the full-text length of the body refers to the total number of characters included in the body. In addition, the heading score refers to a score determined for the occurrence of the candidate in the heading, and the body score refers to a score determined for the occurrence of the candidate in the body. The score is determined based on the title score and/or the text score corresponding to the candidate object. For example, in the case of having only a title score, the title score is the determined score; in the case of only text scores, the text is the determined score; in the case of a heading score and a body score, the score may be determined by summing the heading score and the body score or by weighted summation. Therefore, the score is determined according to the position weight, the length and other characteristics of the candidate objects, and the candidate objects are screened based on the score, so that the characteristics of the finally obtained keywords are optimized, and the accuracy of the obtained keywords is improved.
Optionally, in an embodiment of the present invention, determining the title score for any candidate object includes determining based on the following formula: score title =1.5*TF 1 *log 2 (M+1)*log 2 (P/2. Times.M+0.5), where Score title Representing title score, TF 1 The first number of times is represented by M, the number is represented by M, and the length is represented by P. Further, in the calculation formula, 1.5 represents a score coefficient of the title candidate because the content in the title is more likely to become a keyword than the content in the body.
Optionally, in an embodiment of the present invention, the location weight of the candidate object includes a location of the first occurrence of the candidate object in the text and a full text length of the textDetermining, for any candidate, a text score based on a first location weight corresponding to a duty cycle of the degree and a second location weight corresponding to a duty cycle of a location span of the candidate in the text and a full-text length, the text score comprising determining based on the following formula: score content =TF 2 *log 2 (M+1)*log 2 (OP/2*M+0.5)*(W front +W range ) Wherein, score content Representing text score, TF 2 Represents the second time, M represents the number, P represents the length, W front Represents a first position weight, W range Representing the second position weight.
Optionally, in an embodiment of the present invention, for any candidate object, a calculation formula of the first location weight is:
Figure BDA0004087427290000131
wherein L is first Position index, L, representing the position of the first occurrence of a candidate in the body total Representing the full length of the text. The position index of the position of the candidate object appearing for the first time in the text refers to the number of characters that the candidate object has before the position of the candidate object appearing for the first time; the full-text length of the body refers to the total number of characters of the body.
Optionally, in an embodiment of the present invention, for any candidate object, a calculation formula of the second location weight is: w (W) range =2*L range /L total Wherein L is range The position span of the candidate object in the text is represented, namely, the position index of the position of the last occurrence is subtracted by the position index of the position of the first occurrence to obtain a result. Obviously, when a candidate object appears only once in the text, W range =0。
Optionally, in an embodiment of the present invention, before screening the candidate object based on the screening feature, the method may further include the following. Labeling the elements in the segmentation result list with parts of speech, wherein the filtering features further comprise at least one of: whether belonging to a preset basic part-of-speech set or not and whether belonging to a preset boundary part-of-speech set or not. The preset basic part of speech is preset, can be flexibly adjusted according to different application scenes and is aimed at elements; the preset basic part-of-speech set comprises all preset basic parts-of-speech. For example, the preset base part of speech may be punctuation, a stop, a number, an exclamation, a pronoun, and the like. The candidate objects include at least one element, and for any one candidate object, it is determined that the candidate object does not belong to a preset basic part-of-speech set only if all elements included in the candidate object do not belong to the preset basic part-of-speech set. The candidate objects which are judged not to belong to the preset basic part-of-speech set can be filtered and reserved, and possibly become keywords of the text to be processed. The preset boundary part of speech is also preset, can be flexibly adjusted according to different application scenes and is also aimed at elements, but the preset boundary part of speech is judged to be the element on the boundary specifically. For example, the preset boundary parts of speech may be adverbs, conjunctions, prepositions, and the like. The preset boundary part-of-speech set comprises all preset boundary parts-of-speech. For example, if a candidate object is composed of element 1, element 2, and element 3, and the composition order is also element 1, element 2, and element 3, then element 1 is the left boundary of the candidate object, and element 3 is the right boundary of the candidate object. For any candidate, the candidate is determined to not belong to the preset boundary part-of-speech set only if both the part-of-speech of the left boundary element and the part-of-speech of the right boundary element of the candidate do not belong to the preset boundary part-of-speech set. The candidate objects which are judged not to belong to the preset boundary part-of-speech set can be filtered and reserved, and possibly become keywords of the text to be processed. Optionally, in the embodiment of the present invention, the left boundary part of speech may be distinguished from the right boundary part of speech, where the preset boundary part of speech set includes a preset left boundary part of speech set and a preset right boundary part of speech set, the preset left boundary part of speech set includes parts of speech set for a left boundary of the candidate object, and the preset right boundary part of speech set includes parts of speech set for a right boundary of the candidate object. When judging whether a candidate object belongs to a preset boundary part-of-speech set or not, comparing the part-of-speech of the left boundary of the candidate object with the preset left boundary part-of-speech in the preset left boundary part-of-speech set, comparing the part-of-speech of the right boundary of the candidate object with the preset right boundary part-of-speech in the preset right boundary part-of-speech set, and judging that the candidate object does not belong to the preset boundary part-of-speech set only when the part-of-speech of the left boundary element of the candidate object does not belong to the preset left boundary part-of-speech set and the part-of-speech of the right boundary element does not belong to the preset right boundary part-of-speech set. The preset boundary part-of-speech set is divided into the preset left boundary part-of-speech set and the preset right boundary part-of-speech set, so that the candidate objects obtained through screening are more accurate, and the accuracy of the obtained keywords is improved.
Fig. 2 is a logic diagram of a method for determining keywords of text according to another embodiment of the present invention. An exemplary description of a method for determining keywords of text provided by an embodiment of the present invention is provided below in connection with fig. 2. In this embodiment, an exemplary introduction is made taking the Chinese web text as an example. Briefly, the method generally comprises the following. Firstly, acquiring a Chinese web text to be processed and the number N of target keywords. Then, word segmentation and part-of-speech tagging are carried out on the Chinese network text to be processed; screening candidate objects according to the word segmentation and part-of-speech tagging results, and generating a first candidate object set. Extracting word frequency and word position characteristics of each candidate object in the first candidate object set, and scoring based on the word frequency and the word position; and after sorting according to the scoring result, acquiring Top 2N candidate objects to generate a second candidate object set. And carrying out post-processing on the candidate objects in the second candidate object set, wherein the post-processing mainly comprises three steps of reducing redundancy of an output form, reducing redundancy of semantics and improving accuracy of semantics. And acquiring Top N from the candidate objects subjected to the post-processing result as a final keyword output result. This is described in more detail below in conjunction with fig. 2.
First, a Chinese web text to be processed and the number N of target keywords desired to be extracted are acquired. The Chinese web text to be processed contains two parts of a title and a body, and the character encoding format is UTF-8". The number N of the target keywords represents the number of keywords which need to be extracted from the current Chinese web text, and N is generally more than or equal to 3 and less than or equal to 10. The size of N may be set according to the specific situation, and is not limited thereto.
Secondly, performing word segmentation and part-of-speech tagging on the Chinese network text to be processed. The word segmentation refers to reasonably dividing continuous texts into language forms with smaller granularity such as single words, punctuations, words, idioms and the like with practical significance, wherein the language forms with smaller granularity are elements in the embodiment of the invention; the word segmentation method generally adopts an open-source word segmentation tool, such as: hanlp, jieba; after word segmentation, a word list (equivalent to the word segmentation result list in the embodiment of the invention) is obtained, and the original text can be completely restored according to the word list. Part of speech tagging refers to tagging each element in a word list after a word segmentation with a part of speech tag, such as using a Hanlp tool, where the tag symbol is "w". The part of speech is tagged to obtain a part of speech list. The word list and the part-of-speech list are equal in length, and elements in the two lists and the part-of-speech are in one-to-one correspondence through position indexes in the lists.
And generating candidate objects according to the word segmentation and the part-of-speech tagging results. The candidate object comprises a candidate keyword and a candidate keyword phrase, and the generation processes of the candidate keyword and the candidate keyword phrase are mutually independent and only depend on word segmentation and part-of-speech tagging results; the key word includes only one element, and the key phrase includes a plurality of elements. Extracting M continuous elements from a word list obtained after word segmentation as a candidate object: when m=1, the candidate is referred to as a candidate keyword; when M is equal to or greater than 2, the candidate object is referred to as a candidate key phrase. Presetting the value of M, and extracting candidate objects according to each M value of the range to obtain the candidate objects.
Fourth, screening candidate objects according to whether the candidate objects belong to a preset deactivated element set, whether the candidate objects belong to a preset basic part-of-speech set and whether the candidate objects belong to a preset boundary part-of-speech set, and generating a first candidate object set. Specifically, candidates may be screened in the following order: whether belonging to a preset deactivated element set, whether belonging to a preset basic part-of-speech set, and whether belonging to a preset boundary part-of-speech set. The preset deactivated word set is manually summarized, and the included elements can be "basic", "how", "general", and the like. The preset basic part-of-speech set can be flexibly adjusted according to different application scenes, and can generally comprise punctuation marks, auxiliary words, numerical words, exclamation words, pronouns and the like. The preset boundary part-of-speech set is divided into a preset left boundary part-of-speech set and a preset right boundary part-of-speech set, the preset left and right boundary part-of-speech sets are usually different, and can be flexibly adjusted according to different application scenes, and the preset boundary part-of-speech set can comprise adverbs, conjunctions, prepositions and the like. Judging whether any candidate object belongs to a preset deactivated element set, namely judging whether the element included in the candidate object can be completely matched with any preset deactivated element in the preset deactivated element set, namely judging whether at least one element in M elements included in the candidate object belongs to the preset deactivated element set; when at least one element in the candidate object belongs to a preset deactivated element set, the candidate object belongs to the preset deactivated element set and is screened; only when all elements included in the candidate object do not belong to the preset deactivated element set, the candidate object can be included for subsequent operations. Whether the candidate object belongs to the preset basic part-of-speech set refers to that the current candidate object does not belong to the preset basic part-of-speech set when any element of the candidate object does not belong to the preset basic part-of-speech set, and otherwise, the current candidate object belongs to the basic part-of-speech set. Whether belongs to the preset boundary part-of-speech set comprises whether belongs to the preset left boundary part-of-speech set and whether belongs to the preset right boundary part-of-speech set, and the part-of-speech set does not belong to the preset boundary part-of-speech set only when neither the preset left nor the preset right boundary part-of-speech set belongs to the preset boundary part-of-speech set, or belongs to the boundary part-of-speech set. When m=1, both the left and right boundaries of the candidate are the candidate itself; when M is equal to or greater than 2, the left boundary of the candidate represents the first element of the candidate, and the right boundary of the candidate represents the last element of the candidate. In particular, the purpose of distinguishing the preset basic part-of-speech set from the preset boundary part-of-speech set in the above method is to: when m=3, the intermediate elements of the candidate object only need to be judged by the preset stop word set and the preset basic part-of-speech set, that is, when M >2, the elements belonging to the preset boundary part-of-speech set are allowed to appear in the candidate result. Optionally, whether the candidate object belongs to a preset deactivated element set, a preset basic part-of-speech set and a preset boundary part-of-speech set is determined by comparison.
Fifth, candidate operation is performed on candidate objects which do not belong to the preset deactivated element set, the preset basic part-of-speech set and the preset boundary part-of-speech set. And extracting features such as word frequency, word position and the like of the candidate keywords, and scoring, namely determining scores according to the embodiment. Specifically, the word frequency includes the first number and the second number described in the above embodiment. The word position feature includes the first position weight and the second position weight described in the above embodiment. In addition, when performing the scoring, the following is also based: the length of the candidate, the number of elements (M values) the candidate includes. Wherein in this step, a title score and a body score are determined for any one of the candidates, respectively. Specifically, the title score and the text score may be determined with reference to the calculation formulas described in the above embodiments. Then, for any candidate, a final score is obtained by summing the title score and the body score.
Sixth, the first 2N candidate objects are selected to generate a second candidate object set according to the candidate objects and the scores thereof, and the candidate objects are ranked from big to small.
Seventh, post-processing measures are executed on the second candidate object set, so that output form redundancy and semantic redundancy are removed, semantic accuracy is improved, namely, output form redundancy and semantic redundancy of a result are reduced, and semantic accuracy is improved. The method comprises the steps of selecting a candidate object according to the output form redundancy, selecting the candidate object according to the semantic redundancy, and improving the semantic accuracy according to the semantic similarity between the candidate object and the text to be processed. Specifically, for the candidates in the second candidate set, post-processing is performed according to the following screening order: reducing output redundancy, reducing semantic redundancy, and improving semantic accuracy. And finally outputting the set of candidate objects after post-processing. In the embodiment of the invention, the post-processing is more reasonable according to the sequence of reducing the redundancy of the output form, reducing the redundancy of the semantics and improving the accuracy of the semantics, candidate objects are screened from literal meaning to semantic meaning to the similarity with the whole article, and the candidate objects are screened from shallow to deep, so that the accuracy of the screened candidate objects is better, and the accuracy of the finally obtained keywords is improved.
The redundancy of the output form between the candidate objects is mainly divided into two types, wherein one type is that a containing relation exists between the candidate objects, namely, the containing redundancy exists between the candidate objects; the other is that there is an overlapping relationship between the candidate object and the candidate object, i.e., there is overlapping redundancy. FIG. 3 is a schematic diagram of a logic for screening candidates based on output form redundancy between candidates according to another embodiment of the present invention. The greedy idea is adopted in the whole, a candidate object W1 with the largest character string length is selected each time, then the rest candidate objects in the candidate set are traversed, and whether output form redundancy exists between the candidate objects and W1 is judged: updating the score value (score value) of W1 if redundancy exists (taking the maximum value between the score of W1 and the score of the candidate) and deleting this candidate from the candidate set; if no redundancy exists, the method is reserved and the next round of processing flow is entered.
In particular, an exemplary description of how candidates may be filtered based on output formal redundancy between candidates is presented in connection with FIG. 3. A first set of candidates C1 (n) is given, wherein the first set of candidates is established for candidates to be screened based on output formal redundancy between the candidates. When the output form redundancy is reduced for the first time, the candidate in the set C1 (n) is the candidate in the second candidate set. A second set of candidates R1 (n) is created. It is determined whether the set C1 (n) is empty. If yes, the candidate in the set R1 (n) is a candidate subjected to post-processing, that is, a candidate obtained through reducing the redundancy screening of the output form. If not, a third candidate set T1 (n) is created. The candidate objects in the set C1 (n) are ordered according to a length index, the first candidate object arranged in the first position is selected to be W1, and W1 is deleted from the set C1 (n), wherein the length index is the number of characters included in the candidate objects. Traversing the remaining candidates in the set C1 (n), i.e. for all remaining candidates in the set C1 (n) except W1, wherein for any traversed candidate: determining whether there is output form redundancy between the candidate and W1, specifically, see the method described in the above embodiment; deleting the candidate object currently traversed if redundancy is contained; if overlapping redundancy exists, splicing the two candidate objects, and replacing W1 with the spliced candidate objects; whether redundancy or overlap redundancy is included, updating the weight of W1, where the weight is a score value, updating the weight of W1 to the maximum between the score value of W1 and the score value of the candidate object currently traversed; if the output form redundancy does not exist, the candidate object of the current traversal and the weight thereof are saved to a set T1 (n). W1 and its weight are saved to set R1 (n), the contents in set T1 (n) are copied to set C1 (n), and set T1 (n) becomes empty. And then returning to judging whether the set C1 (n) is empty, repeating the candidate operation of judging that the set C1 (n) is not empty until the set C1 (n) is empty, wherein the candidate object in the set R1 (n) is the candidate object obtained by reducing the redundancy of the output form.
Furthermore, the semantic redundancy of the candidate objects means that the similarity between the candidate objects is semantically too high, i.e. there is redundancy semantically. Fig. 4 shows a logic diagram for screening candidate objects based on semantic redundancy among candidate objects, in which greedy ideas are generally adopted, the candidate objects and the chinese web text to be processed are both represented as semantic vectors through a pretrained embedded model, the semantic similarity between the candidate objects and the chinese web text is calculated by using cosine distances, then one candidate object W2 closest to the chinese web text in semantic is selected each time, and then whether the residual candidate objects in the candidate set and W2 have semantic redundancy or not is traversed: updating the weight value of W2 (taking the sum of the score value of W and the score value of the candidate object) if semantic redundancy exists, and deleting the candidate object from the candidate object set; if there is no semantic redundancy, the method is reserved and the next round of processing is performed.
In particular, with reference to FIG. 4, an exemplary description is provided of how candidates may be screened based on semantic redundancy between the candidates. A fourth set of candidates C2 (n) is given, wherein the fourth set of candidates is established for candidates to be screened based on semantic redundancy between the candidates. In this embodiment, when the semantic redundancy is reduced for the first time, the candidate in the set C2 (n) is the candidate in the set R1 (n) obtained by reducing the output form redundancy. A fifth candidate set R2 (n) is created. And acquiring semantic vectors of the text to be processed and semantic vectors of all candidate objects through a pre-trained Embedding model. It is determined whether the set C2 (n) is empty. If yes, the candidate objects in the set R2 (n) are candidate objects subjected to post-processing, namely, the candidate objects obtained through reducing semantic redundancy screening. If not, a sixth candidate set T2 (n) is created. The candidate objects in the set C2 (n) are ordered according to the semantic similarity indexes of the candidate objects and the text to be processed (specifically, how to calculate the semantic similarity between the candidate objects and the text to be processed can be performed by referring to the method described in the above embodiment), the first candidate object in the first rank is selected as W2, and W2 is deleted from the set C2 (n). Traversing the remaining candidates in the set C2 (n), i.e. for all remaining candidates in the set C2 (n) except for W2, wherein for any traversed candidate: judging whether semantic redundancy exists between the candidate object and W2, specifically, referring to the method in the embodiment; if semantic redundancy exists, deleting the candidate object which is currently traversed, and updating the weight of W2, wherein the weight is a score value, and the weight of W2 is updated to be the sum of the score value of W2 and the score value of the candidate object which is currently traversed; if no semantic redundancy exists, the candidate object currently traversed and the weight thereof are saved to a set T2 (n). W2 and its weight are saved to set R2 (n), the contents in set T2 (n) are copied to set C2 (n), and set T2 (n) becomes empty. And then returning to judging whether the set C2 (n) is empty, repeating the candidate operation of judging that the set C2 (n) is not empty until the set C2 (n) is empty, wherein the candidate objects in the set R2 (n) are the candidate objects obtained by reducing the semantic redundancy.
In addition, the semantic accuracy of the candidate object refers to the similarity degree of the candidate object and the Chinese web text to be processed in terms of semantics, and the higher the similarity is, the more accurate the candidate object is considered. Firstly, calculating semantic vectors of a candidate object and a Chinese network text to be processed through a pre-trained Embedding model, respectively calculating cosine distance values between the semantic vectors of the candidate object and the Chinese network text to be processed to measure semantic similarity between the candidate object and the Chinese network text to be processed, and deleting the semantic similarity from a candidate object set directly when the semantic similarity does not reach a preset semantic similarity threshold value. Wherein, in this embodiment, the candidate object set is a set of candidate objects obtained after semantic redundancy is reduced.
Eighth, for the candidate objects in the candidate object set obtained after post-processing, sorting according to the score from large to small, and selecting the first N as the final keyword extraction result.
As can be seen from the above, the technical solution provided by the embodiments of the present invention solves the following problems: 1) Relying on pre-built corpus text libraries and updating and maintenance thereof; 2) The granularity of the output result only supports a single word, not a phrase consisting of several words; 3) The design of key word characteristics is not carried out aiming at the characteristics of Chinese texts; 4) There is a lack of post-processing of the output results of the algorithm.
In the technical scheme provided by the embodiment of the invention, the method for extracting the keywords of the Chinese network text has the following beneficial effects: 1) In order to eliminate the dependence of the method on a corpus text library, the keyword features which are represented by IDF and depend on pre-constructed expected text library are abandoned, the keyword extraction task can be completed only on the basis of the current text, and the method can be flexibly applied to texts in various fields; 2) In order to support the simultaneous output of keywords and key phrases, a keyword filtering method and a key phrase generating method based on word parts of speech are provided, the extraction of keywords and the extraction of key phrases are completely independent from each other, and the output key phrases can greatly improve the semantic accuracy of results; 3) In order to optimize the keyword characteristics of the Chinese text, a series of keyword characteristics aiming at the Chinese are provided, the weight of the candidate object is adjusted by combining the characteristics of the candidate object such as the length and the position of the candidate object on the basis of TF characteristics, and the accuracy of the keywords is improved, wherein the TF characteristics refer to the occurrence times of the candidate object in the text; 4) In order to optimize the output result, a series of candidate object post-processing methods are provided, including three measures of reducing redundancy of the output form, reducing redundancy of the semantics and improving consistency of the keywords and the semantics of the articles.
Accordingly, another aspect of the embodiments of the present invention provides an apparatus for determining keywords of text.
Fig. 5 is a block diagram of an apparatus for determining keywords of text according to another embodiment of the present invention. As shown in fig. 5, the apparatus includes a word segmentation module 1, a candidate object determination module 2, a screening module 4, and a keyword determination module 4. The word segmentation module 1 is used for segmenting a text to be processed into at least one element by segmenting the text to be processed so as to obtain a word segmentation result list; the candidate object determining module 2 is configured to determine candidate objects based on elements of the word segmentation result list, where any candidate object includes at least one element that is continuous in position in the word segmentation result list; screening module 3 is configured to screen, for the candidate object, based on screening features, wherein the screening features include at least one of: whether the candidate object belongs to a preset inactive element set, scores, output form redundancy among candidate objects, semantic redundancy among the candidate objects and semantic similarity between the candidate objects and the text to be processed or not, wherein the score of any candidate object reflects the possibility that the candidate object becomes a keyword of the text to be processed; the keyword determining module 4 is configured to determine keywords of the text to be processed based on the candidate objects obtained by screening.
Optionally, in an embodiment of the present invention, the filtering module filtering the candidate objects based on output form redundancy between the candidate objects includes: step 1: establishing a first candidate object set for candidate objects to be screened based on output form redundancy among the candidate objects; step 2: determining a formal redundancy comparison candidate object and deleting the formal redundancy comparison candidate object from the first candidate object set, wherein the formal redundancy comparison candidate object is the candidate object with the largest length in the first candidate object set; step 3: for each candidate remaining in the first set of candidates, performing the following: judging whether output form redundancy exists between the targeted candidate object and the form redundancy comparison candidate object; under the condition that output form redundancy exists, determining screening candidate objects according to a first preset rule, comparing the candidate objects based on the updating form redundancy of the screening candidate objects, and deleting the targeted candidate objects from a first candidate object set; deleting the targeted candidate object from the first candidate object set and saving the targeted candidate object into a third candidate object set under the condition that the output form redundancy does not exist; step 4: saving the formal redundancy comparison candidate object into a second candidate object set; step 5: copying the candidate objects in the third candidate object set into the first candidate object set and emptying the third candidate object set; step 6: judging whether the first candidate object set is empty or not; and 7, step 7: and (3) repeating the steps 2 to 6 until the first candidate set is empty under the condition that the first candidate set is not empty, wherein the formal redundancy comparison candidate in the second candidate set is the candidate object screened based on the output formal redundancy among the candidate objects.
Optionally, in an embodiment of the present invention, the first preset rule includes: screening candidate objects to be redundancy comparison candidate objects under the condition that the output form redundancy is redundancy contained; and/or filtering the candidate object to obtain an object after the redundancy comparison candidate object and the aimed candidate object are spliced under the condition that the output form redundancy is overlapped redundancy.
Optionally, in an embodiment of the present invention, the filtering module filters the candidate objects based on semantic redundancy between the candidate objects includes: step 8: establishing a fourth candidate object set aiming at candidate objects to be screened based on semantic redundancy among the candidate objects; step 9: determining a semantic redundancy comparison candidate object and deleting the semantic redundancy comparison candidate object from the fourth candidate object set, wherein the semantic redundancy comparison candidate object is the candidate object with the largest semantic similarity with the text to be processed in the fourth candidate object set; step 10: for each candidate remaining in the fourth candidate set, performing the following: judging whether semantic redundancy exists between the aimed candidate object and the semantic redundancy comparison candidate object; deleting the targeted candidate object from the fourth candidate object set in the case of semantic redundancy; deleting the targeted candidate object from the fourth candidate object set and storing the targeted candidate object into a sixth candidate object set under the condition that semantic redundancy does not exist; step 11: saving the semantic redundancy comparison candidate object into a fifth candidate object set; step 12: copying the candidate objects in the sixth candidate object set to the fourth candidate object set and emptying the sixth candidate object set; step 13: judging whether the fourth candidate object set is empty or not; step 14: and (3) repeating the steps 9 to 13 until the fourth candidate object set is empty under the condition that the fourth candidate object set is not empty, wherein the semantic redundancy comparison candidate object in the fifth candidate object set is the candidate object obtained by screening based on the semantic redundancy among the candidate objects.
Optionally, in an embodiment of the present invention, the filtering module filtering the candidate object based on the semantic similarity between the candidate object and the text to be processed includes: calculating the semantic similarity between the candidate object to be screened and the text to be processed according to any candidate object to be screened based on the semantic similarity between the candidate object and the text to be processed; and screening the candidate objects to be screened based on the calculated semantic similarity, wherein the candidate objects to be screened, of which the semantic similarity reaches a preset semantic similarity threshold, are the candidate objects screened based on the semantic similarity between the candidate objects and the text to be processed.
Optionally, in an embodiment of the present invention, the apparatus further includes: a score determining module for determining a score for any candidate object; wherein the score determination module determining the score for any candidate object comprises: determining a title score and/or a text score corresponding to the candidate object based on the score determination features of the candidate objects, wherein for any candidate object, the score determination features include at least one of: the method comprises the steps of a first time, a second time, a position weight of a candidate object, a length of the candidate object and the number of elements included in the candidate object, wherein the first time represents the number of times the candidate object appears in a title of a text to be processed, and the second time represents the number of times the candidate object appears in the text to be processed; and determining the score based on the title score and/or the text score corresponding to the candidate object.
Optionally, in an embodiment of the present invention, the score determination module determines the title score for any candidate object based on the following formula: score title =1.5*TF 1 *log 2 (M+1)log 2 (P/2. Times.M+0.5), where Score title Representing title score, TF 1 The first number of times is represented by M, the number is represented by M, and the length is represented by P.
Optionally, in an embodiment of the present invention, the location weights of the candidate objects include a first location weight corresponding to a ratio of a location of the candidate object first appearing in the text to a full text length of the text and a second location weight corresponding to a ratio of a location span of the candidate object in the text to a full text length, and the score determining module determines, for any candidate object, the text score based on the following formula: score content =TF 2 *log 2 (M+1)*log 2 (P/2*M+0.5)*(W front +W range ) Wherein, score content Representing text score, TF 2 Represents the second time, M represents the number, P represents the length, W front Represents a first position weight, W range Representing the second position weight.
Optionally, in an embodiment of the present invention, the apparatus further includes: the part-of-speech tagging module is used for tagging the parts-of-speech to the elements in the segmentation result list before screening the candidate objects based on the screening features, wherein the screening features further comprise at least one of the following: whether belonging to a preset basic part-of-speech set or not and whether belonging to a preset boundary part-of-speech set or not.
The specific working principle and benefits of the device for determining keywords of a text provided by the embodiment of the present invention are similar to those of the method for determining keywords of a text provided by the embodiment of the present invention, and will not be described here again.
The device for determining the keywords of the text comprises a processor and a memory, wherein the word segmentation module, the candidate object determination module, the screening module, the keyword determination module and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and the key words can be determined based on the text to be processed by adjusting the kernel parameters without relying on pre-building a corpus text library.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
Still another aspect of the embodiments of the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the method described in the above embodiments.
In addition, another aspect of the embodiments of the present invention further provides a processor, configured to execute a program, where the program is executed to perform the method described in the foregoing embodiments.
In addition, another aspect of the embodiment of the present invention further provides an apparatus, where the apparatus includes a processor, a memory, and a program stored in the memory and capable of running on the processor, and the processor executes the program to implement the method described in the foregoing embodiment. The device herein may be a server, PC, PAD, cell phone, etc.
In addition, another aspect of the embodiments of the present invention also provides a computer program product comprising a computer program/instruction which, when executed by a processor, implements the method described in the above embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (11)

1. A method for determining keywords for text, the method comprising:
splitting a text to be processed into at least one element by word segmentation to the text to be processed so as to obtain a word segmentation result list;
Determining candidate objects based on the elements of the word segmentation result list, wherein any one of the candidate objects comprises at least one element with continuous positions in the word segmentation result list;
screening based on screening features for the candidate object, wherein the screening features comprise at least one of: whether belonging to a preset deactivated element set, a score, output form redundancy between the candidate objects, semantic redundancy between the candidate objects, and semantic similarity between the candidate objects and the text to be processed, wherein the score of any candidate object reflects the possibility that the candidate object is the keyword of the text to be processed; and
and determining the keywords of the text to be processed based on the candidate objects obtained by screening.
2. The method of claim 1, wherein filtering the candidate objects based on output formal redundancy between the candidate objects comprises:
step 1: establishing a first set of candidate objects for the candidate objects to be screened based on output formal redundancy between the candidate objects;
step 2: determining a formal redundancy comparison candidate object and deleting the formal redundancy comparison candidate object from the first candidate object set, wherein the formal redundancy comparison candidate object is the candidate object with the largest length in the first candidate object set;
Step 3: for each of the candidate objects remaining in the first set of candidate objects, performing the following:
judging whether output form redundancy exists between the targeted candidate object and the form redundancy comparison candidate object;
under the condition that output form redundancy exists, determining screening candidate objects according to a first preset rule, updating the form redundancy comparison candidate objects based on the screening candidate objects, and deleting the targeted candidate objects from the first candidate object set;
deleting the targeted candidate object from the first candidate object set and saving the targeted candidate object into a third candidate object set under the condition that the output form redundancy does not exist;
step 4: saving the formal redundancy comparison candidate object to the second candidate object set; step 5: copying the candidate objects in the third candidate object set into the first candidate object set and emptying the third candidate object set;
step 6: judging whether the first candidate object set is empty or not; and
step 7: and (3) repeating the steps 2 to 6 until the first candidate object set is empty under the condition that the first candidate object set is not empty, wherein the formal redundancy comparison candidate objects in the second candidate object set are candidate objects obtained by screening based on output formal redundancy among the candidate objects.
3. The method of claim 1, wherein filtering the candidate objects based on semantic redundancy between the candidate objects comprises:
step 8: establishing a fourth set of candidate objects for the candidate objects to be screened based on semantic redundancy between the candidate objects;
step 9: determining a semantic redundancy comparison candidate object and deleting the semantic redundancy comparison candidate object from the fourth candidate object set, wherein the semantic redundancy comparison candidate object is the candidate object with the largest semantic similarity with the text to be processed in the fourth candidate object set;
step 10: for each of the candidate objects remaining in the fourth set of candidate objects, performing the following:
judging whether semantic redundancy exists between the aimed candidate object and the semantic redundancy comparison candidate object;
deleting the targeted candidate object from the fourth candidate object set in the case of semantic redundancy;
deleting the targeted candidate object from the fourth candidate object set and storing the targeted candidate object into a sixth candidate object set under the condition that semantic redundancy does not exist;
Step 11: saving the semantic redundancy comparison candidate object to the fifth candidate object set; step 12: copying the candidate objects in the sixth candidate object set to the fourth candidate object set and emptying the sixth candidate object set;
step 13: judging whether the fourth candidate object set is empty or not; and
step 14: and (3) repeating the steps 9 to 13 until the fourth candidate object set is empty under the condition that the fourth candidate object set is not empty, wherein the semantic redundancy comparison candidate objects in the fifth candidate object set are candidate objects obtained by screening based on the semantic redundancy among the candidate objects.
4. The method of claim 1, wherein screening the candidate object based on semantic similarity between the candidate object and the text to be processed comprises:
calculating the semantic similarity between the candidate object to be screened and the text to be processed aiming at any candidate object to be screened based on the semantic similarity between the candidate object and the text to be processed; and
and screening the candidate objects to be screened based on the semantic similarity obtained through calculation, wherein the candidate objects to be screened, of which the semantic similarity reaches a preset semantic similarity threshold, are candidate objects obtained through screening based on the semantic similarity between the candidate objects and the text to be processed.
5. The method of claim 1, wherein determining the score for any of the candidate objects comprises:
determining a title score and/or a text score corresponding to the candidate object based on the score determination features of the candidate objects, wherein the score determination features comprise at least one of the following for any one of the candidate objects: a first number of times, a second number of times, a position weight of the candidate object, a length of the candidate object, and the number of elements included in the candidate object, wherein the first number of times represents the number of times the candidate object appears in a title of the text to be processed, and the second number of times represents the number of times the candidate object appears in a body of the text to be processed; and
and determining the score based on the title score and/or the text score corresponding to the candidate object.
6. The method of claim 5, wherein determining the title score for any of the candidate objects comprises determining based on the following formula:
Score title =1.5*TF 1 *log 2+ (M+1)*log 2 (P/2+*M+0.5)
wherein Score title Representing the title score, TF 1 Representing the first number of times, M representing the number, and P representing the length.
7. The method of claim 5, wherein the location weights of the candidate objects include a first location weight for a first occurrence of the candidate object in the body corresponding to a full length of the body and a second location weight for a span of locations of the candidate object in the body corresponding to the full length of the body, wherein determining the body score for any of the candidate objects includes determining based on the following formula:
Score content =TF 2 *log 2 (M+1)*log 2 (P/2*M+0.5)*(W front +W range )
wherein Score content Representing the text score, TF 2 Represents the second number of times, M represents the number, P represents the length, W front Representing the first position weight, W range Representing the second position weight.
8. The method of claim 1, wherein prior to screening for the candidate based on screening characteristics, the method further comprises:
labeling the elements in the word segmentation result list with parts of speech, wherein the filtering features further comprise at least one of: whether belonging to a preset basic part-of-speech set or not and whether belonging to a preset boundary part-of-speech set or not.
9. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the method of any one of claims 1-8.
10. A processor configured to run a program, wherein the program is configured to perform the method of any of claims 1-8 when run.
11. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 1-8.
CN202310138465.2A 2023-02-14 2023-02-14 Method and device for determining keywords of text Pending CN116306616A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310138465.2A CN116306616A (en) 2023-02-14 2023-02-14 Method and device for determining keywords of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310138465.2A CN116306616A (en) 2023-02-14 2023-02-14 Method and device for determining keywords of text

Publications (1)

Publication Number Publication Date
CN116306616A true CN116306616A (en) 2023-06-23

Family

ID=86777166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310138465.2A Pending CN116306616A (en) 2023-02-14 2023-02-14 Method and device for determining keywords of text

Country Status (1)

Country Link
CN (1) CN116306616A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942189A (en) * 2014-03-19 2014-07-23 百度在线网络技术(北京)有限公司 Method and device for determining keywords of compositions
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device
US20170139899A1 (en) * 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
US20200250376A1 (en) * 2019-12-13 2020-08-06 Beijing Xiaomi Intelligent Technology Co., Ltd. Keyword extraction method, keyword extraction device and computer-readable storage medium
KR20220081009A (en) * 2020-12-08 2022-06-15 주식회사 카카오엔터프라이즈 Keyword extraction apparatus, control method thereof and keyword extraction program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942189A (en) * 2014-03-19 2014-07-23 百度在线网络技术(北京)有限公司 Method and device for determining keywords of compositions
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device
US20170139899A1 (en) * 2015-11-18 2017-05-18 Le Holdings (Beijing) Co., Ltd. Keyword extraction method and electronic device
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
US20200250376A1 (en) * 2019-12-13 2020-08-06 Beijing Xiaomi Intelligent Technology Co., Ltd. Keyword extraction method, keyword extraction device and computer-readable storage medium
KR20220081009A (en) * 2020-12-08 2022-06-15 주식회사 카카오엔터프라이즈 Keyword extraction apparatus, control method thereof and keyword extraction program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨洁 等: "基于联合权重的多文档关键词抽取技术", 中文信息学报, vol. 22, no. 06, pages 75 - 79 *

Similar Documents

Publication Publication Date Title
EP3401802A1 (en) Webpage training method and device, and search intention identification method and device
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
US11334608B2 (en) Method and system for key phrase extraction and generation from text
US20150120738A1 (en) System and method for document classification based on semantic analysis of the document
US20130060769A1 (en) System and method for identifying social media interactions
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
WO2015179643A1 (en) Systems and methods for generating summaries of documents
CN111767716A (en) Method and device for determining enterprise multilevel industry information and computer equipment
CN107357777B (en) Method and device for extracting label information
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
EP3762834A1 (en) System and method for searching based on text blocks and associated search operators
CN111291177A (en) Information processing method and device and computer storage medium
CN113377927A (en) Similar document detection method and device, electronic equipment and storage medium
Barua et al. Multi-class sports news categorization using machine learning techniques: resource creation and evaluation
Jain et al. Context sensitive text summarization using k means clustering algorithm
WO2015084757A1 (en) Systems and methods for processing data stored in a database
CN107239455B (en) Core word recognition method and device
CN109885641B (en) Method and system for searching Chinese full text in database
Köksal et al. Improving automated Turkish text classification with learning‐based algorithms
CN112487181B (en) Keyword determination method and related equipment
Zhang et al. Chinese novelty mining
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
US9223833B2 (en) Method for in-loop human validation of disambiguated features
Gero et al. Word centrality constrained representation for keyphrase extraction
CN115438147A (en) Information retrieval method and system for rail transit field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination