CN111061924A - Phrase extraction method, device, equipment and storage medium - Google Patents

Phrase extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN111061924A
CN111061924A CN201911264855.4A CN201911264855A CN111061924A CN 111061924 A CN111061924 A CN 111061924A CN 201911264855 A CN201911264855 A CN 201911264855A CN 111061924 A CN111061924 A CN 111061924A
Authority
CN
China
Prior art keywords
preset
entropy
word
word set
selected word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911264855.4A
Other languages
Chinese (zh)
Inventor
肖光昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201911264855.4A priority Critical patent/CN111061924A/en
Publication of CN111061924A publication Critical patent/CN111061924A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The application provides a phrase extraction method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring basic data to be extracted; selecting effective character strings in the basic data to obtain a candidate character string set; and extracting key phrases in the candidate character string set. According to the method and the device, the key phrase groups are found according to massive basic data, so that the efficiency of extracting the key words is improved.

Description

Phrase extraction method, device, equipment and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting phrases.
Background
In the internet era, hot words are used as a brand-new language mode and cultural landscape, and reflect the problems and things which are generally concerned by people in certain areas in a certain period. The hot words are characterized in that the hot words rarely appear before a certain time period and then appear suddenly and largely, which indicates that events related to the hot words occur in a certain time period, thereby causing special attention of people.
The process of mining hot words in massive information can be called as key phrase discovery. In real life, policemen can discover potential, frequent and emerging crime measures by carrying out key phrase discovery on the alarm condition which occurs every day and every week and every month, so that the efficiency of detecting the alarm condition is improved.
In the actual key phrase discovery process, there are generally two problems as follows: firstly, a large number of meaningless phrase exist in the information generated every day, and need to be removed; secondly, in a traditional key phrase discovery mode, if a ranking algorithm is adopted to judge the heat degree of the phrase phrases, the obtained key phrase discrimination is not high.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method, an apparatus, a device, and a storage medium for extracting phrases, so as to find key phrase phrases according to massive basic data, so as to improve efficiency of extracting keywords.
A first aspect of an embodiment of the present application provides a method for extracting a phrase, including: acquiring basic data to be extracted; selecting effective character strings in the basic data to obtain a candidate character string set; and extracting key phrases in the candidate character string set.
In an embodiment, the selecting the valid character strings in the basic data to obtain a candidate character string set includes: deleting the text data which is the same as the text data in the preset corpus knowledge base from the basic data to obtain effective text data; and deleting data symbols which accord with a preset type from the effective text data to generate the candidate character string set.
In an embodiment, the extracting the key phrases in the candidate character string set includes: and inputting the candidate character string set to a preset phrase extraction model, so that the phrase extraction model outputs the key phrases.
In one embodiment, the step of presetting the phrase extraction model includes: in the candidate character string set, carrying out feature labeling on sample character strings in a preset number to obtain labeled sample data; and performing machine learning model training by adopting the sample data to obtain the phrase extraction model.
In an embodiment, the extracting the key phrases in the candidate character string set includes: performing word segmentation processing on the character strings in the candidate character string set according to a preset word order rule to generate a basic word set; calculating mutual information values between every two basic words in the basic word set; selecting a selected word set which forms the basic words with the mutual information value larger than a preset mutual information threshold value from the basic word set; and extracting the key phrase from the selected word set according to the information entropy of the selected word.
In an embodiment, the extracting the keyword group from the obtained word set according to the information entropy of the obtained word includes: respectively calculating a first left information entropy value of each selected word; judging whether the first left information entropy value is larger than a first preset entropy value or not; when the first left information entropy is smaller than or equal to the first preset entropy, expanding the candidate word to the left by a first preset length to generate a left expansion word set, and calculating a second left information entropy of the left expansion word; selecting the left selected expansion words forming the second left information entropy value larger than the first preset entropy value from the left expansion word set to generate a left selected word set; when the left information entropy is larger than the first preset entropy, the selected word set is used as the left selected word set; and extracting the key phrase from the left selected word set.
In an embodiment, the extracting the keyword group from the left selected word set includes: calculating a first right information entropy value of each word in the left selected word set; judging whether the entropy value of the first right information is larger than a second preset entropy value or not; when the first right information entropy is smaller than or equal to the second preset entropy, expanding words in the left selected word set to the right by a second preset length to generate a right expanded word set, calculating a second right information entropy of the right expanded words, selecting the selected right expanded words forming the second right information entropy larger than the second preset entropy in the right expanded word set, and generating a right selected word set; when the first right information entropy is larger than a second preset entropy, the left word acquisition set is used as the right word acquisition set; and extracting the key phrase from the right selected word set.
In an embodiment, the extracting the keyword group from the right selected word set includes: calculating the multi-character mutual information value of each word in the right selected word set; and selecting a phrase forming the multi-character mutual information value larger than a third threshold value from the right selected word set as the key phrase.
In an embodiment, after the extracting the key phrases in the candidate character string set, the method further includes: calculating the similarity between every two key phrases in a plurality of key phrases; and selecting at least two key phrases with the similarity larger than a preset similarity value to form a similarity pair set.
A second aspect of the embodiments of the present application provides a phrase extracting apparatus, including: the acquisition module is used for acquiring basic data to be extracted; the selection module is used for selecting effective character strings in the basic data to obtain a candidate character string set; and the extraction module is used for extracting key phrases in the candidate character string set.
In one embodiment, the selection module is configured to: deleting the text data which is the same as the text data in the preset corpus knowledge base from the basic data to obtain effective text data; and deleting data symbols which accord with a preset type from the effective text data to generate the candidate character string set.
In one embodiment, the extraction module is configured to: and inputting the candidate character string set to a preset phrase extraction model, so that the phrase extraction model outputs the key phrases.
In an embodiment, the system further includes a model presetting module, configured to: in the candidate character string set, carrying out feature labeling on sample character strings in a preset number to obtain labeled sample data; and performing machine learning model training by adopting the sample data to obtain the phrase extraction model.
In one embodiment, the extraction module is configured to: performing word segmentation processing on the character strings in the candidate character string set according to a preset word order rule to generate a basic word set; calculating mutual information values between every two basic words in the basic word set; selecting a selected word set which forms the basic words with the mutual information value larger than a preset mutual information threshold value from the basic word set; and extracting the key phrase from the selected word set according to the information entropy of the selected word.
In an embodiment, the extracting the keyword group from the obtained word set according to the information entropy of the obtained word includes: respectively calculating a first left information entropy value of each selected word; judging whether the first left information entropy value is larger than a first preset entropy value or not; when the first left information entropy is smaller than or equal to the first preset entropy, expanding the candidate word to the left by a first preset length to generate a left expansion word set, and calculating a second left information entropy of the left expansion word; selecting the left selected expansion words forming the second left information entropy value larger than the first preset entropy value from the left expansion word set to generate a left selected word set; when the left information entropy is larger than the first preset entropy, the selected word set is used as the left selected word set; and extracting the key phrase from the left selected word set.
In an embodiment, the extracting the keyword group from the left selected word set includes: calculating a first right information entropy value of each word in the left selected word set; judging whether the entropy value of the first right information is larger than a second preset entropy value or not; when the first right information entropy is smaller than or equal to the second preset entropy, expanding words in the left selected word set to the right by a second preset length to generate a right expanded word set, calculating a second right information entropy of the right expanded words, selecting the selected right expanded words forming the second right information entropy larger than the second preset entropy in the right expanded word set, and generating a right selected word set; when the first right information entropy is larger than a second preset entropy, the left word acquisition set is used as the right word acquisition set; and extracting the key phrase from the right selected word set.
In an embodiment, the extracting the keyword group from the right selected word set includes: calculating the multi-character mutual information value of each word in the right selected word set; and selecting a phrase forming the multi-character mutual information value larger than a third threshold value from the right selected word set as the key phrase.
In one embodiment, the method further comprises: a calculating module, configured to calculate, after the extracting of the keyword groups in the candidate character string set, a similarity between every two keyword groups in the plurality of keyword groups; and the group pairing module is used for selecting at least two key phrases which form the similarity larger than a preset similarity value to form a similar pair set.
A third aspect of embodiments of the present application provides an electronic device, including: a memory to store a computer program; a processor configured to perform the method of the first aspect of the embodiments of the present application and any of the embodiments of the present application.
A fourth aspect of embodiments of the present application provides a non-transitory electronic device-readable storage medium, including: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of an embodiment of the present application and any embodiment thereof.
According to the phrase extraction method, the device, the equipment and the storage medium, firstly, preprocessing is carried out on basic data to be extracted, after useless information of the data is eliminated, effective character strings are selected from the data to obtain a candidate character string set, then key phrases are extracted from the candidate character string set, and the efficiency of extracting the key phrases is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a phrase extraction method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a phrase extraction method according to an embodiment of the present application;
fig. 4 is a flowchart illustrating a phrase extraction method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a phrase extracting apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10, and the memory 12 stores instructions executable by the processor 11 and the instructions are executed by the processor 11.
In an embodiment, the electronic device 1 may be a computer, a mobile phone, a notebook computer, or the like, and the electronic device 1 may be configured to obtain basic data to be extracted. And selecting effective character strings in the basic data to obtain a candidate character string set. And extracting key phrases in the candidate character string set.
In one embodiment, the underlying data may be user posting information from the Internet. And identifying and extracting a new keyword group with higher liveness in the keyword group based on the posting content and the related information of the user in a certain time period.
Please refer to fig. 2, which is a phrase extraction method according to an embodiment of the present application, and the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to the scene of discovering new words in the internet, so as to extract a keyword phrase based on massive basic data. The method comprises the following steps:
step 201: and acquiring basic data to be extracted.
In this step, the basic data may be from a database composed of network posted contents, such as posted information of a user to a topic in a certain website for a certain period of time. And the required basic data is acquired by docking with a network data platform.
Step 202: and selecting effective character strings in the basic data to obtain a candidate character string set.
In this step, the basic data includes various information related to posting information, such as posting time, sentence characters and emoticons adopted by the content, and pictures. Before extracting the key phrase, the basic data is preprocessed, effective character strings relevant to extracting the key phrase are extracted from the basic data, and a candidate character string set is formed, so that the calculated amount of the basic data during extracting the key phrase is reduced, and the extraction efficiency is improved.
Step 203: and extracting key phrases in the candidate character string set.
In this step, the key information including posting content in the candidate character string set is obtained after the preprocessing, and the key information is used as a processing object for extracting the key phrase, so that the key phrase can be accurately extracted.
According to the phrase extraction method, firstly, the basic data to be extracted are preprocessed, the data are subjected to useless information elimination, effective character strings are selected from the data, a candidate character string set is obtained, then key phrases are extracted from the candidate character string set, and the efficiency of extracting the key phrases is improved.
Please refer to fig. 3, which is a phrase extraction method according to an embodiment of the present application, and the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to the scene of discovering the new words in the internet, so as to extract the key phrases based on massive basic data. The method comprises the following steps:
step 301: and acquiring basic data to be extracted. See the description of step 201 in the above embodiments for details.
Step 302: and deleting the text data which is the same as the text data in the preset corpus knowledge base from the basic data to obtain effective text data.
In this step, the preset corpus knowledge base may be set based on a specific use scenario, for example, in the field of public security application, by analyzing the public security case text, the rules such as format and phrase of the useless information in the case text are obtained, such as the public security case text: "a certain alarm is called to be cheated on x network xx element, a certain police is called to go out to the scene and informs the police to pay attention to prevention and plan criminal cases", wherein the contents of "going out to the scene and planning criminal cases" which are irrelevant to the core contents of the cases can be considered as useless information. And then, combining with a Ratel rule language, generating a preset corpus knowledge base, and removing useless information in the basic data text by using the corpus knowledge base, namely deleting the text data in the basic data text, which is the same as the preset corpus knowledge base, and forming effective text data by using the residual basic data.
Step 303: and deleting the data symbols which accord with the preset type from the effective text data to generate a candidate character string set.
In this step, the preset type may be combined with the big data, and set according to the actual application scenario, the valid text data may further include many useless special symbols, such as spaces, brackets, etc., which may be brought into the text unintentionally by the user, or special words (such as "ones", "is" words that have no influence on the semantic meaning), which do not participate in composing the semantic meaning. In addition, some word groups are not used (stop words), in order to improve the calculation efficiency, Chinese word segmentation can be performed on the text of the effective text data by combining a word bank in a specific field, and the special symbols, the special single words and the stop words are removed to obtain a cleaned text data set, namely a candidate character string set. After the text is preprocessed, the computer can better understand the text, and meanwhile, the effect and performance of the algorithm model can be greatly improved.
Step 304: and inputting the candidate character string set into a preset phrase extraction model, so that the phrase extraction model outputs key phrases.
In this step, a phrase extraction model may be preset, where the phrase extraction model may be a model established based on machine learning, and then the candidate character string set is input to the secondary group extraction model, that is, the keyword phrase to be extracted may be output.
In one embodiment, before step 304, the method further includes: the step of presetting the phrase extraction model may specifically include: and in the candidate character string set, carrying out feature labeling on the sample character strings in a preset number to obtain labeled sample data. And performing machine learning model training by adopting the sample data to obtain a phrase extraction model.
In this step, when a phrase extraction model is established, a preset number of sample character strings may be selected from a candidate character string set, for example, a preset number of sample text data is selected as a training set, and feature elements and paragraphs of a key phrase appearing in a sample text are labeled by a Raptor text labeling tool, for example, in a public security case text: "somebody cheats a certain xxx cash in a certain place by utilizing a pick-up halving mode in x years, x months and x days", wherein the pick-up halving mode is a key phrase and phrase for representing the core content of the text, and is contained in a characteristic element of the "utilizing the pick-up halving mode" -a committing means. And then, training a machine learning model, such as a neural network model, by adopting the labeled sample text to finally obtain a phrase extraction model corresponding to the labeled characteristic elements. And selecting a part of data from the candidate character string set as a test set, carrying out effect test on the trained phrase extraction model, and continuously repeating the step until the phrase extraction model reaches the preset precision, thereby obtaining the final phrase extraction model. And inputting the candidate character string set into the phrase extraction model, and outputting a set of key phrase phrases.
Please refer to fig. 4, which is a phrase extraction method according to an embodiment of the present application, and the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to the scene of discovering new words in the internet, so as to extract a keyword phrase based on massive basic data. The method comprises the following steps:
step 401: and acquiring basic data to be extracted. See the description of step 201 in the above embodiments for details.
Step 402: and deleting the text data which is the same as the text data in the preset corpus knowledge base from the basic data to obtain effective text data. See the description of step 302 in the above embodiments for details.
Step 403: and deleting the data symbols which accord with the preset type from the effective text data to generate a candidate character string set. See the description of step 303 in the above embodiments for details.
Step 404: and performing word segmentation processing on the character strings in the candidate character string set according to a preset word order rule to generate a basic word set.
In this step, the word order rule may be derived from different language rules, for example, the word order rule may be a chinese word order rule, such as chinese word segmentation for a candidate string "computer engineer", which may be divided into two basic words "computer" and "engineer". And respectively carrying out word segmentation processing on each character string in the candidate character string set to obtain a basic word set.
Step 405: and calculating mutual information values between every two basic words in the basic word set.
In this step, Mutual Information (Mutual Information) is a useful Information measure in Information theory, which can be regarded as the amount of Information contained in one random variable about another random variable. The higher the mutual information value between the basic word X and the basic word Y is, the higher the correlation between the basic word X and the basic word Y is, the higher the possibility that the basic word X and the basic word Y constitute a phrase is. Conversely, the lower the mutual information value between the basic word X and the basic word Y is, the lower the correlation between the basic word X and the basic word Y is, and the higher the possibility that a phrase boundary exists between the basic word X and the basic word Y is. And respectively calculating mutual information values between every two basic words in the basic word set.
Step 406: and selecting the selected word set which forms the basic words with mutual information values larger than a preset mutual information threshold value from the basic word set.
In this step, the preset mutual information threshold may be set according to actual needs and language habits and rules, for example, a relatively high preset mutual information threshold may be set, and only when the mutual information value between the basic word X and the basic word Y is greater than the threshold, the requirement that the basic word X and the basic word Y form a phrase may be satisfied, and if not, it is stated that the basic word X and the basic word Y cannot form a phrase. And selecting the basic words with the mutual information value larger than the preset mutual information threshold value to form a selected word set.
Then, a key phrase can be extracted from the selected word set according to the information entropy of the selected word. In an embodiment, the step may specifically include:
step 407: and respectively calculating a first left information entropy value of each acquired word.
In this step, entropy, represents a measure of uncertainty of the random variable. Information entropy is a concept used in information theory to measure the information quantity. For example, the left entropy refers to the entropy of the left boundary of the multi-word expression, such as the entropy of possible words and word frequencies within a certain length (set as the first length) to the left of a character string, and then the entropy is calculated and summed. The first left information entropy may be a word frequency of a phrase formed by the selected word X and the left word in the selected word set. Such as "computer" and "engineer," where "computer" is the phrase left to the "engineer" that may make up the phrase.
Step 408: and judging whether the first left information entropy value is larger than a first preset entropy value or not.
In this step, the left information entropy represents the uncertainty of the phrase formed by the selected word and the left word group thereof, and may be selected based on practical application, for example, based on word usage habits of people in big data, it is concluded that words in what range the left information entropy is in are easier to form a phrase, the first preset entropy is set based on the range boundary, when the first left information entropy is greater than the first preset entropy, step 411 is entered, otherwise, step 409 is entered.
Step 409: and after the selected words are expanded to the left by a first preset length, generating a left expansion word set, and calculating a second left information entropy value of the left expansion words.
In this step, when the first left information entropy is less than or equal to the first preset entropy, it is indicated that the first left information entropy of the selected word is smaller, and within the first length, the selected word and the word group on the left of the selected word cannot meet the requirement of forming a phrase, and then the selected word is expanded to the left by the first preset length, a left expansion word set is generated, and a second left information entropy of each left expansion word is calculated.
Step 410: and selecting the left acquisition word forming the second left information entropy value larger than the first preset entropy value from the left expansion word set to generate a left acquisition word set. Step 412 is entered.
In this step, in the left expansion word set, under the condition that the second left information entropy may still be less than or equal to the first preset entropy, the left expansion words forming the second left information entropy larger than the first preset entropy are selected, and a left selected word set is generated.
In an embodiment, the left expansion words forming the second left information entropy value smaller than or equal to the first preset entropy value may be expanded to the left again (ensuring that the expanded words are within the specified length range), and the left information entropy thereof is calculated and compared with the first preset entropy value until the second left information entropy value is greater than the first preset entropy value, and the corresponding left expansion word group is taken as the left selected word set, and the process proceeds to step 412.
Step 411: and taking the selected word set as a left selected word set.
In this step, when the left information entropy is greater than the first preset entropy, it indicates that the phrase on the left of the selected word can satisfy the requirement for forming the phrase within the first length, and the selected word set may be used as the left selected word set.
Then, extracting a key phrase from the left selected word set, wherein the steps specifically include:
step 412: and calculating a first right information entropy value of each word in the left selected word set.
In this step, the left entropy refers to the entropy of the right boundary of the multi-word expression, for example, the entropy may be calculated for possible words and word frequencies within a certain length (set as the second length) on the right of a character string, and then summed. The first right information entropy may be a word frequency of each word and a phrase formed by the right word in the left selected word set. Such as "computer" and "engineer," where for "computer," engineer "is the phrase to the right of which the phrase may be composed.
Step 413: and judging whether the first right information entropy value is larger than a second preset entropy value or not.
In this step, the principle is similar to the left information entropy, the right information entropy represents the uncertainty of the phrase formed by the left selected word and the right phrase thereof, a second preset entropy may be selected based on practical application, for example, based on word usage habits of people in big data, it is concluded that words in what range the right information entropy is in are easier to form a phrase, the second preset entropy is set based on the range boundary, when the first right information entropy is greater than the second preset entropy, step 416 is performed, otherwise, step 414 is performed.
Step 414: and after words in the left selected word set are expanded to the right by a second preset length, a right expanded word set is generated, and a second right information entropy value of the right expanded word is calculated.
In this step, when the first right information entropy is less than or equal to the second preset entropy, it is indicated that the first right information entropy of the left selected word is smaller, and within the second length, the left selected word and the word group on the right of the left selected word cannot meet the requirement of forming a phrase, and after the left selected word is expanded to the right by the second preset length, a right expanded word set is generated, and the second right information entropy of each right expanded word is calculated.
Step 415: and selecting the right-obtaining expansion words forming the second right information entropy value larger than the second preset entropy value in the right expansion word set to generate a right-obtaining word set. Step 417 is entered.
In this step, in the right expansion word set, under the condition that the second right information entropy may still be less than or equal to the second preset entropy, the selected right expansion words forming the second right information entropy greater than the second preset entropy are selected, and a right selected word set is generated. Step 417 is entered.
In an embodiment, the right expansion words forming the second right information entropy smaller than or equal to the second preset entropy may be expanded rightward again (the expanded words are ensured to be within the specified length range), and the right information entropy thereof is calculated and compared with the second preset entropy, until the second right information entropy is larger than the second preset entropy, the corresponding right expansion word group is used as the right choice word set, and the process proceeds to step 417.
Step 416: and the left selected word set is used as a right selected word set.
In this step, when the first right information entropy is greater than the second preset entropy, it indicates that the phrase to the right of the left selected word can satisfy the requirement for forming the phrase within the second length, and then the left selected word set can be directly used as the right selected word set.
In an embodiment, the order of calculating the left information entropy value and the right information entropy is not limited to the order.
Then, extracting a key phrase from the right word acquisition set, and the specific steps comprise:
step 417: and calculating the multi-character mutual information value of each word in the right selected word set.
In this step, in the right selected word set obtained in the above step, a multi-character mutual information value of each word is respectively calculated, the multi-character mutual information value represents the degree of association of two words, and the larger the multi-character mutual information value is, the stronger the association between the two words is, and the higher the possibility that the two words form a phrase is.
Step 418: and selecting a phrase which forms the multi-character mutual information value larger than a third threshold value from the right selected word set as a key phrase.
In this step, the third threshold may be selected according to language habits in an actual application scenario, based on the network big data, the language habits of the user are counted, a common range where two phrases possibly forming a new word are located is analyzed, based on a boundary of the range, the third threshold is set, and when the multi-word mutual information value is greater than the third threshold, it is stated that the phrase forming the multi-word mutual information value may be used as a new key phrase, thereby completing the identification and extraction of the key phrase.
Step 419: and calculating the similarity between every two key phrases in the plurality of key phrases.
In this step, a plurality of key phrases are selected in step 418, and then the similarity between every two key phrases is calculated. For example, the text similarity of two key word groups is calculated by a pre-trained word vector.
Step 420: and selecting at least two key phrases with the similarity larger than a preset similarity value to form a similarity pair set.
In this step, a plurality of key phrases with similarity greater than the preset similarity value may be regarded as key phrases with similar semantics, and are combined into a similar pair set, and the obtained similar pair set may be subjected to connected graph calculation, and results with similar semantics are aggregated together to comprehensively output the results.
Please refer to fig. 5, which is a phrase extracting apparatus 500 according to an embodiment of the present application, and the apparatus can be applied to the electronic device shown in fig. 1, and can be applied to the scene of discovering new words in the internet, so as to extract a keyword phrase based on massive basic data. The device includes: the obtaining module 501, the selecting module 502 and the extracting module 503 have the following principle relationship:
an obtaining module 501, configured to obtain basic data to be extracted.
A selecting module 502, configured to select an effective character string in the basic data to obtain a candidate character string set. See the description of step 202 in the above embodiments for details.
The extracting module 503 is configured to extract a keyword group in the candidate character string set. See the description of step 203 in the above embodiments for details.
In one embodiment, the selection module 502 is configured to: and deleting the text data which is the same as the text data in the preset corpus knowledge base from the basic data to obtain effective text data. And deleting the data symbols which accord with the preset type from the effective text data to generate a candidate character string set. See the description of steps 302 through 303 in the above embodiments in detail.
In one embodiment, the extraction module 503 is configured to: and inputting the candidate character string set into a preset phrase extraction model, so that the phrase extraction model outputs key phrases. See the description of step 304 in the above embodiments for details.
In an embodiment, the system further includes a model presetting module, configured to: and in the candidate character string set, carrying out feature labeling on the sample character strings in a preset number to obtain labeled sample data. And performing machine learning model training by adopting the sample data to obtain a phrase extraction model. See the relevant description in the above embodiments for details.
In one embodiment, the extraction module 503 is configured to: and performing word segmentation processing on the character strings in the candidate character string set according to a preset word order rule to generate a basic word set. And calculating mutual information values between every two basic words in the basic word set. And selecting the selected word set which forms the basic words with mutual information values larger than a preset mutual information threshold value from the basic word set. And extracting key phrases from the selected word set according to the information entropy of the selected words. See the above embodiments for a detailed description of steps 404 to 406.
In an embodiment, extracting a keyword group from the selected word set according to the information entropy of the selected word includes: and respectively calculating a first left information entropy value of each acquired word. And judging whether the first left information entropy value is larger than a first preset entropy value or not. And when the first left information entropy is smaller than or equal to a first preset entropy, expanding the selected words to the left by a first preset length to generate a left expansion word set, and calculating a second left information entropy of the left expansion words. And selecting the left acquisition word forming the second left information entropy value larger than the first preset entropy value from the left expansion word set to generate a left acquisition word set. And when the left information entropy is larger than the first preset entropy, taking the word acquisition set as a left word acquisition set. And extracting key phrases from the left selected word set. Refer to the description of step 407 to step 411 in the above embodiments in detail.
In an embodiment, extracting the keyword group from the left selected word set includes: and calculating a first right information entropy value of each word in the left selected word set. And judging whether the first right information entropy value is larger than a second preset entropy value or not. And when the first right information entropy is smaller than or equal to a second preset entropy, expanding the words in the left selected word set to the right by a second preset length to generate a right expanded word set, and calculating a second right information entropy of the right expanded words. And selecting the right-obtaining expansion words forming the second right information entropy value larger than the second preset entropy value in the right expansion word set to generate a right-obtaining word set. And when the first right information entropy is larger than the second preset entropy, the left word acquisition set is used as a right word acquisition set. And extracting key phrases from the right selected word set. See the above embodiments for details of steps 412 to 416.
In one embodiment, the extracting the key word group from the right selected word set includes: and calculating the multi-character mutual information value of each word in the right selected word set. And selecting a phrase which forms the multi-character mutual information value larger than a third threshold value from the right selected word set as a key phrase. See the description of steps 417 through 418 in the above embodiments in detail.
In one embodiment, the method further comprises: the calculating module 504 is configured to calculate a similarity between every two key phrases in the plurality of key phrases after extracting the key phrases in the candidate character string set. The group pairing module 505 is configured to select at least two key phrases with similarity greater than a preset similarity value to form a similar pair set. See the description of steps 419-420 in the above embodiments for details.
For a detailed description of the phrase extracting apparatus 500, please refer to the description of the related method steps in the above embodiments.
An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (20)

1. A method for extracting phrases, comprising:
acquiring basic data to be extracted;
selecting effective character strings in the basic data to obtain a candidate character string set;
and extracting key phrases in the candidate character string set.
2. The method of claim 1, wherein the selecting valid strings from the base data to obtain a candidate string set comprises:
deleting the text data which is the same as the text data in the preset corpus knowledge base from the basic data to obtain effective text data;
and deleting data symbols which accord with a preset type from the effective text data to generate the candidate character string set.
3. The method of claim 1, wherein the extracting the key phrase in the candidate character string set comprises:
and inputting the candidate character string set to a preset phrase extraction model, so that the phrase extraction model outputs the key phrases.
4. The method of claim 3, wherein the step of presetting the phrase extraction model comprises:
in the candidate character string set, carrying out feature labeling on sample character strings in a preset number to obtain labeled sample data;
and performing machine learning model training by adopting the sample data to obtain the phrase extraction model.
5. The method of claim 1, wherein the extracting the key phrase in the candidate character string set comprises:
performing word segmentation processing on the character strings in the candidate character string set according to a preset word order rule to generate a basic word set;
calculating mutual information values between every two basic words in the basic word set;
selecting a selected word set which forms the basic words with the mutual information value larger than a preset mutual information threshold value from the basic word set;
and extracting the key phrase from the selected word set according to the information entropy of the selected word.
6. The method according to claim 5, wherein extracting the keyword group from the obtained word set according to the information entropy of the obtained word comprises:
respectively calculating a first left information entropy value of each selected word;
judging whether the first left information entropy value is larger than a first preset entropy value or not;
when the first left information entropy is smaller than or equal to the first preset entropy, expanding the candidate word to the left by a first preset length to generate a left expansion word set, and calculating a second left information entropy of the left expansion word;
selecting the left selected expansion words forming the second left information entropy value larger than the first preset entropy value from the left expansion word set to generate a left selected word set;
when the left information entropy is larger than the first preset entropy, the selected word set is used as the left selected word set;
and extracting the key phrase from the left selected word set.
7. The method of claim 6, wherein extracting the keyword group from the left selected word set comprises:
calculating a first right information entropy value of each word in the left selected word set;
judging whether the entropy value of the first right information is larger than a second preset entropy value or not;
when the first right information entropy is smaller than or equal to the second preset entropy, expanding words in the left selected word set to the right by a second preset length to generate a right expanded word set, and calculating a second right information entropy of the right expanded words;
selecting the right expansion words forming the second right information entropy larger than the second preset entropy in the right expansion word set to generate a right selected word set;
when the first right information entropy is larger than a second preset entropy, the left word acquisition set is used as the right word acquisition set;
and extracting the key phrase from the right selected word set.
8. The method of claim 7, wherein extracting the keyword group from the right selected word set comprises:
calculating the multi-character mutual information value of each word in the right selected word set;
and selecting a phrase forming the multi-character mutual information value larger than a third threshold value from the right selected word set as the key phrase.
9. The method according to claim 1, further comprising, after said extracting the keyword group in the candidate character string set:
calculating the similarity between every two key phrases in a plurality of key phrases;
and selecting at least two key phrases with the similarity larger than a preset similarity value to form a similarity pair set.
10. A phrase extraction device characterized by comprising:
the acquisition module is used for acquiring basic data to be extracted;
the selection module is used for selecting effective character strings in the basic data to obtain a candidate character string set;
and the extraction module is used for extracting key phrases in the candidate character string set.
11. The apparatus of claim 10, wherein the selection module is configured to:
deleting the text data which is the same as the text data in the preset corpus knowledge base from the basic data to obtain effective text data;
and deleting data symbols which accord with a preset type from the effective text data to generate the candidate character string set.
12. The apparatus of claim 10, wherein the extraction module is configured to:
and inputting the candidate character string set to a preset phrase extraction model, so that the phrase extraction model outputs the key phrases.
13. The apparatus of claim 12, further comprising a model pre-set module configured to:
in the candidate character string set, carrying out feature labeling on sample character strings in a preset number to obtain labeled sample data;
and performing machine learning model training by adopting the sample data to obtain the phrase extraction model.
14. The apparatus of claim 10, wherein the extraction module is configured to:
performing word segmentation processing on the character strings in the candidate character string set according to a preset word order rule to generate a basic word set;
calculating mutual information values between every two basic words in the basic word set;
selecting a selected word set which forms the basic words with the mutual information value larger than a preset mutual information threshold value from the basic word set;
and extracting the key phrase from the selected word set according to the information entropy of the selected word.
15. The apparatus of claim 14, wherein the extracting the key phrase from the set of selected words according to the information entropy of the selected words comprises:
respectively calculating a first left information entropy value of each selected word;
judging whether the first left information entropy value is larger than a first preset entropy value or not;
when the first left information entropy is smaller than or equal to the first preset entropy, expanding the candidate word to the left by a first preset length to generate a left expansion word set, and calculating a second left information entropy of the left expansion word;
selecting the left selected expansion words forming the second left information entropy value larger than the first preset entropy value from the left expansion word set to generate a left selected word set;
when the left information entropy is larger than the first preset entropy, the selected word set is used as the left selected word set;
and extracting the key phrase from the left selected word set.
16. The apparatus of claim 15, wherein the extracting the keyword group from the left selected word set comprises:
calculating a first right information entropy value of each word in the left selected word set;
judging whether the entropy value of the first right information is larger than a second preset entropy value or not;
when the first right information entropy is smaller than or equal to the second preset entropy, expanding words in the left selected word set to the right by a second preset length to generate a right expanded word set, and calculating a second right information entropy of the right expanded words;
selecting the right expansion words forming the second right information entropy larger than the second preset entropy in the right expansion word set to generate a right selected word set;
when the first right information entropy is larger than a second preset entropy, the left word acquisition set is used as the right word acquisition set;
and extracting the key phrase from the right selected word set.
17. The apparatus of claim 16, wherein the extracting the keyword group from the right selected word set comprises:
calculating the multi-character mutual information value of each word in the right selected word set;
and selecting a phrase forming the multi-character mutual information value larger than a third threshold value from the right selected word set as the key phrase.
18. The apparatus of claim 10, further comprising:
a calculating module, configured to calculate, after the extracting of the keyword groups in the candidate character string set, a similarity between every two keyword groups in the plurality of keyword groups;
and the group pairing module is used for selecting at least two key phrases which form the similarity larger than a preset similarity value to form a similar pair set.
19. An electronic device, comprising:
a memory to store a computer program;
a processor to perform the method of any one of claims 1 to 9.
20. A non-transitory electronic device readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 9.
CN201911264855.4A 2019-12-11 2019-12-11 Phrase extraction method, device, equipment and storage medium Pending CN111061924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911264855.4A CN111061924A (en) 2019-12-11 2019-12-11 Phrase extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911264855.4A CN111061924A (en) 2019-12-11 2019-12-11 Phrase extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111061924A true CN111061924A (en) 2020-04-24

Family

ID=70300514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911264855.4A Pending CN111061924A (en) 2019-12-11 2019-12-11 Phrase extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111061924A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966791A (en) * 2020-09-03 2020-11-20 深圳市小满科技有限公司 Extraction method and retrieval method of customs data product words

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016133960A (en) * 2015-01-19 2016-07-25 日本電気株式会社 Keyword extraction system, keyword extraction method, and computer program
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016133960A (en) * 2015-01-19 2016-07-25 日本電気株式会社 Keyword extraction system, keyword extraction method, and computer program
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN109408818A (en) * 2018-10-12 2019-03-01 平安科技(深圳)有限公司 New word identification method, device, computer equipment and storage medium
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966791A (en) * 2020-09-03 2020-11-20 深圳市小满科技有限公司 Extraction method and retrieval method of customs data product words
CN111966791B (en) * 2020-09-03 2024-04-19 深圳市小满科技有限公司 Method for extracting and retrieving customs data product words

Similar Documents

Publication Publication Date Title
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109635296B (en) New word mining method, device computer equipment and storage medium
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN109299228B (en) Computer-implemented text risk prediction method and device
CN109543007A (en) Put question to data creation method, device, computer equipment and storage medium
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN109740152B (en) Text category determination method and device, storage medium and computer equipment
CN111767716A (en) Method and device for determining enterprise multilevel industry information and computer equipment
EP3608802A1 (en) Model variable candidate generation device and method
CN110909531A (en) Method, device, equipment and storage medium for discriminating information security
CN112307164A (en) Information recommendation method and device, computer equipment and storage medium
CN116227466B (en) Sentence generation method, device and equipment with similar semantic different expressions
CN110866102A (en) Search processing method
JP2017527913A (en) Systems and processes for analyzing, selecting, and capturing sources of unstructured data by experience attributes
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN113220996B (en) Scientific and technological service recommendation method, device, equipment and storage medium based on knowledge graph
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN107665222B (en) Keyword expansion method and device
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
CN111061924A (en) Phrase extraction method, device, equipment and storage medium
CN115563515A (en) Text similarity detection method, device and equipment and storage medium
CN113343012B (en) News matching method, device, equipment and storage medium
CN112035670B (en) Multi-modal rumor detection method based on image emotional tendency
CN112257408A (en) Text comparison method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200424

WD01 Invention patent application deemed withdrawn after publication