CN116911278A

CN116911278A - Word mining method and device, electronic equipment and storage medium

Info

Publication number: CN116911278A
Application number: CN202310822249.XA
Authority: CN
Inventors: 阮禄; 周航成; 冉猛; 秦蛟禹; 危枫; 王晨子
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-10-20

Abstract

The application discloses a word mining method, a device, electronic equipment and a storage medium, belonging to the field of data processing, comprising the following steps: acquiring a text sequence to be analyzed; extracting keywords from the text sequence to be analyzed to obtain candidate unregistered words; determining inclusion relations among the candidate unregistered words, and taking the set of the included other candidate unregistered words as a closure subset of any candidate unregistered word when any candidate unregistered word includes other candidate unregistered words; and deleting the closure sub-set from the candidate unregistered words to obtain target unregistered words. In this way, the candidate unregistered words are subjected to the closure subset denoising and filtering, and the candidate unregistered words contained in other candidate unregistered words are removed, so that the repeatability and similarity of the obtained target unregistered words are reduced, the quality of the target unregistered words is improved, the segmentation is carried out on the input text sequence in the follow-up natural language processing process, and the accuracy of the natural language processing is improved.

Description

Word mining method and device, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a word mining method, a word mining device, electronic equipment and a storage medium.

Background

With the development of artificial intelligence and big data, natural language processing technology has been landed in various industries, where chinese word segmentation belongs to the bottom core unit of the natural language processing technology, and refers to segmenting an input text sequence into individual words, where the words are the minimum units for expressing and bearing semantics, and the result of chinese word segmentation affects the subsequent natural language processing result.

In general, chinese word segmentation is to match an input text sequence with words included in a word segmentation vocabulary according to a certain strategy, and if a certain character string is found in the word segmentation vocabulary, the character string is successfully matched, and the character string is segmented into a word.

However, there may be some unregistered words in the text sequence, that is, the words that are not included in the word segmentation vocabulary but must be segmented, and because the unregistered words cannot be matched and identified, the unregistered words in the text sequence cannot be segmented in this case, which further results in a semantic defect of the unregistered words, so that the accuracy of the natural language processing is reduced.

Disclosure of Invention

The embodiment of the application aims to provide a word mining method, a word mining device, electronic equipment and a storage medium, which can solve the problems that the unknown words in a text sequence cannot be segmented, so that the semantic deficiency of the unknown words is caused, and the accuracy of natural language processing is reduced.

In a first aspect, an embodiment of the present application provides a word mining method, including:

acquiring a text sequence to be analyzed;

extracting keywords from the text sequence to be analyzed to obtain candidate unregistered words;

determining inclusion relations among the candidate unregistered words, and taking a set of the included other candidate unregistered words as a closure subset of any candidate unregistered word when any candidate unregistered word includes other candidate unregistered words;

and deleting the closure subset from the candidate unregistered words to obtain target unregistered words.

Optionally, extracting the keyword from the text sequence to be analyzed to obtain candidate unregistered words, including:

performing co-occurrence analysis on characters in the text sequence to be analyzed, and determining candidate character strings;

determining information entropy and inter-point information of the candidate character strings based on the context information of the candidate character strings;

And taking the candidate character strings of which the information entropy and the mutual information meet preset conditions as candidate unregistered words.

Optionally, the performing co-occurrence analysis on the characters in the text sequence to be analyzed to determine candidate character strings includes:

counting the co-occurrence of adjacent characters in the text sequence to be analyzed;

and under the condition that the co-occurrence meets the co-occurrence condition, fusing the adjacent characters to obtain candidate character strings.

Optionally, the determining, based on the context information of the candidate character string, information entropy of the candidate character string includes:

determining left information entropy of the candidate character string based on the candidate character string and a character string before the candidate character string;

determining right information entropy of the candidate character string based on the candidate character string and a character string following the candidate character string;

fusing the left information entropy and the right information entropy to obtain the information entropy of the candidate character string; or, taking the minimum value between the left information entropy and the right information entropy as the information entropy of the candidate character string.

Optionally, deleting the closure subset from the candidate unregistered words to obtain target unregistered words includes:

Deleting the closure subset from the candidate unregistered words to obtain first to-be-filtered unregistered words;

determining the proportion of stop words in characters included in the first to-be-filtered unregistered words, and taking the proportion as the stop word assimilation rate of the first to-be-filtered unregistered words;

and deleting the first to-be-filtered unregistered words with the assimilation rate of the inactive words being greater than a preset threshold value, and taking the remaining first to-be-filtered unregistered words as target unregistered words.

deleting the closure subset from the candidate unregistered words to obtain second unregistered words to be filtered;

matching the second to-be-filtered unregistered word in a preset word set;

and deleting the second to-be-filtered unregistered words successfully matched, and taking the remaining second to-be-filtered unregistered words as target unregistered words.

Optionally, the matching the second to-be-filtered unregistered word in the preset word set includes:

connecting the second to-be-filtered unregistered word into a character string serving as a character string to be matched;

traversing a preset word set by using a multimode matching algorithm, and matching the character strings to be matched.

In a second aspect, an embodiment of the present application provides a device for word mining, including:

the acquisition module is used for acquiring a text sequence to be analyzed;

the extraction module is used for extracting keywords from the text sequence to be analyzed to obtain candidate unregistered words;

the determining module is used for determining the inclusion relation among the candidate unregistered words, and taking the set of the included other candidate unregistered words as a closure subset of any candidate unregistered word when any candidate unregistered word includes other candidate unregistered words;

and the denoising module is used for deleting the closure subset from the candidate unregistered words to obtain target unregistered words.

Optionally, the extraction module is specifically configured to:

Optionally, the denoising module is further configured to:

matching the second to-be-filtered unregistered word in a preset word set;

Optionally, the denoising module is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, a text sequence to be analyzed is obtained; extracting keywords from the text sequence to be analyzed to obtain candidate unregistered words; determining inclusion relations among the candidate unregistered words, and taking the set of the included other candidate unregistered words as a closure subset of any candidate unregistered word when any candidate unregistered word includes other candidate unregistered words; and deleting the closure sub-set from the candidate unregistered words to obtain target unregistered words.

In this way, after extracting candidate unregistered words from the text sequence to be analyzed, the application carries out closure subset denoising filtration on the candidate unregistered words based on the inclusion relation among the candidate unregistered words, and removes the candidate unregistered words contained in other candidate unregistered words, thereby reducing the repeatability and similarity of the obtained target unregistered words, improving the quality of the target unregistered words, being beneficial to accurately segmenting the input text sequence in the subsequent natural language processing process, and further improving the accuracy of the natural language processing.

Drawings

FIG. 1 is a flowchart illustrating a word mining method according to an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating a multi-pattern matching in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a word mining method, according to an exemplary embodiment;

FIG. 4 illustrates a lexicon tree diagram of a multi-pattern matching according to an example embodiment;

FIG. 5 is a block diagram of a word digger apparatus, shown in accordance with an exemplary embodiment;

FIG. 6 is a block diagram of a word digger electronic device, shown in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating an apparatus for a word digger, according to an example embodiment.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The word mining method provided by the embodiment of the application is described in detail through specific embodiments and application scenes thereof by combining the accompanying drawings.

FIG. 1 is a flowchart illustrating a word mining method according to an exemplary embodiment, the word mining method including the following steps.

In step S11, a text sequence to be analyzed is acquired.

With the development of artificial intelligence and big data, natural language processing technology has been used in various industries, where chinese word segmentation belongs to the bottom core unit of natural language processing technology, and means that according to a certain policy, an input text sequence is matched with words included in a word segmentation vocabulary, and if a certain character string is found in the word segmentation vocabulary, the character string is successfully matched, and the character string is segmented into a word.

However, there may be some unregistered words in the text sequence, that is, words that are not recorded in the word segmentation vocabulary but must be segmented, and in the related art, the unregistered words cannot be matched and identified, so that the unregistered words in the text sequence cannot be segmented. The word mining method provided by the application can mine the target unregistered word from the text sequence to be analyzed, thereby solving the problems.

The text sequence to be analyzed can be derived from the following aspects:

1. traffic data: an interactive voice response (Interactive Voice Response, IVR) is voice transcribed into a text sequence to be analyzed of text. The data is directly from the first line, has great value, and can be used for mining user demands, constructing user figures and accurately pushing user groups so as to improve the benefit of business departments. In a dialogue scene, a large amount of dialect and white text exist in text data, and meanwhile, professional vocabulary is difficult to transfer correctly or completely.

2. The work order data comprises the following steps: the method comprises the step of working order data in each subdivision field aiming at complaints, consultations, handling, inquiring and faults. The data can help business departments to analyze and mine a lot of postcases, for example, the concrete reasons of user complaints can be analyzed through semantic understanding, so that the problems of users and staff can be better solved, meanwhile, the agent marking can be further assisted through a natural language processing technology, the working efficiency of agent staff is improved, the labor cost of a company is reduced, and the like. The existence of a large number of proper nouns and descriptions in such data is difficult to correctly identify, which results in a large amount of loss of semantics in the subsequent semantic understanding process, thereby reducing the mining effect.

3. Various document data: the data can establish a knowledge base or a knowledge graph to energize the service, and knowledge collection and editing personnel can be helped to accurately search and query the data through the knowledge graph, so that the service energy efficiency is improved. There are a large number of proper nouns and descriptions in such data that are difficult to correctly identify.

Through word mining, corresponding target unregistered words in related scenes are determined, word segmentation effects of spoken language, proper nouns and description in business data, work order data and various document data can be effectively improved, and then natural language processing effects are improved.

In step S12, keyword extraction is performed on the text sequence to be analyzed, and candidate unregistered words are obtained.

In this step, by extracting keywords from the text sequence to be analyzed, candidate unregistered words that may be target unregistered words in the text sequence to be analyzed may be determined. Since Chinese is formed by the combination of single morphemes without interval, any morphemes can be combined to form a word or single morpheme word, and therefore word boundaries can be measured through two indexes of the degree of freedom and the internal condensation degree.

The degree of freedom means that the words can be flexibly applied to different scenes as a basic semantic unit in Chinese, for example, "machine learning" can be used for matching a plurality of verbs and nouns in the context of "learning artificial intelligence knowledge", "working in artificial intelligence industry" and the like, but for the word "artificial intelligence", the word can be used for matching a plurality of words in the context of "energy". Then "artificial intelligence" may be considered not a complete word.

From another perspective, the fact that the degree of freedom of the words is high can be understood as that the relativity between the words is weak, and the higher the degree of freedom, the higher the independence of the words, that is, the words with higher independence can have richer left and right adjacent words.

The degree of internal aggregation is an indicator of how closely a word itself contains characters, and it is understood that if certain characters are often present together, it is believed that the characters may form a word, regardless of how frequently the combination of characters occurs. For example, in the case of "concert" and "singing" occurring more or less frequently, the internal condensation degree can measure which character string is more in line with the new word standard.

In one implementation, extracting keywords from a text sequence to be analyzed to obtain candidate unregistered words includes:

performing co-occurrence analysis on characters in a text sequence to be analyzed to determine candidate character strings; determining information entropy and inter-point information of the candidate character strings based on the context information of the candidate character strings; and taking the candidate character strings with the information entropy and the mutual information meeting the preset conditions as candidate unregistered words.

The co-occurrence word (collage) refers to a word co-occurring in the same term with a certain frequency, and the candidate character strings can be counted through a graph structure.

The information entropy is a measure of how much information is, and can reflect how much information is brought on average after knowing the result of an event. Thus, the richness of the character combination left and right neighbors can be expressed in terms of information Entropy (Entropy):

where p (y|x) represents the probability that the random variable y occurs under the condition that the random variable x is determined.

To quantify the degree of aggregation of character combinations, the probability of occurrence of a character combination may be divided by the probability of occurrence of each constituent element, similar to the form of mutual information (Mutual Information, MI):

mutual Information (MI) indicates the reduction of uncertainty of another random variable y after determining the random variable x, in the present application, since the random variable has a unique value, i.e. corresponds to a certain fixed character, point-wise Mutual Information, PMI can be used as an index for measuring the internal degree of aggregation:

wherein p (x, y) represents the probability of occurrence of the character combination corresponding to x, y, p (x) and p (y) represent the probability of occurrence of the character corresponding to x and the character corresponding to y, respectively, for log functions, true number > =1, function value > =0, and adding log functions corresponds to mapping the original function value to the [0, + -infinity ] interval, and speed-up is slowed down.

In one implementation, performing co-occurrence analysis on characters in a text sequence to be analyzed to determine candidate character strings includes:

counting the co-occurrence between adjacent characters in the text sequence to be analyzed; and under the condition that the co-occurrence meets the co-occurrence condition, fusing adjacent characters to obtain candidate character strings.

Specifically, the text sequence to be analyzed can be used as a keyword extraction result by taking a word as a unit to count the co-occurrence words of the N order. Taking the security broadband 299 element and the first-order co-occurrence, the second-order co-occurrence and the third-order co-occurrence as examples: first-order co-occurrence, i.e., the word frequency of each word, such as "security"; second-order co-occurrence is 'security- & gt broadband'; the third-order co-occurrence is the subsequent connection of the second-order phrase security- & gt broadband, namely 'security- & gt broadband- & gt 299 yuan'. Further, for the convenience of the following calculation, the statistics of the second-order string security- & gt, broadband and the possible subsequent handling- & gt security- & gt, broadband are also required; etc.

In one implementation, determining the information entropy of the candidate string based on the context information of the candidate string includes:

determining left information entropy of the candidate character string based on the candidate character string and the character string before the candidate character string; determining right information entropy of the candidate character string based on the candidate character string and a character string following the candidate character string; fusing left information entropy and right information entropy to obtain information entropy of candidate character strings; or, the minimum value between the left information entropy and the right information entropy is used as the information entropy of the candidate character string.

Specifically, if the occurrence probability of a certain result is p, when it is determined that it does occur, the corresponding information amount, expressed as-log (p), can be obtained, and the smaller the p, the larger the information amount can be obtained, that is, the higher the information entropy, the more abundant the information amount and the greater the uncertainty.

The more abundant the text segment can be matched with words, the greater the left information entropy and the right information entropy of the text segment. In one implementation, the minimum value of the left and right information entropy can be taken as the information entropy of the candidate character string.

Alternatively, a reasonable statistic can be designed by self to measure the degree of freedom, for example, the following formula can be adopted to determine the information entropy of the candidate character string, and the magnitudes of the left information entropy and the right information entropy of the text segment and the absolute value of the difference are comprehensively considered:

where LE represents left information entropy and LR represents right information entropy.

In step S13, the inclusion relation between the candidate unregistered words is determined, and when any one of the candidate unregistered words includes another candidate unregistered word, the set of the included other candidate unregistered words is used as a closure subset of any one of the candidate unregistered words.

In discrete math, a closure of a relationship R refers to a new ordered set of sequential pairs with self-reflexibility, symmetry or transitivity formed by adding a minimum number of sequential pairs, which is the closure of the relationship R.

According to the definition of the closure, the inclusion relationship of two candidate unregistered words may be defined as a closure relationship, and when one candidate unregistered word is completely included in another candidate unregistered word, the included candidate unregistered word must be a subset of the other candidate unregistered word, where the definition of the closure is to find out the word of the maximum length.

For the comprehensive score of two candidate unregistered words, under the condition that the comprehensive score of the two candidate unregistered words is not great, a plurality of subsets can be arranged and combined by the candidate unregistered words with large length, the symmetry and the transitivity in closure definition are met by the candidate unregistered words, the symmetry means that the candidate unregistered words contain all information of similar candidate unregistered words, the transitivity means that the information contained after the decomposition and arrangement of the long candidate unregistered words is an information subset, and the subsets are gradually transmitted to obtain complete information.

In step S14, the closure subset is deleted from the candidate unregistered words to obtain target unregistered words.

In the present application, for a candidate unregistered word, it is necessary to determine whether or not this is a qualified word, i.e., the candidate character string must be "unregistered" and can be referred to as a "word". For some specific domain proper nouns such as "189 proper fusion packages" and "proper fusion packages", it is necessary to determine which is the true non-logged-on word based on how similar the two words appear and where they are located. Therefore, further filtering of candidate unregistered words is required.

As can be seen from the foregoing, all of the candidate unregistered words belonging to the closure subset are removed, so that excessive noise of the target unregistered word can be prevented, accurate word segmentation can be completed, and the natural language processing effect can be improved.

In one implementation, deleting the closure subset from the candidate unregistered word to obtain the target unregistered word includes:

deleting the closure subset from the candidate unregistered words to obtain first unregistered words to be filtered; determining the proportion of stop words in characters included in the first to-be-filtered unregistered words, and taking the proportion as the stop word assimilation rate of the first to-be-filtered unregistered words; deleting the first to-be-filtered unregistered words with the stop word assimilation rate larger than a preset threshold value, and taking the remaining first to-be-filtered unregistered words as target unregistered words.

Specifically, the stop words are nonsensical words, and the de-noising of the stop words means that for a candidate unregistered word, if the candidate unregistered word is segmented again, a plurality of character sets are obtained, and many of the character sets are words in the stop word list, if the character sets have a larger proportion in the candidate unregistered word, we have a reason to doubt that the word is an incomplete word, and the candidate unregistered word should be filtered out of the candidate unregistered word set.

For example, the stop word occupancy may be expressed as:

wherein W is _i ∈W _sw Representing the dead word characters, W, included in the first to-be-filtered non-registered word _i Representing all characters included in the first unknown word to be filtered, when P (W _sw ) The smaller the term, the smaller the stop word duty ratio in the term, the more likely the term is an unregistered term, by setting a threshold, when P (W _sw ) Above this threshold, the candidate unregistered word may be filtered out.

deleting the closure subset from the candidate unregistered words to obtain second unregistered words to be filtered; matching the second to-be-filtered unregistered word in a preset word set; deleting the second to-be-filtered unregistered words successfully matched, and taking the remaining second to-be-filtered unregistered words as target unregistered words.

Specifically, denoising the preset word set refers to removing candidate unregistered words of long words existing in the existing large-scale preset word set, and since the amount of information relied on by algorithm probability calculation is insufficient or corpus used for mining new words is insufficient, two indexes of internal condensation degree and word freedom degree describing the unregistered words may not be complete in this time, so that many candidate unregistered words are actually subsets of the words which are already existing, and the candidate unregistered words should be filtered out.

The matching of the second to-be-filtered unregistered word in the preset word set includes:

connecting the second to-be-filtered unregistered word into a character string serving as a character string to be matched; traversing a preset word set by using a multimode matching algorithm, and matching the character strings to be matched.

Specifically, the multimode matching (Ahocorasick) algorithm is a character string search algorithm, and is used for matching preset words in an input string of character strings, and the difference between the input string of character strings and common character strings is that the input string of character strings can be matched with all preset words at the same time, and the algorithm has approximately linear time complexity under the condition of equal spreading, and the length of the character strings is increased by the number of all the matching, so that the searching performance of a candidate unregistered word preset word set is greatly improved. Whereas common string matching requires finding all matching numbers, if each sub-string matches each other, the time complexity of the algorithm will approximate the quadratic function of the matching.

As shown in fig. 2, which is a schematic diagram of the principle of multi-mode matching, taking "he", "she", "his", "hrs", and several words as examples, the multi-mode matching process is as follows:

firstly, storing preset words into a dictionary tree (Trie tree) according to the character sequence;

Then, for each node, constructing a failure pointer;

the construction rule of the failure pointer is that for the node a with failed state matching, if the failed node of the father node can successfully transfer the state to another node b according to the jump character of the node a, the failed node a is pointed to the node b; if the failed node of the father node can not transfer the state to another node according to the jump character of the node, the failed node a will check whether the failed node of the father node meets the above condition; sequentially recursing, if the trace back to the root node is not found, pointing to point the failed node to the root node, wherein the failed pointer of the root node points to the root node;

furthermore, connecting the second to-be-filtered unregistered word into a character string by using a preset separator, traversing the character string through the Trie tree constructed in the above way, wherein the traversing process is firstly that after each round of jump is finished, the node is remained, and if the node is in a terminal state, the pattern string corresponding to the node is successfully matched; in addition, from the stopped node, the path pointed by the failure pointer is recursively transmitted to the root node, and all the nodes which pass through are matched successfully if the nodes are in the end state.

As shown in fig. 3, a schematic diagram of a word mining method includes a module 1 and a module 2, wherein the module 1 includes the following steps:

s1: the graph processing counts scattered character strings, namely, counts the co-occurrence between adjacent characters in a text sequence to be analyzed; and under the condition that the co-occurrence meets the co-occurrence condition, fusing adjacent characters to obtain candidate character strings.

S2: and calculating the point-to-point information of the candidate unregistered words.

S3: and comparing the mutual information among the points of the candidate unregistered words with a set threshold value, and retaining the candidate unregistered words larger than the threshold value.

S4 and S6: and respectively calculating left information entropy and right information entropy of the candidate unregistered words.

S5 and S7: and respectively comparing the left information entropy and the right information entropy of the candidate unregistered words with a set threshold value, and reserving the candidate unregistered words larger than the threshold value.

S8: and merging the left information entropy and the right information entropy to finally obtain the information entropy of the whole candidate unregistered word, filtering out a plurality of candidate unregistered words with insufficient information entropy through threshold value filtering, and obtaining the final candidate unregistered word.

The module 2 comprises the following steps:

s9: the method comprises the steps of carrying out closure subset denoising on candidate unregistered words, such as three words of security broadband 299 element, security broadband and broadband security, wherein the security broadband is a complete contained subset of the security broadband 299 element, the broadband security is a semantic inversion subset of the security broadband 299 element, the candidate unregistered words belonging to the closure subset can be completely removed through S9 according to the definition in the technical scheme, and the final candidate unregistered words are prevented from being excessively noisy.

S10: the stop word denoising is performed on the candidate non-registered word, and for a certain candidate non-registered word, if the candidate non-registered word is segmented again, a plurality of character sets are obtained, and many of the character sets are stopped in the table, if the ratio of the character sets in the candidate non-registered word is large, we have a reason to doubt that the word is an incomplete word, for example, if the assimilation rate of the stop word of a certain word reaches 100%, the word cannot be used as a non-registered word because even if an unregistered word can be made, the word is a significant word.

S11: the method mainly comprises the steps of carrying out Chinese word forest denoising on candidate unregistered words, mainly removing words existing in the existing large-scale preset word set in the candidate unregistered words, wherein the definition of the unregistered words is that a large number of words exist so far, but the existing preset word set does not have recorded words, such as 'flow' and 'business hall', the unregistered words mined out by the module 1 are meaningful candidate unregistered words, and the words belong to noise words and are filtered out because the entry exists in the preset word set. For example, a large-scale chinese word forest may be introduced as the preset word set, and the chinese word forest may contain all vocabulary in 12 fields of human science, medical science, social engineering, and the like.

As shown in fig. 4, a dictionary tree diagram of multi-pattern matching is shown, if the candidate unregistered words are { business hall, machine-shift testing, solid-shift fusion, operator }, the connected character string is "business hall, machine-shift testing, solid-shift fusion, operator", the character string is traversed through trie tree, and two words of "business hall" and "operator" in the preset word set can be found at one time according to the matching rule, compared with the traditional method, the complexity of traversing comparison algorithm is O (n ² ) Traversal algorithm using multi-pattern matching reduces complexity to O (log _n n), where n is the length of the traversed character, and these words are finally filtered out, thereby ensuring the quality of the non-logged word mining.

From the above, it can be seen that, according to the technical scheme provided by the embodiment of the application, after extracting candidate unregistered words from a text sequence to be analyzed, based on the inclusion relation among the candidate unregistered words, the candidate unregistered words are subjected to closure subset denoising filtration, and the candidate unregistered words included in other candidate unregistered words are removed, so that the repeatability and similarity of the obtained target unregistered words are reduced, thereby improving the quality of the target unregistered words, being beneficial to accurately segmenting the input text sequence in the subsequent natural language processing process, and further improving the accuracy of the natural language processing.

According to the word mining method provided by the embodiment of the application, the execution main body can be a word mining device. In the embodiment of the application, a method for performing word mining by using a word mining device is taken as an example, and the device for performing the word mining method provided by the embodiment of the application is described.

FIG. 5 is a block diagram of a word mining apparatus, according to an exemplary embodiment, the apparatus including:

an obtaining module 201, configured to obtain a text sequence to be analyzed;

the extracting module 202 is configured to extract keywords from the text sequence to be analyzed to obtain candidate unregistered words;

a determining module 203, configured to determine inclusion relationships between the candidate unregistered words, and in a case where any one candidate unregistered word includes other candidate unregistered words, take a set of the included other candidate unregistered words as a closure subset of the any candidate unregistered word;

and a denoising module 204, configured to delete the closure subset from the candidate unregistered words, to obtain target unregistered words.

Optionally, the extracting module 202 is specifically configured to:

Optionally, the denoising module 204 is further configured to:

matching the second to-be-filtered unregistered word in a preset word set;

Optionally, the denoising module 204 is specifically configured to:

The word excavating device in the embodiment of the application can be electronic equipment or a component in the electronic equipment, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The word mining device in the embodiment of the application can be a device with an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.

The word mining device provided by the embodiment of the present application can implement each process implemented by the embodiments of the methods of fig. 1 to fig. 4, and in order to avoid repetition, a description is omitted here.

Optionally, as shown in fig. 6, the embodiment of the present application further provides an electronic device 500, including a processor 501 and a memory 502, where the memory 502 stores a program or an instruction that can be executed on the processor 501, and the program or the instruction implements each step of the above-mentioned word mining method embodiment when executed by the processor 501, and can achieve the same technical effect, so that repetition is avoided, and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

Fig. 7 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1000 includes, but is not limited to: radio frequency unit 1001, network module 1002, audio output unit 1003, input unit 1004, sensor 1005, display unit 1006, user input unit 1007, interface unit 1008, memory 1009, and processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1010 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 7 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

It should be appreciated that in an embodiment of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042, and the graphics processor 10041 processes image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes at least one of a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 can include two portions, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

The memory 1009 may be used to store software programs as well as various data. The memory 1009 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 1009 may include volatile memory or nonvolatile memory, or the memory 1009 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 109 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.

The processor 1010 may include one or more processing units; optionally, the processor 1010 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, and the like, and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 1010.

The embodiment of the application also provides a readable storage medium, and the readable storage medium stores a program or an instruction, which when executed by a processor, implements each process of the above word mining method embodiment, and can achieve the same technical effect, so that repetition is avoided, and no further description is provided herein.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.

The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the word mining method embodiment, and the same technical effects can be achieved, so that repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

Embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the above-described word mining method embodiment, and achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A word mining method, comprising:

acquiring a text sequence to be analyzed;

2. The word mining method according to claim 1, wherein the extracting the keyword from the text sequence to be analyzed to obtain candidate unregistered words includes:

3. The word mining method of claim 2, wherein the performing co-occurrence analysis on characters in the text sequence to be analyzed to determine candidate character strings includes:

4. The word mining method according to claim 2, wherein the determining information entropy of the candidate character string based on the context information of the candidate character string includes:

5. The word mining method according to claim 1, wherein the deleting the closure subset from the candidate unregistered words to obtain target unregistered words includes:

6. The word mining method according to claim 1, wherein the deleting the closure subset from the candidate unregistered words to obtain target unregistered words includes:

matching the second to-be-filtered unregistered word in a preset word set;

7. The word mining method according to claim 6, wherein the matching the second to-be-filtered unregistered word in the preset word set includes:

8. A word excavating device comprising:

the acquisition module is used for acquiring a text sequence to be analyzed;

9. The word mining apparatus of claim 8, wherein the extraction module is specifically configured to:

10. The word mining apparatus of claim 9, wherein the extraction module is specifically configured to:

11. The word mining apparatus of claim 9, wherein the extraction module is specifically configured to:

12. The word mining apparatus of claim 8, wherein the denoising module is further configured to:

13. The word mining apparatus of claim 8, wherein the denoising module is further configured to:

matching the second to-be-filtered unregistered word in a preset word set;

14. The word mining apparatus of claim 13, wherein the denoising module is specifically configured to:

15. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the word mining method of any of claims 1-7.

16. A readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the word mining method of any of claims 1-7.