CN112668331A - Special word mining method and device, electronic equipment and storage medium - Google Patents

Special word mining method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112668331A
CN112668331A CN202110288414.9A CN202110288414A CN112668331A CN 112668331 A CN112668331 A CN 112668331A CN 202110288414 A CN202110288414 A CN 202110288414A CN 112668331 A CN112668331 A CN 112668331A
Authority
CN
China
Prior art keywords
new word
word
domain
new
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110288414.9A
Other languages
Chinese (zh)
Inventor
侯晋峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wofeng Times Data Technology Co ltd
Original Assignee
Beijing Wofeng Times Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wofeng Times Data Technology Co ltd filed Critical Beijing Wofeng Times Data Technology Co ltd
Priority to CN202110288414.9A priority Critical patent/CN112668331A/en
Publication of CN112668331A publication Critical patent/CN112668331A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a device for mining a special word, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a new word set according to the text data of the multiple fields; acquiring domain difference degrees of the first new word for the plurality of domains and domain exposure degrees for the first domain; wherein the first new word is a new word in the new word set, and the first domain is one of the domains; and determining that the domain difference degree is greater than a first preset threshold value and the domain exposure degree of the first domain is greater than a second preset threshold value, and taking the first new word as a special word of the first domain. The special word mining method provided by the invention can be used for mining special words across multiple fields, and overcomes the defects of low accuracy and low efficiency of mining special words in the prior art.

Description

Special word mining method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a device for mining a special word, electronic equipment and a storage medium.
Background
New word discovery is one of the basic tasks of NLP (natural language processing), and new words are identified from existing corpora by mining them. The discovery of new words can also be called unknown word recognition, and strictly speaking, new words refer to new words which appear or old words are new along with the development of the times. Meanwhile, the proper nouns of specific fields can also belong to the category of new words. Generally, the vocabulary of the general field is easy to find, but it is very difficult to find the proper nouns of a specific field, so the proper nouns of a specific field are new words relative to the words of the general field. The new word discovery can not only mine new words generated along with time change, but also mine proper nouns in different fields.
The existing proprietary vocabulary mining technology is a mining method based on single-field data, and mainly comprises two modes: one is a statistical-based method, which finds some new words by counting the degree of freedom and the degree of solidification between words in a data set, and the method adopts a single-field data set, so that the new words found in the data set are all used as terms in the field, and the method has low accuracy and can find many special words in non-fields; the other method is a model-based method, a sequence labeling model is trained by labeling data of a part of special words, and the special words are identified by using the model.
Disclosure of Invention
The invention provides a method and a device for mining a special word, which are used for solving the defects that the special word cannot be mined across multiple fields, the mining accuracy of the special word is low, and the efficiency is low in the prior art.
The invention provides a special word mining method, which comprises the following steps: acquiring a new word set according to the text data of the multiple fields;
acquiring domain difference degrees of the first new word for the plurality of domains and domain exposure degrees for the first domain; wherein the first new word is a new word in the new word set, and the first domain is one of the domains;
and determining that the domain difference degree is greater than a first preset threshold value and the domain exposure degree of the first domain is greater than a second preset threshold value, and taking the first new word as a special word of the first domain.
According to the method for mining the special words provided by the invention, the obtaining of the field difference degrees of the first new word to the plurality of fields comprises the following steps:
acquiring the occurrence probability of the first new word in each field according to the occurrence frequency of the first new word in each field of the plurality of fields;
and acquiring the domain difference degrees of the first new word to the plurality of domains according to the probability of the first new word appearing in each domain.
According to the method for mining the special words, the field difference degrees of the first new word in the multiple fields are obtained through the following formula:
Figure 815370DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 887231DEST_PATH_IMAGE002
the domain difference degrees of the first new word a for the plurality of domains,
Figure 937226DEST_PATH_IMAGE003
and taking the value of i as 1-n and n as the total number of the fields for the probability of the first new word A appearing in the ith field.
According to the method for mining the special words provided by the invention, the obtaining of the domain exposure of the first new word to the first field comprises the following steps:
acquiring the number of texts in the first field, which contain the first new word, and the sum of the number of texts in the first field;
and taking the ratio of the text quantity to the sum of the text quantities as the domain exposure of the first new word to the first domain.
According to the special word mining method provided by the invention, the new word set is obtained according to the text data of a plurality of fields, and the method comprises the following steps:
acquiring text data of the multiple fields;
based on a word segmentation dictionary, performing word segmentation on the text data of the multiple fields to obtain a word segmentation set;
performing new word extraction on the text data of the plurality of fields based on a new word extraction model to obtain an initial new word set;
and removing new words which appear in the participle set from the initial new word set to obtain the new word set.
According to the special word mining method provided by the invention, the new word extraction model is a new word extraction model constructed based on a degree of freedom and a freezing degree calculation algorithm, or a new word extraction model generated based on a sequence labeling training set training deep learning model.
According to the method for mining the special words provided by the invention, after the new words which appear in the participle set are removed from the initial new word set and the new word set is obtained, the method further comprises the following steps:
and updating the word segmentation dictionary according to the new word set.
The present invention also provides a proper word mining device, including:
the new word set generating module is used for acquiring a new word set according to the text data of the multiple fields;
the domain difference and domain exposure generating module is used for acquiring domain differences of the first new word to the plurality of domains and domain exposure of the first new word to the first domain; wherein the first new word is a new word in the new word set, and the first domain is one of the domains;
and the special word generation module is used for determining that the field difference is greater than a first preset threshold and the field exposure is greater than a second preset threshold, and then taking the first new word as the special word of the first field.
The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the proprietary word mining method according to the first aspect when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the proprietary word mining method according to the first aspect.
The invention provides a special word mining method, which comprises the steps of obtaining a new word set according to text data of a plurality of fields, preliminarily screening the text data to obtain a new word set which is used as a candidate special word set of the plurality of fields; by acquiring the field difference degrees of the first new word to the multiple fields and the field exposure degree of the first new word to the first field, and comparing the field difference degrees with the corresponding threshold values, the 'exclusive' degree of the first new word and the receiving and using degrees of the first new word in the first field are measured, the exclusive words can be mined across the multiple fields, and the exclusive word mining in the multiple fields is efficiently and accurately realized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for mining a special word according to the present invention;
FIG. 2 is a schematic flow chart of obtaining a new word set according to text data of multiple domains according to the present invention;
FIG. 3 is a schematic structural diagram of a proprietary word mining device provided by the present invention;
fig. 4 is a schematic structural diagram of the new word set generating module 31 provided by the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention is described below in connection with fig. 1-5.
In a first aspect, the present invention provides a method for mining a proprietary word, as shown in fig. 1, including:
s11, acquiring a new word set according to the text data of the multiple fields;
the multiple fields are selected according to the user attention requirements, the sources of the text data can be technical data in enterprises, schools and units, books and periodicals, and field-related data in a webpage forum crawled by a crawler, and the like, and the method is not limited in the above. It can be understood that if the acquired data is not a plain text file, the information such as pictures, links and the like in the acquired technical data needs to be removed, and only the text data is reserved for subsequent processing. Meanwhile, the more the number of fields and the number of texts are, the higher the extraction accuracy of the special words is. The text data of each domain may be divided into a plurality of texts, and one text may be a sentence or a paragraph or an article, which is not limited herein.
After the text data of a plurality of fields are obtained, extracting new words from the text data, and adding the extracted new words into a new word set. The purpose of extracting the new words is to primarily screen the text data of the multiple fields, remove common and general words such as common words and stop words and the like, obtain a new word set in the text data of the multiple fields, and use the new word set as candidate special words of the multiple fields.
S12, acquiring the domain difference degrees of the first new word for the plurality of domains and the domain exposure degree for the first domain; wherein the first new word is a new word in the new word set, and the first domain is one of the domains;
the special words mean that the vocabulary set appears in a certain field or a few mutually related and close fields, and for this purpose, the field difference degree of the new words can be obtained according to the distribution difference situation of the new words in the new word set in a plurality of fields. The larger the domain difference degree is, the more concentrated the new word appears in a certain domain or a few interrelated and close domains, namely the higher the 'exclusive' degree is; the smaller the degree of difference in the field, the broader the field involved, the more evenly distributed, i.e. the lower its degree of "exclusivity".
Meanwhile, in consideration of the practicability of the special word and whether the special word is accepted by the personnel in the field to a certain extent, the use condition of the new word in the text data is required to acquire the field exposure.
The field exposure indicates the condition of the new word to be accepted and used, and the higher the field exposure, the more widely the new word is accepted and used by people in the field; the lower the exposure of the domain, the lower the acceptance of the new word, and the less suitable it is as a proprietary word for the domain. For example, it is assumed that "high frequency contactor" is a word set forth in a certain paper, but the word is only an abbreviation word set forth by a person in a certain field for the sake of brevity of description, but in the technical field of similar subjects, reference is rarely made to the description method of "high frequency contactor". This means that the new word "high frequency contactor" has low exposure in the field and is not widely accepted by people in the field, so that the new word is not necessarily concerned and is not suitable as a special word.
S13, determining that the field difference degree is larger than a first preset threshold value, and the field exposure degree of the first field is larger than a second preset threshold value, and taking the first new word as a special word of the first field.
After the field difference degrees of the new word to the multiple fields and the field exposure degree to the single field are obtained, the new word needs to be compared with corresponding preset threshold values respectively, and when the field difference degrees of the new word to the multiple fields and the field exposure degree to the single field are both larger than the corresponding preset threshold values, the new word is used as a special vocabulary of the field. It will be appreciated that the same new word may relate to multiple interrelated, proximate domains, and thus the same new word may also be a proprietary word of multiple interrelated, proximate domains, the new word using the same domain dissimilarity in calculating the domain dissimilarities to the multiple domains without repeated calculations. The domain exposure describes that the new word is accepted and used by the personnel in the domain in a certain domain, and needs to be calculated independently.
In the embodiment, a new word set is obtained according to text data of a plurality of fields, and the text data is preliminarily screened to obtain a new word set which is used as a candidate special word set of the plurality of fields; by acquiring the field difference degrees of the first new word to the multiple fields and the field exposure degree of the first new word to the first field, and comparing the field difference degrees with the corresponding threshold values, the 'exclusive' degree of the first new word and the receiving and using degrees of the first new word in the first field are measured, the exclusive words can be mined across the multiple fields, and the exclusive word mining in the multiple fields is efficiently and accurately realized.
In one embodiment of the present invention, obtaining domain differences of the first new word for a plurality of domains includes: acquiring the probability of the first new word appearing in each field according to the frequency of the first new word appearing in each field of the plurality of fields; and acquiring the domain difference degrees of the first new word to the plurality of domains according to the probability of the first new word appearing in each domain.
The number of times and the probability statistics are based on the text data of the plurality of domains in step S11. The number of times of the first new word appearing in each field can be obtained by searching/matching the first new word with the text data in each field of the multiple fields, and the appearance probability of the first new word in each field with the multiple fields as a sample population is obtained through statistics. And obtaining the domain difference degree of the first new word to the plurality of domains according to the difference of the appearance probability of the first new word in the plurality of domains.
According to the method and the device, the probability of the first new word appearing in each field is obtained by calculating the number of times of the first new word appearing in each field, and then the field difference degree of the first new word to a plurality of fields is obtained, so that the field difference degree is obtained simply and accurately.
In one embodiment of the present invention, the domain differences of the first new word for the plurality of domains are obtained by the following formula:
Figure 803551DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 938998DEST_PATH_IMAGE002
the domain difference degrees of the first new word a for the plurality of domains,
Figure 188451DEST_PATH_IMAGE003
is the probability of the first new word a appearing in the ith domain,
Figure 217587DEST_PATH_IMAGE003
the value range is 0-1, the value of i is 1-n, and n is the total number of fields.
The embodiment provides a simple and accurate method for acquiring the field difference degree.
In an embodiment of the present invention, a sum of probability of occurrence of the first new word in the first t fields with the highest frequency of occurrence is obtained as a domain difference degree of the first new word for a plurality of fields, where t and t are positive integers smaller than n, and a specific value thereof can be adjusted according to actual requirements.
Because the special word set appears in a few fields, the probability that the first new word set appears in the few t fields can be obtained by calculating the sum of the probability that the first new word appears in the first t fields with the highest frequency of appearance, and the distribution difference of the first new word to the multiple fields is measured.
In one embodiment of the present invention, obtaining the domain exposure of the first new word to the first domain comprises: acquiring the number of texts containing the first new word in the first field and the sum of the number of texts in the first field; and taking the ratio of the text quantity to the sum of the text quantity as the domain exposure of the first new word to the first domain.
Supposing that the domain exposure of the word A is calculated, and n domains are calculated;
acquiring the total number of texts in the field i as Oi (Oi is a positive integer);
acquiring the number Ti (Ti is more than or equal to 1 and less than or equal to Oi) of texts with words A in the field i;
then the exposure of word a in field i is:
Figure 661338DEST_PATH_IMAGE004
considering that the same new word may appear frequently in a document due to the document theme, if the number of occurrences of the new word is used as the calculation basis of the exposure, a large error may occur, so the basis of calculating the exposure is the number of texts containing the first new word, and the ratio of the number of texts containing the first new word to the total number of texts in the first field is obtained to measure the exposure of the first new word in the first field.
The embodiment provides a simple and accurate exposure calculation method, and eliminates the influence of frequent appearance of the same new word in the same text on the exposure calculation.
As shown in fig. 2, in an embodiment of the present invention, obtaining a new word set according to text data of a plurality of fields includes:
s21, acquiring text data of the multiple fields;
the multiple fields are selected according to the attention requirements of the user, the sources of the text data can be technical data in enterprises, schools and units, books and periodicals, and field-related data in a webpage forum crawled by crawlers, and the method is not limited in the above. It can be understood that if the acquired data is not a plain text file, the information such as pictures, links and the like in the acquired technical data needs to be removed, and only the text data is reserved for subsequent processing. Meanwhile, the more the number of fields and the number of texts are, the higher the extraction accuracy of the special words is.
The text data of each domain may be divided into a plurality of texts, and one text may be a sentence or a paragraph or an article, which is not limited herein.
S22, segmenting the text data of the multiple fields based on a segmentation dictionary to obtain a segmentation set;
word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. Matching Chinese character strings of text data of a plurality of fields with entries in a word segmentation dictionary, if a certain character string is found in the dictionary, matching is successful, and words which are successfully matched are added into a word segmentation set until the text data of all the fields are completely matched. The matching method may be a forward maximum matching method, a reverse maximum matching method, a least segmentation method, a bidirectional maximum matching method, or may be a combination of the above methods, which is not limited herein.
S23, extracting new words from the text data of the multiple fields based on the new word extraction model to obtain an initial new word set;
the new word extraction model is used for extracting new words in the text data. The new word extraction model can be constructed based on a supervision method, and new word recognition problems are regarded as classification or sequence labeling problems by using labeling linguistic data, for example: and training a classification model by taking certain statistics of the candidate character strings as features. Another example is: sequence labeling is carried out based on sequence information to directly obtain a new word, or after a candidate word is obtained, new word judgment is carried out to obtain a new word, and common methods include HMM, CRF, SVM and the like. The new word extraction model can also be constructed based on an unsupervised method, and a threshold value is set for discrimination by utilizing the statistical information of the candidate character strings.
And after a new word extraction model is obtained, inputting the text data into the new word extraction model, extracting new words, and adding the obtained new words into the initial new word set until the new words of the text data in multiple fields are extracted.
S24, removing the new words which appear in the participle set from the initial new word set to obtain the new word set.
The new words which appear in the participle set are removed from the initial new word set, the initial new words are screened, common and general words in the text data are removed, and the new word set is obtained and used as a special word candidate.
The method includes the steps of segmenting text data of multiple fields based on a segmentation dictionary to obtain a common and general vocabulary set in the text data of the multiple fields, extracting new words from the text data of the multiple fields based on a new word extraction model, comprehensively extracting vocabularies in the text data of the multiple fields, removing the new words which are already appeared in the segmentation set from an initial new word set, screening the initial new words, removing the common and general vocabularies in the text data, and obtaining a new word set which is a candidate of a special word.
In an embodiment of the present invention, after segmenting the text data in multiple fields based on the segmentation dictionary, the method further includes performing deduplication processing on the segmentation structure to obtain a segmentation set.
The embodiment removes repeated words in the word segmentation result, and improves the processing efficiency of subsequent steps.
In one embodiment of the invention, the new word extraction model is a new word extraction model constructed based on a degree of freedom and a degree of coagulation calculation algorithm, or is generated based on a deep learning model trained by a sequence labeling training set.
The degree of solidity is a measure of how closely two words that make up a phrase are related. The probability that the word A and the word B appear independently is P (A) and P (B) on the assumption that the words are binary words, and the probability that the two words appear simultaneously is P (A) and P (B) on the assumption that the two words are independent words. If the two words are not independent, the conditional probability that the two words form a new word C and appear simultaneously is greater than P (A) P (B), the coagulation degree = P (C)/P (A) P (B) is taken, and the new word can be judged and screened by setting the coagulation degree threshold.
The degree of freedom represents the degree of freedom of the fixed degree text segment of the words on the left and right sides of a word group, and is also an important standard for judging whether the words are formed. If a text segment can be counted as a word, it should be flexible to appear in a variety of different environments, with a very rich set of left and right neighbourhoods.
For example, the right adjacencies of "chocolate" can only be "force", with zero degrees of freedom, and therefore "chocolate" cannot be a new word. The method comprises the steps of counting the occurrence times of left and right adjacent words and adjacent words of a vocabulary, calculating the occurrence probability of each adjacent word by taking the total occurrence times as denominators, finally counting left and right entropy values by using information entropy, selecting the small entropy values in the left and right as final freedom degrees (meaning that the words cannot be independently used as one word when one side is not free), and judging and screening new words by setting a solidification degree threshold.
The unsupervised new word extraction model constructed based on the degree of freedom and the degree of solidification calculation algorithm screens new words in text data of multiple fields comprehensively at low cost through double screening of the degree of freedom and the degree of solidification through corresponding preset thresholds.
The new word extraction model generated by training the deep learning model based on the sequence labeling training set is an unsupervised new word extraction model, a part of special words are labeled to form a training text, the deep learning model is trained by using the training text, and the trained model is used for recognizing new words.
In the embodiment, the new words in the text data of multiple fields are efficiently and comprehensively extracted through the new word extraction model established based on the degree-of-freedom and degree-of-coagulation calculation algorithm or the new word extraction model generated based on the sequence labeling training set training deep learning model.
In an embodiment of the present invention, after removing new words that have appeared in the participle set from the initial new word set to obtain a new word set, the method further includes: and updating the word segmentation dictionary according to the new word set.
According to the method and the device, after the new word set is obtained, the new words in the new word set are added and updated into the word segmentation dictionary, repeated matching and elimination of words are avoided when the new text data in multiple fields are subjected to the special word mining, and the special word mining efficiency is improved.
In an embodiment of the present invention, after updating the word segmentation dictionary according to the new word set, the method further includes performing word segmentation processing on the text data in the multiple fields according to the updated word segmentation dictionary to obtain a second word segmentation set, where the second word segmentation set is used for relevant statistics of new words in the new word set.
In this embodiment, the updated word segmentation dictionary is used to perform word segmentation on the text data in the multiple fields to obtain a second word segmentation set, which facilitates relevant statistics of each new word in the new word set, such as the number of times that the new word appears in each field, the number of texts in each field containing the first new word, and the like, and improves the efficiency of searching/matching and statistics.
In the following, a description is given of a proper word mining apparatus according to the present invention, and a proper word mining apparatus described below and a proper word mining method described above may be referred to in correspondence with each other.
As shown in fig. 3, the present invention provides a proper word mining apparatus, including: a new word set generation module 31, a domain dissimilarity generation module 32, a domain exposure generation module 33, and a proper word generation module 34.
The new word set generating module 31 is configured to obtain a new word set according to text data of multiple fields; a domain difference degree generating module 32, configured to obtain domain difference degrees of the first new word for the multiple domains; a domain exposure generating module 33, configured to obtain a domain exposure of the first new word to the first domain; the first new word is a new word in a new word set, and the first field is one of a plurality of fields; the special word generating module 34 is configured to determine that the domain difference is greater than a first preset threshold, and the domain exposure is greater than a second preset threshold, and then use the first new word as the special word of the first domain.
In the embodiment, a new word set is obtained according to text data of a plurality of fields, and the text data is preliminarily screened to obtain a new word set which is used as a candidate special word set of the plurality of fields; by acquiring the field difference degrees of the first new word for the plurality of fields and the field exposure degree of the first new word for the first field, and comparing the field difference degrees with the corresponding threshold values, the 'proprietary' degree of the first new word and the acceptance and use degrees of the first new word in the first field are measured, and the mining of the proprietary words in the plurality of fields is efficiently and accurately realized.
As shown in fig. 4, in an embodiment of the present invention, the new word set generating module 31 includes: a text data acquisition module 311, a participle set generation module 312, an initial new word set generation module 313, and a new word set generation module 314.
The text data acquiring module 311 is configured to acquire text data of multiple fields; a segmentation set generation module 312, configured to perform segmentation on the text data in the multiple fields based on a segmentation dictionary to obtain a segmentation set; an initial new word set generating module 313, which extracts new words from the text data of the multiple fields based on the new word extraction model to obtain an initial new word set; and the new word set generating module 314 is configured to remove new words that have appeared in the participle set from the initial new word set, so as to obtain the new word set.
The method includes the steps of segmenting text data of multiple fields based on a segmentation dictionary to obtain a common and general vocabulary set in the text data of the multiple fields, extracting new words from the text data of the multiple fields based on a new word extraction model, comprehensively extracting vocabularies in the text data of the multiple fields, removing the new words which are already appeared in the segmentation set from an initial new word set, screening the initial new words, removing the common and general vocabularies in the text data, and obtaining a new word set which is a candidate of a special word.
In an embodiment of the present invention, a proprietary word mining apparatus further comprises an updating module 304 for updating the segmentation dictionary according to the new word set.
After the new word set is obtained, the new words in the new word set are added and updated into the word segmentation dictionary, repeated matching and elimination of words during subsequent proprietary word mining on the text data of a plurality of new fields are avoided, and the proprietary word mining efficiency is improved.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a proprietary word mining method that includes obtaining a set of new words from text data for a plurality of domains; acquiring domain difference degrees of the first new word for a plurality of domains and domain exposure degrees of the first new word for the first domain; the first new word is a new word in a new word set, and the first field is one of a plurality of fields; and determining that the domain difference degree is greater than a first preset threshold value and the domain exposure degree of the first domain is greater than a second preset threshold value, and taking the first new word as a special word of the first domain.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a method for exclusive word mining provided by the above methods, the method comprising obtaining a set of new words from text data of a plurality of domains; acquiring domain difference degrees of the first new word for a plurality of domains and domain exposure degrees of the first new word for the first domain; the first new word is a new word in a new word set, and the first field is one of a plurality of fields; and determining that the domain difference degree is greater than a first preset threshold value and the domain exposure degree of the first domain is greater than a second preset threshold value, and taking the first new word as a special word of the first domain.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform a method of proprietary word mining as each of the above provides, the method including obtaining a set of new words from text data of a plurality of domains; acquiring domain difference degrees of the first new word for a plurality of domains and domain exposure degrees of the first new word for the first domain; the first new word is a new word in a new word set, and the first field is one of a plurality of fields; and determining that the domain difference degree is greater than a first preset threshold value and the domain exposure degree of the first domain is greater than a second preset threshold value, and taking the first new word as a special word of the first domain.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of proprietary word mining, comprising:
acquiring a new word set according to the text data of the multiple fields;
acquiring domain difference degrees of the first new word for the plurality of domains and domain exposure degrees for the first domain; wherein the first new word is a new word in the new word set, and the first domain is one of the domains;
and determining that the domain difference degree is greater than a first preset threshold value and the domain exposure degree of the first domain is greater than a second preset threshold value, and taking the first new word as a special word of the first domain.
2. The method of claim 1, wherein obtaining domain dissimilarity of the first new word with respect to the plurality of domains comprises:
acquiring the occurrence probability of the first new word in each field according to the occurrence frequency of the first new word in each field of the plurality of fields;
and acquiring the domain difference degrees of the first new word to the plurality of domains according to the probability of the first new word appearing in each domain.
3. The proprietary word mining method of claim 2, wherein the domain dissimilarity of the first new word with respect to the plurality of domains is obtained by:
Figure 626771DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 822260DEST_PATH_IMAGE002
the domain difference degrees of the first new word a for the plurality of domains,
Figure 783263DEST_PATH_IMAGE003
and taking the value of i as 1-n and n as the total number of the fields for the probability of the first new word A appearing in the ith field.
4. The method of claim 1, wherein obtaining a domain exposure of the first new word to the first domain comprises:
acquiring the number of texts in the first field, which contain the first new word, and the sum of the number of texts in the first field;
and taking the ratio of the text quantity to the sum of the text quantities as the domain exposure of the first new word to the first domain.
5. The proprietary word mining method of claim 1, wherein obtaining a set of new words from text data of a plurality of domains comprises:
acquiring text data of the multiple fields;
based on a word segmentation dictionary, performing word segmentation on the text data of the multiple fields to obtain a word segmentation set;
performing new word extraction on the text data of the plurality of fields based on a new word extraction model to obtain an initial new word set;
and removing new words which appear in the participle set from the initial new word set to obtain the new word set.
6. The proprietary word mining method of claim 5, wherein the new word extraction model is a new word extraction model constructed based on a degree of freedom and a degree of coagulation calculation algorithm, or a new word extraction model generated based on a deep learning model trained on a sequence labeling training set.
7. The proprietary word mining method of claim 5, wherein after removing new words that have appeared in a participle set from the initial new word set to obtain the new word set, the method further comprises:
and updating the word segmentation dictionary according to the new word set.
8. A proper word mining device, comprising:
the new word set generating module is used for acquiring a new word set according to the text data of the multiple fields;
the domain difference degree generating module is used for acquiring the domain difference degrees of the first new word to the plurality of domains;
the domain exposure generating module is used for acquiring the domain exposure of the first new word to the first domain;
wherein the first new word is a new word in the new word set, and the first domain is one of the domains;
and the special word generation module is used for determining that the field difference is greater than a first preset threshold and the field exposure is greater than a second preset threshold, and then taking the first new word as the special word of the first field.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the proprietary word mining method of any of claims 1-7.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the proprietary word mining method of any one of claims 1 to 7.
CN202110288414.9A 2021-03-18 2021-03-18 Special word mining method and device, electronic equipment and storage medium Pending CN112668331A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110288414.9A CN112668331A (en) 2021-03-18 2021-03-18 Special word mining method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110288414.9A CN112668331A (en) 2021-03-18 2021-03-18 Special word mining method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112668331A true CN112668331A (en) 2021-04-16

Family

ID=75399544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110288414.9A Pending CN112668331A (en) 2021-03-18 2021-03-18 Special word mining method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112668331A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020077816A1 (en) * 2000-08-30 2002-06-20 Ibm Corporation Method and system for automatically extracting new word
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN109472022A (en) * 2018-10-15 2019-03-15 平安科技(深圳)有限公司 New word identification method and terminal device based on machine learning
CN110110322A (en) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 Network new word discovery method, apparatus, electronic equipment and storage medium
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020077816A1 (en) * 2000-08-30 2002-06-20 Ibm Corporation Method and system for automatically extracting new word
CN101464898A (en) * 2009-01-12 2009-06-24 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN106502984A (en) * 2016-10-19 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN109472022A (en) * 2018-10-15 2019-03-15 平安科技(深圳)有限公司 New word identification method and terminal device based on machine learning
CN110110322A (en) * 2019-03-29 2019-08-09 泰康保险集团股份有限公司 Network new word discovery method, apparatus, electronic equipment and storage medium
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
边馥苓: "《时空大数据的技术与方法》", 31 May 2016, 测绘出版社 *

Similar Documents

Publication Publication Date Title
Fang et al. Entity disambiguation by knowledge and text jointly embedding
CN110427618B (en) Countermeasure sample generation method, medium, device and computing equipment
CN106815197B (en) Text similarity determination method and device
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
Fonseca et al. Mac-morpho revisited: Towards robust part-of-speech tagging
CN107480143A (en) Dialogue topic dividing method and system based on context dependence
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
WO2005064490A1 (en) System for recognising and classifying named entities
CN106611041A (en) New text similarity solution method
WO2014022172A2 (en) Information classification based on product recognition
Al-Omari et al. Arabic light stemmer (ARS)
CN111125360B (en) Emotion analysis method and device in game field and model training method and device thereof
CN113033185B (en) Standard text error correction method and device, electronic equipment and storage medium
CN110619046A (en) Fault identification method based on fault tracking table
CN113836938A (en) Text similarity calculation method and device, storage medium and electronic device
CN112612890A (en) Fault identification method and device for railway vehicle
CN112765325A (en) Vertical field corpus data screening method and system
CN113268576A (en) Deep learning-based department semantic information extraction method and device
CN108595426A (en) Term vector optimization method based on Chinese character pattern structural information
Farhoodi et al. N-gram based text classification for Persian newspaper corpus
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN111506726A (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN113806486A (en) Long text similarity calculation method and device, storage medium and electronic device
WO2014189400A1 (en) A method for diacritisation of texts written in latin- or cyrillic-derived alphabets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210416

RJ01 Rejection of invention patent application after publication