CN108984514A - Acquisition methods and device, storage medium, the processor of word - Google Patents

Acquisition methods and device, storage medium, the processor of word Download PDF

Info

Publication number
CN108984514A
CN108984514A CN201710414730.XA CN201710414730A CN108984514A CN 108984514 A CN108984514 A CN 108984514A CN 201710414730 A CN201710414730 A CN 201710414730A CN 108984514 A CN108984514 A CN 108984514A
Authority
CN
China
Prior art keywords
word
sequence
terms
words
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710414730.XA
Other languages
Chinese (zh)
Inventor
胡晓
谢心哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201710414730.XA priority Critical patent/CN108984514A/en
Publication of CN108984514A publication Critical patent/CN108984514A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of acquisition methods of word and device, storage medium, processors.Wherein, the acquisition methods of the word include: the multiple text features extracted in determining sequence of terms, and determine the corresponding indication information of each text feature;The word in the sequence of terms is filtered by default word length threshold, extracts the set of words for meeting the default word length threshold;It merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and the set of words for meeting the default word length threshold is filtered by the metrics-thresholds, extract the candidate set of words for meeting the metrics-thresholds;The candidate set of words is screened to obtain word and specify word according to preset screening index.Through the invention, the problem of solving and waste a large amount of manpower and material resources when finding neologisms in the related technology, while being largely dependent upon high completeness dictionary.

Description

Acquisition methods and device, storage medium, the processor of word
Technical field
The present invention relates to Language Processing fields, and the acquisition methods and device, storage in particular to a kind of word are situated between Matter, processor.
Background technique
With the rapid development of information technology, information medium has evolved into the indispensable part of people's daily life, People can browse Domestic News, such as Sina website, www.qq.com etc. on the internet, can also deliver individual in multimedia social activity Opinion, such as Sina weibo, wechat etc..Meanwhile ultra-large text information caused by people's online activity promotes nature language Say the rapid development of processing technique.The construction of the analysis of public opinion platform is very closely dependent upon natural language processing technique, therefore, natural language The accuracy and timeliness for saying processing technique become particularly important.On the one hand, natural language processing, such as Chinese word segmentation, neologisms hair Now etc. technologies accurately may insure the reliable of information analysis result;On the other hand, the effect of ultra-large text information processing Rate can be brought to information user most timely analysis as a result, for example, the timeliness of Chinese new word discovery, to current information Public sentiment monitoring and the processing of follow-up have facilitation.
Chinese natural language processing technique is just attract the investment of more and more scientific research scholars and engineering staff at present, and With the great-leap-forward development of artificial intelligence technology, Chinese natural language processing gradually finds to become the hot issue of artificial intelligence One of.Meanwhile the maturation of large-scale distributed computing technique, new breakthrough visual angle is brought for natural language processing technique.For example, Baidu search, the release of the series of products such as search dog input method,
However, there are still many insoluble problems for natural language processing technique, for example, new word discovery.Traditional sense On, if for new term be mainly relative to its occur time depending on.For the corpus dictionary grasped, word therein Remittance is defined as old word, the as vocabulary of time in the past appearance.Therefore, the neologisms found can be abstracted as and not deposit in dictionary Vocabulary.
Existing method is broadly divided into the Chinese words segmentation based on statistics and the Chinese word segmentation skill based on machine learning method Art.The statistical law that the former mainly occurs according to words in the word-building and corpus of Chinese word, but operating process needs big work The artificial participation that work is measured does special analysis and filtering to the concrete condition of data, time-consuming too long.The latter is based primarily upon dictionary, knot It closes the effect that machine learning algorithm is segmented, but segmented and depends on the complete of dictionary.In the incomplete situation of a large amount of dictionaries Under, hardly result in satisfied result.Furthermore it is ultra-large, such as TB grades, PB grades of corpus is also one huge for algorithm performance Big challenge.So in the related technology waste a large amount of manpower and material resources when finding neologisms, while being largely dependent upon height The problem of completeness dictionary.
Summary of the invention
The embodiment of the invention provides a kind of acquisition methods of word and device, storage medium, processors, at least to solve A large amount of manpower and material resources are wasted when finding word in the related technology, while being largely dependent upon asking for high completeness dictionary Topic.
According to one embodiment of present invention, a kind of acquisition methods of word are provided, are extracted in determining sequence of terms Multiple text features, and determine the corresponding indication information of each text feature;By default word length threshold to institute The word stated in sequence of terms is filtered, and extracts the set of words for meeting the default word length threshold;It merges multiple The corresponding indication information of the text feature obtains metrics-thresholds, and long to the default word is met by the metrics-thresholds The set of words filtering for spending threshold value, extracts the candidate set of words for meeting the metrics-thresholds;According to preset screening index The candidate set of words is screened to obtain word and specify word.
Optionally, word segmentation processing determining in the sequence of terms in the following manner: is carried out to the original language material of input Afterwards, word segmentation result is obtained;According to the sequence that word in the word segmentation result occurs, the word is converted by the word segmentation result Sequence.
Optionally, multiple text features in the sequence of terms are extracted, and determine each text feature The corresponding indication information, including at least one of: frequency statistics is carried out to each word in the sequence of terms, and The frequency of occurrences of the word of the sequence of terms is determined according to the result of frequency statistics;To each word in the sequence of terms Adjacent character string between the mutual information PMI sequence that is counted, and occurred according to the result of statistics according to the word Determine the PMI sequence of the sequence of terms;It unites to the symmetric condition probability SCP of each word in the sequence of terms It counts, and constitutes the SCP sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics;To the word The adjoining entropy of each word in sequence is counted, and determines institute according to the sequence that the word occurs according to the result of statistics State the adjoining Entropy sequence of sequence of terms;Word Duplication and length information gain are determined to each word in the sequence of terms It is counted, and determines the information content sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics.
Optionally, the adjacent character string is the left longest substring and right longest substring in the word.
Optionally, the indication information is merged at least through following formula obtain the metrics-thresholds a:a=α a1+ β a2+γ·a3+δ·a4+θ·a5;Wherein, the conditional coefficient of α, beta, gamma, δ and θ for numerical value not less than 0, while alpha+beta+γ+δ+ θ=1, a1 are the frequency of occurrences of the sequence of terms, and a2 is the PMI sequence of the sequence of terms, and a3 is the sequence of terms SCP sequence, the adjoining Entropy sequence of sequence of terms described in a4, a5 are the information content sequence of the sequence of terms.
Optionally, the candidate set of words is screened to obtain word and specify word according to preset screening index Language, including at least one of: whether the frequency for judging that each word in the candidate set of words occurs is greater than default word Frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word by language frequency threshold;Described in judgement Whether each word in candidate set of words belongs to the word in deactivated vocabulary, and the word that will not belong to the deactivated vocabulary is sentenced It is set to specified word;Judge the longest substring in the candidate set of words in each character string the frequency of occurrences whether etc. In the frequency of occurrences of the parent string of the longest substring, probability of occurrence is equal to the institute of the frequency of occurrences of the parent string It states longest substring and is determined as it not being specified word;Judge whether each word in the candidate set of words belongs to user The word for belonging to the user-oriented dictionary is determined as it not being specified word by the word in dictionary.
Optionally, the above method is applied to apache spark platform.
According to another embodiment of the invention, a kind of acquisition device of word is provided, comprising: determining module is used for Multiple text features in determining sequence of terms are extracted, and determine the corresponding indication information of each text feature;The One filtering module extracts satisfaction for being filtered by default word length threshold to the word in the sequence of terms The set of words of the default word length threshold;Second filtering module, for merging the corresponding finger of multiple text features Mark information obtains metrics-thresholds, and by the metrics-thresholds to the set of words mistake for meeting the default word length threshold Filter, extracts the candidate set of words for meeting the metrics-thresholds;Screening module is used for according to preset screening index to described Candidate set of words is screened to obtain word and specify word.
Optionally, described device is also used to, and after carrying out word segmentation processing to the original language material of input, obtains word segmentation result;With And the sequence occurred according to word in the word segmentation result, the sequence of terms is converted by the word segmentation result.
Optionally, the determining module, comprising: the first determination unit, for each word in the sequence of terms Frequency statistics is carried out, and determines according to the result of frequency statistics the frequency of occurrences of the word of the sequence of terms;Second determines list Member is counted for the mutual information PMI between the adjacent character string to each word in the sequence of terms, and according to The result of statistics determines the PMI sequence of the sequence of terms according to the sequence that the word occurs;Third determination unit, for pair The symmetric condition probability SCP of each word in the sequence of terms is counted, and according to the result of statistics according to institute's predicate The sequence that language occurs constitutes the SCP sequence of the sequence of terms;4th determination unit, for each of described sequence of terms The adjoining entropy of word is counted, and determines the sequence of terms according to the sequence that the word occurs according to the result of statistics Adjacent Entropy sequence;5th determination unit, for determining word Duplication and length letter to each word in the sequence of terms Breath gain is counted, and determines the information content of the sequence of terms according to the sequence that the word occurs according to the result of statistics Sequence.
Optionally, the third processing module, comprising: the first judging unit, for judging in the candidate set of words The frequency that occurs of each word whether be greater than default term frequencies threshold value, frequency is greater than corresponding to default word frequency threshold value Word be determined as the specified word;Second judgment unit, for judging that each word in the candidate set of words is The no word belonged in deactivated vocabulary, the word that will not belong to the deactivated vocabulary are determined as specified word;Third judging unit, For judging whether the frequency of occurrences of the longest substring in the candidate set of words in each character string is equal to the longest Probability of occurrence is equal to most eldest son described in the frequency of occurrences of the parent string by the frequency of occurrences of the parent string of substring Character string is determined as it not being specified word;4th judging unit, for judging that each word in the candidate set of words is The word for belonging to the user-oriented dictionary is determined as it not being specified word by the no word belonged in user-oriented dictionary.
Still another embodiment in accordance with the present invention provides a kind of equipment for running apache spark platform, including upper The device stated.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, the storage medium includes storage Program, wherein described program executes method described in any of the above embodiments when running.
According to still another embodiment of the invention, a kind of processor is additionally provided, the processor is used to run program, In, described program executes method described in any of the above embodiments when running.
Through the invention, by the sequence of terms after being converted to the corpus by inputting, using from the sequence of terms It is middle to extract candidate's set of words determined by calculated indication information fusion determining threshold value and preset threshold value.It is directed to simultaneously Candidate's set of words is screened using corresponding screening index finally to determine word.So not needing by too many people Power material resources, meanwhile, it does not need to the requirement with higher of the integrality of dictionary, therefore can solve present in the relevant technologies yet It was found that waste a large amount of manpower and material resources when neologisms, while the problem of be largely dependent upon high completeness dictionary, so as to reach It is complicated to use manpower and material resources sparingly to avoiding calculating, while dictionary can obtain the beneficial effect of neologisms since being not necessarily to.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of hardware block diagram of the mobile terminal of the acquisition methods of word of the embodiment of the present invention;
Fig. 2 is a kind of flow chart of the acquisition of word according to an embodiment of the present invention;
Fig. 3 is a kind of process of the acquisition methods of word applied to network monitoring system according to an embodiment of the present invention Figure;
Fig. 4 is a kind of acquisition modes of word based on large-scale corpus Text Classification System according to an embodiment of the present invention Flow chart;
Fig. 5 is a kind of process of the acquisition methods of word based on personalized recommendation vocabulary according to an embodiment of the present invention Figure;
Fig. 6 is a kind of structural block diagram of the acquisition device of word according to an embodiment of the present invention.
Fig. 7 is a kind of structure chart of equipment for operating in apache spark platform according to an embodiment of the present invention.
Specific embodiment
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.
Embodiment 1
Embodiment of the method provided by the embodiment of the present application one can be in mobile terminal, terminal or similar fortune It calculates and is executed in device.For running on mobile terminals, Fig. 1 is a kind of shifting of the acquisition methods of word of the embodiment of the present invention The hardware block diagram of dynamic terminal.As shown in Figure 1, terminal 10 may include one or more (only showing one in figure) processors 102 (processing units that processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc.) are used Memory 104 in storing data and the transmitting device 106 for communication function.Those of ordinary skill in the art can manage Solution, structure shown in FIG. 1 are only to illustrate, and do not cause to limit to the structure of above-mentioned electronic device.For example, terminal 10 can also wrap Include than shown in Fig. 1 more perhaps less component or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing the software program and module of application software, such as the word in the embodiment of the present invention The corresponding program instruction/module of acquisition methods, processor 102 by the software program that is stored in memory 104 of operation with And module realizes above-mentioned method thereby executing various function application and data processing.Memory 104 may include high speed Random access memory may also include nonvolatile memory, such as one or more magnetic storage device, flash memory or other are non- Volatile solid-state.In some instances, memory 104 can further comprise remotely located relative to processor 102 Memory, these remote memories can pass through network connection to terminal 10.The example of above-mentioned network includes but is not limited to interconnect Net, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of mobile terminal 10 provide.In an example, transmitting device 106 includes a Network adaptation Device (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to it is mutual Networking is communicated.In an example, transmitting device 106 can be radio frequency (Radio Frequency, RF) module, use In wirelessly being communicated with internet.
It should be pointed out that process described in the present embodiment is realized by apace spark platform.Certainly, He is able to carry out the plateform system of word processing within the protection scope of the present embodiment, does not do excessively repeat herein.
It should be noted that the meaning that described word obtains in the present embodiment is the acquisition of neologisms in existing dictionary In the vocabulary that is not present.Certainly, the neologisms such as in the present embodiment, not only can be using the acquisition with Chinese neologisms, simultaneously The language in other language, such as English, French, the other countries such as Japanese or area is within the protection scope of the present embodiment. And when specific practical operation, it can slightly be adjusted according to the language characteristics of every country or area.
A kind of acquisition methods of word for running on above-mentioned terminal are provided in the present embodiment, and Fig. 2 is according to the present invention The flow chart of the acquisition of a kind of word of embodiment, as shown in Fig. 2, the process includes the following steps:
Step S202 extracts multiple text features in determining sequence of terms, and determines each text feature pair The indication information answered;
Optionally, the determination process in step S202 can be realized by way of one of:
(1) frequency statistics is carried out to each word in the sequence of terms, and institute is determined according to the result of frequency statistics State the frequency of occurrences of the word of sequence of terms.
Specifically, if the frequency that word occurs in sequence of terms is higher, it means that time that the word is used Number is more, and therefore, a possibility that constituting a neologisms is very big.
(2) the mutual information PMI in the sequence of terms between the adjacent character string of word is extracted, and according to the PMI The PMI sequence of the sequence of terms is constituted according to the word order.
Specifically, PMI is the measure information for measuring two event correlations.For candidate neologisms, its two are constituted The mutual information magnitude of a longest substring, i.e., left longest substring and right longest substring is bigger, just illustrates that the two constitutes a neologisms Probability is bigger., whereas if mutual information magnitude is smaller, the two is then less likely to constitute a neologisms.Therefore, mutual information is big The small co-occurrence probabilities for reflecting longest substring in candidate neologisms.When the probability is bigger, longest substring engineering constitute neologisms can Energy property is bigger.
Such as: for candidate neologisms w=c1c2…cn, his two longest substrings are left longest substring wleft=c1c2… cn-1With right longest substring wright=c2c3…cn, the mutual information of candidate neologisms w is
PMI (w)=log (p (w)/p (wleft)p(wright))
(3) the symmetric condition probability SCP for extracting each word in the sequence of terms, according to the SCP according to institute's predicate Language sequence constitutes the SCP sequence of the sequence of terms.
Specifically, SCP is the statistic for measuring the tightness degree that each character combines inside character string.For candidate word For, the probability value of the word is bigger, then it is closer to identify the character string formed in the word.Therefore a possibility that constituting word It is bigger, conversely, the probability for then constituting word is smaller.
Such as: for candidate neologisms w=c1c2…cn, the calculation method of SCP is
(4) the adjoining entropy in the sequence of terms in each word is extracted, it is suitable according to the word according to the adjacent entropy Sequence constitutes the adjoining Entropy sequence of the sequence of terms.
Specifically, adjacent entropy (Branch Entropy) is to measure left adjacent character in candidate word and right adjacent character not Deterministic statistic.If the uncertainty of candidate word is higher, just illustrate that the context relation of candidate word is abundanter, because This, constitute word a possibility that it is bigger.
Such as:
For candidate neologisms w, character x and character y respectively indicate the left adjacent character and right adjacent character of candidate neologisms, then w The calculation method of left adjacent entropy HL (w), right adjacent entropy HR (w) and adjoining entropy BE (w) is as follows:
HL (w)=- ∑xp(x|w)logp(x|w)
HR (w)=- ∑yp(y|w)logp(y|w)
BE (w)=min (HL (w), HR (w))
(5) it extracts in the sequence of terms and determines word Duplication and length information gain in each word, according to described Determine word Duplication and the length information increases the information content sequence for constituting the sequence of terms according to the word order.
Optionally, between step S204, the method also includes:
After carrying out word segmentation processing to the original language material of input, word segmentation result is obtained;
It is defeated by the word segmentation result according to the output characteristics of spark according to the sequence that word in the word segmentation result occurs It is out the sequence of terms.
Optionally, the purpose of appearance sequence for retaining word after participle is, in order to which the corpus that will input is in existing dictionary In word segmentation result calculation processing will certainly obtain incorrect division and for those words not in dictionary, therefore In the case where retaining word order, analysis input is provided for subsequent step.
Step S204 is filtered the word in the sequence of terms by default word length threshold, extracts full The set of words of the foot default word length threshold;
Specifically, presetting word length threshold is the word length for following general linguistics rule and constituting neologisms.At this In embodiment, pre-set length threshold is 5 or 6.Therefore, as long as word length meets the pre-set length threshold 5 or 6, then Have the potentiality for constituting neologisms.
Step S106 merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and passes through the finger Mark threshold value filters the set of words for meeting the default word length threshold, extracts the candidate word for meeting the metrics-thresholds Language set.
Optionally, which can be identified by following formula:
A=α a1+ β a2+ γ a3+ δ a4+ θ a5;
Wherein, α, beta, gamma, δ and θ are the conditional coefficient that numerical value is not less than 0, while alpha+beta+γ+δ+θ=1, a1 are described The term frequencies of sequence of terms, a2 are the PMI sequence of the sequence of terms, and a3 is the SCP sequence of the sequence of terms, described in a4 The adjoining Entropy sequence of sequence of terms, a5 are the information content sequence of the sequence of terms.
It should be pointed out that as a kind of value empirically determined according to inventor, α, beta, gamma, the specific value of δ and θ Distribution are as follows:
α=0.35, β=0.14, γ=0.30, δ=0.15, θ=0.06
So for the word for tentatively meeting default word length threshold, in the case where it meets metrics-thresholds a, It can be known as with the candidate neologisms as neologisms ability.Subsequent input for convenience, candidate's neologisms equally also need simultaneously It to be ranked up according to the appearance sequence of word after participle.
It should be pointed out that if since there are messy codes or document damage to cause in system problem or sequence of terms When partial index can not count, then only need to count the index that can be calculated, and other indexs are ignored.Simultaneously will Corresponding conditional coefficient is set as 0.Simultaneously by comparing the importance of index, corresponding conditional coefficient value is reconfigured.This Sample can either ensure that the accuracy that neologisms are chosen to the full extent, while applicability when can widen new words extraction, have The ability of very strong anti-extraneous factor.
Can step S108, the candidate set of words is screened according to preset screening index specified to obtain word Word.
Optionally, specific screening index and screening mode are as follows:
(1) whether the frequency for judging that each word in the candidate set of words occurs is greater than default term frequencies threshold Frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word by value.
Specifically, the candidate word less for the frequency of occurrences, then the word is that it is impossible to meet the items for constituting neologisms Part.A kind of optional way in the present embodiment, default term frequencies threshold value are set as 3.That is, occur when candidate's word Frequency is less than 3, then it is unsatisfactory for the condition as neologisms depending on candidate's word, so candidate's word was carried out Filter.And if there is frequency be greater than 3 if, then can satisfy the condition of neologisms depending on candidate's word, so by the candidate Word retains.
(2) judge whether each word in the candidate set of words belongs to the word in deactivated vocabulary, will not belong to The word of the deactivated vocabulary is determined as specified word.
Specifically, it deactivates vocabulary effect and is the word for indicating to have stopped using, so if candidate's neologisms are to belong to If this deactivates vocabulary, then the candidate information is equally also unsatisfactory for the condition as neologisms, so by candidate's word into Row filtering.And if being not belonging to this and deactivating vocabulary, then retain candidate's word.
(3) judge whether the appearance frequency of the longest substring in the candidate set of words in each character string is equal to The appearance frequency of the parent string of the longest substring.
Specifically, a N metacharacter string c1c2…cn, N-1 member longest substring is wleft=c1c2…cn-1And wright= c2c3…cn.If the appearance frequency of the longest substring is equal to the appearance frequency of its parent string, illustrate the longest Substring is a part appearance as substring in the text, rather than is occurred by way of as a word 's.So the character string is filtered, and retain the different longest substring of the frequency of occurrences.
(4) judge whether each word in the candidate set of words belongs to the word in user-oriented dictionary, by belonging to The word for stating user-oriented dictionary is determined as it not being specified word.
Optionally, by probability level, put the mode of library level and character string level Integrated Selection to candidate word into Row screens layer by layer, finally can determine the neologisms for meeting above-mentioned screening index or new set of words.
It should be pointed out that above-mentioned screening conditions are only enumerated, and it is non exhaustive.For example, for the appearance position of word, word Other indexs such as part of speech of language are screened also within the protection scope of the present embodiment, do not do excessively repeat herein.
Optionally, dispersed elastomeric distributed data collection (the Resilient Distributed polymerizeing on apache spark Dataset abbreviation RDD), the neologisms for meeting screening index or new set of words are obtained,
In addition, additionally providing following practical application scene, in the present embodiment to understand foregoing description in the present embodiment Technical solution:
Scene 1:
Interconnection abundant can be generated in public microblogging used in daily life, wechat circle of friends and internet site Net data.In order to realize that the supervision of the social security on network and public opinion guidance perform effectively, analysis in real time and tracking are public Focus and society's dynamic of public opinion in real time are very important.Fig. 3 is that a kind of network that is applied to according to an embodiment of the present invention is supervised The flow chart for superintending and directing the acquisition methods of the word of system specifically includes: in method
S1, crawl microblogging, wechat and other internet datas carry out serializing storage;
S2 clears up data, removes interference information, is structured message by data consolidation;
S3 loads data using apache spark, carries out new word discovery to text information using this method;
S4 is filtered the neologisms found, and the neologisms of output are new hot spot and the society for being regarded as network public-opinion Public opinion new trend
S5 carries out front end page displaying to result;
Scene 2:
It, can be to extensive corpus of text such as gio signal, government affairs document in large-scale corpus Text Classification System Deng providing real-time grading and query service.Fig. 4 is according to an embodiment of the present invention a kind of based on large-scale corpus text classification system The flow chart of the acquisition modes of the word of system, as shown in figure 4, the acquisition methods include:
S1 reads in corpus data, and is stored as RDD format.
S2 converts RDD to the data set Dataset to put in order with RDD.
S3 converts the word in Dataset to the array to put in order with RDD.
S4 calculates the word frequency of each array in a document.
S5 calculates common weighting (the term frequency-inverse document of information retrieval data mining Frequency, abbreviation TF-IDF).
S6 trains Bayesian model and with document form persistence.
S7, input need the text analyzed.
S8 calls trained topic model in advance, and calculates the theme value of working days text,.
S9 inquires theme dictionary according to calculated value, exports the theme of current text.
Scene 3:
The public sentiment hot analyzed and tracked in real time by this system and popular vocabulary, hobby subdivision and user in conjunction with user Portrait, can carry out personalized recommendation to user, and push meets the real-time news or other information of user preferences.Fig. 5 is basis A kind of flow chart of the acquisition methods of word based on personalized recommendation vocabulary of the embodiment of the present invention.As shown in figure 5, described obtain The method is taken to include:
S1, crawl microblogging, wechat and other internet datas carry out serializing storage;
S2 clears up data, removes interference information, is structured message by data consolidation;
S3 loads data using apache spark, carries out new word discovery to text information using this method;
S4 is filtered the neologisms found, and the neologisms of output are new hot spot and the society for being regarded as network public-opinion These hot spots are corresponded to publication user and place by public opinion new trend;
S5 analyzes user in conjunction with other information in conjunction with the corresponding publication user of neologisms, carries out individual character portrait to user
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
Embodiment 2
A kind of acquisition device of word is additionally provided in the present embodiment, and the device is for realizing above-described embodiment and preferably Embodiment, the descriptions that have already been made will not be repeated.As used below, predetermined function may be implemented in term " module " The combination of software and/or hardware.Although device described in following embodiment is preferably realized with software, hardware, or The realization of the combination of person's software and hardware is also that may and be contemplated.
Fig. 6 is a kind of structural block diagram of the acquisition device of word according to an embodiment of the present invention, as shown in fig. 6, the device Comprise determining that module 62, the first filtering module 64, the second filtering module 66 and screening module 68.
Determining module 62 for extracting multiple text features in determining sequence of terms, and determines each text The corresponding indication information of eigen;
First filtering module 64, for being carried out by default word length threshold to the word in the sequence of terms Filter, extracts the set of words for meeting the default word length threshold;
Second filtering module 66 obtains metrics-thresholds for merging the corresponding indication information of multiple text features, and The set of words for meeting the default word length threshold is filtered by the metrics-thresholds, extracts and meets the index threshold The candidate set of words of value;
Screening module 68, for being screened the candidate set of words to obtain word according to preset screening index Specified word.
Optionally, described device is also used to, and after carrying out word segmentation processing to the original language material of input, obtains word segmentation result;With And the sequence occurred according to word in the word segmentation result, the sequence of terms is converted by the word segmentation result.
Optionally it is determined that module 62 includes:
First determination unit for carrying out frequency statistics to each word in the sequence of terms, and is united according to frequency The result of meter determines the frequency of occurrences of the word of the sequence of terms;
Second determination unit, for the mutual information between the adjacent character string to each word in the sequence of terms PMI is counted, and determines the PMI sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics;
Third determination unit is counted for the symmetric condition probability SCP to each word in the sequence of terms, And the SCP sequence of the sequence of terms is constituted according to the sequence that the word occurs according to the result of statistics;
4th determination unit is counted for the adjoining entropy to each word in the sequence of terms, and according to system The result of meter determines the adjoining Entropy sequence of the sequence of terms according to the sequence that the word occurs;
5th determination unit, for determining word Duplication and length information increasing to each word in the sequence of terms Benefit is counted, and determines the information content sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics Column.
Optionally, screening module 68 includes:
First judging unit, for judging it is pre- whether the frequency of each word appearance in the candidate set of words is greater than If term frequencies threshold value, frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word;
Second judgment unit, for judging whether each word in the candidate set of words belongs in deactivated vocabulary Word, the word that will not belong to the deactivated vocabulary are determined as specified word;
Third judging unit, for judging going out for the longest substring in the candidate set of words in each character string Whether existing frequency is equal to the frequency of occurrences of the parent string of the longest substring, and probability of occurrence is equal to the parent string The longest substring of the frequency of occurrences is determined as it not being specified word;
4th judging unit, for judging whether each word in the candidate set of words belongs in user-oriented dictionary The word for belonging to the user-oriented dictionary is determined as it not being specified word by word.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any Combined form is located in different processors.
Optionally, above-mentioned apparatus described in this embodiment may operate in the equipment of apache spark platform.Fig. 7 A kind of structure chart of equipment for operating in apache spark platform according to an embodiment of the present invention, since realize foregoing description The function of device.
Embodiment 3
The embodiments of the present invention also provide a kind of storage medium, which includes the program of storage, wherein above-mentioned Program executes method described in any of the above embodiments when running.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store the journey for executing following steps Sequence code:
S1 extracts multiple text features in determining sequence of terms, and determines the corresponding finger of each text feature Mark information;
S2 is filtered the word in the sequence of terms by default word length threshold, extracts described in satisfaction The set of words of default word length threshold;
S3 merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and passes through the metrics-thresholds The set of words for meeting the default word length threshold is filtered, the candidate word collection for meeting the metrics-thresholds is extracted It closes;
S4 screens to obtain word and specify word the candidate set of words according to preset screening index.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store program code such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of processor, the processor is for running program, wherein program operation Step in Shi Zhihang any of the above-described method.
Optionally, in the present embodiment, above procedure is for executing following steps:
S1 extracts multiple text features in determining sequence of terms, and determines the corresponding finger of each text feature Mark information;
S2 is filtered the word in the sequence of terms by default word length threshold, extracts described in satisfaction The set of words of default word length threshold;
S3 merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and passes through the metrics-thresholds The set of words for meeting the default word length threshold is filtered, the candidate word collection for meeting the metrics-thresholds is extracted It closes;
S4 screens to obtain word and specify word the candidate set of words according to preset screening index.
Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment Example, details are not described herein for the present embodiment.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.It is all within principle of the invention, it is made it is any modification, etc. With replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (14)

1. a kind of acquisition methods of word characterized by comprising
Multiple text features in determining sequence of terms are extracted, and determine the corresponding indication information of each text feature;
The word in the sequence of terms is filtered by default word length threshold, extracts and meets the default word The set of words of length threshold;
It merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and by the metrics-thresholds to meeting The set of words filtering for stating default word length threshold, extracts the candidate set of words for meeting the metrics-thresholds;
The candidate set of words is screened to obtain specified word according to preset screening index.
2. the method according to claim 1, wherein determining in the sequence of terms in the following manner:
After carrying out word segmentation processing to the original language material of input, word segmentation result is obtained;
According to the sequence that word in the word segmentation result occurs, the sequence of terms is converted by the word segmentation result.
3. according to the method described in claim 2, it is characterized in that, the multiple texts extracted in the sequence of terms are special Sign, and determine the corresponding indication information of each text feature, including at least one of:
Frequency statistics is carried out to each word in the sequence of terms, and the word sequence is determined according to the result of frequency statistics The frequency of occurrences of the word of column;
Mutual information PMI between the adjacent character string of each word in the sequence of terms is counted, and according to system The result of meter determines the PMI sequence of the sequence of terms according to the sequence that the word occurs;
The symmetric condition probability SCP of each word in the sequence of terms is counted, and according to the result of statistics according to The sequence that the word occurs constitutes the SCP sequence of the sequence of terms;
The adjoining entropy of each word in the sequence of terms is counted, and is gone out according to the result of statistics according to the word Existing sequence determines the adjoining Entropy sequence of the sequence of terms;
To each word in the sequence of terms determine word Duplication and length information gain counts, and according to statistics Result the information content sequence of the sequence of terms is determined according to the sequence that the word occurs.
4. according to the method described in claim 3, it is characterized in that, wherein, the adjacent character string is the left side in the word Longest substring and right longest substring.
5. according to the method described in claim 3, being obtained it is characterized in that, merging the indication information at least through following formula The metrics-thresholds a:
A=α a1+ β a2+ γ a3+ δ a4+ θ a5;
Wherein, α, beta, gamma, δ and θ are the conditional coefficient that numerical value is not less than 0, while alpha+beta+γ+δ+θ=1, a1 are the word The frequency of occurrences of sequence, a2 are the PMI sequence of the sequence of terms, and a3 is the SCP sequence of the sequence of terms, word described in a4 The adjoining Entropy sequence of sequence, a5 are the information content sequence of the sequence of terms.
6. according to the method described in claim 2, it is characterized in that, according to preset screening index to the candidate set of words It is screened to obtain word and specify word, including at least one of:
Whether the frequency for judging that each word in the candidate set of words occurs is greater than default term frequencies threshold value, by frequency It is determined as the specified word greater than word corresponding to default word frequency threshold value;
Judge whether each word in the candidate set of words belongs to the word in deactivated vocabulary, will not belong to described deactivate The word of vocabulary is determined as specified word;
Judge whether the frequency of occurrences of the longest substring in the candidate set of words in each character string is equal to the longest Probability of occurrence is equal to most eldest son described in the frequency of occurrences of the parent string by the frequency of occurrences of the parent string of substring Character string is determined as it not being specified word;
Judge whether each word in the candidate set of words belongs to the word in user-oriented dictionary, user's word will be belonged to The word of allusion quotation is determined as it not being specified word.
7. method according to claim 1-6, which is characterized in that be applied to apache spark platform.
8. a kind of acquisition device of word characterized by comprising
Determining module for extracting multiple text features in determining sequence of terms, and determines each text feature Corresponding indication information;
First filtering module is extracted for being filtered by default word length threshold to the word in the sequence of terms Meet the set of words of the default word length threshold out;
Second filtering module obtains metrics-thresholds for merging the corresponding indication information of multiple text features, and passes through institute It states metrics-thresholds to filter the set of words for meeting the default word length threshold, extracts the time for meeting the metrics-thresholds Select set of words;
Screening module, for being screened the candidate set of words to obtain word and specify word according to preset screening index Language.
9. device according to claim 8, which is characterized in that described device is also used to, and is carried out to the original language material of input After word segmentation processing, word segmentation result is obtained;And the sequence occurred according to word in the word segmentation result, the word segmentation result is turned Turn to the sequence of terms.
10. device according to claim 9, which is characterized in that the determining module, comprising:
First determination unit, for carrying out frequency statistics to each word in the sequence of terms, and according to frequency statistics As a result the frequency of occurrences of the word of the sequence of terms is determined;
Second determination unit, for the mutual information PMI between the adjacent character string to each word in the sequence of terms It is counted, and determines the PMI sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics;
Third determination unit is counted for the symmetric condition probability SCP to each word in the sequence of terms, and root Result according to statistics constitutes the SCP sequence of the sequence of terms according to the sequence that the word occurs;
4th determination unit is counted for the adjoining entropy to each word in the sequence of terms, and according to statistics As a result the adjoining Entropy sequence of the sequence of terms is determined according to the sequence that the word occurs;
5th determination unit, for each word in the sequence of terms determine word Duplication and length information gain into Row counts, and determines the information content sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics.
11. device according to claim 10, which is characterized in that the screening module, comprising:
Whether the first judging unit, the frequency for judging that each word in the candidate set of words occurs are greater than default word Frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word by language frequency threshold;
Second judgment unit, for judging whether each word in the candidate set of words belongs to the word in deactivated vocabulary Language, the word that will not belong to the deactivated vocabulary are determined as specified word;
Third judging unit, for judging the appearance frequency of the longest substring in the candidate set of words in each character string Whether rate is equal to the frequency of occurrences of the parent string of the longest substring, and probability of occurrence is equal to the appearance of the parent string The longest substring of frequency is determined as it not being specified word;
4th judging unit, for judging whether each word in the candidate set of words belongs to the word in user-oriented dictionary The word for belonging to the user-oriented dictionary is determined as it not being specified word by language.
12. a kind of equipment for running apache spark platform, which is characterized in that described in any item including claim 8-11 Device.
13. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein when described program is run Method described in any one of perform claim requirement 1 to 7.
14. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 1 to 7 described in method.
CN201710414730.XA 2017-06-05 2017-06-05 Acquisition methods and device, storage medium, the processor of word Pending CN108984514A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710414730.XA CN108984514A (en) 2017-06-05 2017-06-05 Acquisition methods and device, storage medium, the processor of word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710414730.XA CN108984514A (en) 2017-06-05 2017-06-05 Acquisition methods and device, storage medium, the processor of word

Publications (1)

Publication Number Publication Date
CN108984514A true CN108984514A (en) 2018-12-11

Family

ID=64501310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710414730.XA Pending CN108984514A (en) 2017-06-05 2017-06-05 Acquisition methods and device, storage medium, the processor of word

Country Status (1)

Country Link
CN (1) CN108984514A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783649A (en) * 2019-01-02 2019-05-21 腾讯科技(深圳)有限公司 A kind of domain lexicon generation method and device
CN111488727A (en) * 2020-03-24 2020-08-04 南阳柯丽尔科技有限公司 Word file parsing method, word file parsing apparatus, and computer-readable storage medium
CN113342936A (en) * 2021-06-08 2021-09-03 北京明略软件系统有限公司 Word formation compactness determining method and device, electronic equipment and storage medium
CN113779973A (en) * 2020-06-09 2021-12-10 杭州晨熹多媒体科技有限公司 Text data processing method and device
CN115858771A (en) * 2022-01-11 2023-03-28 北京中关村科金技术有限公司 Word searching method and device and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323382A (en) * 2011-07-20 2012-01-18 暨南大学 Multiple index lamination and fusion visualization method for detecting structural damages
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106649334A (en) * 2015-10-29 2017-05-10 北京国双科技有限公司 Conjunction word set processing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323382A (en) * 2011-07-20 2012-01-18 暨南大学 Multiple index lamination and fusion visualization method for detecting structural damages
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN106649334A (en) * 2015-10-29 2017-05-10 北京国双科技有限公司 Conjunction word set processing method and device
CN105786991A (en) * 2016-02-18 2016-07-20 中国科学院自动化研究所 Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
梅莉莉: "基于领域特殊性和统计语言知识的新词抽取方法", 知网, pages 2 *
苏其龙: "微博新词发现研究", 知网, pages 2 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783649A (en) * 2019-01-02 2019-05-21 腾讯科技(深圳)有限公司 A kind of domain lexicon generation method and device
CN109783649B (en) * 2019-01-02 2023-01-24 腾讯科技(深圳)有限公司 Domain dictionary generating method and device
CN111488727A (en) * 2020-03-24 2020-08-04 南阳柯丽尔科技有限公司 Word file parsing method, word file parsing apparatus, and computer-readable storage medium
CN111488727B (en) * 2020-03-24 2023-09-19 南阳柯丽尔科技有限公司 Word file parsing method, word file parsing apparatus, and computer-readable storage medium
CN113779973A (en) * 2020-06-09 2021-12-10 杭州晨熹多媒体科技有限公司 Text data processing method and device
CN113342936A (en) * 2021-06-08 2021-09-03 北京明略软件系统有限公司 Word formation compactness determining method and device, electronic equipment and storage medium
CN113342936B (en) * 2021-06-08 2024-03-22 北京明略软件系统有限公司 Word compactness determining method and device, electronic equipment and storage medium
CN115858771A (en) * 2022-01-11 2023-03-28 北京中关村科金技术有限公司 Word searching method and device and computer readable storage medium

Similar Documents

Publication Publication Date Title
Laylavi et al. Event relatedness assessment of Twitter messages for emergency response
CN108984514A (en) Acquisition methods and device, storage medium, the processor of word
CN109299271B (en) Training sample generation method, text data method, public opinion event classification method and related equipment
CN108776671A (en) A kind of network public sentiment monitoring system and method
CN107862022B (en) Culture resource recommendation system
CN104965905B (en) A kind of method and apparatus of Web page classifying
CN111738011A (en) Illegal text recognition method and device, storage medium and electronic device
CN105740227B (en) A kind of genetic simulated annealing method of neologisms in solution Chinese word segmentation
CN104216954A (en) Prediction device and prediction method for state of emergency topic
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN103778205A (en) Commodity classifying method and system based on mutual information
CN104572958A (en) Event extraction based sensitive information monitoring method
CN107943792B (en) Statement analysis method and device, terminal device and storage medium
CN106649334B (en) Processing method and device of associated word set
CN106951409A (en) A kind of network social intercourse media viewpoint tendency analysis system and method
CN110263169A (en) A kind of focus incident detection method based on convolutional neural networks and keyword clustering
CN107679135A (en) The topic detection of network-oriented text big data and tracking, device
CN111061837A (en) Topic identification method, device, equipment and medium
CN108733791A (en) network event detection method
CN111581956A (en) Sensitive information identification method and system based on BERT model and K nearest neighbor
CN109241392A (en) Recognition methods, device, system and the storage medium of target word
CN104933475A (en) Network forwarding behavior prediction method and apparatus
Kim et al. SMS spam filterinig using keyword frequency ratio
CN110968664A (en) Document retrieval method, device, equipment and medium
CN109558531A (en) News information method for pushing, device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination