CN108984514A - Acquisition methods and device, storage medium, the processor of word - Google Patents
Acquisition methods and device, storage medium, the processor of word Download PDFInfo
- Publication number
- CN108984514A CN108984514A CN201710414730.XA CN201710414730A CN108984514A CN 108984514 A CN108984514 A CN 108984514A CN 201710414730 A CN201710414730 A CN 201710414730A CN 108984514 A CN108984514 A CN 108984514A
- Authority
- CN
- China
- Prior art keywords
- word
- sequence
- terms
- words
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of acquisition methods of word and device, storage medium, processors.Wherein, the acquisition methods of the word include: the multiple text features extracted in determining sequence of terms, and determine the corresponding indication information of each text feature;The word in the sequence of terms is filtered by default word length threshold, extracts the set of words for meeting the default word length threshold;It merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and the set of words for meeting the default word length threshold is filtered by the metrics-thresholds, extract the candidate set of words for meeting the metrics-thresholds;The candidate set of words is screened to obtain word and specify word according to preset screening index.Through the invention, the problem of solving and waste a large amount of manpower and material resources when finding neologisms in the related technology, while being largely dependent upon high completeness dictionary.
Description
Technical field
The present invention relates to Language Processing fields, and the acquisition methods and device, storage in particular to a kind of word are situated between
Matter, processor.
Background technique
With the rapid development of information technology, information medium has evolved into the indispensable part of people's daily life,
People can browse Domestic News, such as Sina website, www.qq.com etc. on the internet, can also deliver individual in multimedia social activity
Opinion, such as Sina weibo, wechat etc..Meanwhile ultra-large text information caused by people's online activity promotes nature language
Say the rapid development of processing technique.The construction of the analysis of public opinion platform is very closely dependent upon natural language processing technique, therefore, natural language
The accuracy and timeliness for saying processing technique become particularly important.On the one hand, natural language processing, such as Chinese word segmentation, neologisms hair
Now etc. technologies accurately may insure the reliable of information analysis result;On the other hand, the effect of ultra-large text information processing
Rate can be brought to information user most timely analysis as a result, for example, the timeliness of Chinese new word discovery, to current information
Public sentiment monitoring and the processing of follow-up have facilitation.
Chinese natural language processing technique is just attract the investment of more and more scientific research scholars and engineering staff at present, and
With the great-leap-forward development of artificial intelligence technology, Chinese natural language processing gradually finds to become the hot issue of artificial intelligence
One of.Meanwhile the maturation of large-scale distributed computing technique, new breakthrough visual angle is brought for natural language processing technique.For example,
Baidu search, the release of the series of products such as search dog input method,
However, there are still many insoluble problems for natural language processing technique, for example, new word discovery.Traditional sense
On, if for new term be mainly relative to its occur time depending on.For the corpus dictionary grasped, word therein
Remittance is defined as old word, the as vocabulary of time in the past appearance.Therefore, the neologisms found can be abstracted as and not deposit in dictionary
Vocabulary.
Existing method is broadly divided into the Chinese words segmentation based on statistics and the Chinese word segmentation skill based on machine learning method
Art.The statistical law that the former mainly occurs according to words in the word-building and corpus of Chinese word, but operating process needs big work
The artificial participation that work is measured does special analysis and filtering to the concrete condition of data, time-consuming too long.The latter is based primarily upon dictionary, knot
It closes the effect that machine learning algorithm is segmented, but segmented and depends on the complete of dictionary.In the incomplete situation of a large amount of dictionaries
Under, hardly result in satisfied result.Furthermore it is ultra-large, such as TB grades, PB grades of corpus is also one huge for algorithm performance
Big challenge.So in the related technology waste a large amount of manpower and material resources when finding neologisms, while being largely dependent upon height
The problem of completeness dictionary.
Summary of the invention
The embodiment of the invention provides a kind of acquisition methods of word and device, storage medium, processors, at least to solve
A large amount of manpower and material resources are wasted when finding word in the related technology, while being largely dependent upon asking for high completeness dictionary
Topic.
According to one embodiment of present invention, a kind of acquisition methods of word are provided, are extracted in determining sequence of terms
Multiple text features, and determine the corresponding indication information of each text feature;By default word length threshold to institute
The word stated in sequence of terms is filtered, and extracts the set of words for meeting the default word length threshold;It merges multiple
The corresponding indication information of the text feature obtains metrics-thresholds, and long to the default word is met by the metrics-thresholds
The set of words filtering for spending threshold value, extracts the candidate set of words for meeting the metrics-thresholds;According to preset screening index
The candidate set of words is screened to obtain word and specify word.
Optionally, word segmentation processing determining in the sequence of terms in the following manner: is carried out to the original language material of input
Afterwards, word segmentation result is obtained;According to the sequence that word in the word segmentation result occurs, the word is converted by the word segmentation result
Sequence.
Optionally, multiple text features in the sequence of terms are extracted, and determine each text feature
The corresponding indication information, including at least one of: frequency statistics is carried out to each word in the sequence of terms, and
The frequency of occurrences of the word of the sequence of terms is determined according to the result of frequency statistics;To each word in the sequence of terms
Adjacent character string between the mutual information PMI sequence that is counted, and occurred according to the result of statistics according to the word
Determine the PMI sequence of the sequence of terms;It unites to the symmetric condition probability SCP of each word in the sequence of terms
It counts, and constitutes the SCP sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics;To the word
The adjoining entropy of each word in sequence is counted, and determines institute according to the sequence that the word occurs according to the result of statistics
State the adjoining Entropy sequence of sequence of terms;Word Duplication and length information gain are determined to each word in the sequence of terms
It is counted, and determines the information content sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics.
Optionally, the adjacent character string is the left longest substring and right longest substring in the word.
Optionally, the indication information is merged at least through following formula obtain the metrics-thresholds a:a=α a1+ β
a2+γ·a3+δ·a4+θ·a5;Wherein, the conditional coefficient of α, beta, gamma, δ and θ for numerical value not less than 0, while alpha+beta+γ+δ+
θ=1, a1 are the frequency of occurrences of the sequence of terms, and a2 is the PMI sequence of the sequence of terms, and a3 is the sequence of terms
SCP sequence, the adjoining Entropy sequence of sequence of terms described in a4, a5 are the information content sequence of the sequence of terms.
Optionally, the candidate set of words is screened to obtain word and specify word according to preset screening index
Language, including at least one of: whether the frequency for judging that each word in the candidate set of words occurs is greater than default word
Frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word by language frequency threshold;Described in judgement
Whether each word in candidate set of words belongs to the word in deactivated vocabulary, and the word that will not belong to the deactivated vocabulary is sentenced
It is set to specified word;Judge the longest substring in the candidate set of words in each character string the frequency of occurrences whether etc.
In the frequency of occurrences of the parent string of the longest substring, probability of occurrence is equal to the institute of the frequency of occurrences of the parent string
It states longest substring and is determined as it not being specified word;Judge whether each word in the candidate set of words belongs to user
The word for belonging to the user-oriented dictionary is determined as it not being specified word by the word in dictionary.
Optionally, the above method is applied to apache spark platform.
According to another embodiment of the invention, a kind of acquisition device of word is provided, comprising: determining module is used for
Multiple text features in determining sequence of terms are extracted, and determine the corresponding indication information of each text feature;The
One filtering module extracts satisfaction for being filtered by default word length threshold to the word in the sequence of terms
The set of words of the default word length threshold;Second filtering module, for merging the corresponding finger of multiple text features
Mark information obtains metrics-thresholds, and by the metrics-thresholds to the set of words mistake for meeting the default word length threshold
Filter, extracts the candidate set of words for meeting the metrics-thresholds;Screening module is used for according to preset screening index to described
Candidate set of words is screened to obtain word and specify word.
Optionally, described device is also used to, and after carrying out word segmentation processing to the original language material of input, obtains word segmentation result;With
And the sequence occurred according to word in the word segmentation result, the sequence of terms is converted by the word segmentation result.
Optionally, the determining module, comprising: the first determination unit, for each word in the sequence of terms
Frequency statistics is carried out, and determines according to the result of frequency statistics the frequency of occurrences of the word of the sequence of terms;Second determines list
Member is counted for the mutual information PMI between the adjacent character string to each word in the sequence of terms, and according to
The result of statistics determines the PMI sequence of the sequence of terms according to the sequence that the word occurs;Third determination unit, for pair
The symmetric condition probability SCP of each word in the sequence of terms is counted, and according to the result of statistics according to institute's predicate
The sequence that language occurs constitutes the SCP sequence of the sequence of terms;4th determination unit, for each of described sequence of terms
The adjoining entropy of word is counted, and determines the sequence of terms according to the sequence that the word occurs according to the result of statistics
Adjacent Entropy sequence;5th determination unit, for determining word Duplication and length letter to each word in the sequence of terms
Breath gain is counted, and determines the information content of the sequence of terms according to the sequence that the word occurs according to the result of statistics
Sequence.
Optionally, the third processing module, comprising: the first judging unit, for judging in the candidate set of words
The frequency that occurs of each word whether be greater than default term frequencies threshold value, frequency is greater than corresponding to default word frequency threshold value
Word be determined as the specified word;Second judgment unit, for judging that each word in the candidate set of words is
The no word belonged in deactivated vocabulary, the word that will not belong to the deactivated vocabulary are determined as specified word;Third judging unit,
For judging whether the frequency of occurrences of the longest substring in the candidate set of words in each character string is equal to the longest
Probability of occurrence is equal to most eldest son described in the frequency of occurrences of the parent string by the frequency of occurrences of the parent string of substring
Character string is determined as it not being specified word;4th judging unit, for judging that each word in the candidate set of words is
The word for belonging to the user-oriented dictionary is determined as it not being specified word by the no word belonged in user-oriented dictionary.
Still another embodiment in accordance with the present invention provides a kind of equipment for running apache spark platform, including upper
The device stated.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, the storage medium includes storage
Program, wherein described program executes method described in any of the above embodiments when running.
According to still another embodiment of the invention, a kind of processor is additionally provided, the processor is used to run program,
In, described program executes method described in any of the above embodiments when running.
Through the invention, by the sequence of terms after being converted to the corpus by inputting, using from the sequence of terms
It is middle to extract candidate's set of words determined by calculated indication information fusion determining threshold value and preset threshold value.It is directed to simultaneously
Candidate's set of words is screened using corresponding screening index finally to determine word.So not needing by too many people
Power material resources, meanwhile, it does not need to the requirement with higher of the integrality of dictionary, therefore can solve present in the relevant technologies yet
It was found that waste a large amount of manpower and material resources when neologisms, while the problem of be largely dependent upon high completeness dictionary, so as to reach
It is complicated to use manpower and material resources sparingly to avoiding calculating, while dictionary can obtain the beneficial effect of neologisms since being not necessarily to.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of hardware block diagram of the mobile terminal of the acquisition methods of word of the embodiment of the present invention;
Fig. 2 is a kind of flow chart of the acquisition of word according to an embodiment of the present invention;
Fig. 3 is a kind of process of the acquisition methods of word applied to network monitoring system according to an embodiment of the present invention
Figure;
Fig. 4 is a kind of acquisition modes of word based on large-scale corpus Text Classification System according to an embodiment of the present invention
Flow chart;
Fig. 5 is a kind of process of the acquisition methods of word based on personalized recommendation vocabulary according to an embodiment of the present invention
Figure;
Fig. 6 is a kind of structural block diagram of the acquisition device of word according to an embodiment of the present invention.
Fig. 7 is a kind of structure chart of equipment for operating in apache spark platform according to an embodiment of the present invention.
Specific embodiment
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting
In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.
Embodiment 1
Embodiment of the method provided by the embodiment of the present application one can be in mobile terminal, terminal or similar fortune
It calculates and is executed in device.For running on mobile terminals, Fig. 1 is a kind of shifting of the acquisition methods of word of the embodiment of the present invention
The hardware block diagram of dynamic terminal.As shown in Figure 1, terminal 10 may include one or more (only showing one in figure) processors
102 (processing units that processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc.) are used
Memory 104 in storing data and the transmitting device 106 for communication function.Those of ordinary skill in the art can manage
Solution, structure shown in FIG. 1 are only to illustrate, and do not cause to limit to the structure of above-mentioned electronic device.For example, terminal 10 can also wrap
Include than shown in Fig. 1 more perhaps less component or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing the software program and module of application software, such as the word in the embodiment of the present invention
The corresponding program instruction/module of acquisition methods, processor 102 by the software program that is stored in memory 104 of operation with
And module realizes above-mentioned method thereby executing various function application and data processing.Memory 104 may include high speed
Random access memory may also include nonvolatile memory, such as one or more magnetic storage device, flash memory or other are non-
Volatile solid-state.In some instances, memory 104 can further comprise remotely located relative to processor 102
Memory, these remote memories can pass through network connection to terminal 10.The example of above-mentioned network includes but is not limited to interconnect
Net, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include
The wireless network that the communication providers of mobile terminal 10 provide.In an example, transmitting device 106 includes a Network adaptation
Device (Network Interface Controller, NIC), can be connected by base station with other network equipments so as to it is mutual
Networking is communicated.In an example, transmitting device 106 can be radio frequency (Radio Frequency, RF) module, use
In wirelessly being communicated with internet.
It should be pointed out that process described in the present embodiment is realized by apace spark platform.Certainly,
He is able to carry out the plateform system of word processing within the protection scope of the present embodiment, does not do excessively repeat herein.
It should be noted that the meaning that described word obtains in the present embodiment is the acquisition of neologisms in existing dictionary
In the vocabulary that is not present.Certainly, the neologisms such as in the present embodiment, not only can be using the acquisition with Chinese neologisms, simultaneously
The language in other language, such as English, French, the other countries such as Japanese or area is within the protection scope of the present embodiment.
And when specific practical operation, it can slightly be adjusted according to the language characteristics of every country or area.
A kind of acquisition methods of word for running on above-mentioned terminal are provided in the present embodiment, and Fig. 2 is according to the present invention
The flow chart of the acquisition of a kind of word of embodiment, as shown in Fig. 2, the process includes the following steps:
Step S202 extracts multiple text features in determining sequence of terms, and determines each text feature pair
The indication information answered;
Optionally, the determination process in step S202 can be realized by way of one of:
(1) frequency statistics is carried out to each word in the sequence of terms, and institute is determined according to the result of frequency statistics
State the frequency of occurrences of the word of sequence of terms.
Specifically, if the frequency that word occurs in sequence of terms is higher, it means that time that the word is used
Number is more, and therefore, a possibility that constituting a neologisms is very big.
(2) the mutual information PMI in the sequence of terms between the adjacent character string of word is extracted, and according to the PMI
The PMI sequence of the sequence of terms is constituted according to the word order.
Specifically, PMI is the measure information for measuring two event correlations.For candidate neologisms, its two are constituted
The mutual information magnitude of a longest substring, i.e., left longest substring and right longest substring is bigger, just illustrates that the two constitutes a neologisms
Probability is bigger., whereas if mutual information magnitude is smaller, the two is then less likely to constitute a neologisms.Therefore, mutual information is big
The small co-occurrence probabilities for reflecting longest substring in candidate neologisms.When the probability is bigger, longest substring engineering constitute neologisms can
Energy property is bigger.
Such as: for candidate neologisms w=c1c2…cn, his two longest substrings are left longest substring wleft=c1c2…
cn-1With right longest substring wright=c2c3…cn, the mutual information of candidate neologisms w is
PMI (w)=log (p (w)/p (wleft)p(wright))
(3) the symmetric condition probability SCP for extracting each word in the sequence of terms, according to the SCP according to institute's predicate
Language sequence constitutes the SCP sequence of the sequence of terms.
Specifically, SCP is the statistic for measuring the tightness degree that each character combines inside character string.For candidate word
For, the probability value of the word is bigger, then it is closer to identify the character string formed in the word.Therefore a possibility that constituting word
It is bigger, conversely, the probability for then constituting word is smaller.
Such as: for candidate neologisms w=c1c2…cn, the calculation method of SCP is
(4) the adjoining entropy in the sequence of terms in each word is extracted, it is suitable according to the word according to the adjacent entropy
Sequence constitutes the adjoining Entropy sequence of the sequence of terms.
Specifically, adjacent entropy (Branch Entropy) is to measure left adjacent character in candidate word and right adjacent character not
Deterministic statistic.If the uncertainty of candidate word is higher, just illustrate that the context relation of candidate word is abundanter, because
This, constitute word a possibility that it is bigger.
Such as:
For candidate neologisms w, character x and character y respectively indicate the left adjacent character and right adjacent character of candidate neologisms, then w
The calculation method of left adjacent entropy HL (w), right adjacent entropy HR (w) and adjoining entropy BE (w) is as follows:
HL (w)=- ∑xp(x|w)logp(x|w)
HR (w)=- ∑yp(y|w)logp(y|w)
BE (w)=min (HL (w), HR (w))
(5) it extracts in the sequence of terms and determines word Duplication and length information gain in each word, according to described
Determine word Duplication and the length information increases the information content sequence for constituting the sequence of terms according to the word order.
Optionally, between step S204, the method also includes:
After carrying out word segmentation processing to the original language material of input, word segmentation result is obtained;
It is defeated by the word segmentation result according to the output characteristics of spark according to the sequence that word in the word segmentation result occurs
It is out the sequence of terms.
Optionally, the purpose of appearance sequence for retaining word after participle is, in order to which the corpus that will input is in existing dictionary
In word segmentation result calculation processing will certainly obtain incorrect division and for those words not in dictionary, therefore
In the case where retaining word order, analysis input is provided for subsequent step.
Step S204 is filtered the word in the sequence of terms by default word length threshold, extracts full
The set of words of the foot default word length threshold;
Specifically, presetting word length threshold is the word length for following general linguistics rule and constituting neologisms.At this
In embodiment, pre-set length threshold is 5 or 6.Therefore, as long as word length meets the pre-set length threshold 5 or 6, then
Have the potentiality for constituting neologisms.
Step S106 merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and passes through the finger
Mark threshold value filters the set of words for meeting the default word length threshold, extracts the candidate word for meeting the metrics-thresholds
Language set.
Optionally, which can be identified by following formula:
A=α a1+ β a2+ γ a3+ δ a4+ θ a5;
Wherein, α, beta, gamma, δ and θ are the conditional coefficient that numerical value is not less than 0, while alpha+beta+γ+δ+θ=1, a1 are described
The term frequencies of sequence of terms, a2 are the PMI sequence of the sequence of terms, and a3 is the SCP sequence of the sequence of terms, described in a4
The adjoining Entropy sequence of sequence of terms, a5 are the information content sequence of the sequence of terms.
It should be pointed out that as a kind of value empirically determined according to inventor, α, beta, gamma, the specific value of δ and θ
Distribution are as follows:
α=0.35, β=0.14, γ=0.30, δ=0.15, θ=0.06
So for the word for tentatively meeting default word length threshold, in the case where it meets metrics-thresholds a,
It can be known as with the candidate neologisms as neologisms ability.Subsequent input for convenience, candidate's neologisms equally also need simultaneously
It to be ranked up according to the appearance sequence of word after participle.
It should be pointed out that if since there are messy codes or document damage to cause in system problem or sequence of terms
When partial index can not count, then only need to count the index that can be calculated, and other indexs are ignored.Simultaneously will
Corresponding conditional coefficient is set as 0.Simultaneously by comparing the importance of index, corresponding conditional coefficient value is reconfigured.This
Sample can either ensure that the accuracy that neologisms are chosen to the full extent, while applicability when can widen new words extraction, have
The ability of very strong anti-extraneous factor.
Can step S108, the candidate set of words is screened according to preset screening index specified to obtain word
Word.
Optionally, specific screening index and screening mode are as follows:
(1) whether the frequency for judging that each word in the candidate set of words occurs is greater than default term frequencies threshold
Frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word by value.
Specifically, the candidate word less for the frequency of occurrences, then the word is that it is impossible to meet the items for constituting neologisms
Part.A kind of optional way in the present embodiment, default term frequencies threshold value are set as 3.That is, occur when candidate's word
Frequency is less than 3, then it is unsatisfactory for the condition as neologisms depending on candidate's word, so candidate's word was carried out
Filter.And if there is frequency be greater than 3 if, then can satisfy the condition of neologisms depending on candidate's word, so by the candidate
Word retains.
(2) judge whether each word in the candidate set of words belongs to the word in deactivated vocabulary, will not belong to
The word of the deactivated vocabulary is determined as specified word.
Specifically, it deactivates vocabulary effect and is the word for indicating to have stopped using, so if candidate's neologisms are to belong to
If this deactivates vocabulary, then the candidate information is equally also unsatisfactory for the condition as neologisms, so by candidate's word into
Row filtering.And if being not belonging to this and deactivating vocabulary, then retain candidate's word.
(3) judge whether the appearance frequency of the longest substring in the candidate set of words in each character string is equal to
The appearance frequency of the parent string of the longest substring.
Specifically, a N metacharacter string c1c2…cn, N-1 member longest substring is wleft=c1c2…cn-1And wright=
c2c3…cn.If the appearance frequency of the longest substring is equal to the appearance frequency of its parent string, illustrate the longest
Substring is a part appearance as substring in the text, rather than is occurred by way of as a word
's.So the character string is filtered, and retain the different longest substring of the frequency of occurrences.
(4) judge whether each word in the candidate set of words belongs to the word in user-oriented dictionary, by belonging to
The word for stating user-oriented dictionary is determined as it not being specified word.
Optionally, by probability level, put the mode of library level and character string level Integrated Selection to candidate word into
Row screens layer by layer, finally can determine the neologisms for meeting above-mentioned screening index or new set of words.
It should be pointed out that above-mentioned screening conditions are only enumerated, and it is non exhaustive.For example, for the appearance position of word, word
Other indexs such as part of speech of language are screened also within the protection scope of the present embodiment, do not do excessively repeat herein.
Optionally, dispersed elastomeric distributed data collection (the Resilient Distributed polymerizeing on apache spark
Dataset abbreviation RDD), the neologisms for meeting screening index or new set of words are obtained,
In addition, additionally providing following practical application scene, in the present embodiment to understand foregoing description in the present embodiment
Technical solution:
Scene 1:
Interconnection abundant can be generated in public microblogging used in daily life, wechat circle of friends and internet site
Net data.In order to realize that the supervision of the social security on network and public opinion guidance perform effectively, analysis in real time and tracking are public
Focus and society's dynamic of public opinion in real time are very important.Fig. 3 is that a kind of network that is applied to according to an embodiment of the present invention is supervised
The flow chart for superintending and directing the acquisition methods of the word of system specifically includes: in method
S1, crawl microblogging, wechat and other internet datas carry out serializing storage;
S2 clears up data, removes interference information, is structured message by data consolidation;
S3 loads data using apache spark, carries out new word discovery to text information using this method;
S4 is filtered the neologisms found, and the neologisms of output are new hot spot and the society for being regarded as network public-opinion
Public opinion new trend
S5 carries out front end page displaying to result;
Scene 2:
It, can be to extensive corpus of text such as gio signal, government affairs document in large-scale corpus Text Classification System
Deng providing real-time grading and query service.Fig. 4 is according to an embodiment of the present invention a kind of based on large-scale corpus text classification system
The flow chart of the acquisition modes of the word of system, as shown in figure 4, the acquisition methods include:
S1 reads in corpus data, and is stored as RDD format.
S2 converts RDD to the data set Dataset to put in order with RDD.
S3 converts the word in Dataset to the array to put in order with RDD.
S4 calculates the word frequency of each array in a document.
S5 calculates common weighting (the term frequency-inverse document of information retrieval data mining
Frequency, abbreviation TF-IDF).
S6 trains Bayesian model and with document form persistence.
S7, input need the text analyzed.
S8 calls trained topic model in advance, and calculates the theme value of working days text,.
S9 inquires theme dictionary according to calculated value, exports the theme of current text.
Scene 3:
The public sentiment hot analyzed and tracked in real time by this system and popular vocabulary, hobby subdivision and user in conjunction with user
Portrait, can carry out personalized recommendation to user, and push meets the real-time news or other information of user preferences.Fig. 5 is basis
A kind of flow chart of the acquisition methods of word based on personalized recommendation vocabulary of the embodiment of the present invention.As shown in figure 5, described obtain
The method is taken to include:
S1, crawl microblogging, wechat and other internet datas carry out serializing storage;
S2 clears up data, removes interference information, is structured message by data consolidation;
S3 loads data using apache spark, carries out new word discovery to text information using this method;
S4 is filtered the neologisms found, and the neologisms of output are new hot spot and the society for being regarded as network public-opinion
These hot spots are corresponded to publication user and place by public opinion new trend;
S5 analyzes user in conjunction with other information in conjunction with the corresponding publication user of neologisms, carries out individual character portrait to user
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation
The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much
In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing
The part that technology contributes can be embodied in the form of software products, which is stored in a storage
In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate
Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
Embodiment 2
A kind of acquisition device of word is additionally provided in the present embodiment, and the device is for realizing above-described embodiment and preferably
Embodiment, the descriptions that have already been made will not be repeated.As used below, predetermined function may be implemented in term " module "
The combination of software and/or hardware.Although device described in following embodiment is preferably realized with software, hardware, or
The realization of the combination of person's software and hardware is also that may and be contemplated.
Fig. 6 is a kind of structural block diagram of the acquisition device of word according to an embodiment of the present invention, as shown in fig. 6, the device
Comprise determining that module 62, the first filtering module 64, the second filtering module 66 and screening module 68.
Determining module 62 for extracting multiple text features in determining sequence of terms, and determines each text
The corresponding indication information of eigen;
First filtering module 64, for being carried out by default word length threshold to the word in the sequence of terms
Filter, extracts the set of words for meeting the default word length threshold;
Second filtering module 66 obtains metrics-thresholds for merging the corresponding indication information of multiple text features, and
The set of words for meeting the default word length threshold is filtered by the metrics-thresholds, extracts and meets the index threshold
The candidate set of words of value;
Screening module 68, for being screened the candidate set of words to obtain word according to preset screening index
Specified word.
Optionally, described device is also used to, and after carrying out word segmentation processing to the original language material of input, obtains word segmentation result;With
And the sequence occurred according to word in the word segmentation result, the sequence of terms is converted by the word segmentation result.
Optionally it is determined that module 62 includes:
First determination unit for carrying out frequency statistics to each word in the sequence of terms, and is united according to frequency
The result of meter determines the frequency of occurrences of the word of the sequence of terms;
Second determination unit, for the mutual information between the adjacent character string to each word in the sequence of terms
PMI is counted, and determines the PMI sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics;
Third determination unit is counted for the symmetric condition probability SCP to each word in the sequence of terms,
And the SCP sequence of the sequence of terms is constituted according to the sequence that the word occurs according to the result of statistics;
4th determination unit is counted for the adjoining entropy to each word in the sequence of terms, and according to system
The result of meter determines the adjoining Entropy sequence of the sequence of terms according to the sequence that the word occurs;
5th determination unit, for determining word Duplication and length information increasing to each word in the sequence of terms
Benefit is counted, and determines the information content sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics
Column.
Optionally, screening module 68 includes:
First judging unit, for judging it is pre- whether the frequency of each word appearance in the candidate set of words is greater than
If term frequencies threshold value, frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word;
Second judgment unit, for judging whether each word in the candidate set of words belongs in deactivated vocabulary
Word, the word that will not belong to the deactivated vocabulary are determined as specified word;
Third judging unit, for judging going out for the longest substring in the candidate set of words in each character string
Whether existing frequency is equal to the frequency of occurrences of the parent string of the longest substring, and probability of occurrence is equal to the parent string
The longest substring of the frequency of occurrences is determined as it not being specified word;
4th judging unit, for judging whether each word in the candidate set of words belongs in user-oriented dictionary
The word for belonging to the user-oriented dictionary is determined as it not being specified word by word.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong
Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any
Combined form is located in different processors.
Optionally, above-mentioned apparatus described in this embodiment may operate in the equipment of apache spark platform.Fig. 7
A kind of structure chart of equipment for operating in apache spark platform according to an embodiment of the present invention, since realize foregoing description
The function of device.
Embodiment 3
The embodiments of the present invention also provide a kind of storage medium, which includes the program of storage, wherein above-mentioned
Program executes method described in any of the above embodiments when running.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store the journey for executing following steps
Sequence code:
S1 extracts multiple text features in determining sequence of terms, and determines the corresponding finger of each text feature
Mark information;
S2 is filtered the word in the sequence of terms by default word length threshold, extracts described in satisfaction
The set of words of default word length threshold;
S3 merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and passes through the metrics-thresholds
The set of words for meeting the default word length threshold is filtered, the candidate word collection for meeting the metrics-thresholds is extracted
It closes;
S4 screens to obtain word and specify word the candidate set of words according to preset screening index.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read-
Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard
The various media that can store program code such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of processor, the processor is for running program, wherein program operation
Step in Shi Zhihang any of the above-described method.
Optionally, in the present embodiment, above procedure is for executing following steps:
S1 extracts multiple text features in determining sequence of terms, and determines the corresponding finger of each text feature
Mark information;
S2 is filtered the word in the sequence of terms by default word length threshold, extracts described in satisfaction
The set of words of default word length threshold;
S3 merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and passes through the metrics-thresholds
The set of words for meeting the default word length threshold is filtered, the candidate word collection for meeting the metrics-thresholds is extracted
It closes;
S4 screens to obtain word and specify word the candidate set of words according to preset screening index.
Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment
Example, details are not described herein for the present embodiment.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general
Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed
Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored
It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein
Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or
Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field
For art personnel, the invention may be variously modified and varied.It is all within principle of the invention, it is made it is any modification, etc.
With replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (14)
1. a kind of acquisition methods of word characterized by comprising
Multiple text features in determining sequence of terms are extracted, and determine the corresponding indication information of each text feature;
The word in the sequence of terms is filtered by default word length threshold, extracts and meets the default word
The set of words of length threshold;
It merges the corresponding indication information of multiple text features and obtains metrics-thresholds, and by the metrics-thresholds to meeting
The set of words filtering for stating default word length threshold, extracts the candidate set of words for meeting the metrics-thresholds;
The candidate set of words is screened to obtain specified word according to preset screening index.
2. the method according to claim 1, wherein determining in the sequence of terms in the following manner:
After carrying out word segmentation processing to the original language material of input, word segmentation result is obtained;
According to the sequence that word in the word segmentation result occurs, the sequence of terms is converted by the word segmentation result.
3. according to the method described in claim 2, it is characterized in that, the multiple texts extracted in the sequence of terms are special
Sign, and determine the corresponding indication information of each text feature, including at least one of:
Frequency statistics is carried out to each word in the sequence of terms, and the word sequence is determined according to the result of frequency statistics
The frequency of occurrences of the word of column;
Mutual information PMI between the adjacent character string of each word in the sequence of terms is counted, and according to system
The result of meter determines the PMI sequence of the sequence of terms according to the sequence that the word occurs;
The symmetric condition probability SCP of each word in the sequence of terms is counted, and according to the result of statistics according to
The sequence that the word occurs constitutes the SCP sequence of the sequence of terms;
The adjoining entropy of each word in the sequence of terms is counted, and is gone out according to the result of statistics according to the word
Existing sequence determines the adjoining Entropy sequence of the sequence of terms;
To each word in the sequence of terms determine word Duplication and length information gain counts, and according to statistics
Result the information content sequence of the sequence of terms is determined according to the sequence that the word occurs.
4. according to the method described in claim 3, it is characterized in that, wherein, the adjacent character string is the left side in the word
Longest substring and right longest substring.
5. according to the method described in claim 3, being obtained it is characterized in that, merging the indication information at least through following formula
The metrics-thresholds a:
A=α a1+ β a2+ γ a3+ δ a4+ θ a5;
Wherein, α, beta, gamma, δ and θ are the conditional coefficient that numerical value is not less than 0, while alpha+beta+γ+δ+θ=1, a1 are the word
The frequency of occurrences of sequence, a2 are the PMI sequence of the sequence of terms, and a3 is the SCP sequence of the sequence of terms, word described in a4
The adjoining Entropy sequence of sequence, a5 are the information content sequence of the sequence of terms.
6. according to the method described in claim 2, it is characterized in that, according to preset screening index to the candidate set of words
It is screened to obtain word and specify word, including at least one of:
Whether the frequency for judging that each word in the candidate set of words occurs is greater than default term frequencies threshold value, by frequency
It is determined as the specified word greater than word corresponding to default word frequency threshold value;
Judge whether each word in the candidate set of words belongs to the word in deactivated vocabulary, will not belong to described deactivate
The word of vocabulary is determined as specified word;
Judge whether the frequency of occurrences of the longest substring in the candidate set of words in each character string is equal to the longest
Probability of occurrence is equal to most eldest son described in the frequency of occurrences of the parent string by the frequency of occurrences of the parent string of substring
Character string is determined as it not being specified word;
Judge whether each word in the candidate set of words belongs to the word in user-oriented dictionary, user's word will be belonged to
The word of allusion quotation is determined as it not being specified word.
7. method according to claim 1-6, which is characterized in that be applied to apache spark platform.
8. a kind of acquisition device of word characterized by comprising
Determining module for extracting multiple text features in determining sequence of terms, and determines each text feature
Corresponding indication information;
First filtering module is extracted for being filtered by default word length threshold to the word in the sequence of terms
Meet the set of words of the default word length threshold out;
Second filtering module obtains metrics-thresholds for merging the corresponding indication information of multiple text features, and passes through institute
It states metrics-thresholds to filter the set of words for meeting the default word length threshold, extracts the time for meeting the metrics-thresholds
Select set of words;
Screening module, for being screened the candidate set of words to obtain word and specify word according to preset screening index
Language.
9. device according to claim 8, which is characterized in that described device is also used to, and is carried out to the original language material of input
After word segmentation processing, word segmentation result is obtained;And the sequence occurred according to word in the word segmentation result, the word segmentation result is turned
Turn to the sequence of terms.
10. device according to claim 9, which is characterized in that the determining module, comprising:
First determination unit, for carrying out frequency statistics to each word in the sequence of terms, and according to frequency statistics
As a result the frequency of occurrences of the word of the sequence of terms is determined;
Second determination unit, for the mutual information PMI between the adjacent character string to each word in the sequence of terms
It is counted, and determines the PMI sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics;
Third determination unit is counted for the symmetric condition probability SCP to each word in the sequence of terms, and root
Result according to statistics constitutes the SCP sequence of the sequence of terms according to the sequence that the word occurs;
4th determination unit is counted for the adjoining entropy to each word in the sequence of terms, and according to statistics
As a result the adjoining Entropy sequence of the sequence of terms is determined according to the sequence that the word occurs;
5th determination unit, for each word in the sequence of terms determine word Duplication and length information gain into
Row counts, and determines the information content sequence of the sequence of terms according to the sequence that the word occurs according to the result of statistics.
11. device according to claim 10, which is characterized in that the screening module, comprising:
Whether the first judging unit, the frequency for judging that each word in the candidate set of words occurs are greater than default word
Frequency is greater than word corresponding to default word frequency threshold value and is determined as the specified word by language frequency threshold;
Second judgment unit, for judging whether each word in the candidate set of words belongs to the word in deactivated vocabulary
Language, the word that will not belong to the deactivated vocabulary are determined as specified word;
Third judging unit, for judging the appearance frequency of the longest substring in the candidate set of words in each character string
Whether rate is equal to the frequency of occurrences of the parent string of the longest substring, and probability of occurrence is equal to the appearance of the parent string
The longest substring of frequency is determined as it not being specified word;
4th judging unit, for judging whether each word in the candidate set of words belongs to the word in user-oriented dictionary
The word for belonging to the user-oriented dictionary is determined as it not being specified word by language.
12. a kind of equipment for running apache spark platform, which is characterized in that described in any item including claim 8-11
Device.
13. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein when described program is run
Method described in any one of perform claim requirement 1 to 7.
14. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run
Benefit require any one of 1 to 7 described in method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710414730.XA CN108984514A (en) | 2017-06-05 | 2017-06-05 | Acquisition methods and device, storage medium, the processor of word |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710414730.XA CN108984514A (en) | 2017-06-05 | 2017-06-05 | Acquisition methods and device, storage medium, the processor of word |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108984514A true CN108984514A (en) | 2018-12-11 |
Family
ID=64501310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710414730.XA Pending CN108984514A (en) | 2017-06-05 | 2017-06-05 | Acquisition methods and device, storage medium, the processor of word |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108984514A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783649A (en) * | 2019-01-02 | 2019-05-21 | 腾讯科技(深圳)有限公司 | A kind of domain lexicon generation method and device |
CN111488727A (en) * | 2020-03-24 | 2020-08-04 | 南阳柯丽尔科技有限公司 | Word file parsing method, word file parsing apparatus, and computer-readable storage medium |
CN113342936A (en) * | 2021-06-08 | 2021-09-03 | 北京明略软件系统有限公司 | Word formation compactness determining method and device, electronic equipment and storage medium |
CN113779973A (en) * | 2020-06-09 | 2021-12-10 | 杭州晨熹多媒体科技有限公司 | Text data processing method and device |
CN115858771A (en) * | 2022-01-11 | 2023-03-28 | 北京中关村科金技术有限公司 | Word searching method and device and computer readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102323382A (en) * | 2011-07-20 | 2012-01-18 | 暨南大学 | Multiple index lamination and fusion visualization method for detecting structural damages |
CN103294664A (en) * | 2013-07-04 | 2013-09-11 | 清华大学 | Method and system for discovering new words in open fields |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN105786991A (en) * | 2016-02-18 | 2016-07-20 | 中国科学院自动化研究所 | Chinese emotion new word recognition method and system in combination with user emotion expression ways |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
CN106649334A (en) * | 2015-10-29 | 2017-05-10 | 北京国双科技有限公司 | Conjunction word set processing method and device |
-
2017
- 2017-06-05 CN CN201710414730.XA patent/CN108984514A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102323382A (en) * | 2011-07-20 | 2012-01-18 | 暨南大学 | Multiple index lamination and fusion visualization method for detecting structural damages |
CN103294664A (en) * | 2013-07-04 | 2013-09-11 | 清华大学 | Method and system for discovering new words in open fields |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN106649334A (en) * | 2015-10-29 | 2017-05-10 | 北京国双科技有限公司 | Conjunction word set processing method and device |
CN105786991A (en) * | 2016-02-18 | 2016-07-20 | 中国科学院自动化研究所 | Chinese emotion new word recognition method and system in combination with user emotion expression ways |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
Non-Patent Citations (2)
Title |
---|
梅莉莉: "基于领域特殊性和统计语言知识的新词抽取方法", 知网, pages 2 * |
苏其龙: "微博新词发现研究", 知网, pages 2 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783649A (en) * | 2019-01-02 | 2019-05-21 | 腾讯科技(深圳)有限公司 | A kind of domain lexicon generation method and device |
CN109783649B (en) * | 2019-01-02 | 2023-01-24 | 腾讯科技(深圳)有限公司 | Domain dictionary generating method and device |
CN111488727A (en) * | 2020-03-24 | 2020-08-04 | 南阳柯丽尔科技有限公司 | Word file parsing method, word file parsing apparatus, and computer-readable storage medium |
CN111488727B (en) * | 2020-03-24 | 2023-09-19 | 南阳柯丽尔科技有限公司 | Word file parsing method, word file parsing apparatus, and computer-readable storage medium |
CN113779973A (en) * | 2020-06-09 | 2021-12-10 | 杭州晨熹多媒体科技有限公司 | Text data processing method and device |
CN113342936A (en) * | 2021-06-08 | 2021-09-03 | 北京明略软件系统有限公司 | Word formation compactness determining method and device, electronic equipment and storage medium |
CN113342936B (en) * | 2021-06-08 | 2024-03-22 | 北京明略软件系统有限公司 | Word compactness determining method and device, electronic equipment and storage medium |
CN115858771A (en) * | 2022-01-11 | 2023-03-28 | 北京中关村科金技术有限公司 | Word searching method and device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Laylavi et al. | Event relatedness assessment of Twitter messages for emergency response | |
CN108984514A (en) | Acquisition methods and device, storage medium, the processor of word | |
CN109299271B (en) | Training sample generation method, text data method, public opinion event classification method and related equipment | |
CN108776671A (en) | A kind of network public sentiment monitoring system and method | |
CN107862022B (en) | Culture resource recommendation system | |
CN104965905B (en) | A kind of method and apparatus of Web page classifying | |
CN111738011A (en) | Illegal text recognition method and device, storage medium and electronic device | |
CN105740227B (en) | A kind of genetic simulated annealing method of neologisms in solution Chinese word segmentation | |
CN104216954A (en) | Prediction device and prediction method for state of emergency topic | |
CN103942340A (en) | Microblog user interest recognizing method based on text mining | |
CN103778205A (en) | Commodity classifying method and system based on mutual information | |
CN104572958A (en) | Event extraction based sensitive information monitoring method | |
CN107943792B (en) | Statement analysis method and device, terminal device and storage medium | |
CN106649334B (en) | Processing method and device of associated word set | |
CN106951409A (en) | A kind of network social intercourse media viewpoint tendency analysis system and method | |
CN110263169A (en) | A kind of focus incident detection method based on convolutional neural networks and keyword clustering | |
CN107679135A (en) | The topic detection of network-oriented text big data and tracking, device | |
CN111061837A (en) | Topic identification method, device, equipment and medium | |
CN108733791A (en) | network event detection method | |
CN111581956A (en) | Sensitive information identification method and system based on BERT model and K nearest neighbor | |
CN109241392A (en) | Recognition methods, device, system and the storage medium of target word | |
CN104933475A (en) | Network forwarding behavior prediction method and apparatus | |
Kim et al. | SMS spam filterinig using keyword frequency ratio | |
CN110968664A (en) | Document retrieval method, device, equipment and medium | |
CN109558531A (en) | News information method for pushing, device and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |