CN108038119A - Utilize the method, apparatus and storage medium of new word discovery investment target - Google Patents
Utilize the method, apparatus and storage medium of new word discovery investment target Download PDFInfo
- Publication number
- CN108038119A CN108038119A CN201711059221.6A CN201711059221A CN108038119A CN 108038119 A CN108038119 A CN 108038119A CN 201711059221 A CN201711059221 A CN 201711059221A CN 108038119 A CN108038119 A CN 108038119A
- Authority
- CN
- China
- Prior art keywords
- neologisms
- word
- language material
- undetermined
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention proposes a kind of method using new word discovery investment target, including:Language material in corpus is pre-processed, obtains language material text data;The language material text by pretreatment is read, which is segmented and goes stop words to handle, obtains multiple word sections of the language material text;The word section adjacent to the language material text converges, and adjacent word section is combined into neologisms undetermined;According to word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value of each neologisms undetermined in the language material text, the real neologisms of language material text are filtered out;And the neologisms and association relationship of the Business Name in corpus that calculating sifting goes out, extraction association relationship meet that the Business Name of preset condition and neologisms are used as with reference to investment target.The present invention also proposes a kind of electronic device and computer-readable recording medium.The new words extraction filtered out using the present invention from news corpus invests target, improves efficiency of investment and accuracy rate.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of method, electronics that target is invested using new word discovery
Device and computer-readable recording medium.
Background technology
At present, in observation investment target angle, investor lacks the associated observation to investee and hot spot theme,
And this observation can be improved to investing the plan of operation of target, Research Emphasis to a certain extent, business increases, raw material needs
Ask, the expected understanding of team building etc..
With the popularization of network, each news website has thousands of bar news daily, and news can real-time update.Such as
Fruit can be extracted from the news corpus of magnanimity and analyze the enterprise involved by the hot spot theme and hot spot theme of Vehicles Collected from Market
Industry, then for the angle of investor, it is possible to obtain Correlative plan, R&D direction or the potential need of investment target enterprise
Ask, and then find business opportunity, seize commercial opportunity.Therefore, how to be extracted from news corpus and analyze neologisms, and utilized from news corpus
The new word discovery investment target of middle extraction is urgent problem.
The content of the invention
The present invention provides a kind of method, electronic device and computer-readable storage medium using new word discovery investment target
Matter, its main purpose are and new using being filtered out from news corpus in by being screened from news corpus and analyzing neologisms
Word extraction investment target.
To achieve the above object, the present invention provides a kind of electronic device, which includes memory, processor, described to deposit
The program using new word discovery investment target that can be run on the processor is stored with reservoir, the program is by the processing
Device realizes following steps when performing:
A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained
To multiple word sections of the language material text;
A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language
Expect the new set of words undetermined of text;
A4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text
As a result, filter out the real neologisms of language material text;And
The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre-
If the Business Name and neologisms of condition are used as with reference to investment target.
Preferably, the step A4 includes:
A41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value
Determine neologisms;
The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, therefrom filters out solidification degree more than second
The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, therefrom filters out the free degree more than the 3rd
Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure A42 is filtered out ", includes:
The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively;
And
Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms
Degree.
In addition, to achieve the above object, the present invention also provides a kind of method using new word discovery investment target, this method
Including:
S1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained
To multiple word sections of the language material text;
S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language
Expect the new set of words undetermined of text;
S4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text
As a result, filter out the real neologisms of language material text;And
The neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre-
If the Business Name and neologisms of condition are used as with reference to investment target.
Preferably, the step S4 includes:
S41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value
Determine neologisms;
The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, therefrom filters out solidification degree more than second
The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, therefrom filters out the free degree more than the 3rd
Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure S42 is filtered out ", includes:
The left adjacent word comentropy by the step S42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively;
And
Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms
Degree.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer-readable recording medium
The program using new word discovery investment target is stored with storage medium, is realized when which is executed by processor as described above
Utilize the arbitrary steps of the method for new word discovery investment target.
Method, electronic device and computer-readable recording medium proposed by the present invention using new word discovery investment target,
By being segmented, being gone stop words etc. to handle to language material text, neologisms undetermined are extracted from language material, it is then undetermined by calculating
Word frequency, solidification degree and the free degree of neologisms, filter out real neologisms in the language material text, finally calculate neologisms and language material text
Association relationship of Business Name determines final investment target in this, improves the efficiency and accuracy of investment target extraction.
Brief description of the drawings
Fig. 1 is the application environment schematic diagram for the method preferred embodiment that the present invention invests target using new word discovery;
Fig. 2 is the module diagram for the program for investing target in Fig. 1 using new word discovery;
Fig. 3 is the flow chart for the method preferred embodiment that the present invention invests target using new word discovery;
Fig. 4 is refined flow chart of the present invention using step S4 in the method for new word discovery investment target.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of method using new word discovery investment target, and this method is applied to a kind of electronic device 1.Ginseng
According to shown in Fig. 1, the application environment schematic diagram of the method preferred embodiment of target is invested using new word discovery for the present invention.
In the present embodiment, the electronic device 1 can be PC (Personal Computer, PC), can also
It is the terminal devices such as smart mobile phone, tablet computer, E-book reader, pocket computer.
The electronic device 1 includes memory 11, processor 12, communication bus 13, and network interface 14.
Wherein, memory 11 includes at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory,
Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), magnetic storage, disk, CD etc..Memory 11
Can be the internal storage unit of the electronic device 1 in certain embodiments, such as the hard disk of the electronic device 1.Memory
11 can also be what is be equipped with the External memory equipment of the electronic device 1, such as the electronic device 1 in further embodiments
Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, dodges
Deposit card (Flash Card) etc..Further, memory 11 can also both include the internal storage unit of the electronic device 1 or wrap
Include External memory equipment.Memory 11 can be not only used for the application software and Various types of data that storage is installed on the electronic device 1,
Such as program 10 and corpus 00 etc. using new word discovery investment target, can be also used for temporarily storing exported or
The data that will be exported.Specifically, language material refers to the language material crawled from each website, such as news corpus, is protected in the corpus 00
There are a large amount of language materials, the present invention extracts neologisms from the language material of corpus 00, and explores investment target according to neologisms.
Processor 12 can be in certain embodiments a central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips, for the program stored in run memory 11
Code or processing data, such as program 10 of target etc. is invested using new word discovery.
Communication bus 13 is used for realization the connection communication between these components.
Network interface 14 can optionally include standard wireline interface and wireless interface (such as WI-FI interfaces), be commonly used in
Communication connection is established between the device and other electronic equipments.
Fig. 1 illustrate only the electronic device 1 with component 11-14, it should be understood that being not required for implementing all show
The component gone out, what can be substituted implements more or less components.
Alternatively, which can also include user interface, user interface can include display (Display),
Input unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.
Alternatively, in certain embodiments, display can be that light-emitting diode display, liquid crystal display, touch control type LCD are shown
Device and Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) touch device etc..Wherein, display
Display screen or display unit are properly termed as, for showing the information that handles in the electronic apparatus 1 and visual for showing
User interface.
In the device embodiment shown in Fig. 1, the program using new word discovery investment target is stored with memory 11.Place
What is stored in the reason execution memory 11 of device 12 realizes following steps when investing the program of target using new word discovery:
A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained
To multiple word sections of the language material text;
A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language
Expect the new set of words undetermined of text;
A4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text
As a result, filter out the real neologisms of language material text;And
The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre-
If the Business Name and neologisms of condition are used as with reference to investment target.
Language material is related to multiple and different fields, and the present embodiment carries out the concrete scheme of the present invention by taking news corpus as an example
Illustrate, but be not limited only to News Field.As investor it should be understood that hot news at present, to obtain investment target enterprise
Correlative plan, R&D direction or potential demand crawl Internet news when information, using web crawlers from internet, for example,
The Internet news that Sina, Baidu, Tencent etc. are crawled by reptile is used as news corpus.It is understood that pushing away with the time
Move, hot news also can constantly change, therefore, in order to make investor more accurately understand hot news at present, in time dimension
On the Internet news that crawls is filtered, preset time section is set, the Internet news of the period is only crawled, for example, only
Crawl the Internet news on the same day.Then duplicate removal processing is carried out to the Internet news crawled, and the title of Internet news is stored in language
Expect in storehouse 00.Since the source of news corpus has diversity, Format Type is relatively more in language material, for ease of to language material
Subsequent treatment is carried out, news corpus need to be pre-processed, obtains news corpus text data, forms news corpus text set.
In specific implementation, the uniform format of news corpus can be text formatting by the pretreatment, from news corpus
Middle removal advertisement noise simultaneously filters the one or more in dirty word, sensitive word and stop words.By the uniform format of news corpus
For text formatting when, the information filtering that current techniques be able to wouldn't can be converted to text formatting is fallen.
After news corpus text is obtained, the every a line news corpus text got is carried out one by one using participle instrument
Word segmentation processing, such as carry out word segmentation processing using participle instruments such as Stanford Chinese word segmentings instrument, jieba participles.It is for example, right
Segmented in " last night goes to have seen film ", can obtain following result " yesterday | at night | go | see | | film ".At participle
Retain word segmentation result after reason.It is understood that in order to further improve the validity of word segmentation result, word segmentation result is gone
Stop words processing, the function word of news corpus theme can not be embodied by removing auxiliary words of mood, adverbial word, preposition, conjunction adjective etc.,
These function words usually itself have no clear and definite meaning, and only putting it into a complete sentence just has certain effect, such as
It is common " ", " ", " this ", " that ", " on ", " under ", " where ", etc..In other embodiments, to news expect text into
After row word segmentation processing, the word section of the final verb only retained in word segmentation result and/or noun, such as in above-mentioned example, can be only
Retain " film " this word.It is understood that the word segmentation result after word segmentation processing may be sky, then corresponding row is filtered out
Text.In other embodiments, the method segmented to news corpus text can also include:Based on string matching
Segmenting method, the segmenting method based on understanding, the segmenting method based on statistics and one kind in the segmenting method based on dictionary or
It is a variety of.
In other embodiments, can also be to every news corpus for the ease of determining the scope of follow-up word segmentation processing
Before text carries out word segmentation processing, primary segmentation can be carried out to news corpus text by branch's processing, branch's processing can be
To language material according to punctuate branch, such as there is the punctuates such as fullstop, comma, exclamation, question mark punishment row.
However, the news corpus text after word segmentation processing, it is possible that one will should be used as in some field
The term data of a word is divided into the situation of multiple term datas, it is therefore desirable to new word discovery.If by the adjacent word in word segmentation result
Duan Jinhang is converged, and forms the neologisms undetermined of news corpus text.
Next, need to determine the real neologisms of news corpus text from the neologisms undetermined of news corpus text, at it
In his embodiment, step A4 is specifically included:
A41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value
Determine neologisms;
The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, therefrom filters out solidification degree more than second
The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, therefrom filters out the free degree more than the 3rd
Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
It is understood that to extract neologisms from news item language material text, it is clear that:Which type of word is just calculated
One neologismsWhether enough look first at the number that this word occurs in a language material text or corpus 00, i.e. word
Frequently.In the present embodiment, word frequency is embodied by reverse document-frequency (Inverse Document Frequency, IDF), IDF
Characterize the frequency of a word in a document, if there is the frequency it is higher, illustrate what this word occurred in different environment
Probability higher, characterizes degree of recognition of the word in different articles.General IDF is higher, illustrates that its degree of recognition is higher, is more possible to
It is neologisms.But if IDF is very high, it is very common to represent this word on the contrary, is not necessarily necessary to enter new word set, especially
It is to cause neologisms to pollute in order to prevent.It is in the screening step, all word frequency are (such as new at one more than the first predetermined threshold value
Hear the number that occurs in language material text more than 5 times) neologisms undetermined screen.
However, the neologisms undetermined filtered out by above-mentioned screening process are possible to not be a word, but multiple words are formed
Phrase.Therefore, processing meets outside the requirement of word frequency, it is also necessary to considers the solidification degree of neologisms undetermined, i.e. a neologisms undetermined
In the probability that occurs together with other words in the neologisms undetermined of each word.Such as in a language material text, " film " occurs
389 times, " cinema " only occurs 175 times, but we are but more likely to " cinema " as a word, because intuition
On see, " film " and " institute " solidifies tighter.The highest word of solidification degree be exactly such as " bat ", " spider ", " knowing which way to go ",
The word of " perturbed " etc, each word in these words almost can always occur at the same time with another word, even at other
Using also such in occasion.It is in the screening step, all solidification degree are undetermined more than the second predetermined threshold value (such as 0.02)
Neologisms screen.
Specifically, by taking two tuple words as an example, the probability that word A and word B individually occur is P (A) and P (B) respectively, it is assumed that this two
A word be autonomous word then two words and meanwhile occur probability be P (A) * P (B).If the two words are not independent, two words are same
When the probability that occurs can be more than P (A) * P (B), i.e. P (C)>>P(A)*P(B).That is, the solidification degree of neologisms undetermined is more than
Two predetermined threshold values need the condition that meets to be:
P (C)-P (A) * P (B) > m
Wherein, A, B represent the word in neologisms undetermined respectively, and P (C) refers to word A, B while the probability occurred, and m represents that second is pre-
If threshold value.
In addition to meeting the requirement of above-mentioned word frequency and solidification degree, it is also contemplated that the free degree of a word.The free degree refers to
One word freely uses degree.Light sees solidifying conjunction degree inside a word not enough, we also need to as a whole it
Exterior performance.By taking " quilt " and " lifetime " the two words as an example, we can say that " buying quilt ", " lid quilt ", " into quilt ",
" good quilt ", " this quilt " etc., above add various words at " quilt ";But the usage in " lifetime " is very fixed, except " a generation
Son ", " this lifetime ", " last lifetime ", " next lifetime ", substantially " lifetime " above cannot add other word." lifetime " this word is left
The word that side can occur is too limited so that instinctively we may think that, " lifetime " not individually into word, really into word
It is the entirety of " a lifetime ", " this lifetime " etc in fact.As it can be seen that word freely with degree be also judge it whether Cheng Xin
The major criterion of word.If a word can be regarded as a neologisms, it should be able to neatly appear in a variety of
In environment, there is very abundant left adjacent word set and right adjacent word set.It can be weighed by calculating the comentropy of a word
The left adjacent word set of this word and the randomness of right adjacent word set.For example, " eating grape and do not spit Grape Skin and do not eating grape Dao Tu Portugals
In grape skin " the words, " grape " word occurs four times, wherein left neighbour's word is respectively { eat, spit, eat, spit }, right neighbour's word is respectively
No, { skin, falls, skin }.According to comentropy calculation formula, the comentropy that the left adjacent word of " grape " word can be calculated respectively is about
0.693, the comentropy of right neighbour's word is about 1.04.As it can be seen that in this sentence, the right adjacent word of " grape " word is more rich.At this
In embodiment, the free degree of a word takes the smaller value in its left adjacent word comentropy and right adjacent word comentropy.Walked in the screening
In rapid, the neologisms undetermined that all frees degree are more than to the 3rd predetermined threshold value (such as 1.92) screen, and expect as the news
The real neologisms of text, because the free degree of " grape " word is less than the 3rd predetermined threshold value, then will not say that the word screens work
For neologisms.
Specifically, described information entropy calculation formula is:
Wherein, it is bottom that logarithm, which generally takes 2, in formula, and unit is bit;N refers to the number of left adjacent word or right adjacent word;PiPoint out existing
The probability of each left adjacent word or right adjacent word.
Further, using participle and predetermined Business Name storehouse exabyte is extracted from news corpus text
Claim, extract the existing ripe technology of Business Name from news corpus at present, so it will not be repeated.Assuming that from news corpus text
The neologisms finally extracted include " pollutant emission ", the Business Name included in news corpus have Yunnan salinization, 31 heavy industrys, in
State's electric construction, then calculate " pollutant emission " and " Yunnan salinization ", " 31 heavy industry ", the association relationship of " Chinese electric construction " respectively, and
The exabyte that association relationship is more than to the 4th predetermined threshold value (such as 0.8) remains, as with reference to investment target.
, can be with it is understood that predetermined threshold value arrived involved in the various embodiments described above etc. needs pre-set parameter
User is configured according to actual conditions.
The electronic device 1 that above-described embodiment proposes, by being segmented, being gone stop words etc. to handle to language material text, from language
Neologisms undetermined are extracted in material, then by calculating word frequency, solidification degree and the free degree of neologisms undetermined, filter out the language material text
In real neologisms, the association relationship for finally calculating Business Name in neologisms and the language material text determines final investment target,
Improve the efficiency and accuracy of investment target extraction.
Alternatively, in other examples, one can also be divided into using the program 10 of new word discovery investment target
A or multiple modules, one or more module are stored in memory 11, and by one or more processors (this implementation
Example is processor 12) it is performed, to complete the present invention, the module alleged by the present invention is to refer to complete a series of of specific function
Computer program instructions section.For example, referring to shown in Fig. 2, show to invest the module of the program 10 of target in Fig. 1 using new word discovery
It is intended to, in the embodiment, first processing module 110, second can be divided into using the program 10 of new word discovery investment target
Processing module 120, convergence module 130, computing module 140 and extraction module 150, the work(that the module 110-150 is realized
Energy or operating procedure are similar as above, are no longer described in detail herein, exemplarily, such as wherein:
First processing module 110, for being pre-processed to the language material in corpus, obtains language material text data, is formed
Language material text set;
Second processing module 120, for reading a language material text by pretreatment, segments the language material text
And go stop words to handle, obtain multiple word sections of the language material text;
Convergence module 130, is converged for the word section adjacent to the language material text, adjacent word section is combined into undetermined
Neologisms, form the new set of words undetermined of the language material text;
Computing module 140, for according to word frequency, solidification degree and the free degree of each neologisms undetermined in the language material text and in advance
If the comparative result of threshold value, the real neologisms of language material text are filtered out;And
Extraction module 150, it is mutual for the neologisms that calculating sifting goes out and association relationship of the Business Name in corpus, extraction
The value of information meets that the Business Name of preset condition and neologisms are used as with reference to investment target.
In addition, the present invention also provides a kind of method using new word discovery investment target.With reference to shown in Fig. 3, for the present invention
Utilize the flow chart of the method preferred embodiment of new word discovery investment target.This method can be performed by a device, the device
Can be by software and/or hardware realization.
In the present embodiment, included using the method for new word discovery investment target:
S1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained
To multiple word sections of the language material text;
S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language
Expect the new set of words undetermined of text;
S4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text
As a result, filter out the real neologisms of language material text;And
The neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre-
If the Business Name and neologisms of condition are used as with reference to investment target.
Language material is related to multiple and different fields, and the present embodiment carries out the concrete scheme of the present invention by taking news corpus as an example
Illustrate, but be not limited only to News Field.As investor it should be understood that hot news at present, to obtain investment target enterprise
Correlative plan, R&D direction or potential demand crawl Internet news as newly when information, by the use of web crawlers from internet
Language material is heard, for example, crawling the Internet news of Sina, Baidu, Tencent etc. by reptile.It is understood that pushing away with the time
Move, hot news also can constantly change, therefore, in order to make investor more accurately understand hot news at present, in time dimension
On the Internet news that crawls is filtered, preset time section is set, the Internet news of the period is only crawled, for example, only
Crawl the Internet news on the same day.Then duplicate removal processing is carried out to the Internet news crawled, and the title of Internet news is stored in language
Expect in storehouse.Since the source of news corpus has diversity, Format Type is relatively more in language material, for ease of to language material into
Row subsequent treatment, need to pre-process news corpus, obtain news corpus text data, form news corpus text set.
In specific implementation, the uniform format of news corpus can be text formatting by the pretreatment, from news corpus
Middle removal advertisement noise simultaneously filters the one or more in dirty word, sensitive word and stop words.By the uniform format of news corpus
For text formatting when, the information filtering that current techniques be able to wouldn't can be converted to text formatting is fallen.
After news corpus text is obtained, the every a line news corpus text got is carried out one by one using participle instrument
Word segmentation processing, such as carry out word segmentation processing using participle instruments such as Stanford Chinese word segmentings instrument, jieba participles.It is for example, right
Segmented in " last night goes to have seen film ", can obtain following result " yesterday | at night | go | see | | film ".At participle
Retain word segmentation result after reason.It is understood that in order to further improve the validity of word segmentation result, word segmentation result is gone
Stop words processing, the function word of news corpus theme can not be embodied by removing auxiliary words of mood, adverbial word, preposition, conjunction adjective etc.,
These function words usually itself have no clear and definite meaning, and only putting it into a complete sentence just has certain effect, such as
It is common " ", " ", " this ", " that ", " on ", " under ", " where ", etc..In other embodiments, to news expect text into
After row word segmentation processing, the word section of the final verb only retained in word segmentation result and/or noun, such as in above-mentioned example, can be only
Retain " film " this word.It is understood that the word segmentation result after word segmentation processing may be sky, then corresponding row is filtered out
Text.In other embodiments, the method segmented to news corpus text can also include:Based on string matching
Segmenting method, the segmenting method based on understanding, the segmenting method based on statistics and one kind in the segmenting method based on dictionary or
It is a variety of.
In other embodiments, can also be to every news corpus for the ease of determining the scope of follow-up word segmentation processing
Before text carries out word segmentation processing, primary segmentation can be carried out to news corpus text by branch's processing, branch's processing can be
To language material according to punctuate branch, such as there is the punctuates such as fullstop, comma, exclamation, question mark punishment row.
However, the news corpus text after word segmentation processing, it is possible that one will should be used as in some field
The term data of a word is divided into the situation of multiple term datas, it is therefore desirable to new word discovery.If by the adjacent word in word segmentation result
Duan Jinhang is converged, and forms the neologisms undetermined of news corpus text.
Next, need to determine the real neologisms of news corpus text, reference from the neologisms undetermined of news corpus text
It is refinement flow diagram of the present invention using step S4 in the method for new word discovery investment target, in other implementations shown in Fig. 4
In example, step S4 is specifically included:
S41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value
Determine neologisms;
The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, therefrom filters out solidification degree more than second
The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, therefrom filters out the free degree more than the 3rd
Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
It is understood that to extract neologisms from news item language material text, it is clear that:Which type of word is just calculated
One neologismsWhether enough look first at the number that this word occurs in a language material text or corpus, i.e. word frequency.
In the present embodiment, word frequency is embodied by reverse document-frequency (Inverse Document Frequency, IDF), IDF characterizations
The frequency of one word in a document, if there is the frequency it is higher, illustrate the probability that this word occurs in different environment
Higher, characterizes degree of recognition of the word in different articles.General IDF is higher, illustrates that its degree of recognition is higher, is more likely to be new
Word.But if IDF is very high, it is very common to represent this word on the contrary, is not necessarily necessary to enter new word set, especially for
Prevent from causing neologisms to pollute.In the screening step, by all word frequency more than the first predetermined threshold value (such as in news item language
The number that occurs is more than 5 times in material text) neologisms undetermined screen.
However, the neologisms undetermined filtered out by above-mentioned screening process are possible to not be a word, but multiple words are formed
Phrase.Therefore, processing meets outside the requirement of word frequency, it is also necessary to considers the solidification degree of neologisms undetermined, i.e. a neologisms undetermined
In the probability that occurs together with other words in the neologisms undetermined of each word.Such as in a language material text, " film " occurs
389 times, " cinema " only occurs 175 times, but we are but more likely to " cinema " as a word, because intuition
On see, " film " and " institute " solidifies tighter.The highest word of solidification degree be exactly such as " bat ", " spider ", " knowing which way to go ",
The word of " perturbed " etc, each word in these words almost can always occur at the same time with another word, even at other
Using also such in occasion.It is in the screening step, all solidification degree are undetermined more than the second predetermined threshold value (such as 0.02)
Neologisms screen.
Specifically, by taking two tuple words as an example, the probability that word A and word B individually occur is P (A) and P (B) respectively, it is assumed that this two
A word be autonomous word then two words and meanwhile occur probability be P (A) * P (B).If the two words are not independent, two words are same
When the probability that occurs can be more than P (A) * P (B), i.e. P (C)>>P(A)*P(B).That is, the solidification degree of neologisms undetermined is more than
Two predetermined threshold values need the condition that meets to be:
P (C)-P (A) * P (B) > m
Wherein, A, B represent the word in neologisms undetermined respectively, and P (C) refers to word A, B while the probability occurred, and m represents that second is pre-
If threshold value.
In addition to meeting the requirement of above-mentioned word frequency and solidification degree, it is also contemplated that the free degree of a word.The free degree refers to
One word freely uses degree.Light sees solidifying conjunction degree inside a word not enough, we also need to as a whole it
Exterior performance.By taking " quilt " and " lifetime " the two words as an example, we can say that " buying quilt ", " lid quilt ", " into quilt ",
" good quilt ", " this quilt " etc., above add various words at " quilt ";But the usage in " lifetime " is very fixed, except " a generation
Son ", " this lifetime ", " last lifetime ", " next lifetime ", substantially " lifetime " above cannot add other word." lifetime " this word is left
The word that side can occur is too limited so that instinctively we may think that, " lifetime " not individually into word, really into word
It is the entirety of " a lifetime ", " this lifetime " etc in fact.As it can be seen that word freely with degree be also judge it whether Cheng Xin
The major criterion of word.If a word can be regarded as a neologisms, it should be able to neatly appear in a variety of
In environment, there is very abundant left adjacent word set and right adjacent word set.It can be weighed by calculating the comentropy of a word
The left adjacent word set of this word and the randomness of right adjacent word set.For example, " eating grape and do not spit Grape Skin and do not eating grape Dao Tu Portugals
In grape skin " the words, " grape " word occurs four times, wherein left neighbour's word is respectively { eat, spit, eat, spit }, right neighbour's word is respectively
No, { skin, falls, skin }.According to comentropy calculation formula, the comentropy that the left adjacent word of " grape " word can be calculated respectively is about
0.693, the comentropy of right neighbour's word is about 1.04.As it can be seen that in this sentence, the right adjacent word of " grape " word is more rich.At this
In embodiment, the free degree of a word takes the smaller value in its left adjacent word comentropy and right adjacent word comentropy.Walked in the screening
In rapid, the neologisms undetermined that all frees degree are more than to the 3rd predetermined threshold value (such as 1.92) screen, and expect as the news
The real neologisms of text, because the free degree of " grape " word is less than the 3rd predetermined threshold value, then will not say that the word screens work
For neologisms.
Specifically, described information entropy calculation formula is:
Wherein, it is bottom that logarithm, which generally takes 2, in formula, and unit is bit;N refers to the number of left adjacent word or right adjacent word;PiPoint out existing
The probability of each left adjacent word or right adjacent word.
Further, using participle and predetermined Business Name storehouse exabyte is extracted from news corpus text
Claim, extract the existing ripe technology of Business Name from news corpus at present, so it will not be repeated.Assuming that from news corpus text
The neologisms finally extracted include " pollutant emission ", the Business Name included in news corpus have Yunnan salinization, 31 heavy industrys, in
State's electric construction, then calculate " pollutant emission " and " Yunnan salinization ", " 31 heavy industry ", the association relationship of " Chinese electric construction " respectively, and
The exabyte that association relationship is more than to the 4th predetermined threshold value (such as 0.8) remains, as with reference to investment target.
, can be with it is understood that predetermined threshold value arrived involved in the various embodiments described above etc. needs pre-set parameter
User is configured according to actual conditions.
The method using new word discovery investment target that above-described embodiment proposes, by being segmented, being gone to language material text
The processing such as stop words, extracts neologisms undetermined from language material, then by calculating word frequency, solidification degree and the freedom of neologisms undetermined
Degree, filters out real neologisms in the language material text, finally calculates the association relationship of neologisms and Business Name in the language material text
Determine final investment target, improve the efficiency and accuracy of investment target extraction.
In addition, the embodiment of the present invention also proposes a kind of computer-readable recording medium, the computer-readable recording medium
On be stored with program using new word discovery investment target, following operation is realized when which is executed by processor:
A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained
To multiple word sections of the language material text;
A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language
Expect the new set of words undetermined of text;
A4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text
As a result, filter out the real neologisms of language material text;And
The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre-
If the Business Name and neologisms of condition are used as with reference to investment target.
Preferably, the step A4 includes:
A41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value
Determine neologisms;
The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, therefrom filters out solidification degree more than second
The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, therefrom filters out the free degree more than the 3rd
Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure A42 is filtered out ", includes:
The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively;
And
Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms
Degree.
Computer-readable recording medium embodiment of the present invention and the above-mentioned method using new word discovery investment target
It is essentially identical with each embodiment of electronic device, do not make tired state herein.
It should be noted that the embodiments of the present invention are for illustration only, the quality of embodiment is not represented.And
Term " comprising " herein, "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, so that bag
To include process, device, article or the method for a series of elements not only include those key elements, but also including being not explicitly listed
Other element, or further include as this process, device, article or the intrinsic key element of method.Do not limiting more
In the case of, the key element that is limited by sentence "including a ...", it is not excluded that in the process including the key element, device, article
Or also there are other identical element in method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on such understanding, technical scheme substantially in other words does the prior art
Going out the part of contribution can be embodied in the form of software product, which is stored in one as described above
In storage medium (such as ROM/RAM, magnetic disc, CD), including some instructions use so that a station terminal equipment (can be mobile phone,
Computer, server, or network equipment etc.) perform method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair
The equivalent structure or equivalent flow shift that bright specification and accompanying drawing content are made, is directly or indirectly used in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
- A kind of 1. method using new word discovery investment target, applied to electronic device, it is characterised in that this method includes:S1, pre-process the language material in corpus, obtains language material text data, forms language material text set;S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is somebody's turn to do Multiple word sections of language material text;S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, form language material text This new set of words undetermined;S4, word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value according to each neologisms undetermined in the language material text, Filter out the real neologisms of language material text;AndThe neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet default bar The Business Name and neologisms of part are used as with reference to investment target.
- 2. the method for new word discovery investment target is utilized as claimed in claim 1, it is characterised in that pre- in the step S1 Processing includes:It is text formatting by the uniform format of language material in corpus, advertisement noise is removed from language material.
- 3. the method for new word discovery investment target is utilized as claimed in claim 1, it is characterised in that the described pair of language material text The method segmented includes:Segmenting method based on string matching, the segmenting method based on understanding, the participle based on statistics Method and the segmenting method based on dictionary.
- 4. the method using new word discovery investment target as described in claim 1 or 2 or 3, it is characterised in that the step S4 Including:S41, calculate the language material text each neologisms undetermined word frequency, it is undetermined new more than the first predetermined threshold value to filter out word frequency Word;The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, it is default more than second therefrom to filter out solidification degree The neologisms undetermined of threshold value;AndThe free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, it is default more than the 3rd therefrom to filter out the free degree Real neologisms of the neologisms undetermined of threshold value as the language material text.
- 5. the method for new word discovery investment target is utilized as claimed in claim 4, it is characterised in that " the calculation procedure S42 The step of free degree of each neologisms undetermined filtered out ", includes:The left adjacent word comentropy by the step S42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively;AndTake the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms undetermined Degree.
- 6. a kind of electronic device, it is characterised in that the device includes:Memory, processor, being stored with the memory can be The program using new word discovery investment target run on the processor, is realized as follows when which is performed by the processor Step:A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is somebody's turn to do Multiple word sections of language material text;A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, form language material text This new set of words undetermined;A4, word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value according to each neologisms undetermined in the language material text, Filter out the real neologisms of language material text;AndThe neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet default bar The Business Name and neologisms of part are used as with reference to investment target.
- 7. electronic device according to claim 6, it is characterised in that the pretreatment in the step A1 includes:By language material The uniform format of language material is text formatting in storehouse, and advertisement noise is removed from news corpus;The method that the described pair of language material text is segmented includes:Segmenting method based on string matching, point based on understanding Word method, the segmenting method based on statistics and the segmenting method based on dictionary.
- 8. the electronic device according to claim 6 or 7, it is characterised in that the step A4 includes:A41, calculate the language material text each neologisms undetermined word frequency, it is undetermined new more than the first predetermined threshold value to filter out word frequency Word;The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, it is default more than second therefrom to filter out solidification degree The neologisms undetermined of threshold value;AndThe free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, it is default more than the 3rd therefrom to filter out the free degree Real neologisms of the neologisms undetermined of threshold value as the language material text.
- 9. electronic device according to claim 8, it is characterised in that described " calculation procedure A42 is filtered out each undetermined The step of free degree of neologisms ", includes:The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively;AndTake the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the free degree as the neologisms.
- 10. a kind of computer-readable recording medium, it is characterised in that be stored with the computer-readable recording medium using new Word finds the program of investment target, and the utilization as any one of claim 1 to 5 is realized when which is executed by processor New word discovery invests the step of method of target.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711059221.6A CN108038119A (en) | 2017-11-01 | 2017-11-01 | Utilize the method, apparatus and storage medium of new word discovery investment target |
PCT/CN2018/076174 WO2019085335A1 (en) | 2017-11-01 | 2018-02-10 | Method for discovering investment objects with new words, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711059221.6A CN108038119A (en) | 2017-11-01 | 2017-11-01 | Utilize the method, apparatus and storage medium of new word discovery investment target |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108038119A true CN108038119A (en) | 2018-05-15 |
Family
ID=62093676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711059221.6A Pending CN108038119A (en) | 2017-11-01 | 2017-11-01 | Utilize the method, apparatus and storage medium of new word discovery investment target |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108038119A (en) |
WO (1) | WO2019085335A1 (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109472022A (en) * | 2018-10-15 | 2019-03-15 | 平安科技(深圳)有限公司 | New word identification method and terminal device based on machine learning |
CN109492224A (en) * | 2018-11-07 | 2019-03-19 | 北京金山数字娱乐科技有限公司 | A kind of method and device of vocabulary building |
CN110457708A (en) * | 2019-08-16 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence |
CN111309898A (en) * | 2018-11-26 | 2020-06-19 | 中移(杭州)信息技术有限公司 | Text mining method and device for new word discovery |
CN111339403A (en) * | 2020-02-11 | 2020-06-26 | 安徽理工大学 | Commodity comment-based new word extraction method |
CN111626053A (en) * | 2020-05-21 | 2020-09-04 | 北京明亿科技有限公司 | Method and device for recognizing descriptor of new case means, electronic device and storage medium |
CN111626054A (en) * | 2020-05-21 | 2020-09-04 | 北京明亿科技有限公司 | New illegal behavior descriptor identification method and device, electronic equipment and storage medium |
CN111832299A (en) * | 2020-07-17 | 2020-10-27 | 成都信息工程大学 | Chinese word segmentation system |
CN111914554A (en) * | 2020-08-19 | 2020-11-10 | 网易(杭州)网络有限公司 | Training method of field new word recognition model, field new word recognition method and field new word recognition equipment |
CN111931491A (en) * | 2020-08-14 | 2020-11-13 | 工银科技有限公司 | Domain dictionary construction method and device |
CN112329458A (en) * | 2020-05-21 | 2021-02-05 | 北京明亿科技有限公司 | New organization descriptor recognition method and device, electronic device and storage medium |
CN112541057A (en) * | 2019-09-04 | 2021-03-23 | 上海晶赞融宣科技有限公司 | Distributed new word discovery method and device, computer equipment and storage medium |
WO2021051600A1 (en) * | 2019-09-19 | 2021-03-25 | 平安科技(深圳)有限公司 | Method, apparatus and device for identifying new word based on information entropy, and storage medium |
CN112560448A (en) * | 2021-02-20 | 2021-03-26 | 京华信息科技股份有限公司 | New word extraction method and device |
CN112883725A (en) * | 2020-12-29 | 2021-06-01 | 上海讯飞瑞元信息技术有限公司 | File generation method and device, electronic equipment and storage medium |
CN113064990A (en) * | 2021-01-04 | 2021-07-02 | 上海金融期货信息技术有限公司 | Hot event identification method and system based on multi-level clustering |
CN113449082A (en) * | 2021-07-16 | 2021-09-28 | 上海明略人工智能(集团)有限公司 | New word discovery method, system, electronic device and medium |
CN113468317A (en) * | 2021-06-26 | 2021-10-01 | 北京网聘咨询有限公司 | Resume screening method, system, equipment and storage medium |
CN113536787A (en) * | 2021-07-14 | 2021-10-22 | 福建亿榕信息技术有限公司 | Method and equipment for establishing audit professional lexicon |
WO2021217936A1 (en) * | 2020-04-29 | 2021-11-04 | 深圳壹账通智能科技有限公司 | Word combination processing-based new word discovery method and apparatus, and computer device |
WO2021217931A1 (en) * | 2020-04-30 | 2021-11-04 | 深圳壹账通智能科技有限公司 | Classification model-based field extraction method and apparatus, electronic device, and medium |
CN114186557A (en) * | 2022-02-17 | 2022-03-15 | 阿里巴巴达摩院(杭州)科技有限公司 | Method, device and storage medium for determining subject term |
CN114385792A (en) * | 2022-03-23 | 2022-04-22 | 北京零点远景网络科技有限公司 | Method, device, equipment and storage medium for extracting words from work order data |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230416373A1 (en) | 2020-11-14 | 2023-12-28 | Biogen Ma Inc. | Biphasic subcutaneous dosing regimens for anti-vla-4 antibodies |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3940656B2 (en) * | 2002-09-30 | 2007-07-04 | 株式会社東芝 | Dictionary refinement method and program used for text information classification |
US20070265832A1 (en) * | 2006-05-09 | 2007-11-15 | Brian Bauman | Updating dictionary during application installation |
CN102023967A (en) * | 2010-11-11 | 2011-04-20 | 清华大学 | Text emotion classifying method in stock field |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN105786962A (en) * | 2016-01-15 | 2016-07-20 | 优品财富管理有限公司 | Big data index analysis method and system based on news transmissibility |
CN106934054A (en) * | 2017-03-17 | 2017-07-07 | 前海梧桐(深圳)数据有限公司 | The accurate analysis method of enterprise's segmented industry and its system based on big data |
CN107292744A (en) * | 2017-06-07 | 2017-10-24 | 前海梧桐(深圳)数据有限公司 | Investment Trend analysis method and its system based on machine learning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105786991B (en) * | 2016-02-18 | 2019-03-15 | 中国科学院自动化研究所 | In conjunction with the Chinese emotion new word identification method and system of user feeling expression way |
CN105956158B (en) * | 2016-05-17 | 2019-08-09 | 清华大学 | The method that network neologisms based on massive micro-blog text and user information automatically extract |
CN106126606B (en) * | 2016-06-21 | 2019-08-20 | 国家计算机网络与信息安全管理中心 | A kind of short text new word discovery method |
-
2017
- 2017-11-01 CN CN201711059221.6A patent/CN108038119A/en active Pending
-
2018
- 2018-02-10 WO PCT/CN2018/076174 patent/WO2019085335A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3940656B2 (en) * | 2002-09-30 | 2007-07-04 | 株式会社東芝 | Dictionary refinement method and program used for text information classification |
US20070265832A1 (en) * | 2006-05-09 | 2007-11-15 | Brian Bauman | Updating dictionary during application installation |
CN102023967A (en) * | 2010-11-11 | 2011-04-20 | 清华大学 | Text emotion classifying method in stock field |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN105786962A (en) * | 2016-01-15 | 2016-07-20 | 优品财富管理有限公司 | Big data index analysis method and system based on news transmissibility |
CN106934054A (en) * | 2017-03-17 | 2017-07-07 | 前海梧桐(深圳)数据有限公司 | The accurate analysis method of enterprise's segmented industry and its system based on big data |
CN107292744A (en) * | 2017-06-07 | 2017-10-24 | 前海梧桐(深圳)数据有限公司 | Investment Trend analysis method and its system based on machine learning |
Non-Patent Citations (1)
Title |
---|
TING GE: "互联网时代的社会语言学:基于SNS的文本数据挖掘—转自MatriX67", 《豆瓣》 * |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109472022A (en) * | 2018-10-15 | 2019-03-15 | 平安科技(深圳)有限公司 | New word identification method and terminal device based on machine learning |
CN109492224A (en) * | 2018-11-07 | 2019-03-19 | 北京金山数字娱乐科技有限公司 | A kind of method and device of vocabulary building |
CN109492224B (en) * | 2018-11-07 | 2024-05-03 | 北京金山数字娱乐科技有限公司 | Vocabulary construction method and device |
CN111309898A (en) * | 2018-11-26 | 2020-06-19 | 中移(杭州)信息技术有限公司 | Text mining method and device for new word discovery |
CN110457708A (en) * | 2019-08-16 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence |
CN110457708B (en) * | 2019-08-16 | 2023-05-16 | 腾讯科技(深圳)有限公司 | Vocabulary mining method and device based on artificial intelligence, server and storage medium |
CN112541057A (en) * | 2019-09-04 | 2021-03-23 | 上海晶赞融宣科技有限公司 | Distributed new word discovery method and device, computer equipment and storage medium |
WO2021051600A1 (en) * | 2019-09-19 | 2021-03-25 | 平安科技(深圳)有限公司 | Method, apparatus and device for identifying new word based on information entropy, and storage medium |
CN111339403A (en) * | 2020-02-11 | 2020-06-26 | 安徽理工大学 | Commodity comment-based new word extraction method |
CN111339403B (en) * | 2020-02-11 | 2022-08-02 | 安徽理工大学 | Commodity comment-based new word extraction method |
WO2021217936A1 (en) * | 2020-04-29 | 2021-11-04 | 深圳壹账通智能科技有限公司 | Word combination processing-based new word discovery method and apparatus, and computer device |
WO2021217931A1 (en) * | 2020-04-30 | 2021-11-04 | 深圳壹账通智能科技有限公司 | Classification model-based field extraction method and apparatus, electronic device, and medium |
CN111626053A (en) * | 2020-05-21 | 2020-09-04 | 北京明亿科技有限公司 | Method and device for recognizing descriptor of new case means, electronic device and storage medium |
CN112329458B (en) * | 2020-05-21 | 2024-05-10 | 北京明亿科技有限公司 | New organization descriptor recognition method and device, electronic equipment and storage medium |
CN112329458A (en) * | 2020-05-21 | 2021-02-05 | 北京明亿科技有限公司 | New organization descriptor recognition method and device, electronic device and storage medium |
CN111626053B (en) * | 2020-05-21 | 2024-04-09 | 北京明亿科技有限公司 | New scheme means descriptor recognition method and device, electronic equipment and storage medium |
CN111626054B (en) * | 2020-05-21 | 2023-12-19 | 北京明亿科技有限公司 | Novel illegal action descriptor recognition method and device, electronic equipment and storage medium |
CN111626054A (en) * | 2020-05-21 | 2020-09-04 | 北京明亿科技有限公司 | New illegal behavior descriptor identification method and device, electronic equipment and storage medium |
CN111832299A (en) * | 2020-07-17 | 2020-10-27 | 成都信息工程大学 | Chinese word segmentation system |
CN111931491A (en) * | 2020-08-14 | 2020-11-13 | 工银科技有限公司 | Domain dictionary construction method and device |
CN111931491B (en) * | 2020-08-14 | 2023-11-14 | 中国工商银行股份有限公司 | Domain dictionary construction method and device |
CN111914554A (en) * | 2020-08-19 | 2020-11-10 | 网易(杭州)网络有限公司 | Training method of field new word recognition model, field new word recognition method and field new word recognition equipment |
CN112883725A (en) * | 2020-12-29 | 2021-06-01 | 上海讯飞瑞元信息技术有限公司 | File generation method and device, electronic equipment and storage medium |
CN113064990A (en) * | 2021-01-04 | 2021-07-02 | 上海金融期货信息技术有限公司 | Hot event identification method and system based on multi-level clustering |
CN112560448B (en) * | 2021-02-20 | 2021-06-22 | 京华信息科技股份有限公司 | New word extraction method and device |
CN112560448A (en) * | 2021-02-20 | 2021-03-26 | 京华信息科技股份有限公司 | New word extraction method and device |
CN113468317A (en) * | 2021-06-26 | 2021-10-01 | 北京网聘咨询有限公司 | Resume screening method, system, equipment and storage medium |
CN113468317B (en) * | 2021-06-26 | 2024-03-08 | 北京网聘信息技术有限公司 | Resume screening method, system, equipment and storage medium |
CN113536787A (en) * | 2021-07-14 | 2021-10-22 | 福建亿榕信息技术有限公司 | Method and equipment for establishing audit professional lexicon |
CN113449082A (en) * | 2021-07-16 | 2021-09-28 | 上海明略人工智能(集团)有限公司 | New word discovery method, system, electronic device and medium |
CN114186557A (en) * | 2022-02-17 | 2022-03-15 | 阿里巴巴达摩院(杭州)科技有限公司 | Method, device and storage medium for determining subject term |
CN114385792B (en) * | 2022-03-23 | 2022-06-24 | 北京零点远景网络科技有限公司 | Method, device, equipment and storage medium for extracting words from work order data |
CN114385792A (en) * | 2022-03-23 | 2022-04-22 | 北京零点远景网络科技有限公司 | Method, device, equipment and storage medium for extracting words from work order data |
Also Published As
Publication number | Publication date |
---|---|
WO2019085335A1 (en) | 2019-05-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108038119A (en) | Utilize the method, apparatus and storage medium of new word discovery investment target | |
CN109271512B (en) | Emotion analysis method, device and storage medium for public opinion comment information | |
CN109145216B (en) | Network public opinion monitoring method, device and storage medium | |
CN109325165B (en) | Network public opinion analysis method, device and storage medium | |
US9336299B2 (en) | Acquisition of semantic class lexicons for query tagging | |
CN110163476A (en) | Project intelligent recommendation method, electronic device and storage medium | |
CN101944109B (en) | System and method for extracting picture abstract based on page partitioning | |
CN109062972A (en) | Web page classification method, device and computer readable storage medium | |
KR20140131327A (en) | Social media data analysis system and method | |
CN102314436A (en) | Webpage automatic adjusting method and system | |
CN112650910B (en) | Method, device, equipment and storage medium for determining website update information | |
WO2021068681A1 (en) | Tag analysis method and device, and computer readable storage medium | |
CN105512104A (en) | Dictionary dimension reducing method and device and information classifying method and device | |
US9064009B2 (en) | Attribute cloud | |
CN104850617A (en) | Short text processing method and apparatus | |
US11687647B2 (en) | Method and electronic device for generating semantic representation of document to determine data security risk | |
CN109241392A (en) | Recognition methods, device, system and the storage medium of target word | |
CN104933074A (en) | News ordering method and device and terminal equipment | |
CN107861945A (en) | Finance data analysis method, application server and computer-readable recording medium | |
CN104462061A (en) | Word extraction method and word extraction device | |
CN103631796A (en) | Website sort management method and electronic device | |
Khemani et al. | A review on reddit news headlines with nltk tool | |
CN104572874B (en) | A kind of abstracting method and device of webpage information | |
CN112579729A (en) | Training method and device for document quality evaluation model, electronic equipment and medium | |
CN111639250A (en) | Enterprise description information acquisition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180515 |
|
RJ01 | Rejection of invention patent application after publication |