CN108038119A - Utilize the method, apparatus and storage medium of new word discovery investment target - Google Patents

Utilize the method, apparatus and storage medium of new word discovery investment target Download PDF

Info

Publication number
CN108038119A
CN108038119A CN201711059221.6A CN201711059221A CN108038119A CN 108038119 A CN108038119 A CN 108038119A CN 201711059221 A CN201711059221 A CN 201711059221A CN 108038119 A CN108038119 A CN 108038119A
Authority
CN
China
Prior art keywords
neologisms
word
language material
undetermined
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711059221.6A
Other languages
Chinese (zh)
Inventor
汪伟
罗傲雪
陈恋
陈一恋
王晓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201711059221.6A priority Critical patent/CN108038119A/en
Priority to PCT/CN2018/076174 priority patent/WO2019085335A1/en
Publication of CN108038119A publication Critical patent/CN108038119A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of method using new word discovery investment target, including:Language material in corpus is pre-processed, obtains language material text data;The language material text by pretreatment is read, which is segmented and goes stop words to handle, obtains multiple word sections of the language material text;The word section adjacent to the language material text converges, and adjacent word section is combined into neologisms undetermined;According to word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value of each neologisms undetermined in the language material text, the real neologisms of language material text are filtered out;And the neologisms and association relationship of the Business Name in corpus that calculating sifting goes out, extraction association relationship meet that the Business Name of preset condition and neologisms are used as with reference to investment target.The present invention also proposes a kind of electronic device and computer-readable recording medium.The new words extraction filtered out using the present invention from news corpus invests target, improves efficiency of investment and accuracy rate.

Description

Utilize the method, apparatus and storage medium of new word discovery investment target
Technical field
The present invention relates to field of computer technology, more particularly to a kind of method, electronics that target is invested using new word discovery Device and computer-readable recording medium.
Background technology
At present, in observation investment target angle, investor lacks the associated observation to investee and hot spot theme, And this observation can be improved to investing the plan of operation of target, Research Emphasis to a certain extent, business increases, raw material needs Ask, the expected understanding of team building etc..
With the popularization of network, each news website has thousands of bar news daily, and news can real-time update.Such as Fruit can be extracted from the news corpus of magnanimity and analyze the enterprise involved by the hot spot theme and hot spot theme of Vehicles Collected from Market Industry, then for the angle of investor, it is possible to obtain Correlative plan, R&D direction or the potential need of investment target enterprise Ask, and then find business opportunity, seize commercial opportunity.Therefore, how to be extracted from news corpus and analyze neologisms, and utilized from news corpus The new word discovery investment target of middle extraction is urgent problem.
The content of the invention
The present invention provides a kind of method, electronic device and computer-readable storage medium using new word discovery investment target Matter, its main purpose are and new using being filtered out from news corpus in by being screened from news corpus and analyzing neologisms Word extraction investment target.
To achieve the above object, the present invention provides a kind of electronic device, which includes memory, processor, described to deposit The program using new word discovery investment target that can be run on the processor is stored with reservoir, the program is by the processing Device realizes following steps when performing:
A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained To multiple word sections of the language material text;
A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language Expect the new set of words undetermined of text;
A4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text As a result, filter out the real neologisms of language material text;And
The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre- If the Business Name and neologisms of condition are used as with reference to investment target.
Preferably, the step A4 includes:
A41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value Determine neologisms;
The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, therefrom filters out solidification degree more than second The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, therefrom filters out the free degree more than the 3rd Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure A42 is filtered out ", includes:
The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively; And
Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms Degree.
In addition, to achieve the above object, the present invention also provides a kind of method using new word discovery investment target, this method Including:
S1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained To multiple word sections of the language material text;
S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language Expect the new set of words undetermined of text;
S4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text As a result, filter out the real neologisms of language material text;And
The neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre- If the Business Name and neologisms of condition are used as with reference to investment target.
Preferably, the step S4 includes:
S41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value Determine neologisms;
The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, therefrom filters out solidification degree more than second The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, therefrom filters out the free degree more than the 3rd Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure S42 is filtered out ", includes:
The left adjacent word comentropy by the step S42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively; And
Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms Degree.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer-readable recording medium The program using new word discovery investment target is stored with storage medium, is realized when which is executed by processor as described above Utilize the arbitrary steps of the method for new word discovery investment target.
Method, electronic device and computer-readable recording medium proposed by the present invention using new word discovery investment target, By being segmented, being gone stop words etc. to handle to language material text, neologisms undetermined are extracted from language material, it is then undetermined by calculating Word frequency, solidification degree and the free degree of neologisms, filter out real neologisms in the language material text, finally calculate neologisms and language material text Association relationship of Business Name determines final investment target in this, improves the efficiency and accuracy of investment target extraction.
Brief description of the drawings
Fig. 1 is the application environment schematic diagram for the method preferred embodiment that the present invention invests target using new word discovery;
Fig. 2 is the module diagram for the program for investing target in Fig. 1 using new word discovery;
Fig. 3 is the flow chart for the method preferred embodiment that the present invention invests target using new word discovery;
Fig. 4 is refined flow chart of the present invention using step S4 in the method for new word discovery investment target.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of method using new word discovery investment target, and this method is applied to a kind of electronic device 1.Ginseng According to shown in Fig. 1, the application environment schematic diagram of the method preferred embodiment of target is invested using new word discovery for the present invention.
In the present embodiment, the electronic device 1 can be PC (Personal Computer, PC), can also It is the terminal devices such as smart mobile phone, tablet computer, E-book reader, pocket computer.
The electronic device 1 includes memory 11, processor 12, communication bus 13, and network interface 14.
Wherein, memory 11 includes at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), magnetic storage, disk, CD etc..Memory 11 Can be the internal storage unit of the electronic device 1 in certain embodiments, such as the hard disk of the electronic device 1.Memory 11 can also be what is be equipped with the External memory equipment of the electronic device 1, such as the electronic device 1 in further embodiments Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, dodges Deposit card (Flash Card) etc..Further, memory 11 can also both include the internal storage unit of the electronic device 1 or wrap Include External memory equipment.Memory 11 can be not only used for the application software and Various types of data that storage is installed on the electronic device 1, Such as program 10 and corpus 00 etc. using new word discovery investment target, can be also used for temporarily storing exported or The data that will be exported.Specifically, language material refers to the language material crawled from each website, such as news corpus, is protected in the corpus 00 There are a large amount of language materials, the present invention extracts neologisms from the language material of corpus 00, and explores investment target according to neologisms.
Processor 12 can be in certain embodiments a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips, for the program stored in run memory 11 Code or processing data, such as program 10 of target etc. is invested using new word discovery.
Communication bus 13 is used for realization the connection communication between these components.
Network interface 14 can optionally include standard wireline interface and wireless interface (such as WI-FI interfaces), be commonly used in Communication connection is established between the device and other electronic equipments.
Fig. 1 illustrate only the electronic device 1 with component 11-14, it should be understood that being not required for implementing all show The component gone out, what can be substituted implements more or less components.
Alternatively, which can also include user interface, user interface can include display (Display), Input unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.
Alternatively, in certain embodiments, display can be that light-emitting diode display, liquid crystal display, touch control type LCD are shown Device and Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) touch device etc..Wherein, display Display screen or display unit are properly termed as, for showing the information that handles in the electronic apparatus 1 and visual for showing User interface.
In the device embodiment shown in Fig. 1, the program using new word discovery investment target is stored with memory 11.Place What is stored in the reason execution memory 11 of device 12 realizes following steps when investing the program of target using new word discovery:
A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained To multiple word sections of the language material text;
A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language Expect the new set of words undetermined of text;
A4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text As a result, filter out the real neologisms of language material text;And
The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre- If the Business Name and neologisms of condition are used as with reference to investment target.
Language material is related to multiple and different fields, and the present embodiment carries out the concrete scheme of the present invention by taking news corpus as an example Illustrate, but be not limited only to News Field.As investor it should be understood that hot news at present, to obtain investment target enterprise Correlative plan, R&D direction or potential demand crawl Internet news when information, using web crawlers from internet, for example, The Internet news that Sina, Baidu, Tencent etc. are crawled by reptile is used as news corpus.It is understood that pushing away with the time Move, hot news also can constantly change, therefore, in order to make investor more accurately understand hot news at present, in time dimension On the Internet news that crawls is filtered, preset time section is set, the Internet news of the period is only crawled, for example, only Crawl the Internet news on the same day.Then duplicate removal processing is carried out to the Internet news crawled, and the title of Internet news is stored in language Expect in storehouse 00.Since the source of news corpus has diversity, Format Type is relatively more in language material, for ease of to language material Subsequent treatment is carried out, news corpus need to be pre-processed, obtains news corpus text data, forms news corpus text set.
In specific implementation, the uniform format of news corpus can be text formatting by the pretreatment, from news corpus Middle removal advertisement noise simultaneously filters the one or more in dirty word, sensitive word and stop words.By the uniform format of news corpus For text formatting when, the information filtering that current techniques be able to wouldn't can be converted to text formatting is fallen.
After news corpus text is obtained, the every a line news corpus text got is carried out one by one using participle instrument Word segmentation processing, such as carry out word segmentation processing using participle instruments such as Stanford Chinese word segmentings instrument, jieba participles.It is for example, right Segmented in " last night goes to have seen film ", can obtain following result " yesterday | at night | go | see | | film ".At participle Retain word segmentation result after reason.It is understood that in order to further improve the validity of word segmentation result, word segmentation result is gone Stop words processing, the function word of news corpus theme can not be embodied by removing auxiliary words of mood, adverbial word, preposition, conjunction adjective etc., These function words usually itself have no clear and definite meaning, and only putting it into a complete sentence just has certain effect, such as It is common " ", " ", " this ", " that ", " on ", " under ", " where ", etc..In other embodiments, to news expect text into After row word segmentation processing, the word section of the final verb only retained in word segmentation result and/or noun, such as in above-mentioned example, can be only Retain " film " this word.It is understood that the word segmentation result after word segmentation processing may be sky, then corresponding row is filtered out Text.In other embodiments, the method segmented to news corpus text can also include:Based on string matching Segmenting method, the segmenting method based on understanding, the segmenting method based on statistics and one kind in the segmenting method based on dictionary or It is a variety of.
In other embodiments, can also be to every news corpus for the ease of determining the scope of follow-up word segmentation processing Before text carries out word segmentation processing, primary segmentation can be carried out to news corpus text by branch's processing, branch's processing can be To language material according to punctuate branch, such as there is the punctuates such as fullstop, comma, exclamation, question mark punishment row.
However, the news corpus text after word segmentation processing, it is possible that one will should be used as in some field The term data of a word is divided into the situation of multiple term datas, it is therefore desirable to new word discovery.If by the adjacent word in word segmentation result Duan Jinhang is converged, and forms the neologisms undetermined of news corpus text.
Next, need to determine the real neologisms of news corpus text from the neologisms undetermined of news corpus text, at it In his embodiment, step A4 is specifically included:
A41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value Determine neologisms;
The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, therefrom filters out solidification degree more than second The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, therefrom filters out the free degree more than the 3rd Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
It is understood that to extract neologisms from news item language material text, it is clear that:Which type of word is just calculated One neologismsWhether enough look first at the number that this word occurs in a language material text or corpus 00, i.e. word Frequently.In the present embodiment, word frequency is embodied by reverse document-frequency (Inverse Document Frequency, IDF), IDF Characterize the frequency of a word in a document, if there is the frequency it is higher, illustrate what this word occurred in different environment Probability higher, characterizes degree of recognition of the word in different articles.General IDF is higher, illustrates that its degree of recognition is higher, is more possible to It is neologisms.But if IDF is very high, it is very common to represent this word on the contrary, is not necessarily necessary to enter new word set, especially It is to cause neologisms to pollute in order to prevent.It is in the screening step, all word frequency are (such as new at one more than the first predetermined threshold value Hear the number that occurs in language material text more than 5 times) neologisms undetermined screen.
However, the neologisms undetermined filtered out by above-mentioned screening process are possible to not be a word, but multiple words are formed Phrase.Therefore, processing meets outside the requirement of word frequency, it is also necessary to considers the solidification degree of neologisms undetermined, i.e. a neologisms undetermined In the probability that occurs together with other words in the neologisms undetermined of each word.Such as in a language material text, " film " occurs 389 times, " cinema " only occurs 175 times, but we are but more likely to " cinema " as a word, because intuition On see, " film " and " institute " solidifies tighter.The highest word of solidification degree be exactly such as " bat ", " spider ", " knowing which way to go ", The word of " perturbed " etc, each word in these words almost can always occur at the same time with another word, even at other Using also such in occasion.It is in the screening step, all solidification degree are undetermined more than the second predetermined threshold value (such as 0.02) Neologisms screen.
Specifically, by taking two tuple words as an example, the probability that word A and word B individually occur is P (A) and P (B) respectively, it is assumed that this two A word be autonomous word then two words and meanwhile occur probability be P (A) * P (B).If the two words are not independent, two words are same When the probability that occurs can be more than P (A) * P (B), i.e. P (C)>>P(A)*P(B).That is, the solidification degree of neologisms undetermined is more than Two predetermined threshold values need the condition that meets to be:
P (C)-P (A) * P (B) > m
Wherein, A, B represent the word in neologisms undetermined respectively, and P (C) refers to word A, B while the probability occurred, and m represents that second is pre- If threshold value.
In addition to meeting the requirement of above-mentioned word frequency and solidification degree, it is also contemplated that the free degree of a word.The free degree refers to One word freely uses degree.Light sees solidifying conjunction degree inside a word not enough, we also need to as a whole it Exterior performance.By taking " quilt " and " lifetime " the two words as an example, we can say that " buying quilt ", " lid quilt ", " into quilt ", " good quilt ", " this quilt " etc., above add various words at " quilt ";But the usage in " lifetime " is very fixed, except " a generation Son ", " this lifetime ", " last lifetime ", " next lifetime ", substantially " lifetime " above cannot add other word." lifetime " this word is left The word that side can occur is too limited so that instinctively we may think that, " lifetime " not individually into word, really into word It is the entirety of " a lifetime ", " this lifetime " etc in fact.As it can be seen that word freely with degree be also judge it whether Cheng Xin The major criterion of word.If a word can be regarded as a neologisms, it should be able to neatly appear in a variety of In environment, there is very abundant left adjacent word set and right adjacent word set.It can be weighed by calculating the comentropy of a word The left adjacent word set of this word and the randomness of right adjacent word set.For example, " eating grape and do not spit Grape Skin and do not eating grape Dao Tu Portugals In grape skin " the words, " grape " word occurs four times, wherein left neighbour's word is respectively { eat, spit, eat, spit }, right neighbour's word is respectively No, { skin, falls, skin }.According to comentropy calculation formula, the comentropy that the left adjacent word of " grape " word can be calculated respectively is about 0.693, the comentropy of right neighbour's word is about 1.04.As it can be seen that in this sentence, the right adjacent word of " grape " word is more rich.At this In embodiment, the free degree of a word takes the smaller value in its left adjacent word comentropy and right adjacent word comentropy.Walked in the screening In rapid, the neologisms undetermined that all frees degree are more than to the 3rd predetermined threshold value (such as 1.92) screen, and expect as the news The real neologisms of text, because the free degree of " grape " word is less than the 3rd predetermined threshold value, then will not say that the word screens work For neologisms.
Specifically, described information entropy calculation formula is:
Wherein, it is bottom that logarithm, which generally takes 2, in formula, and unit is bit;N refers to the number of left adjacent word or right adjacent word;PiPoint out existing The probability of each left adjacent word or right adjacent word.
Further, using participle and predetermined Business Name storehouse exabyte is extracted from news corpus text Claim, extract the existing ripe technology of Business Name from news corpus at present, so it will not be repeated.Assuming that from news corpus text The neologisms finally extracted include " pollutant emission ", the Business Name included in news corpus have Yunnan salinization, 31 heavy industrys, in State's electric construction, then calculate " pollutant emission " and " Yunnan salinization ", " 31 heavy industry ", the association relationship of " Chinese electric construction " respectively, and The exabyte that association relationship is more than to the 4th predetermined threshold value (such as 0.8) remains, as with reference to investment target.
, can be with it is understood that predetermined threshold value arrived involved in the various embodiments described above etc. needs pre-set parameter User is configured according to actual conditions.
The electronic device 1 that above-described embodiment proposes, by being segmented, being gone stop words etc. to handle to language material text, from language Neologisms undetermined are extracted in material, then by calculating word frequency, solidification degree and the free degree of neologisms undetermined, filter out the language material text In real neologisms, the association relationship for finally calculating Business Name in neologisms and the language material text determines final investment target, Improve the efficiency and accuracy of investment target extraction.
Alternatively, in other examples, one can also be divided into using the program 10 of new word discovery investment target A or multiple modules, one or more module are stored in memory 11, and by one or more processors (this implementation Example is processor 12) it is performed, to complete the present invention, the module alleged by the present invention is to refer to complete a series of of specific function Computer program instructions section.For example, referring to shown in Fig. 2, show to invest the module of the program 10 of target in Fig. 1 using new word discovery It is intended to, in the embodiment, first processing module 110, second can be divided into using the program 10 of new word discovery investment target Processing module 120, convergence module 130, computing module 140 and extraction module 150, the work(that the module 110-150 is realized Energy or operating procedure are similar as above, are no longer described in detail herein, exemplarily, such as wherein:
First processing module 110, for being pre-processed to the language material in corpus, obtains language material text data, is formed Language material text set;
Second processing module 120, for reading a language material text by pretreatment, segments the language material text And go stop words to handle, obtain multiple word sections of the language material text;
Convergence module 130, is converged for the word section adjacent to the language material text, adjacent word section is combined into undetermined Neologisms, form the new set of words undetermined of the language material text;
Computing module 140, for according to word frequency, solidification degree and the free degree of each neologisms undetermined in the language material text and in advance If the comparative result of threshold value, the real neologisms of language material text are filtered out;And
Extraction module 150, it is mutual for the neologisms that calculating sifting goes out and association relationship of the Business Name in corpus, extraction The value of information meets that the Business Name of preset condition and neologisms are used as with reference to investment target.
In addition, the present invention also provides a kind of method using new word discovery investment target.With reference to shown in Fig. 3, for the present invention Utilize the flow chart of the method preferred embodiment of new word discovery investment target.This method can be performed by a device, the device Can be by software and/or hardware realization.
In the present embodiment, included using the method for new word discovery investment target:
S1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained To multiple word sections of the language material text;
S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language Expect the new set of words undetermined of text;
S4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text As a result, filter out the real neologisms of language material text;And
The neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre- If the Business Name and neologisms of condition are used as with reference to investment target.
Language material is related to multiple and different fields, and the present embodiment carries out the concrete scheme of the present invention by taking news corpus as an example Illustrate, but be not limited only to News Field.As investor it should be understood that hot news at present, to obtain investment target enterprise Correlative plan, R&D direction or potential demand crawl Internet news as newly when information, by the use of web crawlers from internet Language material is heard, for example, crawling the Internet news of Sina, Baidu, Tencent etc. by reptile.It is understood that pushing away with the time Move, hot news also can constantly change, therefore, in order to make investor more accurately understand hot news at present, in time dimension On the Internet news that crawls is filtered, preset time section is set, the Internet news of the period is only crawled, for example, only Crawl the Internet news on the same day.Then duplicate removal processing is carried out to the Internet news crawled, and the title of Internet news is stored in language Expect in storehouse.Since the source of news corpus has diversity, Format Type is relatively more in language material, for ease of to language material into Row subsequent treatment, need to pre-process news corpus, obtain news corpus text data, form news corpus text set.
In specific implementation, the uniform format of news corpus can be text formatting by the pretreatment, from news corpus Middle removal advertisement noise simultaneously filters the one or more in dirty word, sensitive word and stop words.By the uniform format of news corpus For text formatting when, the information filtering that current techniques be able to wouldn't can be converted to text formatting is fallen.
After news corpus text is obtained, the every a line news corpus text got is carried out one by one using participle instrument Word segmentation processing, such as carry out word segmentation processing using participle instruments such as Stanford Chinese word segmentings instrument, jieba participles.It is for example, right Segmented in " last night goes to have seen film ", can obtain following result " yesterday | at night | go | see | | film ".At participle Retain word segmentation result after reason.It is understood that in order to further improve the validity of word segmentation result, word segmentation result is gone Stop words processing, the function word of news corpus theme can not be embodied by removing auxiliary words of mood, adverbial word, preposition, conjunction adjective etc., These function words usually itself have no clear and definite meaning, and only putting it into a complete sentence just has certain effect, such as It is common " ", " ", " this ", " that ", " on ", " under ", " where ", etc..In other embodiments, to news expect text into After row word segmentation processing, the word section of the final verb only retained in word segmentation result and/or noun, such as in above-mentioned example, can be only Retain " film " this word.It is understood that the word segmentation result after word segmentation processing may be sky, then corresponding row is filtered out Text.In other embodiments, the method segmented to news corpus text can also include:Based on string matching Segmenting method, the segmenting method based on understanding, the segmenting method based on statistics and one kind in the segmenting method based on dictionary or It is a variety of.
In other embodiments, can also be to every news corpus for the ease of determining the scope of follow-up word segmentation processing Before text carries out word segmentation processing, primary segmentation can be carried out to news corpus text by branch's processing, branch's processing can be To language material according to punctuate branch, such as there is the punctuates such as fullstop, comma, exclamation, question mark punishment row.
However, the news corpus text after word segmentation processing, it is possible that one will should be used as in some field The term data of a word is divided into the situation of multiple term datas, it is therefore desirable to new word discovery.If by the adjacent word in word segmentation result Duan Jinhang is converged, and forms the neologisms undetermined of news corpus text.
Next, need to determine the real neologisms of news corpus text, reference from the neologisms undetermined of news corpus text It is refinement flow diagram of the present invention using step S4 in the method for new word discovery investment target, in other implementations shown in Fig. 4 In example, step S4 is specifically included:
S41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value Determine neologisms;
The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, therefrom filters out solidification degree more than second The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, therefrom filters out the free degree more than the 3rd Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
It is understood that to extract neologisms from news item language material text, it is clear that:Which type of word is just calculated One neologismsWhether enough look first at the number that this word occurs in a language material text or corpus, i.e. word frequency. In the present embodiment, word frequency is embodied by reverse document-frequency (Inverse Document Frequency, IDF), IDF characterizations The frequency of one word in a document, if there is the frequency it is higher, illustrate the probability that this word occurs in different environment Higher, characterizes degree of recognition of the word in different articles.General IDF is higher, illustrates that its degree of recognition is higher, is more likely to be new Word.But if IDF is very high, it is very common to represent this word on the contrary, is not necessarily necessary to enter new word set, especially for Prevent from causing neologisms to pollute.In the screening step, by all word frequency more than the first predetermined threshold value (such as in news item language The number that occurs is more than 5 times in material text) neologisms undetermined screen.
However, the neologisms undetermined filtered out by above-mentioned screening process are possible to not be a word, but multiple words are formed Phrase.Therefore, processing meets outside the requirement of word frequency, it is also necessary to considers the solidification degree of neologisms undetermined, i.e. a neologisms undetermined In the probability that occurs together with other words in the neologisms undetermined of each word.Such as in a language material text, " film " occurs 389 times, " cinema " only occurs 175 times, but we are but more likely to " cinema " as a word, because intuition On see, " film " and " institute " solidifies tighter.The highest word of solidification degree be exactly such as " bat ", " spider ", " knowing which way to go ", The word of " perturbed " etc, each word in these words almost can always occur at the same time with another word, even at other Using also such in occasion.It is in the screening step, all solidification degree are undetermined more than the second predetermined threshold value (such as 0.02) Neologisms screen.
Specifically, by taking two tuple words as an example, the probability that word A and word B individually occur is P (A) and P (B) respectively, it is assumed that this two A word be autonomous word then two words and meanwhile occur probability be P (A) * P (B).If the two words are not independent, two words are same When the probability that occurs can be more than P (A) * P (B), i.e. P (C)>>P(A)*P(B).That is, the solidification degree of neologisms undetermined is more than Two predetermined threshold values need the condition that meets to be:
P (C)-P (A) * P (B) > m
Wherein, A, B represent the word in neologisms undetermined respectively, and P (C) refers to word A, B while the probability occurred, and m represents that second is pre- If threshold value.
In addition to meeting the requirement of above-mentioned word frequency and solidification degree, it is also contemplated that the free degree of a word.The free degree refers to One word freely uses degree.Light sees solidifying conjunction degree inside a word not enough, we also need to as a whole it Exterior performance.By taking " quilt " and " lifetime " the two words as an example, we can say that " buying quilt ", " lid quilt ", " into quilt ", " good quilt ", " this quilt " etc., above add various words at " quilt ";But the usage in " lifetime " is very fixed, except " a generation Son ", " this lifetime ", " last lifetime ", " next lifetime ", substantially " lifetime " above cannot add other word." lifetime " this word is left The word that side can occur is too limited so that instinctively we may think that, " lifetime " not individually into word, really into word It is the entirety of " a lifetime ", " this lifetime " etc in fact.As it can be seen that word freely with degree be also judge it whether Cheng Xin The major criterion of word.If a word can be regarded as a neologisms, it should be able to neatly appear in a variety of In environment, there is very abundant left adjacent word set and right adjacent word set.It can be weighed by calculating the comentropy of a word The left adjacent word set of this word and the randomness of right adjacent word set.For example, " eating grape and do not spit Grape Skin and do not eating grape Dao Tu Portugals In grape skin " the words, " grape " word occurs four times, wherein left neighbour's word is respectively { eat, spit, eat, spit }, right neighbour's word is respectively No, { skin, falls, skin }.According to comentropy calculation formula, the comentropy that the left adjacent word of " grape " word can be calculated respectively is about 0.693, the comentropy of right neighbour's word is about 1.04.As it can be seen that in this sentence, the right adjacent word of " grape " word is more rich.At this In embodiment, the free degree of a word takes the smaller value in its left adjacent word comentropy and right adjacent word comentropy.Walked in the screening In rapid, the neologisms undetermined that all frees degree are more than to the 3rd predetermined threshold value (such as 1.92) screen, and expect as the news The real neologisms of text, because the free degree of " grape " word is less than the 3rd predetermined threshold value, then will not say that the word screens work For neologisms.
Specifically, described information entropy calculation formula is:
Wherein, it is bottom that logarithm, which generally takes 2, in formula, and unit is bit;N refers to the number of left adjacent word or right adjacent word;PiPoint out existing The probability of each left adjacent word or right adjacent word.
Further, using participle and predetermined Business Name storehouse exabyte is extracted from news corpus text Claim, extract the existing ripe technology of Business Name from news corpus at present, so it will not be repeated.Assuming that from news corpus text The neologisms finally extracted include " pollutant emission ", the Business Name included in news corpus have Yunnan salinization, 31 heavy industrys, in State's electric construction, then calculate " pollutant emission " and " Yunnan salinization ", " 31 heavy industry ", the association relationship of " Chinese electric construction " respectively, and The exabyte that association relationship is more than to the 4th predetermined threshold value (such as 0.8) remains, as with reference to investment target.
, can be with it is understood that predetermined threshold value arrived involved in the various embodiments described above etc. needs pre-set parameter User is configured according to actual conditions.
The method using new word discovery investment target that above-described embodiment proposes, by being segmented, being gone to language material text The processing such as stop words, extracts neologisms undetermined from language material, then by calculating word frequency, solidification degree and the freedom of neologisms undetermined Degree, filters out real neologisms in the language material text, finally calculates the association relationship of neologisms and Business Name in the language material text Determine final investment target, improve the efficiency and accuracy of investment target extraction.
In addition, the embodiment of the present invention also proposes a kind of computer-readable recording medium, the computer-readable recording medium On be stored with program using new word discovery investment target, following operation is realized when which is executed by processor:
A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained To multiple word sections of the language material text;
A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language Expect the new set of words undetermined of text;
A4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text As a result, filter out the real neologisms of language material text;And
The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre- If the Business Name and neologisms of condition are used as with reference to investment target.
Preferably, the step A4 includes:
A41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value Determine neologisms;
The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, therefrom filters out solidification degree more than second The neologisms undetermined of predetermined threshold value;And
The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, therefrom filters out the free degree more than the 3rd Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.
Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure A42 is filtered out ", includes:
The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively; And
Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms Degree.
Computer-readable recording medium embodiment of the present invention and the above-mentioned method using new word discovery investment target It is essentially identical with each embodiment of electronic device, do not make tired state herein.
It should be noted that the embodiments of the present invention are for illustration only, the quality of embodiment is not represented.And Term " comprising " herein, "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, so that bag To include process, device, article or the method for a series of elements not only include those key elements, but also including being not explicitly listed Other element, or further include as this process, device, article or the intrinsic key element of method.Do not limiting more In the case of, the key element that is limited by sentence "including a ...", it is not excluded that in the process including the key element, device, article Or also there are other identical element in method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on such understanding, technical scheme substantially in other words does the prior art Going out the part of contribution can be embodied in the form of software product, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disc, CD), including some instructions use so that a station terminal equipment (can be mobile phone, Computer, server, or network equipment etc.) perform method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair The equivalent structure or equivalent flow shift that bright specification and accompanying drawing content are made, is directly or indirectly used in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

  1. A kind of 1. method using new word discovery investment target, applied to electronic device, it is characterised in that this method includes:
    S1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
    S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is somebody's turn to do Multiple word sections of language material text;
    S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, form language material text This new set of words undetermined;
    S4, word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value according to each neologisms undetermined in the language material text, Filter out the real neologisms of language material text;And
    The neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet default bar The Business Name and neologisms of part are used as with reference to investment target.
  2. 2. the method for new word discovery investment target is utilized as claimed in claim 1, it is characterised in that pre- in the step S1 Processing includes:It is text formatting by the uniform format of language material in corpus, advertisement noise is removed from language material.
  3. 3. the method for new word discovery investment target is utilized as claimed in claim 1, it is characterised in that the described pair of language material text The method segmented includes:Segmenting method based on string matching, the segmenting method based on understanding, the participle based on statistics Method and the segmenting method based on dictionary.
  4. 4. the method using new word discovery investment target as described in claim 1 or 2 or 3, it is characterised in that the step S4 Including:
    S41, calculate the language material text each neologisms undetermined word frequency, it is undetermined new more than the first predetermined threshold value to filter out word frequency Word;
    The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, it is default more than second therefrom to filter out solidification degree The neologisms undetermined of threshold value;And
    The free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, it is default more than the 3rd therefrom to filter out the free degree Real neologisms of the neologisms undetermined of threshold value as the language material text.
  5. 5. the method for new word discovery investment target is utilized as claimed in claim 4, it is characterised in that " the calculation procedure S42 The step of free degree of each neologisms undetermined filtered out ", includes:
    The left adjacent word comentropy by the step S42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively;And
    Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms undetermined Degree.
  6. 6. a kind of electronic device, it is characterised in that the device includes:Memory, processor, being stored with the memory can be The program using new word discovery investment target run on the processor, is realized as follows when which is performed by the processor Step:
    A1, pre-process the language material in corpus, obtains language material text data, forms language material text set;
    A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is somebody's turn to do Multiple word sections of language material text;
    A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, form language material text This new set of words undetermined;
    A4, word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value according to each neologisms undetermined in the language material text, Filter out the real neologisms of language material text;And
    The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet default bar The Business Name and neologisms of part are used as with reference to investment target.
  7. 7. electronic device according to claim 6, it is characterised in that the pretreatment in the step A1 includes:By language material The uniform format of language material is text formatting in storehouse, and advertisement noise is removed from news corpus;
    The method that the described pair of language material text is segmented includes:Segmenting method based on string matching, point based on understanding Word method, the segmenting method based on statistics and the segmenting method based on dictionary.
  8. 8. the electronic device according to claim 6 or 7, it is characterised in that the step A4 includes:
    A41, calculate the language material text each neologisms undetermined word frequency, it is undetermined new more than the first predetermined threshold value to filter out word frequency Word;
    The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, it is default more than second therefrom to filter out solidification degree The neologisms undetermined of threshold value;And
    The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, it is default more than the 3rd therefrom to filter out the free degree Real neologisms of the neologisms undetermined of threshold value as the language material text.
  9. 9. electronic device according to claim 8, it is characterised in that described " calculation procedure A42 is filtered out each undetermined The step of free degree of neologisms ", includes:
    The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively;And
    Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the free degree as the neologisms.
  10. 10. a kind of computer-readable recording medium, it is characterised in that be stored with the computer-readable recording medium using new Word finds the program of investment target, and the utilization as any one of claim 1 to 5 is realized when which is executed by processor New word discovery invests the step of method of target.
CN201711059221.6A 2017-11-01 2017-11-01 Utilize the method, apparatus and storage medium of new word discovery investment target Pending CN108038119A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201711059221.6A CN108038119A (en) 2017-11-01 2017-11-01 Utilize the method, apparatus and storage medium of new word discovery investment target
PCT/CN2018/076174 WO2019085335A1 (en) 2017-11-01 2018-02-10 Method for discovering investment objects with new words, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711059221.6A CN108038119A (en) 2017-11-01 2017-11-01 Utilize the method, apparatus and storage medium of new word discovery investment target

Publications (1)

Publication Number Publication Date
CN108038119A true CN108038119A (en) 2018-05-15

Family

ID=62093676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711059221.6A Pending CN108038119A (en) 2017-11-01 2017-11-01 Utilize the method, apparatus and storage medium of new word discovery investment target

Country Status (2)

Country Link
CN (1) CN108038119A (en)
WO (1) WO2019085335A1 (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472022A (en) * 2018-10-15 2019-03-15 平安科技(深圳)有限公司 New word identification method and terminal device based on machine learning
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN111309898A (en) * 2018-11-26 2020-06-19 中移(杭州)信息技术有限公司 Text mining method and device for new word discovery
CN111339403A (en) * 2020-02-11 2020-06-26 安徽理工大学 Commodity comment-based new word extraction method
CN111626053A (en) * 2020-05-21 2020-09-04 北京明亿科技有限公司 Method and device for recognizing descriptor of new case means, electronic device and storage medium
CN111626054A (en) * 2020-05-21 2020-09-04 北京明亿科技有限公司 New illegal behavior descriptor identification method and device, electronic equipment and storage medium
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system
CN111914554A (en) * 2020-08-19 2020-11-10 网易(杭州)网络有限公司 Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN112329458A (en) * 2020-05-21 2021-02-05 北京明亿科技有限公司 New organization descriptor recognition method and device, electronic device and storage medium
CN112541057A (en) * 2019-09-04 2021-03-23 上海晶赞融宣科技有限公司 Distributed new word discovery method and device, computer equipment and storage medium
WO2021051600A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Method, apparatus and device for identifying new word based on information entropy, and storage medium
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN112883725A (en) * 2020-12-29 2021-06-01 上海讯飞瑞元信息技术有限公司 File generation method and device, electronic equipment and storage medium
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113449082A (en) * 2021-07-16 2021-09-28 上海明略人工智能(集团)有限公司 New word discovery method, system, electronic device and medium
CN113468317A (en) * 2021-06-26 2021-10-01 北京网聘咨询有限公司 Resume screening method, system, equipment and storage medium
CN113536787A (en) * 2021-07-14 2021-10-22 福建亿榕信息技术有限公司 Method and equipment for establishing audit professional lexicon
WO2021217936A1 (en) * 2020-04-29 2021-11-04 深圳壹账通智能科技有限公司 Word combination processing-based new word discovery method and apparatus, and computer device
WO2021217931A1 (en) * 2020-04-30 2021-11-04 深圳壹账通智能科技有限公司 Classification model-based field extraction method and apparatus, electronic device, and medium
CN114186557A (en) * 2022-02-17 2022-03-15 阿里巴巴达摩院(杭州)科技有限公司 Method, device and storage medium for determining subject term
CN114385792A (en) * 2022-03-23 2022-04-22 北京零点远景网络科技有限公司 Method, device, equipment and storage medium for extracting words from work order data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230416373A1 (en) 2020-11-14 2023-12-28 Biogen Ma Inc. Biphasic subcutaneous dosing regimens for anti-vla-4 antibodies

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3940656B2 (en) * 2002-09-30 2007-07-04 株式会社東芝 Dictionary refinement method and program used for text information classification
US20070265832A1 (en) * 2006-05-09 2007-11-15 Brian Bauman Updating dictionary during application installation
CN102023967A (en) * 2010-11-11 2011-04-20 清华大学 Text emotion classifying method in stock field
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105786962A (en) * 2016-01-15 2016-07-20 优品财富管理有限公司 Big data index analysis method and system based on news transmissibility
CN106934054A (en) * 2017-03-17 2017-07-07 前海梧桐(深圳)数据有限公司 The accurate analysis method of enterprise's segmented industry and its system based on big data
CN107292744A (en) * 2017-06-07 2017-10-24 前海梧桐(深圳)数据有限公司 Investment Trend analysis method and its system based on machine learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786991B (en) * 2016-02-18 2019-03-15 中国科学院自动化研究所 In conjunction with the Chinese emotion new word identification method and system of user feeling expression way
CN105956158B (en) * 2016-05-17 2019-08-09 清华大学 The method that network neologisms based on massive micro-blog text and user information automatically extract
CN106126606B (en) * 2016-06-21 2019-08-20 国家计算机网络与信息安全管理中心 A kind of short text new word discovery method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3940656B2 (en) * 2002-09-30 2007-07-04 株式会社東芝 Dictionary refinement method and program used for text information classification
US20070265832A1 (en) * 2006-05-09 2007-11-15 Brian Bauman Updating dictionary during application installation
CN102023967A (en) * 2010-11-11 2011-04-20 清华大学 Text emotion classifying method in stock field
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105786962A (en) * 2016-01-15 2016-07-20 优品财富管理有限公司 Big data index analysis method and system based on news transmissibility
CN106934054A (en) * 2017-03-17 2017-07-07 前海梧桐(深圳)数据有限公司 The accurate analysis method of enterprise's segmented industry and its system based on big data
CN107292744A (en) * 2017-06-07 2017-10-24 前海梧桐(深圳)数据有限公司 Investment Trend analysis method and its system based on machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TING GE: "互联网时代的社会语言学:基于SNS的文本数据挖掘—转自MatriX67", 《豆瓣》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472022A (en) * 2018-10-15 2019-03-15 平安科技(深圳)有限公司 New word identification method and terminal device based on machine learning
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109492224B (en) * 2018-11-07 2024-05-03 北京金山数字娱乐科技有限公司 Vocabulary construction method and device
CN111309898A (en) * 2018-11-26 2020-06-19 中移(杭州)信息技术有限公司 Text mining method and device for new word discovery
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN110457708B (en) * 2019-08-16 2023-05-16 腾讯科技(深圳)有限公司 Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN112541057A (en) * 2019-09-04 2021-03-23 上海晶赞融宣科技有限公司 Distributed new word discovery method and device, computer equipment and storage medium
WO2021051600A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Method, apparatus and device for identifying new word based on information entropy, and storage medium
CN111339403A (en) * 2020-02-11 2020-06-26 安徽理工大学 Commodity comment-based new word extraction method
CN111339403B (en) * 2020-02-11 2022-08-02 安徽理工大学 Commodity comment-based new word extraction method
WO2021217936A1 (en) * 2020-04-29 2021-11-04 深圳壹账通智能科技有限公司 Word combination processing-based new word discovery method and apparatus, and computer device
WO2021217931A1 (en) * 2020-04-30 2021-11-04 深圳壹账通智能科技有限公司 Classification model-based field extraction method and apparatus, electronic device, and medium
CN111626053A (en) * 2020-05-21 2020-09-04 北京明亿科技有限公司 Method and device for recognizing descriptor of new case means, electronic device and storage medium
CN112329458B (en) * 2020-05-21 2024-05-10 北京明亿科技有限公司 New organization descriptor recognition method and device, electronic equipment and storage medium
CN112329458A (en) * 2020-05-21 2021-02-05 北京明亿科技有限公司 New organization descriptor recognition method and device, electronic device and storage medium
CN111626053B (en) * 2020-05-21 2024-04-09 北京明亿科技有限公司 New scheme means descriptor recognition method and device, electronic equipment and storage medium
CN111626054B (en) * 2020-05-21 2023-12-19 北京明亿科技有限公司 Novel illegal action descriptor recognition method and device, electronic equipment and storage medium
CN111626054A (en) * 2020-05-21 2020-09-04 北京明亿科技有限公司 New illegal behavior descriptor identification method and device, electronic equipment and storage medium
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system
CN111931491A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Domain dictionary construction method and device
CN111931491B (en) * 2020-08-14 2023-11-14 中国工商银行股份有限公司 Domain dictionary construction method and device
CN111914554A (en) * 2020-08-19 2020-11-10 网易(杭州)网络有限公司 Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN112883725A (en) * 2020-12-29 2021-06-01 上海讯飞瑞元信息技术有限公司 File generation method and device, electronic equipment and storage medium
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN112560448B (en) * 2021-02-20 2021-06-22 京华信息科技股份有限公司 New word extraction method and device
CN112560448A (en) * 2021-02-20 2021-03-26 京华信息科技股份有限公司 New word extraction method and device
CN113468317A (en) * 2021-06-26 2021-10-01 北京网聘咨询有限公司 Resume screening method, system, equipment and storage medium
CN113468317B (en) * 2021-06-26 2024-03-08 北京网聘信息技术有限公司 Resume screening method, system, equipment and storage medium
CN113536787A (en) * 2021-07-14 2021-10-22 福建亿榕信息技术有限公司 Method and equipment for establishing audit professional lexicon
CN113449082A (en) * 2021-07-16 2021-09-28 上海明略人工智能(集团)有限公司 New word discovery method, system, electronic device and medium
CN114186557A (en) * 2022-02-17 2022-03-15 阿里巴巴达摩院(杭州)科技有限公司 Method, device and storage medium for determining subject term
CN114385792B (en) * 2022-03-23 2022-06-24 北京零点远景网络科技有限公司 Method, device, equipment and storage medium for extracting words from work order data
CN114385792A (en) * 2022-03-23 2022-04-22 北京零点远景网络科技有限公司 Method, device, equipment and storage medium for extracting words from work order data

Also Published As

Publication number Publication date
WO2019085335A1 (en) 2019-05-09

Similar Documents

Publication Publication Date Title
CN108038119A (en) Utilize the method, apparatus and storage medium of new word discovery investment target
CN109271512B (en) Emotion analysis method, device and storage medium for public opinion comment information
CN109145216B (en) Network public opinion monitoring method, device and storage medium
CN109325165B (en) Network public opinion analysis method, device and storage medium
US9336299B2 (en) Acquisition of semantic class lexicons for query tagging
CN110163476A (en) Project intelligent recommendation method, electronic device and storage medium
CN101944109B (en) System and method for extracting picture abstract based on page partitioning
CN109062972A (en) Web page classification method, device and computer readable storage medium
KR20140131327A (en) Social media data analysis system and method
CN102314436A (en) Webpage automatic adjusting method and system
CN112650910B (en) Method, device, equipment and storage medium for determining website update information
WO2021068681A1 (en) Tag analysis method and device, and computer readable storage medium
CN105512104A (en) Dictionary dimension reducing method and device and information classifying method and device
US9064009B2 (en) Attribute cloud
CN104850617A (en) Short text processing method and apparatus
US11687647B2 (en) Method and electronic device for generating semantic representation of document to determine data security risk
CN109241392A (en) Recognition methods, device, system and the storage medium of target word
CN104933074A (en) News ordering method and device and terminal equipment
CN107861945A (en) Finance data analysis method, application server and computer-readable recording medium
CN104462061A (en) Word extraction method and word extraction device
CN103631796A (en) Website sort management method and electronic device
Khemani et al. A review on reddit news headlines with nltk tool
CN104572874B (en) A kind of abstracting method and device of webpage information
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN111639250A (en) Enterprise description information acquisition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180515

RJ01 Rejection of invention patent application after publication