CN108038119A

CN108038119A - Utilize the method, apparatus and storage medium of new word discovery investment target

Info

Publication number: CN108038119A
Application number: CN201711059221.6A
Authority: CN
Inventors: 汪伟; 罗傲雪; 陈恋; 陈一恋; 王晓伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2017-11-01
Filing date: 2017-11-01
Publication date: 2018-05-15
Also published as: WO2019085335A1

Abstract

The present invention proposes a kind of method using new word discovery investment target, including：Language material in corpus is pre-processed, obtains language material text data；The language material text by pretreatment is read, which is segmented and goes stop words to handle, obtains multiple word sections of the language material text；The word section adjacent to the language material text converges, and adjacent word section is combined into neologisms undetermined；According to word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value of each neologisms undetermined in the language material text, the real neologisms of language material text are filtered out；And the neologisms and association relationship of the Business Name in corpus that calculating sifting goes out, extraction association relationship meet that the Business Name of preset condition and neologisms are used as with reference to investment target.The present invention also proposes a kind of electronic device and computer-readable recording medium.The new words extraction filtered out using the present invention from news corpus invests target, improves efficiency of investment and accuracy rate.

Description

Utilize the method, apparatus and storage medium of new word discovery investment target

Technical field

The present invention relates to field of computer technology, more particularly to a kind of method, electronics that target is invested using new word discovery Device and computer-readable recording medium.

Background technology

At present, in observation investment target angle, investor lacks the associated observation to investee and hot spot theme, And this observation can be improved to investing the plan of operation of target, Research Emphasis to a certain extent, business increases, raw material needs Ask, the expected understanding of team building etc..

With the popularization of network, each news website has thousands of bar news daily, and news can real-time update.Such as Fruit can be extracted from the news corpus of magnanimity and analyze the enterprise involved by the hot spot theme and hot spot theme of Vehicles Collected from Market Industry, then for the angle of investor, it is possible to obtain Correlative plan, R＆D direction or the potential need of investment target enterprise Ask, and then find business opportunity, seize commercial opportunity.Therefore, how to be extracted from news corpus and analyze neologisms, and utilized from news corpus The new word discovery investment target of middle extraction is urgent problem.

The content of the invention

The present invention provides a kind of method, electronic device and computer-readable storage medium using new word discovery investment target Matter, its main purpose are and new using being filtered out from news corpus in by being screened from news corpus and analyzing neologisms Word extraction investment target.

To achieve the above object, the present invention provides a kind of electronic device, which includes memory, processor, described to deposit The program using new word discovery investment target that can be run on the processor is stored with reservoir, the program is by the processing Device realizes following steps when performing：

A1, pre-process the language material in corpus, obtains language material text data, forms language material text set；

A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained To multiple word sections of the language material text；

A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language Expect the new set of words undetermined of text；

A4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text As a result, filter out the real neologisms of language material text；And

The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre- If the Business Name and neologisms of condition are used as with reference to investment target.

Preferably, the step A4 includes：

A41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value Determine neologisms；

The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, therefrom filters out solidification degree more than second The neologisms undetermined of predetermined threshold value；And

The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, therefrom filters out the free degree more than the 3rd Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.

Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure A42 is filtered out ", includes：

The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively； And

Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms Degree.

In addition, to achieve the above object, the present invention also provides a kind of method using new word discovery investment target, this method Including：

S1, pre-process the language material in corpus, obtains language material text data, forms language material text set；

S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is obtained To multiple word sections of the language material text；

S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, forms the language Expect the new set of words undetermined of text；

S4, the comparison according to the word frequency of each neologisms undetermined, solidification degree and the free degree and predetermined threshold value in the language material text As a result, filter out the real neologisms of language material text；And

The neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet pre- If the Business Name and neologisms of condition are used as with reference to investment target.

Preferably, the step S4 includes：

S41, calculate the language material text each neologisms undetermined word frequency, filter out word frequency and treated more than the first predetermined threshold value Determine neologisms；

The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, therefrom filters out solidification degree more than second The neologisms undetermined of predetermined threshold value；And

The free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, therefrom filters out the free degree more than the 3rd Real neologisms of the neologisms undetermined of predetermined threshold value as the language material text.

Preferably, the step of described " frees degree for each neologisms undetermined that calculation procedure S42 is filtered out ", includes：

The left adjacent word comentropy by the step S42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively； And

In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer-readable recording medium The program using new word discovery investment target is stored with storage medium, is realized when which is executed by processor as described above Utilize the arbitrary steps of the method for new word discovery investment target.

Method, electronic device and computer-readable recording medium proposed by the present invention using new word discovery investment target, By being segmented, being gone stop words etc. to handle to language material text, neologisms undetermined are extracted from language material, it is then undetermined by calculating Word frequency, solidification degree and the free degree of neologisms, filter out real neologisms in the language material text, finally calculate neologisms and language material text Association relationship of Business Name determines final investment target in this, improves the efficiency and accuracy of investment target extraction.

Brief description of the drawings

Fig. 1 is the application environment schematic diagram for the method preferred embodiment that the present invention invests target using new word discovery；

Fig. 2 is the module diagram for the program for investing target in Fig. 1 using new word discovery；

Fig. 3 is the flow chart for the method preferred embodiment that the present invention invests target using new word discovery；

Fig. 4 is refined flow chart of the present invention using step S4 in the method for new word discovery investment target.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

The present invention provides a kind of method using new word discovery investment target, and this method is applied to a kind of electronic device 1.Ginseng According to shown in Fig. 1, the application environment schematic diagram of the method preferred embodiment of target is invested using new word discovery for the present invention.

In the present embodiment, the electronic device 1 can be PC (Personal Computer, PC), can also It is the terminal devices such as smart mobile phone, tablet computer, E-book reader, pocket computer.

The electronic device 1 includes memory 11, processor 12, communication bus 13, and network interface 14.

Wherein, memory 11 includes at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), magnetic storage, disk, CD etc..Memory 11 Can be the internal storage unit of the electronic device 1 in certain embodiments, such as the hard disk of the electronic device 1.Memory 11 can also be what is be equipped with the External memory equipment of the electronic device 1, such as the electronic device 1 in further embodiments Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, dodges Deposit card (Flash Card) etc..Further, memory 11 can also both include the internal storage unit of the electronic device 1 or wrap Include External memory equipment.Memory 11 can be not only used for the application software and Various types of data that storage is installed on the electronic device 1, Such as program 10 and corpus 00 etc. using new word discovery investment target, can be also used for temporarily storing exported or The data that will be exported.Specifically, language material refers to the language material crawled from each website, such as news corpus, is protected in the corpus 00 There are a large amount of language materials, the present invention extracts neologisms from the language material of corpus 00, and explores investment target according to neologisms.

Processor 12 can be in certain embodiments a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips, for the program stored in run memory 11 Code or processing data, such as program 10 of target etc. is invested using new word discovery.

Communication bus 13 is used for realization the connection communication between these components.

Network interface 14 can optionally include standard wireline interface and wireless interface (such as WI-FI interfaces), be commonly used in Communication connection is established between the device and other electronic equipments.

Fig. 1 illustrate only the electronic device 1 with component 11-14, it should be understood that being not required for implementing all show The component gone out, what can be substituted implements more or less components.

Alternatively, which can also include user interface, user interface can include display (Display), Input unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.

Alternatively, in certain embodiments, display can be that light-emitting diode display, liquid crystal display, touch control type LCD are shown Device and Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) touch device etc..Wherein, display Display screen or display unit are properly termed as, for showing the information that handles in the electronic apparatus 1 and visual for showing User interface.

In the device embodiment shown in Fig. 1, the program using new word discovery investment target is stored with memory 11.Place What is stored in the reason execution memory 11 of device 12 realizes following steps when investing the program of target using new word discovery：

Language material is related to multiple and different fields, and the present embodiment carries out the concrete scheme of the present invention by taking news corpus as an example Illustrate, but be not limited only to News Field.As investor it should be understood that hot news at present, to obtain investment target enterprise Correlative plan, R＆D direction or potential demand crawl Internet news when information, using web crawlers from internet, for example, The Internet news that Sina, Baidu, Tencent etc. are crawled by reptile is used as news corpus.It is understood that pushing away with the time Move, hot news also can constantly change, therefore, in order to make investor more accurately understand hot news at present, in time dimension On the Internet news that crawls is filtered, preset time section is set, the Internet news of the period is only crawled, for example, only Crawl the Internet news on the same day.Then duplicate removal processing is carried out to the Internet news crawled, and the title of Internet news is stored in language Expect in storehouse 00.Since the source of news corpus has diversity, Format Type is relatively more in language material, for ease of to language material Subsequent treatment is carried out, news corpus need to be pre-processed, obtains news corpus text data, forms news corpus text set.

In specific implementation, the uniform format of news corpus can be text formatting by the pretreatment, from news corpus Middle removal advertisement noise simultaneously filters the one or more in dirty word, sensitive word and stop words.By the uniform format of news corpus For text formatting when, the information filtering that current techniques be able to wouldn't can be converted to text formatting is fallen.

After news corpus text is obtained, the every a line news corpus text got is carried out one by one using participle instrument Word segmentation processing, such as carry out word segmentation processing using participle instruments such as Stanford Chinese word segmentings instrument, jieba participles.It is for example, right Segmented in " last night goes to have seen film ", can obtain following result " yesterday | at night | go | see | | film ".At participle Retain word segmentation result after reason.It is understood that in order to further improve the validity of word segmentation result, word segmentation result is gone Stop words processing, the function word of news corpus theme can not be embodied by removing auxiliary words of mood, adverbial word, preposition, conjunction adjective etc., These function words usually itself have no clear and definite meaning, and only putting it into a complete sentence just has certain effect, such as It is common " ", " ", " this ", " that ", " on ", " under ", " where ", etc..In other embodiments, to news expect text into After row word segmentation processing, the word section of the final verb only retained in word segmentation result and/or noun, such as in above-mentioned example, can be only Retain " film " this word.It is understood that the word segmentation result after word segmentation processing may be sky, then corresponding row is filtered out Text.In other embodiments, the method segmented to news corpus text can also include：Based on string matching Segmenting method, the segmenting method based on understanding, the segmenting method based on statistics and one kind in the segmenting method based on dictionary or It is a variety of.

In other embodiments, can also be to every news corpus for the ease of determining the scope of follow-up word segmentation processing Before text carries out word segmentation processing, primary segmentation can be carried out to news corpus text by branch's processing, branch's processing can be To language material according to punctuate branch, such as there is the punctuates such as fullstop, comma, exclamation, question mark punishment row.

However, the news corpus text after word segmentation processing, it is possible that one will should be used as in some field The term data of a word is divided into the situation of multiple term datas, it is therefore desirable to new word discovery.If by the adjacent word in word segmentation result Duan Jinhang is converged, and forms the neologisms undetermined of news corpus text.

Next, need to determine the real neologisms of news corpus text from the neologisms undetermined of news corpus text, at it In his embodiment, step A4 is specifically included：

It is understood that to extract neologisms from news item language material text, it is clear that：Which type of word is just calculated One neologismsWhether enough look first at the number that this word occurs in a language material text or corpus 00, i.e. word Frequently.In the present embodiment, word frequency is embodied by reverse document-frequency (Inverse Document Frequency, IDF), IDF Characterize the frequency of a word in a document, if there is the frequency it is higher, illustrate what this word occurred in different environment Probability higher, characterizes degree of recognition of the word in different articles.General IDF is higher, illustrates that its degree of recognition is higher, is more possible to It is neologisms.But if IDF is very high, it is very common to represent this word on the contrary, is not necessarily necessary to enter new word set, especially It is to cause neologisms to pollute in order to prevent.It is in the screening step, all word frequency are (such as new at one more than the first predetermined threshold value Hear the number that occurs in language material text more than 5 times) neologisms undetermined screen.

However, the neologisms undetermined filtered out by above-mentioned screening process are possible to not be a word, but multiple words are formed Phrase.Therefore, processing meets outside the requirement of word frequency, it is also necessary to considers the solidification degree of neologisms undetermined, i.e. a neologisms undetermined In the probability that occurs together with other words in the neologisms undetermined of each word.Such as in a language material text, " film " occurs 389 times, " cinema " only occurs 175 times, but we are but more likely to " cinema " as a word, because intuition On see, " film " and " institute " solidifies tighter.The highest word of solidification degree be exactly such as " bat ", " spider ", " knowing which way to go ", The word of " perturbed " etc, each word in these words almost can always occur at the same time with another word, even at other Using also such in occasion.It is in the screening step, all solidification degree are undetermined more than the second predetermined threshold value (such as 0.02) Neologisms screen.

Specifically, by taking two tuple words as an example, the probability that word A and word B individually occur is P (A) and P (B) respectively, it is assumed that this two A word be autonomous word then two words and meanwhile occur probability be P (A) * P (B).If the two words are not independent, two words are same When the probability that occurs can be more than P (A) * P (B), i.e. P (C)>>P(A)*P(B).That is, the solidification degree of neologisms undetermined is more than Two predetermined threshold values need the condition that meets to be：

P (C)-P (A) * P (B) ＞ m

Wherein, A, B represent the word in neologisms undetermined respectively, and P (C) refers to word A, B while the probability occurred, and m represents that second is pre- If threshold value.

In addition to meeting the requirement of above-mentioned word frequency and solidification degree, it is also contemplated that the free degree of a word.The free degree refers to One word freely uses degree.Light sees solidifying conjunction degree inside a word not enough, we also need to as a whole it Exterior performance.By taking " quilt " and " lifetime " the two words as an example, we can say that " buying quilt ", " lid quilt ", " into quilt ", " good quilt ", " this quilt " etc., above add various words at " quilt "；But the usage in " lifetime " is very fixed, except " a generation Son ", " this lifetime ", " last lifetime ", " next lifetime ", substantially " lifetime " above cannot add other word." lifetime " this word is left The word that side can occur is too limited so that instinctively we may think that, " lifetime " not individually into word, really into word It is the entirety of " a lifetime ", " this lifetime " etc in fact.As it can be seen that word freely with degree be also judge it whether Cheng Xin The major criterion of word.If a word can be regarded as a neologisms, it should be able to neatly appear in a variety of In environment, there is very abundant left adjacent word set and right adjacent word set.It can be weighed by calculating the comentropy of a word The left adjacent word set of this word and the randomness of right adjacent word set.For example, " eating grape and do not spit Grape Skin and do not eating grape Dao Tu Portugals In grape skin " the words, " grape " word occurs four times, wherein left neighbour's word is respectively { eat, spit, eat, spit }, right neighbour's word is respectively No, { skin, falls, skin }.According to comentropy calculation formula, the comentropy that the left adjacent word of " grape " word can be calculated respectively is about 0.693, the comentropy of right neighbour's word is about 1.04.As it can be seen that in this sentence, the right adjacent word of " grape " word is more rich.At this In embodiment, the free degree of a word takes the smaller value in its left adjacent word comentropy and right adjacent word comentropy.Walked in the screening In rapid, the neologisms undetermined that all frees degree are more than to the 3rd predetermined threshold value (such as 1.92) screen, and expect as the news The real neologisms of text, because the free degree of " grape " word is less than the 3rd predetermined threshold value, then will not say that the word screens work For neologisms.

Specifically, described information entropy calculation formula is：

Wherein, it is bottom that logarithm, which generally takes 2, in formula, and unit is bit；N refers to the number of left adjacent word or right adjacent word；P_iPoint out existing The probability of each left adjacent word or right adjacent word.

Further, using participle and predetermined Business Name storehouse exabyte is extracted from news corpus text Claim, extract the existing ripe technology of Business Name from news corpus at present, so it will not be repeated.Assuming that from news corpus text The neologisms finally extracted include " pollutant emission ", the Business Name included in news corpus have Yunnan salinization, 31 heavy industrys, in State's electric construction, then calculate " pollutant emission " and " Yunnan salinization ", " 31 heavy industry ", the association relationship of " Chinese electric construction " respectively, and The exabyte that association relationship is more than to the 4th predetermined threshold value (such as 0.8) remains, as with reference to investment target.

, can be with it is understood that predetermined threshold value arrived involved in the various embodiments described above etc. needs pre-set parameter User is configured according to actual conditions.

The electronic device 1 that above-described embodiment proposes, by being segmented, being gone stop words etc. to handle to language material text, from language Neologisms undetermined are extracted in material, then by calculating word frequency, solidification degree and the free degree of neologisms undetermined, filter out the language material text In real neologisms, the association relationship for finally calculating Business Name in neologisms and the language material text determines final investment target, Improve the efficiency and accuracy of investment target extraction.

Alternatively, in other examples, one can also be divided into using the program 10 of new word discovery investment target A or multiple modules, one or more module are stored in memory 11, and by one or more processors (this implementation Example is processor 12) it is performed, to complete the present invention, the module alleged by the present invention is to refer to complete a series of of specific function Computer program instructions section.For example, referring to shown in Fig. 2, show to invest the module of the program 10 of target in Fig. 1 using new word discovery It is intended to, in the embodiment, first processing module 110, second can be divided into using the program 10 of new word discovery investment target Processing module 120, convergence module 130, computing module 140 and extraction module 150, the work(that the module 110-150 is realized Energy or operating procedure are similar as above, are no longer described in detail herein, exemplarily, such as wherein：

First processing module 110, for being pre-processed to the language material in corpus, obtains language material text data, is formed Language material text set；

Second processing module 120, for reading a language material text by pretreatment, segments the language material text And go stop words to handle, obtain multiple word sections of the language material text；

Convergence module 130, is converged for the word section adjacent to the language material text, adjacent word section is combined into undetermined Neologisms, form the new set of words undetermined of the language material text；

Computing module 140, for according to word frequency, solidification degree and the free degree of each neologisms undetermined in the language material text and in advance If the comparative result of threshold value, the real neologisms of language material text are filtered out；And

Extraction module 150, it is mutual for the neologisms that calculating sifting goes out and association relationship of the Business Name in corpus, extraction The value of information meets that the Business Name of preset condition and neologisms are used as with reference to investment target.

In addition, the present invention also provides a kind of method using new word discovery investment target.With reference to shown in Fig. 3, for the present invention Utilize the flow chart of the method preferred embodiment of new word discovery investment target.This method can be performed by a device, the device Can be by software and/or hardware realization.

In the present embodiment, included using the method for new word discovery investment target：

Language material is related to multiple and different fields, and the present embodiment carries out the concrete scheme of the present invention by taking news corpus as an example Illustrate, but be not limited only to News Field.As investor it should be understood that hot news at present, to obtain investment target enterprise Correlative plan, R＆D direction or potential demand crawl Internet news as newly when information, by the use of web crawlers from internet Language material is heard, for example, crawling the Internet news of Sina, Baidu, Tencent etc. by reptile.It is understood that pushing away with the time Move, hot news also can constantly change, therefore, in order to make investor more accurately understand hot news at present, in time dimension On the Internet news that crawls is filtered, preset time section is set, the Internet news of the period is only crawled, for example, only Crawl the Internet news on the same day.Then duplicate removal processing is carried out to the Internet news crawled, and the title of Internet news is stored in language Expect in storehouse.Since the source of news corpus has diversity, Format Type is relatively more in language material, for ease of to language material into Row subsequent treatment, need to pre-process news corpus, obtain news corpus text data, form news corpus text set.

Next, need to determine the real neologisms of news corpus text, reference from the neologisms undetermined of news corpus text It is refinement flow diagram of the present invention using step S4 in the method for new word discovery investment target, in other implementations shown in Fig. 4 In example, step S4 is specifically included：

It is understood that to extract neologisms from news item language material text, it is clear that：Which type of word is just calculated One neologismsWhether enough look first at the number that this word occurs in a language material text or corpus, i.e. word frequency. In the present embodiment, word frequency is embodied by reverse document-frequency (Inverse Document Frequency, IDF), IDF characterizations The frequency of one word in a document, if there is the frequency it is higher, illustrate the probability that this word occurs in different environment Higher, characterizes degree of recognition of the word in different articles.General IDF is higher, illustrates that its degree of recognition is higher, is more likely to be new Word.But if IDF is very high, it is very common to represent this word on the contrary, is not necessarily necessary to enter new word set, especially for Prevent from causing neologisms to pollute.In the screening step, by all word frequency more than the first predetermined threshold value (such as in news item language The number that occurs is more than 5 times in material text) neologisms undetermined screen.

P (C)-P (A) * P (B) ＞ m

Specifically, described information entropy calculation formula is：

The method using new word discovery investment target that above-described embodiment proposes, by being segmented, being gone to language material text The processing such as stop words, extracts neologisms undetermined from language material, then by calculating word frequency, solidification degree and the freedom of neologisms undetermined Degree, filters out real neologisms in the language material text, finally calculates the association relationship of neologisms and Business Name in the language material text Determine final investment target, improve the efficiency and accuracy of investment target extraction.

In addition, the embodiment of the present invention also proposes a kind of computer-readable recording medium, the computer-readable recording medium On be stored with program using new word discovery investment target, following operation is realized when which is executed by processor：

Preferably, the step A4 includes：

Computer-readable recording medium embodiment of the present invention and the above-mentioned method using new word discovery investment target It is essentially identical with each embodiment of electronic device, do not make tired state herein.

It should be noted that the embodiments of the present invention are for illustration only, the quality of embodiment is not represented.And Term " comprising " herein, "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, so that bag To include process, device, article or the method for a series of elements not only include those key elements, but also including being not explicitly listed Other element, or further include as this process, device, article or the intrinsic key element of method.Do not limiting more In the case of, the key element that is limited by sentence "including a ...", it is not excluded that in the process including the key element, device, article Or also there are other identical element in method.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on such understanding, technical scheme substantially in other words does the prior art Going out the part of contribution can be embodied in the form of software product, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disc, CD), including some instructions use so that a station terminal equipment (can be mobile phone, Computer, server, or network equipment etc.) perform method described in each embodiment of the present invention.

It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair The equivalent structure or equivalent flow shift that bright specification and accompanying drawing content are made, is directly or indirectly used in other relevant skills Art field, is included within the scope of the present invention.

Claims

A kind of 1. method using new word discovery investment target, applied to electronic device, it is characterised in that this method includes：

S1, pre-process the language material in corpus, obtains language material text data, forms language material text set；

S2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is somebody's turn to do Multiple word sections of language material text；

S3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, form language material text This new set of words undetermined；

S4, word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value according to each neologisms undetermined in the language material text, Filter out the real neologisms of language material text；And

The neologisms that S5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet default bar The Business Name and neologisms of part are used as with reference to investment target.
2. the method for new word discovery investment target is utilized as claimed in claim 1, it is characterised in that pre- in the step S1 Processing includes：It is text formatting by the uniform format of language material in corpus, advertisement noise is removed from language material.
3. the method for new word discovery investment target is utilized as claimed in claim 1, it is characterised in that the described pair of language material text The method segmented includes：Segmenting method based on string matching, the segmenting method based on understanding, the participle based on statistics Method and the segmenting method based on dictionary.
4. the method using new word discovery investment target as described in claim 1 or 2 or 3, it is characterised in that the step S4 Including：

S41, calculate the language material text each neologisms undetermined word frequency, it is undetermined new more than the first predetermined threshold value to filter out word frequency Word；

The solidification degree for each neologisms undetermined that S42, calculation procedure S41 are filtered out, it is default more than second therefrom to filter out solidification degree The neologisms undetermined of threshold value；And

The free degree for each neologisms undetermined that S43, calculation procedure S42 are filtered out, it is default more than the 3rd therefrom to filter out the free degree Real neologisms of the neologisms undetermined of threshold value as the language material text.
5. the method for new word discovery investment target is utilized as claimed in claim 4, it is characterised in that " the calculation procedure S42 The step of free degree of each neologisms undetermined filtered out ", includes：

The left adjacent word comentropy by the step S42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively；And

Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the freedom as the neologisms undetermined Degree.
6. a kind of electronic device, it is characterised in that the device includes：Memory, processor, being stored with the memory can be The program using new word discovery investment target run on the processor, is realized as follows when which is performed by the processor Step：

A1, pre-process the language material in corpus, obtains language material text data, forms language material text set；

A2, read a language material text by pretreatment, which is segmented and goes stop words to handle, is somebody's turn to do Multiple word sections of language material text；

A3, the word section adjacent to the language material text converge, and adjacent word section is combined into neologisms undetermined, form language material text This new set of words undetermined；

A4, word frequency, solidification degree and the comparative result of the free degree and predetermined threshold value according to each neologisms undetermined in the language material text, Filter out the real neologisms of language material text；And

The neologisms that A5, calculating sifting go out and association relationship of the Business Name in corpus, extraction association relationship meet default bar The Business Name and neologisms of part are used as with reference to investment target.
7. electronic device according to claim 6, it is characterised in that the pretreatment in the step A1 includes：By language material The uniform format of language material is text formatting in storehouse, and advertisement noise is removed from news corpus；

The method that the described pair of language material text is segmented includes：Segmenting method based on string matching, point based on understanding Word method, the segmenting method based on statistics and the segmenting method based on dictionary.
8. the electronic device according to claim 6 or 7, it is characterised in that the step A4 includes：

A41, calculate the language material text each neologisms undetermined word frequency, it is undetermined new more than the first predetermined threshold value to filter out word frequency Word；

The solidification degree for each neologisms undetermined that A42, calculation procedure A41 are filtered out, it is default more than second therefrom to filter out solidification degree The neologisms undetermined of threshold value；And

The free degree for each neologisms undetermined that A43, calculation procedure A42 are filtered out, it is default more than the 3rd therefrom to filter out the free degree Real neologisms of the neologisms undetermined of threshold value as the language material text.
9. electronic device according to claim 8, it is characterised in that described " calculation procedure A42 is filtered out each undetermined The step of free degree of neologisms ", includes：

The left adjacent word comentropy by the step A42 each neologisms undetermined filtered out and right adjacent word comentropy are calculated respectively；And

Take the smaller value in the left adjacent word comentropy and right adjacent word comentropy of each neologisms undetermined, the free degree as the neologisms.
10. a kind of computer-readable recording medium, it is characterised in that be stored with the computer-readable recording medium using new Word finds the program of investment target, and the utilization as any one of claim 1 to 5 is realized when which is executed by processor New word discovery invests the step of method of target.