CN105786961A - Data sorting treatment method based on financial information - Google Patents
Data sorting treatment method based on financial information Download PDFInfo
- Publication number
- CN105786961A CN105786961A CN201610029411.2A CN201610029411A CN105786961A CN 105786961 A CN105786961 A CN 105786961A CN 201610029411 A CN201610029411 A CN 201610029411A CN 105786961 A CN105786961 A CN 105786961A
- Authority
- CN
- China
- Prior art keywords
- participle
- financial information
- code
- text
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
Abstract
The invention discloses a data sorting treatment method based on financial information. The data sorting treatment method comprises the following steps: fetching a main body in the financial information, and analyzing the main body to obtain segmented words; according to an analysis result, obtaining the frequency of occurrence of each segmented word, obtaining the segmented words with a high frequency according to a set frequency, and carrying out retrieval by taking the segmented words with the high frequency as keywords, matching obtained retrieval results according to a sorting code number to correspondingly classify the keywords according to the sorting code number. The main body is analyzed to obtain the segmented words to realize each-dimensionality sorting word frequency management. Secondly, since the segmented words with the high frequency are retrieved as the key words, the obtained retrieval results are matched according to the sorting code number, so that an incidence relation between the segmented words and support data is established so as to achieve the accurate application of the information. The data sorting treatment method based on the financial information can be adopted to realize automatic incidence, automatic warehousing and relevant treatment, and the treatment efficiency of financial information is improved.
Description
Technical field
The present invention relates to the classification of big data and processes, at a kind of data classification based on financial Information
Reason method.
Background technology
Deepening constantly development along with financial market, on the one hand scope and the quantity of financial Information quickly increases
Long, on the other hand user proposes requirements at the higher level to the precisely retrieval of financial Information, how to provide magnanimity finance
Interrogate quick and relatively accurate index of setting up and just become a problem compeling highly necessary to solve.Current domestic common
Information data mode have two kinds, one is a stand-alone program by information to be captured and corresponds to specify classification,
Another kind is that artificial judgment specifies information classification.The former advantage is that efficiency is high, shortcoming be classify relatively thick,
Do not support multidimensional classification;The advantage of the latter is that multidimensional cross division is precisely supported in classification, and shortcoming is effect
Rate is low.
Along with the financial Information that need to process is doubled and redoubled, information need to do exact classification application, traditional information
Processing mode faces such a predicament: program automatically processes and can meet efficiency but information classification accuracy fall
Low, artificial treatment can meet information accuracy requirement but a large amount of personnel of needs.The most quickly, accurately to obtaining
The financial Information taken is classified, and is the major issue of financial Information data automatic classification.In the face of obtaining in a large number
The financial Information data taken, traditional manual sort is without meeting the demand of big data quantity.
Summary of the invention
The technical problem to be solved in the present invention is, automatically classifies financial Information data, can preferably expire
Foot accuracy and raising information treatment effeciency.
Solve above-mentioned technical problem, the invention provides a kind of data classification process side based on financial Information
Method, including,
Capture the text obtaining in financial Information, and text is carried out parsing obtain participle;
The frequency of occurrences of participle is obtained according to analysis result, and high according to the frequency acquisition frequency of occurrences set
Participle,
Being retrieved as keyword by participle high for the described frequency of occurrences, the retrieval result obtained is according to dividing
Class coding mates, and makes keyword classify according to sorting code number correspondence.
Further, carry out the text that described crawl obtains resolving and obtain the method for participle and be:
Pretreatment is: only retain text after the character string included in text is removed label, if described just
Literary composition comprises html tag or English character, described html tag or English character is not entered
Row resolves;
Then described text is carried out cutting and forms sentence,
Secondly to text remaining after cutting is syncopated as from this section of text a sentence the most successively again, and
This sentence is carried out again cutting,
Finally described step is traveled through, until whole sentence is cut into single subelement.
Further, described high-frequency participle builds word frequency management storehouse according to the result that participle resolves,
If the participle frequency occurred is more than setting value, then this participle is defined as high-frequency participle, if going out
When existing participle frequency is less than setting value, then abandon this participle.
Further, described participle carries out dimension classification as follows, " keyword ", " region ",
" fund/bond ", " related person ", " associated mechanisms ", " relevant industries ", " related notion ", " relevant
Company ".
Further, according to the method that described sorting code number carries out mating it is, according to the row of high-frequency participle
Row order is successively at classification 1 support table, classification 2 support table, classification 3 support table ... classification N supports table
In, carry out title and accurately mate, after described participle title has accurately matched a record, it is right to return
The sorting code number answered.
Further, the text in described financial Information includes: text and html tag.
Further, described keyword divides according to following dimension: not cut out class, classification extension class,
Special topic/meeting class, descriptive and bulletin event class.
Further, described sorting code number is divided into according to following rule: metadata ID code, other ID
Code, stock code, mechanism and company code, natural person's code, basis entity code, information entity
Code, guideline code.
Further, after described keyword is classified according to sorting code number correspondence, set up financial Information and divide
The auto-associating of class coding, can be used for associating news relevant classification information based on multiple contingency tables.
Further, use web crawlers described financial Information to be captured and enter database, set up finance money
News text participle database, described web crawlers includes, Larbin, Nutch, Heritrix,
WebSPHINX、Mercator、PolyBot。
Beneficial effects of the present invention:
1) participle is obtained by text is carried out parsing, it is achieved the classification word frequency management of each dimension.
2) by participle high for the described frequency of occurrences is retrieved as keyword, the retrieval knot obtained
Fruit mates according to sorting code number so that participle and the incidence relation supporting data are set up, thus reach
Preferably information is precisely applied.
3) use the present invention data classification processing method based on financial Information can realize auto-associating,
Automatically warehouse-in and relevant treatment, improve the treatment effeciency to financial Information.
Accompanying drawing explanation
Fig. 1 is that the data classification processing method based on financial Information in one embodiment of the invention is embodied as
Mode schematic diagram.
Analytic method schematic diagram in Fig. 2 Fig. 1.
Fig. 3 is pre-treatment step schematic diagram in Fig. 1.
Fig. 4 is Fig. 1 high frequency word segmentation processing mode schematic flow sheet.
Fig. 5 is the dimension classification schematic diagram of participle in Fig. 1.
Fig. 6 is that the data classification processing method based on financial Information in one embodiment of the present invention is concrete
Embodiment schematic diagram.
Fig. 7 is the body structure schematic diagram in Fig. 1 in financial Information.
Fig. 8 is the partition dimension schematic diagram of keyword in the dimension of participle in Fig. 5.
Fig. 9 is an embodiment schematic diagram of sorting code number in Fig. 1.
Figure 10 is that a kind of acquisition mode of financial Information in Fig. 1 is intended to.
Figure 11 is the schematic diagram of the embodiment of the invention.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with being embodied as
Example, and referring to the drawings, the present invention is described in more detail.
Refer to Fig. 1, be the data classification processing method based on financial Information in one embodiment of the invention
Detailed description of the invention schematic diagram.
In the present embodiment, including:
Step S101 captures the text obtaining in financial Information, and text is carried out parsing obtains participle,
Described financial Information includes but not limited to, the consulting in the whole world, futures present quotation, stock present quotation.
Text in financial Information is obtained by the acquisition of information in internet, and port includes but not limited to: PC
End, smart mobile phone end, PAD end.Described text is carried out parsing obtain participle, nature can be used to divide
Cut, dictionary dictionary based on financial Information class split, such as, " listing ", " limit-up ", " silver ", " buy
Sell ", can according to significance level priority principle in order, followed by inverted order and two-way sequence point
Word combination.
Step S102 obtains the frequency of occurrences of participle according to analysis result, and according to the frequency acquisition set
, such as in one section of financial consultation, there is the secondary of " futures " after participle in the participle that the frequency of occurrences is high
Number is higher, then it is assumed that its frequency of occurrences is higher, the most such as, occurs " Shanghai futures exchange after participle
Institute " number of times higher, then it is assumed that its frequency of occurrences is higher.Those skilled in the art can understand, it is possible to
It is high-frequency participle rule of thumb to arrange the participle of often appearance, such as: " 2015 NianAGu cities
Field the most very ox, Shenzhen Stock Exchange becomes to refer to rise and 14.98% maintains the leading position at the whole world cardinal index whole year, upper card
Composite rises 9.41% the whole year.", rule of thumb arrange the participle of often appearance i.e. may include that " A-share ",
" Shenzhen Stock Exchange becomes to refer to ", " above demonstrate,proving composite " etc..
Participle high for the described frequency of occurrences is retrieved by step S103 as keyword, the retrieval obtained
Result is mated according to sorting code number, makes keyword classify according to sorting code number correspondence.Art technology
Personnel can understand, the described mode carrying out retrieving as keyword includes but not limited to, at the number arranged
Retrieve according in storehouse, or retrieve according to corresponding rule.
Refer to Fig. 2, be analytic method schematic diagram in Fig. 1.
In the present embodiment, the text that described crawl obtains is carried out parsing and obtains participle:
Step S201 pre-processes
Step S202 described text is carried out cutting formed sentence, can according to ending ". " ", " or ";”
Carry out cutting and form independent sentence.
Step S203 is to being syncopated as a sentence in text remaining after cutting the most successively from this section of text again
Son, and this sentence is carried out again cutting, cutting obtains each the sentence completed one by one.
Described step is traveled through by step S204, until whole sentence is cut into single subelement,
Subelement refers to individually to represent Chinese vocabulary or the word of implication, and English word, English name-to
English abbreviation.
Fig. 3 is pre-treatment step schematic diagram in Fig. 1.
Step S301 only retains text after the character string included in text is removed label, with in internet
As a example by extracting the financial Information obtained, financial Information is generally by making the standard language of World Wide Web page
HTML (HyperText Mark-up Language) HTML or HyperText Markup language
Speech.The descriptive text being made up of HTML command in html file, HTML command it may be said that
Word, figure, animation, sound, form, link etc. in plain text.Use web browser, such as Netscape
Navigator or Microsoft Internet explorer, it is possible to explain that html file shows webpage.Pass through
Obtain the word with html tag, after being removed by label, only retain the most useful Financial Information literary composition
Word.
Such as, label<FONT>, tag attributes:
SIZE sets the size of word
COLOR sets the color of word
FACE sets the font of word
The most such as, label<BASEFONT>, tag attributes:
The default value of SIZE amendment word size
The default value of COLOR amendment text color
The default value of FACE amendment character script
The most such as, label<BODY>, tag attributes:
Such as, tag format<bODY TEXT=" property value ">...</BODY>
For another example, label<Hn>, tag attributes:
ALIGN arranges the mode of title alignment.
Tag format<hn>...</Hn>
<hn ALIGN=" property value ">...</Hn>, the article of financial Information have title, subtitle, The structure such as chapter and joint, also provides corresponding heading label<hn>In HTML, need not add separation label meeting Automatically jumping to next line, wherein n is the grade of title.HTML provides the title of six grades altogether, N is the least, and title font size is the biggest.
By identifying the html tag in the example above, can extract and obtain plain text word.
Whether step S302 comprises html tag or English character, if then entering step S304, if not
Then enter step S303, it is judged that the html tag in text or English character, wherein for English character
Do not include: specific name or mechanism's abbreviation etc., such as Borrow's Sa Miao Ademilson (Paul
A.Samuelson), the special name such as Mankiw (N.Gregory Mankiw).Such as, Shanghai futures exchange
Institute (SHFE), Chicago,U.S futures exchange (CBOT), USA New York metal exchange
(COMEX), London metal exchange (LME).
Step S303 resolves, if not comprising above-mentioned html tag or English character (no special)
Situation, then directly text is resolved.
Described html tag or English character are not resolved by step S304, not to above-mentioned
Html tag or English character (special) resolve.
Fig. 4 is Fig. 1 high frequency word segmentation processing mode schematic flow sheet.
In the present embodiment, the result structure word frequency management storehouse that high-frequency participle resolves according to participle:
When whether step S401 participle frequency is more than setting value, according to the feature of financial Information industry, permissible
The number of times that setting value occurs according to different terms is set, such as, " A-share ", setting value is 15 times,
" Shanghai futures exchange " arranges value is 5 times.Further, preferred as in the present embodiment, will
Title word segmentation result and text word segmentation result joint account word frequency.
If it is not, enter step S402 to abandon this participle, though will be as participle, the frequency of occurrences is the highest
Participle in setting value abandons, i.e. not as follow-up process.
If so, enter step S403 and this participle is defined as high-frequency participle, enter word frequency management storehouse,
Can map by configuring linked server when calling.
Fig. 5 is the dimension classification schematic diagram of participle in Fig. 1.
In the present embodiment, the dimension of participle can proceed as follows classification:
A " keyword " (KEY_WORD),
B " region " (PUB_AREA_CODE),
C " fund/bond " (PUB_SEC_CODE),
D " related person " (PUB_INDIV_INFO),
E " associated mechanisms " (PUB_ORG_INFO),
F " relevant industries " (PUB_INDU_CODE),
G " related notion " (COM_CONC_INFO),
H " associated companies " (STK_BASIC_INFO).
Fig. 6 is that the data classification processing method based on financial Information in one embodiment of the present invention is concrete
Embodiment schematic diagram.
Data classification processing method based on financial Information in the present embodiment, including step be:
Step S601 starts
Step S602 acquires the text in financial Information
Step S603 resolves and obtains participle
The participle that step S604 frequency is high is retrieved as keyword, by being carried out by participle high for frequency
Retrieval, it is possible to greatly reduce data processing amount, makes classification process the most accurate simultaneously.
Step S605 is at classification 1 support table, classification 2 support table, classification 3 support table ... classification N supports in table,
Carry out title accurately to mate, after described participle title has accurately matched a record, return correspondence
Sorting code number.Described classification 1 supports table, 2 support tables of classifying, 3 support tables of classifying ... classification N support table is
Presetting, further, associative classification could also be from multiple classification charts, is used for associating news relevant classification
Information.1 support table of such as classifying corresponds to " fund/bond ", then in corresponding database according to
High-frequency participle carries out the keyword with fund/bond.
Fig. 7 is the body structure schematic diagram in Fig. 1 in financial Information.
In the present embodiment, the structure capturing the text obtained in financial Information is: text+HTML marks
Label+text, or the structure of html tag+text+html tag.Such as, source code:
<BODY TEXT=FF0000>
Finance?<BR>A-share ticket!<BR>wealth.
</BODY>
The financial Information result that crawl obtains:
Finance?
A-share ticket!
Wealth
Fig. 8 is the partition dimension schematic diagram of keyword in the dimension of participle in Fig. 5.
In the present embodiment, keyword in Figure 5 divides according to following dimension: not cuts out class, divide
Class extension class, special topic/meeting class, descriptive and bulletin event class.Such as, special topic/meeting class: " close
In the instruction advancing " internet+" wisdom energy (energy internet) to take action ", keyword: " guidance "
" suggestion ";Classification extension class: " private is raised strategy: four logics support slow oxen and pay close attention to the three big bright spot of investments "
Keyword: " private is raised ", " investment ".
Refer to Fig. 9, be an embodiment schematic diagram of sorting code number in Fig. 1.
Described sorting code number is divided into according to following rule: metadata ID code, other ID code, security
Code, mechanism and company code, natural person's code, basis entity code, information entity code, index
Code.
Specifically:
101: metadata ID code
102: other ID codes
1021: data source ID code
1022: ID code
1023: financing event id
1024: material items event id
20: stock code
201: stock code
202: fund code
203: bond codes
204: repurchase code
205: warrant code
206: code index
207: set financing code
208: futures code
209: noble metal code
210: Hong Kong stock code
211: international securities code
212: other stock codes
213: fiscal code
214: stock in America code
30: mechanism, company code
40: natural person's code
50: basis entity code
5001: code plate
5002: whole nation county and the above administrative division in county code
5003: industry code
60: information entity code
601: grind report code
602: credit rating is reported
61: news code
65: news code
604: bulletin code
904: bulletin test code
901: news test code
601: grind report code
602: credit rating reporting code
603: bulletin fresh code
605: laws and regulations code
606: keyword
70: guideline code
701: product code
702: macroscopic view industry guideline code
607: original news guideline code
608: mechanism investigation ID
609: medical treatment news
Test code
904: bulletin test code
901: news test code
Method of calling:
ID create-rule: from increasing 12 number codings
CODE create-rule: encoded by unified 10 number returning 701 beginnings of storing process FN_GET_CODE (701).
Example: 701 A storehouses 20,151,127 916,151 001
701 B storehouses 20,151,127 096,442 001
Figure 10 is that a kind of acquisition mode of financial Information in Fig. 1 is intended to.
In the present embodiment, data classification processing method based on financial Information includes: captures and obtains finance
Text in information, and text is carried out parsing obtain participle;The appearance of participle is obtained according to analysis result
Frequency, and according to the high participle of the frequency acquisition frequency of occurrences set, by participle high for the described frequency of occurrences
Retrieving as keyword, the retrieval result obtained is mated according to sorting code number, make keyword by
Classify according to sorting code number correspondence.Wherein, use web crawlers described financial Information to be captured and enter database,
Setting up the participle database of financial Information text, described web crawlers includes: Larbin, Nutch, Heritrix,
WebSPHINX, Mercator, PolyBot, result enters participle database.This technical staff can
Understand, such as, Larbin, can obtain/determine all connections of single financial consultation website, also include
One financial consultation website of mirror image or set up url list group.Nutch, by WebDB in order to deposit
Storage is the link structure information between the captured webpage of reptile, stores two kinds of entities in WebDB
Information: page and link.Page entity characterizes one by describing the characteristic information of a webpage on network
The webpage of individual reality because webpage has a lot of to need to describe, in WebDB by the URL of webpage and
These page entity are indexed by two kinds of indexing means of the MD5 of web page contents.Page entity description
Web page characteristics mainly include the link number in webpage, capturing time etc. of this webpage relevant captures letter
Breath, the importance degree scoring etc. to this webpage, for the data that financial Information industry is special, it is possible to capture
To more effective information.Heritrix, in the predetermined character string for identifying a certain Internet resources title
URI selects one, obtains URI afterwards and be analyzed, file result, select the sense having been found that emerging
" financial " URI of interest, adds predetermined queue, marks the most processed URI the most again.Such as
PolyBot, by a reptile manager, one or more download persons, and one or more domain name system clothes
Business device dns resolution person composition, by being added to the URL being drawn in a queue of hard disk
Face, then uses these URL of mode treatment of batch processing.
Refer to Figure 11, be the schematic diagram of the embodiment of the invention.
Title: cross-border electricity business " unicorn " is glittered the new glamour (figure) that interconnects
URL address: http://business.sohu.com/20151217/n431680329.shtml
Source: the www.xinhuanet.com
The content capturing the webpage in Figure 11 is as follows:
<p>just in the Second Committee World Wide Web conference that Zhe Jiangwu town is held, cross-border electricity business becomes time and again
The topic that participant welcome guest discusses warmly, is also that many participants " big man " are just at the treasured place of Denver Nuggets.</p><p>?
Today of internet high speed development, interconnect and allow all be possibly realized, and the interconnection that cross-border electricity business brings
Intercommunication has made the dealing side being separated by vast oceans closely coupled.</p><p>in the world, valuation is more than ten
The venture company of hundred million dollars is referred to as " unicorn ".Under the situation of cross-border electricity business's fast development, China
" unicorn " enterprise in large quantities of futures is breeding and is growing up.</p><p>according to incompletely statistics, exist
China, carries out the foreign trade enterprise of cross-border ecommerce more than 200,000.Predict according to Department of Commerce, 2016,
China's cross-border electricity business's import and export volume will reach 6.5 trillion yuan.</p><p>china's (Hangzhou) cross-border electronics
Commercial affairs integrated testing area is first cross-border electricity Shang Zongshi district of China.Hangzhou cross-border electricity business combines and runs managing pair as an experiment
Director king says, Zong Shi district is innovated by system innovation, management innovation, service, is devoted to realize cross-border
Ecommerce liberalization, judicial convenience, Normalization.</p><p>king says, pilot is from last Jul
Part starts to now comprehensively, only in Hangzhou, just has 2000 many enterprises to be cross-border electricity business transition, and
All achieve good transaction achievement.</p><p>hangzhou tree summer dress ornament is garment enterprise on a line, should
Company log in electricity business website Amazon and speed sell logical after product has been sold to the state such as Japan and Russia, modern
Annual sales amount is more than 10,000,000 yuan.</p><p>except electricity commercial business industry, the increasing traditional forms of enterprises is also
Wish to be realized by cross-border electricity business " the second spring " of development.It is a traditional foreign trade enterprise that shellfish Betta spins,
The loud and clear theory of company general manager horse: " we are removed intermediate link by cross-border electricity business, and profit improves very
Many.Although this several years foreign trade overall situations are bad, but our sales volume is increasing always."</p><p>zhejiang
Jiang Yiwu is famous " whole world small item ", by carrying out cross-border electricity business, it is achieved that to " the world
Supermarket " transformation.It is exactly the professional B2B E-commerce platform of on-line off-line fusion development that Yiwu is purchased,
65 specialized markets of China " are included " at present.</p><p>the development of new things will experience wind
Rain.Except the cross-border electricity feature such as the fragmentation of business itself, networking, instantaneity, be also faced with settling the exchange, simultaneously
Clearance, taxation etc. many " uncomfortable disease ".</p><p>to this, national information Expert Advisory Committee (EAC) is secondary
Director Yang Guoxun says, growth will be with throe, for monitoring party, it should the most cross-border electricity
Decide through consultation position, set up supervision and the accompanying institution adapting to cross-border electricity business's development, promote the business's normal development of cross-border electricity.
</p><p>" I feel globalization the first step, it is simply that all chains of commodity flow do not hinder, the most any one
The resident in individual place can buy the best product in any one place in the world by cross-border electricity business
Product." board of directors of ctrip.com chairman Liang Jianzhang represents in internet conference.</p>
Being retrieved as keyword by participle high for the described frequency of occurrences, the retrieval result obtained is according to dividing
Class coding mates, and makes keyword classify according to sorting code number correspondence, and specifically classification includes automatically:
Support table: basic data defines, be that CODE is unique in whole table.Participle enters as follows
Row dimension is classified, and makes keyword classify according to sorting code number correspondence:
(1). " keyword " (KEY_WORD)
6060003003 electricity business
6060003004 foreign trades
(2). " region " (PUB_AREA_CODE)
5002001016 Zhejiang
5002001081 Yiwus
(3). " fund/bond " (PUB_SEC_CODE)
3000260576 Hangzhou tree summer dress ornaments
(4). " related person " (PUB_INDIV_INFO)
4000027961 poplar state merits
4000156043 beams build chapter
(5). " associated mechanisms " (PUB_ORG_INFO)
3000197012 national information Expert Advisory Committee (EAC)s
(6). " relevant industries " (PUB_INDU_CODE)
5003002138 specialty wholesale and retail industries
(7). " related notion " (COM_CONC_INFO)
5001000169 cross-border electricity business
(8). " associated companies " (STK_BASIC_INFO)
3000466855 Yiwu capitals
Contingency table: be used for associating news relevant classification information.
(1). " news keyword contingency table " (NEWS_KEY_WORD)
1 6126410945 6060003003
2 6126410945 6060003004
(2). " news regional interrelation table " (NEWS_AREA_RELA)
1 6126410945 5002001016
2 6126410945 5002001081
(3). " association of Public Information Fund bond " (NEWS_SEC_RELA)
1 6126410945 3000260576
(4). " newsmaker's contingency table " (NEWS_INDIV_RELA)
1 6126410945 4000027961
2 6126410945 4000156043
(5). " news briefing associated therewith table " (NEWS_PUB_ORG_RELA)
1 61264109453000197012
(6). " news industry contingency table " (NEWS_INDU_RELA)
1 6126410945 5003002138
(7). " news concept subject matter contingency table " (NEWS_CONC_RELA)
1 6126410945 5001000169
(8). " news listed company contingency table " (NEWS_COM_RELA)
1 6126410945 3000466855
Those of ordinary skill in the field it is understood that more than, described be only being embodied as of the present invention
Example, is not limited to the present invention, all within the spirit and principles in the present invention, that is done is any
Amendment, equivalent, improvement etc., should be included within the scope of the present invention.
Claims (10)
1. a data classification processing method based on financial Information, it is characterised in that include,
Capture the text obtaining in financial Information, and text is carried out parsing obtain participle;
The frequency of occurrences of participle is obtained according to analysis result, and high according to the frequency acquisition frequency of occurrences set
Participle,
Being retrieved as keyword by participle high for the described frequency of occurrences, the retrieval result obtained is according to dividing
Class coding mates, and makes keyword classify according to sorting code number correspondence.
Data classification processing method based on financial Information the most according to claim 1, its feature exists
Carry out resolving in, the text described crawl obtained and obtain the method for participle and be:
Pretreatment is: only retain text after the character string included in text is removed label, if described just
Literary composition comprises html tag or English character, described html tag or English character is not entered
Row resolves;
Then described text is carried out cutting and forms sentence,
Secondly to text remaining after cutting is syncopated as from this section of text a sentence the most successively again, and
This sentence is carried out again cutting,
Finally described step is traveled through, until whole sentence is cut into single subelement.
Data classification processing method based on financial Information the most according to claim 1, its feature exists
In, described high-frequency participle builds word frequency management storehouse according to the result that participle resolves,
If the participle frequency occurred is more than setting value, then this participle is defined as high-frequency participle, if going out
When existing participle frequency is less than setting value, then abandon this participle.
Data classification processing method based on financial Information the most according to claim 1, its feature exists
Carry out dimension classification as follows in, described participle, " keyword ", " region ", " fund/bond ",
" related person ", " associated mechanisms ", " relevant industries ", " related notion ", " associated companies ".
Data classification processing method based on financial Information the most according to claim 1, its feature exists
In, according to the method that described sorting code number carries out mating it is, according to putting in order successively of high-frequency participle
At classification 1 support table, classification 2 support table, classification 3 support table ... classification N supports in table, carries out name
Claim accurately coupling, after described participle title has accurately matched a record, return corresponding classification and compile
Code.
Data classification processing method based on financial Information the most according to claim 1, its feature exists
In, the text in described financial Information includes: text and html tag.
Data classification processing method based on financial Information the most according to claim 4, its feature exists
In, described keyword divides according to following dimension: not cut out class, classification extension class, special topic/meeting class,
Descriptive and bulletin event class.
Data classification processing method based on financial Information the most according to claim 1, its feature exists
In, described sorting code number is divided into according to following rule: metadata ID code, other ID code, security
Code, mechanism and company code, natural person's code, basis entity code, information entity code, index
Code.
Data classification processing method based on financial Information the most according to claim 1, its feature exists
In, after described keyword is classified according to sorting code number correspondence, set up financial Information and sorting code number from
Dynamic association.
Data classification processing method based on financial Information the most according to claim 1, its feature
It is, uses web crawlers described financial Information to be captured and enter database, set up financial Information text
Participle database, described web crawlers includes, Larbin, Nutch, Heritrix, WebSPHINX,
Mercator、PolyBot。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610029411.2A CN105786961A (en) | 2016-01-15 | 2016-01-15 | Data sorting treatment method based on financial information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610029411.2A CN105786961A (en) | 2016-01-15 | 2016-01-15 | Data sorting treatment method based on financial information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105786961A true CN105786961A (en) | 2016-07-20 |
Family
ID=56402498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610029411.2A Pending CN105786961A (en) | 2016-01-15 | 2016-01-15 | Data sorting treatment method based on financial information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105786961A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951533A (en) * | 2017-03-21 | 2017-07-14 | 为朔医学数据科技(北京)有限公司 | A kind of Research on Genetic Variation date storage method and device |
CN107273534A (en) * | 2017-06-29 | 2017-10-20 | 武汉楚鼎信息技术有限公司 | A kind of data processing method extracted based on information content, system |
CN107292744A (en) * | 2017-06-07 | 2017-10-24 | 前海梧桐(深圳)数据有限公司 | Investment Trend analysis method and its system based on machine learning |
CN107818510A (en) * | 2017-11-02 | 2018-03-20 | 长江证券股份有限公司 | A kind of distributed processing system(DPS) and method for investment decision auxiliary |
CN109684472A (en) * | 2018-12-20 | 2019-04-26 | 深圳价值在线信息科技股份有限公司 | A kind of trade classification method and system of security information |
CN110287218A (en) * | 2019-06-26 | 2019-09-27 | 浙江诺诺网络科技有限公司 | A kind of matched method of tax revenue sorting code number, system and equipment |
CN111125596A (en) * | 2019-12-11 | 2020-05-08 | 海南港澳资讯产业股份有限公司 | Short message forming method based on financial information |
CN112115263A (en) * | 2020-09-08 | 2020-12-22 | 浙江嘉兴数字城市实验室有限公司 | NLP-based social management big data monitoring and early warning method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593200A (en) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | Chinese Web page classification method based on the keyword frequency analysis |
CN103577587A (en) * | 2013-11-08 | 2014-02-12 | 南京绿色科技研究院有限公司 | News theme classification method |
CN104063513A (en) * | 2011-09-29 | 2014-09-24 | 北京奇虎科技有限公司 | Intelligent vertical search method and system |
CN104679875A (en) * | 2015-03-10 | 2015-06-03 | 杭州凡闻科技有限公司 | Method for classifying information data based on digital newspaper |
-
2016
- 2016-01-15 CN CN201610029411.2A patent/CN105786961A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593200A (en) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | Chinese Web page classification method based on the keyword frequency analysis |
CN104063513A (en) * | 2011-09-29 | 2014-09-24 | 北京奇虎科技有限公司 | Intelligent vertical search method and system |
CN103577587A (en) * | 2013-11-08 | 2014-02-12 | 南京绿色科技研究院有限公司 | News theme classification method |
CN104679875A (en) * | 2015-03-10 | 2015-06-03 | 杭州凡闻科技有限公司 | Method for classifying information data based on digital newspaper |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951533A (en) * | 2017-03-21 | 2017-07-14 | 为朔医学数据科技(北京)有限公司 | A kind of Research on Genetic Variation date storage method and device |
CN107292744A (en) * | 2017-06-07 | 2017-10-24 | 前海梧桐(深圳)数据有限公司 | Investment Trend analysis method and its system based on machine learning |
CN107273534A (en) * | 2017-06-29 | 2017-10-20 | 武汉楚鼎信息技术有限公司 | A kind of data processing method extracted based on information content, system |
CN107818510A (en) * | 2017-11-02 | 2018-03-20 | 长江证券股份有限公司 | A kind of distributed processing system(DPS) and method for investment decision auxiliary |
CN109684472A (en) * | 2018-12-20 | 2019-04-26 | 深圳价值在线信息科技股份有限公司 | A kind of trade classification method and system of security information |
CN110287218A (en) * | 2019-06-26 | 2019-09-27 | 浙江诺诺网络科技有限公司 | A kind of matched method of tax revenue sorting code number, system and equipment |
CN111125596A (en) * | 2019-12-11 | 2020-05-08 | 海南港澳资讯产业股份有限公司 | Short message forming method based on financial information |
CN112115263A (en) * | 2020-09-08 | 2020-12-22 | 浙江嘉兴数字城市实验室有限公司 | NLP-based social management big data monitoring and early warning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105786961A (en) | Data sorting treatment method based on financial information | |
Lymer et al. | Business reporting on the Internet | |
US8494944B2 (en) | System, report, and method for generating natural language news-based stories | |
CN106649223A (en) | Financial report automatic generation method based on natural language processing | |
US7856390B2 (en) | System, report, and method for generating natural language news-based stories | |
US20190311312A1 (en) | Methods and systems for generating supply chain representations | |
JP5249074B2 (en) | Method and system for symbolic linking and intelligent classification of information | |
US20090055242A1 (en) | Content identification and classification apparatus, systems, and methods | |
CN110188107A (en) | A kind of method and device of the Extracting Information from table | |
US11170022B1 (en) | Method and device for processing multi-source heterogeneous data | |
US11755663B2 (en) | Search activity prediction | |
CN112015721A (en) | E-commerce platform storage database optimization method based on big data | |
KR102121901B1 (en) | System for online public fund investment management assessment service | |
Chen et al. | From opinion mining to financial argument mining | |
CN103390044A (en) | Method and device for identifying linkage type POI (Point Of Interest) data | |
CN107679977A (en) | A kind of tax administration platform and implementation method based on semantic analysis | |
CN109857952A (en) | A kind of search engine and method for quickly retrieving with classification display | |
CN115358201A (en) | Processing method and system for delivery and research report in futures field | |
CN109388710A (en) | A kind of IP address service attribute scaling method and device | |
CN110222180A (en) | A kind of classification of text data and information mining method | |
KR101971087B1 (en) | Displaying method for market sentiment index information and online stock dealing service system | |
KR101145818B1 (en) | Method and apparutus for automatic contents generation | |
CN112183037A (en) | Data classification and summarization method and system in parallel enterprise finance and tax SaaS system | |
Higgins et al. | XBRL: don't lag behind the digital information revolution | |
Wang et al. | Are XBRL-based financial reports better than non-XBRL reports? A quality assessment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160720 |
|
RJ01 | Rejection of invention patent application after publication |