CN105786961A - Data sorting treatment method based on financial information - Google Patents

Data sorting treatment method based on financial information Download PDF

Info

Publication number
CN105786961A
CN105786961A CN201610029411.2A CN201610029411A CN105786961A CN 105786961 A CN105786961 A CN 105786961A CN 201610029411 A CN201610029411 A CN 201610029411A CN 105786961 A CN105786961 A CN 105786961A
Authority
CN
China
Prior art keywords
participle
financial information
code
text
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610029411.2A
Other languages
Chinese (zh)
Inventor
黄�俊
鄢坤
易君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Up Wealth Management Co ltd
Original Assignee
Up Wealth Management Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Up Wealth Management Co ltd filed Critical Up Wealth Management Co ltd
Priority to CN201610029411.2A priority Critical patent/CN105786961A/en
Publication of CN105786961A publication Critical patent/CN105786961A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Abstract

The invention discloses a data sorting treatment method based on financial information. The data sorting treatment method comprises the following steps: fetching a main body in the financial information, and analyzing the main body to obtain segmented words; according to an analysis result, obtaining the frequency of occurrence of each segmented word, obtaining the segmented words with a high frequency according to a set frequency, and carrying out retrieval by taking the segmented words with the high frequency as keywords, matching obtained retrieval results according to a sorting code number to correspondingly classify the keywords according to the sorting code number. The main body is analyzed to obtain the segmented words to realize each-dimensionality sorting word frequency management. Secondly, since the segmented words with the high frequency are retrieved as the key words, the obtained retrieval results are matched according to the sorting code number, so that an incidence relation between the segmented words and support data is established so as to achieve the accurate application of the information. The data sorting treatment method based on the financial information can be adopted to realize automatic incidence, automatic warehousing and relevant treatment, and the treatment efficiency of financial information is improved.

Description

A kind of data classification processing method based on financial Information
Technical field
The present invention relates to the classification of big data and processes, at a kind of data classification based on financial Information Reason method.
Background technology
Deepening constantly development along with financial market, on the one hand scope and the quantity of financial Information quickly increases Long, on the other hand user proposes requirements at the higher level to the precisely retrieval of financial Information, how to provide magnanimity finance Interrogate quick and relatively accurate index of setting up and just become a problem compeling highly necessary to solve.Current domestic common Information data mode have two kinds, one is a stand-alone program by information to be captured and corresponds to specify classification, Another kind is that artificial judgment specifies information classification.The former advantage is that efficiency is high, shortcoming be classify relatively thick, Do not support multidimensional classification;The advantage of the latter is that multidimensional cross division is precisely supported in classification, and shortcoming is effect Rate is low.
Along with the financial Information that need to process is doubled and redoubled, information need to do exact classification application, traditional information Processing mode faces such a predicament: program automatically processes and can meet efficiency but information classification accuracy fall Low, artificial treatment can meet information accuracy requirement but a large amount of personnel of needs.The most quickly, accurately to obtaining The financial Information taken is classified, and is the major issue of financial Information data automatic classification.In the face of obtaining in a large number The financial Information data taken, traditional manual sort is without meeting the demand of big data quantity.
Summary of the invention
The technical problem to be solved in the present invention is, automatically classifies financial Information data, can preferably expire Foot accuracy and raising information treatment effeciency.
Solve above-mentioned technical problem, the invention provides a kind of data classification process side based on financial Information Method, including,
Capture the text obtaining in financial Information, and text is carried out parsing obtain participle;
The frequency of occurrences of participle is obtained according to analysis result, and high according to the frequency acquisition frequency of occurrences set Participle,
Being retrieved as keyword by participle high for the described frequency of occurrences, the retrieval result obtained is according to dividing Class coding mates, and makes keyword classify according to sorting code number correspondence.
Further, carry out the text that described crawl obtains resolving and obtain the method for participle and be:
Pretreatment is: only retain text after the character string included in text is removed label, if described just Literary composition comprises html tag or English character, described html tag or English character is not entered Row resolves;
Then described text is carried out cutting and forms sentence,
Secondly to text remaining after cutting is syncopated as from this section of text a sentence the most successively again, and This sentence is carried out again cutting,
Finally described step is traveled through, until whole sentence is cut into single subelement.
Further, described high-frequency participle builds word frequency management storehouse according to the result that participle resolves,
If the participle frequency occurred is more than setting value, then this participle is defined as high-frequency participle, if going out When existing participle frequency is less than setting value, then abandon this participle.
Further, described participle carries out dimension classification as follows, " keyword ", " region ", " fund/bond ", " related person ", " associated mechanisms ", " relevant industries ", " related notion ", " relevant Company ".
Further, according to the method that described sorting code number carries out mating it is, according to the row of high-frequency participle Row order is successively at classification 1 support table, classification 2 support table, classification 3 support table ... classification N supports table In, carry out title and accurately mate, after described participle title has accurately matched a record, it is right to return The sorting code number answered.
Further, the text in described financial Information includes: text and html tag.
Further, described keyword divides according to following dimension: not cut out class, classification extension class, Special topic/meeting class, descriptive and bulletin event class.
Further, described sorting code number is divided into according to following rule: metadata ID code, other ID Code, stock code, mechanism and company code, natural person's code, basis entity code, information entity Code, guideline code.
Further, after described keyword is classified according to sorting code number correspondence, set up financial Information and divide The auto-associating of class coding, can be used for associating news relevant classification information based on multiple contingency tables.
Further, use web crawlers described financial Information to be captured and enter database, set up finance money News text participle database, described web crawlers includes, Larbin, Nutch, Heritrix, WebSPHINX、Mercator、PolyBot。
Beneficial effects of the present invention:
1) participle is obtained by text is carried out parsing, it is achieved the classification word frequency management of each dimension.
2) by participle high for the described frequency of occurrences is retrieved as keyword, the retrieval knot obtained Fruit mates according to sorting code number so that participle and the incidence relation supporting data are set up, thus reach Preferably information is precisely applied.
3) use the present invention data classification processing method based on financial Information can realize auto-associating, Automatically warehouse-in and relevant treatment, improve the treatment effeciency to financial Information.
Accompanying drawing explanation
Fig. 1 is that the data classification processing method based on financial Information in one embodiment of the invention is embodied as Mode schematic diagram.
Analytic method schematic diagram in Fig. 2 Fig. 1.
Fig. 3 is pre-treatment step schematic diagram in Fig. 1.
Fig. 4 is Fig. 1 high frequency word segmentation processing mode schematic flow sheet.
Fig. 5 is the dimension classification schematic diagram of participle in Fig. 1.
Fig. 6 is that the data classification processing method based on financial Information in one embodiment of the present invention is concrete Embodiment schematic diagram.
Fig. 7 is the body structure schematic diagram in Fig. 1 in financial Information.
Fig. 8 is the partition dimension schematic diagram of keyword in the dimension of participle in Fig. 5.
Fig. 9 is an embodiment schematic diagram of sorting code number in Fig. 1.
Figure 10 is that a kind of acquisition mode of financial Information in Fig. 1 is intended to.
Figure 11 is the schematic diagram of the embodiment of the invention.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with being embodied as Example, and referring to the drawings, the present invention is described in more detail.
Refer to Fig. 1, be the data classification processing method based on financial Information in one embodiment of the invention Detailed description of the invention schematic diagram.
In the present embodiment, including:
Step S101 captures the text obtaining in financial Information, and text is carried out parsing obtains participle, Described financial Information includes but not limited to, the consulting in the whole world, futures present quotation, stock present quotation. Text in financial Information is obtained by the acquisition of information in internet, and port includes but not limited to: PC End, smart mobile phone end, PAD end.Described text is carried out parsing obtain participle, nature can be used to divide Cut, dictionary dictionary based on financial Information class split, such as, " listing ", " limit-up ", " silver ", " buy Sell ", can according to significance level priority principle in order, followed by inverted order and two-way sequence point Word combination.
Step S102 obtains the frequency of occurrences of participle according to analysis result, and according to the frequency acquisition set , such as in one section of financial consultation, there is the secondary of " futures " after participle in the participle that the frequency of occurrences is high Number is higher, then it is assumed that its frequency of occurrences is higher, the most such as, occurs " Shanghai futures exchange after participle Institute " number of times higher, then it is assumed that its frequency of occurrences is higher.Those skilled in the art can understand, it is possible to It is high-frequency participle rule of thumb to arrange the participle of often appearance, such as: " 2015 NianAGu cities Field the most very ox, Shenzhen Stock Exchange becomes to refer to rise and 14.98% maintains the leading position at the whole world cardinal index whole year, upper card Composite rises 9.41% the whole year.", rule of thumb arrange the participle of often appearance i.e. may include that " A-share ", " Shenzhen Stock Exchange becomes to refer to ", " above demonstrate,proving composite " etc..
Participle high for the described frequency of occurrences is retrieved by step S103 as keyword, the retrieval obtained Result is mated according to sorting code number, makes keyword classify according to sorting code number correspondence.Art technology Personnel can understand, the described mode carrying out retrieving as keyword includes but not limited to, at the number arranged Retrieve according in storehouse, or retrieve according to corresponding rule.
Refer to Fig. 2, be analytic method schematic diagram in Fig. 1.
In the present embodiment, the text that described crawl obtains is carried out parsing and obtains participle:
Step S201 pre-processes
Step S202 described text is carried out cutting formed sentence, can according to ending ". " ", " or ";” Carry out cutting and form independent sentence.
Step S203 is to being syncopated as a sentence in text remaining after cutting the most successively from this section of text again Son, and this sentence is carried out again cutting, cutting obtains each the sentence completed one by one.
Described step is traveled through by step S204, until whole sentence is cut into single subelement, Subelement refers to individually to represent Chinese vocabulary or the word of implication, and English word, English name-to English abbreviation.
Fig. 3 is pre-treatment step schematic diagram in Fig. 1.
Step S301 only retains text after the character string included in text is removed label, with in internet As a example by extracting the financial Information obtained, financial Information is generally by making the standard language of World Wide Web page HTML (HyperText Mark-up Language) HTML or HyperText Markup language Speech.The descriptive text being made up of HTML command in html file, HTML command it may be said that Word, figure, animation, sound, form, link etc. in plain text.Use web browser, such as Netscape Navigator or Microsoft Internet explorer, it is possible to explain that html file shows webpage.Pass through Obtain the word with html tag, after being removed by label, only retain the most useful Financial Information literary composition Word.
Such as, label<FONT>, tag attributes:
SIZE sets the size of word
COLOR sets the color of word
FACE sets the font of word
The most such as, label<BASEFONT>, tag attributes:
The default value of SIZE amendment word size
The default value of COLOR amendment text color
The default value of FACE amendment character script
The most such as, label<BODY>, tag attributes:
Such as, tag format<bODY TEXT=" property value ">...</BODY>
For another example, label<Hn>, tag attributes:
ALIGN arranges the mode of title alignment.
Tag format<hn>...</Hn>
<hn ALIGN=" property value ">...</Hn>, the article of financial Information have title, subtitle, The structure such as chapter and joint, also provides corresponding heading label<hn>In HTML, need not add separation label meeting Automatically jumping to next line, wherein n is the grade of title.HTML provides the title of six grades altogether, N is the least, and title font size is the biggest.
By identifying the html tag in the example above, can extract and obtain plain text word.
Whether step S302 comprises html tag or English character, if then entering step S304, if not Then enter step S303, it is judged that the html tag in text or English character, wherein for English character Do not include: specific name or mechanism's abbreviation etc., such as Borrow's Sa Miao Ademilson (Paul A.Samuelson), the special name such as Mankiw (N.Gregory Mankiw).Such as, Shanghai futures exchange Institute (SHFE), Chicago,U.S futures exchange (CBOT), USA New York metal exchange (COMEX), London metal exchange (LME).
Step S303 resolves, if not comprising above-mentioned html tag or English character (no special) Situation, then directly text is resolved.
Described html tag or English character are not resolved by step S304, not to above-mentioned Html tag or English character (special) resolve.
Fig. 4 is Fig. 1 high frequency word segmentation processing mode schematic flow sheet.
In the present embodiment, the result structure word frequency management storehouse that high-frequency participle resolves according to participle:
When whether step S401 participle frequency is more than setting value, according to the feature of financial Information industry, permissible The number of times that setting value occurs according to different terms is set, such as, " A-share ", setting value is 15 times, " Shanghai futures exchange " arranges value is 5 times.Further, preferred as in the present embodiment, will Title word segmentation result and text word segmentation result joint account word frequency.
If it is not, enter step S402 to abandon this participle, though will be as participle, the frequency of occurrences is the highest Participle in setting value abandons, i.e. not as follow-up process.
If so, enter step S403 and this participle is defined as high-frequency participle, enter word frequency management storehouse, Can map by configuring linked server when calling.
Fig. 5 is the dimension classification schematic diagram of participle in Fig. 1.
In the present embodiment, the dimension of participle can proceed as follows classification:
A " keyword " (KEY_WORD),
B " region " (PUB_AREA_CODE),
C " fund/bond " (PUB_SEC_CODE),
D " related person " (PUB_INDIV_INFO),
E " associated mechanisms " (PUB_ORG_INFO),
F " relevant industries " (PUB_INDU_CODE),
G " related notion " (COM_CONC_INFO),
H " associated companies " (STK_BASIC_INFO).
Fig. 6 is that the data classification processing method based on financial Information in one embodiment of the present invention is concrete Embodiment schematic diagram.
Data classification processing method based on financial Information in the present embodiment, including step be:
Step S601 starts
Step S602 acquires the text in financial Information
Step S603 resolves and obtains participle
The participle that step S604 frequency is high is retrieved as keyword, by being carried out by participle high for frequency Retrieval, it is possible to greatly reduce data processing amount, makes classification process the most accurate simultaneously.
Step S605 is at classification 1 support table, classification 2 support table, classification 3 support table ... classification N supports in table, Carry out title accurately to mate, after described participle title has accurately matched a record, return correspondence Sorting code number.Described classification 1 supports table, 2 support tables of classifying, 3 support tables of classifying ... classification N support table is Presetting, further, associative classification could also be from multiple classification charts, is used for associating news relevant classification Information.1 support table of such as classifying corresponds to " fund/bond ", then in corresponding database according to High-frequency participle carries out the keyword with fund/bond.
Fig. 7 is the body structure schematic diagram in Fig. 1 in financial Information.
In the present embodiment, the structure capturing the text obtained in financial Information is: text+HTML marks Label+text, or the structure of html tag+text+html tag.Such as, source code:
<BODY TEXT=FF0000>
Finance?<BR>A-share ticket!<BR>wealth.
</BODY>
The financial Information result that crawl obtains:
Finance?
A-share ticket!
Wealth
Fig. 8 is the partition dimension schematic diagram of keyword in the dimension of participle in Fig. 5.
In the present embodiment, keyword in Figure 5 divides according to following dimension: not cuts out class, divide Class extension class, special topic/meeting class, descriptive and bulletin event class.Such as, special topic/meeting class: " close In the instruction advancing " internet+" wisdom energy (energy internet) to take action ", keyword: " guidance " " suggestion ";Classification extension class: " private is raised strategy: four logics support slow oxen and pay close attention to the three big bright spot of investments " Keyword: " private is raised ", " investment ".
Refer to Fig. 9, be an embodiment schematic diagram of sorting code number in Fig. 1.
Described sorting code number is divided into according to following rule: metadata ID code, other ID code, security Code, mechanism and company code, natural person's code, basis entity code, information entity code, index Code.
Specifically:
101: metadata ID code
102: other ID codes
1021: data source ID code
1022: ID code
1023: financing event id
1024: material items event id
20: stock code
201: stock code
202: fund code
203: bond codes
204: repurchase code
205: warrant code
206: code index
207: set financing code
208: futures code
209: noble metal code
210: Hong Kong stock code
211: international securities code
212: other stock codes
213: fiscal code
214: stock in America code
30: mechanism, company code
40: natural person's code
50: basis entity code
5001: code plate
5002: whole nation county and the above administrative division in county code 5003: industry code
60: information entity code
601: grind report code
602: credit rating is reported
61: news code
65: news code
604: bulletin code
904: bulletin test code
901: news test code
601: grind report code
602: credit rating reporting code
603: bulletin fresh code
605: laws and regulations code
606: keyword
70: guideline code
701: product code
702: macroscopic view industry guideline code
607: original news guideline code
608: mechanism investigation ID
609: medical treatment news
Test code
904: bulletin test code
901: news test code
Method of calling:
ID create-rule: from increasing 12 number codings
CODE create-rule: encoded by unified 10 number returning 701 beginnings of storing process FN_GET_CODE (701).
Example: 701 A storehouses 20,151,127 916,151 001
701 B storehouses 20,151,127 096,442 001
Figure 10 is that a kind of acquisition mode of financial Information in Fig. 1 is intended to.
In the present embodiment, data classification processing method based on financial Information includes: captures and obtains finance Text in information, and text is carried out parsing obtain participle;The appearance of participle is obtained according to analysis result Frequency, and according to the high participle of the frequency acquisition frequency of occurrences set, by participle high for the described frequency of occurrences Retrieving as keyword, the retrieval result obtained is mated according to sorting code number, make keyword by Classify according to sorting code number correspondence.Wherein, use web crawlers described financial Information to be captured and enter database, Setting up the participle database of financial Information text, described web crawlers includes: Larbin, Nutch, Heritrix, WebSPHINX, Mercator, PolyBot, result enters participle database.This technical staff can Understand, such as, Larbin, can obtain/determine all connections of single financial consultation website, also include One financial consultation website of mirror image or set up url list group.Nutch, by WebDB in order to deposit Storage is the link structure information between the captured webpage of reptile, stores two kinds of entities in WebDB Information: page and link.Page entity characterizes one by describing the characteristic information of a webpage on network The webpage of individual reality because webpage has a lot of to need to describe, in WebDB by the URL of webpage and These page entity are indexed by two kinds of indexing means of the MD5 of web page contents.Page entity description Web page characteristics mainly include the link number in webpage, capturing time etc. of this webpage relevant captures letter Breath, the importance degree scoring etc. to this webpage, for the data that financial Information industry is special, it is possible to capture To more effective information.Heritrix, in the predetermined character string for identifying a certain Internet resources title URI selects one, obtains URI afterwards and be analyzed, file result, select the sense having been found that emerging " financial " URI of interest, adds predetermined queue, marks the most processed URI the most again.Such as PolyBot, by a reptile manager, one or more download persons, and one or more domain name system clothes Business device dns resolution person composition, by being added to the URL being drawn in a queue of hard disk Face, then uses these URL of mode treatment of batch processing.
Refer to Figure 11, be the schematic diagram of the embodiment of the invention.
Title: cross-border electricity business " unicorn " is glittered the new glamour (figure) that interconnects
URL address: http://business.sohu.com/20151217/n431680329.shtml
Source: the www.xinhuanet.com
The content capturing the webpage in Figure 11 is as follows:
<p>just in the Second Committee World Wide Web conference that Zhe Jiangwu town is held, cross-border electricity business becomes time and again The topic that participant welcome guest discusses warmly, is also that many participants " big man " are just at the treasured place of Denver Nuggets.</p><p>? Today of internet high speed development, interconnect and allow all be possibly realized, and the interconnection that cross-border electricity business brings Intercommunication has made the dealing side being separated by vast oceans closely coupled.</p><p>in the world, valuation is more than ten The venture company of hundred million dollars is referred to as " unicorn ".Under the situation of cross-border electricity business's fast development, China " unicorn " enterprise in large quantities of futures is breeding and is growing up.</p><p>according to incompletely statistics, exist China, carries out the foreign trade enterprise of cross-border ecommerce more than 200,000.Predict according to Department of Commerce, 2016, China's cross-border electricity business's import and export volume will reach 6.5 trillion yuan.</p><p>china's (Hangzhou) cross-border electronics Commercial affairs integrated testing area is first cross-border electricity Shang Zongshi district of China.Hangzhou cross-border electricity business combines and runs managing pair as an experiment Director king says, Zong Shi district is innovated by system innovation, management innovation, service, is devoted to realize cross-border Ecommerce liberalization, judicial convenience, Normalization.</p><p>king says, pilot is from last Jul Part starts to now comprehensively, only in Hangzhou, just has 2000 many enterprises to be cross-border electricity business transition, and All achieve good transaction achievement.</p><p>hangzhou tree summer dress ornament is garment enterprise on a line, should Company log in electricity business website Amazon and speed sell logical after product has been sold to the state such as Japan and Russia, modern Annual sales amount is more than 10,000,000 yuan.</p><p>except electricity commercial business industry, the increasing traditional forms of enterprises is also Wish to be realized by cross-border electricity business " the second spring " of development.It is a traditional foreign trade enterprise that shellfish Betta spins, The loud and clear theory of company general manager horse: " we are removed intermediate link by cross-border electricity business, and profit improves very Many.Although this several years foreign trade overall situations are bad, but our sales volume is increasing always."</p><p>zhejiang Jiang Yiwu is famous " whole world small item ", by carrying out cross-border electricity business, it is achieved that to " the world Supermarket " transformation.It is exactly the professional B2B E-commerce platform of on-line off-line fusion development that Yiwu is purchased, 65 specialized markets of China " are included " at present.</p><p>the development of new things will experience wind Rain.Except the cross-border electricity feature such as the fragmentation of business itself, networking, instantaneity, be also faced with settling the exchange, simultaneously Clearance, taxation etc. many " uncomfortable disease ".</p><p>to this, national information Expert Advisory Committee (EAC) is secondary Director Yang Guoxun says, growth will be with throe, for monitoring party, it should the most cross-border electricity Decide through consultation position, set up supervision and the accompanying institution adapting to cross-border electricity business's development, promote the business's normal development of cross-border electricity. </p><p>" I feel globalization the first step, it is simply that all chains of commodity flow do not hinder, the most any one The resident in individual place can buy the best product in any one place in the world by cross-border electricity business Product." board of directors of ctrip.com chairman Liang Jianzhang represents in internet conference.</p>
Being retrieved as keyword by participle high for the described frequency of occurrences, the retrieval result obtained is according to dividing Class coding mates, and makes keyword classify according to sorting code number correspondence, and specifically classification includes automatically:
Support table: basic data defines, be that CODE is unique in whole table.Participle enters as follows Row dimension is classified, and makes keyword classify according to sorting code number correspondence:
(1). " keyword " (KEY_WORD)
6060003003 electricity business
6060003004 foreign trades
(2). " region " (PUB_AREA_CODE)
5002001016 Zhejiang
5002001081 Yiwus
(3). " fund/bond " (PUB_SEC_CODE)
3000260576 Hangzhou tree summer dress ornaments
(4). " related person " (PUB_INDIV_INFO)
4000027961 poplar state merits
4000156043 beams build chapter
(5). " associated mechanisms " (PUB_ORG_INFO)
3000197012 national information Expert Advisory Committee (EAC)s
(6). " relevant industries " (PUB_INDU_CODE)
5003002138 specialty wholesale and retail industries
(7). " related notion " (COM_CONC_INFO)
5001000169 cross-border electricity business
(8). " associated companies " (STK_BASIC_INFO)
3000466855 Yiwu capitals
Contingency table: be used for associating news relevant classification information.
(1). " news keyword contingency table " (NEWS_KEY_WORD)
1 6126410945 6060003003
2 6126410945 6060003004
(2). " news regional interrelation table " (NEWS_AREA_RELA)
1 6126410945 5002001016
2 6126410945 5002001081
(3). " association of Public Information Fund bond " (NEWS_SEC_RELA)
1 6126410945 3000260576
(4). " newsmaker's contingency table " (NEWS_INDIV_RELA)
1 6126410945 4000027961
2 6126410945 4000156043
(5). " news briefing associated therewith table " (NEWS_PUB_ORG_RELA)
1 61264109453000197012
(6). " news industry contingency table " (NEWS_INDU_RELA)
1 6126410945 5003002138
(7). " news concept subject matter contingency table " (NEWS_CONC_RELA)
1 6126410945 5001000169
(8). " news listed company contingency table " (NEWS_COM_RELA)
1 6126410945 3000466855
Those of ordinary skill in the field it is understood that more than, described be only being embodied as of the present invention Example, is not limited to the present invention, all within the spirit and principles in the present invention, that is done is any Amendment, equivalent, improvement etc., should be included within the scope of the present invention.

Claims (10)

1. a data classification processing method based on financial Information, it is characterised in that include,
Capture the text obtaining in financial Information, and text is carried out parsing obtain participle;
The frequency of occurrences of participle is obtained according to analysis result, and high according to the frequency acquisition frequency of occurrences set Participle,
Being retrieved as keyword by participle high for the described frequency of occurrences, the retrieval result obtained is according to dividing Class coding mates, and makes keyword classify according to sorting code number correspondence.
Data classification processing method based on financial Information the most according to claim 1, its feature exists Carry out resolving in, the text described crawl obtained and obtain the method for participle and be:
Pretreatment is: only retain text after the character string included in text is removed label, if described just Literary composition comprises html tag or English character, described html tag or English character is not entered Row resolves;
Then described text is carried out cutting and forms sentence,
Secondly to text remaining after cutting is syncopated as from this section of text a sentence the most successively again, and This sentence is carried out again cutting,
Finally described step is traveled through, until whole sentence is cut into single subelement.
Data classification processing method based on financial Information the most according to claim 1, its feature exists In, described high-frequency participle builds word frequency management storehouse according to the result that participle resolves,
If the participle frequency occurred is more than setting value, then this participle is defined as high-frequency participle, if going out When existing participle frequency is less than setting value, then abandon this participle.
Data classification processing method based on financial Information the most according to claim 1, its feature exists Carry out dimension classification as follows in, described participle, " keyword ", " region ", " fund/bond ", " related person ", " associated mechanisms ", " relevant industries ", " related notion ", " associated companies ".
Data classification processing method based on financial Information the most according to claim 1, its feature exists In, according to the method that described sorting code number carries out mating it is, according to putting in order successively of high-frequency participle At classification 1 support table, classification 2 support table, classification 3 support table ... classification N supports in table, carries out name Claim accurately coupling, after described participle title has accurately matched a record, return corresponding classification and compile Code.
Data classification processing method based on financial Information the most according to claim 1, its feature exists In, the text in described financial Information includes: text and html tag.
Data classification processing method based on financial Information the most according to claim 4, its feature exists In, described keyword divides according to following dimension: not cut out class, classification extension class, special topic/meeting class, Descriptive and bulletin event class.
Data classification processing method based on financial Information the most according to claim 1, its feature exists In, described sorting code number is divided into according to following rule: metadata ID code, other ID code, security Code, mechanism and company code, natural person's code, basis entity code, information entity code, index Code.
Data classification processing method based on financial Information the most according to claim 1, its feature exists In, after described keyword is classified according to sorting code number correspondence, set up financial Information and sorting code number from Dynamic association.
Data classification processing method based on financial Information the most according to claim 1, its feature It is, uses web crawlers described financial Information to be captured and enter database, set up financial Information text Participle database, described web crawlers includes, Larbin, Nutch, Heritrix, WebSPHINX, Mercator、PolyBot。
CN201610029411.2A 2016-01-15 2016-01-15 Data sorting treatment method based on financial information Pending CN105786961A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610029411.2A CN105786961A (en) 2016-01-15 2016-01-15 Data sorting treatment method based on financial information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610029411.2A CN105786961A (en) 2016-01-15 2016-01-15 Data sorting treatment method based on financial information

Publications (1)

Publication Number Publication Date
CN105786961A true CN105786961A (en) 2016-07-20

Family

ID=56402498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610029411.2A Pending CN105786961A (en) 2016-01-15 2016-01-15 Data sorting treatment method based on financial information

Country Status (1)

Country Link
CN (1) CN105786961A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951533A (en) * 2017-03-21 2017-07-14 为朔医学数据科技(北京)有限公司 A kind of Research on Genetic Variation date storage method and device
CN107273534A (en) * 2017-06-29 2017-10-20 武汉楚鼎信息技术有限公司 A kind of data processing method extracted based on information content, system
CN107292744A (en) * 2017-06-07 2017-10-24 前海梧桐(深圳)数据有限公司 Investment Trend analysis method and its system based on machine learning
CN107818510A (en) * 2017-11-02 2018-03-20 长江证券股份有限公司 A kind of distributed processing system(DPS) and method for investment decision auxiliary
CN109684472A (en) * 2018-12-20 2019-04-26 深圳价值在线信息科技股份有限公司 A kind of trade classification method and system of security information
CN110287218A (en) * 2019-06-26 2019-09-27 浙江诺诺网络科技有限公司 A kind of matched method of tax revenue sorting code number, system and equipment
CN111125596A (en) * 2019-12-11 2020-05-08 海南港澳资讯产业股份有限公司 Short message forming method based on financial information
CN112115263A (en) * 2020-09-08 2020-12-22 浙江嘉兴数字城市实验室有限公司 NLP-based social management big data monitoring and early warning method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis
CN103577587A (en) * 2013-11-08 2014-02-12 南京绿色科技研究院有限公司 News theme classification method
CN104063513A (en) * 2011-09-29 2014-09-24 北京奇虎科技有限公司 Intelligent vertical search method and system
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis
CN104063513A (en) * 2011-09-29 2014-09-24 北京奇虎科技有限公司 Intelligent vertical search method and system
CN103577587A (en) * 2013-11-08 2014-02-12 南京绿色科技研究院有限公司 News theme classification method
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951533A (en) * 2017-03-21 2017-07-14 为朔医学数据科技(北京)有限公司 A kind of Research on Genetic Variation date storage method and device
CN107292744A (en) * 2017-06-07 2017-10-24 前海梧桐(深圳)数据有限公司 Investment Trend analysis method and its system based on machine learning
CN107273534A (en) * 2017-06-29 2017-10-20 武汉楚鼎信息技术有限公司 A kind of data processing method extracted based on information content, system
CN107818510A (en) * 2017-11-02 2018-03-20 长江证券股份有限公司 A kind of distributed processing system(DPS) and method for investment decision auxiliary
CN109684472A (en) * 2018-12-20 2019-04-26 深圳价值在线信息科技股份有限公司 A kind of trade classification method and system of security information
CN110287218A (en) * 2019-06-26 2019-09-27 浙江诺诺网络科技有限公司 A kind of matched method of tax revenue sorting code number, system and equipment
CN111125596A (en) * 2019-12-11 2020-05-08 海南港澳资讯产业股份有限公司 Short message forming method based on financial information
CN112115263A (en) * 2020-09-08 2020-12-22 浙江嘉兴数字城市实验室有限公司 NLP-based social management big data monitoring and early warning method

Similar Documents

Publication Publication Date Title
CN105786961A (en) Data sorting treatment method based on financial information
Lymer et al. Business reporting on the Internet
US8494944B2 (en) System, report, and method for generating natural language news-based stories
CN106649223A (en) Financial report automatic generation method based on natural language processing
US7856390B2 (en) System, report, and method for generating natural language news-based stories
US20190311312A1 (en) Methods and systems for generating supply chain representations
JP5249074B2 (en) Method and system for symbolic linking and intelligent classification of information
US20090055242A1 (en) Content identification and classification apparatus, systems, and methods
CN110188107A (en) A kind of method and device of the Extracting Information from table
US11170022B1 (en) Method and device for processing multi-source heterogeneous data
US11755663B2 (en) Search activity prediction
CN112015721A (en) E-commerce platform storage database optimization method based on big data
KR102121901B1 (en) System for online public fund investment management assessment service
Chen et al. From opinion mining to financial argument mining
CN103390044A (en) Method and device for identifying linkage type POI (Point Of Interest) data
CN107679977A (en) A kind of tax administration platform and implementation method based on semantic analysis
CN109857952A (en) A kind of search engine and method for quickly retrieving with classification display
CN115358201A (en) Processing method and system for delivery and research report in futures field
CN109388710A (en) A kind of IP address service attribute scaling method and device
CN110222180A (en) A kind of classification of text data and information mining method
KR101971087B1 (en) Displaying method for market sentiment index information and online stock dealing service system
KR101145818B1 (en) Method and apparutus for automatic contents generation
CN112183037A (en) Data classification and summarization method and system in parallel enterprise finance and tax SaaS system
Higgins et al. XBRL: don't lag behind the digital information revolution
Wang et al. Are XBRL-based financial reports better than non-XBRL reports? A quality assessment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160720

RJ01 Rejection of invention patent application after publication