CN107066599A - A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning - Google Patents

A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning Download PDF

Info

Publication number
CN107066599A
CN107066599A CN201710259506.8A CN201710259506A CN107066599A CN 107066599 A CN107066599 A CN 107066599A CN 201710259506 A CN201710259506 A CN 201710259506A CN 107066599 A CN107066599 A CN 107066599A
Authority
CN
China
Prior art keywords
data
information
company
enterprise
carried out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710259506.8A
Other languages
Chinese (zh)
Other versions
CN107066599B (en
Inventor
郑锦光
张梦迪
丁海星
曹辉
鲍捷
马新磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wen Jie Internet Technology Co Ltd
Original Assignee
Beijing Wen Jie Internet Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wen Jie Internet Technology Co Ltd filed Critical Beijing Wen Jie Internet Technology Co Ltd
Priority to CN201710259506.8A priority Critical patent/CN107066599B/en
Publication of CN107066599A publication Critical patent/CN107066599A/en
Application granted granted Critical
Publication of CN107066599B publication Critical patent/CN107066599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Abstract

The invention discloses a kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning, what methods described was used comprises the following steps that:Obtain company information, parsing data storage, confluence analysis data, set up business entity's knowledge base.The system includes incorporated business's data obtaining module, information extraction structurized module, keyword optimization retrieval module and similar matrix processing construction of knowledge base module.The present invention can solve that traditional mode classification coverage rate is complete and traditional classification imperfection and the more low technical problem of recall precision to mark enterprise searching system.

Description

A kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning and System
Technical field
The present invention relates to a kind of information analysis retrieval technique in financial investment field.
Background technology
In financial investment field, investor needs to carry out target enterprise detailed traffic pattern analysis, financial analysis, And rational enterprise value valuation.For the research of target company, it is often necessary to there is of the same trade or same domain rival firms Enterprise operation data are as with reference to supporting, using suitable valuation mode model, to model or predict that the expected of the said firm passes through Data are sought, potential investment target is found.Conventional searching same domain or company of the same trade mode, mainly passes through existing row Industry disaggregated model, such as GICS (GICS), the global industry (RGS) of Russell, trade classification benchmark (ICB) investment Type categorizing system, and the management type government industry categorizing system such as industrial sectors of national economy classification, marketing enterprises trade classification.Due to The continuous progress of emerging technology, the incorporated business of multi-field conglomerate blending emerges in large numbers in succession, and traditional mode classification is difficult complete Cover new technique field company.
Information retrieval technique is that the activity of the information resources related to information requirement is obtained from information resources set.Retrieval can With based on full text or other indexes based on content.Web search engine is both most common information retrieval application.In letter Cease in retrieving, inquiry each time can be identified sequence to information resources object, and arrange and store between different objects Correlation degree and ranking information.Information object is typically the solid data of properties collection or database purchase, by original The contents extraction of beginning information resources, sorts out the related information between effective entity and entity, is used as the straight of information retrieval Connect process object.A kind of ripe search engine system would generally be according to match query degree each time, to being stored in system Entity object carry out calculating marking, then ranking.The Query Result of user each time, can all show that respective queries are in the top Entity and associated entity.Traditional classification imperfection and recall precision to mark enterprise searching system is relatively low.
Similar extraction is a kind of contents extraction mode similar or relevant documentation to its based on document content characteristic key.It is logical Cross and the entity data bak progress document relevance built is estimated, the similitude ranking set up between entity can be effective Retrieval rate is improved, useful information is returned.Conventional relativity measurement mode include vector space model, probabilistic model, with And inference network model.Vector space model is modeled by carrying out the vector space based on keyword to document, by comparing not With the vector space distance between document, Documents Similarity ranking is realized;Probabilistic model is by calculating searching keyword and document Between dependent probability, using different priori and posteriority field empirical probability, based on Bayesian model, draw different keywords Correlation degree between document, and similarity ranking is carried out to different document.Inference network model is that one kind possesses knowledge and pushed away The similar to search model of reason ability, can there is provided the correlation degree between retrieval and document, Yi Jiwen based on different calculative strategies Similarity ranking between shelves and document.Specific calculative strategy includes vector space, keyword weight probability etc..
Because there is above mentioned problem in traditional sorting technique, therefore, combining information retrieval technique, searching order scheduling algorithm meter Automatic similar enterprise's searching classification system after calculation will use.
The content of the invention
It is an object of the invention to provide a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning and System, to solve, traditional mode classification coverage rate is complete and traditional classification imperfection and inspection to mark enterprise searching system The technical problem such as rope is less efficient.
In order to realize foregoing invention purpose, a kind of similar enterprise of the listed company inspection of knowledge based storehouse reasoning of the present invention Rope sorting technique, use is comprised the following steps that:
1) obtains company information, and data collection is carried out to all enterprises of listed company, including listed company raises capital by floating shares explanation Book, annual report, important announcement, financial report, industry research report, patent information, actionable information, information of inviting and submitting bids and enterprise Industry highlight;
2) parses data storage, and the data crawled are resolved into appropriate format by resolver, stored into database, Resolver containing type analyzer, format analyzer, to the data type and form for complexity, and are resolved to unification Form;
3) confluence analysises data, are carried out to data with existing at data deduplication, content structure information extraction and information classification Reason, for each furniture body enterprise, sets up business data portrait, is constituted from main business, joins holding company's relation, financial index Angle, classified description is carried out to enterprise-like corporation;
4) sets up business entity's knowledge base, by using Chinese word segmentation, part-of-speech tagging, identification mark, rule match skill Art, the structural analysis of paragraph and sentence level is carried out to company information, and extracts entity and relation;Pass through term vector mould afterwards Type, and by inverted index, keyword optimization, similarity ranking, entity relationship matching step, set up business entity's knowledge base;
5) returns to the related to mark company information of target enterprise according to search key.
The parsing data storage is, according to the listed company's enterprise operation data got, for different type, to carry out Parsing is extracted;Data are obtained by more than and are uniformly submitted to type resolver, for the data of different-format type, resolver is included Corresponding data type interface module, dissection process is identified to corresponding data;Number is analyzed by format analyzer afterwards According to different-format, various company datas are converted into unified form, be parsed after, it is necessary to store data into database It is middle to preserve.
The confluence analysis data, on the data basis with unified form, in addition it is also necessary to further clear up data; Firstly the need of to data deduplication, a large amount of description data, financial data, the news data included for company, first layer form Also need to carry out cleaning detection to available data after analyzing and processing, remove the data after repeated data, duplicate removal still comprising a large amount of The redundant datas such as useless label, form, in addition it is also necessary to the data after cleaning are carried out at extraction using rule-based identification technology Reason, sifts out useful data, finally according to company's situation, and data are carried out mainly to include finance model, enterprise's contrast of the same trade or business, product Category classification including structure, sales mode, client and market.
It is described to set up business entity's knowledge base, first, full-text index is set up to data, utilize distributed search engine technology Data after handling structuring set up full-text index, carry out this word retrieval in full to related document, and by text data Space vector is converted into, relevance score is carried out to text using vector model.
It is described to set up business entity's knowledge base, secondly, according to key word of the inquiry information extraction data chunks, utilize distribution Search engine is retrieved to database, extracts associated companies data, constitutes a data chunks, and retrieval is optimized.
It is described to set up business entity's knowledge base, the 3rd, searching keyword storehouse, for ad hoc inquiry keyword, will be closed with it The data chunks of connection organize foundation search spatial cache, lift the efficiency of search inquiry.
It is described to set up business entity's knowledge base, the 4th, Similarity Measure is carried out to data chunks, data chunks entered first Row company information is modeled, and text data is converted into vector using term vector model, is carried out based on obtained vector matrix similar Degree is calculated, and passes through Multilevel method in calculating process:Vectorization is carried out to company information with keyword-entity vector model, made Company information is set up with Inverted Index Technique and indexed, search key is optimized, using Similarity Measure technology to enterprise Industry similarity is optimized, and is completed entity relationship matching, is generated similarity matrix.
It is described to set up business entity's knowledge base, the 5th, the retrieval knot according to similarity matrix return similarity more than threshold value Really.
A kind of similar enterprise of the listed company searching classification system of knowledge based storehouse reasoning, including:
Various commonly used company information data are carried out acquisition arrangement by incorporated business's data obtaining module;
Data parse form analysis module, the data crawled are resolved into unified form, wherein needing the class of analyze data Type and form, for different data type and form, using different analytical algorithms, are resolved to
Unified form, is finally stored data into appropriate database;Information extraction structurized module, to unified form Data carry out further confluence analysis, including data deduplication, information extraction, information classification algorithm;
Data after integration are set up full-text index by keyword optimization retrieval module based on distributed search engine, according to The data that searching keyword can retrieve correlation constitute data chunks, and relevance score is carried out to data, improve recall precision;
Similar matrix handles construction of knowledge base module, for the similarity according to data chunks Computer Corp. data, wherein Vector form then is converted the data into using term vector model, and pass through the row of falling first based on company information to data modeling Index, keyword optimization, similarity ranking, entity relationship matching process, set up business entity's knowledge base, and the retrieval to input is closed Keyword makes inferences matching.
Advantages of the present invention:
The present invention uses the inference strategy network model of knowledge based storehouse reasoning, according to the production of target enterprise and relevant enterprise What product structure, main business service, rival, competition situation, trade cycle sensitivity, and statistical correlation degree were combined Mode, similarity ranking is carried out to associated enterprise, to find out to mark enterprise, and to more global industry industrial chain, up and down There is provided data basis for the investment analyses such as trip association.Imitated for the existing classification imperfection to mark enterprise searching system and retrieval A kind of the problems such as rate is relatively low, it is proposed that similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning.This Invention has mode classification coverage rate complete, and the classification to mark enterprise searching system is perfect, the advantages of recall precision is high.
Brief description of the drawings
Fig. 1 is to retrieve the method flow diagram of similar listed company in example.
Fig. 2 is the flow chart of parsing data storage and confluence analysis data in example.
Fig. 3 is the flow chart of generation company similarity matrix.
Fig. 4 is to retrieve the system flow chart of similar listed company in example.
Embodiment
The present invention is described in detail with reference to example, it is clear that described example is the certain embodiments of the application.Should Understand, preferred embodiment described herein is merely to illustrate and explain the present invention, be not used to limit the application.Based on the application Example, those skilled in the art obtains the protection domain that every other example belongs to the application.
Fig. 1 is the general flow chart of method, describes the operational process of similar company's search method.
101-104 listed companies business data source.According to data processing needs of the present invention, enterprise of listed company is disclosed The information of channel issue is collected, and specifying information is reported including industry research, company's bulletin, financial report, related important new Hear, and prospectus, annual report, great bulletin, actionable information, patent information etc. can cover what company's day-to-day operations changed The information content;
105 listed company's acquisition of information.For above-mentioned separate sources data, it is determined that corresponding information acquiring pattern, such as row The data such as industry research report, company's bulletin are often the PDF document of textual form, then need to be updated storage to specific document Processing.The data such as financial report are the numeric data with label form after structuring, then need obtaining according to numeric data Mode is taken, acquisition is updated in batches, and modeling is associated to identical structure of report field of same companies etc.;Actionable information, Patent information etc. is Homepage Publishing data, then needs effectively to recognize structure of web page content, and extraction and analysis obtains useful data.
106 parsing data storages.The listed company's enterprise operation data got according to 105, for different type, are carried out Parsing is extracted.Data are obtained by more than and are uniformly submitted to type resolver, for the data of different-format type, text-type data Such as PDF format, Word format, structure data such as JSON forms, XML etc., web data such as HTML etc., resolver is contained Corresponding data type interface module, dissection process is identified to corresponding data.Pass through format analyzer analyze data afterwards Different-format, various company datas are converted into unified form, be parsed after, it is necessary to store data into database Preserve;
107 confluence analysis data, on the data basis with unified form, in addition it is also necessary to further clear up data. Firstly the need of to data deduplication, a large amount of description data, financial data, the news data included for company, usual first layer Also need to carry out cleaning detection to available data after format analysis processing, remove repeated data, to improve data validity, mitigate The burden of storage system.Data after duplicate removal are still comprising redundant datas such as a large amount of useless label, forms, in addition it is also necessary to use identification Algorithmic technique carries out extraction process to the data after cleaning, sifts out useful data, finally data are classified, including financial mould Type, enterprise's contrast of the same trade or business, product structure, sales mode, the classification such as client and market;
108 pairs of data set up full-text index.In order to improve the retrieval rate of data after processing, it is necessary to utilize distributed search Data after engine technique is handled structuring set up full-text index, and this word retrieval in full is carried out to related document, and will Text data is converted into space vector, and relevance score is carried out to text using vector model;
109, according to key word of the inquiry information extraction data chunks, are retrieved using distributed search engine to database, Associated companies data are extracted, such as search key, company data document, keyword positional information constitute a data chunks, Retrieval is optimized.
110 searching keyword storehouses are determined.For ad hoc inquiry keyword, data chunks associated with it are organized and built Vertical search spatial cache, lifts the efficiency of search inquiry.
111 pairs of data chunks carry out Similarity Measure, carry out company information modeling to data chunks first, utilize term vector Text data is converted into vector by model, Similarity Measure is carried out based on obtained vector matrix, by many in calculating process Layer processing:Keyword-entity vector model is carried out, inverted index, keyword optimization, similarity ranking, entity relationship matching Deng generation similarity matrix;
112 return to the retrieval result that similarity is more than threshold value according to similarity matrix.
Fig. 2 describes the flow that data storage and Data Integration analysis are parsed in the inventive method.
The 201 incorporated business's data obtained.According to 101-105, listed company's enterprise operation data needed for obtaining, bag Industry research report, company's bulletin, financial report, related highlight, and prospectus, annual report, great bulletin are included, is told Dispute information, patent information etc..
202 type analysis, the company information obtained to more than carries out type analysis.It is public for industry research report, listing Document-type (such as PDF, Word) data such as department's bulletin, according to file structure feature, extract valid data content therein, including text The useful informations such as sheet, picture, form.It is right according to specific structure feature information for the value structure type data such as financial report Initial data carries out reprocessing processing, and original structure feature is recombinated, to generate the new of the recognizable processing of this patent system Type structural data.For the structure of web page information data such as lawsuit, patent, its tag head need to be analyzed according to specific structure of web page Portion's content, extracts useful information data, and recombinate structuring.
203 format analysis.The original incorporated business's data message of the different type according to 202, carries out corresponding form knot Structureization processing.Text-type data such as PDF format, the useful informations such as content of text therein, chart are carried out to extract at formatting Reason, generates unified structure content;The structured content such as JSON data such as corporate financial data, product information, main business information, Pattern handling again is carried out to it;Network Page data such as corporate news, actionable information etc., by format analyzer by having in webpage Imitate data and carry out unified extraction, reject useless format tags, screen useful information data.
204 data storages, the enterprise in the good data Cun Chudao databases of formatting structure, setting up corporate linkage is believed Knowledge base is ceased, to improve data access efficiency.
205 data deduplications, duplicate removal cleaning treatment again is carried out for existing format data, using salted hash Salted, calculates number According to informative abstract, repeated data is removed, the utilization ratio of incorporated business data is improved.
206 information extractions, for the structural data in incorporated business's information knowledge storehouse, for different demands, such as enterprise Description, product structure, main business is constituted, financial statement, senior executive's information, patent information etc., carries out related content extraction.
207 information classifications, in the data basis that 206 procedure extractions go out, to corresponding contents carry out information classification, and with original Beginning company information is associated.For each specific enterprise, corresponding business data portrait is generated, from multiple angles to enterprise Company carries out classified description.
Fig. 3 describes the flow of Computer Corp.'s similarity matrix in the inventive method.
301 by flow described in Fig. 2, by the original enterprise-like corporation's data got, carries out structuring extraction process, obtains Simplify the business data portrait of classification;
302 company information entities are extracted.Chinese word segmentation, part-of-speech tagging, identification mark, rule match etc. are used by comprehensive Technology, the structural analysis of paragraph/sentence level is carried out to above-mentioned company information, and extracts entity therein and relation.
303 term vector models.Obtained business entity information is handled according to 302 processes, it is right using term vector model is used It carries out text vector matrixing processing, wherein the dimension of vector is number of entities in text, the overwhelming majority is 0 in vector, certain A little dimensions are 1, and numerical value vector is converted the text to by such mode, enabling carry out next a series of calculate;
304 business entity's knowledge bases.For the business entity information extracted, pass through a series of in-depths of 305-308 Processing, construction, which is set up, to be possessed business entity's Association repository of inferential capability there is provided for covering Shenzhen stock market, stock markets of Shanghai, new three plates institute There is enterprise of listed company to the inference data chain needed for mark company automatic recognition classification system
305 inverted indexs.For the business entity information built, with reference to corresponding search key, the row of falling is built Index structure, raising retrieval associates matching degree with result.Row's keyword-entity index can regard a chained list number as Group, the gauge outfit of each chained list includes keyword, and its subsequent cell then includes all entity vector models including this keyword, And some other information.These information can be entity vector in the word frequency or entity vector in the word The information such as position.
306 keywords optimize.According to the keyword set up in 305-entity inverted index model, for retrieval each time As a result, keyword occurrence number and weight are optimized.If keyword occurrence number in some entity vector is got over It is many, then this word is considered as more important.If a keyword occurs in more entity vectors, then this word The effect of discernibly matrix is lower, and then its importance should also be as corresponding reduction.The entity vector model dimension of one enterprise is got over It is high, then its number of times for some keyword occur may be higher, and differentiation of each keyword to this entity vector is acted on It is lower, certain drop power should be given these keywords accordingly.
307 similarity rankings.It is constantly excellent by being carried out to weight of the keyword in different business entity's vector models Change amendment, marking is ranked up to business entity's vector associated by same keyword, the corresponding business entity of keyword is set up Ranking collection of illustrative plates, finds out in same domain of the same trade, and the difference index association such as product business business revenue is most strong to mark enterprise-like corporation.
308 entity relationships are matched.It is crucial according to different retrievals on the basis of 307 foundation are to mark enterprise ranking collection of illustrative plates Word, pattern of enterprises index carries out classification and matching the association business entity, such as product structure, main business market, OK Region residing for industry and cycle etc., carry out different relationship match processing, and foundation can make inferences retrieval according to different keywords Business entity's relational knowledge base.
309 inference patterns.By above 301--308 handling process, complete business entity's knowledge base is set up, for Different search keys, knowledge base voluntarily reasoning can obtain the corresponding similar enterprise's matching result of target enterprise, and can root According to different concern classifications, reasoning draw certain subdivision scene to mark enterprise-like corporation, for industry and enterprise analysis have very big Help.
310 pairs of mark matching results.According to search key, output is to mark matching result.
Fig. 4 is the system flow chart realized according to the inventive method, describes the overall operation of similar company's searching system Flow.
401 incorporated business's data obtaining modules.Acquisition arrangement is carried out to various commonly used company information data;
402 data parse form analysis module.The data crawled are resolved into unified form, wherein needing analyze data Type and form, for different data type and form, using different analytical algorithms, are resolved to unified form, most Store data into afterwards in appropriate database;
403 information extraction structurized modules.Data to unified form borrow the confluence analysis of a step, including number According to duplicate removal, information extraction, information classification scheduling algorithm;
404 keywords optimization retrieval module.Full-text index, root are set up to the data after integration based on distributed search engine The data that can retrieve correlation according to searching keyword constitute data chunks, and vector space model, BM25 algorithms are related among these Deng data are carried out with relevance score, recall precision is improved;
405 similar matrixes handle construction of knowledge base module.For the similarity according to data chunks Computer Corp. data, its In then to convert the data into vector form using term vector model first based on company information to data modeling, and by falling The processes such as index, keyword optimization, the matching of similarity ranking, entity relationship are arranged, business entity's knowledge base are set up, the inspection to input Rope keyword makes inferences matching.

Claims (9)

1. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning, use is comprised the following steps that:
1) obtains company information, and all enterprises of listed company are carried out with data collection, including listed company's prospectus, year Spend report, important announcement, financial report, industry research report, patent information, actionable information, information of inviting and submitting bids and enterprise's weight Want news;
2) parses data storage, and the data crawled are resolved into appropriate format by resolver, stored into database, parsing Device containing type analyzer, format analyzer, to the data type and form for complexity, and are resolved to unified lattice Formula;
3) confluence analysises data, data deduplication, content structure information are carried out to data with existing and is extracted and information classification processing, pin To each furniture body enterprise, business data portrait is set up, constituted from main business, join holding company's relation, financial index angle, Classified description is carried out to enterprise-like corporation;
4) sets up business entity's knowledge base, right by using Chinese word segmentation, part-of-speech tagging, identification mark, rule match technology Company information carries out the structural analysis of paragraph and sentence level, and extracts entity and relation;Afterwards by term vector model, and By inverted index, keyword optimization, similarity ranking, entity relationship matching step, business entity's knowledge base is set up;
5) returns to the related to mark company information of target enterprise according to search key.
2. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning according to claim 1, solution Analysis data storage is, according to the listed company's enterprise operation data got, for different type, to carry out parsing extraction;By more than Obtain data and be uniformly submitted to type resolver, for the data of different-format type, resolver contains corresponding data class Type interface module, dissection process is identified to corresponding data;, will afterwards by the different-format of format analyzer analyze data Various company datas are converted into unified form, are preserved after being parsed, it is necessary to store data into database.
3. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning according to claim 1, whole Analyze data is closed, on the data basis with unified form, in addition it is also necessary to further clear up data;Firstly the need of to data Duplicate removal, a large amount of description data, financial data, the news data included for company is also needed after the processing of first layer format analysis Cleaning detection is carried out to available data, remove the data after repeated data, duplicate removal still comprising a large amount of useless label, forms etc. Redundant data, in addition it is also necessary to extraction process is carried out to the data after cleaning using rule-based identification technology, useful data is sifted out, Finally according to company's situation, data are carried out mainly to include finance model, enterprise of the same trade or business contrast, product structure, sales mode, visitor Family and the category classification including market.
4. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning according to claim 1, institute State and set up business entity's knowledge base, full-text index is set up to data, after being handled using distributed search engine technology structuring Data set up full-text index, this word retrieval, and text data is converted into space vector in full is carried out to related document, Relevance score is carried out to text using vector model.
5. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning according to claim 1, institute State and set up business entity's knowledge base, according to key word of the inquiry information extraction data chunks, using distributed search engine to data Storehouse is retrieved, and extracts associated companies data, constitutes a data chunks, and retrieval is optimized.
6. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning according to claim 1, institute State and set up business entity's knowledge base, searching keyword storehouse, for ad hoc inquiry keyword, by data chunks tissue associated with it Get up to set up search spatial cache, lift the efficiency of search inquiry.
7. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning according to claim 1, institute State and set up business entity's knowledge base, Similarity Measure is carried out to data chunks, company information modeling is carried out to data chunks first, Text data is converted into vector using term vector model, Similarity Measure is carried out based on obtained vector matrix, calculated Pass through Multilevel method in journey:Vectorization is carried out to company information with keyword-entity vector model, Inverted Index Technique pair is used Company information sets up index, and search key is optimized, enterprise's similarity optimized using Similarity Measure technology, Entity relationship matching is completed, similarity matrix is generated.
8. a kind of similar enterprise of the listed company searching classification method of knowledge based storehouse reasoning according to claim 1, institute State and set up business entity's knowledge base, the retrieval result that similarity is more than threshold value is returned to according to similarity matrix.
9. a kind of similar enterprise of the listed company searching classification system of knowledge based storehouse reasoning, it is characterised in that including:
Various commonly used company information data are carried out acquisition arrangement by incorporated business's data obtaining module;
Data parse form analysis module, and the data crawled are resolved into unified form, wherein need analyze data type and Form, for different data type and form, using different analytical algorithms, is resolved to unified form, finally by number According to storage into appropriate database;Information extraction structurized module, the data to unified form carry out further integration point Analysis, including data deduplication, information extraction, information classification algorithm;
Data after integration are set up full-text index, according to inquiry by keyword optimization retrieval module based on distributed search engine The data that keyword can retrieve correlation constitute data chunks, and relevance score is carried out to data, improve recall precision;
Similar matrix handles construction of knowledge base module, for the similarity according to data chunks Computer Corp. data, wherein will be first Based on company information to data modeling, then convert the data into vector form using term vector model, and by inverted index, Keyword optimization, similarity ranking, entity relationship matching process, set up business entity's knowledge base, to the search key of input Make inferences matching.
CN201710259506.8A 2017-04-20 2017-04-20 Similar listed company enterprise retrieval classification method and system based on knowledge base reasoning Active CN107066599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710259506.8A CN107066599B (en) 2017-04-20 2017-04-20 Similar listed company enterprise retrieval classification method and system based on knowledge base reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710259506.8A CN107066599B (en) 2017-04-20 2017-04-20 Similar listed company enterprise retrieval classification method and system based on knowledge base reasoning

Publications (2)

Publication Number Publication Date
CN107066599A true CN107066599A (en) 2017-08-18
CN107066599B CN107066599B (en) 2021-11-30

Family

ID=59599954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710259506.8A Active CN107066599B (en) 2017-04-20 2017-04-20 Similar listed company enterprise retrieval classification method and system based on knowledge base reasoning

Country Status (1)

Country Link
CN (1) CN107066599B (en)

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN107844960A (en) * 2017-11-22 2018-03-27 辅投帮(武汉)科技有限公司 A kind of investment analysis tools of automatic intelligent analysis report of business plan
CN108073692A (en) * 2017-12-06 2018-05-25 国云科技股份有限公司 A kind of enterprise's ranking system and its implementation
CN108563783A (en) * 2018-04-25 2018-09-21 张艳 A kind of financial analysis management system and method based on big data
CN108596439A (en) * 2018-03-29 2018-09-28 北京中兴通网络科技股份有限公司 A kind of the business risk prediction technique and system of knowledge based collection of illustrative plates
CN108615124A (en) * 2018-05-11 2018-10-02 北京窝头网络科技有限公司 Valuation of enterprise method and system based on word frequency analysis
CN109145081A (en) * 2018-07-27 2019-01-04 安康市惠企财税服务有限公司 A kind of financial data search method and system
CN109165337A (en) * 2018-10-17 2019-01-08 珠海市智图数研信息技术有限公司 A kind of method and system of knowledge based map construction bidding field association analysis
CN109213867A (en) * 2018-10-26 2019-01-15 湖北大学 A kind of mass knowledge base construction method precisely predicted towards big data
CN109241046A (en) * 2018-08-30 2019-01-18 天津做票君机器人科技有限公司 A kind of inventory information recognition methods of negotiation by draft robot and identifier
CN109359817A (en) * 2018-09-13 2019-02-19 江苏站企动网络科技有限公司 A kind of business information analysis management system
CN109376273A (en) * 2018-09-21 2019-02-22 平安科技(深圳)有限公司 Company information map construction method, apparatus, computer equipment and storage medium
CN109558492A (en) * 2018-10-16 2019-04-02 中山大学 A kind of listed company's knowledge mapping construction method and device suitable for event attribution
CN109598705A (en) * 2018-11-19 2019-04-09 江苏科技大学 A kind of inspection procedure automatic generation method based on detection feature
CN109657066A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Knowledge mapping construction method, device and computer equipment based on multi-angle of view
CN109785144A (en) * 2019-01-18 2019-05-21 国家电网有限公司 A kind of assets classes method, apparatus, equipment and medium
CN110020660A (en) * 2017-12-06 2019-07-16 埃森哲环球解决方案有限公司 Use the integrity assessment of the unstructured process of artificial intelligence (AI) technology
CN110110044A (en) * 2019-04-11 2019-08-09 广州探迹科技有限公司 A kind of method of company information combined sorting
CN110162590A (en) * 2019-02-22 2019-08-23 北京捷风数据技术有限公司 A kind of database displaying method and device thereof of calling for tenders of project text combination economic factor
CN110427547A (en) * 2018-04-26 2019-11-08 观相科技(上海)有限公司 A kind of search system and searching method based on industrial characteristic
CN110532383A (en) * 2019-07-18 2019-12-03 中山大学 A kind of patent text classification method based on intensified learning
CN110737749A (en) * 2019-10-11 2020-01-31 软通动力信息技术有限公司 Entrepreneurship plan evaluation method, entrepreneurship plan evaluation device, computer equipment and storage medium
CN110795425A (en) * 2019-10-31 2020-02-14 上海义缘网络科技有限公司 Method, device, equipment and medium for cleaning and merging customs data
CN110879829A (en) * 2019-11-26 2020-03-13 杭州皓智天诚信息科技有限公司 Intellectual property big data service intelligent system
CN111008265A (en) * 2019-12-03 2020-04-14 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN111080132A (en) * 2019-12-18 2020-04-28 北京智识企业管理咨询有限公司 Industry chain analysis system and method based on big data
CN111125185A (en) * 2019-11-25 2020-05-08 泰康保险集团股份有限公司 Data processing method, device, medium and electronic equipment
CN111177189A (en) * 2019-12-20 2020-05-19 航天云网科技发展有限责任公司 Client optimization system and method based on user behavior analysis
CN111176650A (en) * 2018-11-09 2020-05-19 阿里巴巴集团控股有限公司 Parser generation method, search method, server, and storage medium
CN111183421A (en) * 2017-10-06 2020-05-19 株式会社东芝 Service providing system, business analysis support system, method, and program
CN111737421A (en) * 2020-08-07 2020-10-02 杭州六棱镜知识产权科技有限公司 Intellectual property big data information retrieval system and storage medium
CN112115314A (en) * 2020-09-16 2020-12-22 江苏开拓信息与系统有限公司 General government affair big data aggregation retrieval system and construction method
CN112183090A (en) * 2020-10-09 2021-01-05 浪潮云信息技术股份公司 Method for calculating entity relevance based on word network
CN112182223A (en) * 2020-10-12 2021-01-05 浙江工业大学 Enterprise industry classification method and system based on domain ontology
CN112214572A (en) * 2020-10-20 2021-01-12 济南浪潮高新科技投资发展有限公司 Method for secondarily extracting entities in resume analysis
CN112434158A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Enterprise label acquisition method and device, storage medium and computer equipment
CN112434665A (en) * 2020-12-12 2021-03-02 广东电力信息科技有限公司 Method and device for intelligently identifying financial data in image based on machine learning
CN112507201A (en) * 2020-11-03 2021-03-16 国网浙江省电力有限公司台州供电公司 Search engine construction and search method based on NLP (non-line segment) retrieval analysis technology
CN112612937A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 Associated information acquisition method and equipment
CN112650951A (en) * 2020-12-21 2021-04-13 撼地数智(重庆)科技有限公司 Enterprise similarity matching method, system and computing device
CN112734493A (en) * 2021-01-18 2021-04-30 科技谷(厦门)信息技术有限公司 Industry monitoring analysis platform
CN113742496A (en) * 2021-09-10 2021-12-03 国网江苏省电力有限公司电力科学研究院 Power knowledge learning system and method based on heterogeneous resource fusion
CN116578677A (en) * 2023-07-14 2023-08-11 高密市中医院 Retrieval system and method for medical examination information
CN117057942A (en) * 2023-10-12 2023-11-14 之江实验室科技控股有限公司 Intelligent financial decision big data analysis system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096845A (en) * 2009-12-10 2011-06-15 黑龙江省森林工程与环境研究所 Knowledge base full text search engine system for classified forest management
CN104008107A (en) * 2013-02-25 2014-08-27 成都勤智数码科技股份有限公司 Implement method of knowledge base on operation and maintenance management
CN104834668A (en) * 2015-03-13 2015-08-12 浙江奇道网络科技有限公司 Position recommendation system based on knowledge base
CN104834736A (en) * 2015-05-19 2015-08-12 深圳证券信息有限公司 Method and device for establishing index database and retrieval method, device and system
CN106126695A (en) * 2016-06-30 2016-11-16 张春生 A kind of similar case search method and device
CN106156104A (en) * 2015-04-02 2016-11-23 北京奇虎科技有限公司 Crawl the method and device of corporate intranet information
CN106296312A (en) * 2016-08-30 2017-01-04 江苏名通信息科技有限公司 Online education resource recommendation system based on social media

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096845A (en) * 2009-12-10 2011-06-15 黑龙江省森林工程与环境研究所 Knowledge base full text search engine system for classified forest management
CN104008107A (en) * 2013-02-25 2014-08-27 成都勤智数码科技股份有限公司 Implement method of knowledge base on operation and maintenance management
CN104834668A (en) * 2015-03-13 2015-08-12 浙江奇道网络科技有限公司 Position recommendation system based on knowledge base
CN106156104A (en) * 2015-04-02 2016-11-23 北京奇虎科技有限公司 Crawl the method and device of corporate intranet information
CN104834736A (en) * 2015-05-19 2015-08-12 深圳证券信息有限公司 Method and device for establishing index database and retrieval method, device and system
CN106126695A (en) * 2016-06-30 2016-11-16 张春生 A kind of similar case search method and device
CN106296312A (en) * 2016-08-30 2017-01-04 江苏名通信息科技有限公司 Online education resource recommendation system based on social media

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
赵民等: "《基于流程的知识工程与创新》", 31 January 2016 *
鲍捷: "智能金融的核心引擎_一览与前瞻", 《软件和集成技术》 *
鲍捷: "知识图谱如何助力实现智能金融", 《金卡工程》 *

Cited By (62)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818130A (en) * 2017-09-15 2018-03-20 深圳市电陶思创科技有限公司 The method for building up and system of a kind of search engine
CN111183421B (en) * 2017-10-06 2023-11-28 株式会社东芝 Service providing system, business analysis supporting system, method and recording medium
CN111183421A (en) * 2017-10-06 2020-05-19 株式会社东芝 Service providing system, business analysis support system, method, and program
CN107844960A (en) * 2017-11-22 2018-03-27 辅投帮(武汉)科技有限公司 A kind of investment analysis tools of automatic intelligent analysis report of business plan
CN107844960B (en) * 2017-11-22 2020-12-01 辅投帮(武汉)科技有限公司 Investment analysis tool for automatically and intelligently analyzing business plan
CN108073692B (en) * 2017-12-06 2021-09-21 国云科技股份有限公司 Method for implementing enterprise ranking system
CN110020660B (en) * 2017-12-06 2023-05-09 埃森哲环球解决方案有限公司 Integrity assessment of unstructured processes using Artificial Intelligence (AI) techniques
US11574204B2 (en) 2017-12-06 2023-02-07 Accenture Global Solutions Limited Integrity evaluation of unstructured processes using artificial intelligence (AI) techniques
CN110020660A (en) * 2017-12-06 2019-07-16 埃森哲环球解决方案有限公司 Use the integrity assessment of the unstructured process of artificial intelligence (AI) technology
CN108073692A (en) * 2017-12-06 2018-05-25 国云科技股份有限公司 A kind of enterprise's ranking system and its implementation
CN108596439A (en) * 2018-03-29 2018-09-28 北京中兴通网络科技股份有限公司 A kind of the business risk prediction technique and system of knowledge based collection of illustrative plates
CN108563783A (en) * 2018-04-25 2018-09-21 张艳 A kind of financial analysis management system and method based on big data
CN110427547A (en) * 2018-04-26 2019-11-08 观相科技(上海)有限公司 A kind of search system and searching method based on industrial characteristic
CN108615124A (en) * 2018-05-11 2018-10-02 北京窝头网络科技有限公司 Valuation of enterprise method and system based on word frequency analysis
CN108615124B (en) * 2018-05-11 2022-02-01 北京窝头网络科技有限公司 Enterprise evaluation method and system based on word frequency analysis
CN109145081A (en) * 2018-07-27 2019-01-04 安康市惠企财税服务有限公司 A kind of financial data search method and system
CN109241046A (en) * 2018-08-30 2019-01-18 天津做票君机器人科技有限公司 A kind of inventory information recognition methods of negotiation by draft robot and identifier
CN109359817A (en) * 2018-09-13 2019-02-19 江苏站企动网络科技有限公司 A kind of business information analysis management system
CN109376273B (en) * 2018-09-21 2024-02-27 平安科技(深圳)有限公司 Enterprise information map construction method, enterprise information map construction device, computer equipment and storage medium
CN109376273A (en) * 2018-09-21 2019-02-22 平安科技(深圳)有限公司 Company information map construction method, apparatus, computer equipment and storage medium
CN109558492A (en) * 2018-10-16 2019-04-02 中山大学 A kind of listed company's knowledge mapping construction method and device suitable for event attribution
CN109165337B (en) * 2018-10-17 2021-10-15 珠海市智图数研信息技术有限公司 Method and system for establishing bid and ask field association analysis based on knowledge graph
CN109165337A (en) * 2018-10-17 2019-01-08 珠海市智图数研信息技术有限公司 A kind of method and system of knowledge based map construction bidding field association analysis
CN109213867A (en) * 2018-10-26 2019-01-15 湖北大学 A kind of mass knowledge base construction method precisely predicted towards big data
CN111176650A (en) * 2018-11-09 2020-05-19 阿里巴巴集团控股有限公司 Parser generation method, search method, server, and storage medium
CN111176650B (en) * 2018-11-09 2023-04-18 阿里巴巴集团控股有限公司 Parser generation method, search method, server, and storage medium
CN109598705A (en) * 2018-11-19 2019-04-09 江苏科技大学 A kind of inspection procedure automatic generation method based on detection feature
CN109657066A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Knowledge mapping construction method, device and computer equipment based on multi-angle of view
CN109598705B (en) * 2018-11-19 2023-06-23 江苏科技大学 Automatic generation method of inspection procedure based on detection characteristics
CN109785144A (en) * 2019-01-18 2019-05-21 国家电网有限公司 A kind of assets classes method, apparatus, equipment and medium
CN110162590A (en) * 2019-02-22 2019-08-23 北京捷风数据技术有限公司 A kind of database displaying method and device thereof of calling for tenders of project text combination economic factor
CN110110044B (en) * 2019-04-11 2020-05-05 广州探迹科技有限公司 Method for enterprise information combination screening
CN110110044A (en) * 2019-04-11 2019-08-09 广州探迹科技有限公司 A kind of method of company information combined sorting
CN110532383A (en) * 2019-07-18 2019-12-03 中山大学 A kind of patent text classification method based on intensified learning
CN110737749A (en) * 2019-10-11 2020-01-31 软通动力信息技术有限公司 Entrepreneurship plan evaluation method, entrepreneurship plan evaluation device, computer equipment and storage medium
CN110737749B (en) * 2019-10-11 2022-09-27 软通智慧信息技术有限公司 Entrepreneurship plan evaluation method, entrepreneurship plan evaluation device, computer equipment and storage medium
CN110795425B (en) * 2019-10-31 2023-04-28 上海义缘网络科技有限公司 Customs data cleaning and merging method, device, equipment and medium
CN110795425A (en) * 2019-10-31 2020-02-14 上海义缘网络科技有限公司 Method, device, equipment and medium for cleaning and merging customs data
CN111125185A (en) * 2019-11-25 2020-05-08 泰康保险集团股份有限公司 Data processing method, device, medium and electronic equipment
CN110879829A (en) * 2019-11-26 2020-03-13 杭州皓智天诚信息科技有限公司 Intellectual property big data service intelligent system
CN111008265A (en) * 2019-12-03 2020-04-14 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN111008265B (en) * 2019-12-03 2023-03-28 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN111080132A (en) * 2019-12-18 2020-04-28 北京智识企业管理咨询有限公司 Industry chain analysis system and method based on big data
CN111177189B (en) * 2019-12-20 2024-04-05 北京航天云路有限公司 Client optimization system and method based on user behavior analysis
CN111177189A (en) * 2019-12-20 2020-05-19 航天云网科技发展有限责任公司 Client optimization system and method based on user behavior analysis
CN111737421A (en) * 2020-08-07 2020-10-02 杭州六棱镜知识产权科技有限公司 Intellectual property big data information retrieval system and storage medium
CN112115314A (en) * 2020-09-16 2020-12-22 江苏开拓信息与系统有限公司 General government affair big data aggregation retrieval system and construction method
CN112183090A (en) * 2020-10-09 2021-01-05 浪潮云信息技术股份公司 Method for calculating entity relevance based on word network
CN112182223A (en) * 2020-10-12 2021-01-05 浙江工业大学 Enterprise industry classification method and system based on domain ontology
CN112214572A (en) * 2020-10-20 2021-01-12 济南浪潮高新科技投资发展有限公司 Method for secondarily extracting entities in resume analysis
CN112214572B (en) * 2020-10-20 2022-11-01 山东浪潮科学研究院有限公司 Method for secondarily extracting entities in resume analysis
CN112507201A (en) * 2020-11-03 2021-03-16 国网浙江省电力有限公司台州供电公司 Search engine construction and search method based on NLP (non-line segment) retrieval analysis technology
CN112434158A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Enterprise label acquisition method and device, storage medium and computer equipment
CN112612937A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 Associated information acquisition method and equipment
CN112434665A (en) * 2020-12-12 2021-03-02 广东电力信息科技有限公司 Method and device for intelligently identifying financial data in image based on machine learning
CN112650951A (en) * 2020-12-21 2021-04-13 撼地数智(重庆)科技有限公司 Enterprise similarity matching method, system and computing device
CN112734493A (en) * 2021-01-18 2021-04-30 科技谷(厦门)信息技术有限公司 Industry monitoring analysis platform
CN113742496A (en) * 2021-09-10 2021-12-03 国网江苏省电力有限公司电力科学研究院 Power knowledge learning system and method based on heterogeneous resource fusion
CN116578677A (en) * 2023-07-14 2023-08-11 高密市中医院 Retrieval system and method for medical examination information
CN116578677B (en) * 2023-07-14 2023-09-15 高密市中医院 Retrieval system and method for medical examination information
CN117057942A (en) * 2023-10-12 2023-11-14 之江实验室科技控股有限公司 Intelligent financial decision big data analysis system
CN117057942B (en) * 2023-10-12 2024-01-30 之江实验室科技控股有限公司 Intelligent financial decision big data analysis system

Also Published As

Publication number Publication date
CN107066599B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN107066599A (en) A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning
Deng et al. A study of supervised term weighting scheme for sentiment analysis
US11663254B2 (en) System and engine for seeded clustering of news events
Noh et al. Keyword selection and processing strategy for applying text mining to patent analysis
Xie et al. A novel text mining approach for scholar information extraction from web content in Chinese
CN102609512A (en) System and method for heterogeneous information mining and visual analysis
Ahmadov et al. Towards a hybrid imputation approach using web tables
Loudcher et al. Combining OLAP and information networks for bibliographic data analysis: a survey
DE102012221251A1 (en) Semantic and contextual search of knowledge stores
CN106484813A (en) A kind of big data analysis system and method
CN111737421A (en) Intellectual property big data information retrieval system and storage medium
CN114880486A (en) Industry chain identification method and system based on NLP and knowledge graph
CN114254201A (en) Recommendation method for science and technology project review experts
De et al. An introduction to data mining in social networks
CA2956627A1 (en) System and engine for seeded clustering of news events
Bhardwaj et al. Review of text mining techniques
Dehghan et al. Mining shape of expertise: A novel approach based on convolutional neural network
Chen et al. Exploring technology opportunities and evolution of IoT-related logistics services with text mining
CN114896423A (en) Construction method and system of enterprise basic information knowledge graph
CN106909626A (en) Improved Decision Tree Algorithm realizes search engine optimization technology
Liao et al. Improving farm management optimization: Application of text data analysis and semantic networks
CN115953041A (en) Construction scheme and system of operator policy system
Panagopoulos et al. Scientometrics for success and influence in the microsoft academic graph
Nogales et al. Measuring vocabulary use in the Linked Data Cloud
CN113127650A (en) Technical map construction method and system based on map database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant