CN107357851A - A kind of information processing method and system - Google Patents

A kind of information processing method and system Download PDF

Info

Publication number
CN107357851A
CN107357851A CN201710506158.XA CN201710506158A CN107357851A CN 107357851 A CN107357851 A CN 107357851A CN 201710506158 A CN201710506158 A CN 201710506158A CN 107357851 A CN107357851 A CN 107357851A
Authority
CN
China
Prior art keywords
industry
enterprise
word
default
relevant documentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710506158.XA
Other languages
Chinese (zh)
Other versions
CN107357851B (en
Inventor
夏耘海
张斌德
王江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoxin Youe Data Co Ltd
Original Assignee
Guoxin Youe Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoxin Youe Data Co Ltd filed Critical Guoxin Youe Data Co Ltd
Priority to CN201710506158.XA priority Critical patent/CN107357851B/en
Publication of CN107357851A publication Critical patent/CN107357851A/en
Application granted granted Critical
Publication of CN107357851B publication Critical patent/CN107357851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Educational Administration (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of information processing method, including:Determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein, it is the affiliated industry of three new spectras to preset trade classification code to characterize industry;Industry corresponding to industry is characterized based on default trade classification code and illustrates document, generates three new spectra keyword corpus;Business scope corresponding to first enterprise is introduced into document and carries out keyword match with corpus, filters out the second enterprise;Business relevant documentation corresponding to the second enterprise is crawled, and the business relevant documentation crawled and corpus are subjected to Similarity Measure;It is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras.The invention also discloses a kind of information processing system.The present invention can accurately filter out three new spectras for meeting to require from default enterprise.

Description

A kind of information processing method and system
Technical field
The present invention relates to a kind of information approach and system, and in particular to a kind of method and system for identifying three new spectras.
Background technology
With the fast development of China's economy, new firms and economic activity continuously emerge.Enterprise is as in social economy Most important active agent, important role is play in economy, the arrangement and analysis for company information assist in Relevant Decision person understands the management state of the enterprise, finds potential business risk.
For example, most recently newly appearance and enjoying three new spectras of the Party Central Committee, State Council's concern (including NPD projects, new industry situation, new The enterprise of business model), related personnel needs economic activity development scale, structure and quality to this kind of enterprise etc. to count Observation, to understand the development scale of this kind of enterprise, structure and quality in real time, reference frame is provided for future decisions.And carry out The key point of statistical observation is that those enterprises belong to three new spectras in the numerous enterprises for needing exact knowledge to investigate.This just needs pair Three new spectras are accurately screened, to filter out satisfactory three new spectra.However, it is new to not currently exist accurate screening three The scheme of enterprise..
The content of the invention
The example technical problems to be solved of the present invention are to provide one kind being capable of time saving and energy saving and three new spectras of accurate screening Scheme.
One aspect of the present invention provides a kind of information processing method, for accurately and effectively screening three new spectras, this method bag Include:Determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein, the default row It is the affiliated industry of three new spectras that industry Sort Code, which characterizes industry,;It is corresponding that industry is characterized based on the default trade classification code Industry illustrate document, generate three new spectra keyword corpus;Business scope corresponding to first enterprise is introduced into document Keyword match is carried out with the corpus, filters out the second enterprise;Business relevant documentation corresponding to second enterprise is crawled, And the business relevant documentation crawled and the corpus are subjected to Similarity Measure;The business for being up to default similarity is related The second enterprise is defined as three new spectras belonging to document.
Alternatively, the business relevant documentation includes the full text or fragment of following one or more documents:Related product Introduction, Related product operation instruction, software works, trade mark, patent.
Alternatively, industry corresponding to industry is characterized based on the default trade classification code and illustrates document, it is new to generate three Enterprise's keyword corpus, is specifically included:For being said in the default trade classification code per industry corresponding to class industry code Plaintext shelves, the sector is illustrated that document splits into single word;For splitting obtained each word, the word frequency of the word is determined;Using Word frequency extraction keyword of the preset algorithm based on determination, generates three new spectra keyword corpus.
Alternatively, the business relevant documentation crawled and the corpus are subjected to Similarity Measure, specifically included:For The every business relevant documentation crawled, the business relevant documentation is split into single word;For splitting obtained each word, really The word frequency of the fixed word;Obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by the default row Industry is corresponded to per class industry code illustrate that document splits obtained word and corresponding word frequency progress similarity in industry Sort Code Calculate.
Alternatively, it is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras, specifically Including:If in the presence of at least a kind of industry code, business relevant documentation industry corresponding with such industry code is set to illustrate document phase Reach default similarity like degree, then the second enterprise belonging to the business relevant documentation is defined as three new spectras.
An alternative embodiment of the invention provides a kind of information processing system, including:First processing units, for from default Determine that trade classification code meets the first enterprise of default trade classification code in enterprise;Wherein, the default trade classification generation It is the affiliated industry of three new spectras that code, which characterizes industry,;Corpus generation unit, for based on the default trade classification code institute Industry corresponding to characterizing industry illustrates document, generates three new spectra keyword corpus;Second processing unit, for by described Business scope corresponding to one enterprise introduces document and carries out keyword match with the corpus, filters out the second enterprise;Similarity Computing unit, for crawling business relevant documentation corresponding to second enterprise, and by the business relevant documentation crawled and institute State corpus and carry out Similarity Measure;3rd processing unit, for being up to belonging to the business relevant documentation of default similarity Two enterprises are defined as three new spectras.
Alternatively, the business relevant documentation includes the full text or fragment of following one or more documents:Related product Introduction, Related product operation instruction, software works, trade mark, patent.
Alternatively, the corpus generation unit characterizes industry corresponding to industry based on the default trade classification code Illustrate document, generate three new spectra keyword corpus, specifically include:For every class industry in the default trade classification code Industry illustrates document corresponding to code, and the sector is illustrated into document splits into single word;For splitting obtained each word, it is determined that The word frequency of the word;Keyword is extracted using word frequency of the preset algorithm based on determination, generates three new spectra keyword corpus.
Alternatively, the business relevant documentation crawled and the corpus are carried out similarity by the similarity calculated Calculate, specifically include:For the every business relevant documentation crawled, the business relevant documentation is split into single word;For Obtained each word is split, determines the word frequency of the word;Obtained word and corresponding word frequency will be split by the business relevant documentation, Respectively with by the default trade classification code, per class industry code, corresponding industry illustrates that document splits obtained word and right The word frequency answered carries out Similarity Measure.
Alternatively, the 3rd processing unit is up to the second enterprise belonging to the business relevant documentation of default similarity and determined For three new spectras, specifically include:If in the presence of at least a kind of industry code, make the business relevant documentation corresponding with such industry code Industry illustrates that Documents Similarity reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new enterprises Industry.
Information processing method provided by the invention, it is three new spectra institutes from industry is characterized first when screening three new spectras Belong to and determine that trade classification code meets the first enterprise of default trade classification code in the default enterprise of industry, then, based on pre- If trade classification code characterizes industry corresponding to industry and illustrates document, three new spectra keyword corpus are generated, then, by the Business scope corresponding to one enterprise introduces document and corpus and carries out keyword match, filters out the second enterprise, then, crawls the Business relevant documentation corresponding to two enterprises, and the business relevant documentation crawled and corpus are subjected to Similarity Measure, finally, It is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras, it is so laddering by three-wheel Screening so that the enterprise screened is the degree of accuracy more and more higher of three new spectras, can filter out three new spectras exactly, be The screening of three new spectras provides reference frame.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the information processing method of the embodiment of the present invention;
Fig. 2 is the structural representation of the information processing system of the embodiment of the present invention.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.
Fig. 1 is the schematic flow sheet of the information processing method of the embodiment of the present invention.As shown in figure 1, the embodiment of the present invention carries The information processing method of confession, including:
S101, determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein, It is the affiliated industry of three new spectras that the default trade classification code, which characterizes industry,.
S102, characterize based on the default trade classification code industry corresponding to industry and illustrate document, the new enterprise of generation three Industry keyword corpus.
S103, business scope corresponding to first enterprise is introduced to document and corpus progress keyword match, Filter out the second enterprise.
S104, crawl business relevant documentation corresponding to second enterprise, and by the business relevant documentation crawled and institute State corpus and carry out Similarity Measure.
S105, it is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras.
Wherein, in step S101, characterizing industry can be based on for the default trade classification code of the affiliated industry of three new spectras Description of the associated documents to three new spectras, extract and obtain from industrial sectors of national economy classification.For example, it can be based on《The Chinese people are total to With state's the 13rd five-year-plan outline of national economy and social development》《State Council is on printing and distributing the logical of the > of < made in China 2025 Know》、《Instruction of the State Council on actively pushing forward " internet+" action》With《State Council is on carrying forward vigorously popular foundation Millions of people innovate the opinion of some policies and measures》The classification model of three new spectras is drawn Deng the elaboration about " three is new " activity in file Enclose, the classification range on three new spectras for being then based on obtaining is divided to choose the industry of correlation from industrial sectors of national economy classification Category code.In one example, the classification range based on three new spectras obtained by associated documents may include modern agriculture, forestry, animal husbandry and fishery, Advanced manufacturing industry, novel energy activity, energy-conserving and environment-protective activity, internet and modern information technologies service, new technology and double wounds take Business activity, modern production sex service activity, new liveliness proof service activity, modern integrated management activity, according to these classification institutes Obtained default trade classification code may include 278 groups.
So, the default trade classification code based on determination, selection trade classification code meets default from default enterprise First enterprise of trade classification code.Default enterprise can be screened from request by specified interface and be obtained at the requestor of three new spectras , or acquisition is crawled by web crawlers according to designated key word.
In step s 102, industry corresponding to industry is characterized based on the default trade classification code and illustrates document, it is raw Into three new spectra keyword corpus, may particularly include:
Step 1: for illustrating document per industry corresponding to class industry code in the default trade classification code, by this Industry illustrates that document splits into single word.
It can use default participle instrument that industry corresponding to every class industry code is illustrated into document splits into single word, example Such as, the jieba storehouses in python can be used.Every industry can be illustrated that document is split as by jieba storehouses according to custom rule Single word.
Step 2: for splitting obtained each word, the word frequency of the word is determined.
The each word obtained for step 1, it can be counted to obtain each word by word frequency statisticses instrument in every industry Illustrate the word frequency that document occurs, so as to obtain the word frequency of each word.In addition, to reduce noise, can be by the word obtained by step 1 In to screening three new spectras without especially contribution or nonsensical word delete, such as delete document in some void Word, such as interjection, preposition, conjunction, so as to improve efficiency of the subsequent step to keyword extraction.
Step 3: extracting keyword using word frequency of the preset algorithm based on determination, three new spectra keyword corpus are generated.
In the example of the present invention, keyword, generation can be extracted using TF-IDF methods come the word frequency based on determination Three new spectra keyword corpus, but the invention is not limited in this, can also be using other method come the word frequency based on determination Keyword is extracted, for example, mutual information, expectation cross entropy, Information Gain Method, PCA, genetic algorithm etc..
The t of each word in every document is obtained in the present invention using TF-IDF methodsi- idf is worth, and chooses ti- idf is worth More than specific threshold word as keyword, every industry illustrates the t of each word in documenti- idf values can pass through equation below (1) obtain:
ti- idf=fi*log(N/dfi) (1)
Wherein, fiRefer to word frequency rate, represent the number that i-th of word occurs in the sector illustrates document, dfiRefer to document Frequency, represent that all industries illustrate the number of documents for occurring i-th of word in document, N represents that all industries illustrate the number of document. tiThe specific threshold of-idf values can determine according to actual conditions, as long as the keyword for obtain screens to greatest extent Go out to meet the reduction processing complexity that three desired new spectra and cans are tried one's best.
The word frequency of each word of the every service description document obtained by step 2, can be obtained using above-mentioned formula (1) To the t of each wordi- idf is worth, and then chooses ti- idf values are more than the word of specific threshold as keyword, so as to generate three new spectras Keyword corpus.
In step s 103, business scope corresponding to the first enterprise step S101 obtained introduces document and step S102 The keyword of generation expects that storehouse carries out Keywords matching, filters out second enterprise associated with keyword.First enterprise is corresponding Business scope introduce document can by specified interface at related personnel for example request screen three new spectras requestor at obtain , or acquisition is crawled by web crawlers.The present invention an example in, can use R language in match functions by Business scope corresponding to first enterprise introduces document and the keyword of step S102 generations expects that storehouse carries out Keywords matching, automatically Filter out second enterprise associated with the keyword in keyword corpus.Because the trade classification code of enterprise may be not The actual business managed of the enterprise can be represented, i.e. the trade classification code of enterprise may exist with its actual business managed Deviation, thus the first enterprise determined by trade classification code there may be it is many be not three new spectras enterprise, because This, in step S103 using Keywords matching by way of the second enterprise for further being filtered out from the first enterprise it is new for three The accuracy rate of enterprise can improve a lot, and in one exemplary embodiment of the invention, be screened by step S103 About 85% three new spectras can be included in second enterprise.
In step S104, Real-time Network can be carried out by the software kit such as the seleuim of python programming languages, bs4 reptile Network crawls business relevant documentation corresponding to the second enterprise, and the business relevant documentation may include the full text of following one or more documents Or fragment:Related product introduction, Related product operation instruction, software works, trade mark, patent.Wherein, the business that will be crawled Relevant documentation carries out Similarity Measure with the keyword corpus obtained in step S102, may particularly include:
The first step, every business relevant documentation for crawling, single word is split into by the business relevant documentation.
The step every business relevant documentation is split into single word mode can with will be per class row in abovementioned steps S102 Industry corresponding to industry code illustrate document split into single word mode it is identical.
Second step, each word obtained for fractionation, determine the word frequency of the word.
The step determines that the mode of the word frequency of each word can be with determining each word in industry expository writing in abovementioned steps S102 The mode of word frequency in shelves is identical.Equally, to reduce noise, will can not have in the word obtained by the first step to three new spectras of screening Especially contribution or nonsensical word are deleted, such as delete some function words in document, such as interjection, preposition, conjunction Deng so as to improve efficiency of the subsequent step to keyword extraction.
3rd step, obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by described default Correspond to industry per class industry code in trade classification code and illustrates that document splits obtained word and the progress of corresponding word frequency is similar Degree calculates.
In the present invention, every business relevant documentation and every class industry code can be calculated using co sinus vector included angle method Corresponding industry illustrates the similarity of document.
Specifically, first, in a certain order, for example, the sequencing occurred in a document according to word, by each word Corresponding word frequency is built into word frequency vector.For example, for i-th business relevant documentation in business relevant documentation, it is based on The word of fractionation and corresponding word frequency can build vector:Ai:[x1,x2,...,xn], wherein, x1,x2,...,xnRespectively should The word frequency of n keyword of business relevant documentation.Similarly, for presetting the i-th class industry code pair in trade classification code Industry is answered to illustrate document, its word based on fractionation and corresponding word frequency can build vector and be:Bi:[y1,y2,...,yn], its In, y1,y2,...,ynRespectively such industry code corresponds to the word frequency that industry illustrates n keyword of document.
Then, the vector based on foregoing structure, using following formula (2) come every business relevant documentation and every class industry generation The corresponding industry of code illustrates the similarity cos θ of document:
So, using above-mentioned formula (2), it can obtain every business relevant documentation industry explanation corresponding with per class industry code The similarity of document.
In step S105, it is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new enterprises Industry, specifically include:If in the presence of at least a kind of industry code, make business relevant documentation industry explanation corresponding with such industry code Documents Similarity reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new spectras.Specifically, If all about every business relevant documentation calculated in step S104 industry expository writing corresponding with per class industry code The similarity of shelves, which has, reaches default similarity, such as 0.7, then is defined as the second enterprise belonging to corresponding business relevant documentation Three new spectras.Due to that will be three new enterprises by the business relevant documentation and sign industry for the second enterprise that Keywords matching filters out The default industry code of the affiliated industry of industry correspondingly illustrates that document carries out Similarity Measure, then chooses similarity and reaches default similar The enterprise of degree understands higher as three new spectras by three new spectras determined by this Similarity Measure, the degree of accuracy.
To sum up, information processing method provided by the invention, when screening three new spectras, it is primarily based on and characterizes three new spectra institutes The default industry code for belonging to industry carries out first round screening, and business scope corresponding to the enterprise for then screening the first round is situated between The document that continues illustrates that the keyword corpus of document structure tree is closed with characterizing the corresponding industry of industry based on default industry code Keyword matches, and carries out the second wheel screening, finally by the business relevant documentation and the keyword of the enterprise obtained through the second wheel screening Corpus carries out Similarity Measure, and selection similarity reaches the enterprise of default similarity as three new spectras, so, by three-wheel Laddering screening, substantially increase the degree of accuracy that the enterprise filtered out is three new spectras.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of information processing system, by the system is solved Certainly the principle of problem is similar to aforementioned information processing method, therefore the implementation of the system may refer to the implementation of preceding method, weight Multiple part repeats no more.
A kind of information processing system provided in an embodiment of the present invention, as shown in Fig. 2 including:
First processing units 201, for determining that trade classification code meets default trade classification code from default enterprise The first enterprise;Wherein, it is the affiliated industry of three new spectras that the default trade classification code, which characterizes industry,;
Corpus generation unit 202, said for characterizing industry corresponding to industry based on the default trade classification code Plaintext shelves, generate three new spectra keyword corpus;
Second processing unit 203, for business scope corresponding to first enterprise to be introduced into document and the corpus Keyword match is carried out, filters out the second enterprise;
Similarity calculated 204, for crawling business relevant documentation corresponding to second enterprise, and it will crawl Business relevant documentation carries out Similarity Measure with the corpus;
3rd processing unit 205, it is defined as being up to the second enterprise belonging to the business relevant documentation of default similarity Three new spectras.
In one exemplary embodiment of the invention, the business relevant documentation includes following one or more documents Full text or fragment:Related product introduction, Related product operation instruction, software works, trade mark, patent.The related text of these business Shelves can carry out real-time network by the software kit such as the seleuim of python programming languages, bs4 reptile and crawl.
In one exemplary embodiment of the invention, the corpus generation unit 202 is based on the default industry point Category code characterizes industry corresponding to industry and illustrates document, generates three new spectra keyword corpus, specifically includes:For described Illustrate document per industry corresponding to class industry code in default trade classification code, it is single that the sector is illustrated into document is split into Word;For splitting obtained each word, the word frequency of the word is determined;Keyword is extracted using word frequency of the preset algorithm based on determination, Generate three new spectra keyword corpus.
In one exemplary embodiment of the invention, the similarity calculated 204 is related by the business crawled Document carries out Similarity Measure with the corpus, specifically includes:For the every business relevant documentation crawled, by the business Relevant documentation splits into single word;For splitting obtained each word, the word frequency of the word is determined;It will be torn open by the business relevant documentation The word got and corresponding word frequency, respectively with by the default trade classification code, per class industry code, corresponding industry is said Plaintext shelves split obtained word and corresponding word frequency carries out Similarity Measure.
In one exemplary embodiment of the invention, the 3rd processing unit 205 is up to the industry of default similarity Second enterprise belonging to business relevant documentation is defined as three new spectras, specifically includes:If in the presence of at least a kind of industry code, make the business Relevant documentation industry corresponding with such industry code illustrates that Documents Similarity reaches default similarity, then by the business relevant documentation Affiliated second enterprise is defined as three new spectras.
Finally it should be noted that:Embodiment described above, it is only the embodiment of the present invention, to illustrate the present invention Technical scheme, rather than its limitations, protection scope of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, it will be understood by those within the art that:Any one skilled in the art The invention discloses technical scope in, it can still modify to the technical scheme described in previous embodiment or can be light Change is readily conceivable that, or equivalent substitution is carried out to which part technical characteristic;And these modifications, change or replacement, do not make The essence of appropriate technical solution departs from the spirit and scope of technical scheme of the embodiment of the present invention.The protection in the present invention should all be covered Within the scope of.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims (10)

  1. A kind of 1. information processing method, it is characterised in that including:
    Determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein, it is described default It is the affiliated industry of three new spectras that trade classification code, which characterizes industry,;
    Industry corresponding to industry is characterized based on the default trade classification code and illustrates document, generates three new spectra key wordses Expect storehouse;
    Business scope corresponding to first enterprise is introduced into document and carries out keyword match with the corpus, filters out second Enterprise;
    Business relevant documentation corresponding to second enterprise is crawled, and the business relevant documentation crawled is entered with the corpus Row Similarity Measure;
    It is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras.
  2. 2. according to the method for claim 1, it is characterised in that the business relevant documentation includes following one or more texts The full text or fragment of shelves:Related product introduction, Related product operation instruction, software works, trade mark, patent.
  3. 3. method according to claim 1 or 2, it is characterised in that row is characterized based on the default trade classification code Industry corresponding to industry illustrates document, generates three new spectra keyword corpus, specifically includes:
    For illustrating document per industry corresponding to class industry code in the default trade classification code, the sector is illustrated into document Split into single word;
    For splitting obtained each word, the word frequency of the word is determined;
    Keyword is extracted using word frequency of the preset algorithm based on determination, generates three new spectra keyword corpus.
  4. 4. method according to claim 1 or 2, it is characterised in that by the business relevant documentation crawled and the language material Storehouse carries out Similarity Measure, specifically includes:
    For the every business relevant documentation crawled, the business relevant documentation is split into single word;
    For splitting obtained each word, the word frequency of the word is determined;
    Obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by the default trade classification code In correspond to industry per class industry code and illustrates that document splits obtained word and corresponding word frequency progress Similarity Measure.
  5. 5. according to the method for claim 4, it is characterised in that be up to belonging to the business relevant documentation of default similarity the Two enterprises are defined as three new spectras, specifically include:
    If in the presence of at least a kind of industry code, business relevant documentation industry corresponding with such industry code is set to illustrate that document is similar Degree reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new spectras.
  6. A kind of 6. information processing system, it is characterised in that including:
    First processing units, for determining that trade classification code meets the first enterprise of default trade classification code from default enterprise Industry;Wherein, it is the affiliated industry of three new spectras that the default trade classification code, which characterizes industry,;
    Corpus generation unit, illustrate document for characterizing industry corresponding to industry based on the default trade classification code, Generate three new spectra keyword corpus;
    Second processing unit, key is carried out with the corpus for business scope corresponding to first enterprise to be introduced into document Word matches, and filters out the second enterprise;
    Similarity calculated, for crawling business relevant documentation corresponding to second enterprise, and the business phase that will be crawled Close document and carry out Similarity Measure with the corpus;
    3rd processing unit, it is defined as three new enterprises for being up to the second enterprise belonging to the business relevant documentation of default similarity Industry.
  7. 7. system according to claim 6, it is characterised in that the business relevant documentation includes following one or more texts The full text or fragment of shelves:Related product introduction, Related product operation instruction, software works, trade mark, patent.
  8. 8. the system according to claim 6 or 7, it is characterised in that the corpus generation unit is based on the default row Industry Sort Code characterizes industry corresponding to industry and illustrates document, generates three new spectra keyword corpus, specifically includes:
    For illustrating document per industry corresponding to class industry code in the default trade classification code, the sector is illustrated into document Split into single word;
    For splitting obtained each word, the word frequency of the word is determined;
    Keyword is extracted using word frequency of the preset algorithm based on determination, generates three new spectra keyword corpus.
  9. 9. the system according to claim 6 or 7, it is characterised in that the business that the similarity calculated will crawl Relevant documentation carries out Similarity Measure with the corpus, specifically includes:
    For the every business relevant documentation crawled, the business relevant documentation is split into single word;
    For splitting obtained each word, the word frequency of the word is determined;
    Obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by the default trade classification code In correspond to industry per class industry code and illustrates that document splits obtained word and corresponding word frequency progress Similarity Measure.
  10. 10. system according to claim 9, it is characterised in that the 3rd processing unit is up to default similarity The second enterprise is defined as three new spectras belonging to business relevant documentation, specifically includes:
    If in the presence of at least a kind of industry code, business relevant documentation industry corresponding with such industry code is set to illustrate that document is similar Degree reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new spectras.
CN201710506158.XA 2017-06-28 2017-06-28 information processing method and system Active CN107357851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710506158.XA CN107357851B (en) 2017-06-28 2017-06-28 information processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710506158.XA CN107357851B (en) 2017-06-28 2017-06-28 information processing method and system

Publications (2)

Publication Number Publication Date
CN107357851A true CN107357851A (en) 2017-11-17
CN107357851B CN107357851B (en) 2020-01-31

Family

ID=60273239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710506158.XA Active CN107357851B (en) 2017-06-28 2017-06-28 information processing method and system

Country Status (1)

Country Link
CN (1) CN107357851B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109801118A (en) * 2018-12-24 2019-05-24 航天信息股份有限公司 Identify method, apparatus, medium and the equipment of the manufacturing business of designated trade
CN113076979A (en) * 2021-03-23 2021-07-06 广州快必妥营销策划咨询有限公司 Qualified crop screening method, crop cultivation control method, system and device
CN113869639A (en) * 2021-08-26 2021-12-31 中国环境科学研究院 Yangtze river basin enterprise screening method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101127050A (en) * 2007-07-03 2008-02-20 北京大学 Method for automatically extracting website owner administrative apanage information from web page
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
JP4791169B2 (en) * 2005-12-12 2011-10-12 ヤフー株式会社 Related word extraction device and related word extraction method
CN106682145A (en) * 2016-12-22 2017-05-17 北京览群智数据科技有限责任公司 Enterprise information processing method, server and client

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4791169B2 (en) * 2005-12-12 2011-10-12 ヤフー株式会社 Related word extraction device and related word extraction method
CN101127050A (en) * 2007-07-03 2008-02-20 北京大学 Method for automatically extracting website owner administrative apanage information from web page
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
CN106682145A (en) * 2016-12-22 2017-05-17 北京览群智数据科技有限责任公司 Enterprise information processing method, server and client

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡芳槐: ""基于多种数据源的中文知识图谱构建方法研究"", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109801118A (en) * 2018-12-24 2019-05-24 航天信息股份有限公司 Identify method, apparatus, medium and the equipment of the manufacturing business of designated trade
CN113076979A (en) * 2021-03-23 2021-07-06 广州快必妥营销策划咨询有限公司 Qualified crop screening method, crop cultivation control method, system and device
CN113076979B (en) * 2021-03-23 2024-05-17 广州快必妥营销策划咨询有限公司 Qualified crop screening method, crop cultivation control method, system and device
CN113869639A (en) * 2021-08-26 2021-12-31 中国环境科学研究院 Yangtze river basin enterprise screening method and device, electronic equipment and storage medium
WO2023025332A1 (en) * 2021-08-26 2023-03-02 中国环境科学研究院 Yangtze river basin enterprise screening method and apparatus, electronic device, and storage medium
CN113869639B (en) * 2021-08-26 2023-11-07 中国环境科学研究院 Yangtze river basin enterprise screening method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN107357851B (en) 2020-01-31

Similar Documents

Publication Publication Date Title
CN111159395B (en) Chart neural network-based rumor standpoint detection method and device and electronic equipment
CN107871144A (en) Invoice trade name sorting technique, system, equipment and computer-readable recording medium
CN108009284A (en) Using the Law Text sorting technique of semi-supervised convolutional neural networks
CN106484664A (en) Similarity calculating method between a kind of short text
CN103034726B (en) Text filtering system and method
CN107357851A (en) A kind of information processing method and system
CN112001170B (en) Method and system for identifying deformed sensitive words
Li et al. A Bi-LSTM-RNN model for relation classification using low-cost sequence features
CN106599037A (en) Recommendation method based on label semantic normalization
CN108388554A (en) Text emotion identifying system based on collaborative filtering attention mechanism
CN108335210A (en) A kind of stock unusual fluctuation analysis method based on public opinion data
CN113254593A (en) Text abstract generation method and device, computer equipment and storage medium
CN109960791A (en) Judge the method and storage medium, terminal of text emotion
CN106649262B (en) Method for protecting sensitive information of enterprise hardware facilities in social media
Geetha et al. Twitter opinion mining and boosting using sentiment analysis
CN103810213B (en) A kind of searching method and system
Sabaruddin et al. Malay tweets: discovering mental health situation during covid-19 pandemic in Malaysia
CN108846128A (en) A kind of cross-domain texts classification method based on adaptive noise encoder
US11640398B2 (en) Method and system for data communication with relational database management
Amethyst et al. Data pattern single column analysis for data profiling using an open source platform
Pitchayaviwat A study on clustering customer suggestion on online social media about insurance services by using text mining techniques
CN111125486B (en) Microblog user attribute analysis method based on multiple features
JP6150664B2 (en) Mining analyzer, method and program
Monish et al. Automated topic modeling and sentiment analysis of tweets on SparkR
Alwosheel et al. Artificial neural networks as a means to accommodate decision rules in choice models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 100070, No. 101-8, building 1, 31, zone 188, South Fourth Ring Road, Beijing, Fengtai District

Patentee after: Guoxin Youyi Data Co., Ltd

Address before: 100070, No. 188, building 31, headquarters square, South Fourth Ring Road West, Fengtai District, Beijing

Patentee before: SIC YOUE DATA Co.,Ltd.

CP03 Change of name, title or address