CN107357851A - A kind of information processing method and system - Google Patents
A kind of information processing method and system Download PDFInfo
- Publication number
- CN107357851A CN107357851A CN201710506158.XA CN201710506158A CN107357851A CN 107357851 A CN107357851 A CN 107357851A CN 201710506158 A CN201710506158 A CN 201710506158A CN 107357851 A CN107357851 A CN 107357851A
- Authority
- CN
- China
- Prior art keywords
- industry
- enterprise
- word
- default
- relevant documentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Human Resources & Organizations (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Educational Administration (AREA)
- Economics (AREA)
- Data Mining & Analysis (AREA)
- Development Economics (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Game Theory and Decision Science (AREA)
- Health & Medical Sciences (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of information processing method, including:Determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein, it is the affiliated industry of three new spectras to preset trade classification code to characterize industry;Industry corresponding to industry is characterized based on default trade classification code and illustrates document, generates three new spectra keyword corpus;Business scope corresponding to first enterprise is introduced into document and carries out keyword match with corpus, filters out the second enterprise;Business relevant documentation corresponding to the second enterprise is crawled, and the business relevant documentation crawled and corpus are subjected to Similarity Measure;It is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras.The invention also discloses a kind of information processing system.The present invention can accurately filter out three new spectras for meeting to require from default enterprise.
Description
Technical field
The present invention relates to a kind of information approach and system, and in particular to a kind of method and system for identifying three new spectras.
Background technology
With the fast development of China's economy, new firms and economic activity continuously emerge.Enterprise is as in social economy
Most important active agent, important role is play in economy, the arrangement and analysis for company information assist in
Relevant Decision person understands the management state of the enterprise, finds potential business risk.
For example, most recently newly appearance and enjoying three new spectras of the Party Central Committee, State Council's concern (including NPD projects, new industry situation, new
The enterprise of business model), related personnel needs economic activity development scale, structure and quality to this kind of enterprise etc. to count
Observation, to understand the development scale of this kind of enterprise, structure and quality in real time, reference frame is provided for future decisions.And carry out
The key point of statistical observation is that those enterprises belong to three new spectras in the numerous enterprises for needing exact knowledge to investigate.This just needs pair
Three new spectras are accurately screened, to filter out satisfactory three new spectra.However, it is new to not currently exist accurate screening three
The scheme of enterprise..
The content of the invention
The example technical problems to be solved of the present invention are to provide one kind being capable of time saving and energy saving and three new spectras of accurate screening
Scheme.
One aspect of the present invention provides a kind of information processing method, for accurately and effectively screening three new spectras, this method bag
Include:Determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein, the default row
It is the affiliated industry of three new spectras that industry Sort Code, which characterizes industry,;It is corresponding that industry is characterized based on the default trade classification code
Industry illustrate document, generate three new spectra keyword corpus;Business scope corresponding to first enterprise is introduced into document
Keyword match is carried out with the corpus, filters out the second enterprise;Business relevant documentation corresponding to second enterprise is crawled,
And the business relevant documentation crawled and the corpus are subjected to Similarity Measure;The business for being up to default similarity is related
The second enterprise is defined as three new spectras belonging to document.
Alternatively, the business relevant documentation includes the full text or fragment of following one or more documents:Related product
Introduction, Related product operation instruction, software works, trade mark, patent.
Alternatively, industry corresponding to industry is characterized based on the default trade classification code and illustrates document, it is new to generate three
Enterprise's keyword corpus, is specifically included:For being said in the default trade classification code per industry corresponding to class industry code
Plaintext shelves, the sector is illustrated that document splits into single word;For splitting obtained each word, the word frequency of the word is determined;Using
Word frequency extraction keyword of the preset algorithm based on determination, generates three new spectra keyword corpus.
Alternatively, the business relevant documentation crawled and the corpus are subjected to Similarity Measure, specifically included:For
The every business relevant documentation crawled, the business relevant documentation is split into single word;For splitting obtained each word, really
The word frequency of the fixed word;Obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by the default row
Industry is corresponded to per class industry code illustrate that document splits obtained word and corresponding word frequency progress similarity in industry Sort Code
Calculate.
Alternatively, it is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras, specifically
Including:If in the presence of at least a kind of industry code, business relevant documentation industry corresponding with such industry code is set to illustrate document phase
Reach default similarity like degree, then the second enterprise belonging to the business relevant documentation is defined as three new spectras.
An alternative embodiment of the invention provides a kind of information processing system, including:First processing units, for from default
Determine that trade classification code meets the first enterprise of default trade classification code in enterprise;Wherein, the default trade classification generation
It is the affiliated industry of three new spectras that code, which characterizes industry,;Corpus generation unit, for based on the default trade classification code institute
Industry corresponding to characterizing industry illustrates document, generates three new spectra keyword corpus;Second processing unit, for by described
Business scope corresponding to one enterprise introduces document and carries out keyword match with the corpus, filters out the second enterprise;Similarity
Computing unit, for crawling business relevant documentation corresponding to second enterprise, and by the business relevant documentation crawled and institute
State corpus and carry out Similarity Measure;3rd processing unit, for being up to belonging to the business relevant documentation of default similarity
Two enterprises are defined as three new spectras.
Alternatively, the business relevant documentation includes the full text or fragment of following one or more documents:Related product
Introduction, Related product operation instruction, software works, trade mark, patent.
Alternatively, the corpus generation unit characterizes industry corresponding to industry based on the default trade classification code
Illustrate document, generate three new spectra keyword corpus, specifically include:For every class industry in the default trade classification code
Industry illustrates document corresponding to code, and the sector is illustrated into document splits into single word;For splitting obtained each word, it is determined that
The word frequency of the word;Keyword is extracted using word frequency of the preset algorithm based on determination, generates three new spectra keyword corpus.
Alternatively, the business relevant documentation crawled and the corpus are carried out similarity by the similarity calculated
Calculate, specifically include:For the every business relevant documentation crawled, the business relevant documentation is split into single word;For
Obtained each word is split, determines the word frequency of the word;Obtained word and corresponding word frequency will be split by the business relevant documentation,
Respectively with by the default trade classification code, per class industry code, corresponding industry illustrates that document splits obtained word and right
The word frequency answered carries out Similarity Measure.
Alternatively, the 3rd processing unit is up to the second enterprise belonging to the business relevant documentation of default similarity and determined
For three new spectras, specifically include:If in the presence of at least a kind of industry code, make the business relevant documentation corresponding with such industry code
Industry illustrates that Documents Similarity reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new enterprises
Industry.
Information processing method provided by the invention, it is three new spectra institutes from industry is characterized first when screening three new spectras
Belong to and determine that trade classification code meets the first enterprise of default trade classification code in the default enterprise of industry, then, based on pre-
If trade classification code characterizes industry corresponding to industry and illustrates document, three new spectra keyword corpus are generated, then, by the
Business scope corresponding to one enterprise introduces document and corpus and carries out keyword match, filters out the second enterprise, then, crawls the
Business relevant documentation corresponding to two enterprises, and the business relevant documentation crawled and corpus are subjected to Similarity Measure, finally,
It is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras, it is so laddering by three-wheel
Screening so that the enterprise screened is the degree of accuracy more and more higher of three new spectras, can filter out three new spectras exactly, be
The screening of three new spectras provides reference frame.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the information processing method of the embodiment of the present invention;
Fig. 2 is the structural representation of the information processing system of the embodiment of the present invention.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool
Body embodiment is described in detail.
Fig. 1 is the schematic flow sheet of the information processing method of the embodiment of the present invention.As shown in figure 1, the embodiment of the present invention carries
The information processing method of confession, including:
S101, determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein,
It is the affiliated industry of three new spectras that the default trade classification code, which characterizes industry,.
S102, characterize based on the default trade classification code industry corresponding to industry and illustrate document, the new enterprise of generation three
Industry keyword corpus.
S103, business scope corresponding to first enterprise is introduced to document and corpus progress keyword match,
Filter out the second enterprise.
S104, crawl business relevant documentation corresponding to second enterprise, and by the business relevant documentation crawled and institute
State corpus and carry out Similarity Measure.
S105, it is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras.
Wherein, in step S101, characterizing industry can be based on for the default trade classification code of the affiliated industry of three new spectras
Description of the associated documents to three new spectras, extract and obtain from industrial sectors of national economy classification.For example, it can be based on《The Chinese people are total to
With state's the 13rd five-year-plan outline of national economy and social development》《State Council is on printing and distributing the logical of the > of < made in China 2025
Know》、《Instruction of the State Council on actively pushing forward " internet+" action》With《State Council is on carrying forward vigorously popular foundation
Millions of people innovate the opinion of some policies and measures》The classification model of three new spectras is drawn Deng the elaboration about " three is new " activity in file
Enclose, the classification range on three new spectras for being then based on obtaining is divided to choose the industry of correlation from industrial sectors of national economy classification
Category code.In one example, the classification range based on three new spectras obtained by associated documents may include modern agriculture, forestry, animal husbandry and fishery,
Advanced manufacturing industry, novel energy activity, energy-conserving and environment-protective activity, internet and modern information technologies service, new technology and double wounds take
Business activity, modern production sex service activity, new liveliness proof service activity, modern integrated management activity, according to these classification institutes
Obtained default trade classification code may include 278 groups.
So, the default trade classification code based on determination, selection trade classification code meets default from default enterprise
First enterprise of trade classification code.Default enterprise can be screened from request by specified interface and be obtained at the requestor of three new spectras
, or acquisition is crawled by web crawlers according to designated key word.
In step s 102, industry corresponding to industry is characterized based on the default trade classification code and illustrates document, it is raw
Into three new spectra keyword corpus, may particularly include:
Step 1: for illustrating document per industry corresponding to class industry code in the default trade classification code, by this
Industry illustrates that document splits into single word.
It can use default participle instrument that industry corresponding to every class industry code is illustrated into document splits into single word, example
Such as, the jieba storehouses in python can be used.Every industry can be illustrated that document is split as by jieba storehouses according to custom rule
Single word.
Step 2: for splitting obtained each word, the word frequency of the word is determined.
The each word obtained for step 1, it can be counted to obtain each word by word frequency statisticses instrument in every industry
Illustrate the word frequency that document occurs, so as to obtain the word frequency of each word.In addition, to reduce noise, can be by the word obtained by step 1
In to screening three new spectras without especially contribution or nonsensical word delete, such as delete document in some void
Word, such as interjection, preposition, conjunction, so as to improve efficiency of the subsequent step to keyword extraction.
Step 3: extracting keyword using word frequency of the preset algorithm based on determination, three new spectra keyword corpus are generated.
In the example of the present invention, keyword, generation can be extracted using TF-IDF methods come the word frequency based on determination
Three new spectra keyword corpus, but the invention is not limited in this, can also be using other method come the word frequency based on determination
Keyword is extracted, for example, mutual information, expectation cross entropy, Information Gain Method, PCA, genetic algorithm etc..
The t of each word in every document is obtained in the present invention using TF-IDF methodsi- idf is worth, and chooses ti- idf is worth
More than specific threshold word as keyword, every industry illustrates the t of each word in documenti- idf values can pass through equation below
(1) obtain:
ti- idf=fi*log(N/dfi) (1)
Wherein, fiRefer to word frequency rate, represent the number that i-th of word occurs in the sector illustrates document, dfiRefer to document
Frequency, represent that all industries illustrate the number of documents for occurring i-th of word in document, N represents that all industries illustrate the number of document.
tiThe specific threshold of-idf values can determine according to actual conditions, as long as the keyword for obtain screens to greatest extent
Go out to meet the reduction processing complexity that three desired new spectra and cans are tried one's best.
The word frequency of each word of the every service description document obtained by step 2, can be obtained using above-mentioned formula (1)
To the t of each wordi- idf is worth, and then chooses ti- idf values are more than the word of specific threshold as keyword, so as to generate three new spectras
Keyword corpus.
In step s 103, business scope corresponding to the first enterprise step S101 obtained introduces document and step S102
The keyword of generation expects that storehouse carries out Keywords matching, filters out second enterprise associated with keyword.First enterprise is corresponding
Business scope introduce document can by specified interface at related personnel for example request screen three new spectras requestor at obtain
, or acquisition is crawled by web crawlers.The present invention an example in, can use R language in match functions by
Business scope corresponding to first enterprise introduces document and the keyword of step S102 generations expects that storehouse carries out Keywords matching, automatically
Filter out second enterprise associated with the keyword in keyword corpus.Because the trade classification code of enterprise may be not
The actual business managed of the enterprise can be represented, i.e. the trade classification code of enterprise may exist with its actual business managed
Deviation, thus the first enterprise determined by trade classification code there may be it is many be not three new spectras enterprise, because
This, in step S103 using Keywords matching by way of the second enterprise for further being filtered out from the first enterprise it is new for three
The accuracy rate of enterprise can improve a lot, and in one exemplary embodiment of the invention, be screened by step S103
About 85% three new spectras can be included in second enterprise.
In step S104, Real-time Network can be carried out by the software kit such as the seleuim of python programming languages, bs4 reptile
Network crawls business relevant documentation corresponding to the second enterprise, and the business relevant documentation may include the full text of following one or more documents
Or fragment:Related product introduction, Related product operation instruction, software works, trade mark, patent.Wherein, the business that will be crawled
Relevant documentation carries out Similarity Measure with the keyword corpus obtained in step S102, may particularly include:
The first step, every business relevant documentation for crawling, single word is split into by the business relevant documentation.
The step every business relevant documentation is split into single word mode can with will be per class row in abovementioned steps S102
Industry corresponding to industry code illustrate document split into single word mode it is identical.
Second step, each word obtained for fractionation, determine the word frequency of the word.
The step determines that the mode of the word frequency of each word can be with determining each word in industry expository writing in abovementioned steps S102
The mode of word frequency in shelves is identical.Equally, to reduce noise, will can not have in the word obtained by the first step to three new spectras of screening
Especially contribution or nonsensical word are deleted, such as delete some function words in document, such as interjection, preposition, conjunction
Deng so as to improve efficiency of the subsequent step to keyword extraction.
3rd step, obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by described default
Correspond to industry per class industry code in trade classification code and illustrates that document splits obtained word and the progress of corresponding word frequency is similar
Degree calculates.
In the present invention, every business relevant documentation and every class industry code can be calculated using co sinus vector included angle method
Corresponding industry illustrates the similarity of document.
Specifically, first, in a certain order, for example, the sequencing occurred in a document according to word, by each word
Corresponding word frequency is built into word frequency vector.For example, for i-th business relevant documentation in business relevant documentation, it is based on
The word of fractionation and corresponding word frequency can build vector:Ai:[x1,x2,...,xn], wherein, x1,x2,...,xnRespectively should
The word frequency of n keyword of business relevant documentation.Similarly, for presetting the i-th class industry code pair in trade classification code
Industry is answered to illustrate document, its word based on fractionation and corresponding word frequency can build vector and be:Bi:[y1,y2,...,yn], its
In, y1,y2,...,ynRespectively such industry code corresponds to the word frequency that industry illustrates n keyword of document.
Then, the vector based on foregoing structure, using following formula (2) come every business relevant documentation and every class industry generation
The corresponding industry of code illustrates the similarity cos θ of document:
So, using above-mentioned formula (2), it can obtain every business relevant documentation industry explanation corresponding with per class industry code
The similarity of document.
In step S105, it is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new enterprises
Industry, specifically include:If in the presence of at least a kind of industry code, make business relevant documentation industry explanation corresponding with such industry code
Documents Similarity reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new spectras.Specifically,
If all about every business relevant documentation calculated in step S104 industry expository writing corresponding with per class industry code
The similarity of shelves, which has, reaches default similarity, such as 0.7, then is defined as the second enterprise belonging to corresponding business relevant documentation
Three new spectras.Due to that will be three new enterprises by the business relevant documentation and sign industry for the second enterprise that Keywords matching filters out
The default industry code of the affiliated industry of industry correspondingly illustrates that document carries out Similarity Measure, then chooses similarity and reaches default similar
The enterprise of degree understands higher as three new spectras by three new spectras determined by this Similarity Measure, the degree of accuracy.
To sum up, information processing method provided by the invention, when screening three new spectras, it is primarily based on and characterizes three new spectra institutes
The default industry code for belonging to industry carries out first round screening, and business scope corresponding to the enterprise for then screening the first round is situated between
The document that continues illustrates that the keyword corpus of document structure tree is closed with characterizing the corresponding industry of industry based on default industry code
Keyword matches, and carries out the second wheel screening, finally by the business relevant documentation and the keyword of the enterprise obtained through the second wheel screening
Corpus carries out Similarity Measure, and selection similarity reaches the enterprise of default similarity as three new spectras, so, by three-wheel
Laddering screening, substantially increase the degree of accuracy that the enterprise filtered out is three new spectras.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of information processing system, by the system is solved
Certainly the principle of problem is similar to aforementioned information processing method, therefore the implementation of the system may refer to the implementation of preceding method, weight
Multiple part repeats no more.
A kind of information processing system provided in an embodiment of the present invention, as shown in Fig. 2 including:
First processing units 201, for determining that trade classification code meets default trade classification code from default enterprise
The first enterprise;Wherein, it is the affiliated industry of three new spectras that the default trade classification code, which characterizes industry,;
Corpus generation unit 202, said for characterizing industry corresponding to industry based on the default trade classification code
Plaintext shelves, generate three new spectra keyword corpus;
Second processing unit 203, for business scope corresponding to first enterprise to be introduced into document and the corpus
Keyword match is carried out, filters out the second enterprise;
Similarity calculated 204, for crawling business relevant documentation corresponding to second enterprise, and it will crawl
Business relevant documentation carries out Similarity Measure with the corpus;
3rd processing unit 205, it is defined as being up to the second enterprise belonging to the business relevant documentation of default similarity
Three new spectras.
In one exemplary embodiment of the invention, the business relevant documentation includes following one or more documents
Full text or fragment:Related product introduction, Related product operation instruction, software works, trade mark, patent.The related text of these business
Shelves can carry out real-time network by the software kit such as the seleuim of python programming languages, bs4 reptile and crawl.
In one exemplary embodiment of the invention, the corpus generation unit 202 is based on the default industry point
Category code characterizes industry corresponding to industry and illustrates document, generates three new spectra keyword corpus, specifically includes:For described
Illustrate document per industry corresponding to class industry code in default trade classification code, it is single that the sector is illustrated into document is split into
Word;For splitting obtained each word, the word frequency of the word is determined;Keyword is extracted using word frequency of the preset algorithm based on determination,
Generate three new spectra keyword corpus.
In one exemplary embodiment of the invention, the similarity calculated 204 is related by the business crawled
Document carries out Similarity Measure with the corpus, specifically includes:For the every business relevant documentation crawled, by the business
Relevant documentation splits into single word;For splitting obtained each word, the word frequency of the word is determined;It will be torn open by the business relevant documentation
The word got and corresponding word frequency, respectively with by the default trade classification code, per class industry code, corresponding industry is said
Plaintext shelves split obtained word and corresponding word frequency carries out Similarity Measure.
In one exemplary embodiment of the invention, the 3rd processing unit 205 is up to the industry of default similarity
Second enterprise belonging to business relevant documentation is defined as three new spectras, specifically includes:If in the presence of at least a kind of industry code, make the business
Relevant documentation industry corresponding with such industry code illustrates that Documents Similarity reaches default similarity, then by the business relevant documentation
Affiliated second enterprise is defined as three new spectras.
Finally it should be noted that:Embodiment described above, it is only the embodiment of the present invention, to illustrate the present invention
Technical scheme, rather than its limitations, protection scope of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair
It is bright to be described in detail, it will be understood by those within the art that:Any one skilled in the art
The invention discloses technical scope in, it can still modify to the technical scheme described in previous embodiment or can be light
Change is readily conceivable that, or equivalent substitution is carried out to which part technical characteristic;And these modifications, change or replacement, do not make
The essence of appropriate technical solution departs from the spirit and scope of technical scheme of the embodiment of the present invention.The protection in the present invention should all be covered
Within the scope of.Therefore, protection scope of the present invention described should be defined by scope of the claims.
Claims (10)
- A kind of 1. information processing method, it is characterised in that including:Determine that trade classification code meets the first enterprise of default trade classification code from default enterprise;Wherein, it is described default It is the affiliated industry of three new spectras that trade classification code, which characterizes industry,;Industry corresponding to industry is characterized based on the default trade classification code and illustrates document, generates three new spectra key wordses Expect storehouse;Business scope corresponding to first enterprise is introduced into document and carries out keyword match with the corpus, filters out second Enterprise;Business relevant documentation corresponding to second enterprise is crawled, and the business relevant documentation crawled is entered with the corpus Row Similarity Measure;It is up to the second enterprise belonging to the business relevant documentation of default similarity and is defined as three new spectras.
- 2. according to the method for claim 1, it is characterised in that the business relevant documentation includes following one or more texts The full text or fragment of shelves:Related product introduction, Related product operation instruction, software works, trade mark, patent.
- 3. method according to claim 1 or 2, it is characterised in that row is characterized based on the default trade classification code Industry corresponding to industry illustrates document, generates three new spectra keyword corpus, specifically includes:For illustrating document per industry corresponding to class industry code in the default trade classification code, the sector is illustrated into document Split into single word;For splitting obtained each word, the word frequency of the word is determined;Keyword is extracted using word frequency of the preset algorithm based on determination, generates three new spectra keyword corpus.
- 4. method according to claim 1 or 2, it is characterised in that by the business relevant documentation crawled and the language material Storehouse carries out Similarity Measure, specifically includes:For the every business relevant documentation crawled, the business relevant documentation is split into single word;For splitting obtained each word, the word frequency of the word is determined;Obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by the default trade classification code In correspond to industry per class industry code and illustrates that document splits obtained word and corresponding word frequency progress Similarity Measure.
- 5. according to the method for claim 4, it is characterised in that be up to belonging to the business relevant documentation of default similarity the Two enterprises are defined as three new spectras, specifically include:If in the presence of at least a kind of industry code, business relevant documentation industry corresponding with such industry code is set to illustrate that document is similar Degree reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new spectras.
- A kind of 6. information processing system, it is characterised in that including:First processing units, for determining that trade classification code meets the first enterprise of default trade classification code from default enterprise Industry;Wherein, it is the affiliated industry of three new spectras that the default trade classification code, which characterizes industry,;Corpus generation unit, illustrate document for characterizing industry corresponding to industry based on the default trade classification code, Generate three new spectra keyword corpus;Second processing unit, key is carried out with the corpus for business scope corresponding to first enterprise to be introduced into document Word matches, and filters out the second enterprise;Similarity calculated, for crawling business relevant documentation corresponding to second enterprise, and the business phase that will be crawled Close document and carry out Similarity Measure with the corpus;3rd processing unit, it is defined as three new enterprises for being up to the second enterprise belonging to the business relevant documentation of default similarity Industry.
- 7. system according to claim 6, it is characterised in that the business relevant documentation includes following one or more texts The full text or fragment of shelves:Related product introduction, Related product operation instruction, software works, trade mark, patent.
- 8. the system according to claim 6 or 7, it is characterised in that the corpus generation unit is based on the default row Industry Sort Code characterizes industry corresponding to industry and illustrates document, generates three new spectra keyword corpus, specifically includes:For illustrating document per industry corresponding to class industry code in the default trade classification code, the sector is illustrated into document Split into single word;For splitting obtained each word, the word frequency of the word is determined;Keyword is extracted using word frequency of the preset algorithm based on determination, generates three new spectra keyword corpus.
- 9. the system according to claim 6 or 7, it is characterised in that the business that the similarity calculated will crawl Relevant documentation carries out Similarity Measure with the corpus, specifically includes:For the every business relevant documentation crawled, the business relevant documentation is split into single word;For splitting obtained each word, the word frequency of the word is determined;Obtained word and corresponding word frequency will be split by the business relevant documentation, respectively with by the default trade classification code In correspond to industry per class industry code and illustrates that document splits obtained word and corresponding word frequency progress Similarity Measure.
- 10. system according to claim 9, it is characterised in that the 3rd processing unit is up to default similarity The second enterprise is defined as three new spectras belonging to business relevant documentation, specifically includes:If in the presence of at least a kind of industry code, business relevant documentation industry corresponding with such industry code is set to illustrate that document is similar Degree reaches default similarity, then the second enterprise belonging to the business relevant documentation is defined as into three new spectras.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710506158.XA CN107357851B (en) | 2017-06-28 | 2017-06-28 | information processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710506158.XA CN107357851B (en) | 2017-06-28 | 2017-06-28 | information processing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107357851A true CN107357851A (en) | 2017-11-17 |
CN107357851B CN107357851B (en) | 2020-01-31 |
Family
ID=60273239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710506158.XA Active CN107357851B (en) | 2017-06-28 | 2017-06-28 | information processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107357851B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109801118A (en) * | 2018-12-24 | 2019-05-24 | 航天信息股份有限公司 | Identify method, apparatus, medium and the equipment of the manufacturing business of designated trade |
CN113076979A (en) * | 2021-03-23 | 2021-07-06 | 广州快必妥营销策划咨询有限公司 | Qualified crop screening method, crop cultivation control method, system and device |
CN113869639A (en) * | 2021-08-26 | 2021-12-31 | 中国环境科学研究院 | Yangtze river basin enterprise screening method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101127050A (en) * | 2007-07-03 | 2008-02-20 | 北京大学 | Method for automatically extracting website owner administrative apanage information from web page |
CN102073692A (en) * | 2010-12-16 | 2011-05-25 | 北京农业信息技术研究中心 | Agricultural field ontology library based semantic retrieval system and method |
JP4791169B2 (en) * | 2005-12-12 | 2011-10-12 | ヤフー株式会社 | Related word extraction device and related word extraction method |
CN106682145A (en) * | 2016-12-22 | 2017-05-17 | 北京览群智数据科技有限责任公司 | Enterprise information processing method, server and client |
-
2017
- 2017-06-28 CN CN201710506158.XA patent/CN107357851B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4791169B2 (en) * | 2005-12-12 | 2011-10-12 | ヤフー株式会社 | Related word extraction device and related word extraction method |
CN101127050A (en) * | 2007-07-03 | 2008-02-20 | 北京大学 | Method for automatically extracting website owner administrative apanage information from web page |
CN102073692A (en) * | 2010-12-16 | 2011-05-25 | 北京农业信息技术研究中心 | Agricultural field ontology library based semantic retrieval system and method |
CN106682145A (en) * | 2016-12-22 | 2017-05-17 | 北京览群智数据科技有限责任公司 | Enterprise information processing method, server and client |
Non-Patent Citations (1)
Title |
---|
胡芳槐: ""基于多种数据源的中文知识图谱构建方法研究"", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109801118A (en) * | 2018-12-24 | 2019-05-24 | 航天信息股份有限公司 | Identify method, apparatus, medium and the equipment of the manufacturing business of designated trade |
CN113076979A (en) * | 2021-03-23 | 2021-07-06 | 广州快必妥营销策划咨询有限公司 | Qualified crop screening method, crop cultivation control method, system and device |
CN113076979B (en) * | 2021-03-23 | 2024-05-17 | 广州快必妥营销策划咨询有限公司 | Qualified crop screening method, crop cultivation control method, system and device |
CN113869639A (en) * | 2021-08-26 | 2021-12-31 | 中国环境科学研究院 | Yangtze river basin enterprise screening method and device, electronic equipment and storage medium |
WO2023025332A1 (en) * | 2021-08-26 | 2023-03-02 | 中国环境科学研究院 | Yangtze river basin enterprise screening method and apparatus, electronic device, and storage medium |
CN113869639B (en) * | 2021-08-26 | 2023-11-07 | 中国环境科学研究院 | Yangtze river basin enterprise screening method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107357851B (en) | 2020-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111159395B (en) | Chart neural network-based rumor standpoint detection method and device and electronic equipment | |
CN107871144A (en) | Invoice trade name sorting technique, system, equipment and computer-readable recording medium | |
CN108009284A (en) | Using the Law Text sorting technique of semi-supervised convolutional neural networks | |
CN106484664A (en) | Similarity calculating method between a kind of short text | |
CN103034726B (en) | Text filtering system and method | |
CN107357851A (en) | A kind of information processing method and system | |
CN112001170B (en) | Method and system for identifying deformed sensitive words | |
Li et al. | A Bi-LSTM-RNN model for relation classification using low-cost sequence features | |
CN106599037A (en) | Recommendation method based on label semantic normalization | |
CN108388554A (en) | Text emotion identifying system based on collaborative filtering attention mechanism | |
CN108335210A (en) | A kind of stock unusual fluctuation analysis method based on public opinion data | |
CN113254593A (en) | Text abstract generation method and device, computer equipment and storage medium | |
CN109960791A (en) | Judge the method and storage medium, terminal of text emotion | |
CN106649262B (en) | Method for protecting sensitive information of enterprise hardware facilities in social media | |
Geetha et al. | Twitter opinion mining and boosting using sentiment analysis | |
CN103810213B (en) | A kind of searching method and system | |
Sabaruddin et al. | Malay tweets: discovering mental health situation during covid-19 pandemic in Malaysia | |
CN108846128A (en) | A kind of cross-domain texts classification method based on adaptive noise encoder | |
US11640398B2 (en) | Method and system for data communication with relational database management | |
Amethyst et al. | Data pattern single column analysis for data profiling using an open source platform | |
Pitchayaviwat | A study on clustering customer suggestion on online social media about insurance services by using text mining techniques | |
CN111125486B (en) | Microblog user attribute analysis method based on multiple features | |
JP6150664B2 (en) | Mining analyzer, method and program | |
Monish et al. | Automated topic modeling and sentiment analysis of tweets on SparkR | |
Alwosheel et al. | Artificial neural networks as a means to accommodate decision rules in choice models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address |
Address after: 100070, No. 101-8, building 1, 31, zone 188, South Fourth Ring Road, Beijing, Fengtai District Patentee after: Guoxin Youyi Data Co., Ltd Address before: 100070, No. 188, building 31, headquarters square, South Fourth Ring Road West, Fengtai District, Beijing Patentee before: SIC YOUE DATA Co.,Ltd. |
|
CP03 | Change of name, title or address |