CN107357851B - information processing method and system - Google Patents

information processing method and system Download PDF

Info

Publication number
CN107357851B
CN107357851B CN201710506158.XA CN201710506158A CN107357851B CN 107357851 B CN107357851 B CN 107357851B CN 201710506158 A CN201710506158 A CN 201710506158A CN 107357851 B CN107357851 B CN 107357851B
Authority
CN
China
Prior art keywords
industry
preset
enterprise
business
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710506158.XA
Other languages
Chinese (zh)
Other versions
CN107357851A (en
Inventor
夏耘海
张斌德
王江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoxin Youe Data Co Ltd
Original Assignee
Guoxin Youe Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoxin Youe Data Co Ltd filed Critical Guoxin Youe Data Co Ltd
Priority to CN201710506158.XA priority Critical patent/CN107357851B/en
Publication of CN107357851A publication Critical patent/CN107357851A/en
Application granted granted Critical
Publication of CN107357851B publication Critical patent/CN107357851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Educational Administration (AREA)
  • Tourism & Hospitality (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)

Abstract

The invention discloses a information processing method which comprises the steps of determining a th enterprise with industry classification codes meeting preset industry classification codes from preset enterprises, wherein the industry represented by the preset industry classification codes is the industry to which three new enterprises belong, generating a three-new-enterprise keyword corpus based on an industry description document corresponding to the industry represented by the preset industry classification codes, performing keyword matching on an operation range introduction document corresponding to the th enterprise and the corpus, screening out the second enterprise, crawling a business-related document corresponding to the second enterprise, performing similarity calculation on the crawled business-related document and the corpus, determining the second enterprise to which the business-related document reaching the preset similarity belongs as the three new enterprises, and information processing systems.

Description

information processing method and system
Technical Field
The invention relates to information methods and systems, in particular to methods and systems for identifying three new enterprises.
Background
With the rapid development of the economy of China, new enterprises and economic activities are continuously appeared. The enterprise plays an important role in the economy as the most important activity subject in the social economy, and the arrangement and analysis of enterprise information are helpful for helping related decision makers to know the operation condition of the enterprise and discover potential operation risks.
For example, three new enterprises (including new industries, new statuses, and new business models) that have recently emerged and are concerned by party centers and state hospitals, people involved in the business needs to make statistical observations on the development scale, structure, and quality of economic activities of such enterprises, so as to know the development scale, structure, and quality of such enterprises in real time, and provide reference bases for future decisions. The key point for statistical observation is that the enterprises of the many enterprises that need to be accurately known to investigate belong to three new enterprises. Therefore, the three new enterprises need to be accurately screened so as to screen the three new enterprises meeting the requirements. However, there is currently no solution for accurately screening three new enterprises. .
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is to provide schemes capable of saving time and labor and accurately screening three new enterprises.
The invention provides information processing methods for accurately and effectively screening three new enterprises, which comprise the steps of determining a th enterprise with an industry classification code conforming to a preset industry classification code from preset enterprises, wherein the industry represented by the preset industry classification code is the industry to which the three new enterprises belong, generating a keyword corpus of the three new enterprises based on an industry description document corresponding to the industry represented by the preset industry classification code, performing keyword matching on an operation range introduction document corresponding to the th enterprise and the corpus to screen out a second enterprise, crawling a service-related document corresponding to the second enterprise, performing similarity calculation on the crawled service-related document and the corpus, and determining the second enterprise to which the service-related document reaching the preset similarity belongs as the three new enterprises.
Optionally, the business related documents include full or fractional documents including related product introduction, related product instructions, software work, trademarks, patents.
Optionally, generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code specifically includes: aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes, splitting the industry description document into single words; determining the word frequency of each word obtained by splitting; and extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.
Optionally, the similarity calculation of the crawled business-related document and the corpus specifically includes: for each crawled business related document, splitting the business related document into single words; determining the word frequency of each word obtained by splitting; and respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.
Optionally, determining the second enterprise to which the business-related document reaching the preset similarity belongs as three new enterprises, specifically, if at least -type industry codes exist, so that the similarity between the business-related document and the industry description document corresponding to the industry code reaches the preset similarity, determining the second enterprise to which the business-related document belongs as three new enterprises.
Another embodiments of the invention provide information processing systems, which include a processing unit configured to determine a th enterprise, from preset enterprises, to which an industry classification code meets a preset industry classification code, where the industry represented by the preset industry classification code is an industry to which three new enterprises belong, a corpus generating unit configured to generate a keyword corpus of the three new enterprises based on an industry description document corresponding to the industry represented by the preset industry classification code, a second processing unit configured to perform keyword matching on an operation range introduction document corresponding to an th enterprise and the corpus to screen out the second enterprise, a similarity calculating unit configured to crawl a business-related document corresponding to the second enterprise and perform similarity calculation on the crawled business-related document and the corpus, and a third processing unit configured to determine the second enterprise to which the business-related document reaching the preset similarity belongs as the three new enterprises.
Optionally, the business related documents include full or fractional documents including related product introduction, related product instructions, software work, trademarks, patents.
Optionally, the corpus generating unit generates a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code, and specifically includes: aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes, splitting the industry description document into single words; determining the word frequency of each word obtained by splitting; and extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.
Optionally, the similarity calculation unit performs similarity calculation on the crawled business-related document and the corpus, and specifically includes: for each crawled business related document, splitting the business related document into single words; determining the word frequency of each word obtained by splitting; and respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.
Optionally, the third processing unit determines the second enterprise to which the business-related document that achieves the preset similarity belongs as three new enterprises, and specifically includes determining the second enterprise to which the business-related document belongs as three new enterprises if at least types of industry codes exist, so that the similarity between the business-related document and the industry description document corresponding to the industry codes achieves the preset similarity.
When three new enterprises are screened, firstly, th enterprises with industry classification codes meeting the preset industry classification codes are determined from preset enterprises representing industries to which the three new enterprises belong, then, a keyword corpus of the three new enterprises is generated based on industry description documents corresponding to the industries represented by the preset industry classification codes, then, keyword matching is carried out on an operation range introduction document corresponding to th enterprise and the corpus, the second enterprise is screened, then, business related documents corresponding to the second enterprise are crawled, similarity calculation is carried out on the crawled business related documents and the corpus, and finally, the second enterprise to which the business related documents reaching the preset similarity belong is determined as the three new enterprises, so that the screened enterprises are higher and higher than the three new enterprises through three-round progressive screening, the three new enterprises can be accurately screened, and reference basis is provided for screening of the three new enterprises.
Drawings
FIG. 1 is a flow chart of an information processing method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an information processing system according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flowchart illustrating an information processing method according to an embodiment of the present invention. As shown in fig. 1, an information processing method provided in an embodiment of the present invention includes:
s101, determining th enterprises with industry classification codes meeting the preset industry classification codes from preset enterprises, wherein the industries represented by the preset industry classification codes are industries to which the three new enterprises belong.
And S102, generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code.
And S103, carrying out keyword matching on the operation range introduction document corresponding to the th enterprise and the corpus, and screening out a second enterprise.
And S104, crawling the business related documents corresponding to the second enterprise, and carrying out similarity calculation on the crawled business related documents and the corpus.
And S105, determining the second enterprise to which the business related documents reaching the preset similarity belong as three new enterprises.
For example, the classification ranges of the three new enterprises can be obtained based on the information of the notification of the state department about the printing release (China manufacturing 2025), the guidance of the state department about actively advancing the internet + 'action, the opinion of the state department about vigorously advancing a plurality of policy measures of the public entrepreneur, and the like in the thirteenth five-year planning requirement of the development of the national economy and society of the people' S business, so that the classification ranges of the three new enterprises can be obtained, and then the related industry classification codes can be selected from the classification ranges of the state economy industry based on the related files in examples, the classification ranges of the three new enterprises obtained based on the related files can include the modern agriculture and forestry, advanced manufacturing industry, novel energy activities, energy-saving and environmental-protection service activities, internet and service information and modern technical service information and modern activities, and the classification codes of the modern industrial activities can be obtained according to the classification ranges of the modern industry, modern activities, including the modern industrial activities and modern production activities.
And selecting th enterprises with industry classification codes meeting the preset industry classification codes from the preset enterprises based on the determined preset industry classification codes, wherein the preset enterprises can be obtained from requesters requesting to screen the three new enterprises through a specified interface or obtained by crawling of a web crawler according to specified keywords.
In step S102, generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code may specifically include:
and , splitting the industry description document into single words aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes.
A preset word segmentation tool can be adopted to split the industry description document corresponding to each type of industry code into single words, for example, a jieba library in python can be adopted. The jieba library can split each industry description document into individual words according to custom rules.
And step two, determining the word frequency of each word obtained by splitting.
In addition, in order to reduce noise, the words obtained in the step , which do not particularly contribute to the screening of the three new enterprises or are meaningless, may be deleted, for example, the dummy words in the document, such as exclamation words, prepositions, conjunctions, and the like, so as to improve the efficiency of extracting the keywords in the subsequent steps.
And thirdly, extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.
In examples of the present invention, a TF-IDF method may be used to extract keywords based on a determined word frequency to generate a three-new enterprise keyword corpus, but the present invention is not limited thereto, and other methods may be used to extract keywords based on a determined word frequency, such as mutual information, desired cross entropy, information gain method, principal component analysis method, genetic algorithm, and the like.
In the invention, a TF-IDF method is used for obtaining t of each word in each documentiIdf value, choose ti-words with idf value greater than a certain threshold as keywords, t for each word in each industry description documentiThe idf value can be obtained by the following equation (1):
ti-idf=fi*log(N/dfi) (1)
wherein f isiMeans word frequency, representing the number of times the ith word appears in the industry description document, dfiThe document frequency is referred to, the document number of the ith word appearing in all industry description documents is represented, and N represents the number of all industry description documents. t is tiThe specific threshold of the idf value can be determined according to actual conditions, so long as the obtained keywords can screen out three new enterprises meeting the requirements to the maximum extent and reduce the processing complexity as much as possible.
Obtaining the word frequency of each word of each service description document obtained in the step two, and obtaining t of each word by using the formula (1)iIdf value, then select tiWords with idf values greater than a certain threshold are used as keywords, thus generating a corpus of three new enterprise keywords.
In step S103, the business scope introduction document corresponding to the th enterprise obtained in step S101 is keyword-matched with the keyword forecast library generated in step S102 to screen out the second enterprise associated with the keyword, the business scope introduction document corresponding to the th enterprise is obtained from a related person, for example, a requester who requests to screen out three new enterprises, or obtained by crawling a web crawler, in examples of the present invention, a match function in the R language may be used to perform keyword matching on the business scope introduction document corresponding to the th enterprise with the keyword forecast library generated in step S102, and the second enterprise associated with the keyword in the keyword corpus is automatically screened out, since the business classification code of the enterprise may not represent the business actually operated by the enterprise, that is, the business classification code of the enterprise may deviate from the business actually operated by the enterprise, so that the th enterprise determined by the classification code may have many enterprises which are not three new enterprises, and therefore, the third enterprise screening method of the third enterprise selected by the step S103 by using the keyword matching method of step S may have an accuracy of improving the third enterprise screening of the third enterprise of the third embodiment of the invention, which is about .
In step S104, a crawler, such as seleuim, bs4, of python programming language, may perform real-time web crawling on a business-related document corresponding to the second enterprise, where the business-related document may include all or a fragment of documents, such as a related product introduction, a related product instruction, a software work, a trademark, and a patent, where the similarity calculation between the crawled business-related document and the keyword corpus obtained in step S102 may specifically include:
and , for each business-related document crawled, splitting the business-related document into single words.
The manner of splitting each business-related document into single words in this step may be the same as the manner of splitting the industry specification document corresponding to each type of industry code into single words in the foregoing step S102.
And secondly, determining the word frequency of each word obtained by splitting.
Similarly, in order to reduce noise, words that do not particularly contribute to the screening of three new enterprises or are meaningless in the words obtained in step may be deleted, for example, dummy words such as exclamatory words, prepositions, conjunctions, and the like in the document may be deleted, thereby improving the efficiency of extracting keywords in the subsequent steps.
And thirdly, respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.
In the invention, the similarity between each business related document and the industry description document corresponding to each type of industry code can be calculated by using a vector included angle cosine method.
Specifically, firstly, according to , for example, according to the sequence of words appearing in the document, the word frequency corresponding to each word is constructed into a word frequency vectori:[x1,x2,...,xn]Wherein x is1,x2,...,xnThe word frequencies of the n keywords of the business related document are respectively. Similarly, for an industry description document corresponding to the ith type industry code in the preset industry classification codes, a vector can be constructed based on the split words and the corresponding word frequencies as follows: b isi:[y1,y2,...,yn]Wherein, y1,y2,...,ynThe word frequencies of the n key words of the industry description documents corresponding to the industry codes are respectively.
Then, based on the constructed vector, the similarity cos θ of each business-related document and the industry description document corresponding to each type of industry code is determined by the following formula (2):
Figure BDA0001334728980000071
thus, by using the above formula (2), the similarity between each business-related document and the industry description document corresponding to each type of industry code can be obtained.
In step S105, the second enterprise to which the business-related document that has reached the preset similarity belongs is determined as three new enterprises, which specifically includes determining the second enterprise to which the business-related document belongs as three new enterprises if at least types of industry codes exist, and the similarity between the business-related document and the industry description document corresponding to the type of industry code reaches the preset similarity, and specifically, if the similarity between all the business-related documents calculated in step S104 and the industry description document corresponding to each type of industry code reaches the preset similarity, for example, 0.7, determining the second enterprise to which the corresponding business-related document belongs as three new enterprises.
In summary, when three new enterprises are screened, th round screening is performed based on preset industry codes representing industries to which the three new enterprises belong, keyword matching is performed on the operation range introduction document corresponding to the enterprise screened in the th round and the keyword corpus generated based on the industry description document corresponding to the industry represented by the preset industry codes, second round screening is performed, similarity calculation is performed on the business related documents of the enterprises obtained through the second round screening and the keyword corpus, and the enterprises with the similarity reaching the preset similarity are selected as the three new enterprises.
Based on the same concept as , the embodiment of the present invention further provides information processing systems, and since the principle of the problem solved by the system is similar to that of the aforementioned information processing method, the implementation of the system can refer to the implementation of the aforementioned method, and repeated details are not repeated.
As shown in fig. 2, the information processing systems provided in the embodiment of the present invention include:
the processing unit 201 is used for determining enterprises with industry classification codes meeting preset industry classification codes from preset enterprises, wherein the industries represented by the preset industry classification codes are industries to which three new enterprises belong;
a corpus generating unit 202, configured to generate a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code;
the second processing unit 203 is configured to perform keyword matching on the operation range introduction document corresponding to the th enterprise and the corpus to screen out a second enterprise;
the similarity calculation unit 204 is configured to crawl a business-related document corresponding to the second enterprise, and perform similarity calculation between the crawled business-related document and the corpus;
the third processing unit 205 is configured to determine, as three new businesses, the second business to which the business-related document that reaches the preset similarity belongs.
In the exemplary embodiments of the invention, the business related documents include all or a portion of documents such as a description of the relevant product, instructions for use of the relevant product, software work, trademark, patent, which are crawled in real time by a crawler package in the python programming language seleuim, bs4, etc.
In exemplary embodiments of the present invention, the corpus generating unit 202 generates three new enterprise keyword corpuses based on the industry description documents corresponding to the industries represented by the preset industry classification codes, and specifically includes splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes into single words, determining the word frequency of each word obtained by splitting, and generating three new enterprise keyword corpuses by extracting keywords based on the determined word frequency by using a preset algorithm.
In exemplary embodiments of the present invention, the similarity calculation unit 204 performs similarity calculation on the crawled business-related documents and the corpus, specifically includes splitting each crawled business-related document into a single word, determining a word frequency of each word obtained by splitting the word, and performing similarity calculation on the word obtained by splitting the business-related document and the corresponding word frequency, and the word obtained by splitting the business-related document and the corresponding word frequency are respectively subjected to similarity calculation with the word obtained by splitting the industry description document corresponding to each type of industry code in the preset industry classification code and the corresponding word frequency.
In exemplary embodiments of the present invention, the determining, by the third processing unit 205, that the second enterprise to which the business-related document that achieves the preset similarity belongs is three new enterprises specifically includes determining, if at least types of industry codes exist, that the similarity between the business-related document and the industry specification document corresponding to the type of industry codes achieves the preset similarity, that the second enterprise to which the business-related document belongs is three new enterprises.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1, an information processing method, comprising:
determining th enterprises with industry classification codes conforming to preset industry classification codes from preset enterprises, wherein the industries represented by the preset industry classification codes are industries to which the three new enterprises belong;
generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code;
carrying out keyword matching on the operation range introduction document corresponding to the th enterprise and the corpus to screen out a second enterprise;
crawling a business related document corresponding to the second enterprise, and performing similarity calculation on the crawled business related document and the corpus;
determining a second enterprise to which the business related documents reaching the preset similarity belong as three new enterprises;
the similarity calculation of the crawled business-related documents and the corpus specifically comprises the following steps:
for each crawled business related document, splitting the business related document into single words;
determining the word frequency of each word obtained by splitting;
and respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.
2. The method of claim 1, wherein the business-related documents comprise full-text or fragments of related product introduction, related product usage instructions, software works, trademarks, patents.
3. The method according to claim 1 or 2, wherein generating a corpus of three new enterprise keywords based on industry description documents corresponding to industries characterized by the preset industry classification codes specifically comprises:
aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes, splitting the industry description document into single words;
determining the word frequency of each word obtained by splitting;
and extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.
4. The method according to claim 1, wherein the step of determining the second enterprise to which the business-related documents reaching the preset similarity belong as three new enterprises specifically comprises:
and if at least types of industry codes exist, and the similarity between the business related document and the industry description document corresponding to the industry codes reaches the preset similarity, determining the second enterprise to which the business related document belongs as three new enterprises.
An information processing system of , comprising:
the processing unit is used for determining enterprises with industry classification codes meeting preset industry classification codes from preset enterprises, wherein the industry represented by the preset industry classification codes is the industry to which the three new enterprises belong;
a corpus generating unit, configured to generate a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code;
the second processing unit is used for matching keywords of the operation range introduction document corresponding to the th enterprise with the corpus to screen out a second enterprise;
the similarity calculation unit is used for crawling the business related documents corresponding to the second enterprise and performing similarity calculation on the crawled business related documents and the corpus;
the third processing unit is used for determining the second enterprise to which the business related documents reaching the preset similarity belong as three new enterprises;
the similarity calculation unit calculates similarity between the crawled business-related document and the corpus, and specifically includes:
for each crawled business related document, splitting the business related document into single words;
determining the word frequency of each word obtained by splitting;
and respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.
6. The system of claim 5, wherein the business-related documents comprise full-text or fragments of related product introduction, related product usage instructions, software works, trademarks, patents.
7. The system according to claim 5 or 6, wherein the corpus generating unit generates a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code, and specifically includes:
aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes, splitting the industry description document into single words;
determining the word frequency of each word obtained by splitting;
and extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.
8. The system according to claim 5, wherein the third processing unit determines the second enterprise to which the business-related document that has reached the preset similarity belongs as three new enterprises, and specifically includes:
and if at least types of industry codes exist, and the similarity between the business related document and the industry description document corresponding to the industry codes reaches the preset similarity, determining the second enterprise to which the business related document belongs as three new enterprises.
CN201710506158.XA 2017-06-28 2017-06-28 information processing method and system Active CN107357851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710506158.XA CN107357851B (en) 2017-06-28 2017-06-28 information processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710506158.XA CN107357851B (en) 2017-06-28 2017-06-28 information processing method and system

Publications (2)

Publication Number Publication Date
CN107357851A CN107357851A (en) 2017-11-17
CN107357851B true CN107357851B (en) 2020-01-31

Family

ID=60273239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710506158.XA Active CN107357851B (en) 2017-06-28 2017-06-28 information processing method and system

Country Status (1)

Country Link
CN (1) CN107357851B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109801118A (en) * 2018-12-24 2019-05-24 航天信息股份有限公司 Identify method, apparatus, medium and the equipment of the manufacturing business of designated trade
CN113076979B (en) * 2021-03-23 2024-05-17 广州快必妥营销策划咨询有限公司 Qualified crop screening method, crop cultivation control method, system and device
CN113869639B (en) * 2021-08-26 2023-11-07 中国环境科学研究院 Yangtze river basin enterprise screening method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101127050A (en) * 2007-07-03 2008-02-20 北京大学 Method for automatically extracting website owner administrative apanage information from web page
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
JP4791169B2 (en) * 2005-12-12 2011-10-12 ヤフー株式会社 Related word extraction device and related word extraction method
CN106682145A (en) * 2016-12-22 2017-05-17 北京览群智数据科技有限责任公司 Enterprise information processing method, server and client

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4791169B2 (en) * 2005-12-12 2011-10-12 ヤフー株式会社 Related word extraction device and related word extraction method
CN101127050A (en) * 2007-07-03 2008-02-20 北京大学 Method for automatically extracting website owner administrative apanage information from web page
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
CN106682145A (en) * 2016-12-22 2017-05-17 北京览群智数据科技有限责任公司 Enterprise information processing method, server and client

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于多种数据源的中文知识图谱构建方法研究";胡芳槐;《中国博士学位论文全文数据库 信息科技辑》;20150515(第2015年05期);I138-112 *

Also Published As

Publication number Publication date
CN107357851A (en) 2017-11-17

Similar Documents

Publication Publication Date Title
US10180969B2 (en) Entity resolution and identity management in big, noisy, and/or unstructured data
CN112256874A (en) Model training method, text classification method, device, computer equipment and medium
CN107357851B (en) information processing method and system
US20130218620A1 (en) Method and system for skill extraction, analysis and recommendation in competency management
US10860565B2 (en) Database update and analytics system
US10387805B2 (en) System and method for ranking news feeds
JP6553816B2 (en) User data sharing method and apparatus
CN113098888A (en) Abnormal behavior prediction method, device, equipment and storage medium
CN112287111B (en) Text processing method and related device
Afolabi et al. Analysis of Customer satisfaction for competitive advantage using clustering and association rules
US10191786B2 (en) Application program interface mashup generation
Lin et al. Linking personally identifiable information from the dark web to the surface web: A deep entity resolution approach
Nguyen et al. Feature representation of audible sound signal in monitoring surface roughness of the grinding process
CN116739247A (en) Enterprise data analysis decision method, system, electronic equipment and storage medium
Ozer et al. Predicting the next location change and time of change for mobile phone users
Cheng et al. A hybrid approach to extract business process models with high fitness and precision
KR20150008635A (en) Device for selecting core kyword, method for selecting core kyword, and method for providing search service using the same
CN106156000A (en) Searching method based on intersection algorithm and search system
Khlobystova et al. Approaches to modeling development scenarios of multistep social engineering attacks
US11797768B2 (en) System and method to represent conversational flows as graph embeddings and to conduct classification and clustering based on such embeddings
Shahsavari et al. Finding k-most influential users in social networks for information diffusion based on network structure and different user behavioral patterns
Huang et al. A new method of k-means clustering algorithm with events based on variable time granularity
Burago et al. Automated attacks on compression-based classifiers
Al-Daeef et al. Evaluation of phishing email classification features: reliability ratio measure
Kumar et al. On the estimation of R= P (Y> X) for a class of Lifetime Distributions by Transformation Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 100070, No. 101-8, building 1, 31, zone 188, South Fourth Ring Road, Beijing, Fengtai District

Patentee after: Guoxin Youyi Data Co., Ltd

Address before: 100070, No. 188, building 31, headquarters square, South Fourth Ring Road West, Fengtai District, Beijing

Patentee before: SIC YOUE DATA Co.,Ltd.

CP03 Change of name, title or address