CN107357851B

CN107357851B - information processing method and system

Info

Publication number: CN107357851B
Application number: CN201710506158.XA
Authority: CN
Inventors: 夏耘海; 张斌德; 王江
Original assignee: Guoxin Youe Data Co Ltd
Current assignee: Guoxin Youe Data Co Ltd
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2020-01-31
Anticipated expiration: 2037-06-28
Also published as: CN107357851A

Abstract

The invention discloses a information processing method which comprises the steps of determining a th enterprise with industry classification codes meeting preset industry classification codes from preset enterprises, wherein the industry represented by the preset industry classification codes is the industry to which three new enterprises belong, generating a three-new-enterprise keyword corpus based on an industry description document corresponding to the industry represented by the preset industry classification codes, performing keyword matching on an operation range introduction document corresponding to the th enterprise and the corpus, screening out the second enterprise, crawling a business-related document corresponding to the second enterprise, performing similarity calculation on the crawled business-related document and the corpus, determining the second enterprise to which the business-related document reaching the preset similarity belongs as the three new enterprises, and information processing systems.

Description

information processing method and system

Technical Field

The invention relates to information methods and systems, in particular to methods and systems for identifying three new enterprises.

Background

With the rapid development of the economy of China, new enterprises and economic activities are continuously appeared. The enterprise plays an important role in the economy as the most important activity subject in the social economy, and the arrangement and analysis of enterprise information are helpful for helping related decision makers to know the operation condition of the enterprise and discover potential operation risks.

For example, three new enterprises (including new industries, new statuses, and new business models) that have recently emerged and are concerned by party centers and state hospitals, people involved in the business needs to make statistical observations on the development scale, structure, and quality of economic activities of such enterprises, so as to know the development scale, structure, and quality of such enterprises in real time, and provide reference bases for future decisions. The key point for statistical observation is that the enterprises of the many enterprises that need to be accurately known to investigate belong to three new enterprises. Therefore, the three new enterprises need to be accurately screened so as to screen the three new enterprises meeting the requirements. However, there is currently no solution for accurately screening three new enterprises. .

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide schemes capable of saving time and labor and accurately screening three new enterprises.

The invention provides information processing methods for accurately and effectively screening three new enterprises, which comprise the steps of determining a th enterprise with an industry classification code conforming to a preset industry classification code from preset enterprises, wherein the industry represented by the preset industry classification code is the industry to which the three new enterprises belong, generating a keyword corpus of the three new enterprises based on an industry description document corresponding to the industry represented by the preset industry classification code, performing keyword matching on an operation range introduction document corresponding to the th enterprise and the corpus to screen out a second enterprise, crawling a service-related document corresponding to the second enterprise, performing similarity calculation on the crawled service-related document and the corpus, and determining the second enterprise to which the service-related document reaching the preset similarity belongs as the three new enterprises.

Optionally, the business related documents include full or fractional documents including related product introduction, related product instructions, software work, trademarks, patents.

Optionally, generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code specifically includes: aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes, splitting the industry description document into single words; determining the word frequency of each word obtained by splitting; and extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.

Optionally, the similarity calculation of the crawled business-related document and the corpus specifically includes: for each crawled business related document, splitting the business related document into single words; determining the word frequency of each word obtained by splitting; and respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.

Optionally, determining the second enterprise to which the business-related document reaching the preset similarity belongs as three new enterprises, specifically, if at least -type industry codes exist, so that the similarity between the business-related document and the industry description document corresponding to the industry code reaches the preset similarity, determining the second enterprise to which the business-related document belongs as three new enterprises.

Another embodiments of the invention provide information processing systems, which include a processing unit configured to determine a th enterprise, from preset enterprises, to which an industry classification code meets a preset industry classification code, where the industry represented by the preset industry classification code is an industry to which three new enterprises belong, a corpus generating unit configured to generate a keyword corpus of the three new enterprises based on an industry description document corresponding to the industry represented by the preset industry classification code, a second processing unit configured to perform keyword matching on an operation range introduction document corresponding to an th enterprise and the corpus to screen out the second enterprise, a similarity calculating unit configured to crawl a business-related document corresponding to the second enterprise and perform similarity calculation on the crawled business-related document and the corpus, and a third processing unit configured to determine the second enterprise to which the business-related document reaching the preset similarity belongs as the three new enterprises.

Optionally, the corpus generating unit generates a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code, and specifically includes: aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes, splitting the industry description document into single words; determining the word frequency of each word obtained by splitting; and extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.

Optionally, the similarity calculation unit performs similarity calculation on the crawled business-related document and the corpus, and specifically includes: for each crawled business related document, splitting the business related document into single words; determining the word frequency of each word obtained by splitting; and respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.

Optionally, the third processing unit determines the second enterprise to which the business-related document that achieves the preset similarity belongs as three new enterprises, and specifically includes determining the second enterprise to which the business-related document belongs as three new enterprises if at least types of industry codes exist, so that the similarity between the business-related document and the industry description document corresponding to the industry codes achieves the preset similarity.

When three new enterprises are screened, firstly, th enterprises with industry classification codes meeting the preset industry classification codes are determined from preset enterprises representing industries to which the three new enterprises belong, then, a keyword corpus of the three new enterprises is generated based on industry description documents corresponding to the industries represented by the preset industry classification codes, then, keyword matching is carried out on an operation range introduction document corresponding to th enterprise and the corpus, the second enterprise is screened, then, business related documents corresponding to the second enterprise are crawled, similarity calculation is carried out on the crawled business related documents and the corpus, and finally, the second enterprise to which the business related documents reaching the preset similarity belong is determined as the three new enterprises, so that the screened enterprises are higher and higher than the three new enterprises through three-round progressive screening, the three new enterprises can be accurately screened, and reference basis is provided for screening of the three new enterprises.

Drawings

FIG. 1 is a flow chart of an information processing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an information processing system according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a flowchart illustrating an information processing method according to an embodiment of the present invention. As shown in fig. 1, an information processing method provided in an embodiment of the present invention includes:

s101, determining th enterprises with industry classification codes meeting the preset industry classification codes from preset enterprises, wherein the industries represented by the preset industry classification codes are industries to which the three new enterprises belong.

And S102, generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code.

And S103, carrying out keyword matching on the operation range introduction document corresponding to the th enterprise and the corpus, and screening out a second enterprise.

And S104, crawling the business related documents corresponding to the second enterprise, and carrying out similarity calculation on the crawled business related documents and the corpus.

And S105, determining the second enterprise to which the business related documents reaching the preset similarity belong as three new enterprises.

For example, the classification ranges of the three new enterprises can be obtained based on the information of the notification of the state department about the printing release (China manufacturing 2025), the guidance of the state department about actively advancing the internet + 'action, the opinion of the state department about vigorously advancing a plurality of policy measures of the public entrepreneur, and the like in the thirteenth five-year planning requirement of the development of the national economy and society of the people' S business, so that the classification ranges of the three new enterprises can be obtained, and then the related industry classification codes can be selected from the classification ranges of the state economy industry based on the related files in examples, the classification ranges of the three new enterprises obtained based on the related files can include the modern agriculture and forestry, advanced manufacturing industry, novel energy activities, energy-saving and environmental-protection service activities, internet and service information and modern technical service information and modern activities, and the classification codes of the modern industrial activities can be obtained according to the classification ranges of the modern industry, modern activities, including the modern industrial activities and modern production activities.

And selecting th enterprises with industry classification codes meeting the preset industry classification codes from the preset enterprises based on the determined preset industry classification codes, wherein the preset enterprises can be obtained from requesters requesting to screen the three new enterprises through a specified interface or obtained by crawling of a web crawler according to specified keywords.

In step S102, generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code may specifically include:

and , splitting the industry description document into single words aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes.

A preset word segmentation tool can be adopted to split the industry description document corresponding to each type of industry code into single words, for example, a jieba library in python can be adopted. The jieba library can split each industry description document into individual words according to custom rules.

And step two, determining the word frequency of each word obtained by splitting.

In addition, in order to reduce noise, the words obtained in the step , which do not particularly contribute to the screening of the three new enterprises or are meaningless, may be deleted, for example, the dummy words in the document, such as exclamation words, prepositions, conjunctions, and the like, so as to improve the efficiency of extracting the keywords in the subsequent steps.

And thirdly, extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.

In examples of the present invention, a TF-IDF method may be used to extract keywords based on a determined word frequency to generate a three-new enterprise keyword corpus, but the present invention is not limited thereto, and other methods may be used to extract keywords based on a determined word frequency, such as mutual information, desired cross entropy, information gain method, principal component analysis method, genetic algorithm, and the like.

In the invention, a TF-IDF method is used for obtaining t of each word in each document_iIdf value, choose t_i-words with idf value greater than a certain threshold as keywords, t for each word in each industry description document_iThe idf value can be obtained by the following equation (1):

t_i-idf＝f_i*log(N/df_i) (1)

wherein f is_iMeans word frequency, representing the number of times the ith word appears in the industry description document, df_iThe document frequency is referred to, the document number of the ith word appearing in all industry description documents is represented, and N represents the number of all industry description documents. t is t_iThe specific threshold of the idf value can be determined according to actual conditions, so long as the obtained keywords can screen out three new enterprises meeting the requirements to the maximum extent and reduce the processing complexity as much as possible.

Obtaining the word frequency of each word of each service description document obtained in the step two, and obtaining t of each word by using the formula (1)_iIdf value, then select t_iWords with idf values greater than a certain threshold are used as keywords, thus generating a corpus of three new enterprise keywords.

In step S103, the business scope introduction document corresponding to the th enterprise obtained in step S101 is keyword-matched with the keyword forecast library generated in step S102 to screen out the second enterprise associated with the keyword, the business scope introduction document corresponding to the th enterprise is obtained from a related person, for example, a requester who requests to screen out three new enterprises, or obtained by crawling a web crawler, in examples of the present invention, a match function in the R language may be used to perform keyword matching on the business scope introduction document corresponding to the th enterprise with the keyword forecast library generated in step S102, and the second enterprise associated with the keyword in the keyword corpus is automatically screened out, since the business classification code of the enterprise may not represent the business actually operated by the enterprise, that is, the business classification code of the enterprise may deviate from the business actually operated by the enterprise, so that the th enterprise determined by the classification code may have many enterprises which are not three new enterprises, and therefore, the third enterprise screening method of the third enterprise selected by the step S103 by using the keyword matching method of step S may have an accuracy of improving the third enterprise screening of the third enterprise of the third embodiment of the invention, which is about .

In step S104, a crawler, such as seleuim, bs4, of python programming language, may perform real-time web crawling on a business-related document corresponding to the second enterprise, where the business-related document may include all or a fragment of documents, such as a related product introduction, a related product instruction, a software work, a trademark, and a patent, where the similarity calculation between the crawled business-related document and the keyword corpus obtained in step S102 may specifically include:

and , for each business-related document crawled, splitting the business-related document into single words.

The manner of splitting each business-related document into single words in this step may be the same as the manner of splitting the industry specification document corresponding to each type of industry code into single words in the foregoing step S102.

And secondly, determining the word frequency of each word obtained by splitting.

Similarly, in order to reduce noise, words that do not particularly contribute to the screening of three new enterprises or are meaningless in the words obtained in step may be deleted, for example, dummy words such as exclamatory words, prepositions, conjunctions, and the like in the document may be deleted, thereby improving the efficiency of extracting keywords in the subsequent steps.

And thirdly, respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.

In the invention, the similarity between each business related document and the industry description document corresponding to each type of industry code can be calculated by using a vector included angle cosine method.

Specifically, firstly, according to , for example, according to the sequence of words appearing in the document, the word frequency corresponding to each word is constructed into a word frequency vector_i:[x₁,x₂,...,x_n]Wherein x is₁,x₂,...,x_nThe word frequencies of the n keywords of the business related document are respectively. Similarly, for an industry description document corresponding to the ith type industry code in the preset industry classification codes, a vector can be constructed based on the split words and the corresponding word frequencies as follows: b is_i:[y₁,y₂,...,y_n]Wherein, y₁,y₂,...,y_nThe word frequencies of the n key words of the industry description documents corresponding to the industry codes are respectively.

Then, based on the constructed vector, the similarity cos θ of each business-related document and the industry description document corresponding to each type of industry code is determined by the following formula (2):

thus, by using the above formula (2), the similarity between each business-related document and the industry description document corresponding to each type of industry code can be obtained.

In step S105, the second enterprise to which the business-related document that has reached the preset similarity belongs is determined as three new enterprises, which specifically includes determining the second enterprise to which the business-related document belongs as three new enterprises if at least types of industry codes exist, and the similarity between the business-related document and the industry description document corresponding to the type of industry code reaches the preset similarity, and specifically, if the similarity between all the business-related documents calculated in step S104 and the industry description document corresponding to each type of industry code reaches the preset similarity, for example, 0.7, determining the second enterprise to which the corresponding business-related document belongs as three new enterprises.

In summary, when three new enterprises are screened, th round screening is performed based on preset industry codes representing industries to which the three new enterprises belong, keyword matching is performed on the operation range introduction document corresponding to the enterprise screened in the th round and the keyword corpus generated based on the industry description document corresponding to the industry represented by the preset industry codes, second round screening is performed, similarity calculation is performed on the business related documents of the enterprises obtained through the second round screening and the keyword corpus, and the enterprises with the similarity reaching the preset similarity are selected as the three new enterprises.

Based on the same concept as , the embodiment of the present invention further provides information processing systems, and since the principle of the problem solved by the system is similar to that of the aforementioned information processing method, the implementation of the system can refer to the implementation of the aforementioned method, and repeated details are not repeated.

As shown in fig. 2, the information processing systems provided in the embodiment of the present invention include:

the processing unit 201 is used for determining enterprises with industry classification codes meeting preset industry classification codes from preset enterprises, wherein the industries represented by the preset industry classification codes are industries to which three new enterprises belong;

a corpus generating unit 202, configured to generate a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code;

the second processing unit 203 is configured to perform keyword matching on the operation range introduction document corresponding to the th enterprise and the corpus to screen out a second enterprise;

the similarity calculation unit 204 is configured to crawl a business-related document corresponding to the second enterprise, and perform similarity calculation between the crawled business-related document and the corpus;

the third processing unit 205 is configured to determine, as three new businesses, the second business to which the business-related document that reaches the preset similarity belongs.

In the exemplary embodiments of the invention, the business related documents include all or a portion of documents such as a description of the relevant product, instructions for use of the relevant product, software work, trademark, patent, which are crawled in real time by a crawler package in the python programming language seleuim, bs4, etc.

In exemplary embodiments of the present invention, the corpus generating unit 202 generates three new enterprise keyword corpuses based on the industry description documents corresponding to the industries represented by the preset industry classification codes, and specifically includes splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes into single words, determining the word frequency of each word obtained by splitting, and generating three new enterprise keyword corpuses by extracting keywords based on the determined word frequency by using a preset algorithm.

In exemplary embodiments of the present invention, the similarity calculation unit 204 performs similarity calculation on the crawled business-related documents and the corpus, specifically includes splitting each crawled business-related document into a single word, determining a word frequency of each word obtained by splitting the word, and performing similarity calculation on the word obtained by splitting the business-related document and the corresponding word frequency, and the word obtained by splitting the business-related document and the corresponding word frequency are respectively subjected to similarity calculation with the word obtained by splitting the industry description document corresponding to each type of industry code in the preset industry classification code and the corresponding word frequency.

In exemplary embodiments of the present invention, the determining, by the third processing unit 205, that the second enterprise to which the business-related document that achieves the preset similarity belongs is three new enterprises specifically includes determining, if at least types of industry codes exist, that the similarity between the business-related document and the industry specification document corresponding to the type of industry codes achieves the preset similarity, that the second enterprise to which the business-related document belongs is three new enterprises.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1, an information processing method, comprising:

determining th enterprises with industry classification codes conforming to preset industry classification codes from preset enterprises, wherein the industries represented by the preset industry classification codes are industries to which the three new enterprises belong;

generating a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code;

carrying out keyword matching on the operation range introduction document corresponding to the th enterprise and the corpus to screen out a second enterprise;

crawling a business related document corresponding to the second enterprise, and performing similarity calculation on the crawled business related document and the corpus;

determining a second enterprise to which the business related documents reaching the preset similarity belong as three new enterprises;

the similarity calculation of the crawled business-related documents and the corpus specifically comprises the following steps:

for each crawled business related document, splitting the business related document into single words;

determining the word frequency of each word obtained by splitting;

and respectively carrying out similarity calculation on the words obtained by splitting the business related documents and the corresponding word frequencies and the words obtained by splitting the industry description documents corresponding to each type of industry codes in the preset industry classification codes and the corresponding word frequencies.

2. The method of claim 1, wherein the business-related documents comprise full-text or fragments of related product introduction, related product usage instructions, software works, trademarks, patents.

3. The method according to claim 1 or 2, wherein generating a corpus of three new enterprise keywords based on industry description documents corresponding to industries characterized by the preset industry classification codes specifically comprises:

aiming at the industry description document corresponding to each type of industry code in the preset industry classification codes, splitting the industry description document into single words;

determining the word frequency of each word obtained by splitting;

and extracting keywords based on the determined word frequency by adopting a preset algorithm to generate a three-new enterprise keyword corpus.

4. The method according to claim 1, wherein the step of determining the second enterprise to which the business-related documents reaching the preset similarity belong as three new enterprises specifically comprises:

and if at least types of industry codes exist, and the similarity between the business related document and the industry description document corresponding to the industry codes reaches the preset similarity, determining the second enterprise to which the business related document belongs as three new enterprises.

An information processing system of , comprising:

the processing unit is used for determining enterprises with industry classification codes meeting preset industry classification codes from preset enterprises, wherein the industry represented by the preset industry classification codes is the industry to which the three new enterprises belong;

a corpus generating unit, configured to generate a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code;

the second processing unit is used for matching keywords of the operation range introduction document corresponding to the th enterprise with the corpus to screen out a second enterprise;

the similarity calculation unit is used for crawling the business related documents corresponding to the second enterprise and performing similarity calculation on the crawled business related documents and the corpus;

the third processing unit is used for determining the second enterprise to which the business related documents reaching the preset similarity belong as three new enterprises;

the similarity calculation unit calculates similarity between the crawled business-related document and the corpus, and specifically includes:

determining the word frequency of each word obtained by splitting;

6. The system of claim 5, wherein the business-related documents comprise full-text or fragments of related product introduction, related product usage instructions, software works, trademarks, patents.

7. The system according to claim 5 or 6, wherein the corpus generating unit generates a three-new enterprise keyword corpus based on the industry description document corresponding to the industry represented by the preset industry classification code, and specifically includes:

determining the word frequency of each word obtained by splitting;

8. The system according to claim 5, wherein the third processing unit determines the second enterprise to which the business-related document that has reached the preset similarity belongs as three new enterprises, and specifically includes: