CN106997340B

CN106997340B - Word stock generation method and device and document classification method and device using word stock

Info

Publication number: CN106997340B
Application number: CN201610048630.5A
Authority: CN
Inventors: 肖汉平; 刘健鹏; 沈栋
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2016-01-25
Filing date: 2016-01-25
Publication date: 2020-07-31
Anticipated expiration: 2036-01-25
Also published as: CN106997340A

Abstract

The application discloses a word stock generation method and a document classification method and device using the word stock. The method for generating the word stock comprises the following steps: acquiring title information of each document under each classification category of a specified level in a specified network interaction platform; determining each keyword corresponding to each classification category and the weight distribution of each keyword under the corresponding classification category according to the title information of each document; and forming a word bank according to each classification category, each keyword corresponding to each classification category and the weight of each keyword. The word stock generation method is used for classifying the documents in the target network interaction platform to the specified network interaction platform through the word stock generated by the documents under the classification categories in the specified network interaction platform, so that a foundation is laid for resource integration among the similar network interaction platforms.

Description

Word stock generation method and device and document classification method and device using word stock

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a lexicon and classifying documents using the lexicon.

Background

With the continuous development of internet technology, more and more network interaction platforms appear, and the network interaction platforms provide great convenience for the daily life of people. For example, the advent of the internet e-commerce platform has enabled users to conveniently purchase goods on the internet; the presence of the online game platform enables users to play electronic games through the internet; the network education platform enables users to conveniently learn through a network.

However, the appearance of a large number of network interaction platforms brings convenience to the life of people, and meanwhile, difficulty is increased for resource integration among similar network interaction platforms. For example, for two network e-commerce platforms A and B, if the commodity of the platform B needs to be integrated into the platform A, since the standards for commodity classification on the platform A and the platform B are usually different, even the classification categories may be different, and at this time, the commodity of the platform B is usually difficult to be integrated into the platform A. Therefore, when resource integration is performed between network interaction platforms of the same type, it is often difficult to perform resource integration due to different classification standards and even different classification purposes. Therefore, how to perform resource integration between similar network interaction platforms is an urgent problem to be solved in the industry at present.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating a word stock and classifying documents by using the word stock, which are used for solving the problem that in the prior art, when resource integration is performed among similar network interaction platforms, the resource integration is usually difficult to perform due to different classification standards and even different classification purposes.

The embodiment of the application provides a method for generating a word stock, wherein the word stock is used for classifying documents in a target network interaction platform to a specified network interaction platform, and the method comprises the following steps:

acquiring title information of each document under each classification category of a specified level in a specified network interaction platform;

determining each keyword corresponding to each classification category and weight distribution of each keyword under the corresponding classification category according to the title information of each document, wherein the weight distribution is used for evaluating the distinguishing degree when the classification category where the keyword is located is distinguished from other classification categories except the classification category, and the keyword refers to a specified word according to a preset rule under the classification category corresponding to the keyword;

and forming a word bank according to each classification category, each keyword corresponding to each classification category and the weight of each keyword.

Preferably, the weight distribution of the keywords under the corresponding classification category is determined by the following method:

determining the word frequency of the keywords in the corresponding classification category;

determining the anti-document frequency of the keywords according to the number of documents containing the keywords in the titles of the documents under the classification categories and the total number of the documents under the classification categories;

and determining the weight distribution of the keywords under the corresponding classification categories according to the word frequency of the keywords in the corresponding classification categories and the anti-document frequency of the keywords.

Preferably, the determining, according to the header information of each document, each keyword corresponding to each classification category specifically includes:

normalizing the title of each document;

performing word segmentation processing and entity word identification on the normalized titles of the documents to obtain entity words for describing entity things;

determining the word frequency of each entity word corresponding to each classification category, wherein the word frequency of an entity word is used for describing the frequency or frequency of the entity word appearing in the corresponding classification category;

and respectively taking each entity word as a current entity word, judging whether the word frequency of the current entity word is greater than a preset threshold value, and if so, determining the current entity word as the keyword of the corresponding classification category.

Preferably, the method further includes, when the word frequency of each entity word corresponding to the current classification category is smaller than a preset threshold, determining the entity word with the largest word frequency in the current classification category as the keyword in the current classification category, where the current classification category includes any one of the classification categories.

Preferably, the generating a word bank according to each classification category, each keyword corresponding to each classification category, and the weight of each keyword specifically includes:

taking each keyword as a current keyword respectively, and forming an association set according to the current keyword, a classification category corresponding to the current keyword and the weight of the current keyword under the corresponding classification category;

and generating a word bank according to each association.

Preferably, each category of the designated level specifically includes each primary category and/or leaf category.

The embodiment of the application also provides a document classification method, which comprises the following steps:

acquiring title information of a specified document in a target network interaction platform and determining each entity word corresponding to the title of the specified document, wherein the entity words are used for describing entity things;

determining the weight distribution of each entity word under each classification category in the word stock by querying the word stock, wherein the word stock is generated according to any one of the word stock generation methods;

determining the total score of the specified document under each classification category according to the weight distribution of each entity word under each classification category;

and classifying the specified documents according to the respective total scores of the specified documents under the classification categories.

Preferably, the classifying the specific document according to the total score of the specific document under each classification category specifically includes:

generating a queue of all the classification categories according to the total score of the specified document from high to low, and selecting a predetermined number of classification categories from the initial position of the queue according to the sequence in the queue, wherein the predetermined number of classification categories at least comprises two classification categories;

acquiring an initial category of the specified document in a target network interaction platform;

respectively determining the first similarity of each category of the selected preset number and the initial category, wherein the first similarity is used for describing the similarity of the initial category and each category;

respectively determining second similarity of the selected predetermined number of classification categories and the title of the specified document, wherein the second similarity is used for describing the similarity between the title of the specified document and each classification category;

and respectively taking the selected classification categories with the preset quantity as pre-selection classification categories, carrying out weighted summation on the total score, the first similarity and the second similarity corresponding to the pre-selection classification categories, and classifying the specified documents according to the result of the weighted summation.

and determining the maximum total score classification category corresponding to the maximum total score of the specified document according to the total score of the specified document under each classification category, and classifying the specified document into the maximum total score classification category.

Preferably, the determining the total score of the specified document under each classification category through the weight distribution of each entity word under each classification category specifically includes:

and taking each classification category as a current classification category, and summing the weight scores of the entity words under the current classification category to determine the total score of the specified document under the current classification category.

Preferably, the determining of each entity word corresponding to the title of the specified document specifically includes:

normalizing the title of the specified document;

and performing word segmentation processing and entity word recognition on the normalized title of the specified document, and determining each entity word corresponding to the title of the specified document.

The embodiment of the present application further provides a device for generating a thesaurus, where the thesaurus is used to classify documents in a target network interaction platform to a specified network interaction platform, and the device includes:

an acquisition unit, a determination unit and a generation unit, wherein:

the acquisition unit is used for acquiring the title information of each document under each classification category of a specified level in a specified network interaction platform;

the determining unit is used for determining each keyword corresponding to each classification category and weight distribution of each keyword under the corresponding classification category according to the title information of each document, wherein the weight distribution is used for evaluating the distinguishing degree when the classification category where the keyword is located is distinguished from other classification categories except the classification categories, and the keyword refers to a specified word according to a preset rule under the classification category corresponding to the keyword;

and the generating unit is used for generating a word bank according to each classification category, each keyword corresponding to each classification category and the weight of each keyword.

An embodiment of the present application further provides a document classification device, where the device includes:

the device comprises an acquisition unit, a query unit, a determination unit and a classification unit, wherein:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring title information of a specified document in a target network interaction platform and determining each entity word corresponding to the title of the specified document, and the entity words are used for describing entity things;

the query unit is used for determining the weight distribution of each entity word under each classification category in the word stock by querying the word stock, wherein the word stock is generated according to the generation device of the word stock;

a determining unit, configured to determine, according to the weight distribution of each entity word under each classification category, a respective total score of the designated document under each classification category;

a classification unit configured to classify the specified document by a total score of the specified document under each classification category, the classification unit including: .

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

the title information of the appointed document is obtained, the entity words are determined through the title information, then the corresponding weight distribution of the entity words under each classification category is determined through the word library inquiry, the total score of the appointed document under each classification category is determined through the weight distribution, and the appointed document is classified through the total score. Therefore, the problem that resource integration is difficult to perform among similar network interaction platforms due to different classification standards in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of a specific implementation flow of a method for generating a lexicon according to embodiment 1 of the present application;

fig. 2 is a schematic diagram of a specific implementation flow of a method for generating a lexicon in an actual application scenario, provided in embodiment 1 of the present application;

fig. 3 is a schematic flowchart of a specific implementation of a document classification method provided in embodiment 2 of the present application;

fig. 4 is a schematic diagram of a specific implementation flow of classifying a specified document by a total score in a document classification method provided in embodiment 2 of the present application;

fig. 5 is a schematic structural diagram of a word stock generation apparatus provided in embodiment 3 of the present application;

fig. 6 is a schematic structural diagram of a document sorting apparatus provided in embodiment 4 of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Example 1

Embodiment 1 provides a method for generating a lexicon, wherein the lexicon generated by the method is used for classifying documents in a target network interaction platform into an appointed network interaction platform, thereby laying a foundation for resource integration among similar network interaction platforms. The specific flow diagram of the method is shown in fig. 1, and the method comprises the following steps:

step S11: and acquiring the title information of each document under each classification category of the specified level in the specified network interaction platform.

The designated network interaction platform is a platform for integrating resources, and the designated network interaction platform is a platform for receiving integrated resources. For example, when the resources of the first network game platform are integrated into the second network game platform, the first network game platform is a target network interaction platform, and the second network game platform is a designated network interaction platform. In addition, the type of the specified network interaction platform and the target network interaction platform is not limited herein, that is, they may be a network e-commerce platform, a network game platform, etc.

In a network interaction platform, in order to facilitate a user to find a required resource, the resource is generally divided into a plurality of classification categories according to a certain standard, and the classification categories may further include a plurality of different levels. For example, in the network e-commerce platform, the primary classification categories include book audio and video, electronic appliances, home building materials and the like; the first-class classified books, videos, children books, musical instruments and the like. The classification category of the designated level may be a primary category, a secondary category, a leaf category, or a classification category of other designated levels. In practical applications, a preferred solution for the classification category of the specified level may be a level one category and/or a leaf category, where the leaf category refers to a sub-classification category at the end of the classification category in the specified network interaction platform, that is, the leaf category is not further subdivided in the specified network interaction platform.

In the network interaction platform, various resources are usually embodied in a document form at a server side. For example, on the network e-commerce platform, a document is usually corresponding to each commodity at the server, and the document corresponding to each commodity contains a document title, description information of categories for classifying the commodity, and an address of the document on the network.

The title information of each document under each classification category of the specified level in the specified network interaction platform can be obtained through the steps.

Step S12: and determining each keyword corresponding to each classification category and the weight distribution of each keyword under the corresponding classification category according to the title information of each document.

Here, the weight score is used to evaluate a degree of distinction when distinguishing a classification category in which a keyword is located from other classification categories other than the classification category, where the keyword is a specified word according to a predetermined rule under the classification category corresponding to the keyword.

In the step, the keywords respectively corresponding to each classification category and the weight distribution of the keywords under the corresponding classification category can be determined by processing the titles of the documents through the title information of the documents. In practical applications, the method for processing the title of the document and determining the keywords corresponding to each category may include multiple methods, and one preferred scheme is shown as step 121.

Step 121: normalizing the title of each document; performing word segmentation processing and entity word identification on the normalized titles of the documents to obtain entity words for describing entity things; determining the word frequency of each entity word corresponding to each classification category, wherein the word frequency of an entity word is used for describing the frequency or frequency of the entity word appearing in the corresponding classification category; and respectively taking each entity word as a current entity word, judging whether the word frequency of the current entity word is greater than a preset threshold value, and if so, determining the current entity word as the keyword of the corresponding classification category.

The normalization process may include letter big first writing conversion, complex and simple conversion, and temporal conversion in practical application. Word segmentation processing on the title of the document can generally adopt a word segmentation tool, and a word2vec is a commonly used word segmentation tool. Entity word recognition can be typically accomplished through a library of entity words, which refer to words that describe physical things, such as trade names, trademarks, book names, game names, and the like. After the entity words are identified, respectively counting the word frequency of each entity word under each classification category; and then selecting the entity words with the word frequency larger than the preset threshold value as the key words under the classification categories corresponding to the entity words by comparing the word frequency of the entity words with the size of the preset threshold value. For example, 3 entity words under the category of the electronic appliance category are "computer", "mobile phone" and "charger", respectively, where the word frequency of "computer" is 300, the word frequency of "mobile phone" is 350, the word frequency of "charger" is 100, a preset threshold may be set according to actual needs, and if the preset threshold is set to 200, the "computer" and the "mobile phone" will be determined as the key words under the category of the electronic appliance category.

It should be noted that, in step 121, in order to avoid that under some classification categories, the word frequency of the entity word with the highest word frequency is also smaller than the preset threshold, so that the classification category is excluded. Therefore, when the word frequency of the entity word with the highest word frequency in some classification categories is smaller than the preset threshold, the problem can be solved by properly reducing the preset threshold. Of course, in practical applications, when the word frequency of each entity word corresponding to the current classification category is smaller than a preset threshold, a preferred scheme is that the entity word with the largest word frequency in the current classification category is determined as the keyword in the current classification category, and the current classification category includes any one classification category in each classification category. For example, the entity word having the highest word frequency under the musical instrument classification category is "guitar", the word frequency of "guitar" under the classification category is 86, and if the predetermined threshold is 100, the "guitar" may be determined as the keyword under the musical instrument classification category in order to prevent the musical instrument classification category from being excluded.

In fact, it should be noted that, in step 121, word segmentation processing may be performed on the title of each document, and then normalization processing and entity word recognition are performed on the result of word segmentation, so as to obtain entity words for describing entity things.

In step 12, the weight distribution of each keyword under the corresponding classification category may also be determined, and the word frequency of each keyword under the corresponding classification category may be directly used as the weight distribution, or the weight distribution may also be determined in other manners. In practice, a preferred scheme for determining the weight of the keywords under the corresponding category is shown in step 122.

Step 122: determining the word frequency of the keywords in the corresponding classification category; determining the anti-document frequency of the keywords according to the number of documents containing the keywords in the titles of the documents under the classification categories and the total number of the documents under the classification categories; and determining the weight distribution of the keywords under the corresponding classification categories according to the word frequency of the keywords in the corresponding classification categories and the anti-document frequency of the keywords.

In step 122, in order to determine the weight distribution of a certain keyword under the corresponding classification category, the Term Frequency (TF) of the keyword under the corresponding classification category is determined first; then determining the Document number N of the keyword contained in the title of each Document under each classification category and the total number N of the documents under each classification category, and determining the Inverse Document Frequency (IDF) of the keyword through N and N; and finally, determining the weight distribution of the key words under the corresponding classification categories according to the word frequency of the key words under the corresponding classification categories and the anti-document frequency of the key words.

In practical applications, the anti-document frequency of the keyword can be determined by the following formula:

IDF=log(N/n)

and N is the total number of the documents under each classification category, and N is the number of the documents containing the keyword in the title of each document under each classification category. log (N/N) means taking logarithm of the value of N/N, the base of the logarithm can be set according to actual needs, and in practical application log (N/N) can be generally a natural logarithm ln (N/N).

In practical application, the weight of the keyword under the corresponding classification category can be obtained by multiplying the word frequency of the keyword under the corresponding classification category by the inverse document frequency of the keyword.

Step S13: and forming a word bank according to each classification category, each keyword corresponding to each classification category and the weight of each keyword.

And forming a word bank according to each classification category, each key word respectively corresponding to each classification category and the weight of each key under the corresponding classification category. Three attribute items of keywords, classification categories and weights can be included in the word stock.

The preferred scheme for generating the word stock is that each keyword is respectively used as a current keyword, and an association set is formed according to the current keyword, the classification category corresponding to the current keyword and the weight of the current keyword under the corresponding classification category; and generating a word bank according to each association.

By using the word stock generation method provided in embodiment 1, the header information of each document under each classification category of a specified level in a specified network interaction platform is obtained, each keyword corresponding to each classification category and the weight distribution of each keyword under the corresponding classification category are determined according to the header information of each document, and then a word stock is generated according to each classification category, each keyword corresponding to each classification category and the weight distribution of each keyword. The method is used for classifying the documents in the target network interaction platform to the specified network interaction platform through the word stock generated by the documents under each classification category in the specified network interaction platform, thereby laying a foundation for the resource integration among the similar network interaction platforms.

For convenience of understanding, the embodiment of the present application further provides a schematic diagram for implementing the word stock generation method in an actual application scenario, as shown in fig. 2. In the application scene, the network interaction platform is designated as a network e-commerce platform A, the target network interaction platform is designated as a network e-commerce platform B, and the classification category of the designated level is a leaf category. The method may specifically comprise the steps of:

step S21: and acquiring the title information of each document under each leaf category in the network e-commerce platform A.

Step S22: and normalizing the title of each document.

Step S23: and performing word segmentation processing and entity word recognition on the normalized titles of the documents to obtain entity words for describing entity things.

Step S24: and determining the word frequency of each entity word corresponding to each leaf category.

Step S25: and respectively taking each entity word as a current entity word, judging whether the word frequency of the current entity word is greater than a preset threshold value, and if so, executing the step S26.

Step S26: and determining the current entity word as the key word of the corresponding leaf category.

Step S27: and determining the anti-document frequency of the keywords according to the document number of the keywords contained in the title of each document under each leaf category and the total number of the documents under each leaf category.

Step S28: and determining the weight distribution of the keywords under the corresponding leaf categories according to the word frequency of the keywords in the corresponding leaf categories and the anti-document frequency of the keywords.

Step S29: and generating a word bank according to the key words corresponding to the leaf categories and the weight of the key words.

In practical applications, a thesaurus generated according to the method is shown in table 1.

Table 1. a thesaurus in practical application.

Keyword	Class of leaves	Weight distribution
			Notebook computer	Computer with a display	7
Apple (Malus pumila)	Mobile phone	4
			Notebook computer	Stationery with detachable cover	5

It should be noted that all execution subjects of the steps of the method provided in embodiment 1 may be the same device, or all the steps of the method may be executed by different devices. For example, the execution subjects of step 11 and step 12 may be the apparatus 1; for another example, the execution subject of step 11 may be device 1, and the execution subject of step 12 and step 2 may be device 2; and so on.

Example 2

Embodiment 2 provides a document classification method, which classifies documents in a target network interaction platform into a specified network interaction platform by using the lexicon generated in embodiment 1, thereby solving the problem that resource integration is difficult to perform among similar network interaction platforms due to different classification standards in the prior art. The specific flow diagram of the method is shown in fig. 3, and the method comprises the following steps:

step S31: the method comprises the steps of obtaining title information of a specified document in a target network interaction platform and determining each entity word corresponding to the title of the specified document, wherein the entity words are used for describing entity things.

The target network interaction platform refers to a platform with integrated resources. For example, when the resources of the first network game platform are integrated into the second network game platform, the first network game platform is a target network interaction platform. In addition, the type of the target network interaction platform is not limited herein, that is, they may be a network e-commerce platform, a network game platform, etc.

In the network interaction platform, various resources are usually embodied in a document form at a server side. For example, on the network e-commerce platform, a document is usually corresponding to each commodity at the server, and the document corresponding to each commodity contains a document title, description information of categories for classifying the commodity, and an address of the document on the network. The designated document is a document corresponding to a resource which needs to be integrated into the designated network interaction platform in the target network interaction platform.

By specifying the title information of the document, each entity word corresponding to the title of the specified document is determined, and the entity words are used for describing entity things, such as commodity names, trademarks, book names, game names and the like.

In practical applications, there are many methods for determining each entity word corresponding to the title of the specified document by specifying the title of the document, and a preferred scheme is as shown in step S311.

Step S311: normalizing the title of the specified document; and performing word segmentation processing and entity word recognition on the normalized title of the specified document, and determining each entity word corresponding to the title of the specified document.

The normalization process may include letter big first writing conversion, complex and simple conversion, and temporal conversion in practical application. Word segmentation processing on the title of a specified document can generally adopt a word segmentation tool, and a word2vec is a commonly used word segmentation tool. Entity word recognition can typically be accomplished through an entity lexicon. After the entity words are identified, respectively counting the word frequency of each entity word under each classification category; and then selecting the entity words with the word frequency larger than the preset threshold value as the key words under the classification categories corresponding to the entity words by comparing the word frequency of the entity words with the size of the preset threshold value.

The entity words determined in step S131 are more accurate and normalized because of the normalization and the entity word recognition.

Step S32: and determining the weight distribution of each entity word under each classification category in the word stock by querying the word stock.

The thesaurus is a thesaurus generated according to the thesaurus generation method provided in embodiment 1. The word stock usually comprises three attribute items of a keyword, a classification category and a weight score, wherein the weight score is used for evaluating the distinguishing degree when the classification category where the keyword is located is distinguished from other classification categories except the classification category, and the keyword refers to a specified word under the classification category corresponding to the keyword according to a preset rule.

And determining the weight distribution of each entity word under each classification category of the word stock by inquiring the word stock. In practical applications, it usually happens that there is no entity word to be searched under some classification categories, and at this time, the weight of the entity word under these classification categories may be set as a predetermined constant, and the predetermined constant may be 0. For example, the entity word is "rice", which does not appear under the category "computer" of the thesaurus, and thus the weight score of "rice" under the category may be set to a preset constant.

Step S33: determining the total score of the specified document under each classification category through the weight distribution of each entity word under each classification category.

This step is used to determine the respective total score of the specified document under each classification category of the lexicon, and the determination process of the total score is exemplified by the total score of the specified document under one classification category of the lexicon, which is referred to herein as the current classification category. The total score of the specified document under the current classification category can be determined by the weight scores of the entity words in the specified document title in the current classification category respectively.

In practical applications, there are various ways to determine the total score of the specified document under the current classification category through the weight scores of the entity words in the specified document title in the current classification category. For example, the weights of the entity words may be summed in a weighted manner to determine a total score of the specified document under the current classification category, where the weights of the entity words may be determined according to the types of the entity words, and the types of the entity words include names, place names, brand words, and the like; the classification categories may also be respectively used as current classification categories, and the total score of the specified document under the current classification category is determined by summing the weights of the entity words under the current classification categories.

Step S34: and classifying the specified documents according to the respective total scores of the specified documents under the classification categories.

After determining the respective overall scores of the given document under the respective classification categories, the given document is classified by the respective overall scores under the respective classification categories.

There are various ways in which a given document may be classified by its respective overall score under each classification category. For example, the designated documents may be sorted according to the size order of the total score, and the designated documents are sorted into the category with the largest total score, that is, the largest total score category corresponding to the largest total score of the designated documents is determined according to the respective total score of the designated documents under each category, and the designated documents are sorted into the largest total score category.

Further, various problems are generally encountered in practical applications, and thus the classification may be inaccurate when the classification is performed by the total score. For example, a document entitled "apple iphone4s data line" has split entity words including "apple", "iphone 4 s" and "data line", and since the weight of the two entity words "apple" and "iphone 4 s" is higher under the category "mobile phone", the document is usually classified under the category "mobile phone". Thus, as shown in FIG. 4, a preferred scheme for classifying a given document by its respective total score under each classification category is as shown in steps S341 through S345.

Step S341: and generating a queue by the classification categories according to the total score of the specified document from high to low, and selecting a predetermined number of classification categories from the initial position of the queue according to the sequence in the queue.

The predetermined number of classification categories includes at least two classification categories, and in practical applications, a plurality of classification categories may be selected according to needs, for example, the 100 classification categories with the highest total score are selected in sequence.

Step S342: and acquiring the initial category of the specified document in the target network interaction platform.

Since the initial category to which the specified document belongs in the target network interaction platform usually also contains information useful for classifying the specified document, the information of the initial category of the specified document in the target network interaction platform can be obtained.

In addition, each document classification in the network interaction platform generally has a plurality of grades, so that the initial category of the specified document may be a plurality of grades, and a preferred scheme for selecting the initial category is that the selected initial category is the same as the classification grade of the classification category. For example, when the classification category is a primary category, the selected initial category is also the primary category; when the classification category is a leaf category, the selected initial category is also the leaf category.

Step S343: and respectively determining the first similarity of each of the selected predetermined number of classification categories and the initial category, wherein the first similarity is used for describing the similarity between the initial category and each of the classification categories.

There are various ways to determine the first similarity between the classification category and the initial category, for example, the first similarity may be determined by matching words, or the similarity may be determined by other ways. Preferably, the Jacard similarity between the selected predetermined number of categories and the initial category is determined separately.

Taking each word in the classification category as a set A, taking each word in the initial category as a set B, and calculating the similarity of the Jaccard for the sets A and B as follows:

Jaccard（A，B）=|Aintersect B| / |A union B|

where | Ainterelect B | represents the number of words of the intersection of A and B, | A units B | represents the number of words of the union of A and B. Typically, the Jaccard (a, B) value is between [0, 1], and Jaccard (a, B) =1 when a = = B, i.e., a and B are identical.

Step S344: and respectively determining second similarity of the selected predetermined number of classification categories and the title of the specified document, wherein the second similarity is used for describing the similarity between the title of the specified document and each classification category.

The manner of determining the second similarity between the classification category and the title of the specified document may be the same as or different from the manner of determining the first similarity. In practical applications, the preferred way to determine the second similarity between the category and the title of the specified document is to determine the Jacard similarity between the category and the title of the specified document.

Step S345: and respectively taking the selected classification categories with the preset quantity as pre-selection classification categories, carrying out weighted summation on the total score, the first similarity and the second similarity corresponding to the pre-selection classification categories, and classifying the specified documents according to the result of the weighted summation.

And classifying the specified documents through the result of weighted summation by respectively taking the selected predetermined number of classification categories as pre-selected classification categories and carrying out weighted summation on the total score, the first similarity and the second similarity of the pre-selected classification categories. The respective corresponding weights of the total score, the first similarity and the second similarity of the pre-selected classification category may have been determined according to actual conditions, and in general, the respective corresponding weights may also be set to 1.

In practical applications, in order to check the accuracy of the classification result, the classification result may be subjected to sampling check, for example, the prediction hit rate is verified according to the sampling of the first-class classification category.

By adopting the document classification method provided by the embodiment 2, the method classifies the documents in the target network interaction platform to the specified network interaction platform by using the word stock generated by the embodiment 1, thereby solving the problem that the resource integration is difficult to perform between the similar network interaction platforms due to the difference of the classification standards in the prior art.

Example 3

Based on the same inventive concept as that in embodiment 1, embodiment 3 provides a device for generating a lexicon, wherein the lexicon generated by the device is used for classifying documents in a target network interaction platform to a specified network interaction platform, thereby laying a foundation for resource integration among similar network interaction platforms. As shown in fig. 5, the apparatus 40 includes:

an obtaining unit 401, a determining unit 402, and a generating unit 403, wherein:

an obtaining unit 401, configured to obtain header information of each document under each classification category of a specified level in a specified network interaction platform;

a determining unit 403, configured to determine, according to the header information of each document, each keyword corresponding to each classification category and a weight score of each keyword under the corresponding classification category, where the weight score is used to evaluate a degree of distinction between the classification category where the keyword is located and other classification categories other than the classification categories, and the keyword is a specified word according to a predetermined rule under the classification category corresponding to the keyword;

a generating unit 403, configured to generate a word stock according to each classification category, each keyword corresponding to each classification category, and the weight of each keyword.

With the word stock generating device provided in embodiment 3, the obtaining unit obtains the title information of each document under each classification category of a specified level in the specified network interaction platform, the determining unit determines each keyword corresponding to each classification category and the weight distribution of each keyword under the corresponding classification category according to the title information of each document, and the generating unit generates the word stock according to each classification category, each keyword corresponding to each classification category and the weight distribution of each keyword. The method is used for classifying the documents in the target network interaction platform to the specified network interaction platform through the word stock generated by the documents under each classification category in the specified network interaction platform, thereby laying a foundation for the resource integration among the similar network interaction platforms.

Example 4

Based on the same inventive concept as that in embodiment 2, embodiment 4 provides a document classification apparatus, which classifies documents in a target network interaction platform into a designated network interaction platform by using the lexicon generated in embodiment 3, thereby solving the problem that resource integration is difficult to perform among similar network interaction platforms due to different classification standards in the prior art. As shown in fig. 6, the apparatus 50 includes:

an obtaining unit 501, a querying unit 502, a determining unit 503 and a classifying unit 504, wherein:

an obtaining unit 501, configured to obtain title information of a specified document in a target network interaction platform and determine entity words corresponding to a title of the specified document, where the entity words are used to describe an entity object;

a query unit 502, configured to determine weight distribution of each entity word in each classification category in a thesaurus by querying the thesaurus, where the thesaurus is a thesaurus generated by a thesaurus generation apparatus provided according to an embodiment of the present specification;

a determining unit 503, configured to determine, according to the weight distribution of each entity word under each classification category, a respective total score of the designated document under each classification category;

a classifying unit 504, configured to classify the specified document according to the total score of the specified document under each classification category

With the document classification device provided in embodiment 4, the obtaining unit obtains the title information of the specified document and determines the entity words according to the title information, the querying unit determines the weight scores of the entity words corresponding to the classification categories by querying the word bank generated in embodiment 3, the determining unit determines the total scores of the specified document according to the weight scores, and the classification unit classifies the specified document according to the total scores. Therefore, the problem that resource integration is difficult to perform among similar network interaction platforms due to different classification standards in the prior art is solved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for generating a word stock is characterized in that the word stock is used for classifying documents in a target network interaction platform to a specified network interaction platform, and the target network interaction platform is a platform with integrated resources and comprises the following steps:

acquiring title information of each document under each classification category of a designated level in a designated network interaction platform, wherein the designated network interaction platform is a platform for receiving integrated resources, and each classification category of the designated level specifically comprises each primary category and/or each leaf category;

determining each entity word corresponding to the title of the document according to the title information of each document, wherein the entity words are used for describing entity things, and are obtained by performing normalization processing on the title of each document, performing word segmentation processing and entity word identification on the normalized title of each document, wherein the normalization processing comprises at least one of letter case conversion, traditional and simplified conversion and temporal conversion; determining each keyword corresponding to each classification category and weight distribution of each keyword under the corresponding classification category according to the entity words, wherein the weight distribution is used for evaluating the distinguishing degree when the classification category where the keyword is located is distinguished from other classification categories except the classification category, and the keyword refers to a specified word according to a preset rule under the classification category corresponding to the keyword;

2. The method of claim 1, wherein the weight distribution of the keywords under the corresponding category is determined by:

3. The method of claim 1, wherein each entity word corresponding to the title of each document is determined according to the title information of each document, and the entity word is used for describing entity things; determining, according to the entity words, each keyword corresponding to each classification category specifically includes:

normalizing the title of each document;

4. The method as claimed in claim 3, further comprising, when the word frequency of each entity word corresponding to the current classification category is smaller than a preset threshold, determining the entity word with the largest word frequency in the current classification category as the keyword in the current classification category, where the current classification category includes any one of the classification categories.

5. The method of claim 1, wherein the generating a thesaurus according to the classification categories, the keywords respectively corresponding to the classification categories, and the weights of the keywords specifically comprises:

and generating a word bank according to each association.

6. A method of classifying a document, comprising:

determining the weight distribution of each entity word under each classification category in a word stock by querying the word stock, wherein the word stock is generated according to the method of any one of claims 1 to 5;

acquiring an initial category of the specified document in the target network interaction platform;

7. The method of claim 6, wherein said determining a respective overall score for said given document under each of said taxonomy categories by a weighted distribution of each of said entity words under each of said taxonomy categories specifically comprises:

8. The method of claim 6, wherein the determining entity words corresponding to the title of the specified document specifically comprises:

normalizing the title of the specified document;

9. A device for generating a lexicon, wherein the lexicon is used for classifying documents in a target network interaction platform to a specified network interaction platform, and the target network interaction platform is a platform with integrated resources, and comprises:

an acquisition unit, a determination unit and a generation unit, wherein:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the title information of each document under each classification category of a specified level in a specified network interaction platform, the specified network interaction platform is a platform for receiving integrated resources, and each classification category of the specified level specifically comprises each level category and/or each leaf category;

the determining unit is used for determining each entity word corresponding to the title of each document according to the title information of each document, wherein the entity words are used for describing entity things, the entity words are obtained by performing normalization processing on the title of each document, performing word segmentation processing and entity word identification on the normalized title of each document, and the normalization processing comprises at least one of letter case conversion, traditional and simplified conversion and temporal conversion; determining each keyword corresponding to each classification category and weight distribution of each keyword under the corresponding classification category according to the entity words, wherein the weight distribution is used for evaluating the distinguishing degree when the classification category where the keyword is located is distinguished from other classification categories except the classification category, and the keyword refers to a specified word according to a preset rule under the classification category corresponding to the keyword;

10. A document sorting apparatus, comprising:

a query unit, configured to determine weight distribution of each entity word under each classification category in a thesaurus by querying the thesaurus, where the thesaurus is the thesaurus generated by the thesaurus generation apparatus according to any one of claims 1 to 5;

the classification unit is used for generating a queue according to the total score of the specified document from high to low, and selecting a predetermined number of classification categories from the initial position of the queue according to the sequence in the queue, wherein the predetermined number of classification categories at least comprises two classification categories;