CN112015895A

CN112015895A - Patent text classification method and device

Info

Publication number: CN112015895A
Application number: CN202010870909.8A
Authority: CN
Inventors: 肖小清; 段新辉; 温柏坚; 周永言; 赵永发; 魏焱; 沈桂泉; 龙震岳; 沈伍强; 伍江瑶
Original assignee: Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-01

Abstract

The application provides a patent text classification method and a device, wherein the method comprises the following steps: acquiring a patent text to be classified, and extracting patent attribute characteristics of the patent text to be classified; performing word segmentation processing and word vector construction processing on the patent text to be classified to obtain text word vector characteristics; inputting the patent attribute features and the text word vector features into a patent text classification model so as to obtain a classification result of the patent text to be classified according to the operation of the patent text classification model; the patent text classification model is a neural network model which is constructed according to a preset training sample and by combining with a classification information set corresponding to a preset technical classification node. According to the method and the device, the classification work of the patent texts is realized through the neural network learning mode according to the technical classification nodes established by the technical system of the user and based on the patent text classification model trained by the classification information set established by the technical classification nodes, and the technical problem of low classification efficiency of the patent texts in the prior art is solved.

Description

Patent text classification method and device

Technical Field

The present application relates to the field of text classification, and in particular, to a method and an apparatus for classifying patent texts.

Background

Along with the pace of vigorously developing intellectual property rights in China at present, in order to better guarantee the benefits of the enterprises, a lot of enterprises begin to pay attention to the operation of the intellectual property rights, and for technical enterprises, patent texts are an effective way to know and master industry and technology, and are helpful to provide direction guidance for technical research and development.

However, unlike the general text, the patent text has very strong technical and legal characteristics, and has many additional attributes, such as classification number, applicant, etc., and it is very complicated and time consuming to classify the patent according to its own system, which results in the technical problem of low execution efficiency of the prior art for classifying the patent text.

Disclosure of Invention

The application provides a patent text classification method and device, which are used for solving the technical problem that the execution efficiency of classification work of patent texts in the prior art is low.

First, a first aspect of the present application provides a method for classifying patent texts, including:

acquiring a patent text to be classified, and extracting patent attribute characteristics of the patent text to be classified;

performing word segmentation processing and word vector construction processing on the patent text to be classified to obtain text word vector characteristics;

inputting the patent attribute features and the text word vector features into a patent text classification model so as to obtain a classification result of the patent text to be classified according to the operation of the patent text classification model;

the patent text classification model is a neural network model which is constructed by combining classification information sets corresponding to preset technical classification nodes according to preset training samples, and the training samples specifically obtain patent attribute features and text word vector features according to the preset patent text samples.

Optionally, the inputting the patent attribute features and the text word vector features into a patent text classification model, so as to obtain a classification result of the patent text to be classified according to the operation of the patent text classification model specifically includes:

inputting the patent attribute features and the text word vector features into a patent text classification model, and obtaining the matching degree of the patent text to be classified and each technical classification node through the operation of the patent text classification model;

and classifying the patent texts to be classified by using the technical classification nodes corresponding to the maximum matching degree according to the matching degree, so as to obtain the classification result of the patent texts to be classified.

Optionally, before the classifying the patent texts to be classified by the technical classification node corresponding to the maximum matching degree according to the size of each matching degree to obtain the classification result of the patent texts to be classified, the method further includes:

and if the matching degrees are smaller than a preset matching degree threshold value, outputting the classification result of the patent text to be classified as classification failure.

Optionally, the configuration process of the patent text classification model specifically includes:

respectively obtaining patent attribute features and text word vector features of a training sample according to a preset training sample;

and inputting the patent attribute characteristics of the training sample, the text word vector characteristics of the training sample and a classification information set corresponding to a preset technical classification node into an initial neural network model for model training, and obtaining the patent text classification model after training.

Optionally, the configuration process of the training samples specifically includes:

and extracting corresponding patent texts from a patent database by using a patent search formula, and carrying out information labeling on the patent texts based on the patent search formula so as to obtain a training sample.

Secondly, this application second aspect provides a patent text classification device, includes:

the patent attribute feature extraction unit is used for acquiring a patent text to be classified and extracting the patent attribute features of the patent text to be classified;

the word vector feature acquisition unit is used for carrying out word segmentation processing and word vector construction processing on the patent text to be classified to obtain text word vector features;

the patent text classification unit is used for inputting the patent attribute features and the text word vector features into a patent text classification model so as to obtain a classification result of the patent text to be classified according to the operation of the patent text classification model;

Optionally, the patent text classification unit specifically includes:

the technical matching degree calculation subunit is used for inputting the patent attribute features and the text word vector features into a patent text classification model, and obtaining the matching degree between the patent text to be classified and each technical classification node through the operation of the patent text classification model;

and the classification subunit is used for classifying the patent texts to be classified by the technical classification nodes corresponding to the maximum matching degree according to the matching degree so as to obtain the classification result of the patent texts to be classified.

Optionally, the patent text classification unit further includes:

and the matching degree judging subunit is used for outputting the classification result of the patent text to be classified as classification failure if the matching degrees are smaller than a preset matching degree threshold value.

Optionally, the method further comprises:

the sample feature acquisition unit is used for respectively acquiring the patent attribute features and the text word vector features of a training sample according to a preset training sample;

and the model training unit is used for inputting the patent attribute characteristics of the training samples, the text word vector characteristics of the training samples and the classification information set corresponding to the preset technical classification node into an initial neural network model for model training, and obtaining the patent text classification model after training is completed.

Optionally, the method further comprises:

and the sample labeling unit is used for extracting corresponding patent texts from a patent database by using a patent search formula and labeling the patent texts based on the patent search formula so as to obtain training samples.

According to the technical scheme, the embodiment of the application has the following advantages:

the patent text classification method provided by the application comprises the following steps: acquiring a patent text to be classified, and extracting patent attribute characteristics of the patent text to be classified; performing word segmentation processing and word vector construction processing on the patent text to be classified to obtain text word vector characteristics; inputting the patent attribute features and the text word vector features into a patent text classification model so as to obtain a classification result of the patent text to be classified according to the operation of the patent text classification model; the patent text classification model is a neural network model which is constructed by combining classification information sets corresponding to preset technical classification nodes according to preset training samples, and the training samples specifically obtain patent attribute features and text word vector features according to the preset patent text samples.

According to the method and the device, a plurality of technical classification nodes are established according to a technical system of a user in a neural network learning mode, classification work of the patent texts is realized based on a classification information set established by the technical classification nodes and a patent text classification model trained by the patent texts serving as training samples, and the technical problem that the execution efficiency of the classification work of the patent texts in the prior art is low is solved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a patent text classification method according to a first embodiment of the present application;

fig. 2 is a schematic flowchart of a patent text classification method according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of a patent document classification device according to a first embodiment of the present application.

Detailed Description

The embodiment of the application provides a method and a device for classifying patent texts, which are used for solving the technical problem that the execution efficiency of the classification work of the patent texts in the prior art is low.

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a first embodiment of the present application provides a method for classifying patent texts, including:

101, acquiring a patent text to be classified, and extracting patent attribute characteristics of the patent text to be classified;

firstly, after the patent texts to be classified are obtained, the special fields in the patent texts can be identified, and corresponding patent attribute features are extracted according to the special fields.

102, performing word segmentation processing and word vector construction processing on a patent text to be classified to obtain text word vector characteristics;

then, the patent text to be classified is subjected to word segmentation processing, the text is divided into words, and the divided words are constructed into corresponding text word vector characteristics through word vector construction processing.

103, inputting the patent attribute characteristics and the text word vector characteristics into a patent text classification model to obtain a classification result of the patent text to be classified according to the operation of the patent text classification model;

and then, inputting the patent attribute features obtained in the step 101 and the text word vector features obtained in the step 102 into a pre-trained patent text classification model to obtain a classification result of the patent text to be classified according to the operation of the patent text classification model.

The patent text classification model mentioned in this embodiment is a neural network model constructed by combining a classification information set corresponding to a preset technical classification node according to a preset training sample, and the training sample specifically obtains patent attribute features and text word vector features according to the preset patent text sample.

According to the method and the device, a plurality of technical classification nodes are established according to a technical system of a user in a neural network learning mode, classification work of the patent texts is achieved based on a classification information set established by the technical classification nodes and a patent text classification model trained by the patent texts serving as training samples, and the technical problem that the execution efficiency of the classification work of the patent texts in the prior art is low is solved.

The above is a detailed description of a first embodiment of a patent text classification method provided by the present application, and the following is a detailed description of a second embodiment of a patent text classification method provided by the present application.

Referring to fig. 2, a second embodiment of the present application provides a method for classifying patent documents.

On the basis of the first embodiment of the present application, further, step 103 specifically includes:

step 1031, inputting the patent attribute features and the text word vector features into a patent text classification model, and obtaining the matching degree of the patent text to be classified and each technical classification node through the operation of the patent text classification model;

and 1032, classifying the patent texts to be classified by the technical classification node corresponding to the maximum matching degree according to the matching degree, so as to obtain a classification result of the patent texts to be classified.

It should be noted that, after the patent attribute features and the text word vector features are input into the patent text classification model, the patent text classification model compares the input patent attribute features and the text word vector features with the classification information sets corresponding to the technical classification nodes, and outputs the matching degrees of the features of the patent text to be classified and the technical classification nodes. Taking the electric power field as an example, the technology classification nodes may include four classification nodes of power transmission, power transformation, power utilization and power generation, and each technology node may further subdivide a plurality of sub-classification nodes, which may be specifically set by a user, and will not be described herein.

And determining a technical classification node corresponding to the maximum matching degree according to the result output by the patent text classification model, and classifying the patent text to be classified according to the technical classification node corresponding to the maximum matching degree, thereby obtaining the classification result of the patent text to be classified. For example, the matching degree of a certain patent belonging to the power generation classification is 0.80, the matching degree of the power transformation classification is 0.50, the matching degree of the power transformation classification is 0.40, and the matching degree of the power transformation classification is 0.30. The system determines that this patent belongs to the category of power generation.

Further, the present embodiment further includes, before step 1032:

and step 10311, judging whether each matching degree is smaller than a preset matching degree threshold, if so, outputting the classification result of the patent text to be classified as classification failure, and if not, continuing to execute step 1032.

It should be noted that, when the matching degrees corresponding to the technical classification nodes are all smaller than a preset matching degree threshold, for example, when each matching degree output by the model is lower than 0.5, it is very low that the patent text belongs to the technical classification nodes, and at this time, a classification result of classification failure is output.

Further, the configuration process of the patent text classification model of the embodiment specifically includes:

step 1001, according to a preset training sample, respectively obtaining patent attribute features and text word vector features of the training sample.

Step 1002, inputting the patent attribute features of the training samples, the text word vector features of the training samples and the classification information sets corresponding to the preset technical classification nodes into an initial neural network model for model training, and obtaining a patent text classification model after the training is completed, wherein the classification information sets mentioned in this embodiment may include: the technical classification nodes are commonly used technical keywords, patent classification numbers and the like.

Further, the configuration process of the training samples in this embodiment specifically includes:

and step 1000, extracting a corresponding patent text from a patent database by using a patent search formula, and performing information labeling on the patent text based on the patent search formula to obtain a training sample.

It should be noted that, in this embodiment, a training sample is generated by using a patent search formula search method, a corresponding patent text is extracted from a patent database according to an input patent search formula, and the patent text is extracted according to key information of the search formula and labeled based on the key information in the patent search formula, so as to obtain the training sample, and improve configuration efficiency of the training sample.

According to the method and the device, a plurality of technical classification nodes are established according to a technical system of a user in a neural network learning mode, classification work on the patent texts is realized based on a classification information set established by the technical classification nodes and a patent text classification model trained by the patent texts serving as training samples, the training samples are automatically configured in a retrieval mode, the model training efficiency is further improved, and the technical problem that the execution efficiency of the classification work on the patent texts in the prior art is low is solved.

The above is a detailed description of the second embodiment of the patent document classification method provided in the present application, and the following is a detailed description of the first embodiment of the patent document classification device provided in the present application.

Referring to fig. 3, a second aspect of the present application provides a device for classifying patent documents, including:

the patent attribute feature extraction unit 301 is configured to acquire a patent text to be classified and extract a patent attribute feature of the patent text to be classified;

the word vector feature obtaining unit 302 is configured to perform word segmentation processing and word vector construction processing on the patent text to be classified to obtain text word vector features;

the patent text classification unit 303 is configured to input the patent attribute features and the text word vector features into a patent text classification model, so as to obtain a classification result of the patent text to be classified according to an operation of the patent text classification model;

the patent text classification model is a neural network model which is constructed according to a preset training sample and by combining a classification information set corresponding to a preset technical classification node, and the training sample specifically obtains patent attribute features and text word vector features according to the preset patent text sample.

Further, the patent text classification unit 303 specifically includes:

the technical matching degree calculation operator unit 3031 is used for inputting the patent attribute characteristics and the text word vector characteristics into a patent text classification model, and obtaining the matching degree between the patent text to be classified and each technical classification node through the operation of the patent text classification model;

and the classification subunit 3032 is configured to classify the patent texts to be classified by using the technical classification node corresponding to the maximum matching degree according to the size of each matching degree, so as to obtain a classification result of the patent texts to be classified.

Further, the patent text classification unit further includes:

and the matching degree determining subunit 30311 is configured to, if each matching degree is smaller than a preset matching degree threshold, output the classification result of the patent text to be classified as a classification failure.

Further, still include:

a sample feature obtaining unit 3001, configured to obtain, according to a preset training sample, patent attribute features and text word vector features of the training sample respectively;

the model training unit 3002 is configured to input the patent attribute features of the training samples, the text word vector features of the training samples, and the classification information sets corresponding to the preset technical classification nodes to the initial neural network model for model training, and obtain a patent text classification model after training.

Further, still include:

the sample labeling unit 3000 is configured to extract a corresponding patent text from the patent database by using a patent search formula, and label the patent text with information based on the patent search formula to obtain a training sample.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A patent text classification method is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the inputting the patent attribute features and the text word vector features into a patent text classification model to obtain the classification result of the patent text to be classified according to the operation of the patent text classification model specifically comprises:

3. The method as claimed in claim 2, wherein before classifying the patent texts to be classified by the technical classification node corresponding to the maximum matching degree according to the matching degree, obtaining the classification result of the patent texts to be classified, the method further comprises:

4. The method for classifying patent texts according to claim 1, wherein the configuration process of the patent text classification model specifically includes:

5. The method for classifying patent texts according to claim 4, wherein the configuration process of the training samples specifically includes:

6. A patent text classification apparatus, comprising:

7. The patent text classification device according to claim 6, wherein the patent text classification unit specifically comprises:

8. The patent text classification device according to claim 7, wherein the patent text classification unit further comprises:

9. The patent text classification device according to claim 6, further comprising:

10. The patent text classification device according to claim 9, further comprising: