CN114461801A

CN114461801A - Patent text classification number identification method and device, electronic equipment and storage medium

Info

Publication number: CN114461801A
Application number: CN202210120391.5A
Authority: CN
Inventors: 杨海涛; 王超超; 王为磊; 屠昶旸; 张济徽
Original assignee: Smart Bud Information Technology Suzhou Co ltd
Current assignee: Smart Bud Information Technology Suzhou Co ltd
Priority date: 2022-02-07
Filing date: 2022-02-07
Publication date: 2022-05-10

Abstract

The invention relates to a patent text classification number identification method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring an input text corresponding to a target patent text; and inputting the input text into a recognition neural network, and outputting a classification number determination result of the target patent text after the input text is processed by the recognition neural network. By using the implementation mode provided by the embodiment of the application, the neural network can be trained and identified by using the constructed class equilibrium data set. And identifying the classification number of the target patent text by utilizing the trained recognition neural network according to the input text of the target patent text of the classification number to be recognized. Therefore, the efficiency and the accuracy of classification number identification are effectively improved, and the labor cost is reduced.

Description

Patent text classification number identification method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of intelligent identification, in particular to a patent text classification number identification method and device, electronic equipment and a storage medium.

Background

The patent classification number is a number given to a patent document to represent its classification according to a specific classification rule. Currently, commonly used patent classification numbers include IPC classification number, CPC classification number, and the like. The patent classification number may be used to assist in the retrieval of patent documents. Classifying patent documents and identifying classification numbers of the patent documents are of great significance to search of patent documents, examination of patent documents and the like. In the prior art, classification numbers of patent texts are generally identified manually. However, with the rapid increase of the patent application amount, the problems of low efficiency, high cost and the like of the manual identification mode gradually emerge. In addition, due to the complexity and diversity of patent texts, the accuracy of identifying the classification number of the patent text through a machine is low at present. Therefore, it is necessary to provide an efficient and highly accurate identification method for patent text classification numbers.

Disclosure of Invention

The application provides a patent text classification number identification method and device, electronic equipment and a storage medium, so that the accuracy and efficiency of patent text classification number identification are improved, and the labor cost is saved.

According to a first aspect of the present application, there is provided a patent text classification number identification method, the method comprising: acquiring an input text corresponding to a target patent text; and inputting the input text into the recognition neural network, and outputting a classification number determination result of the target patent text after the input text is processed by the recognition neural network.

In one possible implementation, identifying the neural network includes semantic feature extraction neural network, long-range dependency capture neural network, and classification neural network. Inputting the input text into a recognition neural network, and outputting a classification number determination result of the target patent text after the input text is processed by the recognition neural network, wherein the classification number determination result comprises the following steps: inputting input text into a semantic feature extraction neural network, and outputting word vectors and sentence vectors after the semantic feature extraction neural network processing; inputting the word vector and the sentence vector into a long-range dependency relationship capture neural network, and outputting a residual error feature vector after the long-range dependency relationship capture neural network processing; and inputting the residual error feature vector into a classification neural network, and obtaining a classification number determination result after the processing of the classification neural network.

In one possible implementation manner, the obtaining of the input text corresponding to the target patent text includes: and preprocessing the specific subfile of the target patent text to obtain an input text.

In one possible implementation, the specific subfolders include title text, abstract text, and claim text of the target patent text. Preprocessing the specific subfolders of the target patent text to obtain an input text, wherein the input text comprises the following steps: carrying out data cleaning processing and keyword extraction processing on the claim text to obtain a keyword text corresponding to the claim text; and combining the keyword text, the title text and the abstract text to obtain the input text with the text length being a fixed value.

In a possible implementation manner, inputting the residual feature vector into a classification neural network, and obtaining a classification number determination result after processing by the classification neural network includes: inputting the residual error feature vector into a classification neural network, and outputting a probability value of the target patent text belonging to each classification number after the processing of the classification neural network; and arranging the classification numbers according to the corresponding probability values from high to low, and determining the classification numbers with the probability values arranged in the top N as the classification number determination results, wherein N is more than or equal to 1.

In a possible implementation manner, the classification numbers include small class numbers and small group numbers, the probability value of each classification number includes the probability value of each small class number and the probability value of each small group number, and the N classification numbers include small class numbers in the first M rows from high to low according to the corresponding probability values of the small class numbers and small group numbers in the first L rows from high to low according to the corresponding probability values of the small group numbers, wherein M is greater than or equal to 1, and L is greater than or equal to 1.

In one possible implementation, the semantic feature extraction neural network includes a Bert semantic feature extraction model, and the long-range dependency capture neural network includes a bidirectional long-short memory network (BiLSTM) long-range dependency capture model.

In one possible implementation, the neural network is identified as a trained neural network. The training mode for identifying the neural network comprises the following steps: constructing a patent text training sample set, wherein each patent text sample in the patent text training sample set corresponds to one or more classification number labels; inputting input texts corresponding to patent text samples in a patent text training sample set into a recognition neural network, and outputting a predicted classification number determination result after the recognition neural network processing; determining the loss of the processing result of the neural network according to the predicted classification number determination result and the classification number label corresponding to the patent text sample; the loss is propagated back to the identified neural network to adjust network parameters of the identified neural network.

In one possible implementation, the method for constructing the patent text training sample set includes: obtaining a classification number label sorting list according to the classification number label corresponding to each patent text sample in the original data; traversing the classification number label sorting list to obtain the number of samples corresponding to each classification number label; if the number of the samples corresponding to the classification number labels is smaller than a preset value, distributing the samples corresponding to the classification number labels to a patent text training sample set and a patent text testing sample set according to a preset proportion; if the number of samples corresponding to the class number label is greater than or equal to the preset value, filling the samples corresponding to the class number label with a first fixed number to the training sample set, and filling the samples corresponding to the class number label with a second fixed number to the test sample set.

In one possible implementation, the category number label includes a minor group number and a minor category number.

According to another aspect of the present application, there is provided a patent text classification number recognition apparatus, including: the data preprocessing module is configured to acquire an input text corresponding to a target patent file; and the recognition module is configured to input the input text into the recognition neural network and output a classification number determination result of the target patent text after the input text is processed by the recognition neural network.

In one possible implementation, the identification module includes: the semantic feature extraction unit is configured to input the input text into a semantic feature extraction neural network, and output a word vector and a sentence vector after being processed by the semantic feature extraction neural network; the long-range dependency relationship capturing unit is configured to input the word vectors and the sentence vectors into a long-range dependency relationship capturing neural network, and output residual error feature vectors after the long-range dependency relationship capturing neural network processes; and the classification unit is configured to input the residual error feature vector into a classification neural network, and obtain a classification number determination result after the residual error feature vector is processed by the classification neural network.

According to a third aspect of the present application, there is provided an electronic device comprising: a processor and a memory for storing executable instructions, the processor implementing the above method by invoking the executable instructions.

According to a fourth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions to be executed by a processor to implement the above method.

According to the implementation mode of each aspect of the application, the constructed patent text training sample set is used for training and identifying the neural network, and the classification number of the target patent text is identified by the trained identifying neural network according to the input text of the target patent text of the classification number to be identified. Therefore, the efficiency and the accuracy of classification number identification are effectively improved, and the labor cost is reduced.

Furthermore, the semantic features adopted by the neural network are identified to extract the long-range dependency relationship of the neural network (such as a Bert model) and capture the architecture of the neural network (such as the BilSTM), so that the problems of gradient disappearance, gradient explosion and the like can be effectively solved, and the accuracy of classification number identification is further effectively improved.

Drawings

Fig. 1 is an exemplary flowchart of a patent text classification number identification method provided in the present application.

Fig. 2 is a schematic diagram of a module structure of a device for identifying a patent text classification number provided by the present application.

Fig. 3 is a schematic diagram of an exemplary application scenario of a recognition neural network model provided in the present application.

Fig. 4 is a schematic diagram of an exemplary acquisition process of a training sample set provided in the present application.

Fig. 5 is a block diagram of an exemplary module structure of an electronic device provided in the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

Fig. 1 is an exemplary flowchart of a patent text classification number identification method provided in the present application. As shown in fig. 1, according to some embodiments of the present application, a patent text classification number identification method may include:

s110: and acquiring an input text corresponding to the target patent text.

The target patent text refers to the patent text of the classification number to be identified. In an embodiment of the present application, the type of the target patent text may be an application text of a chinese patent (e.g., an invention patent application text or a utility model), an authorization text of a chinese patent, an application text of a foreign (e.g., english, japanese, korean, etc.) patent, an authorization text of a foreign patent, etc., or a combination thereof. In some embodiments, the type of target patent text may be inventive patent text, utility patent text, appearance patent text, journal paper, or the like, or a combination thereof. The application is not limited to a particular language or type of target patent text.

In one embodiment of the present application, the input text may be one or more parts extracted from the target patent text. For example only, the input text may be a specific part in the target patent text, a text in which specific parts extracted from the target patent text are combined according to a specific rule, the entire target patent text, or the like. For example, the input text may be a title portion of the target patent text. For another example, the input text may be a portion extracted from a text abstract portion of the target patent. For another example, the input text may be a text obtained by combining extracted portions such as a title portion, an abstract portion, and a claim portion of the target patent text.

In some embodiments, obtaining the input text corresponding to the target patent text may include preprocessing a specific sub-text of the target patent text to obtain the input text. For details of obtaining the input text from the sub-text, reference may be made to fig. 3 and its associated description.

In an embodiment of the present application, the input text may be automatically, semi-automatically, or manually obtained text, and the present application is not limited thereto.

S120: and inputting the input text into the recognition neural network, and outputting a classification number determination result of the target patent text after the input text is processed by the recognition neural network.

The classification number may include an IPC classification number, a CPC classification number, a European patent classification number (ECLA), a US patent classification number (CCL), a Japanese classification (FI/F-term), and the like, or combinations thereof. For convenience of description, the IPC classification number is used as an example for illustration. The IPC classification number contains 4 parts, as in A01B33/08, the first letter A indicates the part to which the patent belongs, the first three characters A01 indicates the major class to which the patent belongs, the first four characters A01B indicates the minor class to which the patent belongs, and the entire classification number A01B33/08 indicates the minor group to which the patent belongs. At present, there are 654 Chinese subclasses and 77850 Chinese subclasses.

In one embodiment of the present application, the cognitive neural network is a mathematical model or computational model that mimics the structure and function of a biological neural network. The recognition neural network may include an input layer, an intermediate layer, and an output layer. The input layer is responsible for receiving input data from the outside and passing the input data to the middle layer. The middle layer is responsible for information exchange, and can be designed into a single hidden layer or a plurality of hidden layers according to the requirement of information change capability. And the intermediate layer transmits the output result to the output layer for further processing to obtain the output result of the recognition neural network. The input layer, the intermediate layer and the output layer may each comprise a number of neurons, and the connections between the neurons may comprise directed connections with variable weights. The recognition neural network can achieve the purpose of establishing a relation model between analog input and output by repeatedly learning and training known information and gradually adjusting and changing the connection weight of the neurons. The trained recognition neural network can detect input information by using a simulated relation model between input and output, and output information corresponding to the input information is given. For example, the identified neural network may include convolutional layers, pooling layers, fully-connected layers, and the like.

In some embodiments, identifying the neural network may include semantic feature extraction neural networks, long-range dependency capture neural networks, classification neural networks, and the like, or combinations thereof. In some embodiments, the neural network is identified as a trained neural network.

In an embodiment of the present application, the training method for identifying the neural network may include:

constructing a patent text training sample set, wherein each patent text sample in the patent text training sample set corresponds to one or more classification number labels;

inputting input texts corresponding to samples in a patent text training sample set into a recognition neural network, and outputting a predicted classification number determination result after the recognition neural network processing;

determining the loss of the processing result of the neural network according to the predicted classification number determination result and the classification number label corresponding to the sample;

the loss is propagated back to the identified neural network to adjust network parameters of the identified neural network.

In an embodiment of the present application, a method for constructing a patent text training sample set may include: obtaining a classification number label sorting list according to the classification number label corresponding to each patent text sample in the original data; traversing the classification number label sorting list to obtain the number of samples corresponding to each classification number label; if the number of the samples corresponding to the classification number labels is smaller than a preset value, distributing the samples corresponding to the classification number labels to a patent text training sample set and a patent text testing sample set according to a preset proportion; if the number of samples corresponding to the class number label is greater than or equal to the preset value, filling the samples corresponding to the class number label with a first fixed number to the training sample set, and filling the samples corresponding to the class number label with a second fixed number to the test sample set. For details of the training mode of the patent text sample set, reference may be made to fig. 4 and its related description.

In an embodiment of the present application, inputting the input text into the recognition neural network, and outputting the classification number determination result of the target patent text after the processing by the recognition neural network may include:

s121: the input text is input into a semantic feature extraction neural network, and one or more of a word vector, a sentence vector and the like are output after being processed by the semantic feature extraction neural network.

S122: one or more of the word vector, the sentence vector and the like are input into a long-range dependency relationship capture neural network, and the residual error feature vector is output after the long-range dependency relationship capture neural network processing.

S123: and inputting the residual error feature vector into a classification neural network, and obtaining a classification number determination result after the processing of the classification neural network.

The semantic feature extraction neural network can be a Bert semantic feature extraction model, the Long-range dependency relationship capture neural network can be a BilSTM Long-range dependency relationship capture model, and the Long-range dependency relationship (Long-Term Dependencies) refers to a dependency relationship established with information before a Long distance and can also be called a Long-distance dependency relationship or a Long-distance dependency relationship.

In some embodiments, inputting the residual feature vector into a classification neural network, and obtaining the classification number determination result after the processing by the classification neural network may include:

inputting the residual error feature vector into a classification neural network, and outputting a probability value of the target patent text belonging to each classification number after the processing of the classification neural network;

and determining the first N classification numbers in each classification number according to the corresponding probability values thereof from high to low as classification number determination results, wherein N is more than or equal to 1. In this example, the value of N may be determined according to the actual classification number identification requirement and the actual implementation scenario, which is not limited in this application.

In another embodiment of the present application, the classification number may include an IPC classification number, the IPC classification number is composed of a small class number and a small group number, and correspondingly, the probability value of each classification number includes a probability value of each small class number and a probability value of each small group number, and the N classification numbers include small class numbers in the small class numbers which are ranked in the front M from high to low according to the corresponding probability value and small group numbers in the small group numbers which are ranked in the front L from high to low according to the corresponding probability value, where M is greater than or equal to 1, and L is greater than or equal to 1. The numerical values of M and L may be any natural number greater than or equal to 1 theoretically, and specifically, the numerical values of M and L may be determined according to the actual classification number identification requirement and the actual implementation scenario, which is not limited in this application. In some embodiments of the present application, the classification number determination result may be a plurality of classification numbers of a plurality of categories. For example, in this example, the classification number determination result may be 5 minor class numbers and 5 minor group numbers, may be 4 minor class numbers and 4 minor group numbers, and may be any other number of minor group numbers and minor group numbers. In some embodiments of the present application, the classification number determination result may also be a plurality of classification numbers of a single category, such as 5 minor group numbers or minor group numbers, or any number of minor group numbers or minor group numbers. Of course, in other embodiments of the present application, the classification number determination result may be a single classification number of a single category. Specifically, the type and number of the classification number determination result may be determined according to the actual classification number identification requirement and the actual implementation scenario, which is not limited in the present application. Further, in an embodiment of the present application, a classification number with the highest probability value in the classification number determination result may also be used as the master classification number.

Fig. 3 is a schematic diagram of an exemplary application scenario of a recognition neural network model provided in the present application. As shown in FIG. 3, the recognition neural network model may adopt a network architecture of a Bert semantic feature extraction model + a BilSTM long-range dependency capture model.

In this example, the basic model of the Bert semantic feature extraction model may be a Bert model from smart bud corporation that is pre-trained based on patent data, where the pre-trained model is a model trained on a transform framework (a neural network framework for natural language processing and based on attention mechanism) in an unsupervised manner based on ten million-level patent data, and in this example, the purpose of extracting the patent semantic features may be achieved by performing fine-tuning (fine-tuning) on the model. Semantic feature extraction using the Bert model can better distinguish ambiguous words and obtain corresponding semantic representations through unsupervised bidirectional semantic learning of a large amount of data. Semantic feature extraction using the Bert model has a better effect than resolving ambiguous words using a vocabulary of words. For example, for the following two sentences: the ' apple harvest of this year is good ' and ' apple mobile phone sales of this year ' are good ', the Bert model can distinguish different meanings of the word ' apple ' in two sentences according to semantic information learned by pre-training, and accordingly vector representations of the two ' apples ' in corresponding sentences are obtained.

As shown in fig. 3, since the processing procedure of the Bert semantic feature extraction model is all digital operations, the input text needs to be mapped to numbers first, and this step can be completed by using the mapping table of the Bert semantic feature extraction model. In this example, Bert would map the input text to two products: 512 word vectors 1, 2, … … n with 768 dimensions and sentence vectors with 768 dimensions, 512 represents the length of input text and is the longest text length accepted by the Bert semantic feature extraction model, and 768 represents the feature quantity output by the model. The sentence vectors and word vectors are passed to a downstream long-range dependency capture model for further feature extraction steps.

In this example, as shown in fig. 3, the long-range dependency relationship features between the word vectors generated by the Bert semantic feature extraction model can be extracted by using the BiLSTM long-range dependency relationship capture model. The BilSTM long-range dependency capture model may be, for example, an RNN (Recurrent Neural Network, RNN for short), and is characterized in that key information in a long sequence is selectively retained and some useless information is forgotten through a memory gate, a forgetting gate and an output gate in a training process, and the key information is superimposed when time series data is processed and fused into a characterization vector capable of representing the whole sequence. Unlike the common LSTM model, the BilTM model can capture the bidirectional long-range dependence relationship of the sequence, and is particularly suitable for processing the long text sequence such as patent data.

In an embodiment of the present application, as shown in fig. 3, the capture model may adopt a two-layer bilst model, and the length of the set hidden state vector may be 768 dimensions. Firstly, the first layer of BilSTM can receive word vectors of an upstream Bert semantic feature extraction model, 512 768 x 2-dimensional vectors are generated and can be used as input of the second layer of BilSTM to continuously capture long-range dependency, and the second layer of BilSTM can generate 512 768 x 2-dimensional hidden state results. As the hidden state information of the last step is most abundant in the LSTM, 768-dimensional long-range dependent feature vectors of the last hidden state of the bidirectional LSTM are respectively taken and spliced to obtain 1536-dimensional BiLSTM feature vectors. Because the model may face the problems of gradient disappearance, gradient explosion, network degradation and the like, and the patent groups have many categories, large sample amount and low convergence rate, in this example, after the double-layer BilSTM, 768-dimensional sentence vectors generated by the Bert semantic feature extraction model are expanded by one time and added with the BilSTM feature vectors in an opposite-position manner to obtain residual error feature vectors, and the characteristics of the residual errors can be utilized to make the convergence rate of the model faster and the recognition performance higher.

In one embodiment of the present application, classification prediction may be performed by a classification neural network. The classification neural network can comprise a full-connection classifier, and the full-connection classifier can input residual error feature vectors obtained at the upstream into the full-connection classifier for classification. The specific process is to design a fully-connected network, the input nodes of the network can be 1536 dimensions, corresponding to the dimensions of residual features, and the output nodes of the network can correspond to the number of subclasses and subgroups of patents, namely 653 and 76214. The obtained output vector can be a group of probability values which represent the probability that the patent belongs to each category, and the category number determination result is determined by the category numbers with the probability values arranged in the top N (N is more than or equal to 1) in each category number. For example, categories having probability values ranked in the top 5, i.e., 5 minor class numbers and 5 minor group numbers, may be selected as the minor class numbers and the minor group numbers recommended by the patent.

In this example, the whole recognition process is not only very fast, but also very accurate, wherein in an application example, the recognition time of each target patent text is 0.105 seconds (subclass) and 0.176 seconds (subgroup), the accuracy of the subclass of top1 is above 84%, the accuracy of the subclass of top3 is above 94%, the accuracy of the subclass of top1 is above 48%, and the accuracy of the subclass of top3 is above 68%. The recognition efficiency is far higher than the efficiency of human recognition, and the recognition accuracy is effectively improved compared with the accuracy of the existing machine recognition.

Of course, the types and architectures of the various types of neural networks in the embodiments described above are exemplary. In other embodiments of the present application, other types and architectures of text feature extraction neural networks, long-range dependency relationship capture networks, and classification neural networks may be adopted, as long as the corresponding functions can be implemented, which is not limited in the present application.

In an embodiment of the present application, obtaining an input text corresponding to a target patent text may include: and preprocessing the specific subfile of the target patent text to obtain an input text.

In this example, the specific sub-text may include a title text, an abstract text, a claim text, and the like of the target patent text, or a combination thereof. Preprocessing the specific subfolders of the target patent text to obtain the input text may include:

carrying out data cleaning processing and keyword extraction processing on the claim text to obtain a keyword text corresponding to the claim text;

combining the keyword text, the title text, the abstract text and the like to obtain the input text with the text length of a fixed value.

Specifically, the title text, the abstract text and the claim text of the target patent text can be selected as a data base, and corresponding preprocessing is performed on the subfolders. In this example, the data pre-processing stage is primarily directed to the claim text. Since many of the claim fields are obtained from external sources, the following general problems exist:

the right text contains a number of html tags such as: the myopia prevention lamp according to claim 1, wherein the dimming circuit comprises a four-position switch SA, the first position of the switch SA is light-off, the second position of the switch SA is connected with the red light source (6) and the green light source (7) through a capacitor C, the third position of the switch SA is connected with the red light source (6) and the green light source (7) through a diode, and the fourth position of the switch SA is connected with the red light source (6) and the green light source (7). </seg-refi > </div > ". The html tags do not have semantic information, influence judgment of the model and belong to noise data.

In addition, a large number of redundant utterances are contained in the claim text, such as the content of one claim text is: "0005.5. A dynamic reactive power compensator with slow-falling grid voltage overcurrent protection according to claim 1, characterized in that the input contactor (KM) is connected with a soft start resistor (R) in parallel. 0006.6. A dynamic reactive power compensator using slow buck gate voltage overcurrent protection as claimed in claim 1, wherein the filter is a high pass filter. 0007.7. The dynamic reactive power compensation device with slow grid voltage reduction and overcurrent protection according to claim 1, wherein the inverter is a three-phase half-bridge inverter. 0008.8. The dynamic reactive power compensation device adopting slow-falling gate voltage overcurrent protection according to claim 1, wherein the harmonic current detection circuit comprises an FFT module and an IFFT module. 0009.9. A dynamic reactive power compensation device with slow buck gate voltage overcurrent protection according to claim 1, wherein the frequency of the total current of the device is 64.8 KHz. 0010.10. A dynamic reactive power compensation device with slow buck gate voltage over current protection as claimed in claim 9, wherein the compensation response time of the device is less than 10ms ". The dynamic reactive power compensation device adopting the slow grid voltage reduction overcurrent protection according to claim 1 is characterized in that the dynamic reactive power compensation device is repeated for 6 times, and the repeated statements affect the identification performance of a subsequent identification neural network.

Some claim text may also contain many logical numbers associated with the picture, such as: "a bearing device (1) for a sanitary fixture, comprising a bearing element (2) for mounting the sanitary fixture; the base part (3) is connectable to the bearing element (2), wherein the base part (3) comprises a support portion (4) with a support surface (5), the support surface (5) being connected with the base part (3). The part of numbers have a front-back logic relationship, and need to be understood by referring to patent pictures, so that the method is not suitable for being used in a natural language processing method.

The method requires the fields to be long in length and the information and characters to be not concise and concise, which increases the difficulty of extracting semantic features of a subsequent recognition model.

In view of the above problems, in this example, the claim text may be processed by using data cleaning and keyword extraction techniques, for example, the keyword text corresponding to the claim text may be obtained by using regularization, deduplication and short text keyword extraction (phrase extract) algorithms.

Further, in this example, the title text, the abstract text, the keyword text, and the like may be spliced to obtain the final text. In some embodiments of the present application, the length of the input text must be a fixed value due to the input limitations of the downstream model. In some embodiments, the final text may be truncated and filled in as input text by a fixed value 512.

Fig. 4 is a schematic diagram of an exemplary acquisition process of a training sample set provided in the present application. In this example, for an IPC classification number, each patent text may correspond to multiple patent numbers, which is a multi-label classification task, and the phenomenon of data imbalance in the multi-label task may affect the classification effect, so how to construct a training sample set and a testing sample set with balanced classes is a difficult point and a key point in the training step. Specifically, as shown in fig. 4, since some groups are very cold and some groups are very hot, the difference in the number of patents owned by different groups is large. In order to ensure that the model has sufficient learning effect on each group number, the number of samples between the two groups needs to be controlled to be balanced. Therefore, in this example, a strategy of preferentially selecting patent numbers of the uncommon group into the data set is adopted, and the number of samples of each group number is kept about 50 as much as possible, wherein the number of patent text training samples is 40, and the number of patent text testing samples is 10. The method comprises the following specific steps:

the form of the original data is expressed that each patent number corresponds to a plurality of small group numbers, the small group numbers are used as targets at present, key value conversion is carried out on the original data, the number of patents of each small group number is counted, and a small group number sorting list is obtained according to the number from small to large;

traversing the small group number list, judging whether the number of patents owned by the current small group number is more than 50 one by one, if not, taking all patents, and pressing 4: 1, distributing the patent numbers to a patent text training sample set and a patent text testing sample set, if the number of the patent numbers is more than 50, firstly filling the number of the patent text training sets to 40, secondly filling the number of the patent text testing sets to 10, and then carrying out data pulling work of the next small group number;

since the multi-label task has the condition of data multiple use, when each patent number is filled into the data set of the current small group number, the patent number is also filled into the data sets of other small group numbers, so that the data can be used for multiple times. The data (patent number) has a plurality of classification number labels (small group numbers) which are direct reasons for the imbalance of multi-label task data, and the multi-purpose method of the data can relieve the imbalance phenomenon to a certain extent, wherein the principle is that samples with rare numbers are preferentially pulled, so that the number of large samples is indirectly increased, and the method of directly pulling large samples is avoided; and integrating the training sample set and the test sample set of all the small group numbers to form a final training sample set and a final test sample set. Of course, the above numbers and proportions are exemplary, and in other embodiments of the present application, the distribution numbers and proportions of the sample sets may also be determined according to the actual data amount and training requirements, which is not limited in the present application.

In an application example of the application, 77850 small group numbers are recorded in a commercial patent database, wherein 1636 small group numbers have no patent data, and 1 small group number has no patent data, so that the patent data are removed. The remaining 76214 minor group numbers (653 total minor class numbers) pull 1315558 patent data, including 1001821 patent text training sample sets and 313737 patent text testing sample sets. On average, there are 69 sample data under each small group number, with the majority of small group numbers having a sample count centered between 20-69.

By using the implementation manner of the method provided by each embodiment, the classification number of the target patent text can be identified according to the input text of the target patent text. Therefore, the efficiency and the accuracy of classification number identification can be effectively improved, and the labor cost is reduced.

Furthermore, the semantic feature extraction neural network (such as a Bert model) adopted by the neural network is identified to extract the framework of the long-range dependency relationship capture neural network (such as the BilSTM), so that the problems of gradient disappearance, gradient explosion and the like can be effectively solved, and the identification accuracy of the identified classification number is further effectively improved.

Based on the methods provided by the above embodiments, the present application also provides a device for identifying a patent text classification number. Fig. 2 is a schematic diagram of a module structure of a device for identifying a patent text classification number provided by the present application. As shown in fig. 2, the apparatus may include:

the data preprocessing module 101 may be configured to obtain an input text corresponding to a target patent document;

the recognition module 102 may be configured to input the input text into the recognition neural network, and output the classification number determination result of the target patent text after being processed by the recognition neural network.

In one embodiment of the present application, the identifying module 102 may include:

the semantic feature extraction unit can be configured to input the input text into a semantic feature extraction neural network, and output word vectors and sentence vectors after being processed by the semantic feature extraction neural network;

the long-range dependency relationship capturing unit can be configured to input the word vector and the sentence vector into a long-range dependency relationship capturing neural network, and output a residual error feature vector after being processed by the capturing neural network;

and the classification unit can be configured to input the residual error feature vector into a classification neural network, and obtain a classification number determination result after the residual error feature vector is processed by the classification neural network.

It will be clear to those skilled in the art that the embodiments of the present application may be referred to each other, for example, for convenience and brevity of description, the specific working procedures of the units or modules in the above-described apparatus and devices may be described with reference to the corresponding procedures in the foregoing method embodiments.

Fig. 5 is a block diagram of an exemplary module structure of an electronic device provided in the present application. In one embodiment of the present application, the electronic device 1900 may be provided as a server. Referring to fig. 5, electronic device 1900 may include a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate an operating system, such as Windows Server, stored in the memory 1932^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMOr the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the methods described herein.

The present application may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present application.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present application may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present application by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), electronic devices and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It should be understood that the above embodiments are exemplary and are not intended to encompass all possible implementations encompassed by the claims. Various modifications and changes may also be made on the basis of the above embodiments without departing from the scope of the present application. Likewise, various features of the above embodiments may be arbitrarily combined to form additional embodiments of the present invention that may not be explicitly described. Therefore, the above examples only represent some embodiments of the present invention, and do not limit the scope of the present invention.

Claims

1. A patent text classification number identification method is characterized by comprising the following steps:

acquiring an input text corresponding to a target patent text;

and inputting the input text into a recognition neural network, and outputting a classification number determination result of the target patent text after the input text is processed by the recognition neural network.

2. The method of claim 1, wherein the identifying neural network comprises a semantic feature extraction neural network, a long-range dependency capture neural network and a classification neural network, and the inputting the input text into the identifying neural network, and the outputting the classification number determination result of the target patent text after the processing by the identifying neural network comprises:

inputting the input text into the semantic feature extraction neural network, and outputting a word vector and a sentence vector after the input text is processed by the semantic feature extraction neural network;

inputting the word vector and the sentence vector into the long-range dependency relationship capturing neural network, and outputting a residual error feature vector after the long-range dependency relationship capturing neural network processes;

and inputting the residual error feature vector into the classification neural network, and obtaining the classification number determination result after the residual error feature vector is processed by the classification neural network.

3. The method according to claim 1, wherein the obtaining of the input text corresponding to the target patent text comprises:

and preprocessing the specific subfile of the target patent text to obtain the input text.

4. The method of claim 3, wherein the specific sub-text comprises a title text, an abstract text, and a claim text of the target patent text, and wherein the preprocessing the specific sub-text of the target patent text to obtain the input text comprises:

and combining the keyword text, the title text and the abstract text to obtain the input text with the text length being a fixed value.

5. The method of claim 2, wherein inputting the residual feature vectors into a classification neural network, and obtaining the classification number determination result after the classification neural network processing comprises:

inputting the residual error feature vector into the classification neural network, and outputting probability values of the target patent texts belonging to the classification numbers after the residual error feature vector is processed by the classification neural network;

and arranging the classification numbers according to the corresponding probability values from high to low, and determining the classification numbers with the probability values arranged in the top N as the classification number determination results, wherein N is more than or equal to 1.

6. The method of claim 5, wherein the classification numbers include a minor class number and a minor group number, the probability values of the classification numbers include a probability value of the minor class number and a probability value of the minor group number, and the N classification numbers include first M minor class numbers in the minor class numbers and first L minor group numbers in the minor group numbers, wherein M is greater than or equal to 1, and L is greater than or equal to 1, according to the corresponding probability values.

7. The method of claim 2, wherein the semantic feature extraction neural network comprises a Bert semantic feature extraction model, and wherein the long-range dependency capture neural network comprises a two-way long-range dependency capture model.

8. The method of claim 1, wherein the identifying the neural network is a trained neural network, and the training of the identifying the neural network comprises:

inputting an input text corresponding to the patent text sample in the patent text training sample set into the recognition neural network, and outputting a predicted classification number determination result after the input text is processed by the recognition neural network;

determining the loss of the processing result of the recognition neural network according to the predicted classification number determination result and the classification number label corresponding to the sample;

back-propagating the loss to the identified neural network to adjust network parameters of the identified neural network.

9. The method of claim 8, wherein the constructing the patent text training sample set comprises:

obtaining a classification number label sorting list according to the classification number label corresponding to each patent text sample in the original data;

traversing the sorted list of the classification number labels to obtain the number of samples corresponding to each classification number label;

if the number of the samples corresponding to the classification number labels is smaller than a preset value, distributing the samples corresponding to the classification number labels to the patent text training sample set and the patent text testing sample set according to a preset proportion;

and if the number of the samples corresponding to the classification number label is greater than or equal to the preset value, filling a first fixed number into the patent text training sample set and filling a second fixed number into the patent text testing sample set.

10. The method of claim 8 or 9, wherein the classification number label comprises a minor group number and a minor class number.

11. A patent text classification number recognition apparatus, comprising:

the data preprocessing module is configured to acquire an input text corresponding to a target patent file;

and the recognition module is configured to input the input text into a recognition neural network, and output a classification number determination result of the target patent text after the input text is processed by the recognition neural network.

12. The apparatus of claim 11, wherein the identification module comprises:

the semantic feature extraction unit is configured to input the input text into a semantic feature extraction neural network, and output a word vector and a sentence vector after being processed by the semantic feature extraction neural network;

the long-range dependency relationship capturing unit is configured to input the word vector and the sentence vector into a long-range dependency relationship capturing neural network, and output a residual error feature vector after being processed by the long-range dependency relationship capturing neural network;

and the classification unit is configured to input the residual error feature vector into a classification neural network, and obtain the classification number determination result after the residual error feature vector is processed by the classification neural network.

13. An electronic device, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor implements the method of any one of claims 1 to 10 by invoking the executable instructions.

14. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 10.