CN115168590A - Text feature extraction method, model training method, device, equipment and medium - Google Patents

Text feature extraction method, model training method, device, equipment and medium Download PDF

Info

Publication number
CN115168590A
CN115168590A CN202210921235.9A CN202210921235A CN115168590A CN 115168590 A CN115168590 A CN 115168590A CN 202210921235 A CN202210921235 A CN 202210921235A CN 115168590 A CN115168590 A CN 115168590A
Authority
CN
China
Prior art keywords
word
feature vector
patent text
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210921235.9A
Other languages
Chinese (zh)
Inventor
郑侃
齐家驹
侯璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jizhigan Technology Co ltd
Original Assignee
Beijing Jizhigan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jizhigan Technology Co ltd filed Critical Beijing Jizhigan Technology Co ltd
Priority to CN202210921235.9A priority Critical patent/CN115168590A/en
Publication of CN115168590A publication Critical patent/CN115168590A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides a text feature extraction method, a model training method, a device, equipment and a medium. The method for extracting the text features comprises the steps of obtaining a patent text, wherein the patent text comprises a plurality of words, determining a first feature vector of each word in the plurality of words, determining the similarity between each word and each word in the plurality of words, weighting and processing the first feature vectors based on the similarity to obtain a second feature vector of each word, and processing the second feature vectors through convolution layers to obtain the patent text features.

Description

Text feature extraction method, model training method, device, equipment and medium
Technical Field
The present disclosure relates to the field of natural language processing, and in particular, to a text feature extraction method, a model training method, an apparatus, a device, and a medium.
Background
The patent is used as a measuring mark for scientific and technical development of the modern society, and plays a role of a wind vane to a certain extent. Through studying the information such as development context, submission quantity, research direction and the like of patents, people can know the current technical hot spots, blanks and barriers and presume important information such as the next important technical explosion point and the like. The work of patent research decision-making needs to collate effective information from a huge amount of patent texts.
The vectorization representation of the patent is an important method for researching the patent, and a text processing algorithm is used for extracting a patent feature vector as a representation mode of the patent, so that a patent worker can further analyze the patent by using a mathematical method. The existing similar patent feature vector extraction method mainly calculates the occurrence frequency of each noun and verb in an extracted patent text, and constructs feature vectors of corresponding patents based on a plurality of nouns and verbs with high word frequencies. However, the method loses more information, resulting in poor effect of subsequent analysis.
Disclosure of Invention
In order to solve the problems in the related art, embodiments of the present disclosure provide a text feature extraction method, a model training method, a device, an apparatus, and a medium.
One aspect of the present disclosure provides a text feature extraction method, including: the patent text processing method comprises the steps of obtaining a patent text, wherein the patent text comprises a plurality of words, determining a first feature vector of each word in the plurality of words, determining the similarity between each word and each word in the plurality of words, weighting and processing the first feature vectors based on the similarity to obtain a second feature vector of each word, and processing the second feature vectors through a convolution layer to obtain patent text features.
Another aspect of the present disclosure provides a model training method, the model including an encoder and a classifier, the method including: obtaining a plurality of patent texts and classification labels of the patent texts, processing the patent texts through an encoder to obtain patent text characteristics, processing the patent text characteristics through a classifier to obtain classification prediction results, and training the encoder and the classifier based on the classification labels and the classification prediction results.
Another aspect of the present disclosure provides a text feature extraction apparatus, including: the device comprises a first acquisition module, a first determination module, a second determination module, a first processing module and a second processing module. A first obtaining module configured to obtain patent text, the patent text including a plurality of words. A first determination module configured to determine a first feature vector for each of the plurality of terms. A second determination module configured to determine, for each word, a similarity between the word and each word of the plurality of words. And the first processing module is configured to process the first feature vector based on the similarity weighting to obtain a second feature vector of each word. And the second processing module is configured to process the second feature vector through a convolution layer to obtain patent text features.
Another aspect of the present disclosure provides a model training apparatus, wherein the model includes an encoder and a classifier, the apparatus includes a second obtaining module, a feature extraction module, a third processing module, and a training module. The second acquisition module is configured to acquire a plurality of patent texts and classification labels of the patent texts. And the characteristic extraction module is configured to process the patent text through an encoder to obtain patent text characteristics. And the third processing module is configured to process the patent text features through the classifier to obtain a classification prediction result. A training module configured to train the encoder and the classifier based on the classification labels and the classification prediction results.
Another aspect of the present disclosure provides an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
Another aspect of the disclosure provides a computer-readable storage medium storing computer-readable instructions for implementing the method as described above when executed by a processor.
Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.
According to the embodiment of the disclosure, the similarity of each word and a plurality of words in the patent text is obtained, the second feature vector of each word is obtained through weighting processing, and the second feature vector is processed through the convolution layer to obtain the patent text features, so that each word in a sentence is embedded with context information, the sensing range and the global analysis capability of a convolution kernel are greatly enhanced, the word and word sequence information of all texts are reserved, and the obtained patent text features can be represented in a continuous vector space. The method can accurately extract the characteristic vector representing the patent text, and is beneficial to the patent decision analysis of patent researchers.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:
fig. 1 schematically shows a system architecture diagram to which the text feature extraction method of the embodiment of the present disclosure is applied;
FIG. 2 schematically illustrates a flow diagram of a text feature extraction method of an embodiment of the disclosure;
FIG. 3 schematically illustrates a flow chart for determining a first feature vector according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of processing a second feature vector of an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a text feature extraction method of another embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of a model training method of an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow chart for obtaining category labels according to an embodiment of the present disclosure;
fig. 8 schematically shows a block diagram of a text feature extraction apparatus of an embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a model training apparatus of an embodiment of the present disclosure; and
FIG. 10 schematically illustrates a block diagram of a computer system suitable for implementing the methods and apparatus of embodiments of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The embodiment of the disclosure provides an image processing method, which includes obtaining a patent text, where the patent text includes a plurality of words, determining a first feature vector of each of the plurality of words, determining, for each word, a similarity between the word and each of the plurality of words, weighting and processing the first feature vectors based on the similarity to obtain a second feature vector of each word, and processing the second feature vectors through a convolutional layer to obtain patent text features. The method enables each vocabulary in the sentence to be embedded with context information, greatly enhances the perception range and the global analysis capability of a convolution kernel, thereby retaining the word and word sequence information of all texts, and the obtained patent text characteristics can be expressed in a continuous vector space. The method can accurately extract the characteristic vector representing the patent text, and is beneficial to the patent decision analysis of patent researchers.
Technical solutions provided by the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Fig. 1 schematically shows a system architecture diagram to which the text feature extraction method according to the embodiment of the present disclosure is applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between terminal devices 101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. Various client applications may be installed on the terminal devices 101, 102, 103. Such as browser-type applications, search-type applications, instant messaging-type tools, and so forth.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various special purpose or general purpose electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module.
The server 105 may be a server that provides various services, such as a backend server that provides services for client applications installed on the terminal devices 101, 102, 103.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module.
The method provided by the embodiment of the present disclosure may be executed by the terminal devices 101, 102, and 103, or may be executed by the server 105, for example. Alternatively, the method of the embodiments of the present disclosure may be performed in part by the terminal devices 101, 102, 103, and in other parts by the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 2 schematically illustrates a flowchart of a text feature extraction method according to an embodiment of the present disclosure.
As shown in fig. 2, the text feature extraction method includes operations S202, S204, S206, S208, and S210.
In operation S202, a patent text including a plurality of words is acquired.
In operation S204, a first feature vector of each of the plurality of words is determined.
In operation S206, for each word, a similarity between the word and each of the plurality of words is determined.
In operation S208, the first feature vector is weighted based on the similarity, and a second feature vector of each word is obtained.
In operation S210, the second feature vector is processed by the convolutional layer to obtain patent text features.
According to embodiments of the present disclosure, the patent text may include some or all of the abstract of the specification, the independent claims, the dependent claims, the detailed description. The feature vectors are extracted through the whole patent text or important parts, such as abstracts, detailed description, claims and other larger-range texts, and the important information of the patent text is kept as much as possible. And the original text of the patent is directly used as an input text to train the neural network, and all word and word sequence information of the original text is reserved.
According to the embodiment of the disclosure, the patent text can be participled by using various existing participle tools, and the patent text is processed into a plurality of ordered words.
According to embodiments of the present disclosure, the first feature vector of each Word may be determined using existing Word embedding algorithms, such as the Word2vec algorithm provided by google, inc. Word embedding refers to a method of representing each word as a word vector. And the whole sentence can be converted into a matrix, and the matrix not only retains the word order information of the original sentence, but also contains the semantic information of each word.
Fig. 3 schematically illustrates a flow chart for determining a first feature vector according to an embodiment of the present disclosure.
As shown in fig. 3, operation S204 may include operations S302, S304, and S306.
In operation S302, a plurality of words in the patent text are processed based on an associated topic model algorithm to obtain a plurality of topics.
In operation S304, a probability that each word belongs to each topic is determined.
In operation S306, a first feature vector is constructed based on the probability.
The associated topic model algorithm may be, for example, latent Dirichlet Allocation (LDA), and may perform clustering and semantic analysis on an implicit semantic structure of a patent text by using an unsupervised learning manner of LDA, so as to mine a plurality of topics included in the text. The algorithm may also give a probability distribution that a word corresponds to the topics. For example, where there are 8 topics, the algorithm may give 8 values representing the probability that the word belongs to the 8 topics. The plurality of probabilities may be used to form a first feature vector as a word embedding vector for the word.
Extracting a plurality of topics from the patent text by using a hidden Dirichlet distribution algorithm or other associated topic models, and generating a topic model to perform vectorization processing on the effective text extracted from each patent in the data set. Based on the topic model as prior information, corresponding LDA word embedding can be calculated for each word in the patent text and used as input data of a following text learning neural network.
For patents, some of the "functional" terms play a crucial role in semantic analysis because they cover important information such as the technical field to which the patent belongs. In the topic model, the topic probability distributions of these key words are approximated. In particular, if the LDA algorithm is used to generate the topic model, the LDA word embeddings of these key words are very similar and strongly related, so they can describe the technical words in the patent more accurately.
Determining a first feature vector for each of the plurality of terms, according to embodiments of the present disclosure, may further include determining a first feature vector for each of the plurality of terms based on word2 vec. The word embedding vector can be determined as the first feature vector by using the LDA algorithm alone, the word embedding vector can also be determined as the first feature vector by using the word2vec algorithm alone, and the word embedding vector and the word2vec algorithm can also be determined as the first feature vector by using the LDA algorithm and the word2vec algorithm respectively. In an embodiment that uses two algorithms simultaneously, the obtained two word embedding vectors may be processed as two first feature vectors respectively, or may be spliced into one feature vector, which is not limited herein.
Reference is made back to fig. 2. The similarity between words may use, for example, vector inner product, euclidean distance, cosine similarity, perceptron, etc.
Some related technologies apply convolutional neural network to mine abstract features in patent texts, however, the greatest disadvantage is that the width of a convolution kernel is limited, and the receptive field is too small to capture local information. For some long texts, there may be an important relationship between two words far away, but the words between the two words may dilute the relationship, so that it is difficult for the neural network to mine the information.
The method of the disclosed embodiments is intended to use attention mechanism to adapt word embedding to achieve a new text representation approach. By means of calculation of the attention mechanism, the context information of each word is embedded into the word embedding of the word, so that the context information which generates semantic association and is far away is reserved as much as possible.
According to an embodiment of the present disclosure, the first feature vector of the term i is q i The first feature vector of the word j is q j The similarity of the word i and the word j is a ij The number of words is N. According to an embodiment of the present disclosure, for each word i, all the similarities of the word may be normalized to obtain a ij . In a manner of normalizing all similarities of words, for example, the weighting may be normalized to a probability value from 0 to 1 using a softmax function, that is:
Figure BDA0003777627160000071
then when the second characteristic vector is determined, the similarity and the first characteristic vector of the corresponding word are subjected to weighted summation to obtain the attention word embedding of the current word, namely the second characteristic vector
Figure BDA0003777627160000072
Wherein,
Figure BDA0003777627160000073
therefore, even if the window length of the convolution kernel is not changed, each vocabulary still carries context information in the convolution process, so that the perception range of the convolution neural network is greatly enhanced, and the convolution neural network has the capability of capturing global information.
For patents, some keywords play a crucial role, and determine that there is inequality among words in patent texts. Based on the associated topic model, the convolutional layer can be further refined, focusing more on these important words.
Fig. 4 schematically illustrates a flow chart of processing a second feature vector of an embodiment of the present disclosure.
As shown in fig. 4, the operation S210 may include operations S402 and S404.
In operation S402, the second feature vectors are filtered, including, for each word, if values in the second feature vectors are not greater than a threshold, the word is rejected.
In operation S404, the patent text feature is obtained by convolution layer processing the filtered second feature vector.
According to the embodiment of the disclosure, the LDA topic model described above can be used as prior information to determine whether a certain vocabulary in a sentence is important. Specifically, a threshold is determined and for a certain vocabulary, a word is considered important only if there is a value in its LDA-word embedding that is greater than the threshold. The new convolutional layer focuses more on excavating the characteristics of the screened words, so that the semantic meaning of the text is kept to the maximum extent, and the perception range of the convolutional kernel is further enlarged.
For example, the patent text includes N words, and N second feature vectors are obtained through LDA word embedding, and assuming that the dimension of word embedding is K (i.e., the number of topics), the patent text is represented as a matrix of N × K at this time. The method of the embodiments of the present disclosure may check row by row, and if there is a value greater than a threshold in the row, the row is retained, otherwise the row is discarded. And obtaining a matrix of M x K after processing, wherein M is less than or equal to N. The method enables important professional words to be better characterized.
The method illustrated in fig. 4 may be used only for processing the first feature vector obtained through the LDA algorithm, and not for processing the first feature vector obtained through the word2vec algorithm.
Fig. 5 schematically illustrates a flowchart of a text feature extraction method according to another embodiment of the present disclosure.
As shown in fig. 5, the method includes operations S502, S504, S506, S508, S510, S512, and S514.
According to the embodiment of the disclosure, after the patent text is obtained, the process proceeds to S502 and S508, and the word embedding process is performed in two ways to convert the patent text into vectorized data.
In operation S502, the patent text is processed by means of LDA word embedding, which is described with reference to fig. 3 and is not described herein again;
in operation S504, the attention-based LDA word embedding processing method is performed with reference to the above-described calculation
Figure BDA0003777627160000081
The process of (1).
In operation S506, the second feature vector is filtered based on a threshold, referring to the method described above with reference to fig. 4.
In operation S508, the patent text is processed by means of Word2vec Word embedding.
In operation S510, an attention-based word2vec word embedding processing method is performed, also with reference to the above-described calculation
Figure BDA0003777627160000091
The process of (1).
In operation S512, the outputs of the two branches are processed by the convolutional neural network model to obtain a feature vector a and a feature vector B. The neural network can extract high-level features of the text through operations of convolution, pooling and the like.
In operation S514, the feature vector a and the feature vector B may be merged to obtain patent text features.
In the method, how to train the model parameters to obtain the feature vectors with better effect is the key to realize the accurate analysis of the patent.
The present disclosure also provides a model training method for training a machine learning model including an encoder and a classifier.
FIG. 6 schematically illustrates a flow chart of a model training method of an embodiment of the present disclosure.
As shown in fig. 6, the method includes operations S602, S604, S606, and S608.
In operation S602, a plurality of patent texts and classification labels of the patent texts are acquired.
In operation S604, the patent text is processed by the encoder to obtain patent text features. The encoder is configured to perform a method as described in any of the embodiments of fig. 2-5.
In operation S606, the patent text features are processed by the classifier to obtain a classification prediction result.
In operation S608, the encoder and the classifier are trained based on the classification labels and the classification prediction results.
According to the disclosed embodiment, the classification label of the patent text can be obtained through a patent classification number, such as an international IPC patent classification number and the like. The patent data set with the classification labels can be used as a training set and a verification set to train an encoder.
According to the embodiment of the disclosure, the patent data set formed by the patent texts can be preprocessed, and the effective texts of each patent, such as the full patent texts or part of important characters, are extracted to form a single sample in the patent data set.
Classification numbers of fine granularity, such as "major group" and "minor group", do not satisfy the actual need in many cases. According to the embodiment of the present disclosure, it is also possible to obtain a specific topic under the classification in the manner of a topic algorithm using only the coarse-grained classification number of the "section" or the "large class" as a reference.
Fig. 7 schematically illustrates a flowchart of acquiring a category label according to an embodiment of the present disclosure.
As shown in fig. 7, operation S602 may include operations S702, S704, and S706.
In operation S702, a primary category of a patent text is determined based on the classification number of the patent text. Such as "section" or "major class" in the IPC class number.
In operation S704, a plurality of secondary categories under the same primary category and a probability that the patent text belongs to each secondary category are determined by the association topic model algorithm. The associated topic model algorithm may be, for example, the LDA algorithm described above.
In operation S706, the secondary category with the highest probability is selected as the classification label of the patent text.
And extracting the topics implicit in the patent text by using an associated topic model algorithm, and calculating the probability of each topic corresponding to each patent to finish marking the category to which the patent belongs.
According to the embodiment of the disclosure, the patent data sets can be clustered to several subjects according to a plurality of keywords or subject words of the extracted patent data sets. And calculating the probability of each topic corresponding to each patent sample to determine the belonged category of the patent sample, so that all the patent samples are classified into a plurality of categories.
Reference is made back to fig. 6. The encoder may have the structure illustrated in fig. 5, for example, for performing the text feature extraction method, but needs to be trained to achieve a good result. The classifier is used for integrating the extracted high-level features, mapping the high-level features to a sample mark space, and then performing explicit expression of classification, and the output of the classifier is the probability of predicting patents belonging to various categories. The classifier can use a fully-connected layer, whose input is the output of the encoder, and whose output is the patent classification prediction result. The classification tags can be used for supervision and training of the whole machine learning model including the encoder and decoder. After the training is finished, the text feature extraction method can be realized by using the encoder, that is, the patent text can be directly input into the encoder, so that the vectorized patent text features can be obtained, and the features can reflect the theme characteristics of the patent.
The method of the embodiment of the disclosure is based on a probability statistical analysis algorithm and an attention mechanism and associated topic model, and combines a text learning neural network to extract patent text features. The method can accurately extract the characteristic vector representing the patent text, so that a patent researcher can perform more definite patent decision analysis, and hot spots and blank fields of patents and technical barriers can be found.
Due to universality of the neural network, the method can adapt to changes of patent categories and data sets. And the feature vectors extracted by the neural network are continuous values and can be represented in a continuous vector space. Compared with the existing feature vector extraction method using isolated discontinuous values of statistical data such as word frequency, the feature vector obtained by the technical scheme of the embodiment of the disclosure can more accurately reflect the quantitative relation between patents, and is also beneficial to further analysis by using a mathematical method.
For example, the patent text features obtained by processing the patent text by the method of the embodiment of the present disclosure may be used in various application scenarios, including classifying patents, drawing patent maps to find popular or blank fields, and the like.
Based on the same inventive concept, the present disclosure also provides a text feature extraction device and a model training device, and the device of the embodiment of the present disclosure is described below with reference to fig. 8 and 9.
Fig. 8 schematically illustrates a block diagram of a text feature extraction apparatus 800 according to an embodiment of the present disclosure. The apparatus 800 may be implemented as part or all of an electronic device through software, hardware, or a combination of both.
As shown in fig. 8, the text feature extraction apparatus 800 includes a first obtaining module 802, a first determining module 804, a second determining module 806, a first processing module 808, and a second processing module 810. The text feature extraction apparatus 800 may perform the various text feature extraction methods described above.
A first obtaining module 802 configured to obtain patent text, the patent text including a plurality of words.
A first determination module 804 configured to determine a first feature vector for each of the plurality of terms.
A second determining module 806 configured to determine, for each word, a similarity between the word and each word of the plurality of words.
A first processing module 808 configured to process the first feature vector based on the similarity weighting to obtain a second feature vector of each word.
A second processing module 810 configured to process the second feature vector by a convolution layer to obtain patent text features.
According to the embodiment of the disclosure, the first determining module 804 is further configured to process a plurality of words in the patent text based on an associated topic model algorithm to obtain a plurality of topics, determine a probability that each word belongs to each topic, and construct a first feature vector based on the probabilities.
According to an embodiment of the disclosure, the first determining module is further configured to determine a first feature vector for each of the plurality of words based on word2 vec.
In accordance with the present disclosureExample, the first feature vector of word i is q i The first feature vector of the word j is q j The similarity of the word i and the word j is a ij The number of words is N, and the weighting processing the first feature vector based on the similarity to obtain a second feature vector of each word includes:
normalizing all the similarities of each word i to obtain
Figure BDA0003777627160000121
Determining a second feature vector
Figure BDA0003777627160000122
According to an embodiment of the present disclosure, the second processing module 810 is further configured to filter the second feature vectors, including, for each word, if values in the second feature vectors are not greater than a threshold, rejecting the word, and processing the filtered second feature vectors by a convolutional layer to obtain patent text features.
According to the disclosed embodiment, the patent text comprises part or all of the abstract of the specification, the independent claims, the dependent claims and the detailed description.
Fig. 9 schematically illustrates a block diagram of a model training apparatus 900 according to an embodiment of the present disclosure. The apparatus 900 may be implemented as part or all of an electronic device through software, hardware, or a combination of both.
As shown in fig. 9, the model training apparatus 900 is used for training a model including an encoder and a classifier, and the apparatus 900 includes a second obtaining module 902, a feature extracting module 904, a third processing module 906, and a training module 908. The model training apparatus 900 may perform the various model training methods described above.
A second obtaining module 902 configured to obtain a plurality of patent texts and classification labels of the patent texts.
A feature extraction module 904 configured to process the patent text through an encoder to obtain patent text features.
A third processing module 906, configured to process the patent text features through the classifier to obtain a classification prediction result.
A training module 908 configured to train the encoder and classifier based on the classification labels and classification predictions.
According to an embodiment of the present disclosure, the second obtaining module 902 is further configured to determine a primary category of the patent text based on the classification number of the patent text, determine a plurality of secondary categories under the same primary category through an associated topic model algorithm, determine a probability that the patent text belongs to each secondary category, and select the secondary category with the highest probability as the classification label of the patent text.
The present disclosure also discloses an electronic device comprising a memory for storing a program enabling the electronic device to perform the method in any of the above embodiments and a processor configured to execute the program stored in the memory to implement the method as described in any of the embodiments of fig. 2-7 above.
FIG. 10 schematically illustrates a block diagram of a computer system suitable for implementing the methods and apparatus of embodiments of the present disclosure.
As shown in fig. 10, the computer system 1000 includes a processing unit 1001 that can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the system 1000 are also stored. The processing unit 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input portion 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary. The processing unit 1001 may be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.
In particular, the above described methods may be implemented as computer software programs according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the above-described method. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or by programmable hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the electronic device or the computer system in the above embodiments; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.

Claims (12)

1. A text feature extraction method comprises the following steps:
acquiring a patent text, wherein the patent text comprises a plurality of words;
determining a first feature vector for each of the plurality of terms;
for each term, determining a similarity between the term and each term of the plurality of terms;
weighting the first feature vector based on the similarity to obtain a second feature vector of each word;
and processing the second feature vector through a convolution layer to obtain patent text features.
2. The method of claim 1, wherein the determining a first feature vector for each of the plurality of words comprises:
processing a plurality of words in the patent text based on an associated topic model algorithm to obtain a plurality of topics;
determining a probability that each term belongs to each topic;
a first feature vector is constructed based on the probabilities.
3. The method of claim 2, wherein the determining a first feature vector for each of the plurality of words further comprises:
determining a first feature vector for each of the plurality of words based on word2 vec.
4. The method of any of claims 1-3, wherein the first feature vector of word i is q i The first feature vector of the word j is q j The similarity of the word i and the word j is a ij The number of words is N, and the weighting processing the first feature vector based on the similarity to obtain a second feature vector of each word includes:
normalizing all the similarities of each word i to obtain
Figure FDA0003777627150000011
Determining a second feature vector
Figure FDA0003777627150000012
5. The method of claim 2, wherein said processing the second feature vector through convolutional layers to obtain patent text features comprises:
screening the second feature vectors, wherein for each word, if the values in the second feature vectors are not more than a threshold value, the word is eliminated;
and processing the screened second feature vector through the convolution layer to obtain the patent text feature.
6. The method of claim 1, wherein the patent text comprises some or all of a specification abstract, an independent claim, a dependent claim, a detailed description.
7. A method of model training, wherein the model comprises an encoder and a classifier, the method comprising:
acquiring a plurality of patent texts and classification labels of the patent texts;
performing the method of any one of claims 1-6 by an encoder to process the patent text for patent text features;
processing the patent text features through a classifier to obtain a classification prediction result;
training the encoder and classifier based on the classification labels and classification prediction results.
8. The method of claim 7, wherein obtaining the classification label of the patent text comprises:
determining a primary category of the patent text based on the classification number of the patent text;
determining a plurality of secondary categories under the same primary category and the probability of the patent text belonging to each secondary category through a correlation topic model algorithm;
and selecting the secondary category with the highest probability as a classification label of the patent text.
9. A text feature extraction apparatus comprising:
a first obtaining module configured to obtain a patent text, the patent text including a plurality of words;
a first determination module configured to determine a first feature vector for each of the plurality of terms;
a second determination module configured to determine, for each word, a similarity between the word and each word of the plurality of words;
a first processing module configured to perform weighted processing on the first feature vector based on the similarity to obtain a second feature vector of each word;
and the second processing module is configured to process the second feature vector through a convolution layer to obtain patent text features.
10. A model training apparatus, wherein the model comprises an encoder and a classifier, the apparatus comprising:
the second acquisition module is configured to acquire a plurality of patent texts and classification labels of the patent texts;
a feature extraction module configured to perform the method of any one of claims 1-6 by an encoder to process the patent text to obtain patent text features;
the third processing module is configured to process the patent text features through the classifier to obtain a classification prediction result;
a training module configured to train the encoder and classifier based on the classification labels and classification predictions.
11. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
12. A computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a processor, cause the processor to perform the method of any one of claims 1-8.
CN202210921235.9A 2022-08-02 2022-08-02 Text feature extraction method, model training method, device, equipment and medium Pending CN115168590A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210921235.9A CN115168590A (en) 2022-08-02 2022-08-02 Text feature extraction method, model training method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210921235.9A CN115168590A (en) 2022-08-02 2022-08-02 Text feature extraction method, model training method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115168590A true CN115168590A (en) 2022-10-11

Family

ID=83477880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210921235.9A Pending CN115168590A (en) 2022-08-02 2022-08-02 Text feature extraction method, model training method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115168590A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842932A (en) * 2023-08-30 2023-10-03 腾讯科技(深圳)有限公司 Text feature decoding method and device, storage medium and electronic equipment
CN117271653A (en) * 2023-10-12 2023-12-22 四川大学 Multi-dimensional patent map construction method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842932A (en) * 2023-08-30 2023-10-03 腾讯科技(深圳)有限公司 Text feature decoding method and device, storage medium and electronic equipment
CN116842932B (en) * 2023-08-30 2023-11-14 腾讯科技(深圳)有限公司 Text feature decoding method and device, storage medium and electronic equipment
CN117271653A (en) * 2023-10-12 2023-12-22 四川大学 Multi-dimensional patent map construction method and system
CN117271653B (en) * 2023-10-12 2024-06-04 四川大学 Multi-dimensional patent map construction method and system

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN107679039B (en) Method and device for determining statement intention
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN112015859A (en) Text knowledge hierarchy extraction method and device, computer equipment and readable medium
CN111709240A (en) Entity relationship extraction method, device, equipment and storage medium thereof
CN111539197A (en) Text matching method and device, computer system and readable storage medium
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN115168590A (en) Text feature extraction method, model training method, device, equipment and medium
CN111126396A (en) Image recognition method and device, computer equipment and storage medium
CN112188312B (en) Method and device for determining video material of news
CN113961666B (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN112347223A (en) Document retrieval method, document retrieval equipment and computer-readable storage medium
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN111177367A (en) Case classification method, classification model training method and related products
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
CN113704623A (en) Data recommendation method, device, equipment and storage medium
CN116958622A (en) Data classification method, device, equipment, medium and program product
CN114372532A (en) Method, device, equipment, medium and product for determining label marking quality
CN111783425B (en) Intention identification method based on syntactic analysis model and related device
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
CN113139751A (en) Method for determining micro-service user service type based on big data
CN112925983A (en) Recommendation method and system for power grid information
CN116450943A (en) Artificial intelligence-based speaking recommendation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination