CN116975068A

CN116975068A - Metadata-based patent document data storage method, device and storage medium

Info

Publication number: CN116975068A
Application number: CN202311234829.3A
Authority: CN
Inventors: 孙广芝; 王淑敏; 隋媛; 李岭岭
Original assignee: China National Institute of Standardization
Current assignee: China National Institute of Standardization
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-10-31

Abstract

The application provides a patent document data storage method, device and storage medium based on metadata, relates to the technical field of metadata, and solves the problem that the conventional method cannot perform standardized management on patent document data. The method comprises the following steps: extracting a plurality of pieces of data in a target patent document according to a patent document metadata template; determining a category of each piece of extracted data based on the document structure; traversing each piece of extracted data, carrying out semantic similarity calculation based on deep learning on the data of the same category, determining the relation between the data of the same category, and merging the data of the same category or similar data; and importing the combined data into a storage table generated according to the patent document metadata template. The method can normalize the patent document data through the metadata, is convenient for unified management application, and enables a user to fully utilize all data in the table, thereby providing powerful support for data analysis.

Description

Metadata-based patent document data storage method, device and storage medium

Technical Field

The application relates to the technical field of metadata, in particular to a patent literature data storage technology based on metadata.

Background

Modern enterprises are increasingly competing, the means of competition being a variety, with competition for intellectual property being one of the important aspects. At present, most enterprises can manage patent information in related or similar technical fields, the enterprises usually use software such as electronic forms to manually record the patent information, but due to the large amount of patent general information required to be managed by the enterprises, the situation that data are different and are easy to tamper, lose, record errors and the like can be caused due to the difference of manual recording and retrieval websites, and the management mode is excessively dependent on manual work and has various uncertain factors. At present, some management software products exist in the market, but the functions of the management software products are complex and are not beneficial to the management of enterprises, so that a simple and easy-to-use intelligent patent information management scheme is needed to be provided to overcome the defect, reduce the management cost of the enterprise intellectual property management work and improve the work efficiency of the enterprise intellectual property management work.

Disclosure of Invention

In order to solve the technical defects, the embodiment of the application provides a spatial geographic data storage method, a spatial geographic data storage device, electronic equipment and a storage medium based on metadata.

An embodiment of a first aspect of the present application provides a method for storing patent document data based on metadata, including the steps of: extracting a plurality of pieces of data in a target patent document according to a patent document metadata template; determining the category of each piece of extracted data according to the document structure of the target patent document where each piece of extracted data is located; traversing each piece of extracted data, and performing cosine similarity calculation based on deep learning on a first feature vector and a second feature vector corresponding to the first data and the second data; comparing the cosine similarity calculation result with a preset threshold value, determining the relation between the first data and the second data, and combining the first data and the second data with the same or similar relation; the first data and the second data are the same class of data, and the relationship between the first data and the second data comprises the same, similar or irrelevant; and importing the combined data into a storage table generated according to the patent document metadata template.

In one possible implementation, the category includes one or more of the following: name, designer, applicant, application number, application date, class number, technical problem information, design intent information, design demonstration information, design scheme information, and advantage and disadvantage information.

In one possible implementation manner, determining the category of each piece of extracted data further includes: and carrying out semantic analysis on the data of which the category cannot be determined according to the document structure, and determining the category of the corresponding data according to the semantic analysis result.

In one possible implementation manner, performing cosine similarity calculation based on deep learning on the first feature vector and the second feature vector corresponding to the first data and the second data includes: the first eigenvector V is expressed by the following expression _t1 And the second eigenvector V _t2 Cosine similarity calculation is performed:

Sim(T ₁ , T ₂ ) = cos(θ) ==/>

wherein θ is the first feature vector V _t1 And the second eigenvector V _t2 Included angle V of _t1i And V _t2i Respectively the first feature vector V _t1 And the second eigenvector V _t2 Is the ith group of (2)Element, T ₁ And T ₂ Text corresponding to the first data and the second data respectively, n is the number of feature vector elements, ||V _t1 I and I V _t2 I is the first feature vector V respectively _t1 And the second eigenvector V _t2 Is a mold of (a).

In one possible implementation manner, the first feature vector V _t1 And the second eigenvector V _t2 Obtained by the following steps: extracting texts from the first data and the second data to obtain a first text T1 and a second text T2, respectively preprocessing the first text T1 and the second text T2, and respectively pooling the preprocessed data to obtain a first feature vector V comprising n elements _t1 And a second eigenvector V _t2 。

In one possible implementation manner, the method further includes: and for the data in the storage table, according to a preset rule, combining the local data in the storage table, and supplementing the data in the storage table completely, or receiving the supplement or modification of the data in the storage table by a user.

In one possible implementation manner, the method further includes: the supplemental or modified data is marked.

An embodiment of the second aspect of the present application further provides a metadata-based patent document data storage device, including: a data extraction module for extracting a plurality of pieces of data in the target patent document according to the patent document metadata template; the data classification module is used for determining the category of each piece of extracted data according to the document structure of the target patent document where each piece of extracted data is located; the data fusion module is used for traversing each piece of extracted data, and performing cosine similarity calculation based on deep learning on a first feature vector and a second feature vector corresponding to the first data and the second data; comparing the cosine similarity calculation result with a preset threshold value, determining the relation between the first data and the second data, and combining the first data and the second data with the same or similar relation; the first data and the second data are the same class of data, and the relationship between the first data and the second data comprises the same, similar or irrelevant; and the data storage module is used for importing the combined data into a storage table generated according to the patent document metadata template.

An embodiment of the third aspect of the present application further provides an electronic device, including: a memory; a processor; a computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method described above.

The fourth aspect embodiment of the present application also provides a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the method described above.

According to the metadata-based patent document data storage method and device provided by the embodiment of the application, the metadata is used for standardizing the patent document data, so that unified management application is facilitated, a user can fully utilize all data in the table, and powerful support is provided for data analysis.

Drawings

FIG. 1 is a schematic diagram of an electronic device 100 according to one embodiment of the application;

FIG. 2 is a flow chart of a metadata-based patent document data storage method 200 according to one embodiment of the present application;

fig. 3 is a schematic structural view of a metadata-based patent document data storage device 300 according to one embodiment of the present application.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is provided in conjunction with the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application and not exhaustive of all embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

Fig. 1 shows a schematic diagram of an electronic device 100 according to an embodiment of the application. Note that, the electronic device 100 shown in fig. 1 is only an example, and in practice, the electronic device used to implement the metadata-based patent document data storage method of the present application may be any type of device, and the hardware configuration may be the same as the electronic device 100 shown in fig. 1 or may be different from the electronic device 100 shown in fig. 1. In practice, the electronic device used to implement the metadata-based patent document data storage method of the present application may add or delete hardware components of the electronic device 100 shown in fig. 1, and the present application is not limited to the specific hardware configuration of the electronic device.

As shown in fig. 1, electronic device 100 typically includes one or more processors 110, and memory 120. Memory bus 130 may be used for communication between processor 110 and memory 120.

The memory 120 has stored therein operating system program instructions 121 and application program instructions 122, the application running on top of the operating system. When the electronic device 100 starts up, the processor 110 reads the operating system program instructions 121 from the memory 120 and executes them. When a user launches an application, the processor 110 reads and executes the application instructions 122 from the memory 120. The memory 120 also stores application data 123, where the application data 123 is data that may be used during the running process of the application, such as a table.

In the electronic device 100 according to the present application, the application program instructions 122 include computer program instructions for performing the metadata-based patent document data storage method 200 of the present application, which may instruct the processor 110 to perform the metadata-based spatial geographic data storage method 200 of the present application.

Fig. 2 shows a flowchart of a metadata-based patent document data storage method 200 according to one embodiment of the present application, the method 200 being performed in an electronic device (e.g., the aforementioned electronic device 100). As shown in fig. 2, the method 200 begins at step S210.

S210, extracting a plurality of pieces of data in the target patent document according to the patent document metadata template.

S220, determining the category of each piece of extracted data according to the document structure of the target patent document where each piece of extracted data is located.

S230, traversing each piece of extracted data, carrying out semantic similarity calculation based on deep learning on the data of the same category, determining the relation between the data of the same category, and merging the data of the same category or similar data; specifically, cosine similarity calculation based on deep learning is carried out on a first feature vector and a second feature vector corresponding to the first data and the second data; comparing the cosine similarity calculation result with a preset threshold value, determining the relation between the first data and the second data, and combining the first data and the second data with the same or similar relation; the first data and the second data are the same class of data, and the relationship between the first data and the second data comprises the same, similar or irrelevant.

S240, importing the combined data into a storage table generated according to the patent document metadata template.

The metadata template in the embodiment of the application is a template in a form of a table, and the template contains data of a plurality of categories in patent documents. The categories include, but are not limited to: name, designer, applicant, application number, application date, class number, technical problem information, design intent information, design demonstration information, design scheme information, and advantage and disadvantage information. In practice, the metadata template may further include information such as the number of pictures, abstract drawings, and the like. After a user requests to store patent document data, the metadata-based patent document data storage method 200 in the embodiment of the present application generates a storage table according to the patent document metadata template.

And extracting a plurality of related data from the patent literature according to the patent literature metadata template. For patent documents, each document structure of the patent document usually only describes one core theme, for example, the background section only describes the problems existing in the prior art, and other themes are not involved. The background section may have both positive and negative descriptions that may be considered to be detailed descriptions of problems with the prior art. For example, the final part of the summary generally demonstrates the technical effects of the patent literature, including design demonstration information and advantage and disadvantage information. In step S220 of the embodiment of the present application, the category of the extracted data is determined based on the document structure characteristics of the patent document.

In the embodiment of the present application, step S220 further includes, after determining the category of the extracted data according to the document structure of the target patent document where the extracted data is located; and carrying out semantic analysis on the data of which the category cannot be determined according to the document structure, and determining the category of the corresponding data according to the semantic analysis result.

For patent documents, in the detailed description section, a plurality of subjects or categories may be described, and when the category of the corresponding data cannot be determined according to the document structure, the category of the corresponding data may be determined by using a semantic analysis method.

There are repeated expressions of entities and relationships in the same patent document, usually in different expressions, or in different sentence patterns, grammars or synonyms, which are repeated to express the same meaning. The multiple expressions of the natural language do not have contradictory conflict problems in terms of semantic content and do not affect conceptual confusion of designers, and in the embodiment of the application, the data of the same category is extracted from different document structures, and the situation that the same meaning is repeatedly expressed by different sentence patterns, grammars or synonyms exists, so that the extracted data needs to be subjected to data fusion, and in the embodiment of the application, the semantic similarity calculation based on deep learning on the data of the same category in step S230 comprises: calculating a first eigenvector V corresponding to the first data and the second data of the same class _t1 And a second eigenvector V _t2 Cosine similarity of (c); and comparing the cosine similarity calculation result with a preset threshold value, and determining the relation between the first data and the second data.

In the embodiment of the application, for the first data A1 and the second data A2 of the same class, text is extracted from the first data A1 and the second data A2 to obtain a first text T1And a second text T2, performing model processing based on deep learning on the first text T1 and the second text T2 respectively, and performing pooling operation on the processed data respectively to obtain a first feature vector V with a preset length _t1 And a second eigenvector V _t2 For the first feature vector V _t1 And the second eigenvector V _t2 And (3) performing cosine similarity calculation, comparing the cosine similarity calculation result with a preset threshold value, and determining the relation between the data of the same category.

The relationship between data in the embodiment of the application comprises the same, similar and irrelevant, wherein a model based on deep learning is trained in the following way: and acquiring a preset number of patent documents, carrying out pairwise comparison analysis on the data of the same category in the same patent document, adopting a sentence converter neural network algorithm based on a deep learning model, converting the data pairs of the same category in the patent document into feature vector pairs, and determining the model similarity coefficient based on the deep learning and a corresponding threshold value by using cosine similarity calculation of text semantics.

In the embodiment of the application, the first characteristic vector V corresponding to the first data and the second data of the same class is calculated _t1 And a second eigenvector V _t2 The cosine similarity of (2) includes:

the first eigenvector V _t1 And the second eigenvector V _t2 Cosine similarity calculation is performed by the following expression:

Sim(T ₁ , T ₂ ) = cos(θ) ==/>

wherein θ is the first feature vector V _t1 And the second eigenvector V _t2 Included angle V of _t1i And V _t2i Respectively the first feature vector V _t1 And the second eigenvector V _t2 Is the ith constituent element of T ₁ And T ₂ Text corresponding to the first data and the second data respectively, n is the number of feature vector elements, ||V _t1 I and I V _t2 I is the first feature vector V respectively _t1 And the second eigenvector V _t2 Is a mold of (a).

The metadata-based patent document data storage method 200 provided by the embodiment of the application may further include: and for the data in the storage table, according to a preset rule, combining the local data in the storage table, and supplementing the data in the storage table completely, or receiving the supplement or modification of the data in the storage table by a user.

In an embodiment of the present application, the metadata-based patent document data storage method 200 further includes: the supplemental or modified data is marked.

For various reasons, some of the category data is not extracted, such as design arguments, or the abstract drawing is empty in the table. In order to facilitate subsequent data analysis, the data of these voids may be supplemented to completion in embodiments of the present application. In particular, the following rules are possible.

1. In combination with other patent documents of the designer and/or the applicant, the design intent, design demonstration information, advantage and disadvantage information and other data are supplemented completely, for example, the first step of the scheme of the current patent document is to acquire the A information, the later step is to process or analyze the A information, and the scheme of acquiring the A information is specifically set forth in the patent documents of the same designer and applicant applied previously, so that the vacant data are supplemented completely.

2. The technical problem information is supplemented and completed by combining the reference relation in the patent literature, for example, another patent literature in the prior art is cited in the background art, corresponding information such as application numbers, application names and the like is given, and the technical problem information is supplemented and completed according to the content cited in the current patent literature.

3. The information such as abstract drawings is supplemented to be complete by default values, for example, patent documents have no drawings or have drawings but no abstract drawings, and the vacant data is supplemented to be complete by default values.

The reliability of the supplemental data is relatively low compared to the actual acquired data, and therefore, the predicted data in the table may be marked for distinction from other information.

Embodiments of the present application also provide a metadata-based patent document data storage apparatus 300 capable of performing the respective step processes of the metadata-based patent document data storage method 200 as described above. The above-described apparatus 300 is described below in connection with fig. 3.

As shown in fig. 3, the apparatus 300 includes a data extraction module 310, a data classification module 320, a data fusion module 330, and a data storage module 340.

A data extraction module 310 for extracting pieces of data in the target patent document according to the patent document metadata template; a data classification module 320, configured to determine a category of each piece of extracted data according to a document structure of the target patent document where each piece of extracted data is located; the data fusion module 330 is configured to traverse each piece of extracted data, and perform cosine similarity calculation based on deep learning on a first feature vector and a second feature vector corresponding to the first data and the second data; comparing the cosine similarity calculation result with a preset threshold value, determining the relation between the first data and the second data, and combining the first data and the second data with the same or similar relation; the first data and the second data are the same class of data, and the relationship between the first data and the second data comprises the same, similar or irrelevant; the data storage module 340 is configured to import the merged data into a storage table generated according to the patent document metadata template.

As a preferred embodiment of the present application, the categories in the data classification module 320 include one or more of the following: name, designer, applicant, application number, application date, class number, technical problem information, design intent information, design demonstration information, design scheme information, and advantage and disadvantage information.

As a preferred embodiment of the present application, the data classification module 320 is further configured to: and carrying out semantic analysis on the data of which the category cannot be determined according to the document structure, and determining the category of the corresponding data according to the semantic analysis result.

As a preferred embodiment of the present application, the data fusion module 330 performs semantic similarity calculation based on deep learning on the data of the same category, including: calculating a first eigenvector V corresponding to the first data and the second data of the same class _t1 And a second eigenvector V _t2 Cosine similarity of (c); and comparing the cosine similarity calculation result with a preset threshold value, and determining the relation between the first data and the second data.

As a preferred embodiment of the present application, the data fusion module 330 calculates the first feature vector V corresponding to the first data and the second data of the same class _t1 And a second eigenvector V _t2 The cosine similarity of (2) includes: the first eigenvector V _t1 And the second eigenvector V _t2 Cosine similarity calculation is performed by the following expression:

Sim(T ₁ , T ₂ ) = cos(θ) ==/>

As a preferred embodiment of the present application, the metadata-based patent document data storage apparatus 300 further includes: and the data editing module is used for supplementing the data in the storage table completely according to the preset rule and combining the local data in the storage table, or receiving the supplement or modification of the data in the storage table by a user.

As a preferred embodiment of the present application, the metadata-based patent document data storage device 300 further includes a data marking module for marking supplementary or modified data.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods, apparatus and devices of the present application.

While the application has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments are contemplated within the scope of the application as described herein. In addition, various modifications and alterations of this application may be made by those skilled in the art without departing from the spirit and scope of this application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A patent document data storage method based on metadata, characterized by comprising the steps of:

extracting a plurality of pieces of data in a target patent document according to a patent document metadata template;

determining the category of each piece of extracted data according to the document structure of the target patent document where each piece of extracted data is located;

traversing each piece of extracted data, and performing cosine similarity calculation based on deep learning on a first feature vector and a second feature vector corresponding to the first data and the second data; comparing the cosine similarity calculation result with a preset threshold value, determining the relation between the first data and the second data, and combining the first data and the second data with the same or similar relation; the first data and the second data are the same class of data, and the relationship between the first data and the second data comprises the same, similar or irrelevant;

and importing the combined data into a storage table generated according to the patent document metadata template.

2. The method of claim 1, wherein the categories include one or more of: name, designer, applicant, application number, application date, class number, technical problem information, design intent information, design demonstration information, design scheme information, and advantage and disadvantage information.

3. The method of claim 2, wherein determining the category of each piece of extracted data based on the document structure of the target patent document in which each piece of extracted data is located further comprises:

and carrying out semantic analysis on the data of which the category cannot be determined according to the document structure, and determining the category of the corresponding data according to the semantic analysis result.

4. The method of claim 1, wherein performing deep learning based cosine similarity calculation on the first feature vector and the second feature vector corresponding to the first data and the second data comprises:

the first eigenvector V is expressed by the following expression _t1 And the second eigenvector V _t2 Cosine similarity calculation is performed:

Sim(T ₁ , T ₂ ) = cos(θ) ==/>；

wherein θ is the first feature vector V _t1 And the second eigenvector V _t2 Included angle V of _t1i And V _t2i Respectively the first feature vector V _t1 And the second eigenvector V _t2 Is the first of (2)i constituent elements, T ₁ And T ₂ Text corresponding to the first data and the second data respectively, n is the number of feature vector elements, ||V _t1 I and I V _t2 I is the first feature vector V respectively _t1 And the second eigenvector V _t2 Is a mold of (a).

5. The method of claim 4, wherein the first feature vector V _t1 And the second eigenvector V _t2 Obtained by the following steps:

extracting texts from the first data and the second data to obtain a first text T1 and a second text T2, respectively preprocessing the first text T1 and the second text T2, and respectively pooling the preprocessed data to obtain a first feature vector V comprising n elements _t1 And a second eigenvector V _t2 。

6. The method of any one of claims 1 to 5, wherein the method further comprises:

and combining the partial data in the storage table according to a preset rule for the data in the storage table, supplementing the data in the storage table completely, or,

and receiving the supplement or modification of the data in the storage form by a user.

7. The method of claim 6, wherein the method further comprises:

the supplemental or modified data is marked.

8. A metadata-based patent document data storage device, comprising:

a data extraction module for extracting a plurality of pieces of data in the target patent document according to the patent document metadata template;

the data classification module is used for determining the category of each piece of extracted data according to the document structure of the target patent document where each piece of extracted data is located;

the data fusion module is used for traversing each piece of extracted data, and performing cosine similarity calculation based on deep learning on a first feature vector and a second feature vector corresponding to the first data and the second data; comparing the cosine similarity calculation result with a preset threshold value, determining the relation between the first data and the second data, and combining the first data and the second data with the same or similar relation; the first data and the second data are the same class of data, and the relationship between the first data and the second data comprises the same, similar or irrelevant;

and the data storage module is used for importing the combined data into a storage table generated according to the patent document metadata template.

9. An electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized by a computer program stored thereon; the computer program being executed by a processor to implement the method of any one of claims 1 to 7.