CN113886574A - Patent topographic map drawing method and device based on structural text clustering - Google Patents

Patent topographic map drawing method and device based on structural text clustering Download PDF

Info

Publication number
CN113886574A
CN113886574A CN202111025719.7A CN202111025719A CN113886574A CN 113886574 A CN113886574 A CN 113886574A CN 202111025719 A CN202111025719 A CN 202111025719A CN 113886574 A CN113886574 A CN 113886574A
Authority
CN
China
Prior art keywords
weight
clustering
text
determining
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111025719.7A
Other languages
Chinese (zh)
Inventor
朱欣昱
程序
刘琦
孔文娟
李艳
陈亚鑫
张素兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongzhi Zhihui Technology Co ltd
Original Assignee
Beijing Zhongzhi Zhihui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongzhi Zhihui Technology Co ltd filed Critical Beijing Zhongzhi Zhihui Technology Co ltd
Priority to CN202111025719.7A priority Critical patent/CN113886574A/en
Publication of CN113886574A publication Critical patent/CN113886574A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for drawing a patent topographic map based on structural text clustering, wherein the method comprises the following steps: acquiring all target patent texts; extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to each type of field; determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts; determining key feature words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result; and drawing a patent topographic map according to the clustering result. The invention can accurately draw the patent topographic map based on the structural text clustering, thereby accurately reflecting the information of the technology association degree, the technology dense points and the like of the patent technology.

Description

Patent topographic map drawing method and device based on structural text clustering
Technical Field
The invention relates to the technical field of big data, in particular to a method and a device for drawing a patent topographic map based on structural text clustering.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
The patent topographic map is different from a generalized statistical chart type patent map, and the patents and the technologies are clustered and arranged in a topographic map with three-dimensional coordinates and elements such as contour lines in a coordinate point mode. Such results are used to intuitively reflect information such as the degree of technical association of the patent technology, the technical concentration point, and the like. The existing drawing method of the patent topographic map has the problem of low drawing precision, so that the information such as the technical association degree, the technical dense points and the like of the patent technology cannot be accurately reflected.
Disclosure of Invention
The embodiment of the invention provides a patent topographic map drawing method based on structural text clustering, which is used for accurately drawing a patent topographic map based on the structural text clustering and comprises the following steps:
acquiring all target patent texts;
extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to each type of field;
determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts;
determining key feature words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result;
and drawing a patent topographic map according to the clustering result.
The embodiment of the invention also provides a device for drawing the patent topographic map based on the structural text clustering, which is used for accurately drawing the patent topographic map based on the structural text clustering and comprises the following components:
the acquisition unit is used for acquiring all target patent texts;
the extraction unit is used for extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to the fields;
the weight determining unit is used for determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts;
the processing unit is used for determining key characteristic words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result;
and the drawing unit is used for drawing the patent topographic map according to the clustering processing result.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the computer program, the patent topographic map drawing method for clustering the structural texts is realized.
The embodiment of the invention also provides a computer readable storage medium, which stores a computer program for executing the patent topographic map drawing method for the structural text clustering.
In the embodiment of the invention, the patent topographic map drawing scheme of the structural text clustering comprises the following steps: acquiring all target patent texts; extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to each type of field; determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts; determining key feature words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result; according to the clustering processing result, the patent topographic map is drawn, and the accurate drawing of the patent topographic map based on the structural text clustering can be realized, so that the information such as the technical association degree and the technical dense points of the patent technology can be accurately reflected.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a schematic flow chart of a method for drawing a topographic patent map based on structured text clustering according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a key feature word extraction process in an embodiment of the present invention;
FIG. 3 is a schematic diagram of polar transformation in an embodiment of the present invention;
FIG. 4 is a diagram illustrating keyword extraction settings according to an embodiment of the present invention;
FIG. 5 is an exemplary diagram of a patent vector in an embodiment of the present invention;
FIG. 6 is a diagram illustrating a patent clustering result according to an embodiment of the present invention;
FIG. 7 is a topographic map of a patent including a patent drawing point according to an embodiment of the present invention;
FIG. 8 is a topographical view of only a center plot in accordance with an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a device for drawing a topographic patent map based on structural text clustering in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
The embodiment of the invention provides a patent topographic map drawing scheme based on structural text clustering, which aims to research patent texts with structures, perform patent text clustering on the basis, further research a drawing algorithm of a clustering topographic map, enable the clustering topographic map to accurately express corresponding physical meanings, and perform related research on patent analysis on the basis of a topographic map. The patent topographic map drawing scheme based on the structural text clustering is described in detail below.
Fig. 1 is a schematic flow chart of a method for drawing a topographic patent map based on structured text clustering according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
step 101: acquiring all target patent texts;
step 102: extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to each type of field;
step 103: determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts;
step 104: determining key feature words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result;
step 105: and drawing a patent topographic map according to the clustering result.
The patent topographic map drawing method for the structural text clustering provided by the embodiment of the invention can realize the accurate drawing of the patent topographic map based on the structural text clustering, thereby accurately reflecting the information such as the technical association degree, the technical dense points and the like of the patent technology. This is described in detail below with reference to fig. 2 to 8.
First, the above step 101 is described.
In specific implementation, the meaning of the target patent text is a structural text to be analyzed and then clustered, and all the target patent texts can form a document set.
Next, the above step 102, i.e. extracting the feature information of the structured patent document, is described.
In specific implementation, as an experimental object, the embodiment of the invention extracts keywords from 256 Chinese patents in the field of industrial robots. The patent text is different from information texts such as general news and the like, and due to relative specifications of the whole patent application process, the writing form of the patent text and the article structure are relatively fixed. As is well known, patents contain a large amount of fixed bibliographic information, of which textual information that can participate in text analysis is shown in table 1 below.
Table 1: patent text type information (field type) and corresponding information connotation thereof
Figure BDA0003243286810000041
The text fields are based on Chinese patent data primary processing indexing. It can be seen that the information content corresponding to different fields is relatively fixed and is different from one field to another.
In an embodiment, as shown in fig. 2, extracting key feature words from each of the target patent texts according to different types of fields and preset weights corresponding to each type of field may include: extracting a key feature word corresponding to a target patent text according to the following method:
extracting candidate feature words from a target patent text according to different types of fields and preset weights corresponding to the fields of each type; i.e. the step of collecting candidate subject words in fig. 2;
calculating co-occurrence factors among candidate characteristic words extracted from a target patent text; i.e. the step of calculating co-occurrence factors in fig. 2;
determining the weight of the candidate characteristic words in the target patent text full text according to the co-occurrence factor; i.e. the step of calculating weights in fig. 2;
extracting a key feature word corresponding to the target patent text according to the weight of the candidate feature word in the target patent text full text; namely, the step of taking the top 20 in the weight normalization in fig. 2, namely, taking the candidate feature words with the top 20 in the weight ranking after the weight normalization as the key feature words.
In particular implementation, in the step of collecting candidate topic words in fig. 2, the field type may be the content as described in the information field column in table 1 above. Specifically, the patent extraction key feature words in the embodiment of the present invention adopt 4 field titles, abstracts, main weights, and full text (4 field types), and the weight settings (preset weights) of each item are shown in fig. 4.
In specific implementation, in the step of calculating the co-occurrence factor in fig. 2, the co-occurrence factor between the candidate feature words extracted from one target patent text may be calculated according to the following formula (4). When co-occurrence factors between candidate feature words are calculated using a formula, wipFor the weight of the candidate feature word within each paragraph, wpIs the weight of a paragraph, wipfWord frequency weight, w, for candidate feature wordsipdIs a co-occurrence factor.
In specific implementation, in the step of calculating the weight in fig. 2, the weight of the candidate feature word in the entire target patent text may be calculated according to the following formula (1). W in formula (1)ipThe calculation method of (2) is shown in the formula (2).
In specific implementation, in the step of finally fetching words, as shown in the step of fetching the first 20 words by weight normalization in fig. 2, words are fetched after the weight normalization processing, so that the precision and efficiency of fetching words can be improved.
In specific implementation, the embodiment of extracting feature words shown in fig. 2 can improve the accuracy of extracting features, thereby improving the accuracy of subsequently drawing a topographic map of a patent. An example of a patent vector after extracting keywords may be as shown in fig. 5.
Third, next, the above step 103 is described.
The main idea of step 103 in the embodiment of the present invention is to divide the weight of the feature word into two parts: in-document weight (w)l) And inter-document weight (w)g). The intra-document weight is calculated according to the distribution condition inside the document, and the inter-document weight is mainly calculated according to the condition that the characteristic words appear in the document set. The final weight is the product of the two: w ═ wl×wg
1) The determinants of the weights within the document are: word frequency (frequency) + co-occurrence distance (co-location) + paragraph position (opportunity) + concept hierarchy (Similarity).
Since the patent text has a definite paragraph structure and different paragraphs have different importance, in the embodiment of the present invention, the weight of each paragraph is subjectively evaluated, so that the weight of a feature word in the whole text may be the sum of the weights in several paragraphs.
Figure BDA0003243286810000061
Wherein, wiIs the weight of a feature word (candidate feature word or key feature word) in the whole text, wipIs the weight of a feature word (candidate feature word or key feature word) in each paragraph.
From the above, in one embodiment, determining the intra-document weight of each key feature word in the patent text may include:
determining the weight of each key feature word in each paragraph;
and determining the weight of each key characteristic word in the document in the patent text according to the weight of each key characteristic word in each paragraph.
The embodiment of the invention mainly researches a weight distribution scheme in a paragraph, and assumes that the weight of a paragraph is wpThen the feature word weight within a paragraph can be expressed as:
wip=wipf×(1+wipd)×wp; (2)
wherein: w is aipFor the weight of the key feature word (or candidate feature word) within each paragraph, wpIs the weight of a paragraph, wipfIs the word frequency weight, w, of the key feature word (or candidate feature word)ipdIs a co-occurrence factor.
As can be seen from the above, in one embodiment, determining the weight of the key feature word in each paragraph may include determining the weight of the key feature word in each paragraph according to the above formula (2).
In one paragraph, the frequency of a word represents the weight of a word, i.e. the higher the frequency, the greater the weight, i.e. in one embodiment, the above-mentioned method for mapping a patent terrain map based on structured text clustering may further include calculating the word frequency weight according to the following formula:
Figure BDA0003243286810000062
wherein, wipfIs the word frequency weight of the key feature word fipThe occurrence frequency of the key feature words in one paragraph is shown, n is the total number of the key feature words, and j is the serial number of the key feature words.
Meanwhile, the embodiment of the invention evaluates the co-occurrence degree of words in the paragraphs. Suppose that the co-occurrence distances of the two feature words are d1, d2, d3 … … dm, respectively.
Then the co-occurrence factor of the two terms can be defined as:
Figure BDA0003243286810000071
wherein, wipdAs co-occurrence factor, djFor the co-occurrence distance, m is the total number of feature words, and j is the serial number of the feature words.
As can be seen from the above, in an embodiment, the above method for drawing a patent terrain map based on structured text clustering may further include calculating a co-occurrence factor according to equation (4).
2) The decision factors for the inter-document weight are: document rate (concurrence).
In one embodiment, determining the inter-document weight of each key feature word in all patent texts may include:
determining the distribution condition of each key characteristic word in all patent texts;
and determining the inter-document weight of each key characteristic word in all patent texts according to the distribution condition of each key characteristic word in all patent texts.
In specific implementation, the inter-document weight means: if the distribution of a certain characteristic word is uniform in the document set, the characteristic word appears in a plurality of texts, so that the characteristic word is considered to have weak capability of representing a certain text, and the inter-document weight of the characteristic word is 0; if the characteristic word only appears in one text, the characteristic word can be considered to have strong capability of representing the text, and the inter-document weight is the largest. That is, in one embodiment, determining the inter-document weight of each key feature word in all patent texts according to the distribution of each key feature word in all patent texts may include: the inter-document weight of each key feature word in all patent texts decreases as the number of key feature words distributed in the patent texts increases.
In specific implementation, the mean square error can be used to evaluate the distribution of a feature word in each document:
suppose the weights of the feature words T in the document set are w respectivelyk(k ═ 1,2, … | D |). Now, the weights are mainly evaluated to be equally distributed among the documents. And calculating the distribution situation of the weights by using the characteristics of the mean square error:
Figure BDA0003243286810000072
that is to say wgThe larger the weight of the feature word in each document is, the more different the weight of the feature word is, and if the feature word is uniformly distributed in each document, the weight of the feature word is wgIf it is 0, the feature word is excluded from the cluster (i.e., the feature word is not added to the cluster set for cluster analysis in step 104). Considering the space sparsity problem of the feature words, the method can be simplified as follows:
Figure BDA0003243286810000081
wherein, wgIs the inter-document weight, D is the intra-document weight (i.e., the weight of the feature word in the kth document), k is the identification (order) of the document,
Figure BDA0003243286810000084
is the weight average, and i is the identification (order) of the weight within the document.
Fourthly, next, for ease of understanding, the above steps 104 and 105 are introduced together.
In the step 104, a text clustering algorithm of K-means may be adopted, and the patent clustering result may be as shown in fig. 6.
In one embodiment, in the step 105, drawing a patent topographic map according to the clustering result may include:
mapping the feature vector corresponding to each key feature word in the clustering processing result to a pre-established polar coordinate axis of a corresponding angle, and calculating to obtain a polar coordinate corresponding to each feature vector;
converting the polar coordinates corresponding to each eigenvector into Cartesian coordinates to obtain the mass center of a polygon surrounded by each eigenvector; the centroid is a plane coordinate of each eigenvector mapped on a Cartesian coordinate system;
calculating the similarity of the cluster where each feature vector is located; the similarity is a Z coordinate of the corresponding feature vector;
and obtaining the patent topographic map according to the plane coordinate and the Z coordinate of each feature vector.
In specific implementation, the patent topographic map drawing algorithm may include:
an N-dimensional data space is mapped to a flat surface for display using polar transformation, as shown in fig. 3.
And distributing the N dimension data according to the circumference (2 pi) and the like, and setting each dimension according to the actual value range of the dimension.
Any one vector Vk={vi(i ═ 0, 1,2, …, N-1), maps the value of each dimension to the coordinate axis of the corresponding angle, and calculates the polar coordinate of the point:
Figure BDA0003243286810000082
convert it to cartesian coordinates as:
(vi cosθi,vi sinθi);
such vector VkThe centroid of the enclosed polygon is:
Figure BDA0003243286810000083
Figure BDA0003243286810000091
this centroid coordinate is the vector VkAnd mapping to a plane coordinate on a Cartesian coordinate system, and taking the similarity of a cluster where the vector is located as a Z coordinate of the point, so far, the design of the drop point of the vector on the patent map is completed, and the drawing result of the patent topographic map can be shown in FIG. 7 and FIG. 8.
In specific implementation, the detailed implementation of drawing the patent map may include:
1) to avoid passing centroids of data of different dimensions, such as 0 ° and 180 °, 90 ° and 270 °, in distributing the feature vectors, 90 ° is chosen as the entire vector coordinate space.
2) Calculate cluster (one of the clustering results) coordinates:
a) and calculating cluster coordinates according to a polar coordinate transformation mode by taking the origin as the center.
b) The distance of each cluster coordinate from the origin is calculated.
c) All cluster coordinates are shrunk by equal scale (the inverse of the farthest distance in all clusters), now within the unit circle.
3) Calculating the patent coordinates:
a) calculate the coverage radius of each cluster: 1/2 of the distance between adjacent nearest clusters.
b) And calculating the patent coordinates according to a polar coordinate transformation mode by taking the cluster coordinates where the patents are located as the center.
And (3) contracting all patent coordinates within the coverage radius of the cluster according to the similarity of each patent and the cluster, namely:
Figure BDA0003243286810000092
and fifthly, in order to facilitate comprehensive understanding, main interface design of a patent clustering and mapping algorithm program is introduced below.
a) Extracting subject term
The functions are as follows: and analyzing the text content, extracting subject words of the patent and evaluating the contribution weight of each subject word to the full-text subject.
An inlet: the title, abstract, main right, text and other contents of the patent document and the weight of each chapter are input.
And (4) outlet: the keywords of the patent and their respective weights and concept groupings.
b) Clustering function
The functions are as follows: a collection of patent documents is automatically grouped by topic similarity.
An inlet: inputting the ID of each patent document, the subject word, the weight and concept group of the subject word, a reference word list, the number of clusters, whether to calculate the coordinate, the maximum number of circulation, the cluster termination condition and the number of working threads.
And (4) outlet: subject word vectors for each cluster, patent documents contained in each cluster, and distances between each patent document and the center of the cluster in which it is located.
c) Comparing similarity
The functions are as follows: the similarity is compared for the two word vectors.
An inlet: two vectors to be compared.
And (4) outlet: similarity between vectors.
Therefore, the method for drawing the patent topographic map based on the structural text clustering provided by the embodiment of the invention well achieves the following purposes:
1) the method comprises the following steps of extracting the segmentation field of the patent subject term and vectorizing and expressing the patent. This is the basis of text clustering of patents. Due to the adoption of a segmented extraction method and a professional lexicon. The amount of non-technical vocabulary in the patent vector is greatly reduced.
2) Text clustering of patents. Based on a special vectorization means of the patent, the result of patent text clustering is closer to the result of patent technology classification.
3) And (5) drawing a patent topographic map. And the distance calculation between the category central points, between the category central points and the patent points and between the patent points is well realized in the drawing of the patent topographic map. On the basis of overall uniform distribution of the central point, the purpose that the first two types in the 3 types of distance relations reflect text similarity as much as possible is achieved. Meanwhile, the density degree of the patent points also really reflects the distribution situation of technical research.
The embodiment of the invention also provides a device for drawing the patent topographic map based on the structural text clustering, which is described in the following embodiment. Because the principle of the device for solving the problems is similar to the method for drawing the patent topographic map based on the structural text clustering, the implementation of the device can refer to the implementation of the method for drawing the patent topographic map based on the structural text clustering, and repeated parts are not repeated.
Fig. 9 is a schematic structural diagram of a device for drawing a topographic patent map based on structured text clustering according to an embodiment of the present invention, as shown in fig. 9, the device includes:
the acquiring unit 01 is used for acquiring all target patent texts;
the extracting unit 02 is used for extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to each type of field;
the weight determining unit 03 is used for determining the weight of each key feature word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts;
the processing unit 04 is configured to determine a key feature word added to the cluster set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result;
and the drawing unit 05 is used for drawing the patent topographic map according to the clustering processing result.
In an embodiment, the extracting unit may be specifically configured to: extracting a key feature word corresponding to a target patent text according to the following method:
extracting candidate feature words from a target patent text according to different types of fields and preset weights corresponding to the fields of each type;
calculating co-occurrence factors among candidate characteristic words extracted from a target patent text;
determining the weight of the candidate characteristic words in the target patent text full text according to the co-occurrence factor;
and extracting a key feature word corresponding to the target patent text according to the weight of the candidate feature word in the target patent text full text.
In an embodiment, the weight determining unit may be specifically configured to:
determining the weight of each key feature word in each paragraph;
and determining the weight of each key characteristic word in the document in the patent text according to the weight of each key characteristic word in each paragraph.
In an embodiment, the weight determining unit may be specifically configured to determine the weight of the key feature word in each paragraph according to the following formula:
wip=wipf×(1+wipd)×wp
wherein, wipFor the weight of key feature words within each paragraph, wpIs the weight of a paragraph, wipfWord frequency weight, w, for key feature wordsipdIs a co-occurrence factor.
In an embodiment, the weight determining unit may be specifically configured to:
determining the distribution condition of each key characteristic word in all patent texts;
and determining the inter-document weight of each key characteristic word in all patent texts according to the distribution condition of each key characteristic word in all patent texts.
In one embodiment, the weight determination unit may be specifically configured to decrease the inter-document weight of each key feature word in all patent texts as the number of key feature words distributed in the patent texts increases.
In an embodiment, the rendering unit may be specifically configured to:
mapping the feature vector corresponding to each key feature word in the clustering processing result to a pre-established polar coordinate axis of a corresponding angle, and calculating to obtain a polar coordinate corresponding to each feature vector;
converting the polar coordinates corresponding to each eigenvector into Cartesian coordinates to obtain the mass center of a polygon surrounded by each eigenvector; the centroid is a plane coordinate of each eigenvector mapped on a Cartesian coordinate system;
calculating the similarity of the cluster where each feature vector is located; the similarity is a Z coordinate of the corresponding feature vector;
and obtaining the patent topographic map according to the plane coordinate and the Z coordinate of each feature vector.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the computer program, the patent topographic map drawing method for clustering the structural texts is realized.
The embodiment of the invention also provides a computer readable storage medium, which stores a computer program for executing the patent topographic map drawing method for the structural text clustering.
In the embodiment of the invention, the patent topographic map drawing scheme of the structural text clustering comprises the following steps: acquiring all target patent texts; extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to each type of field; determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts; determining key feature words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result; according to the clustering processing result, the patent topographic map is drawn, and the accurate drawing of the patent topographic map based on the structural text clustering can be realized, so that the information such as the technical association degree and the technical dense points of the patent technology can be accurately reflected.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A patent topographic map drawing method based on structural text clustering is characterized by comprising the following steps:
acquiring all target patent texts;
extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to each type of field;
determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts;
determining key feature words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result;
and drawing a patent topographic map according to the clustering result.
2. The method for drawing a patent topographic map based on structural text clustering as claimed in claim 1, wherein the extracting key feature words from each of the target patent texts according to different types of fields and preset weights corresponding to each type of field comprises: extracting a key feature word corresponding to a target patent text according to the following method:
extracting candidate feature words from a target patent text according to different types of fields and preset weights corresponding to the fields of each type;
calculating co-occurrence factors among candidate characteristic words extracted from a target patent text;
determining the weight of the candidate characteristic words in the target patent text full text according to the co-occurrence factor;
and extracting a key feature word corresponding to the target patent text according to the weight of the candidate feature word in the target patent text full text.
3. The method for drawing a patent topographic map based on structural text clustering as claimed in claim 1, wherein the determining the intra-document weight of each key characteristic word in the patent text comprises:
determining the weight of each key feature word in each paragraph;
and determining the weight of each key characteristic word in the document in the patent text according to the weight of each key characteristic word in each paragraph.
4. A method for patent terrain mapping based on structured text clustering as recited in claim 3 wherein determining the weight of the key feature words in each paragraph comprises determining the weight of the key feature words in each paragraph according to the following formula:
wip=wipf×(1+wipd)×wp
wherein, wipFor the weight of key feature words within each paragraph, wpIs the weight of a paragraph, wipfWord frequency weight, w, for key feature wordsipdIs a co-occurrence factor.
5. The method for drawing a patent topographic map based on structural text clustering as claimed in claim 1, wherein the determining the inter-document weight of each key characteristic word in all patent texts comprises:
determining the distribution condition of each key characteristic word in all patent texts;
and determining the inter-document weight of each key characteristic word in all patent texts according to the distribution condition of each key characteristic word in all patent texts.
6. The method for drawing a patent topographic map based on structural text clustering as claimed in claim 5, wherein the determining the inter-document weight of each key feature word in all patent texts according to the distribution of each key feature word in all patent texts comprises: the inter-document weight of each key feature word in all patent texts decreases as the number of key feature words distributed in the patent texts increases.
7. The method for drawing the patent topographic map based on the structural text clustering as claimed in claim 1, wherein drawing the patent topographic map according to the clustering processing result comprises:
mapping the feature vector corresponding to each key feature word in the clustering processing result to a pre-established polar coordinate axis of a corresponding angle, and calculating to obtain a polar coordinate corresponding to each feature vector;
converting the polar coordinates corresponding to each eigenvector into Cartesian coordinates to obtain the mass center of a polygon surrounded by each eigenvector; the centroid is a plane coordinate of each eigenvector mapped on a Cartesian coordinate system;
calculating the similarity of the cluster where each feature vector is located; the similarity is a Z coordinate of the corresponding feature vector;
and obtaining the patent topographic map according to the plane coordinate and the Z coordinate of each feature vector.
8. The patent topographic map drawing device based on structural text clustering is characterized by comprising the following components:
the acquisition unit is used for acquiring all target patent texts;
the extraction unit is used for extracting key feature words from each target patent text according to different types of fields and preset weights corresponding to the fields;
the weight determining unit is used for determining the weight of each key characteristic word in the document in the patent text; determining the inter-document weight of each key characteristic word in all patent texts;
the processing unit is used for determining key characteristic words added into the clustering set according to the intra-document weight and the inter-document weight; clustering the target patent text according to the key feature words added into the clustering set to obtain a clustering result;
and the drawing unit is used for drawing the patent topographic map according to the clustering processing result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 7.
CN202111025719.7A 2021-09-02 2021-09-02 Patent topographic map drawing method and device based on structural text clustering Pending CN113886574A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111025719.7A CN113886574A (en) 2021-09-02 2021-09-02 Patent topographic map drawing method and device based on structural text clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111025719.7A CN113886574A (en) 2021-09-02 2021-09-02 Patent topographic map drawing method and device based on structural text clustering

Publications (1)

Publication Number Publication Date
CN113886574A true CN113886574A (en) 2022-01-04

Family

ID=79012108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111025719.7A Pending CN113886574A (en) 2021-09-02 2021-09-02 Patent topographic map drawing method and device based on structural text clustering

Country Status (1)

Country Link
CN (1) CN113886574A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN102136039A (en) * 2011-03-30 2011-07-27 保定市大为计算机软件开发有限公司 Method and equipment for establishing map model
CN106372051A (en) * 2016-10-20 2017-02-01 长城计算机软件与系统有限公司 Patent map visualization method and system
CN107766318A (en) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 Keyword extraction method and device and electronic equipment
CN108932239A (en) * 2017-05-24 2018-12-04 西安科技大市场创新云服务股份有限公司 A kind of patent map modeling method and device
EP3611468A1 (en) * 2018-08-17 2020-02-19 Ordnance Survey Limited Vector tile pyramiding

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN102136039A (en) * 2011-03-30 2011-07-27 保定市大为计算机软件开发有限公司 Method and equipment for establishing map model
CN107766318A (en) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 Keyword extraction method and device and electronic equipment
CN106372051A (en) * 2016-10-20 2017-02-01 长城计算机软件与系统有限公司 Patent map visualization method and system
CN108932239A (en) * 2017-05-24 2018-12-04 西安科技大市场创新云服务股份有限公司 A kind of patent map modeling method and device
EP3611468A1 (en) * 2018-08-17 2020-02-19 Ordnance Survey Limited Vector tile pyramiding

Similar Documents

Publication Publication Date Title
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN107122352B (en) Method for extracting keywords based on K-MEANS and WORD2VEC
CN109947904B (en) Preference space Skyline query processing method based on Spark environment
Agathos et al. 3D articulated object retrieval using a graph-based representation
CN111368891B (en) K-Means text classification method based on immune clone gray wolf optimization algorithm
Gonçalves et al. The Impact of Pre-processing on the Classification of MEDLINE Documents
JP5094830B2 (en) Image search apparatus, image search method and program
CN107291895B (en) Quick hierarchical document query method
CN105740378B (en) Digital pathology full-section image retrieval method
WO2021028505A1 (en) Information retrieval and/or visualization method
CN113282756B (en) Text clustering intelligent evaluation method based on hybrid clustering
CN107784110A (en) A kind of index establishing method and device
CN105843925A (en) Similar image searching method based on improvement of BOW algorithm
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
CN106919658B (en) A kind of large-scale image words tree search method and system accelerated based on GPU
CN111125329B (en) Text information screening method, device and equipment
CN110209895B (en) Vector retrieval method, device and equipment
CN112579783B (en) Short text clustering method based on Laplace atlas
CN113886574A (en) Patent topographic map drawing method and device based on structural text clustering
CN116089639A (en) Auxiliary three-dimensional modeling method, system, device and medium
Wu et al. Similar image retrieval in large-scale trademark databases based on regional and boundary fusion feature
Fan et al. Application of K-means algorithm to web text mining based on average density optimization
CN110717015A (en) Neural network-based polysemous word recognition method
Zaw et al. Web document clustering using Gauss distribution based cuckoo search clustering algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination