CN111930953B - Text attribute feature identification, classification and structure analysis method and device - Google Patents

Text attribute feature identification, classification and structure analysis method and device Download PDF

Info

Publication number
CN111930953B
CN111930953B CN202010992100.2A CN202010992100A CN111930953B CN 111930953 B CN111930953 B CN 111930953B CN 202010992100 A CN202010992100 A CN 202010992100A CN 111930953 B CN111930953 B CN 111930953B
Authority
CN
China
Prior art keywords
text
text attribute
classification
phrase
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010992100.2A
Other languages
Chinese (zh)
Other versions
CN111930953A (en
Inventor
姜庭欣
陈伟然
李静毅
郭永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hexiang Wisdom Technology Co ltd
Original Assignee
Beijing Hexiang Wisdom Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hexiang Wisdom Technology Co ltd filed Critical Beijing Hexiang Wisdom Technology Co ltd
Priority to CN202011632896.7A priority Critical patent/CN112632286A/en
Priority to CN202010992100.2A priority patent/CN111930953B/en
Publication of CN111930953A publication Critical patent/CN111930953A/en
Application granted granted Critical
Publication of CN111930953B publication Critical patent/CN111930953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Abstract

The invention discloses a method and a device for identifying, classifying and analyzing a structure of text attribute features, wherein the method for identifying the text attribute features comprises the following steps: generating a grammar structure according to the sentences in the target text; generating a data structure according to the node relation in the grammar structure; generating a first input vector from the data structure; determining the probability that each sentence contains an attribute feature text according to the first input vector and a preset text attribute feature classification model; and identifying text attribute features in the target text according to the probability. By implementing the method and the device, the text attribute characteristics in the target text can be accurately identified, the identification of the meaning of the text content is realized, the deeper meaning of the text can be mined, the content of text identification can be enriched, and more comprehensive data and content support can be provided for the subsequent processes such as analysis processing based on the text identification content.

Description

Text attribute feature identification, classification and structure analysis method and device
Technical Field
The invention relates to the technical field of data mining, in particular to a method and a device for recognizing, classifying and analyzing a text attribute characteristic.
Background
With the development of intellectual property cause, the great value of patent data gradually draws attention of people, and how to effectively mine the value of the data is very important. However, the existing text recognition and analysis methods only stay in recognition and analysis of the part of speech of the text, so as to analyze the basic structure of the text, and cannot know the deeper meaning of the text, and when a user wants to search based on a category of words (e.g., words representing effects, words representing recognition or derogation, etc.) representing certain characteristics of the text as keywords, the existing text recognition and analysis methods cannot effectively recognize such words, and thus, the requirements of the user on searching, etc. cannot be met. Therefore, based on the existing text recognition analysis method, the development and utilization of text content are still greatly limited. Therefore, a technology capable of deeply mining text content is needed.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for identifying, classifying and analyzing text attribute features, so as to solve the problem that development and utilization of text content are still greatly limited based on the existing text identification and analysis method.
According to a first aspect, an embodiment of the present invention provides a method for identifying text attribute features, including: generating a grammar structure according to the sentences in the target text; generating a data structure according to the node relation in the grammar structure; generating a first input vector from the data structure; determining the probability that each statement contains an attribute feature text according to the first input vector and a preset text attribute feature classification model; and identifying text attribute features in the target text according to the probability.
With reference to the first aspect, in a first implementation manner of the first aspect, the generating a grammar structure according to a sentence in a target text includes: respectively identifying words of each sentence in the target text, and constructing a word directed graph; calculating the shortest path from the first node to the last node in the word directed graph as the word segmentation result of each statement; constructing a word sequence according to the word segmentation result of each sentence; generating an input vector according to adjacent words in the word sequence; obtaining an output vector according to a preset neural network model and the input vector; calculating the cosine value of the included angle between the input vector and the output vector; constructing a combination node by two adjacent words with the largest included angle cosine value until a root node of the word sequence is generated; and determining a syntactic structure of the word sequence according to the combination node and the root node.
With reference to the first aspect, in a second implementation manner of the first aspect, the preset text attribute feature classification model is a power sentence classification model, and the power sentence classification model is constructed through the following processes: acquiring an efficacy statement sample, wherein the efficacy statement sample comprises a preset efficacy mark and a sample statement; generating a first grammar structure according to the efficacy statement sample; generating a first feature list according to the node relation in the first syntactic structure; generating a first classification input vector according to the feature list, and generating a first classification output vector according to the preset efficacy mark and the sample statement; and training a preset classification model according to the first classification input vector and the first classification output vector to generate the efficacy sentence classification model.
With reference to the first aspect, in a third implementation manner of the first aspect, the preset text attribute feature classification model is an efficacy phrase classification model, and the efficacy phrase classification model is constructed through the following processes: acquiring an efficacy phrase sample, wherein the efficacy phrase sample comprises preset efficacy marks and sample phrases; generating a second grammar structure according to the efficacy phrase sample; generating a second feature list according to the node relation in the second syntactic structure; generating a second data structure according to a preset efficacy phrase; generating a second classification input vector according to the second feature list and a second data structure, and generating a second classification output vector according to the preset efficacy mark and the sample phrase; and training a preset classification model according to the second classification input vector and the second classification output vector to generate the preset efficacy phrase classification model.
With reference to the first aspect, in a fourth implementation manner of the first aspect, determining a probability that each sentence includes an attribute feature text according to the first input vector and a preset text attribute feature classification model includes at least one of the following steps: determining a first probability that each statement contains attribute features according to the first input vector and a preset text attribute feature classification model; determining a second probability that the phrase in each sentence contains the attribute features according to the first input vector and a preset text attribute feature classification model; and determining a third probability that each paragraph in each sentence contains attribute features according to the first input vector and a preset text attribute feature classification model.
According to a second aspect, an embodiment of the present invention provides a method for classifying text attribute features, including: identifying text attribute features in a target text according to sentences in the target text by the text attribute feature identification method according to the first aspect or any embodiment of the first aspect; constructing a text attribute feature set according to the recognition result of the target text; constructing a structure tree according to the text attribute feature set; respectively determining the class center points of all leaf nodes in the structure tree, and forming a class center set by a plurality of class center points; generating class center vectors according to the class center points, and generating leaf node vectors according to leaf nodes of the structure tree; and determining the classification node of each leaf node according to the class center vector and the leaf node vector classification nodes.
With reference to the second aspect, in a first embodiment of the second aspect, the separately determining class center points of leaf nodes in the structure tree includes: respectively taking one leaf node in the structure tree as a central node, and calculating the average distance from the leaf node in the structure tree to the central node; determining the leaf node with the farthest average distance as the class center point.
With reference to the second aspect, in a second implementation manner of the second aspect, the determining a classification node of each leaf node according to the class center vector and the leaf node vector classification nodes includes: calculating Euclidean distance from the leaf node vector to the class center vector; and for each leaf node, taking the class center point corresponding to the class center vector with the minimum Euclidean distance as the classification node of each leaf node.
According to a third aspect, an embodiment of the present invention provides a method for constructing a classification model of text attribute features, including: acquiring an engineering parameter corpus sample and a text attribute feature sample, wherein the engineering parameter corpus sample comprises an engineering parameter text attribute feature and a text attribute feature type mark, and the text attribute feature sample is generated by identification according to the first aspect or the identification method of the text attribute feature of any one of the embodiments of the first aspect; generating an engineering parameter input vector according to the text attribute feature sample, and generating an engineering parameter output vector according to the engineering parameter corpus sample; and training the SVM model according to the engineering parameter input vector and the engineering parameter output vector to construct an engineering parameter classification model.
According to a fourth aspect, an embodiment of the present invention provides a method for classifying text attribute features, where the method includes: acquiring a target text, and identifying text attribute features of the target text by using a text attribute feature identification method according to the first aspect or any embodiment of the first aspect; and generating an engineering parameter classification result corresponding to the text attribute feature according to the text attribute feature and a preset engineering parameter classification model.
With reference to the fourth aspect, in a first implementation manner of the fourth aspect, the engineering parameter classification model is constructed according to the method for constructing a classification model of text attribute features of the third aspect.
According to a fifth aspect, an embodiment of the present invention provides a text structure analysis method, including: identifying phrases in each paragraph of the target text, and respectively forming a phrase set according to each paragraph; carrying out structured analysis on each phrase set to generate a structured set; identifying text attribute features in paragraphs of the structured collection according to the first aspect or the text attribute feature identification method according to any embodiment of the first aspect; and establishing the incidence relation between the structured set and the phrase set according to the identified text attribute characteristics.
With reference to the fifth aspect, in the first embodiment of the fifth aspect, if the paragraphs of the structured collection do not contain text attribute features, a first phrase collection closest to the current phrase and behind the current phrase is selected, and an association relationship is established between the first phrase collection and the structured collection.
With reference to the first implementation manner of the fifth aspect, in a second implementation manner of the fifth aspect, if the first phrase set is not found, a second phrase set closest to the current phrase and preceding the current phrase is selected, and an association relationship is established between the second phrase set and the structured set.
With reference to the second embodiment of the fifth aspect, in the third embodiment of the fifth aspect, if neither the first phrase set nor the second phrase set exists, the current phrase is ignored.
According to a sixth aspect, an embodiment of the present invention provides an apparatus for recognizing text attribute features, including: the grammar structure generating module is used for generating grammar structures according to sentences in the target text; the data structure generating module is used for generating a data structure according to the node relation in the grammar structure; a first input vector generation module for generating a first input vector according to the data structure; the probability determination module is used for determining the probability that each statement contains the attribute feature text according to the first input vector and a preset text attribute feature classification model; and the text recognition module is used for recognizing the text attribute characteristics in the target text according to the probability.
According to a seventh aspect, an embodiment of the present invention provides a device for classifying text attribute features, including: a text recognition module, configured to recognize, according to a sentence in a target text, a text attribute feature in the target text by using the text attribute feature recognition method according to the first aspect or any embodiment of the first aspect; the text attribute feature set construction module is used for constructing a text attribute feature set according to the recognition result of the target text; the structure tree construction module is used for constructing a structure tree according to the text attribute feature set; the class center set building module is used for respectively determining class center points of all leaf nodes in the structure tree and forming a class center set by a plurality of class center points; the vector generation module is used for generating class center vectors according to the class center points and generating leaf node vectors according to leaf nodes of the structure tree; and the classification node determining module is used for determining the classification node of each leaf node according to the class center vector and the leaf node vector classification nodes.
According to an eighth aspect, an embodiment of the present invention provides a device for constructing a classification model of text attribute features, including: a sample obtaining module, configured to obtain an engineering parameter corpus sample and a text attribute feature sample, where the engineering parameter corpus sample includes an engineering parameter text attribute feature and a text attribute feature type flag, and the text attribute feature sample is generated by identification according to the first aspect or the text attribute feature identification method described in any one of the embodiments of the first aspect; the vector generation module is used for generating an engineering parameter input vector according to the text attribute feature sample and generating an engineering parameter output vector according to the engineering parameter corpus sample; and the model construction module is used for training the SVM model according to the engineering parameter input vector and the engineering parameter output vector to construct an engineering parameter classification model.
According to a ninth aspect, an embodiment of the present invention provides a text attribute feature classification apparatus, including: the text recognition module is used for acquiring a target text and recognizing an efficacy phrase of the target text by using a text attribute feature recognition method according to the first aspect or any embodiment of the first aspect; and the text classification module is used for generating an engineering parameter classification result corresponding to the efficacy phrase according to the text attribute characteristics and a preset engineering parameter classification model.
According to a tenth aspect, an embodiment of the present invention provides a text structure analysis apparatus, including: the phrase set generating module is used for identifying phrases in each paragraph of the target text and respectively forming a phrase set according to each paragraph; the structured set generation module is used for carrying out structured analysis on each phrase set to generate a structured set; a text recognition module, configured to recognize text attribute features in paragraphs of the structured collection according to the first aspect or the recognition method for text attribute features according to any embodiment of the first aspect; and the text relation building module is used for building the incidence relation between the structured set and the phrase set according to the identified text attribute characteristics.
According to an eleventh aspect, an embodiment of the present invention provides a computer apparatus, including: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing therein a computer instruction, and the processor executing the computer instruction to perform the method for identifying a text attribute feature according to the first aspect or any one of the embodiments of the first aspect, or perform the method for classifying a text attribute feature according to the second aspect or any one of the embodiments of the second aspect, or perform the method for constructing a classification model of a text attribute feature according to the third aspect, or perform the method for classifying a text attribute feature according to any one of the fourth aspect or the embodiments of the fourth aspect, or perform the method for analyzing a text structure according to any one of the fifth aspect or the embodiments of the fifth aspect.
According to a twelfth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores computer instructions for causing a computer to execute the method for recognizing the text attribute features described in the first aspect or any one of the implementations of the first aspect, or execute the method for classifying the text attribute features described in the second aspect or any one of the implementations of the second aspect, or execute the method for constructing the classification model of the text attribute features described in the third aspect, or execute the method for classifying the text attribute features described in the fourth aspect or any one of the implementations of the fourth aspect, or execute the method for analyzing the text structure described in the fifth aspect or any one of the implementations of the fifth aspect.
The text recognition method and the text recognition device have the advantages that through the implementation of the text recognition method and the text recognition device, the text attribute characteristics in the target text can be accurately recognized, the meaning of the text content can be recognized, compared with the prior art, the recognition of the contents such as the division and the part of speech of the text is eliminated, the characters, the words, the phrases and the like representing the text attribute characteristics of the contents such as the efficacy and the effect in the text can be accurately recognized, the deeper meaning of the text can be mined, the content of text recognition can be enriched, and more comprehensive data and content support can be provided for the subsequent processes such as analysis processing based on the text recognition content.
Drawings
The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:
FIG. 1 is a flow chart of a text attribute feature identification method according to an embodiment of the present invention;
FIG. 2 shows a flowchart of step S10 of an embodiment of the present invention;
FIG. 3 shows a schematic diagram of a directed graph G of an embodiment of the present invention;
FIG. 4 shows a schematic diagram of a directed graph G of another embodiment of the present invention;
FIG. 5 shows a schematic structural diagram of a neural network model N according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating the structure of a grammar structure in accordance with an embodiment of the invention;
FIG. 7 is a flow chart of the embodiment of the invention for constructing the classification model of the power sentence;
FIG. 8 illustrates a flowchart for constructing the efficacy phrase classification model according to an embodiment of the present invention;
fig. 9 illustrates a structural diagram of a syntax tree T of an embodiment of the present invention;
FIG. 10 is a flowchart illustrating a method for classifying text attribute features according to an embodiment of the present invention;
fig. 11 shows a schematic structural diagram of a structure Tree of an embodiment of the present invention;
FIG. 12 is a schematic flow chart illustrating the construction of the engineering parameter classification model according to an embodiment of the present invention;
FIG. 13 is a flow chart illustrating a text structure analysis method according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of a text attribute feature recognition apparatus according to an embodiment of the present invention;
FIG. 15 is a schematic structural diagram of a text attribute feature classification apparatus according to an embodiment of the present invention;
FIG. 16 is a schematic structural diagram of a text attribute feature classification apparatus according to another embodiment of the present invention;
FIG. 17 is a schematic structural diagram of an apparatus for constructing a classification model of text attribute features according to an embodiment of the present invention;
fig. 18 is a schematic structural diagram showing a text structure analysis apparatus according to an embodiment of the present invention;
fig. 19 is a hardware configuration diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As described in the background art, the existing text recognition and analysis methods only stay at the level of recognition and analysis of the text itself, so as to analyze the basic structure of the text, and cannot know the deeper meaning of the text, and when a user wants to search based on a category of words (e.g., words representing effects, words representing recognition or derogation, etc.) that represent certain characteristics of the text as keywords, the existing text recognition and analysis methods cannot effectively recognize such words, and thus, the requirements for searching, etc. of the user cannot be met. The language expression mode is diversified, and the corresponding specific vocabulary can not be quickly and effectively searched by using the keyword search.
Based on the above, the embodiment of the invention provides a method for mining and displaying attribute characteristics of text content. For any given text content, such as a patent or a patent set, the attribute feature content included in the text content, such as an efficacy sentence, an efficacy phrase and the like representing the efficacy, is mined, and the actual value of each text content can be analyzed.
The embodiment of the invention provides a text attribute feature identification method, as shown in fig. 1, the identification method mainly comprises the following steps:
step S10: generating a grammar structure according to the sentences in the target text;
optionally, in some embodiments of the present invention, as shown in fig. 2, in step S10, the process of generating a grammar structure according to the sentence in the target text mainly includes:
step S11: and respectively identifying words of each sentence in the target text, and constructing a word directed graph. In the embodiment of the invention, a Chinese dictionary Dict is predefined and consists of a common Chinese word stock and a special word stock. For the sentence sen in the target text, each single word and each word in Dict are firstly identified, and then the words are placed in a directed graph G.
For example, the target text is "artificial intelligence technology". The directed graph G constructed according to the target text is shown in fig. 3, and each word of the sentence sen corresponds to one edge in G.
Step S12: and calculating the shortest path from the first node to the last node in the word directed graph as the word segmentation result of each statement.
After the directed graph G shown in fig. 3 is constructed, with S1 as the start node and S7 as the end node, the route corresponding to the possible word combination in the sentence is constructed.
For example, the dictionary contains four words of "artificial", "intelligent", "technical" and "artificial intelligence". The directed graph G becomes as shown in fig. 4. In FIG. 4, the edge "artificial" points from S1 to S3, "Intelligent" points from S3 to S5, "Artificial Intelligence" points from S1 to S5, and "technology" points from S5 to S7. Setting the weights of each edge to be the same, the shortest path from S1 to S7 is S1-S5-S7, containing only two edges, so the word segmentation result is "artificial intelligence" + "technique.
In the embodiment of the present invention, the method for finding the shortest path may specifically be: arcs is defined as the adjacency matrix of the directed graph G. Matrix a is the path length between two vertices.
Initial conditions:
a) if there is edge between node i and node j, arcs [ i ] [ j ] = length of edge. Otherwise arcs [ i ] [ j ] = infinity.
b) The 0 th iteration matrix a, a [ i ] [ j ] = arcs [ i ] [ j ].
An iterative algorithm:
Figure 287468DEST_PATH_IMAGE001
where k =1,2,3, … …, n. The main idea is to update the length of node i to node j if the length of node i to node k + the length of node k to node j is smaller than the length of node i directly to node j.
Step S13: constructing a word sequence according to the word segmentation result of each sentence; step S14: an input vector is generated from adjacent words in the sequence of words.
In the embodiment of the invention, after word segmentation processing is carried out on each word in the target text, a word sequence S after word segmentation is obtained, word segmentation results are input into a word2vec model for training, the number of nodes of a hidden layer is 300, and a 300-dimensional vector of each word is obtained through training. Defining a neural network N, wherein the first layer of input layer is 600-dimensional vectors, the second layer of hidden layer is 300-dimensional vectors, and the third layer of output layer is 600-dimensional vectors.
For every two adjacent words term1 and term2 in the word sequence S, obtaining corresponding vectors Vec1 and Vec2, splicing the vectors Vec into a 600-dimensional vector Vec, using the vector Vec as an input layer and an output layer of a neural network model N, and training the neural network model N by using a reverse iterative algorithm, wherein the structure of the neural network model N is shown in FIG. 5.
Step S15: obtaining an output vector according to a preset neural network model and an input vector; step S16: and calculating the cosine value of the included angle between the input vector and the output vector.
In the embodiment of the invention, a syntax tree T is generated for each sentence sen in the word sequence S, and each word is a leaf node. The process of generating the syntax tree T mainly includes: and obtaining vectors Vec1 and Vec2 corresponding to every two adjacent words term1 and term2 in sen, splicing the vectors Vec _ in into a 600-dimensional vector Vec _ in, obtaining an output layer vector Vec _ out through calculation as an input layer of a preset neural network model N, and calculating that the Vec _ in and the Vec _ out are included angle cosine values cos 1.
Step S17: constructing a combination node by two adjacent words with the largest included angle cosine value until a root node of a word sequence is generated; step S18: and determining the grammar structure of the word sequence according to the combination nodes and the root nodes.
Calculating every 2 adjacent words in the target text through the process to obtain an included angle cosine value, thereby determining 2 adjacent words with the largest included angle cosine value, incorporating the 2 words into a middle node of a syntax tree T, and obtaining a vector v = vec1+ vec2 of the node. And circulating the processes until the root node root is concluded, thereby forming the grammatical structure of the target text sen.
For example, the target text sen contains 5 words, term1 to term5, each corresponding to a leaf node, a first iteration of calculating a combined node1 of term1 and term2, a second iteration of calculating a combined node2 of term3 and term4, and a third iteration of calculating a combined node3 of node2 and term 5; in the last round, only node1 and node3 need to be classified into root, and finally the grammar structure shown in fig. 6 is formed.
Step S20; generating a data structure according to the node relation in the grammar structure;
for the target text, through the above step S10, the corresponding syntax tree T (i.e. syntax structure) is determined, and the root node is root, the intermediate nodes n1, n2, …, nk, the leaf nodes L1, L2, Lm. Generating a data structure by using 2 nodes with parent-child relationship, for example: [ node1 → node2], thereby composing a feature list. For the syntactic structure shown in fig. 6, the corresponding feature list fea1 is: [ root → node1], [ root → node3], [ node1 → term1], [ node1 → term2], [ node3 → node2], [ node3 → term5], [ node2 → term3], [ node2 → term4 ].
Step S30: a first input vector is generated from the data structure. In the embodiment of the invention, the identification of the text attribute features of the target text is mainly carried out by combining the pre-trained classification model, so that the input vector of the classification model is formed according to the data structure after the data structure of the target text is obtained.
In the data structure, the expression is also the incidence relation between two words, therefore, the process of forming the input vector of the classification model based on the data structure mainly comprises the following steps: and (3) inputting words contained in the data structure into a word2vec model for training, taking 300 hidden layer nodes, and obtaining a 300-dimensional vector of each word through training. The first layer of input layers defining the classification model are 600-dimensional vectors, the second layer of hidden layers are 300-dimensional vectors, and the third layer of output layers are 600-dimensional vectors.
For two related words in the data structure (e.g., [ node2 → term4] as described above), the corresponding vectors Vec2 and Vec4 are obtained and spliced into a 600-dimensional vector Vec as the input vector of the classification model.
Step S40: and determining the probability of each sentence containing the attribute feature text according to the first input vector and a preset text attribute feature classification model.
After the input vector is obtained in step S30, the input vector is input to the text attribute feature classification model constructed by training, so as to determine the probability that each sentence in the target text contains the attribute feature text.
Step S50: and identifying text attribute features in the target text according to the probability.
After determining the probability that each sentence in the target text contains the attribute feature text, whether the text attribute feature exists in the target text can be identified according to the probability. For example, if it is determined that the probability that each sentence in the target text contains the attribute feature text is greater than a preset threshold, and the ranking of the probabilities that each sentence in the target text contains the attribute feature text is determined, the first 10% of the sentences ranked are regarded as containing the attribute feature text.
The text attribute feature recognition method can accurately recognize the text attribute features in the target text, realizes the recognition of the meanings of the text contents, removes the recognition of the contents such as the division, the part of speech and the like of the text, can also accurately recognize the characters, the words, the phrases and the like representing the text attribute features of the contents such as the efficacy, the effect and the like in the text compared with the prior art, can excavate the deeper meanings of the text, can enrich the contents of the text recognition, and can provide more comprehensive data and content support for the subsequent processes such as the analysis processing based on the contents of the text recognition.
Alternatively, in some embodiments of the present invention, the target text may refer to a patent document or a patent set, for example, the text attribute feature corresponding to the patent document or the patent set may be an efficacy word representing the beneficial effect and the technical effect of the patent, for example, for a statement "achieving the effect of increasing the operation speed" in the target text, the efficacy word in the statement is "increasing", and the other efficacy words may also be, for example, "using conveniently, increasing the work efficiency, having a simple structure", and the like. However, the word "efficacy" is merely one of the attribute features of the text, and is only for illustration and not intended to limit the invention.
Hereinafter, the embodiments of the present invention will be specifically described with reference to patent documents or patent sets as the target text, and with reference to functional words as the attribute features of the text, but the embodiments are merely illustrative and are not intended to limit the present invention.
Optionally, in some embodiments of the present invention, for the classification of the power sentence, the preset text attribute feature classification model may be a power sentence classification model, and as shown in fig. 7, the power sentence classification model may be constructed through the following processes:
step S71: acquiring an efficacy statement sample, wherein the efficacy statement sample comprises a preset efficacy mark and a sample statement; the efficacy sentence sample contains sen (sentences) and efficacy flag (0/1 value, whether efficacy sentence or not). For example, the sample: helping the user to visualize the structure obscured in the text, 1.
Step S72: generating a first grammar structure according to the efficacy sentence sample;
step S73: generating a first feature list according to the node relation in the first syntactic structure;
step S74: generating a first classification input vector according to the feature list, and generating a first classification output vector according to a preset efficacy mark and a sample statement;
step S75: and training a preset classification model according to the first classification input vector and the first classification output vector to generate an efficacy sentence classification model.
Specifically, a syntax tree T (i.e., a syntax structure) is extracted for the power sentence sample, and root nodes are root, intermediate nodes n1, n2, …, nk, leaf nodes L1, L2, and Lm. In the embodiment of the present invention, the classification Model may be an SVM, and a gaussian kernel is used to train to obtain an efficacy sentence classification Model1, where the 2 nodes having a parent-child relationship generate a data structure, [ node1 → node2], and form a feature list, which is used as an input of the classification Model, and whether the efficacy sentence is an output of the classification Model.
Optionally, in some embodiments of the present invention, for the classification of the efficacy phrase, the preset text attribute feature classification model may be a efficacy phrase classification model, which is constructed through the following processes as shown in fig. 8:
step S81: acquiring an efficacy phrase sample, wherein the efficacy phrase sample comprises preset efficacy marks and sample phrases;
step S82: generating a second grammar structure from the efficacy phrase samples;
step S83: generating a second feature list according to the node relation in the second syntactic structure;
step S84: generating a second classification input vector according to the feature list, and generating a second classification output vector according to a preset efficacy mark and a sample phrase;
step S85: and training a preset classification model according to the second classification input vector and the second classification output vector to generate an efficacy phrase classification model.
The power phrase sample first part is a list of features of the sen grammar structure. The second part is the composition of the phrase, the qualifier + the central word + the efficacy word, such as "increase the operation speed", the central word is the speed, the qualifier is the operation, the efficacy word is the increase, the second part of the efficacy phrase sample is in the form of speed | central word + operation | qualifier + increase | efficacy word. The third part of the efficacy phrase sample is the efficacy flag, 0/1 value, indicating whether it is an efficacy phrase.
For example, the efficacy phrase sample: the effect of increasing the running speed is achieved, and the speed | the central word + the running | the limiting word + the efficacy word 1.
A syntax tree T (syntax structure) is extracted for the efficacy phrase samples, and as shown in fig. 9, the root nodes are root, intermediate nodes n1, n2, nk. Leaf nodes L1, L2, Lm. The 2 nodes with parent-child relationships generate a data structure, [ node1 → node2], which constitutes a feature list. If term1, term2, term3 are effect phrases, then 3 phrases constitute the data structure, [ term1, term2, term3], the first position is a core word, the second position is a qualifier, and the third position is an effect word. The two parts are combined together and used as input of a classification Model, whether an efficacy phrase is output is judged, the classification Model can adopt SVM, and an efficacy phrase Model2 is obtained through training by using a Gaussian kernel.
Optionally, in some embodiments of the present invention, the step S40: the process of determining the probability that each sentence contains the attribute feature text according to the first input vector and the preset text attribute feature classification model may specifically include at least one of the following steps:
determining a first probability that each sentence contains attribute features according to the first input vector and a preset text attribute feature classification model;
determining a second probability that the phrase in each sentence contains the attribute features according to the first input vector and a preset text attribute feature classification model;
and determining a third probability that the paragraph in each sentence contains the attribute features according to the first input vector and a preset text attribute feature classification model.
In the embodiment of the present invention, the classification of the text attribute features of the target text may be not limited to the recognition of sentences, but may be extended to the recognition of phrases and the recognition of paragraphs. Of course, those skilled in the art should understand that the method for identifying text attribute features according to the embodiments of the present invention is not limited to only the sentences, phrases and paragraphs listed above, but also includes other text forms, and the present invention is not limited thereto.
Specifically, for the patent document, based on the efficacy sentence Model1, efficacy sentence judgment can be performed on each sentence in the patent document as the target text, and the probability r1 that the corresponding sentence is an efficacy sentence is obtained; the power phrase Model2 can be used to extract the power phrase from each sentence in the patent document as the target text, and obtain the probability r2 of the power phrase.
For each phrase of the target text, the probability R that the phrase is a function phrase is obtained, R = R1 w1+ R2 w2+ … …, wherein w1 and w2 are weights of R1 and R2, and the weights can be set as required, and the weighted part of the probability R of the function phrase is calculated, and the probability R is not limited to R1 and R2, and can be expanded to the probability R3 of a function paragraph as required.
From the above calculations, if R is greater than a certain threshold and belongs to the top 10 of all phrases, then the phrase is a power phrase. Those skilled in the art will appreciate that R is greater than a certain threshold value, and that the first 10 criteria in all phrases are merely illustrative and that in practice the criteria may be adjusted as desired.
For example, for the patent document, the content of the effect description part is basically effective, no matter in the invention content part or in the specific implementation part, the effect part is quickly located by identifying the description of the corresponding effect phrase, and the beneficial effect of the scheme related to the patent document can be more accurately recognized. In addition, in practical application, based on the recognized efficacy phrase, the index label of the patent document can be added, and the user who searches the patent document can know the patent document by one more dimension.
For a patent set, a power sentence judgment can be performed on each sentence of each patent in the patent set, and the probability r1 that the sentence belongs to a power sentence is obtained. And extracting the efficacy phrase from each sentence of the patent to obtain the probability r2 of the efficacy phrase.
For each phrase of the target text, the probability R that the phrase is a function phrase is obtained, R = R1 w1+ R2 w2, w1, w2 is R1, and the weight of R2 is set as needed, and the weighted part of the probability R of the function phrase is calculated, which is not limited to R1 and R2 described above, and may be extended to the probability R3 of a function paragraph as needed.
Through the above calculation, if R is greater than a certain threshold and belongs to the top 10% of all the phrases, the phrase is the functional phrase, and those skilled in the art should know that R described herein is greater than a certain threshold and the judgment criteria of the top 10% of all the phrases are only examples, and in practical applications, the judgment criteria can be adjusted as needed.
And after identifying the corresponding efficacy phrases, integrating the efficacy phrases appearing in the patent set to obtain the efficacy phrase set of the patent set. Based on the efficacy phrase set, a macroscopic understanding of the overall situation of the patent set can be achieved. In practical applications, a corresponding patent set is usually constructed based on fields, technical directions, keywords, and the like as a core, or is presented based on contents retrieved by a user. For the patent sets, the method for identifying the text attribute features can be used for effectively identifying the effect phrases in the patent sets, and subsequent statistical analysis and other processing can be performed based on the identification result.
As described above, in the embodiment of the present invention, the target text is not limited to the patent document or the patent set, but the corresponding content may be identified for various texts, and the identified content is not limited to the sentence representing the efficacy class, and may be other types of words, sentences, and the like that represent the attribute feature of the text, and the present invention is not limited thereto.
The method for identifying text attribute features according to the embodiments of the present invention is further described below with reference to some specific application examples.
In practical applications, when a user wants to search text contents such as patent documents, the user often searches through a search formula composed of key words, technical fields, classification numbers and other related core words, but due to the limitation of the search formula, the obtained search results are often large in batch, and the search range is large. In this case, if the specific effects and beneficial effects of the patent documents can be further screened, the range of the searched contents can be effectively further narrowed, and more accurate search results can be provided to the user.
Therefore, the method for identifying the text attribute features provided by the embodiment of the invention can further identify the efficacy content of the text content such as patent documents, and the like, so that the text content can be used by the user to screen the text content. For example, the content retrieved by the user includes a functional description of "electric furnace + rise + water + temperature", and at this time, the above-mentioned functional description existing in the patent document can be recognized by the text attribute feature recognition method described in any of the above-mentioned embodiments, where water is a core word (i.e., is an action object); the elevation is a function word (namely, an action applied to an action object), and is also a word representing positive or negative effects of the function, and can be specifically defined as a direction word; temperature is a parameter; an electric stove is a qualifier (in this example, an electric stove is a functional carrier). Wherein, the limiting word is a word for limiting the core concept of efficacy; the central word refers to the core concept of technical efficacy; the direction word is a word for describing the direction of optimizing the technical effect.
Therefore, the user can select the literature content related to "electric furnace + rise + water + temperature" based on the screening of further function descriptions on the basis of the basic search. Of course, in practical applications, the functional description existing in the literature may not be so specific, and may be more generalized, for example, "raise + water + temperature", "heat water", "raise temperature", "electric furnace heat water", or only include a part of the functional description, but no matter how the form of the functional description changes, the form of the central word + efficacy word included in the functional description does not change, and therefore, for the functional description, the method for identifying the text attribute features according to the embodiment of the present invention may be identified, so as to provide the user with processes such as selection, screening, and the like.
The embodiment of the present invention further provides a method for classifying text attribute features, as shown in fig. 10, the method mainly includes:
step S101: and identifying text attribute features in the target text according to the sentences in the target text. In the embodiment of the present invention, the classification method for text attribute features is based on the premise that the text attribute features of the sentences of the target text are identified, and thus classification is performed based on the text attribute features. Optionally, in some embodiments of the present invention, the marked words, phrases, and the like may be identified by indexing, highlighting, or the like, so as to obtain corresponding text attribute features. In some embodiments of the present invention, the target text may also be identified by the text attribute feature identification method described in any of the above embodiments, so as to obtain an identification result of the text attribute feature.
Step S102: and constructing a text attribute feature set according to the recognition result of the target text. For the recognized text attribute features, corresponding central words are usually included, for example, as described above, the resulting word combinations include not only efficacy words, but also central words and qualifiers, and for such word combinations, a text attribute feature set T is defined, which includes n elements. For example: t1= run + speed, there being n such combinations T1 throughout, thus forming the set T.
Step S103: and constructing the structure tree according to the text attribute feature set. Based on the text attribute feature set T, a structure Tree is built, as shown in fig. 11, the root is a root node, each element in T is a leaf node of the Tree, and the leaf node is combined with L and includes L1, L2, Ln. If any 2 leaf nodes Li and Lj have over-co-occurrence in S, then Li and Lj are included under one intermediate node. The Node on the upper layer of the leaf Node is Node, including Node1, Node2, Node k. All Node nodes are connected to a root.
Step S104: and respectively determining the class center points of all leaf nodes in the structure tree, and forming a class center set by a plurality of class center points.
In the embodiment of the invention, in the process, one leaf node in the structure tree is respectively used as a central node, and the average distance from the leaf node to the central node in the structure tree is calculated; the leaf node with the largest average distance is determined as the class center point.
Step S105: generating class center vectors according to the class center points, and generating leaf node vectors according to leaf nodes of the structure tree; step S106: and determining the classification node of each leaf node according to the class center vector and the leaf node vector classification nodes.
In the embodiment of the present invention, the process of determining the classification node of each leaf node according to the class center vector and the leaf node vector classification node mainly includes: calculating Euclidean distance from the leaf node vector to the class center vector; and regarding each leaf node, taking the class center point corresponding to the class center vector with the minimum Euclidean distance as the classification node of each leaf node.
The above process is to classify the above n elements (i.e. the identified text attribute features) into m clusters (for example, m =10000 may be defined), specifically, the classification process is as follows:
(1) initializing a class center set C, wherein the initial state C is null;
(2) randomly selecting a leaf Li, and adding the leaf Li into the set C;
(3) for other nodes, the leaf Lj farthest from the average distance of the leaf nodes Li in the set C is calculated, wherein the distance is equal to the number of edges traveled from one node to Lj through the tree. The leaf Lj with the average furthest distance is added to the set C.
(4) And (4) repeating the step (3) until m elements in the set C, namely m class center points, represent one center point by cluster _ i.
And for each cluster _ i, the cluster _ i consists of a qualifier and a central word to obtain a qualifier vector and a central word vector, and the qualifier vector and the central word vector are spliced into a 600-dimensional vector _ i.
And each leaf node leaf also obtains a 600-dimensional vector vec formed by splicing the qualifier word vector and the central word vector of the leaf. And calculating Euclidean distance r from vec to each class cluster _ i vector, and taking the central point with the minimum r as the classification of leaf.
Through the above process, for each leaf, the corresponding class center can be found. The class-centered phrase is a classification result of each text attribute feature in the class.
For example, the L1 running speed is a class center calculated through the above process. The L2 sliding speed is closest to L1, and L1 is the classification result of L2.
Further, in some embodiments of the present invention, the cosine values of the angles between different classes of central phrases may also be calculated (for example, calculated through the processes from step S11 to step S17), and if the values of the cosine values of the angles between two classes of central phrases are greater than a preset threshold (for example, 0.95), the meanings expressed by the two classes of central phrases may be considered as synonymous or near-synonymous, and the two classes of central phrases may be further classified to obtain a further classification result.
In practical applications, the functional descriptions included in the content retrieved by the user may have various forms, but the meanings may belong to the same type of function or the achieved efficacies belong to the same type, for example, "electric furnace + elevated + water + temperature", "heating water", "elevated temperature", "electric furnace heating water", and the essential contents of these functional descriptions are actually the same type, and are all used to raise the temperature of water, and only slightly different in terms of expression form. Therefore, the functional description can be classified by the classification method of the text attribute feature of the embodiment of the invention, so that the text contents which are only distinguished in expression form are classified, words of the functional and efficacy descriptions with different substantial contents are divided, and a corresponding efficacy phrase set is summarized. Meanwhile, the function descriptions with the same essential content can be prevented from being divided into separate categories, and the complexity of the classification result is increased.
Based on the above classification process, at least two problems can be solved. On the one hand, in practical applications, when a user performs a search, the user may also perform a search based on the efficacy description, but from the user perspective, the user wants to quickly perform a search in all expression forms related to efficacy, rather than listing all expression forms with the same substantial content. As can be seen from the above examples, for the function descriptions with the same actual meaning, there may be a plurality of different expression forms, and the user may not be able to list, and even if the user lists, the whole search process becomes very cumbersome. Therefore, how to simplify the search process should be focused on the search scheme based on text content. Therefore, based on the above classification method, the function descriptions with substantially the same meaning are classified, based on the classification result, when the user searches, only one of the expression forms needs to be input, and the database, the document library, and the like used by the user can identify the classification of the function description that the user desires to search according to the function description classification to which the expression form belongs, and extract the corresponding search result from the database and the document library based on the classification. Therefore, the processing mode can greatly simplify the operation process of the user and can obtain a more comprehensive retrieval result meeting the requirements of the user.
On the other hand, the classification result obtained through the above process can be applied to a database, a document library and other scenes. After a user retrieves corresponding results based on search formulas such as keywords and the like, the results can be classified based on function descriptions in the primary retrieval results, so that main types of function descriptions contained in the primary retrieval results are extracted, a screening condition can be added to screening conditions of a database and a document library based on the contents, the classified function descriptions are used as screening items for the user to screen, the user is provided with search screening conditions with more dimensions, and the user can be helped to quickly and accurately locate the range or the content which the user wants to search.
At this time, by the text attribute feature recognition method according to any of the above embodiments, the above-described function description existing in the patent document can be recognized, in which water is a central word (i.e., an action object), is raised to an efficacy word (i.e., an action applied to the action object), temperature is a parameter, and an electric furnace is a qualifier (in this example, an electric furnace is a function carrier). Therefore, the user can select the literature content related to "electric furnace + rise + water + temperature" based on the screening of further function descriptions on the basis of the basic search. Of course, in practical applications, the functional description existing in the literature may not be so specific, for example, such a more general functional description may be included, or only a part of the functional description may be included, but regardless of the change of the form of the functional description, the form of the central word + the efficacy word included in the functional description does not change, and therefore, for such a functional description, the method for identifying the text attribute feature according to the embodiment of the present invention may perform identification, so as to provide the user with processing such as selection, filtering, and the like.
Alternatively, in some embodiments of the present invention, when classifying text attribute features, one type of situation may be encountered where one type of word expresses a "positive" meaning, and another type of word expresses a "negative" meaning, but the actual meanings of the two are substantially the same. For example, the actual meaning of "raise + correct rate" and "lower + error rate" is the same. However, "promote + correct rate" belongs to "positive" words, and "reduce + error rate" belongs to "negative" words, and if only one type of words is classified, a part of words is obviously lost. The two different words are classified separately, which is not consistent with their actual meanings. Therefore, in the embodiment of the present invention, in order to avoid such a situation, in the classification, such a word having the same actual meaning may also be subjected to a normalization process.
For m class-centered phrases, a cognitive normalization rule base can be established in an indexing manner. The method is directed to the feature of the central word and the text attribute (in this embodiment, the effect word is taken as an example for explanation). Respectively establishing an antisense word bank: for the core word, for example, error rate — correct rate. For each central word in m phrases, a central word with opposite meaning is obtained, and each negative central word is replaced by a positive central word. For efficacy words, e.g., decrease-increase. And for each efficacy word in the m phrases, obtaining an efficacy word with a meaning opposite to that of the efficacy word, and replacing each negative efficacy word with a positive efficacy word.
Through the processing mode, two types of words with the same actual meanings are unified, and when the target text is identified, classified and the like, the corresponding result can be more accurate, and the actual meaning expressed by the text content can be better met.
In practical application, the normalization processing procedure may learn and train the same or different semantic relations such as positive and negative effect words through an intelligent semantic analysis technology in artificial intelligence, so as to form a semantic analysis model capable of classifying the inputted words, and implement fast and efficient normalization processing on a large number of words through the semantic analysis model, where the normalization processing procedure combined with the intelligent semantic analysis mainly includes:
identifying efficacy categories in the target text through a semantic analysis model, and classifying the efficacy categories into positive efficacy phrases (including words such as increasing and increasing) and negative efficacy phrases (including words such as decreasing and decreasing).
The process of training the semantic analysis model based on the efficacy category words mainly comprises the following steps:
A) for example, 1 million positive corpuses and 1 million negative corpuses can be selected to establish the corpus. Each corpus is in a format of a qualifier + a central word + an efficacy word, for example, "increase operation speed", the central word is speed, the qualifier is operation, and the efficacy word is increase, then the corresponding corpus structure is: velocity | core word + run | qualifier + raise | efficacy word. The category indicates 0/1 values, with 0 being negative and 1 being positive. Therefore, for the positive effect phrase "increase the operation speed" and the negative effect phrase "decrease the operation speed", the corresponding linguistic data are obtained as follows:
speed | core word + run | qualifier + raise | efficacy word, 1;
speed | core word + run | qualifier + reduce | efficacy word, 0.
B) And classifying each efficacy phrase into a category by using an SVM algorithm, and training by using a Gaussian kernel based on a corpus to obtain a semantic analysis model.
2, converting the identified negative phrases into positive phrases
A) And establishing an antisense word bank, such as error rate-correct rate. And calculating each central word in the m phrases to obtain the central word with the opposite meaning, and replacing each negative central word with the positive central word.
B) An antisense thesaurus is established, such as reduction-promotion. And for each efficacy word in the m phrases, obtaining the efficacy word with the opposite meaning, and replacing each negative efficacy word with a positive efficacy word.
Meanwhile, in the above example, only the process of performing the normalization process on the positive effect word and the negative effect word is described, in an alternative embodiment of the present invention, the normalization process on the effect words is not limited to the effect words with opposite semantics, and the normalization process may also be performed on the words with similar semantics (for example, synonyms or near-synonyms), for example, the words such as "raise, increase" and the like are synonyms or near-synonyms that represent the same meaning, and based on the same meaning, on one of the effect words, the normalization process is performed on the other synonyms or near-synonyms. Through the normalization processing process, the accuracy of classifying based on the efficacy words can be further improved, meanwhile, in the subsequent retrieval, classification or screening process, based on the screening words after the normalization processing, the results obtained after the user selects based on the screening words can better meet the requirements of the user, and the use experience of the user is improved.
In the process of practical application in combination with various modes of the embodiment of the present invention, it is found that the text attribute features of the target text targeted by the identification method and the classification method of the embodiment of the present invention can also be found in accordance with the popular TRIZ theory (which is a solution theory of the problem of the invention, the TRIZ theory successfully reveals the internal rules and principles of the invention, and focuses on clarifying and emphasizing the contradictions existing in the system, and the goal is to completely solve the contradictions and obtain the final ideal solution).
The definition of the TRIZ on the engineering parameter set can be shown in table 1:
TABLE 1
Figure 536046DEST_PATH_IMAGE003
As can be seen from table 1, in the TRIZ theory, there is a certain classification for the efficacy parameters, which is in accordance with the concept of the efficacy parameters used in the embodiment of the present invention as examples, and in the TRIZ theory, there is a certain division for the physical parameters, which is seen to be in accordance with the text attribute characteristics of the target text targeted in the embodiment of the present invention.
Therefore, an embodiment of the present invention provides a method for classifying text attribute features, which mainly includes:
firstly, a target text is obtained, and text attribute characteristics of the target text are identified. In the embodiment of the present invention, the classification method for text attribute features is based on the premise that the text attribute features of the sentences of the target text are identified, and thus classification is performed based on the text attribute features. Optionally, in some embodiments of the present invention, the marked words, phrases, and the like may be identified by indexing, highlighting, or the like, so as to obtain corresponding text attribute features. In some embodiments of the present invention, the target text may also be identified by the text attribute feature identification method described in any of the above embodiments, so as to obtain an identification result of the text attribute feature.
And then, generating an engineering parameter classification result corresponding to the text attribute feature according to the text attribute feature and a preset engineering parameter classification model. The preset engineering parameter classification model is a neural network model constructed by pre-training according to the classification of the engineering parameters of the TRIZ theory, so that based on the trained classification model, a result corresponding to the engineering parameter classification can be obtained based on the result of the text attribute feature.
Alternatively, in some embodiments of the present invention, as shown in fig. 12, the engineering parameter classification model may be trained and constructed through the following process:
step S121: acquiring an engineering parameter corpus sample and a text attribute feature sample, wherein the engineering parameter corpus sample comprises an engineering parameter text attribute feature and a text attribute feature type mark, for example, for 'add weight, 1', wherein the add weight is the engineering parameter text attribute feature, and a 1-bit corresponding type mark indicates that the add weight belongs to the engineering parameter text attribute feature, and if the mark is 0, the corresponding phrase does not belong to the engineering parameter text attribute feature. In an alternative embodiment, the text attribute feature sample described herein may be generated by recognition of the text attribute feature according to any of the above embodiments;
step S122: and generating an engineering parameter input vector according to the text attribute feature sample, and generating an engineering parameter output vector according to the engineering parameter corpus sample. The input vector and the output vector required for training the model are constructed based on the samples.
Step S123: and training the SVM model according to the engineering parameter input vector and the engineering parameter output vector to construct an engineering parameter classification model. In the embodiment of the invention, an SVM algorithm can be used for classifying each text attribute feature phrase into an engineering parameter text attribute feature, and a Gaussian kernel is used for training a neural network model so as to construct a required engineering parameter classification model.
Through the method, words representing attribute characteristics of the text are classified correspondingly by combining a relatively mature TRIZ theory, and the target text can be further analyzed and mined based on the classification mode, and is classified based on engineering parameters of the TRIZ theory, so that the target text has a more systematic classification mode. And with the common application of the TRIZ theory in various fields, the content displayed by the text classification result of the TRIZ theory is combined, so that the requirement of the user can be more met, and the user can be used as important reference content in aspects of mining, analysis, extension schemes and the like of the corresponding field.
Optionally, in some embodiments of the present invention, in combination with the text attribute feature classification method described in the above embodiments, when classifying text contents, in addition to classifying engineering parameters based on the TRIZ theory, the text contents may be classified by combining with functional classification of engineering parameters in the TRIZ theory. In the TRIZ theory, the words corresponding to the engineering parameters can be divided into useful functions, harmful functions, insufficient functions and transitional functions. The classification result of this aspect can be considered from two levels, on one hand, the classification can be performed only according to that the word belongs to one of the several functional classifications without depending on the engineering parameter classification described above; on the other hand, on the basis of the engineering classification, which kind of function the words belong to can be further divided. The specific classification mode can be set according to actual needs. For example, different classification modes can be selected for users, and the classification can be directly performed according to functions; for the word selected by the user to be divided according to different classification levels, the word can be further classified according to the classification of the engineering parameters, and the invention is not limited to this.
In addition, in practical application, the text attribute feature analysis method based on engineering parameters of the TRIZ theory and the text attribute feature analysis method according to some embodiments of the present invention may be used in combination. Aiming at the content which the user wants to search, only the analysis search result on one hand is listed, but the search result classified based on the general efficiency words can be displayed, and the search result classified based on the project parameters of the TRIZ theory can also be displayed at the same time, so that the corresponding search result is displayed to the user through different dimensions and different levels aiming at the text content which the user wants to search, and the more comprehensive and detailed search result is displayed for the user. For such a search scheme, in practical applications, for example, an "automatic efficacy word composite search function" may be added to a database, a document library, etc., for example, when a user selects the automatic efficacy word to conform to the search function and inputs a keyword "loose-proof" to be searched in a corresponding input field, a system of the database and the document library may search for a word described in the same kind of function as the keyword, based on the keyword, and search for keywords related to the keyword, technical efficacy sentences and technical efficacy phrases corresponding to the keywords, and the like, respectively, in technical efficacy and TRIZ engineering parameters based on corresponding classifications. And the user selects an effect sentence, an effect phrase or an effect word related to the searched effect from the found corresponding technical effect sentence and technical effect phrase. Correspondingly, if the keyword to be searched input by the user is the related efficacy sentence, efficacy phrase or efficacy word, the system of the database and the document library can reversely search the corresponding technical efficacy 1-3 grade and triz parameter according to the content.
Aiming at the retrieval result obtained in the process, the user can further screen the requirement of the user on the information and the screening items listed by the system, so as to find the text content really wanted by the user. In practical application, a user can use an OR statement to combine the search keywords into a review search formula, so as to comprehensively search the related content of the efficacy.
An embodiment of the present invention further provides a text structure analysis method, as shown in fig. 13, the text structure analysis method mainly includes:
step S131: phrases in paragraphs of the target text are recognized, and a phrase set is formed according to the paragraphs. For any target text (for example, patent document d), paragraphs in the target text are identified, and phrases appearing in the same paragraph constitute a phrase set S. For example, if there are k paragraphs, p _1 to p _ k, in patent document d, the phrase set S is composed of k phrase sets S1, S2, …, Sk.
Step S132: and carrying out structural analysis on each phrase set to generate a structural set. Each paragraph of patent document d is subjected to structural analysis, and a set stuct is obtained, which contains k elements, wherein each element is a structural result of one paragraph.
Step S133: text attribute features in paragraphs of the structured collection are identified. In the embodiment of the present invention, the classification method for text attribute features is based on the premise that the text attribute features of the sentences of the target text are identified, and thus classification is performed based on the text attribute features. Optionally, in some embodiments of the present invention, the marked words, phrases, and the like may be identified by indexing, highlighting, or the like, so as to obtain corresponding text attribute features. In some embodiments of the present invention, the target text may also be identified by the text attribute feature identification method described in any of the above embodiments, so as to obtain an identification result of the text attribute feature.
Step S134: and establishing an incidence relation between the structured set and the phrase set according to the identified text attribute characteristics.
In the embodiment of the invention, if a paragraph p _ i corresponding to struct _ i contains a text attribute feature phrase set S _ i, a corresponding relation between struct _ i and S _ i is established.
And if the paragraph p _ i corresponding to struct _ i does not contain the text attribute feature phrase, selecting a phrase set S _ j which is closest to p _ i and appears behind p _ i.
And if the S _ j does not exist, selecting a phrase set S _ m which is closest to the p _ i and appears in front of the p, and establishing a corresponding relation.
If S _ m is still not present, p _ i is ignored.
Through the above process, the correspondence between the extraction structuring and the efficacy for patent document d is determined.
For example, for a target text:
(p 1) the gantry moves longitudinally through the lead screw and the connecting block to drive the cutter to move longitudinally, the matching mode is stable in operation, and the engraving precision is improved;
(p 2) two, be equipped with a plurality of tops on the top mount to through the cooperation of guide pin bushing, guide pillar, only need promote top mount and can centre gripping a plurality of work pieces simultaneously, screw up the tightening bolt on the guide pin bushing and can fix top mount, the simple operation still raises the efficiency.
As can be seen, the target text has 2 paragraphs p1 and p 2.
For paragraph p 1:
1, obtaining a structured struct _1 through the steps S131 and S132: gantry-connecting-screw, gantry-driving-tool;
based on the above steps S131 and S132, in combination with step S133, an efficacy phrase S _1 is obtained: the operation is stable, and the engraving precision is improved;
based on the above steps S131 to S133, in combination with step S134, it is determined that the association relationship between struct _1 and S _1 is the effect that S _1 is achieved by struct _ 1.
For paragraph p 2:
1, obtaining a structured struct _2 through the steps S131 and S132: center-containing-center, center-connecting-guide sleeve, center-connecting-guide post, center-pushing-workpiece;
2, based on the above steps S131 and S132, in combination with step S133, obtaining an efficacy phrase S _ 2: the operation is convenient and fast, and the efficiency is improved;
based on the above steps S131 to S133, in combination with step S134, the association relationship between struct _2 and S _2 is the effect that S _2 is achieved through struct _ 2.
Further, for a target text set (for example, a patent set D containing a plurality of patent documents) containing a plurality of target texts, firstly, a phrase set of each patent is obtained through the above process, and the phrase set is structured, and a corresponding relationship between struct _ i and a text attribute feature phrase is determined. For each phrase set s, all corresponding structures in the patent set D are obtained. And acquiring all the corresponding effect phrases in the D patent set for each structure struct. Thus, the corresponding relation between the extraction structuralization and the efficacy in the patent set D is completed.
Through the above process, a structured technical scheme can be recommended for analysis of text attribute features, or text attribute feature content of the structured technical scheme can be predicted). And certain limiting conditions can be combined to give more accurate results. For example, combining efficacy with technical fields or keywords, a targeted technical solution is given, or a structured technical solution and key components, a more accurate prediction of efficacy is given (multiple prediction results can be given, and probabilities are expressed by percentiles).
In the above embodiments of the method, the text attribute features are mined, analyzed, identified, clustered and the like from the perspective of a text attribute feature identification method, a text attribute feature classification method, a text structure analysis method and the like, and in practical application, the respective functions of the text attribute features can be realized to obtain corresponding analysis processing results, and the results are provided for users. Optionally, in some embodiments of the present invention, the methods corresponding to the above method embodiments may also be combined, so as to obtain an overall processing procedure of text mining, analysis, and the like, so that different results obtained based on the text are presented to the user from multiple angles and multiple levels.
In the embodiment of the present invention, the solutions corresponding to the different method embodiments may integrally form an analysis processing architecture for a text, and the architecture may include a plurality of processing results of different levels, for example, the method may include: a language layer, a perception layer, a cognition layer, a standard parameter layer and the like.
The language layer is used for representing and labeling and splitting parts of words such as efficacy words and efficacy phrases, embodies the process of identifying text attribute characteristics of the efficacy words and the efficacy phrases, and can be a combination comprising limit words, central words and direction words. The language layer can be obtained by the text attribute feature identification method described in the above method embodiment.
The perception layer is used for representing the result of classifying the efficacy words in the text, and is a combination of words with the same meaning and the standardized words. The perception layer can be obtained by the classification method of the text attribute features of the method embodiment; the induction ability from the language layer to the perception layer is equivalent to the ability of a computer to achieve the perception of technical effects.
The cognitive layer is used for representing the result of normalizing the expression of the effect words in the text, and on the basis of the technical effect phrase of the perception layer, the sense word, the central word and the direction word are taken as a whole, and the cognition consistency of human to the essential content of the language expression is taken as a standard to be unified and normalized. The cognitive layer can be obtained through the normalization processing process in the text attribute feature classification method of the method embodiment. The inductive power from the perception layer to the cognition layer is equivalent to the cognitive power of a computer capable of achieving the essential content of technical efficacy.
The standard parameter layer is used for representing the clustering result of the technical efficacy of the cognitive layer, and matching the functional parameters of the technical contradiction matrix of the TRIZ theory to form efficacy TRIZ parameters. The standard parameter layer can be obtained by the classification method of the text attribute features combined with the TRIZ theory of the above method embodiment. The induction ability from the perception layer to the TRIZ parameter layer is equivalent to the ability of the computer to summarize the technical effects of the invention experts.
In practical application, based on the processing procedures of the method embodiments, corresponding analysis processing results can be generated for the text to be retrieved by the user for selection by the user, or different analysis results can be displayed according to the screening of the user, and the corresponding retrieval analysis results of the text retrieved by the user can be displayed from different angles and in different manners. The following describes processing results of the analysis processing architecture constructed based on the method embodiments of the present invention with reference to several specific embodiments.
Example 1
Text 1: the device has the advantages of being convenient to install and reducing the detection error rate.
Two efficacy phrases are extracted for text 1.
Efficacy phrase 1 of text 1: convenient to install
Technical efficacy qualifier Technical effect central word Technical effect directional word Phrase with technical effect
Language layer Mounting of Convenience of use - Convenient to install
Sensing layer Mounting of Convenience of use Improvement of Improvement in mounting convenience
Cognitive layer Mounting of Convenience of use Improvement of Improvement in mounting convenience
Layer of standard parameters 33-convenience of the operating procedure
Efficacy phrase 2 of text 1: detection error rate reduction
Technical efficacy qualifier Technical effect central word Technical effect directional word Phrase with technical effect
Language layer Detection of Error rate Reduce Detection error rate reduction
Sensing layer Detection of Error rate Reduce Detection error rate reduction
Cognitive layer Detection of Accuracy of Improvement of Detection accuracy improvement
Layer of standard parameters 28-measurement accuracy
And 2, the device is convenient to install and high in detection accuracy.
Efficacy phrase 1 of text 2: convenient installation
Technical efficacy qualifier Technical effect central word Technical effect directional word Phrase with technical effect
Language layer Mounting of Convenience - Convenient installation
Sensing layer Mounting of Convenience of use Improvement of Improvement in mounting convenience
Cognitive layer Mounting of Convenience of use Improvement of Improvement in mounting convenience
Layer of standard parameters 33-convenience of the operating procedure
Efficacy phrase 2 of text 2: the detection accuracy is high
Technical efficacy qualifier Technical effect central word Technical effect directional word Phrase with technical effect
Language layer Detection of Accuracy of Height of The detection accuracy is high
Sensing layer Detection of Accuracy of Improvement of Detection accuracy improvement
Cognitive layer Detection of Accuracy of Improvement of Detection accuracy improvement
Layer of standard parameters 28-measurement accuracy
Example 2 is an example of the results of normalizing the expression of efficacy words in text, which is specifically achieved by the cognitive layer described above:
text 1: the device has the advantages of being convenient to install and reducing the detection error rate.
Text 2: the device is convenient to install and high in detection accuracy.
It can be seen that "detection error rate is reduced" in the text 1 and "detection accuracy is high" in the text 2 are two kinds of functional words having the same actual meaning. When a user searches for a text that can improve detection accuracy, the input language of the user may be "how to improve detection accuracy", "a technique with a low detection error rate", or the like. Text 1 and text 2 cannot be successfully matched without specification of technical efficacy.
After the efficacy extraction by the method of the embodiment of the invention, the results of the cognitive layer are as follows:
user input 1 User input 2 Text 1 Text 2
Original content How to improve the detection accuracy Techniques for detecting low error rates Detection error rate is reduced The detection accuracy is high
Phrase of action Improve the detection accuracy Low detection error rate Detection error rate reduction The detection accuracy is high
Cognitive layer efficacy Detection-accuracy-improvement Detection-accuracy-improvement Detection-accuracy-improvement Detection-accuracy-improvement
Therefore, after the user input requirements are normalized, the two problems are that the detection accuracy is improved in the cognitive layer, and the essential contents of the text 1 and the text 2 in the cognitive layer are completely consistent. And the text 1 and the text 2 are successfully matched, and compared with the condition that the technical efficiency is not normalized, the comprehensiveness of the search is improved.
Example 3 is an example of the results of normalizing the expression of efficacy words in text, specifically achieved by the cognitive layer described above:
text 4: the water dispenser is convenient to install.
Text 5: a drinking machine with convenient use.
The user inputs the question: a water dispenser with convenient installation.
From the similarity of sentence pattern and text. Text 5 is more similar to the user input than text 4, and text 5 may have a higher similarity score than text 4 if a conventional search engine is used. From a human understanding, the question of text 4 and user input is to specify the power description to the convenience of installation, as opposed to the convenience of the text 5 "use" process.
After the efficacy extraction is carried out by the method of the embodiment of the invention, the cognitive layer efficacy and matching result are as follows:
user input Text 4 Text 5
Original content Water dispenser convenient to mount The water dispenser is convenient to install Convenient drinking machine
Phrase of action Convenient to install Convenient installation Convenient to use
Cognitive layer efficacy Mounting-convenience-enhancement Mounting-convenience-enhancement Use-convenience-improvement
Portion perfectly matching user input Technical effect limiting word, central word and direction word Technical effect central word and directional word
Parts not exactly matching with user input Is free of Technical efficacy qualifier
Therefore, based on the results obtained by the method of the embodiment of the present invention, the cognitive layer technical efficacy of the text 4 is more matched with the cognitive layer technical efficacy input by the user than that of the text 5.
An embodiment of the present invention further provides an apparatus for identifying text attribute features, as shown in fig. 14, the apparatus for identifying text attribute features mainly includes:
a grammar structure generating module 141, configured to generate a grammar structure according to the sentence in the target text; for details, reference may be made to the related description of step S10 in the above method embodiment, and details are not repeated herein;
a data structure generating module 142, configured to generate a data structure according to the node relationship in the syntax structure; for details, reference may be made to the related description of step S20 in the above method embodiment, and details are not repeated herein;
a first input vector generation module 143 configured to generate a first input vector according to the data structure; for details, reference may be made to the related description of step S30 in the above method embodiment, and details are not repeated herein;
a probability determining module 144, configured to determine, according to the first input vector and a preset text attribute feature classification model, a probability that each sentence contains an attribute feature text; for details, reference may be made to the related description of step S40 in the above method embodiment, and details are not repeated herein;
the text recognition module 145 is used for recognizing text attribute features in the target text according to the probability; for details, reference may be made to the related description of step S50 in the above method embodiment, and details are not repeated herein.
The text attribute feature recognition device can accurately recognize the text attribute features in the target text and realize the recognition of the meanings of the text contents, compared with the prior art, the text attribute feature recognition device can remove the recognition of the contents such as the division, the part of speech and the like of the text, can also accurately recognize the characters, the words, the phrases and the like representing the text attribute features of the contents such as the efficacy, the effect and the like in the text, can mine the deeper meanings of the text, can enrich the text recognition contents, and can provide more comprehensive data and content support for the subsequent processes such as analysis processing based on the text recognition contents and the like.
An embodiment of the present invention further provides a device for classifying text attribute features, as shown in fig. 15, the device for classifying text attribute features mainly includes:
a text recognition module 151, configured to recognize a text attribute feature in a target text according to a sentence in the target text; for details, reference may be made to the related description of step S101 in the above method embodiment, which is not described herein again;
a text attribute feature set construction module 152, configured to construct a text attribute feature set according to the recognition result of the target text; for details, reference may be made to the related description of step S102 in the above method embodiment, which is not described herein again;
the structure tree building module 153 is configured to build a structure tree according to the text attribute feature set; for details, reference may be made to the related description of step S103 in the above method embodiment, which is not described herein again;
a class center set constructing module 154, configured to determine class center points of each leaf node in the structure tree, respectively, and form a class center set from a plurality of class center points; for details, reference may be made to the related description of step S104 in the above method embodiment, which is not described herein again;
a vector generating module 155, configured to generate class center vectors according to the class center points, and generate leaf node vectors according to leaf nodes of the structure tree; for details, reference may be made to the related description of step S105 in the above method embodiment, which is not described herein again;
a classification node determining module 156, configured to determine a classification node of each leaf node according to the class center vector and the leaf node vector classification nodes; for details, reference may be made to the related description of step S106 in the above method embodiment, and details are not repeated herein.
The classification device of the text attribute features of the embodiment of the invention classifies the function descriptions, thereby realizing the classification of the text contents which are only distinguished in expression forms, dividing the words of the function and efficacy descriptions with different substantial contents and inducing corresponding efficacy phrase sets. Meanwhile, the function descriptions with the same essential content can be prevented from being divided into separate categories, and the complexity of the classification result is increased.
An embodiment of the present invention further provides a device for classifying text attribute features, as shown in fig. 16, the device for classifying text attribute features mainly includes:
the text recognition module 161 is configured to obtain a target text and recognize an efficacy phrase of the target text; in the embodiment of the present invention, the classification method for text attribute features is based on the premise that the text attribute features of the sentences of the target text are identified, and thus classification is performed based on the text attribute features. Optionally, in some embodiments of the present invention, the marked words, phrases, and the like may be identified by indexing, highlighting, or the like, so as to obtain corresponding text attribute features. In some embodiments of the present invention, the target text may also be identified by the text attribute feature identification method described in any of the above embodiments, so as to obtain an identification result of the text attribute feature.
And the text classification module 162 is configured to generate an engineering parameter classification result corresponding to the efficacy phrase according to the text attribute feature and a preset engineering parameter classification model. The preset engineering parameter classification model is a neural network model constructed by pre-training according to the classification of the engineering parameters of the TRIZ theory, so that based on the trained classification model, a result corresponding to the engineering parameter classification can be obtained based on the result of the text attribute feature.
Optionally, in some embodiments of the present invention, the engineering parameter classification model may be trained and constructed by a device for constructing a classification model of text attribute features as shown in fig. 17, where the device for constructing a classification model of text attribute features mainly includes:
the sample obtaining module 171 is configured to obtain an engineering parameter corpus sample and a text attribute feature sample, where the engineering parameter corpus sample includes an engineering parameter text attribute feature and a text attribute feature type flag, for example, for "add weight, 1", where the add weight is the engineering parameter text attribute feature, and a 1-bit corresponding type flag indicates that the add weight belongs to the engineering parameter text attribute feature, and if the type flag is 0, the corresponding phrase does not belong to the engineering parameter text attribute feature. In an alternative embodiment, the text attribute feature sample described herein may be generated by recognition of the text attribute feature according to any of the above embodiments;
and the vector generation module 172 is configured to generate an engineering parameter input vector according to the text attribute feature sample, and generate an engineering parameter output vector according to the engineering parameter corpus sample. The input vector and the output vector required for training the model are constructed based on the samples.
The model building module 173 is configured to train the SVM model according to the engineering parameter input vector and the engineering parameter output vector, and build an engineering parameter classification model. In the embodiment of the invention, an SVM algorithm can be used for classifying each text attribute feature phrase into an engineering parameter text attribute feature, and a Gaussian kernel is used for training a neural network model so as to construct a required engineering parameter classification model.
An embodiment of the present invention further provides a text structure analysis device, as shown in fig. 18, the text structure analysis device mainly includes:
a phrase set generating module 181, configured to identify phrases in paragraphs of the target text, and form a phrase set according to the paragraphs; for details, reference may be made to the related description of step S131 in the above method embodiment, which is not described herein again;
a structured set generation module 182, configured to perform structured parsing on each phrase set to generate a structured set; for details, reference may be made to the related description of step S132 in the above method embodiment, which is not described herein again;
a text recognition module 183 for recognizing text attribute features in paragraphs of the structured collection; for details, reference may be made to the related description of step S133 of the above method embodiment, which is not described herein again;
the text relationship building module 184 is configured to build an association relationship between the structured collection and the phrase collection according to the identified text attribute features; for details, reference may be made to the related description of step S134 of the above method embodiment, and details are not repeated herein.
Through the above process, a structured technical scheme can be recommended for analysis of text attribute features, or text attribute feature content of the structured technical scheme can be predicted). And certain limiting conditions can be combined to give more accurate results. For example, combining efficacy with technical fields or keywords, a targeted technical solution is given, or a structured technical solution and key components, a more accurate prediction of efficacy is given (multiple prediction results can be given, and probabilities are expressed by percentiles).
An embodiment of the present invention further provides a computer device, as shown in fig. 19, the computer device may include a processor 191 and a memory 192, where the processor 191 and the memory 192 may be connected by a bus or in another manner, and fig. 19 takes the example of being connected by a bus.
The processor 191 may be a Central Processing Unit (CPU). The Processor 191 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or a combination thereof.
The memory 192 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for identifying text attribute features, or the method for classifying text attribute features, or the method for constructing a classification model of text attribute features in the embodiments of the present invention. The processor 191 executes the non-transitory software programs, instructions and modules stored in the memory 192 to execute various functional applications and data processing of the processor, that is, to implement the identification method of the text attribute feature, or the classification method of the text attribute feature, or the construction method of the classification model of the text attribute feature in the above-described method embodiments.
The memory 192 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 191, and the like. Further, the memory 192 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 192 may optionally include memory located remotely from the processor 191, and such remote memory may be coupled to the processor 191 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 192 and, when executed by the processor 191, perform a method of identifying text attribute features, or a method of classifying text attribute features, or a method of constructing a classification model of text attribute features as in the embodiments shown in fig. 1-13.
The details of the computer device can be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to 13, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (21)

1. A text attribute feature recognition method is characterized by comprising the following steps:
generating a grammar structure according to the sentences in the target text;
generating a data structure according to the node relation in the grammar structure;
generating a first input vector from the data structure;
determining the probability that each statement contains an attribute feature text according to the first input vector and a preset text attribute feature classification model;
identifying text attribute features in the target text according to the probability;
determining the probability that each sentence contains an attribute feature text according to the first input vector and a preset text attribute feature classification model, wherein the probability at least comprises one of the following steps:
determining a first probability that each statement contains attribute features according to the first input vector and a preset text attribute feature classification model;
determining a second probability that the phrase in each sentence contains the attribute features according to the first input vector and a preset text attribute feature classification model;
and determining a third probability that each paragraph in each sentence contains attribute features according to the first input vector and a preset text attribute feature classification model.
2. The method for recognizing text attribute features according to claim 1, wherein the generating a grammar structure according to sentences in the target text comprises:
respectively identifying words of each sentence in the target text, and constructing a word directed graph;
calculating the shortest path from the first node to the last node in the word directed graph as the word segmentation result of each statement;
constructing a word sequence according to the word segmentation result of each sentence;
generating an input vector according to adjacent words in the word sequence;
obtaining an output vector according to a preset neural network model and the input vector;
calculating the cosine value of the included angle between the input vector and the output vector;
constructing a combination node by two adjacent words with the largest included angle cosine value until a root node of the word sequence is generated;
and determining a syntactic structure of the word sequence according to the combination node and the root node.
3. The method for recognizing the text attribute features according to claim 1, wherein the preset text attribute feature classification model is a power sentence classification model, and the power sentence classification model is constructed by the following processes:
acquiring an efficacy statement sample, wherein the efficacy statement sample comprises a preset efficacy mark and a sample statement;
generating a first grammar structure according to the efficacy statement sample;
generating a first feature list according to the node relation in the first syntactic structure;
generating a first classification input vector according to the feature list, and generating a first classification output vector according to the preset efficacy mark and the sample statement;
and training a preset classification model according to the first classification input vector and the first classification output vector to generate the efficacy sentence classification model.
4. The method for recognizing the text attribute features according to claim 1, wherein the preset text attribute feature classification model is a power phrase classification model, and the power phrase classification model is constructed by the following processes:
acquiring an efficacy phrase sample, wherein the efficacy phrase sample comprises preset efficacy marks and sample phrases;
generating a second grammar structure according to the efficacy phrase sample;
generating a second feature list according to the node relation in the second syntactic structure;
generating a second data structure according to a preset efficacy phrase;
generating a second classification input vector according to the second feature list and a second data structure, and generating a second classification output vector according to the preset efficacy mark and the sample phrase;
and training a preset classification model according to the second classification input vector and the second classification output vector to generate the preset efficacy phrase classification model.
5. A method for classifying text attribute features is characterized by comprising the following steps:
identifying text attribute features in a target text according to sentences in the target text by the identification method of text attribute features as claimed in any one of claims 1-4;
constructing a text attribute feature set according to the recognition result of the target text;
constructing a structure tree according to the text attribute feature set;
respectively determining the class center points of all leaf nodes in the structure tree, and forming a class center set by a plurality of class center points;
generating class center vectors according to the class center points, and generating leaf node vectors according to leaf nodes of the structure tree;
and determining the classification node of each leaf node according to the class center vector and the leaf node vector classification nodes.
6. The method for classifying text attribute features according to claim 5, wherein the determining the class center point of each leaf node in the structure tree respectively comprises:
respectively taking one leaf node in the structure tree as a central node, and calculating the average distance from the leaf node in the structure tree to the central node;
determining the leaf node with the farthest average distance as the class center point.
7. The method of classifying text attribute features according to claim 5, wherein the determining the classification node of each leaf node according to the class center vector and the leaf node vector classification nodes comprises:
calculating Euclidean distance from the leaf node vector to the class center vector;
and for each leaf node, taking the class center point corresponding to the class center vector with the minimum Euclidean distance as the classification node of each leaf node.
8. A method for constructing a classification model of text attribute features is characterized by comprising the following steps:
acquiring an engineering parameter corpus sample and a text attribute feature sample, wherein the engineering parameter corpus sample comprises engineering parameter text attribute features and text attribute feature type marks, and the text attribute feature sample is generated by identification according to the identification method of the text attribute features in any one of claims 1 to 4;
generating an engineering parameter input vector according to the text attribute feature sample, and generating an engineering parameter output vector according to the engineering parameter corpus sample;
and training the SVM model according to the engineering parameter input vector and the engineering parameter output vector to construct an engineering parameter classification model.
9. A method for classifying text attribute features is characterized by comprising the following steps:
acquiring a target text, and identifying the text attribute feature of the target text by the text attribute feature identification method according to any one of claims 1 to 4;
and generating an engineering parameter classification result corresponding to the text attribute feature according to the text attribute feature and a preset engineering parameter classification model.
10. The method for classifying text attribute features according to claim 9, wherein the engineering parameter classification model is constructed according to the method for constructing the classification model of text attribute features according to claim 8.
11. A text structure analysis method is characterized by comprising the following steps:
identifying phrases in each paragraph of the target text, and respectively forming a phrase set according to each paragraph;
carrying out structured analysis on each phrase set to generate a structured set;
the identification method of text attribute features according to any one of claims 1-4 identifying text attribute features in paragraphs of the structured collection;
and establishing the incidence relation between the structured set and the phrase set according to the identified text attribute characteristics.
12. The text structure analysis method according to claim 11,
and if the paragraphs of the structured set do not contain text attribute features, selecting a first phrase set which is closest to the current phrase and is positioned behind the current phrase, and establishing an association relationship with the structured set.
13. The text structure analysis method according to claim 12,
and if the first phrase set is not found, selecting a second phrase set which is closest to the current phrase and is positioned in front of the current phrase, and establishing an association relation with the structured set.
14. The text structure analysis method according to claim 13,
if neither the first set of phrases nor the second set of phrases are present, the current phrase is ignored.
15. An apparatus for recognizing text attribute features, comprising:
the grammar structure generating module is used for generating grammar structures according to sentences in the target text;
the data structure generating module is used for generating a data structure according to the node relation in the grammar structure;
a first input vector generation module for generating a first input vector according to the data structure;
the probability determination module is used for determining the probability that each statement contains the attribute feature text according to the first input vector and a preset text attribute feature classification model;
determining the probability that each sentence contains an attribute feature text according to the first input vector and a preset text attribute feature classification model, wherein the probability at least comprises one of the following steps:
determining a first probability that each statement contains attribute features according to the first input vector and a preset text attribute feature classification model;
determining a second probability that the phrase in each sentence contains the attribute features according to the first input vector and a preset text attribute feature classification model;
determining a third probability that each paragraph in each sentence contains attribute features according to the first input vector and a preset text attribute feature classification model;
and the text recognition module is used for recognizing the text attribute characteristics in the target text according to the probability.
16. A classification device for text attribute features is characterized by comprising:
a text recognition module, configured to recognize a text attribute feature in a target text according to a sentence in the target text by the recognition method of the text attribute feature according to any one of claims 1 to 4;
the text attribute feature set construction module is used for constructing a text attribute feature set according to the recognition result of the target text;
the structure tree construction module is used for constructing a structure tree according to the text attribute feature set;
the class center set building module is used for respectively determining class center points of all leaf nodes in the structure tree and forming a class center set by a plurality of class center points;
the vector generation module is used for generating class center vectors according to the class center points and generating leaf node vectors according to leaf nodes of the structure tree;
and the classification node determining module is used for determining the classification node of each leaf node according to the class center vector and the leaf node vector classification nodes.
17. A device for constructing a classification model of text attribute features is characterized by comprising:
a sample obtaining module, configured to obtain an engineering parameter corpus sample and a text attribute feature sample, where the engineering parameter corpus sample includes an engineering parameter text attribute feature and a text attribute feature type flag, and the text attribute feature sample is generated by identification according to the text attribute feature identification method according to any one of claims 1 to 4;
the vector generation module is used for generating an engineering parameter input vector according to the text attribute feature sample and generating an engineering parameter output vector according to the engineering parameter corpus sample;
and the model construction module is used for training the SVM model according to the engineering parameter input vector and the engineering parameter output vector to construct an engineering parameter classification model.
18. A classification device for text attribute features is characterized by comprising:
a text recognition module for obtaining a target text, and recognizing a power phrase of the target text by the text attribute feature recognition method according to any one of claims 1 to 4;
and the text classification module is used for generating an engineering parameter classification result corresponding to the efficacy phrase according to the text attribute characteristics and a preset engineering parameter classification model.
19. A text structure analysis apparatus, comprising:
the phrase set generating module is used for identifying phrases in each paragraph of the target text and respectively forming a phrase set according to each paragraph;
the structured set generation module is used for carrying out structured analysis on each phrase set to generate a structured set;
a text recognition module, configured to recognize text attribute features in paragraphs of the structured collection according to the recognition method for text attribute features of any one of claims 1 to 4;
and the text relation building module is used for building the incidence relation between the structured set and the phrase set according to the identified text attribute characteristics.
20. A computer device, comprising:
a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method for identifying text attribute features according to any one of claims 1 to 4, or to perform the method for classifying text attribute features according to any one of claims 5 to 7, or to perform the method for constructing a classification model of text attribute features according to claim 8, or to perform the method for classifying text attribute features according to claim 9 or 10, or to perform the method for analyzing a text structure according to any one of claims 11 to 14.
21. A computer-readable storage medium storing computer instructions for causing a computer to execute the method for identifying text attribute features according to any one of claims 1 to 4, or the method for classifying text attribute features according to any one of claims 5 to 7, or the method for constructing a classification model of text attribute features according to claim 8, or the method for classifying text attribute features according to claim 9, or the method for analyzing a text structure according to claim 11.
CN202010992100.2A 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device Active CN111930953B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011632896.7A CN112632286A (en) 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device
CN202010992100.2A CN111930953B (en) 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010992100.2A CN111930953B (en) 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202011632896.7A Division CN112632286A (en) 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device

Publications (2)

Publication Number Publication Date
CN111930953A CN111930953A (en) 2020-11-13
CN111930953B true CN111930953B (en) 2021-02-02

Family

ID=73335257

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202011632896.7A Pending CN112632286A (en) 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device
CN202010992100.2A Active CN111930953B (en) 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202011632896.7A Pending CN112632286A (en) 2020-09-21 2020-09-21 Text attribute feature identification, classification and structure analysis method and device

Country Status (1)

Country Link
CN (2) CN112632286A (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723096A (en) * 2021-07-23 2021-11-30 智慧芽信息科技(苏州)有限公司 Text recognition method and device, computer-readable storage medium and electronic equipment
CN113361275A (en) * 2021-08-10 2021-09-07 北京优幕科技有限责任公司 Speech draft logic structure evaluation method and device
CN116341521B (en) * 2023-05-22 2023-07-28 环球数科集团有限公司 AIGC article identification system based on text features

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7822608B2 (en) * 2007-02-27 2010-10-26 Nuance Communications, Inc. Disambiguating a speech recognition grammar in a multimodal application
CN110852095A (en) * 2018-08-02 2020-02-28 中国银联股份有限公司 Statement hot spot extraction method and system

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5878386A (en) * 1996-06-28 1999-03-02 Microsoft Corporation Natural language parser with dictionary-based part-of-speech probabilities
JP5346841B2 (en) * 2010-02-22 2013-11-20 株式会社野村総合研究所 Document classification system, document classification program, and document classification method
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN106776548B (en) * 2016-12-06 2019-12-13 上海智臻智能网络科技股份有限公司 Text similarity calculation method and device
WO2018208979A1 (en) * 2017-05-10 2018-11-15 Oracle International Corporation Enabling rhetorical analysis via the use of communicative discourse trees
US10853724B2 (en) * 2017-06-02 2020-12-01 Xerox Corporation Symbolic priors for recurrent neural network based semantic parsing
CN108073569B (en) * 2017-06-21 2021-08-27 北京华宇元典信息服务有限公司 Law cognition method, device and medium based on multi-level multi-dimensional semantic understanding
US20190236135A1 (en) * 2018-01-30 2019-08-01 Accenture Global Solutions Limited Cross-lingual text classification
CN109033078B (en) * 2018-07-03 2019-10-25 龙马智芯(珠海横琴)科技有限公司 The recognition methods of sentence classification and device, storage medium, processor
CN109299252A (en) * 2018-08-17 2019-02-01 北京奇虎科技有限公司 The viewpoint polarity classification method and device of stock comment based on machine learning
CN109522555A (en) * 2018-11-16 2019-03-26 中国民航大学 A kind of land sky call based on BiLSTM is rehearsed semantic automatic Verification method
CN111444334B (en) * 2019-01-16 2023-04-25 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111523315B (en) * 2019-01-16 2023-04-18 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN109902159A (en) * 2019-01-29 2019-06-18 华融融通(北京)科技有限公司 A kind of intelligent O&M statement similarity matching process based on natural language processing
CN110008323B (en) * 2019-03-27 2021-04-23 北京百分点科技集团股份有限公司 Problem equivalence judgment method combining semi-supervised learning and ensemble learning
CN110321563B (en) * 2019-06-28 2021-05-11 浙江大学 Text emotion analysis method based on hybrid supervision model
CN110515838A (en) * 2019-07-31 2019-11-29 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Method and system for detecting software defects based on topic model
CN110516236B (en) * 2019-08-09 2022-10-28 安徽工程大学 Social short text fine-grained emotion acquisition method
CN111414476A (en) * 2020-03-06 2020-07-14 哈尔滨工业大学 Attribute-level emotion analysis method based on multi-task learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7822608B2 (en) * 2007-02-27 2010-10-26 Nuance Communications, Inc. Disambiguating a speech recognition grammar in a multimodal application
CN110852095A (en) * 2018-08-02 2020-02-28 中国银联股份有限公司 Statement hot spot extraction method and system

Also Published As

Publication number Publication date
CN111930953A (en) 2020-11-13
CN112632286A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
US10990767B1 (en) Applied artificial intelligence technology for adaptive natural language understanding
CN110399457B (en) Intelligent question answering method and system
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
CN111353030B (en) Knowledge question and answer retrieval method and device based on knowledge graph in travel field
CN111930953B (en) Text attribute feature identification, classification and structure analysis method and device
CN110222160B (en) Intelligent semantic document recommendation method and device and computer readable storage medium
CN110968699B (en) Logic map construction and early warning method and device based on fact recommendation
TWI662425B (en) A method of automatically generating semantic similar sentence samples
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN112800170A (en) Question matching method and device and question reply method and device
CN109614620B (en) HowNet-based graph model word sense disambiguation method and system
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN108038099B (en) Low-frequency keyword identification method based on word clustering
US20200089756A1 (en) Preserving and processing ambiguity in natural language
CN110309504B (en) Text processing method, device, equipment and storage medium based on word segmentation
CN113505209A (en) Intelligent question-answering system for automobile field
CN112883165B (en) Intelligent full-text retrieval method and system based on semantic understanding
CN113821605A (en) Event extraction method
CN114997288A (en) Design resource association method
CN113157887A (en) Knowledge question-answering intention identification method and device and computer equipment
CN106407332B (en) Search method and device based on artificial intelligence
CN110750632B (en) Improved Chinese ALICE intelligent question-answering method and system
CN112307364A (en) Character representation-oriented news text place extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant