CN110543630B - Method and device for generating text structured representation and computer storage medium - Google Patents

Method and device for generating text structured representation and computer storage medium Download PDF

Info

Publication number
CN110543630B
CN110543630B CN201910772846.XA CN201910772846A CN110543630B CN 110543630 B CN110543630 B CN 110543630B CN 201910772846 A CN201910772846 A CN 201910772846A CN 110543630 B CN110543630 B CN 110543630B
Authority
CN
China
Prior art keywords
composite structure
text data
word
layer
structured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910772846.XA
Other languages
Chinese (zh)
Other versions
CN110543630A (en
Inventor
杨光信
杨宏超
徐凯旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Chengrui Technology Co ltd
Original Assignee
Beijing Chengrui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Chengrui Technology Co ltd filed Critical Beijing Chengrui Technology Co ltd
Priority to CN201910772846.XA priority Critical patent/CN110543630B/en
Publication of CN110543630A publication Critical patent/CN110543630A/en
Application granted granted Critical
Publication of CN110543630B publication Critical patent/CN110543630B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

A method and a device for generating a text structured representation, a computer storage medium and an electronic device comprise the following steps: determining text data to be structured; recognizing the structure of the text data and the action of each element in the text data in the previous layer structure by using a pre-constructed text structured recognition model to obtain a structured representation result; wherein the structured representation comprises: and marking the composite structure included in the text data with a composite structure type, and marking each element in the text data with the function of the element in the previous layer of the composite structure where the element is positioned. By adopting the scheme in the application, the internal structure of each element of the text data and the function of each element in the previous layer of composite structure can be represented, and a foundation is provided for other subsequent data applications.

Description

Method and device for generating text structured representation and computer storage medium
Technical Field
The present application relates to data processing technologies, and in particular, to a method and an apparatus for generating a text structured representation, a computer storage medium, and an electronic device.
Background
Textual data presented in natural language form is one of the main sources of data in the information age. It is typically characterized by an unformed structure. As such, useful information contained in free-form text data is difficult to automatically extract comprehensively and accurately. This presents a number of obstacles to deep-level applications based on text data.
At present, Dependency analysis (DP) is generally used for structural analysis of a text, and the technology uses a sentence as a basic unit, and analyzes Dependency relationships between every two components in the sentence on the basis of recognizing grammatical components in the text sentence, such as a predicate, a motile, a middle-form relationship, and the like, to reveal a syntactic structure of the text.
For most natural languages, there are typically as many as several tens of different types of relationships. In addition, since the dependency relationships essentially describe relationships between two text elements, complex interpretation of these numerous dependency relationships is required to understand and reconstruct the overall or local structure of the text and the role of these structures in sentences. This complexity presents a relatively large hurdle to exploiting dependency-based structured results.
The above information disclosed in this background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The embodiment of the application provides a method and a device for generating a text structured representation, a computer storage medium and electronic equipment, so as to solve the problems.
According to a first aspect of the embodiments of the present application, there is provided a method for generating a text structured representation, including the following steps:
determining text data to be structured;
recognizing the structure of the text data and the action of each element in the text data in the previous layer structure by using a pre-constructed text structured recognition model to obtain a structured representation result;
wherein the structured representation comprises: and marking the composite structure included in the text data with a composite structure type, and marking each element in the text data with the function of the element in the previous layer of the composite structure where the element is positioned.
According to a second aspect of the embodiments of the present application, there is provided an apparatus for generating a text structured representation, including:
the text determining module is used for determining text data to be structured;
the structured processing module is used for identifying the structure of the text data and the action of each element in the text data in the previous layer structure by utilizing a pre-constructed text structured identification model to obtain a structured representation result;
wherein the structured representation comprises: and marking the composite structure included in the text data with a composite structure type, and marking each element in the text data with the function of the element in the previous layer of the composite structure where the element is positioned.
According to a third aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above.
According to a fourth aspect of embodiments herein, there is provided an electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method as described above.
According to the text structured representation generation method and device, the computer storage medium and the electronic device, the text data to be processed can be structurally represented between the pre-constructed structural recognition models, the composite structure type of the composite structure marked with the text data and the structural data result of the action of each element in the text data in the upper composite structure are generated, the unstructured text data are converted into the structural data, and the generated structural data result comprises the composite structure of the text data and the action of each element in the composite structure, so that the text data can be more conveniently applied to the text data of other subsequent deep layers.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart illustrating an implementation of a method for generating a structured representation of text in accordance with an embodiment of the present application;
FIG. 2 is a schematic structural diagram illustrating a structured representation of text data according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a device for generating a text structured representation in the second embodiment of the present application;
fig. 4 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application;
FIG. 5 is a diagram illustrating a process of generating a structured representation of text in accordance with an embodiment of the present application;
fig. 6 shows a schematic diagram of tree storage of a structured representation result in the fifth embodiment of the present application.
Detailed Description
In view of the problems in the prior art, embodiments of the present application provide a method and an apparatus for generating a text structured representation, a computer storage medium, and an electronic device, which are used to represent an internal structure of a text element and its role in a sentence.
The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example one
Fig. 1 shows a flowchart of an implementation of a method for generating a text structured representation in an embodiment of the present application.
As shown in the figure, the method for generating the text structured representation includes:
step 101, determining text data to be structured;
step 102, recognizing the structure of the text data and the action of each element in the text data in the previous layer structure by using a pre-constructed text structured recognition model to obtain a structured representation result;
wherein the structured representation comprises: and marking the composite structure included in the text data with a composite structure type, and marking each element in the text data with the function of the element in the previous layer of the composite structure where the element is positioned.
In specific implementation, the text data to be structured can be a text in languages such as chinese and english, and the language of the text data is not limited in the present application.
The structured data result identifies a composite structure type of the composite structure of the text data and a role of each element in the text data in the previous layer of the composite structure, wherein,
the composite structure may include:
a centered phrase, a phrase in a form, a major-minor phrase, a interjacent phrase, a minor phrase, a parallel phrase, a major-minor phrase, a clause, etc.
The role of each element in the text data in the composite structure may be represented by a sentence component, and the sentence component may include:
subjects, predicates, objects, determinants, subjects, complements, and the like.
The text data is structurally represented by using a pre-constructed structural model to obtain a structured data result, for example: the text data to be processed is "I am a student from chengang school, I will go to beat part in an English speech category in July", and the result after the structured processing of the text data is "[ I/subject am/predicate [ a student from chengang school ]/medium phrase/object ]/clause,/punctuation [ I/subject [ will go to ]/predicate beat meeting/object [ to take away sentence ]/medium phrase/complement [ in July ]/medium phrase/object phrase ]/clause ]/punctuation symbol".
The method for generating the text structured representation provided by the embodiment of the application can utilize a pre-constructed structured recognition model to structurally represent text data to be processed, generate a structured data result which is marked with the composite structure type of the text data and the effect of each element in the text data in the previous layer of the composite structure, convert unstructured text data into structured data, and generate the structured data result which comprises the composite structure of the text data and the effect of each element in the composite structure, so that the text data can be more conveniently applied to other subsequent deep-level data.
In one embodiment, each element in the text data is identified as a word, and the structured representation further includes:
and marking the part of speech of each word in the text data.
In specific implementation, the parts of speech of each word in the text data may be labeled, then the composite structure type of the composite structure of the text data may be labeled, and then the role of each element in the text data in the composite structure may be labeled.
In general, a word may have different parts of speech in different contexts, and the judgment of the part of speech of a word should be comprehensively determined from the combination ability, the ability to serve as a sentence component, and the form. The method and the device for word part tagging can use the existing word part tagging method (such as a rule-based method, a statistical-based method or a hybrid method) to achieve word part tagging, and can also use high-frequency word parts corresponding to each word as default word parts from a corpus by constructing a word part tagging model.
The method and the device can judge whether the Chinese character is the Chinese character or not through the regular expression after the text data to be processed is determined, if the Chinese character is the Chinese character, a directed acyclic graph is constructed according to a prefix dictionary, a maximum probability path is calculated based on the directed acyclic graph, and meanwhile, the part of speech separated by the directed acyclic graph is found in the prefix dictionary; if the Chinese characters are not Chinese characters, the type judgment can be continuously carried out through the regular expression, and the number or the English can be distinguished.
In specific implementation, the parts of speech may include:
real words: nouns, verbs, adjectives, numerologies, quantifiers, pronouns, and the like;
the particle: adverbs, prepositions, conjunctions, helpers, sighs, vocabularies, and the like.
One skilled in the art may have other parts of speech according to actual needs, and the present application is not limited thereto.
For example, the text data to be structured is "I am a student from chengang school, I will go to Beijing to take part in an English speech context in July.
"I/pronoun am/verb [ a/indefinite article from/preposition chengu school/noun, I/pronoun will go to/verb Beijing/noun to/preposition take part in/verb an/indefinite article Englishppech context/noun in/preposition July/noun ].
The composite structure of the text data is labeled as follows:
"[ I/pronoun am/verb [ a/indefinite article/noun from/preposition chengu school/noun ]/phrase in/clause, [ I/pronoun will go to/verb Beijing/noun [ to/preposition take part in/verb an/indefinite article English company/noun ]/phrase in/preposition July/noun ]/phrase in.
Then, labeling the role of each element in the text data in the composite structure, specifically as follows:
"[ I/pronoun/subject am/verb/predicate [ a/indefinite article student/noun from/preposition chenguangskool/noun ]/phrase in/object ]/clause,/punctuation [ I/pronoun/subject [ wilgo to ]/verb/predicate Beijing/noun/object [ to/preposition take part in/verb/an indefinite article English site constest/noun ]/phrase in/complement [ in/preposition July/noun ]/phrase in/object ]/phrase in/clause ]/clause.
In specific implementation, the part of speech, the composite structure, or the role of the element in the composite structure in the embodiment of the present application may be marked by a preset character, for example: the centering phrases in the composite structure may be identified by dz, nouns by n, subjects by sbj, and so on.
The skilled person can choose at will which character to use to mark the part of speech, the composite structure, or the role of the element in the composite structure, and this application is not limited to this.
In one embodiment, the identifying the structure of the text data and the role of each element in the text data in the respective previous layer structure to obtain the structured representation result includes:
combining a plurality of words in the text data to obtain a previous layer composite structure of the words;
and marking the type of the composite structure of the upper layer of the composite structure of the word, and marking the action of each word in the upper layer of the composite structure.
In specific implementation, the text data can be structured in a mode from bottom to top, that is, the text data is firstly subjected to word segmentation and phrase recognition to obtain a plurality of words, then the words are combined to obtain one or more composite structures, and the composite structures are established into an association relationship; the composite structure type of the composite structure of the text data is then labeled, and the role of each word in the text data in the composite structure in which the word is located is labeled.
Continuing with the above example, the text data to be structured is "I am a student from chengangschool, I will go to Beijing to take part in an English space context in july.
"[ I am [ a student from chengang school ]/phrase in/clause, [ I will go to Beijing [ to take part in an English space meeting ]/phrase in July ]/phrase in interguest ]/clause ].
Then, labeling the role of each word in the text data in the composite structure where the word is located, specifically as follows:
"[ I/subject am/predicate [ a student from chenguang school ]/phrase in/object ]/clause,/punctuation [ I/subject [ will go to ]/predicate Beijing/object [ to take part in an Englishpeech constest ]/phrase in/complement [ in July ]/phrase in/parent/clause ]/punctuation ].
In practice, composite structures may nest to form more complex composite structures.
In one embodiment, after obtaining the previous layer composite structure of the plurality of words, the method further includes:
combining the ith layer of composite structure and/or words to obtain an (i + 1) th layer of composite structure after combination; i is an integer greater than or equal to 1; the 1 st layer of composite structure is the previous layer of composite structure of the words;
and adding 1 to the value of i, repeating the previous step until the composite structures can not be combined, and marking the type of the composite structure and the function of the lower-layer composite structure in the upper-layer composite structure of each combined layer of composite structure.
In specific implementation, if the text data includes five words, i.e., a, B, c, d, and e, where B and c in the five words may be combined to obtain a composite structure A, d and e may be combined to obtain a composite structure B, and a may be combined to obtain AA, the embodiment of the present application respectively labels the composite structure type of the combined composite structure A, B, AA, and labels the role of words B and c in composite structure a, the role of words d and e in composite structure B, and the role of words a and composite structure a in composite structure AA.
Considering that text data may not be described in terms of words, but in terms of individual words, for words of chinese, japanese, korean, etc., the present application may also be implemented in the following manner.
In one embodiment, before combining the plurality of words in the text data, the method further comprises:
and performing word segmentation and phrase identification on the text data to obtain a plurality of words.
In specific implementation, after text data to be structured is determined, the text data can be segmented to obtain one or more words; and then labeling a composite structure obtained by combining a plurality of words and phrases, labeling the composite structure type of the composite structure of the text data, and labeling the effect of each word and phrase in the text data in the composite structure.
In an embodiment, the identifying the structure of the text data and the role of each element in the text data in the respective sentence to obtain the structured representation result further includes:
identifying parts of speech of each word in the text data;
and labeling the part of speech of the word for each word.
For example: the text data to be structured is' perseverance is the condensation of the senecio avenae, levee and sand. "
Firstly, performing word segmentation processing on the text data to obtain the following results:
[ willpower, yes, seng, levee, one, sand, one, stone, coacervation. ]
Then, the part of speech of each word in the text data is labeled, and the following results are obtained:
willpower/noun is/verb kilo/adjective levee/noun/digraph sand/noun/digraph stone/noun/pronoun cohesion/noun. ]
Next, labeling the composite structure type of the composite structure of the text data, and obtaining the following result:
[ resolute/noun is/verb [ [ kilo/adjective levee/noun ]/centering phrase [ one/few words sand/noun one/few words stone/noun ]/aggregation of parallel phrases ]/centering phrase. ]/sentence
And finally, labeling the action of each word in the text data in the composite structure to obtain the following result:
[ resolute/noun/subject is/verb/predicate [ [ sene/adjective embankment/noun ]/centering phrase [ one/digraph sand/noun one/digraph stone/noun ]/cohesion of parallel phrases ]/centering phrase/object. Punctuation/sentence
For another example: the text data to be structured is' in the field, the ears of the sinking meadow enable farmers to enjoy the pleasure of harvest. "
Firstly, performing word segmentation processing on the text data to obtain the following results:
[ in the field, the ears of rice in a meadow make farmers enjoy the pleasure of harvest. ]
Then, the part of speech of each word in the text data is labeled, and the following results are obtained:
[ field/adverb, meadow/adjective spike/noun order/verb farmer/noun enjoy/verb/help rich/noun/help pleased/noun. ]
Next, labeling the composite structure type of the composite structure of the text data, and obtaining the following result:
[ field/adverb, [ affidan/adjective spike/noun ]/centering phrase order/verb farmer/noun [ enjoy/verb landing/auxiliary word [ harvest/noun/auxiliary word happy/noun ]/centering phrase ]/motile phrase. ]
Finally, labeling the action of each element in the text data in the composite structure to obtain the following result:
[ fieldry/adverb,/punctuation [ meandin/adjective/adverb/noun ]/meditation phrase/subject/verb/predicate-farmer/noun/object [ enjoy/verb-on/adverb [ rich/noun/adverb/noun ]/meditation phrase ]/adverb phrase/complement. Mark point symbol
The word segmentation in the embodiment of the present application may refer to a process of recombining continuous word sequences into word sequences, and may include chinese word segmentation, chinese-english hybrid word segmentation, and the like.
Specifically, word segmentation may be performed based on rules (field-based word segmentation), statistical-based word segmentation, semantic-based word segmentation, or understanding-based word segmentation, etc.
In specific implementation, the word segmentation algorithm in the prior art can be used for word segmentation of the text data, for example: a maximum matching method, a hidden markov model, a matrix constraint method, a neural network expert system integrated part-of-speech method, etc., which are not described herein in detail.
In the embodiment of the application, for 1) marking the part of speech of a word, 2) marking the composite structure type of the composite structure and 3) marking the action of each word in the composite structure, there is no precedence order relationship among the words, the order of marking the part of speech, marking the composite structure type and marking the action of the word in the composite structure can be according to the order of marking the part of speech, marking the composite structure type and marking the action of the word in the composite structure, and the order of marking the composite structure type, marking the action of the word in the composite structure, marking the order of the part of speech and the like can be according to the application, and the marking order is not limited.
In one embodiment, the identifying the structure of the text data and the role of each element in the text data in the respective previous layer structure to obtain the structured representation result includes:
identifying a plurality of composite structures of the text data;
each composite structure is labeled with a composite structure type, and the role of each term in each composite structure in the previous layer of composite structure on which it is located is labeled.
In specific implementation, the embodiment of the present application may be structured from top to bottom, that is, a composite structure of the text data is obtained by recognition, and then a plurality of words in the composite structure are recognized and the role of each word in the composite structure is determined, so as to complete labeling of the type of the composite structure and the role of the word in the composite structure.
In one embodiment, after the identifying obtains the plurality of composite structures of the text data, the method further includes:
layering the multiple composite structures according to the inclusion relationship to obtain multiple-level composite structures; the upper layer composite structure comprises a lower layer composite structure and/or words;
the role of the lower composite structure in the upper composite structure is noted.
In specific implementation, if the text data includes a composite structure AA, a composite structure B, and a composite structure a obtained through recognition, where the composite structure AA includes a composite structure a and a word a, and the composite structure B includes words B and c, in this embodiment of the present application, the composite structure AA, the composite structure B, and the composite structure a may be layered according to an inclusion relationship, where a first layer of the composite structure is a composite structure a and a composite structure B, and a second layer of the composite structure is a composite structure AA, and the roles of the composite structure a and the word a in the composite structure AA, the roles of the words B and c in the composite structure B, the roles of the composite structure AA and the composite structure B in the text data, and the like are labeled.
In the embodiment of the present application, the text data may be regarded as one of the composite structures, belonging to the composite structure of the highest layer, and the word may be regarded as an atomic structure.
In one embodiment, the identifying the structure of the text data and the role of each element in the text data in the respective previous layer structure to obtain the structured representation result includes:
identifying a part-of-speech of each word in the text data;
each term is tagged with a part-of-speech of the term.
In one embodiment, the structured representation is specifically:
a connection relation is established between a node where each word in the text data is located and a node where the previous layer of composite structure where the word belongs; the connection relation is marked with the function of the words in the upper layer composite structure to which the words belong; each term node comprises the term;
the text data comprises 1-N layers of composite structures, a correlation relationship is established between a node where the ith layer of composite structure is located and a node where the (i + 1) th layer of composite structure is located, and i is more than 1 and less than N; establishing a connection relation between the node where the text data is located and the node where the Nth-layer composite structure of the text data is located;
each node where the composite structure is located comprises the composite structure and the composite structure type thereof; the connection relation between the node where the ith layer of composite structure is located and the node where the (i + 1) th layer of composite structure is located is marked with the function of the ith layer of composite structure in the (i + 1) th layer of composite structure.
In specific implementation, the embodiment of the present application may represent a structured result tree, for example: taking the text data as root nodes, taking the most complex composite structures of the text data as child nodes of the root nodes respectively, continuously dividing the complex composite structures into one or more simple composite structures, taking the simple composite structures as child nodes of the complex composite structures, connecting the simple composite structures layer by layer, and taking words in the simplest composite structures as leaf nodes respectively; the node where the composite structure is located is marked with the composite structure and the composite structure type thereof, and the leaf node where the word is located is marked with the word.
The connection relationship between the lower node and the upper node is marked with the role of the lower node in the upper node, such as: the lower layer node is a word, the upper layer node is a composite structure, and the connection relation between the word and the composite structure is marked with the function of the word in the composite structure; for another example: the lower layer node is a composite structure A, the upper layer node is a composite structure AA, and the function of the composite structure A in the composite structure AA is marked on the connection relationship between the composite structure A and the composite structure AA; and if the upper-layer node of the composite structure AA is the text data, marking the role of the composite structure AA in the text data on the connection relationship between the composite structure AA and the node where the text data is located.
In one embodiment, the node where each word in the composite structure is located further includes a part of speech of the word.
FIG. 2 is a schematic structural diagram illustrating a structured representation of text data according to an embodiment of the present application.
The text data may include two complex composite structure types (in the figure, the composite structure type is abbreviated as type two), namely a composite structure a and a composite structure B, the composite structure a may include a composite structure a1 and a composite structure a2, the composite structure B may include a composite structure B1, a composite structure B2 and a composite structure B3, the composite structure a1 may include words 1 and 2, and the subordinate nodes of other composite structures are similar (not shown in the figure).
The role of the term in the composite structure may be labeled in a term node, or may be labeled in a connection relationship between the term and the composite structure (a line for connecting the term and the composite structure in the figure).
In specific implementation, the parts of speech, the roles of words in the composite structure, the type of the composite structure, and the like may be labeled with "/" or other labels such as "-", for example, "the parts of speech", "the composite structure a 1" and the type of the composite structure ", which are not limited in this application.
In one embodiment, the structured representation is specifically:
each composite structure of the text data is marked with a character string as a composite structure whole, and the composite structure type of the composite structure and the effect of the composite structure in the composite structure on the upper layer are marked with the character string after each composite structure whole;
each word in the composite structure is marked as a word whole by a character string, and the function of the word in the upper layer composite structure to which the word belongs is marked by the character string after each word whole.
In particular, each composite structure of the text data may be labeled as a whole with a "[ ]" string, for example: for the composite structure "happy mood" in the text data "mr. wang exercises the body and then is happy", it may be labeled as "mr. wang exercises the body and then [ happy mood ]", and then a composite structure type of the composite structure "mr. wang exercises the body and then [ happy mood ]/major phrase" is labeled with a character string.
It is still possible to label each word in the composite structure as a whole word in the form of a character string, for example, "mr. wang exercises out of the body [ [ mood ] [ comfort ] ]/predicate phrase", and then label the role of each word in the predicate phrase in the form of a character string "mr. wang exercises out of the body [ [ mood ]/predicate phrase".
In one embodiment, each word is further tagged with a part of speech of the word after the word is in its entirety.
For example: "mr. wang exercises after exercising the body [ [ mood ]/subject/noun [ comfort ]/predicate/adjective ]/main phrasing", or "mr. wang exercises after exercising the body [ [ mood ]/noun/subject [ comfort ]/adjective/predicate ]/main phrasing".
The sequence of the part of speech of the word and the effect of the word in the composite structure or the sequence position of the label is not limited in the application.
Other storage means may be used by those skilled in the art, and the present application is not limited thereto.
In one embodiment, the building process of the structured model may include:
determining a plurality of corpus samples;
marking the part of speech of the words in each sample;
labeling the composite structures present in each sample;
marking the action of each element in each sample in the composite structure to obtain structured sample data;
training a plurality of structured sample data to obtain the structured model.
In specific implementation, a plurality of corpus samples can be obtained or collected in advance, the corpus samples can be text data of various languages, the samples are respectively labeled with parts of speech, labeled with composite structures, and labeled with the action of each element in the composite structures, so as to obtain structured sample data, and the structured sample data are trained, so as to finally obtain the structured model.
Example two
Based on the same inventive concept, the embodiment of the present application provides a device for generating a text structured representation, and the principle of the device for solving the technical problem is similar to a method for generating a text structured representation, and repeated parts are not repeated.
Fig. 3 shows a schematic structural diagram of a generating apparatus for text structured representation in the second embodiment of the present application.
As shown in the figure, the apparatus for generating the text structured representation includes:
a text determining module 301, configured to determine text data to be structured;
a structured representation generation module 302, configured to identify, by using a pre-constructed text structured recognition model, a structure of the text data and an effect of each element in the text data in a previous layer structure where the element is located, so as to obtain a structured representation result;
wherein the structured representation comprises: and marking the composite structure included in the text data with a composite structure type, and marking each element in the text data with the function of the element in the previous layer of the composite structure where the element is positioned.
The device for generating the text structured representation can utilize the structured representation of the text data to be processed between the pre-constructed structured models to generate the structured data result with the identification of the composite structure of the text data and the effect of each element in the text data in the composite structure, converts the unstructured text data into the structured data, and generates the structured data result which comprises the composite structure of the text data and the effect of each element in the previous layer composite structure where the element is located, so that the text data can be more conveniently applied to other subsequent deep-level data.
In one embodiment, each element in the text data is identified as a word, and the structured representation further includes:
and marking the part of speech of each word in the text data.
In one embodiment, the structured representation generation module includes:
the first composite structure determining unit is used for combining a plurality of words in the text data to obtain a previous layer composite structure of the words;
the first labeling unit is used for labeling the type of the composite structure on the upper layer of the composite structure of the word;
and the second labeling unit is used for labeling the function of each word in the composite structure on the upper layer.
In one embodiment, the structured representation generation module further comprises:
the combination unit is used for combining the ith layer of composite structure and/or words to obtain an (i + 1) th layer of composite structure after combination; i is an integer greater than or equal to 1; the 1 st layer of composite structure is the previous layer of composite structure of the words; adding 1 to the value of i, and repeating the previous step until the composite structure can not be combined;
and the third labeling unit is used for labeling the type of the composite structure and the function of the lower-layer composite structure in the upper-layer composite structure of each combined layer of composite structure.
In one embodiment, the structured representation generation module further comprises:
and the word segmentation unit is used for performing word segmentation and phrase recognition on the text data to obtain a plurality of words before combining the plurality of words in the text data.
In one embodiment, the structured representation generation module further comprises:
a part-of-speech recognition unit for identifying the part of speech of each word in the text data;
and the part-of-speech tagging unit is used for tagging the part of speech of the word to each word.
In one embodiment, the structured representation generation module includes:
a second composite structure determination unit configured to identify a plurality of composite structures of the obtained text data;
and the fourth labeling unit is used for labeling the type of the composite structure of each composite structure and labeling the action of each word in each composite structure in the upper layer of the composite structure.
In one embodiment, the structured representation generation module further comprises:
the layering unit is used for layering the multiple composite structures according to the inclusion relationship to obtain multiple levels of composite structures; the upper layer composite structure comprises a lower layer composite structure and/or words;
and the fifth marking unit is used for marking the function of the lower-layer composite structure in the upper-layer composite structure.
In one embodiment, the structured representation is specifically:
a connection relation is established between a node where each word in the text data is located and a node where the previous layer of composite structure where the word belongs; the connection relation is marked with the function of the words in the upper layer composite structure to which the words belong; each term node comprises the term;
the text data comprises 1-N layers of composite structures, a correlation relationship is established between a node where the ith layer of composite structure is located and a node where the (i + 1) th layer of composite structure is located, and i is more than 1 and less than N; establishing a connection relation between the node where the text data is located and the node where the Nth-layer composite structure of the text data is located;
each node where the composite structure is located comprises the composite structure and the composite structure type thereof; the connection relation between the node where the ith layer of composite structure is located and the node where the (i + 1) th layer of composite structure is located is marked with the function of the ith layer of composite structure in the (i + 1) th layer of composite structure.
In one embodiment, the node where each word in the composite structure is located further includes a part of speech of the word.
In one embodiment, the structured representation is specifically:
each composite structure of the text data is marked with a character string as a composite structure whole, and the composite structure type of the composite structure and the effect of the composite structure in the composite structure on the upper layer are marked with the character string after each composite structure whole;
each word in the composite structure is marked as a word whole by a character string, and the function of the word in the upper layer composite structure to which the word belongs is marked by the character string after each word whole.
In one embodiment, each word is further tagged with a part of speech of the word after the word is in its entirety.
In one embodiment, further comprising:
a model building module, the model building module comprising:
the sample determining unit is used for determining a plurality of corpus samples;
the first labeling unit is used for labeling the part of speech of the word in each sample;
a second labeling unit for labeling the composite structure present in each sample;
the third labeling unit is used for labeling the action of each element in each sample in the composite structure to obtain structured sample data;
and the training unit is used for training a plurality of structured sample data to obtain the structured model.
Those skilled in the art will appreciate that the modules described above may be distributed in an apparatus as described in the embodiments of the present application, and that corresponding changes may be made in one or more apparatus unique from the embodiments of the present application. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
EXAMPLE III
Based on the same inventive concept, embodiments of the present application further provide a computer storage medium, which is described below.
The computer storage medium has a computer program stored thereon, which, when being executed by a processor, implements the steps of the method for generating a textual structured representation according to an embodiment.
In particular implementations, a computer storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer storage media may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer storage medium provided by the embodiment of the application can utilize the structured data results of the text data to be processed between the pre-constructed structured models to perform structured representation, generate the composite structure marked with the text data and the effect of each element in the text data in the composite structure, convert the unstructured text data into structured data, and the generated structured data results include both the composite structure of the text data and the effect of each element in the composite structure, so that the text data can be more conveniently applied to other subsequent deep-level data.
Example four
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, which is described below.
Fig. 4 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and applicable scope of the embodiments of the present application.
As shown, the electronic device includes memory 401 for storing one or more programs, and one or more processors 402; the one or more programs, when executed by the one or more processors, implement a method for generating a textual structured representation according to embodiment one.
In particular, the electronic device may further include a bus, a display unit, and the like for connecting different system components.
The memory may include readable media in the form of volatile memory units, such as: a random access memory unit RAM, or a cache memory unit, and may also include a read-only memory unit ROM.
The memory may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The bus may be any of several types of bus structures including a memory bus or controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output I/O interface. Also, the electronic device may communicate with one or more networks through network adapters, such as: a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network. The network adapter may communicate with other modules of the electronic device over the bus. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The electronic equipment provided by the embodiment of the application can utilize the structured data result of the effect of each element in the text data in the composite structure to convert unstructured text data into structured data by performing structured representation on the text data to be processed between the pre-constructed structured models, and the generated structured data result not only comprises the composite structure of the text data but also comprises the effect of each element in the composite structure, so that the text data can be more conveniently applied to other subsequent deep-level data.
EXAMPLE five
In order to facilitate the implementation of the present application, the embodiments of the present application are described with a specific example.
Fig. 5 is a schematic diagram illustrating a generation process of a text structured representation in the fifth embodiment of the present application.
As shown in the figure, the generation process of the text structured representation provided by the embodiment of the present application includes the following steps:
step 501, extracting text data to be structured from a data source.
Assume that the text data to be structured is:
we are Chinese and we love our own motherland.
Step 502, identifying the structure of the text data and the role of each element in the text data in the structure.
The embodiment of the application firstly carries out word segmentation processing on the sentence, and the result after word segmentation is obtained as follows:
we have found that
Is that
China (China)
Human being
We have found that
Love of heat
Oneself with
Is/are as follows
Chinese medicine
Then, parts of speech are labeled to the result after word segmentation, and the embodiment of the present application sets that r represents a pronoun, n represents a noun, v represents a verb, del represents a helper, wp represents a punctuation mark, and the like, respectively, and finally obtains the following result:
we/r
Is/shi
China/n
Person/n
,/wp
We/r
Love/v
Self/n
Of
Chinese/n
。/wp
Then, labeling the composite structure, the present application example sets the following labels of 9 common composite structures:
dz: the centering phrase, zz: phrase in the form, zw: major phrase, db: a move guest phrase;
jb: the usher phrase, dbu: dynamic complement phrase, lh: the juxtaposition phrase, zc: master-slave phrase, ss: and (4) clauses.
Labeling the composite structure gave the following results:
[
we/r
Is/shi
[ China/n people/n ]/dz
]/ss
,/wp
[ We/r)
Love/v
[ own/n/del country of motherland/n ]/dz
]/ss
。/wp
Finally, marking the roles of each element in the composite structure, the embodiments of the present application set the following signs of the roles of the six elements in the composite structure:
sbj subject, obj: object, att: slogan, adv: the number of subjects, pre: predicate, cmp: and (5) supplementing words.
The effect of the annotation element in the composite structure yields the following results:
[
we/r/sbj
Is/shi/pre
[ China/n/att people/n ]/dz/obj
]/ss
,/wp
[ We/r/sbj
Love/v/pre
[ own/n/del our country/n ]/dz/obj
]/ss
。/wp
It can be seen that the two complementary dimensions of type (nine) and action type (six) of the composite structure together depict the action of each element in the text data in the entire sentence.
And 503, displaying the structured representation result on a display interface.
In the embodiment of the present application, the structured representation result may be displayed on a display interface of a computer in a tree manner, and specifically, may be as shown in fig. 6.
And step 504, storing the text data according to the structured representation result.
The embodiment of the application can store the text data in a server or a database according to the structured representation result.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (14)

1. A method for generating a structured representation of text, comprising:
determining text data to be structured;
recognizing the structure of the text data and the action of each element in the text data in the previous layer structure by using a pre-constructed text structured recognition model to obtain a structured representation result;
wherein the structured representation comprises: marking a composite structure type on a composite structure included in the text data, and marking the effect of each element in the text data on the element in the previous layer of the composite structure;
a connection relation is established between a node where each word in the text data is located and a node where the previous layer of composite structure where the word belongs; the connection relation is marked with the function of the words in the upper layer composite structure to which the words belong; the node where each word is located comprises the word.
2. The method of claim 1, wherein the structured representation further comprises:
and marking the part of speech of each word in the text data.
3. The method according to claim 1, wherein the identifying the structure of the text data and the role of each element in the text data in the respective previous layer structure to obtain the structured representation result comprises:
combining a plurality of words in the text data to obtain a previous layer composite structure of the words;
and marking the type of the composite structure of the upper layer of the composite structure of the word, and marking the action of each word in the upper layer of the composite structure of the word.
4. The method of claim 3, wherein after said obtaining the previous layer composite structure of the plurality of words, further comprising:
combining the ith layer of composite structure and/or words to obtain an (i + 1) th layer of composite structure after combination; i is an integer greater than or equal to 1; the 1 st layer of composite structure is the previous layer of composite structure of the words;
and adding 1 to the value of i, repeating the previous step until the composite structures can not be combined, and marking the type of the composite structure and the function of the lower-layer composite structure in the upper-layer composite structure of each combined layer of composite structure.
5. The method of claim 3, wherein prior to said combining the plurality of words in the text data, further comprising:
and performing word segmentation and phrase identification on the text data to obtain a plurality of words.
6. The method according to claim 1, wherein the identifying the structure of the text data and the role of each element in the text data in the respective previous layer structure to obtain the structured representation result comprises:
identifying a plurality of composite structures of the text data;
each composite structure is labeled with a composite structure type, and the role of each term in each composite structure in the previous layer of composite structure on which it is located is labeled.
7. The method of claim 6, wherein after the identifying results in a plurality of composite structures for the text data, further comprising:
layering the multiple composite structures according to the inclusion relationship to obtain multiple-level composite structures; the upper layer composite structure comprises a lower layer composite structure and/or words;
the role of the lower composite structure in the upper composite structure is noted.
8. The method according to claim 1, characterized in that the structured representation is in particular:
the text data comprises 1-N layers of composite structures, a correlation relationship is established between a node where the ith layer of composite structure is located and a node where the (i + 1) th layer of composite structure is located, and i is more than 1 and less than N; establishing a connection relation between the node where the text data is located and the node where the Nth-layer composite structure of the text data is located;
each node where the composite structure is located comprises the composite structure and the composite structure type thereof; the connection relation between the node where the ith layer of composite structure is located and the node where the (i + 1) th layer of composite structure is located is marked with the function of the ith layer of composite structure in the (i + 1) th layer of composite structure.
9. The method according to claim 1, characterized in that the structured representation is in particular:
each composite structure of the text data is marked with a character string as a composite structure whole, and the composite structure type of the composite structure and the effect of the composite structure in the composite structure on the upper layer are marked with the character string after each composite structure whole;
each word in the composite structure is marked as a word whole by a character string, and the function of the word in the upper layer composite structure to which the word belongs is marked by the character string after each word whole.
10. The method according to any one of claims 1 to 9, wherein the composite structure type comprises one or more of:
the phrase comprises a fixed phrase, a phrase in a shape, a main phrase and a subordinate phrase, a moving phrase, a mediating phrase, a moving phrase, a parallel phrase, a main phrase and a subordinate phrase and a clause.
11. A method according to any one of claims 1 to 9, wherein the effect of the element in the composite structure comprises one or more of:
subject, predicate, object, complement.
12. An apparatus for generating a structured representation of text, comprising:
the text determining module is used for determining text data to be structured;
the structured representation module is used for identifying the structure of the text data and the action of each element in the text data in the structure by utilizing a pre-constructed text structured identification model to obtain a structured representation result;
wherein the structured representation comprises: marking a composite structure type on a composite structure included in the text data, and marking the effect of each element in the text data on the composite structure where the element is positioned;
a connection relation is established between a node where each word in the text data is located and a node where the previous layer of composite structure where the word belongs; the connection relation is marked with the function of the words in the upper layer composite structure to which the words belong; the node where each word is located comprises the word.
13. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.
14. An electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of any of claims 1 to 11.
CN201910772846.XA 2019-08-21 2019-08-21 Method and device for generating text structured representation and computer storage medium Active CN110543630B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910772846.XA CN110543630B (en) 2019-08-21 2019-08-21 Method and device for generating text structured representation and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910772846.XA CN110543630B (en) 2019-08-21 2019-08-21 Method and device for generating text structured representation and computer storage medium

Publications (2)

Publication Number Publication Date
CN110543630A CN110543630A (en) 2019-12-06
CN110543630B true CN110543630B (en) 2020-06-09

Family

ID=68711715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910772846.XA Active CN110543630B (en) 2019-08-21 2019-08-21 Method and device for generating text structured representation and computer storage medium

Country Status (1)

Country Link
CN (1) CN110543630B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260058B (en) * 2020-01-21 2023-09-26 北京百度网讯科技有限公司 Feature generation method, device, electronic equipment and storage medium
CN113515927B (en) * 2021-09-14 2021-12-03 北京欧应信息技术有限公司 Method, computing device and storage medium for generating structured text

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150303A (en) * 2013-03-08 2013-06-12 北京理工大学 Chinese semantic case layering identification method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007072528A (en) * 2005-09-02 2007-03-22 Internatl Business Mach Corp <Ibm> Method, program and device for analyzing document structure
CN108984683B (en) * 2018-06-29 2021-06-25 北京百度网讯科技有限公司 Method, system, equipment and storage medium for extracting structured data
CN109101583A (en) * 2018-07-23 2018-12-28 上海斐讯数据通信技术有限公司 A kind of knowledge mapping construction method and system for non-structured text
CN109271626B (en) * 2018-08-31 2023-09-26 北京工业大学 Text semantic analysis method
CN109241538B (en) * 2018-09-26 2022-12-20 上海德拓信息技术股份有限公司 Chinese entity relation extraction method based on dependency of keywords and verbs

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150303A (en) * 2013-03-08 2013-06-12 北京理工大学 Chinese semantic case layering identification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐凡.《 英文篇章结构分析关键问题研究》.《中国博士学位论文全文数据库 信息科技辑》.2014,(第11期),I138-48. *

Also Published As

Publication number Publication date
CN110543630A (en) 2019-12-06

Similar Documents

Publication Publication Date Title
Androutsopoulos et al. Generating natural language descriptions from OWL ontologies: the NaturalOWL system
Holzinger et al. Combining HCI, natural language processing, and knowledge discovery-potential of IBM content analytics as an assistive technology in the biomedical field
Constant et al. MWU-aware part-of-speech tagging with a CRF model and lexical resources
US20180046705A1 (en) Providing question and answers with deferred type evaluation using text with limited structure
Berman Principles of big data: preparing, sharing, and analyzing complex information
CN112487202B (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
CN109271626A (en) Text semantic analysis method
US20160357731A1 (en) Method for Automatically Detecting Meaning and Measuring the Univocality of Text
Khan et al. Extracting Spatial Information From Place Descriptions
CN102272755A (en) Method for semantic processing of natural language using graphical interlingua
CN110543630B (en) Method and device for generating text structured representation and computer storage medium
CN116805013A (en) Traditional Chinese medicine video retrieval model based on knowledge graph
Nath et al. The quest for better clinical word vectors: Ontology based and lexical vector augmentation versus clinical contextual embeddings
CN113779993A (en) Medical entity identification method based on multi-granularity text embedding
Ke et al. Medical entity recognition and knowledge map relationship analysis of Chinese EMRs based on improved BiLSTM-CRF
Li et al. Distributed representation for traditional Chinese medicine herb via deep learning models
Moussallem et al. RDF2PT: Generating Brazilian Portuguese texts from RDF data
Ding et al. Research on question answering system for COVID-19 based on knowledge graph
Sharipbay et al. ONTOLOGICAL KNOWLEDGE BASED MODELS REPRESENTING MEDICINE TERMINOLOGY
AlAgha Using linguistic analysis to translate arabic natural language queries to SPARQL
Khazani et al. Semantic graph knowledge representation for Al-Quran verses based on word dependencies
Padma et al. Development of morphological stemmer, analyzer and generator for Kannada nouns
Chiarcos et al. Developing and using the ontologies of linguistic annotation (2006-2016)
Yan et al. Research on Named Entity Recognition in Chinese EMR Based on Semi-Supervised Learning with Dual Selected Strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant