CN108733636B - Method and device for extracting multiple tuples from characters - Google Patents
Method and device for extracting multiple tuples from characters Download PDFInfo
- Publication number
- CN108733636B CN108733636B CN201710280347.XA CN201710280347A CN108733636B CN 108733636 B CN108733636 B CN 108733636B CN 201710280347 A CN201710280347 A CN 201710280347A CN 108733636 B CN108733636 B CN 108733636B
- Authority
- CN
- China
- Prior art keywords
- text
- entity
- network
- vectors
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention provides a method and a device for extracting a multi-tuple from a character, and relates to the field of text processing. The method for extracting the multi-tuple from the characters comprises the following steps: inputting training data with legal identification and illegal identification into a recurrent neural network to obtain network parameters; identifying a multi-element entity in the text to be detected, and segmenting words of other parts except the multi-element entity in the text to be detected; according to the arrangement sequence in the text to be tested, correspondingly inputting word vectors of a multi-element entity and word vectors of words obtained after word segmentation into a plurality of sub-networks one by one, and combining network parameters to obtain hidden vectors output by each sub-network, wherein the hidden vectors output by the previous sub-network are input by the next sub-network; performing integration calculation on hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector; classifying the judgment vectors by using the network parameters to obtain a classification result; and extracting the multi-element group with the classification result being legal as a legal multi-element group. The accuracy of extracting the multiple tuples can be improved.
Description
Technical Field
The invention relates to the field of text processing, in particular to a method and a device for extracting multiple tuples from characters.
Background
In daily work, study and life, many texts such as reports, statements and documents are involved, and important information in the texts can be represented in a multi-group form. For example, in a text that "the company management cost in 2013 spends 2306 ten thousand yuan", a triple [ 2013, the company management cost, 2306 ten thousand yuan ] may be extracted, and important information of the text may be included in the triple.
At present, in order to extract the tuples in the text, a rule for extracting the tuples is constructed in advance. For example, taking extracting a triple as an example, firstly, initializing an attribute entity, a time entity and a value entity of the triple to be null; scanning the text according to the sequence from front to back to obtain entities existing in the text; if the entity is the attribute entity, representing the entity as the latest attribute entity; if the entity is a value entity, adding a queue of the value entity; if the entity is a time entity, adding the time entity into a queue of the time entity; if the queue length of the value entity is consistent with that of the time entity and the attribute entity is not empty, a triple [ the ith time entity, the latest attribute entity and the ith value entity ] is extracted. However, when a text contains a plurality of attribute entities, an error occurs in extracting a triplet according to the above-described rule. For example, if the text is "the publisher in 2013 has a revenue, gross profit, and gross profit margin of 99340.49 ten thousand yuan, 64478.58 ten thousand yuan, and 64.91%", only the triplet [ 2013, gross profit margin, 99340.49 ten thousand yuan ] can be extracted according to the above rules, and it is known from the text content that an erroneous triplet is extracted here. In the prior art, the accuracy of extracting the tuples is low.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting a multi-tuple from a character, which can improve the accuracy of extracting the multi-tuple.
In a first aspect, an embodiment of the present invention provides a method for extracting tuples from words, including: inputting training data with class identification into a recurrent neural network, training to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks; identifying a multi-element entity in the text to be detected, and segmenting words of other parts except the multi-element entity in the text to be detected, wherein the text to be detected comprises characters; according to the arrangement sequence in the text to be tested, correspondingly inputting word vectors of a multi-element entity and word vectors of words obtained after word segmentation into a plurality of sub-networks one by one, and obtaining hidden vectors output by each sub-network by combining network parameters, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network; performing integration calculation on hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector; classifying the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally; and extracting the multi-element group with the classification result being legal as a legal multi-element group.
In a second aspect, an embodiment of the present invention provides an apparatus for extracting tuples from words, including: the training module is configured to input training data with class identification into a recurrent neural network, and train to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks; the splitting module is configured to identify a multi-element entity in the text to be detected and perform word segmentation on other parts except the multi-element entity in the text to be detected, wherein the text to be detected comprises characters; the first calculation module is configured to correspondingly input word vectors of the multi-element entities and word vectors of words obtained after word segmentation into a plurality of sub-networks one by one according to the arrangement sequence in the text to be detected, and obtain hidden vectors output by each sub-network by combining network parameters, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network; the second calculation module is configured to perform integrated calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector; the classification module is configured to classify the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally; and the first extraction module is configured to extract the multi-element group with the classification result of legal as a legal multi-element group.
The embodiment of the invention provides a method and a device for extracting a multi-tuple from a character, which train a recurrent neural network by using training data with class identification to obtain network parameters of the recurrent neural network. According to the arrangement sequence of words obtained by word segmentation of the multivariate entity and other parts except the multivariate entity in the text to be tested, word vectors of the multivariate entity and word vectors of the words obtained by word segmentation are correspondingly input into the sub-networks of the recurrent neural network one by one, and the hidden vectors output by each sub-network are obtained by combining network parameters. And obtaining a judgment vector by utilizing implicit vector integration calculation, and classifying the judgment vector to obtain a legal or illegal classification result. And extracting the multivariate group with the classification result of legal as a legal multivariate group. Compared with the prior art of extracting the multi-element group in the characters according to the pre-constructed rule of extracting the multi-element group, the embodiment of the invention utilizes the recurrent neural network, the hidden vector output by the previous sub-network in the recurrent neural network is used as the input of the next sub-network, and the multi-element entity is connected with other parts in the text to be detected. Therefore, when the cyclic neural network is trained to obtain the network parameters, the legal rules of the tuples in various types of texts can be obtained. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
Drawings
The present invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters designate like or similar features.
FIG. 1 is a flowchart illustrating a method for extracting tuples from text according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating an application architecture of a method for extracting triples from text according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an apparatus for extracting tuples from text according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for extracting tuples in text according to another embodiment of the present invention;
FIG. 7 is a block diagram of an apparatus for extracting tuples according to another embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention. The present invention is in no way limited to any specific configuration and algorithm set forth below, but rather covers any modification, replacement or improvement of elements, components or algorithms without departing from the spirit of the invention. In the drawings and the following description, well-known structures and techniques are not shown in order to avoid unnecessarily obscuring the present invention.
The embodiment of the invention provides a method and a device for extracting a multi-tuple from a character. Training the recurrent neural network with training data with class identification in advance. And identifying the multi-element entities in the text to be tested, and inputting the multi-element entities and other parts in the text to be tested into each sub-network in the trained recurrent neural network. And obtaining a judgment vector through the hidden vector output by the sub-network. And classifying the judgment vectors by using the network parameters of the neural cycle network, and extracting the multivariate group with the classification result of legal as a combined multivariate group. The embodiment of the invention establishes the relation between the multi-element entity and other parts in the text to be detected by utilizing the recurrent neural network, thereby acquiring the legal rules of the multi-element group in various types of texts. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
FIG. 1 is a flowchart illustrating a method for extracting tuples from text according to an embodiment of the present invention. As shown in fig. 1, the method for extracting multiple tuples from text includes steps 101 to 106.
In step 101, training data with class identifiers are input into a recurrent neural network, and network parameters of the recurrent neural network are obtained through training.
The training data with the class identification comprises legal training data and illegal training data. The category identification includes legal, which means correct here, and illegal, which means wrong here. In one example, the legitimate training data may include the correct tuple and the illegitimate training data may include the incorrect tuple.
The recurrent neural network includes a plurality of sub-networks. In one example, the Recurrent Neural network may be an RNN (Recurrent Neural Networks). The sub-networks in the recurrent neural network are sequentially connected in sequence, and the output of the former sub-network is connected with the input of the latter sub-network. That is, the output of the previous sub-network can be used as the input of the next sub-network to participate in the operation of the next sub-network. In one example, the plurality of sub-networks share the network parameters of the recurrent neural network, that is, the network parameters of the plurality of sub-networks are the same. In one illustrative example, the network parameters may be parameters of an LSTM (Long Short-Term Memory) unit.
In step 102, a plurality of entities in the text to be tested are identified, and the other parts except the plurality of entities in the text to be tested are segmented.
The text to be detected comprises characters. The words here include chinese characters, letters, numbers, punctuation marks and words of other languages. The multivariate entity corresponds to the multivariate group extracted for the purpose in the embodiment of the invention. In one example, the multi-element entity includes at least two of a time entity, an attribute entity, a value entity, and a qualifier entity. In one example, the qualifier entity is a modify attribute entity, but is not so limited.
Identifying the multiple entities in the text to be detected, and performing word segmentation on other parts except the multiple entities in the text to be detected. For example, the texts to be tested are '2012' and '2013', and the sales amounts of the companies are 100 ten thousand yuan and 200 ten thousand yuan respectively. The time entity identified from the text is 2012 and 2013, the attribute entity identified is sales, the value entity identified is 100 ten thousand yuan and 200 ten thousand yuan, and the text to be tested is obtained by dividing words of other parts except the multivariate entity, and the words are ' company ', ' and ' company '. ".
In step 103, according to the arrangement sequence in the text to be tested, the word vectors of the multi-element entities and the word vectors of the words obtained after word segmentation are input into the multiple subnetworks in a one-to-one correspondence manner, and the hidden vectors output by each subnetwork are obtained by combining network parameters.
And inputting the word vectors of the multi-element entities and the word vectors of the words obtained after word segmentation into the plurality of sub-networks in a one-to-one correspondence manner according to the arrangement sequence of the multi-element entities and the words obtained after word segmentation in the text to be tested. In an example, let it be a triple that is extracted, and fig. 2 is a schematic diagram of an application architecture of a triple extraction method in an example of the embodiment of the present invention. As shown in fig. 2, where the word vector w1、……、wi、……、wj、……、wk、……、wnTo be as followsAnd the word vectors of the multi-entity and the word vectors of the words obtained after word segmentation are sequentially arranged in the arrangement sequence of the text to be detected. As can be seen from fig. 2, the word vectors of the multivariate entities and the word vectors of the words obtained after the word segmentation correspond to the sub-networks in the recurrent neural network one to one.
Wherein, wi、wj、wkAre word vectors of multiple entities. It should be noted that the network parameters obtained by training the recurrent neural network include word vectors. In one example, the multivariate entity and the participled word both have position identifiers in a preset dictionary, and the position identifiers in the dictionary correspond to word vectors. That is, the multivariate entity and the participled word are both corresponding to a word vector. It should be noted that if the multivariate entity or the word after word segmentation is not matched with all the words in the dictionary, the word vector of the multivariate entity or the word vector of the word after word segmentation that is not matched with all the words in the dictionary is set as an unknown identifier, for example, the unknown identifier may be an unknown identifier. Of course, the unknown identifier may be expressed in other ways, and is not limited herein.
Specifically, the location identifier in the dictionary may be a location number. For example, the texts to be tested are '2012' and '2013', and the sales amounts of the companies are 100 ten thousand yuan and 200 ten thousand yuan respectively. "time entity is" 2012 "and" 2013 ", attribute entity is" sales ", and value entity is" 100 ten thousand yuan "and" 200 ten thousand yuan ". As shown in table one below:
watch 1
The identifiers in the first table correspond to word vectors, and the word vectors corresponding to different identifiers can be different. The word vectors corresponding to the identifiers in the first table may be input into each sub-network, so as to obtain hidden vectors output by each sub-network. It should be noted that the implicit vector of the output of the previous sub-network can be used as the input of the next sub-network. As shown in FIG. 2, for example, for sub-network i, the input of sub-network i includes the word vector and the sub-word of the ith word in the text to be testedHidden vector h output by network i-1i-1。
In step 104, the implicit vectors corresponding to the multi-element entities forming the multi-element group are integrated and calculated to obtain a judgment vector.
For example, if the multi-tuple extracted by the method for extracting the multi-tuple from the text of the embodiment of the present invention includes a time entity, an attribute entity, and a value entity, the hidden vector for performing the integration calculation includes a hidden vector corresponding to the time entity, a hidden vector corresponding to the attribute entity, and a hidden vector corresponding to the value entity. For example, as shown in FIG. 2, the word vector of the multi-element entity includes wi、wj、wkThus the implicit vector for the integral calculation includes hi、hj、hkWill perform an integration calculation hi、hj、hkAnd obtaining a judgment vector d. The integration calculation is a calculation of integrating two or more vectors into one vector.
In step 105, the judgment vectors are classified by using the network parameters to obtain a classification result.
The network parameters obtained by training in step 101 may be used as a basis for classification in step 105 to classify the determination vectors, thereby obtaining a classification result. Wherein, the classification result includes legal and illegal. That is, through classification, it can be known whether the tuple consisting of the multi-element entities corresponding to the hidden vectors participating in the integration calculation is the correct tuple. And if the classification result is legal, indicating that the multi-element group formed by the multi-element entities corresponding to the hidden vectors participating in the integration calculation is a correct multi-element group. And if the classification result is illegal, indicating that the multi-element group formed by the multi-element entities corresponding to the hidden vectors participating in the integration calculation is an error multi-element group. In one example, the decision vector may be classified using a multi-dimensional classification model, such as a softmax classification model. Alternatively, the judgment vectors may be classified by using a Machine learning model, for example, an SVM (Support Vector Machine) model. In one example, the classification result may be represented by a number, with a number 1 representing legal and a number 0 representing illegal. The classification result may be expressed in other manners, and is not limited herein.
In step 106, the tuples whose classification result is legal are extracted as legal tuples.
The embodiment of the invention provides a method for extracting a multi-tuple from characters, which trains a recurrent neural network by using training data with class identification to obtain network parameters of the recurrent neural network. According to the arrangement sequence of words obtained by word segmentation of the multivariate entity and other parts except the multivariate entity in the text to be tested, word vectors of the multivariate entity and word vectors of the words obtained by word segmentation are correspondingly input into the sub-networks of the recurrent neural network one by one, and the hidden vectors output by each sub-network are obtained by combining network parameters. And obtaining a judgment vector by utilizing implicit vector integration calculation, and classifying the judgment vector to obtain a legal or illegal classification result. And extracting the multivariate group with the classification result of legal as a legal multivariate group. Compared with the prior art of extracting the multi-element group in the characters according to the pre-constructed rule of extracting the multi-element group, the embodiment of the invention utilizes the recurrent neural network, the hidden vector output by the previous sub-network in the recurrent neural network is used as the input of the next sub-network, and the multi-element entity is connected with other parts in the text to be detected. Therefore, when the cyclic neural network is trained to obtain the network parameters, the legal rules of the tuples in various types of texts can be obtained. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
It should be noted that the extracted legal tuples can also be input into the recurrent neural network as legal training data to train and update parameters of the recurrent neural network, so that legal rules of the tuples in various types of texts are further enriched, and the accuracy of extracting the tuples is further improved.
FIG. 3 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention. Fig. 3 is different from fig. 1 in that the method for extracting tuples from text may further include step 107, and step 104 in fig. 1 may be specifically detailed as step 1041 or step 1042.
In step 107, the plurality of entities are arranged and combined to generate at least one plurality of elements.
The types of the multi-element entities identified in the text to be tested are generally more than two. And arranging and combining the multi-element entities of different types to generate at least one multi-element group. For example, the texts to be tested are '2012' and '2013', and the sales amounts of the companies are 100 ten thousand yuan and 200 ten thousand yuan respectively. Three kinds of multi-element entities in the text to be detected are respectively a time entity, an attribute entity and a value entity. Wherein the time entity includes "2012" and "2013", the attribute entity includes "sales", and the value entity includes "100 ten thousand yuan" and "200 ten thousand yuan". Therefore, by combining the three types of multi-element entities, 2 × 1 × 2 ═ 4 triples can be obtained. The 4 triples are "2012, sales, 100 ten thousand yuan", "2012, sales, 200 ten thousand yuan", "2013, sales, 100 ten thousand yuan" and "2013, sales, 200 ten thousand yuan", respectively. The implicit vectors corresponding to the multi-element entities of each of the 4 triples can be integrated and calculated, so that the judgment vectors can be obtained. And (4) extracting the triples with legal classification results as legal triples by judging vector classification.
In step 1041, a mean vector of the hidden vectors corresponding to the multi-element entities forming the multi-element group is calculated, and the mean vector is used as a judgment vector.
In one example, let the extracted tuples be triples, hi、hjAnd hkIs a hidden vector corresponding to the multi-element entity forming the triple, and d is a judgment vector. Equation (1) for calculating the judgment vector can be obtained:
d=(hi+hj+hk)/3 (1)
the decision vector d in this example is calculated using a mean algorithm.
In step 1042, a weight calculation is performed on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and the vectors obtained by the weight calculation are used as judgment vectors.
In one example, let the extracted tuples be triples, hi、hjAnd hkFor pairs of multiple entities forming triplesThe corresponding hidden vector d is a judgment vector. Equation (2) for calculating the judgment vector can be obtained:
d=hi×mi+hj×mj+hk×mk (2)
wherein m isi、mjAnd mkAre all weighting coefficients, mi+mj+mk1. The decision vector d in this example is calculated using a weighting algorithm.
However, the calculation method for the integrated calculation of the acquired judgment vector includes, but is not limited to, the two methods described above, and is not limited thereto. Algorithms capable of integrating more than two hidden vectors into one judgment vector belong to the protection scope of the embodiment of the invention.
FIG. 4 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention. Fig. 4 is different from fig. 1 in that the method for extracting tuples from text further includes steps 108 to 110.
In step 108, tuples are extracted from the table.
The text to be detected comprises a table matched with the characters. All the multi-component groups described in the table were extracted. Since the technology of extracting tuples from the table is mature, it is not described herein.
In step 109, the multi-tuple extracted from the table is compared with the legitimate multi-tuple extracted from the text.
In the steps of the above embodiments, legal tuples can be extracted. The multi-tuple extracted from the table in step 108 is compared to a legal multi-tuple extracted from the text. It is determined whether the multinary group extracted from the table matches a legitimate multinary group extracted from the text.
In step 110, if the tuple extracted from the table does not match the valid tuple extracted from the text, the presentation information is generated.
If the tuple extracted from the table is not identical to the legal tuple extracted from the text, it indicates that the tuple extracted from the table or the tuple extracted from the text is an erroneous tuple, and it indicates that the tuple described in the table or the tuple described in the text is erroneous. When the multi-element group extracted from the table is inconsistent with the legal multi-element group extracted from the characters, prompt information is generated to prompt error reporting, so that the error reporting function of the multi-element group extracted from the table or the multi-element group extracted from the characters is realized.
The difference between the multinary group extracted from the table and the legitimate multinary group extracted from the characters may be output as an error in the table or characters. Thereby realizing the function of error correction of the tuples in the table or the tuples in the characters.
FIG. 5 is a block diagram illustrating an apparatus 200 for extracting tuples from text according to an embodiment of the present invention. As shown in fig. 5, the apparatus 200 for extracting tuples from text includes a training module 201, a splitting module 202, a first calculating module 203, a second calculating module 204, a classifying module 205, and a first extracting module 206.
The training module 201 is configured to input training data with category identifiers into a recurrent neural network, and train to obtain network parameters of the recurrent neural network, where the training data with category identifiers includes legal training data and illegal training data, and the recurrent neural network includes multiple sub-networks.
The splitting module 202 is configured to identify a multi-element entity in the text to be tested, and perform word segmentation on other parts of the text to be tested except the multi-element entity, where the text to be tested includes characters.
The first calculation module 203 is configured to correspondingly input word vectors of the multi-element entities and word vectors of words obtained after word segmentation into multiple subnetworks one by one according to an arrangement sequence in the text to be tested, and obtain hidden vectors output by each subnetwork by combining network parameters, wherein the hidden vectors output by a previous subnetwork are used as inputs of a next subnetwork.
The second calculating module 204 is configured to perform integration calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector.
The classification module 205 is configured to classify the judgment vector by using the network parameter to obtain a classification result, where the classification result includes legal and illegal.
A first extraction module 206 configured to extract the tuple whose classification result is legal as a legal tuple.
The embodiment of the invention provides a device 200 for extracting a multi-tuple from a character, wherein a training module 201 trains a recurrent neural network by using training data with class identification to obtain network parameters of the recurrent neural network. The first calculation module 203 inputs the word vectors of the multivariate entities and the word vectors of the words obtained after word segmentation into the sub-networks of the recurrent neural network in a one-to-one correspondence manner according to the arrangement sequence of the words obtained by segmenting the multivariate entities and the other parts except the multivariate entities identified by the splitting module 202 in the text to be tested, and obtains the hidden vectors output by each sub-network by combining the network parameters. The second calculating module 204 uses implicit vector integration calculation to obtain the judgment vector. The classification module 205 classifies the judgment vector to obtain a legal or illegal classification result. The first extraction module 206 extracts the tuple whose classification result is legal as a legal tuple. Compared with the prior art of extracting the multi-element group in the characters according to the pre-constructed rule of extracting the multi-element group, the embodiment of the invention utilizes the recurrent neural network, the hidden vector output by the previous sub-network in the recurrent neural network is used as the input of the next sub-network, and the multi-element entity is connected with other parts in the text to be detected. Therefore, when the cyclic neural network is trained to obtain the network parameters, the legal rules of the tuples in various types of texts can be obtained. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
In one example, the multi-element entity includes at least two of a time entity, an attribute entity, a value entity, and a qualifier entity.
FIG. 6 is a block diagram of an apparatus 200 for extracting tuples in text according to another embodiment of the present invention. Fig. 6 is different from fig. 5 in that the apparatus 200 for extracting tuples from words further includes a tuple generating module 207.
The multi-element group generating module 207 is configured to arrange and combine the multi-element entities to generate at least one multi-element group.
It should be noted that the second computing module 204 in the above embodiments may include the first computing unit 2041 or the second computing unit 2042.
The first calculating unit 2041 is configured to calculate a mean vector of hidden vectors corresponding to the multiple entities forming the multiple elements, and use the mean vector as a judgment vector.
The second calculating unit 2042 is configured to perform weighted calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and use the vectors obtained through the weighted calculation as the judgment vectors.
FIG. 7 is a block diagram illustrating an apparatus 200 for extracting tuples in text according to yet another embodiment of the present invention. Fig. 7 is different from fig. 5 in that the apparatus 200 for extracting tuples from text further includes a second extracting module 208, a comparing module 209, and an error reporting module 210.
A second extraction module 208 configured to extract tuples from the table.
The text to be tested also comprises a table matched with the characters.
A comparing module 209 configured to compare the tuple extracted from the table with a legal tuple extracted from the text.
An error reporting module 210 configured to generate a prompt if the tuple extracted from the table is inconsistent with a legitimate tuple extracted from the text.
It should be clear that the embodiments in this specification are described in a progressive manner, and the same or similar parts in the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. For the device embodiments, reference may be made to the description of the method embodiments in the relevant part. The present invention is not limited to the specific steps and structures described above and shown in the drawings. Those skilled in the art may make various changes, modifications and additions or change the order between the steps after appreciating the spirit of the invention. Also, a detailed description of known process techniques is omitted herein for the sake of brevity.
The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
Claims (8)
1. A method for extracting tuples from characters is characterized by comprising the following steps:
inputting training data with class identification into a recurrent neural network, training to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks;
identifying a multi-element entity in a text to be detected, and segmenting words of other parts except the multi-element entity in the text to be detected, wherein the text to be detected comprises characters;
according to the arrangement sequence in the text to be tested, correspondingly inputting the word vectors of the multi-element entities and the word vectors of the words obtained after word segmentation into the plurality of sub-networks one by one, and combining the network parameters to obtain the hidden vectors output by each sub-network, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network;
performing integration calculation on hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector;
classifying the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally;
extracting the multi-tuple of which the classification result is legal as a legal multi-tuple;
the integrating calculation of the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain the judgment vector comprises the following steps:
calculating a mean vector of hidden vectors corresponding to the multi-element entities forming the multi-element group, and taking the mean vector as the judgment vector;
alternatively, the first and second electrodes may be,
and performing weighted calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and taking the vectors obtained through weighted calculation as the judgment vectors.
2. The method of claim 1, wherein the plurality of entities comprises at least two of a time entity, an attribute entity, a value entity, and a custom entity.
3. The method of claim 1, wherein before the integrating the hidden vectors corresponding to the multi-element entities forming the multi-element set to obtain the judgment vector, the method further comprises:
and arranging and combining the multiple entities to generate at least one multiple group.
4. The method of claim 1, wherein the text to be tested further comprises a table matching words, the method further comprising:
extracting tuples from the table;
comparing the tuple extracted from the table with the legal tuple extracted from the characters;
if the tuple extracted from the table does not match the legitimate tuple extracted from the text, the prompt message is generated.
5. An apparatus for extracting tuples from words, comprising:
the training module is configured to input training data with class identification into a recurrent neural network, and train to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks;
the splitting module is configured to identify a multi-element entity in a text to be tested, and perform word segmentation on other parts except the multi-element entity in the text to be tested, wherein the text to be tested comprises characters;
the first calculation module is configured to correspondingly input word vectors of the multi-element entities and word vectors of words obtained after word segmentation into the plurality of sub-networks one by one according to the arrangement sequence in the text to be detected, and obtain hidden vectors output by each sub-network by combining the network parameters, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network;
the second calculation module is configured to perform integrated calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector;
the classification module is configured to classify the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally;
a first extraction module configured to extract the tuples of which the classification results are legal as legal tuples;
the second computing module, comprising:
a first calculation unit configured to calculate a mean vector of hidden vectors corresponding to the multi-element entities constituting the multi-element group, the mean vector being the determination vector;
alternatively, the first and second electrodes may be,
and the second calculation unit is configured to perform weighted calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and take the vectors obtained through weighted calculation as the judgment vectors.
6. The apparatus of claim 5, wherein the plurality of entities comprises at least two of a time entity, an attribute entity, a value entity, and a custom entity.
7. The apparatus of claim 5, further comprising:
and the multi-element group generating module is configured to arrange and combine the multi-element entities to generate at least one multi-element group.
8. The apparatus of claim 5, wherein the text to be tested further comprises a table matching words, the apparatus further comprising:
a second extraction module configured to extract tuples from the table;
a comparison module configured to compare the tuple extracted from the table with a legal tuple extracted from the text;
an error correction module configured to generate prompt information if the tuple extracted from the table is inconsistent with a legitimate tuple extracted from the text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710280347.XA CN108733636B (en) | 2017-04-25 | 2017-04-25 | Method and device for extracting multiple tuples from characters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710280347.XA CN108733636B (en) | 2017-04-25 | 2017-04-25 | Method and device for extracting multiple tuples from characters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108733636A CN108733636A (en) | 2018-11-02 |
CN108733636B true CN108733636B (en) | 2021-07-13 |
Family
ID=63934675
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710280347.XA Active CN108733636B (en) | 2017-04-25 | 2017-04-25 | Method and device for extracting multiple tuples from characters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108733636B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701888B (en) * | 2023-08-09 | 2023-10-17 | 国网浙江省电力有限公司丽水供电公司 | Auxiliary model data processing method and system for clean energy enterprises |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105261358A (en) * | 2014-07-17 | 2016-01-20 | 中国科学院声学研究所 | N-gram grammar model constructing method for voice identification and voice identification system |
CN106294325A (en) * | 2016-08-11 | 2017-01-04 | 海信集团有限公司 | The optimization method and device of spatial term statement |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9195656B2 (en) * | 2013-12-30 | 2015-11-24 | Google Inc. | Multilingual prosody generation |
-
2017
- 2017-04-25 CN CN201710280347.XA patent/CN108733636B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105261358A (en) * | 2014-07-17 | 2016-01-20 | 中国科学院声学研究所 | N-gram grammar model constructing method for voice identification and voice identification system |
CN106294325A (en) * | 2016-08-11 | 2017-01-04 | 海信集团有限公司 | The optimization method and device of spatial term statement |
Non-Patent Citations (1)
Title |
---|
一种基于多元组鉴别文本语种的方法;刘敏等;《计算机应用》;20051231;第25卷;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108733636A (en) | 2018-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230013306A1 (en) | Sensitive Data Classification | |
CN109325691B (en) | Abnormal behavior analysis method, electronic device and computer program product | |
JP7266674B2 (en) | Image classification model training method, image processing method and apparatus | |
CN110147823B (en) | Wind control model training method, device and equipment | |
CN106815194A (en) | Model training method and device and keyword recognition method and device | |
WO2021031825A1 (en) | Network fraud identification method and device, computer device, and storage medium | |
US20200065573A1 (en) | Generating variations of a known shred | |
Ting et al. | Towards the detection of cyberbullying based on social network mining techniques | |
WO2019179010A1 (en) | Data set acquisition method, classification method and device, apparatus, and storage medium | |
CN110929203B (en) | Abnormal user identification method, device, equipment and storage medium | |
CN111666761A (en) | Fine-grained emotion analysis model training method and device | |
CN110175851A (en) | A kind of cheating detection method and device | |
CN113449725B (en) | Object classification method, device, equipment and storage medium | |
CN112926045B (en) | Group control equipment identification method based on logistic regression model | |
US20130151239A1 (en) | Orthographical variant detection apparatus and orthographical variant detection program | |
CN109783805B (en) | Network community user identification method and device and readable storage medium | |
CN109933648A (en) | A kind of differentiating method and discriminating device of real user comment | |
CN111553241A (en) | Method, device and equipment for rejecting mismatching points of palm print and storage medium | |
JP6962123B2 (en) | Label estimation device and label estimation program | |
CN106888201A (en) | A kind of method of calibration and device | |
JP6146209B2 (en) | Information processing apparatus, character recognition method, and program | |
CN108733636B (en) | Method and device for extracting multiple tuples from characters | |
CN117093698B (en) | Knowledge base-based dialogue generation method and device, electronic equipment and storage medium | |
Ishitani | Model matching based on association graph for form image understanding | |
US20170039484A1 (en) | Generating negative classifier data based on positive classifier data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |