CN108733636B - Method and device for extracting multiple tuples from characters - Google Patents

Method and device for extracting multiple tuples from characters Download PDF

Info

Publication number
CN108733636B
CN108733636B CN201710280347.XA CN201710280347A CN108733636B CN 108733636 B CN108733636 B CN 108733636B CN 201710280347 A CN201710280347 A CN 201710280347A CN 108733636 B CN108733636 B CN 108733636B
Authority
CN
China
Prior art keywords
text
entity
network
vectors
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710280347.XA
Other languages
Chinese (zh)
Other versions
CN108733636A (en
Inventor
林得苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pai Tech Co ltd
Original Assignee
Pai Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pai Tech Co ltd filed Critical Pai Tech Co ltd
Priority to CN201710280347.XA priority Critical patent/CN108733636B/en
Publication of CN108733636A publication Critical patent/CN108733636A/en
Application granted granted Critical
Publication of CN108733636B publication Critical patent/CN108733636B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a method and a device for extracting a multi-tuple from a character, and relates to the field of text processing. The method for extracting the multi-tuple from the characters comprises the following steps: inputting training data with legal identification and illegal identification into a recurrent neural network to obtain network parameters; identifying a multi-element entity in the text to be detected, and segmenting words of other parts except the multi-element entity in the text to be detected; according to the arrangement sequence in the text to be tested, correspondingly inputting word vectors of a multi-element entity and word vectors of words obtained after word segmentation into a plurality of sub-networks one by one, and combining network parameters to obtain hidden vectors output by each sub-network, wherein the hidden vectors output by the previous sub-network are input by the next sub-network; performing integration calculation on hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector; classifying the judgment vectors by using the network parameters to obtain a classification result; and extracting the multi-element group with the classification result being legal as a legal multi-element group. The accuracy of extracting the multiple tuples can be improved.

Description

Method and device for extracting multiple tuples from characters
Technical Field
The invention relates to the field of text processing, in particular to a method and a device for extracting multiple tuples from characters.
Background
In daily work, study and life, many texts such as reports, statements and documents are involved, and important information in the texts can be represented in a multi-group form. For example, in a text that "the company management cost in 2013 spends 2306 ten thousand yuan", a triple [ 2013, the company management cost, 2306 ten thousand yuan ] may be extracted, and important information of the text may be included in the triple.
At present, in order to extract the tuples in the text, a rule for extracting the tuples is constructed in advance. For example, taking extracting a triple as an example, firstly, initializing an attribute entity, a time entity and a value entity of the triple to be null; scanning the text according to the sequence from front to back to obtain entities existing in the text; if the entity is the attribute entity, representing the entity as the latest attribute entity; if the entity is a value entity, adding a queue of the value entity; if the entity is a time entity, adding the time entity into a queue of the time entity; if the queue length of the value entity is consistent with that of the time entity and the attribute entity is not empty, a triple [ the ith time entity, the latest attribute entity and the ith value entity ] is extracted. However, when a text contains a plurality of attribute entities, an error occurs in extracting a triplet according to the above-described rule. For example, if the text is "the publisher in 2013 has a revenue, gross profit, and gross profit margin of 99340.49 ten thousand yuan, 64478.58 ten thousand yuan, and 64.91%", only the triplet [ 2013, gross profit margin, 99340.49 ten thousand yuan ] can be extracted according to the above rules, and it is known from the text content that an erroneous triplet is extracted here. In the prior art, the accuracy of extracting the tuples is low.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting a multi-tuple from a character, which can improve the accuracy of extracting the multi-tuple.
In a first aspect, an embodiment of the present invention provides a method for extracting tuples from words, including: inputting training data with class identification into a recurrent neural network, training to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks; identifying a multi-element entity in the text to be detected, and segmenting words of other parts except the multi-element entity in the text to be detected, wherein the text to be detected comprises characters; according to the arrangement sequence in the text to be tested, correspondingly inputting word vectors of a multi-element entity and word vectors of words obtained after word segmentation into a plurality of sub-networks one by one, and obtaining hidden vectors output by each sub-network by combining network parameters, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network; performing integration calculation on hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector; classifying the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally; and extracting the multi-element group with the classification result being legal as a legal multi-element group.
In a second aspect, an embodiment of the present invention provides an apparatus for extracting tuples from words, including: the training module is configured to input training data with class identification into a recurrent neural network, and train to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks; the splitting module is configured to identify a multi-element entity in the text to be detected and perform word segmentation on other parts except the multi-element entity in the text to be detected, wherein the text to be detected comprises characters; the first calculation module is configured to correspondingly input word vectors of the multi-element entities and word vectors of words obtained after word segmentation into a plurality of sub-networks one by one according to the arrangement sequence in the text to be detected, and obtain hidden vectors output by each sub-network by combining network parameters, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network; the second calculation module is configured to perform integrated calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector; the classification module is configured to classify the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally; and the first extraction module is configured to extract the multi-element group with the classification result of legal as a legal multi-element group.
The embodiment of the invention provides a method and a device for extracting a multi-tuple from a character, which train a recurrent neural network by using training data with class identification to obtain network parameters of the recurrent neural network. According to the arrangement sequence of words obtained by word segmentation of the multivariate entity and other parts except the multivariate entity in the text to be tested, word vectors of the multivariate entity and word vectors of the words obtained by word segmentation are correspondingly input into the sub-networks of the recurrent neural network one by one, and the hidden vectors output by each sub-network are obtained by combining network parameters. And obtaining a judgment vector by utilizing implicit vector integration calculation, and classifying the judgment vector to obtain a legal or illegal classification result. And extracting the multivariate group with the classification result of legal as a legal multivariate group. Compared with the prior art of extracting the multi-element group in the characters according to the pre-constructed rule of extracting the multi-element group, the embodiment of the invention utilizes the recurrent neural network, the hidden vector output by the previous sub-network in the recurrent neural network is used as the input of the next sub-network, and the multi-element entity is connected with other parts in the text to be detected. Therefore, when the cyclic neural network is trained to obtain the network parameters, the legal rules of the tuples in various types of texts can be obtained. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
Drawings
The present invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters designate like or similar features.
FIG. 1 is a flowchart illustrating a method for extracting tuples from text according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating an application architecture of a method for extracting triples from text according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an apparatus for extracting tuples from text according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for extracting tuples in text according to another embodiment of the present invention;
FIG. 7 is a block diagram of an apparatus for extracting tuples according to another embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention. The present invention is in no way limited to any specific configuration and algorithm set forth below, but rather covers any modification, replacement or improvement of elements, components or algorithms without departing from the spirit of the invention. In the drawings and the following description, well-known structures and techniques are not shown in order to avoid unnecessarily obscuring the present invention.
The embodiment of the invention provides a method and a device for extracting a multi-tuple from a character. Training the recurrent neural network with training data with class identification in advance. And identifying the multi-element entities in the text to be tested, and inputting the multi-element entities and other parts in the text to be tested into each sub-network in the trained recurrent neural network. And obtaining a judgment vector through the hidden vector output by the sub-network. And classifying the judgment vectors by using the network parameters of the neural cycle network, and extracting the multivariate group with the classification result of legal as a combined multivariate group. The embodiment of the invention establishes the relation between the multi-element entity and other parts in the text to be detected by utilizing the recurrent neural network, thereby acquiring the legal rules of the multi-element group in various types of texts. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
FIG. 1 is a flowchart illustrating a method for extracting tuples from text according to an embodiment of the present invention. As shown in fig. 1, the method for extracting multiple tuples from text includes steps 101 to 106.
In step 101, training data with class identifiers are input into a recurrent neural network, and network parameters of the recurrent neural network are obtained through training.
The training data with the class identification comprises legal training data and illegal training data. The category identification includes legal, which means correct here, and illegal, which means wrong here. In one example, the legitimate training data may include the correct tuple and the illegitimate training data may include the incorrect tuple.
The recurrent neural network includes a plurality of sub-networks. In one example, the Recurrent Neural network may be an RNN (Recurrent Neural Networks). The sub-networks in the recurrent neural network are sequentially connected in sequence, and the output of the former sub-network is connected with the input of the latter sub-network. That is, the output of the previous sub-network can be used as the input of the next sub-network to participate in the operation of the next sub-network. In one example, the plurality of sub-networks share the network parameters of the recurrent neural network, that is, the network parameters of the plurality of sub-networks are the same. In one illustrative example, the network parameters may be parameters of an LSTM (Long Short-Term Memory) unit.
In step 102, a plurality of entities in the text to be tested are identified, and the other parts except the plurality of entities in the text to be tested are segmented.
The text to be detected comprises characters. The words here include chinese characters, letters, numbers, punctuation marks and words of other languages. The multivariate entity corresponds to the multivariate group extracted for the purpose in the embodiment of the invention. In one example, the multi-element entity includes at least two of a time entity, an attribute entity, a value entity, and a qualifier entity. In one example, the qualifier entity is a modify attribute entity, but is not so limited.
Identifying the multiple entities in the text to be detected, and performing word segmentation on other parts except the multiple entities in the text to be detected. For example, the texts to be tested are '2012' and '2013', and the sales amounts of the companies are 100 ten thousand yuan and 200 ten thousand yuan respectively. The time entity identified from the text is 2012 and 2013, the attribute entity identified is sales, the value entity identified is 100 ten thousand yuan and 200 ten thousand yuan, and the text to be tested is obtained by dividing words of other parts except the multivariate entity, and the words are ' company ', ' and ' company '. ".
In step 103, according to the arrangement sequence in the text to be tested, the word vectors of the multi-element entities and the word vectors of the words obtained after word segmentation are input into the multiple subnetworks in a one-to-one correspondence manner, and the hidden vectors output by each subnetwork are obtained by combining network parameters.
And inputting the word vectors of the multi-element entities and the word vectors of the words obtained after word segmentation into the plurality of sub-networks in a one-to-one correspondence manner according to the arrangement sequence of the multi-element entities and the words obtained after word segmentation in the text to be tested. In an example, let it be a triple that is extracted, and fig. 2 is a schematic diagram of an application architecture of a triple extraction method in an example of the embodiment of the present invention. As shown in fig. 2, where the word vector w1、……、wi、……、wj、……、wk、……、wnTo be as followsAnd the word vectors of the multi-entity and the word vectors of the words obtained after word segmentation are sequentially arranged in the arrangement sequence of the text to be detected. As can be seen from fig. 2, the word vectors of the multivariate entities and the word vectors of the words obtained after the word segmentation correspond to the sub-networks in the recurrent neural network one to one.
Wherein, wi、wj、wkAre word vectors of multiple entities. It should be noted that the network parameters obtained by training the recurrent neural network include word vectors. In one example, the multivariate entity and the participled word both have position identifiers in a preset dictionary, and the position identifiers in the dictionary correspond to word vectors. That is, the multivariate entity and the participled word are both corresponding to a word vector. It should be noted that if the multivariate entity or the word after word segmentation is not matched with all the words in the dictionary, the word vector of the multivariate entity or the word vector of the word after word segmentation that is not matched with all the words in the dictionary is set as an unknown identifier, for example, the unknown identifier may be an unknown identifier. Of course, the unknown identifier may be expressed in other ways, and is not limited herein.
Specifically, the location identifier in the dictionary may be a location number. For example, the texts to be tested are '2012' and '2013', and the sales amounts of the companies are 100 ten thousand yuan and 200 ten thousand yuan respectively. "time entity is" 2012 "and" 2013 ", attribute entity is" sales ", and value entity is" 100 ten thousand yuan "and" 200 ten thousand yuan ". As shown in table one below:
watch 1
Figure BDA0001278902360000061
The identifiers in the first table correspond to word vectors, and the word vectors corresponding to different identifiers can be different. The word vectors corresponding to the identifiers in the first table may be input into each sub-network, so as to obtain hidden vectors output by each sub-network. It should be noted that the implicit vector of the output of the previous sub-network can be used as the input of the next sub-network. As shown in FIG. 2, for example, for sub-network i, the input of sub-network i includes the word vector and the sub-word of the ith word in the text to be testedHidden vector h output by network i-1i-1
In step 104, the implicit vectors corresponding to the multi-element entities forming the multi-element group are integrated and calculated to obtain a judgment vector.
For example, if the multi-tuple extracted by the method for extracting the multi-tuple from the text of the embodiment of the present invention includes a time entity, an attribute entity, and a value entity, the hidden vector for performing the integration calculation includes a hidden vector corresponding to the time entity, a hidden vector corresponding to the attribute entity, and a hidden vector corresponding to the value entity. For example, as shown in FIG. 2, the word vector of the multi-element entity includes wi、wj、wkThus the implicit vector for the integral calculation includes hi、hj、hkWill perform an integration calculation hi、hj、hkAnd obtaining a judgment vector d. The integration calculation is a calculation of integrating two or more vectors into one vector.
In step 105, the judgment vectors are classified by using the network parameters to obtain a classification result.
The network parameters obtained by training in step 101 may be used as a basis for classification in step 105 to classify the determination vectors, thereby obtaining a classification result. Wherein, the classification result includes legal and illegal. That is, through classification, it can be known whether the tuple consisting of the multi-element entities corresponding to the hidden vectors participating in the integration calculation is the correct tuple. And if the classification result is legal, indicating that the multi-element group formed by the multi-element entities corresponding to the hidden vectors participating in the integration calculation is a correct multi-element group. And if the classification result is illegal, indicating that the multi-element group formed by the multi-element entities corresponding to the hidden vectors participating in the integration calculation is an error multi-element group. In one example, the decision vector may be classified using a multi-dimensional classification model, such as a softmax classification model. Alternatively, the judgment vectors may be classified by using a Machine learning model, for example, an SVM (Support Vector Machine) model. In one example, the classification result may be represented by a number, with a number 1 representing legal and a number 0 representing illegal. The classification result may be expressed in other manners, and is not limited herein.
In step 106, the tuples whose classification result is legal are extracted as legal tuples.
The embodiment of the invention provides a method for extracting a multi-tuple from characters, which trains a recurrent neural network by using training data with class identification to obtain network parameters of the recurrent neural network. According to the arrangement sequence of words obtained by word segmentation of the multivariate entity and other parts except the multivariate entity in the text to be tested, word vectors of the multivariate entity and word vectors of the words obtained by word segmentation are correspondingly input into the sub-networks of the recurrent neural network one by one, and the hidden vectors output by each sub-network are obtained by combining network parameters. And obtaining a judgment vector by utilizing implicit vector integration calculation, and classifying the judgment vector to obtain a legal or illegal classification result. And extracting the multivariate group with the classification result of legal as a legal multivariate group. Compared with the prior art of extracting the multi-element group in the characters according to the pre-constructed rule of extracting the multi-element group, the embodiment of the invention utilizes the recurrent neural network, the hidden vector output by the previous sub-network in the recurrent neural network is used as the input of the next sub-network, and the multi-element entity is connected with other parts in the text to be detected. Therefore, when the cyclic neural network is trained to obtain the network parameters, the legal rules of the tuples in various types of texts can be obtained. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
It should be noted that the extracted legal tuples can also be input into the recurrent neural network as legal training data to train and update parameters of the recurrent neural network, so that legal rules of the tuples in various types of texts are further enriched, and the accuracy of extracting the tuples is further improved.
FIG. 3 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention. Fig. 3 is different from fig. 1 in that the method for extracting tuples from text may further include step 107, and step 104 in fig. 1 may be specifically detailed as step 1041 or step 1042.
In step 107, the plurality of entities are arranged and combined to generate at least one plurality of elements.
The types of the multi-element entities identified in the text to be tested are generally more than two. And arranging and combining the multi-element entities of different types to generate at least one multi-element group. For example, the texts to be tested are '2012' and '2013', and the sales amounts of the companies are 100 ten thousand yuan and 200 ten thousand yuan respectively. Three kinds of multi-element entities in the text to be detected are respectively a time entity, an attribute entity and a value entity. Wherein the time entity includes "2012" and "2013", the attribute entity includes "sales", and the value entity includes "100 ten thousand yuan" and "200 ten thousand yuan". Therefore, by combining the three types of multi-element entities, 2 × 1 × 2 ═ 4 triples can be obtained. The 4 triples are "2012, sales, 100 ten thousand yuan", "2012, sales, 200 ten thousand yuan", "2013, sales, 100 ten thousand yuan" and "2013, sales, 200 ten thousand yuan", respectively. The implicit vectors corresponding to the multi-element entities of each of the 4 triples can be integrated and calculated, so that the judgment vectors can be obtained. And (4) extracting the triples with legal classification results as legal triples by judging vector classification.
In step 1041, a mean vector of the hidden vectors corresponding to the multi-element entities forming the multi-element group is calculated, and the mean vector is used as a judgment vector.
In one example, let the extracted tuples be triples, hi、hjAnd hkIs a hidden vector corresponding to the multi-element entity forming the triple, and d is a judgment vector. Equation (1) for calculating the judgment vector can be obtained:
d=(hi+hj+hk)/3 (1)
the decision vector d in this example is calculated using a mean algorithm.
In step 1042, a weight calculation is performed on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and the vectors obtained by the weight calculation are used as judgment vectors.
In one example, let the extracted tuples be triples, hi、hjAnd hkFor pairs of multiple entities forming triplesThe corresponding hidden vector d is a judgment vector. Equation (2) for calculating the judgment vector can be obtained:
d=hi×mi+hj×mj+hk×mk (2)
wherein m isi、mjAnd mkAre all weighting coefficients, mi+mj+mk1. The decision vector d in this example is calculated using a weighting algorithm.
However, the calculation method for the integrated calculation of the acquired judgment vector includes, but is not limited to, the two methods described above, and is not limited thereto. Algorithms capable of integrating more than two hidden vectors into one judgment vector belong to the protection scope of the embodiment of the invention.
FIG. 4 is a flowchart illustrating a method for extracting tuples from text according to another embodiment of the present invention. Fig. 4 is different from fig. 1 in that the method for extracting tuples from text further includes steps 108 to 110.
In step 108, tuples are extracted from the table.
The text to be detected comprises a table matched with the characters. All the multi-component groups described in the table were extracted. Since the technology of extracting tuples from the table is mature, it is not described herein.
In step 109, the multi-tuple extracted from the table is compared with the legitimate multi-tuple extracted from the text.
In the steps of the above embodiments, legal tuples can be extracted. The multi-tuple extracted from the table in step 108 is compared to a legal multi-tuple extracted from the text. It is determined whether the multinary group extracted from the table matches a legitimate multinary group extracted from the text.
In step 110, if the tuple extracted from the table does not match the valid tuple extracted from the text, the presentation information is generated.
If the tuple extracted from the table is not identical to the legal tuple extracted from the text, it indicates that the tuple extracted from the table or the tuple extracted from the text is an erroneous tuple, and it indicates that the tuple described in the table or the tuple described in the text is erroneous. When the multi-element group extracted from the table is inconsistent with the legal multi-element group extracted from the characters, prompt information is generated to prompt error reporting, so that the error reporting function of the multi-element group extracted from the table or the multi-element group extracted from the characters is realized.
The difference between the multinary group extracted from the table and the legitimate multinary group extracted from the characters may be output as an error in the table or characters. Thereby realizing the function of error correction of the tuples in the table or the tuples in the characters.
FIG. 5 is a block diagram illustrating an apparatus 200 for extracting tuples from text according to an embodiment of the present invention. As shown in fig. 5, the apparatus 200 for extracting tuples from text includes a training module 201, a splitting module 202, a first calculating module 203, a second calculating module 204, a classifying module 205, and a first extracting module 206.
The training module 201 is configured to input training data with category identifiers into a recurrent neural network, and train to obtain network parameters of the recurrent neural network, where the training data with category identifiers includes legal training data and illegal training data, and the recurrent neural network includes multiple sub-networks.
The splitting module 202 is configured to identify a multi-element entity in the text to be tested, and perform word segmentation on other parts of the text to be tested except the multi-element entity, where the text to be tested includes characters.
The first calculation module 203 is configured to correspondingly input word vectors of the multi-element entities and word vectors of words obtained after word segmentation into multiple subnetworks one by one according to an arrangement sequence in the text to be tested, and obtain hidden vectors output by each subnetwork by combining network parameters, wherein the hidden vectors output by a previous subnetwork are used as inputs of a next subnetwork.
The second calculating module 204 is configured to perform integration calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector.
The classification module 205 is configured to classify the judgment vector by using the network parameter to obtain a classification result, where the classification result includes legal and illegal.
A first extraction module 206 configured to extract the tuple whose classification result is legal as a legal tuple.
The embodiment of the invention provides a device 200 for extracting a multi-tuple from a character, wherein a training module 201 trains a recurrent neural network by using training data with class identification to obtain network parameters of the recurrent neural network. The first calculation module 203 inputs the word vectors of the multivariate entities and the word vectors of the words obtained after word segmentation into the sub-networks of the recurrent neural network in a one-to-one correspondence manner according to the arrangement sequence of the words obtained by segmenting the multivariate entities and the other parts except the multivariate entities identified by the splitting module 202 in the text to be tested, and obtains the hidden vectors output by each sub-network by combining the network parameters. The second calculating module 204 uses implicit vector integration calculation to obtain the judgment vector. The classification module 205 classifies the judgment vector to obtain a legal or illegal classification result. The first extraction module 206 extracts the tuple whose classification result is legal as a legal tuple. Compared with the prior art of extracting the multi-element group in the characters according to the pre-constructed rule of extracting the multi-element group, the embodiment of the invention utilizes the recurrent neural network, the hidden vector output by the previous sub-network in the recurrent neural network is used as the input of the next sub-network, and the multi-element entity is connected with other parts in the text to be detected. Therefore, when the cyclic neural network is trained to obtain the network parameters, the legal rules of the tuples in various types of texts can be obtained. Therefore, legal multi-element groups in more types of texts can be identified and extracted, and the accuracy of extracting the multi-element groups is improved.
In one example, the multi-element entity includes at least two of a time entity, an attribute entity, a value entity, and a qualifier entity.
FIG. 6 is a block diagram of an apparatus 200 for extracting tuples in text according to another embodiment of the present invention. Fig. 6 is different from fig. 5 in that the apparatus 200 for extracting tuples from words further includes a tuple generating module 207.
The multi-element group generating module 207 is configured to arrange and combine the multi-element entities to generate at least one multi-element group.
It should be noted that the second computing module 204 in the above embodiments may include the first computing unit 2041 or the second computing unit 2042.
The first calculating unit 2041 is configured to calculate a mean vector of hidden vectors corresponding to the multiple entities forming the multiple elements, and use the mean vector as a judgment vector.
The second calculating unit 2042 is configured to perform weighted calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and use the vectors obtained through the weighted calculation as the judgment vectors.
FIG. 7 is a block diagram illustrating an apparatus 200 for extracting tuples in text according to yet another embodiment of the present invention. Fig. 7 is different from fig. 5 in that the apparatus 200 for extracting tuples from text further includes a second extracting module 208, a comparing module 209, and an error reporting module 210.
A second extraction module 208 configured to extract tuples from the table.
The text to be tested also comprises a table matched with the characters.
A comparing module 209 configured to compare the tuple extracted from the table with a legal tuple extracted from the text.
An error reporting module 210 configured to generate a prompt if the tuple extracted from the table is inconsistent with a legitimate tuple extracted from the text.
It should be clear that the embodiments in this specification are described in a progressive manner, and the same or similar parts in the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. For the device embodiments, reference may be made to the description of the method embodiments in the relevant part. The present invention is not limited to the specific steps and structures described above and shown in the drawings. Those skilled in the art may make various changes, modifications and additions or change the order between the steps after appreciating the spirit of the invention. Also, a detailed description of known process techniques is omitted herein for the sake of brevity.
The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

Claims (8)

1. A method for extracting tuples from characters is characterized by comprising the following steps:
inputting training data with class identification into a recurrent neural network, training to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks;
identifying a multi-element entity in a text to be detected, and segmenting words of other parts except the multi-element entity in the text to be detected, wherein the text to be detected comprises characters;
according to the arrangement sequence in the text to be tested, correspondingly inputting the word vectors of the multi-element entities and the word vectors of the words obtained after word segmentation into the plurality of sub-networks one by one, and combining the network parameters to obtain the hidden vectors output by each sub-network, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network;
performing integration calculation on hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector;
classifying the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally;
extracting the multi-tuple of which the classification result is legal as a legal multi-tuple;
the integrating calculation of the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain the judgment vector comprises the following steps:
calculating a mean vector of hidden vectors corresponding to the multi-element entities forming the multi-element group, and taking the mean vector as the judgment vector;
alternatively, the first and second electrodes may be,
and performing weighted calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and taking the vectors obtained through weighted calculation as the judgment vectors.
2. The method of claim 1, wherein the plurality of entities comprises at least two of a time entity, an attribute entity, a value entity, and a custom entity.
3. The method of claim 1, wherein before the integrating the hidden vectors corresponding to the multi-element entities forming the multi-element set to obtain the judgment vector, the method further comprises:
and arranging and combining the multiple entities to generate at least one multiple group.
4. The method of claim 1, wherein the text to be tested further comprises a table matching words, the method further comprising:
extracting tuples from the table;
comparing the tuple extracted from the table with the legal tuple extracted from the characters;
if the tuple extracted from the table does not match the legitimate tuple extracted from the text, the prompt message is generated.
5. An apparatus for extracting tuples from words, comprising:
the training module is configured to input training data with class identification into a recurrent neural network, and train to obtain network parameters of the recurrent neural network, wherein the training data with the class identification comprises legal training data and illegal training data, and the recurrent neural network comprises a plurality of sub-networks;
the splitting module is configured to identify a multi-element entity in a text to be tested, and perform word segmentation on other parts except the multi-element entity in the text to be tested, wherein the text to be tested comprises characters;
the first calculation module is configured to correspondingly input word vectors of the multi-element entities and word vectors of words obtained after word segmentation into the plurality of sub-networks one by one according to the arrangement sequence in the text to be detected, and obtain hidden vectors output by each sub-network by combining the network parameters, wherein the hidden vector output by the previous sub-network is used as the input of the next sub-network;
the second calculation module is configured to perform integrated calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group to obtain a judgment vector;
the classification module is configured to classify the judgment vectors by using the network parameters to obtain classification results, wherein the classification results comprise legality and illegally;
a first extraction module configured to extract the tuples of which the classification results are legal as legal tuples;
the second computing module, comprising:
a first calculation unit configured to calculate a mean vector of hidden vectors corresponding to the multi-element entities constituting the multi-element group, the mean vector being the determination vector;
alternatively, the first and second electrodes may be,
and the second calculation unit is configured to perform weighted calculation on the hidden vectors corresponding to the multi-element entities forming the multi-element group, and take the vectors obtained through weighted calculation as the judgment vectors.
6. The apparatus of claim 5, wherein the plurality of entities comprises at least two of a time entity, an attribute entity, a value entity, and a custom entity.
7. The apparatus of claim 5, further comprising:
and the multi-element group generating module is configured to arrange and combine the multi-element entities to generate at least one multi-element group.
8. The apparatus of claim 5, wherein the text to be tested further comprises a table matching words, the apparatus further comprising:
a second extraction module configured to extract tuples from the table;
a comparison module configured to compare the tuple extracted from the table with a legal tuple extracted from the text;
an error correction module configured to generate prompt information if the tuple extracted from the table is inconsistent with a legitimate tuple extracted from the text.
CN201710280347.XA 2017-04-25 2017-04-25 Method and device for extracting multiple tuples from characters Active CN108733636B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710280347.XA CN108733636B (en) 2017-04-25 2017-04-25 Method and device for extracting multiple tuples from characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710280347.XA CN108733636B (en) 2017-04-25 2017-04-25 Method and device for extracting multiple tuples from characters

Publications (2)

Publication Number Publication Date
CN108733636A CN108733636A (en) 2018-11-02
CN108733636B true CN108733636B (en) 2021-07-13

Family

ID=63934675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710280347.XA Active CN108733636B (en) 2017-04-25 2017-04-25 Method and device for extracting multiple tuples from characters

Country Status (1)

Country Link
CN (1) CN108733636B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701888B (en) * 2023-08-09 2023-10-17 国网浙江省电力有限公司丽水供电公司 Auxiliary model data processing method and system for clean energy enterprises

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261358A (en) * 2014-07-17 2016-01-20 中国科学院声学研究所 N-gram grammar model constructing method for voice identification and voice identification system
CN106294325A (en) * 2016-08-11 2017-01-04 海信集团有限公司 The optimization method and device of spatial term statement

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9195656B2 (en) * 2013-12-30 2015-11-24 Google Inc. Multilingual prosody generation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261358A (en) * 2014-07-17 2016-01-20 中国科学院声学研究所 N-gram grammar model constructing method for voice identification and voice identification system
CN106294325A (en) * 2016-08-11 2017-01-04 海信集团有限公司 The optimization method and device of spatial term statement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于多元组鉴别文本语种的方法;刘敏等;《计算机应用》;20051231;第25卷;全文 *

Also Published As

Publication number Publication date
CN108733636A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
US20230013306A1 (en) Sensitive Data Classification
CN109325691B (en) Abnormal behavior analysis method, electronic device and computer program product
JP7266674B2 (en) Image classification model training method, image processing method and apparatus
CN110147823B (en) Wind control model training method, device and equipment
CN106815194A (en) Model training method and device and keyword recognition method and device
WO2021031825A1 (en) Network fraud identification method and device, computer device, and storage medium
US20200065573A1 (en) Generating variations of a known shred
Ting et al. Towards the detection of cyberbullying based on social network mining techniques
WO2019179010A1 (en) Data set acquisition method, classification method and device, apparatus, and storage medium
CN110929203B (en) Abnormal user identification method, device, equipment and storage medium
CN111666761A (en) Fine-grained emotion analysis model training method and device
CN110175851A (en) A kind of cheating detection method and device
CN113449725B (en) Object classification method, device, equipment and storage medium
CN112926045B (en) Group control equipment identification method based on logistic regression model
US20130151239A1 (en) Orthographical variant detection apparatus and orthographical variant detection program
CN109783805B (en) Network community user identification method and device and readable storage medium
CN109933648A (en) A kind of differentiating method and discriminating device of real user comment
CN111553241A (en) Method, device and equipment for rejecting mismatching points of palm print and storage medium
JP6962123B2 (en) Label estimation device and label estimation program
CN106888201A (en) A kind of method of calibration and device
JP6146209B2 (en) Information processing apparatus, character recognition method, and program
CN108733636B (en) Method and device for extracting multiple tuples from characters
CN117093698B (en) Knowledge base-based dialogue generation method and device, electronic equipment and storage medium
Ishitani Model matching based on association graph for form image understanding
US20170039484A1 (en) Generating negative classifier data based on positive classifier data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant