WO2024087963A1 - Text processing method and apparatus, and electronic device and storage medium - Google Patents

Text processing method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2024087963A1
WO2024087963A1 PCT/CN2023/120521 CN2023120521W WO2024087963A1 WO 2024087963 A1 WO2024087963 A1 WO 2024087963A1 CN 2023120521 W CN2023120521 W CN 2023120521W WO 2024087963 A1 WO2024087963 A1 WO 2024087963A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
pair
pair data
vector
data
Prior art date
Application number
PCT/CN2023/120521
Other languages
French (fr)
Chinese (zh)
Inventor
程昊熠
Original Assignee
中移(苏州)软件技术有限公司
中国移动通信集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中移(苏州)软件技术有限公司, 中国移动通信集团有限公司 filed Critical 中移(苏州)软件技术有限公司
Publication of WO2024087963A1 publication Critical patent/WO2024087963A1/en

Links

Definitions

  • the present disclosure relates to data processing technology, and in particular to a text processing method, device, electronic device and storage medium.
  • Event co-reference resolution is to determine whether event sentences with different description methods refer to the same event in real life, which mainly depends on the similarity between the two.
  • the difficulty lies in how to accurately calculate the similarity value between two event sentences and how to improve the accuracy of similarity calculation. There is currently no effective solution to this problem.
  • the main purpose of the present invention is to provide a text processing method, device, and electronic device. equipment and storage media.
  • the present disclosure provides a text processing method, including:
  • the confidence of the event pair data is determined based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a same-reference relationship.
  • the process of using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data includes:
  • the event pair data is intercepted based on the start word and the end word to obtain the event short sentence pair data.
  • determining the confidence of the event pair data in the first text based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity includes:
  • the first similarity is determined based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity, and the second non-linear similarity.
  • the confidence vector is processed based on a fully connected classifier to obtain the confidence of the event on the data.
  • the method further includes:
  • a pre-trained model (Bidirectional Encoder Representation from Transformers, BERT) is used to predict the event pair data to obtain the word vector pairs corresponding to the event pair data.
  • the event pair data includes a plurality of word pair data; the method further includes:
  • the first information pair represents a part-of-speech information pair of the word pair data
  • the second information pair represents a position information pair of the word pair data
  • a first event vector pair corresponding to the event pair data is determined.
  • the method further includes:
  • Bi-LSTM Bi-directional Long Short-Term Memory network
  • CNN convolutional neural network
  • determining the first linear similarity and the first non-linear similarity of the event pair data includes:
  • the first linear similarity includes a first cosine distance; and the first non-linear similarity includes at least one of a first bilinear distance and a first single-layer network distance.
  • the method further includes:
  • the first event short sentence vector pair is processed by a second global maximum pooling layer to obtain a second event short sentence vector pair corresponding to the first event short sentence vector pair.
  • determining the second linear similarity and the second non-linear similarity of the event phrase pair data includes:
  • the second linear similarity includes a second cosine distance; and the second non-linear similarity includes at least one of a second bilinear distance and a second single-layer network distance.
  • the present disclosure provides a text processing device, including:
  • a first acquisition module configured to acquire event pair data included in the first text
  • a first processing module is configured to process the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data;
  • a first determination module is configured to determine a first linear similarity and a first non-linear similarity of the event pair data and to determine a second linear similarity and a second non-linear similarity of the event phrase pair data;
  • the second determination module is configured to determine the confidence of the event pair data based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a synonymous relationship.
  • An embodiment of the present disclosure provides a text processing device, including a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and when the processor executes the program, any of the above-mentioned methods is implemented.
  • An embodiment of the present disclosure provides a storage medium, wherein the storage medium stores executable instructions.
  • the executable instructions are executed by a processor, any of the above methods is implemented.
  • the disclosed embodiment provides a text processing method, device, electronic device and storage medium.
  • the method includes: obtaining event pair data included in a first text; processing the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data; determining the first linear similarity and the first non-linear similarity of the event pair data and determining the second linear similarity and the second non-linear similarity of the event short sentence pair data; determining the confidence of the event pair data based on the event pair data, the event short sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence characterizes the degree to which the event pair data has a co-referential relationship.
  • FIG1 is a schematic diagram of a flow chart of a text processing method according to an embodiment of the present disclosure
  • FIG2 is a schematic diagram of the technical process of the BNN system of the text processing method according to an embodiment of the present disclosure
  • FIG3 is a schematic diagram of the structure of a text processing device according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a hardware entity structure of a text processing device according to an embodiment of the present disclosure.
  • relevant researchers In machine learning methods, relevant researchers have introduced a series of Whether the event attributes, such as trigger words, tense, polarity, etc. are consistent.
  • Relevant scholar 2 designed a maximum entropy classifier and introduced more than 100 features for experiments.
  • Relevant scholar 3 proposed a joint reasoning model based on Markov chain to correct the erroneous results produced by the classifier.
  • Relevant scholar 4 designed a graph-based model classifier to merge events into an undirected graph, and then remove non-co-referential events from the graph.
  • the sixth scholar first used a convolutional pooling network to extract feature information of the event sentence and the trigger word context, and then introduced event pair matching features to assist in determining whether there is a synonymous relationship between event pairs.
  • the seventh scholar first used a fully connected layer to perform a dimensionality change operation on the two event sentences, and then calculated the cosine distance and Euclidean distance of the two event sentences, and finally used an activation function to derive a confidence level to determine the synonymous relationship.
  • Fang Jie mainly used the attention mechanism to extract important information from event sentences, and combined the linear similarity between event sentences with event pair matching features to determine whether there is a synonymous relationship between event pairs.
  • probability or graph-based machine learning methods require a lot of feature engineering to extract features, which has high labor costs, low accuracy, and poor portability.
  • the method proposed by the six relevant researchers uses a convolutional neural network to extract the contextual feature information of words in event sentences. It only considers the local information between words in the event sentences, does not consider the relationship between a pair of event sentences, and does not deeply extract the features in the event sentences, resulting in low performance of event co-reference resolution.
  • the method proposed by the relevant scholar No. 7 simply performs dimensionality transformation on the event sentences, and does not extract features in depth, which results in the calculated cosine distance and Euclidean distance between the event sentences being not very accurate, affecting the final classification performance.
  • the input information of the neural network method is not rich enough and has certain errors. It basically only combines the event sentence and the relative distance between each word and the trigger word. In addition, three words before and after the trigger word are taken to form an event short sentence. However, the event short sentences extracted using fixed rules will have certain errors. This affects the discriminative performance of the model.
  • this application proposes a text processing method, device, electronic device and storage medium. It aims to pre-train accurate word vectors to represent event sentences, deeply extract useful feature information from event sentences with high dimensions, complex semantic information and complex sentence structures, and assist in distinguishing the same reference relationship by calculating the similarity between event short sentences.
  • the disclosed embodiment proposes a text processing method, the functions implemented by the method can be implemented by calling program codes by a processor in a text processing device.
  • the program codes can be stored in a computer storage medium.
  • the computing device at least includes a processor and a storage medium.
  • FIG1 is a schematic diagram of a text processing method according to an embodiment of the present disclosure. As shown in FIG1 , the method includes:
  • Step 101 Acquire event pair data included in a first text
  • Step 102 using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data;
  • Step 103 determining a first linear similarity and a first non-linear similarity of the event pair data and determining a second linear similarity and a second non-linear similarity of the event phrase pair data;
  • Step 104 Determine the confidence of the event pair data based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a coreference relationship.
  • the text processing method can be determined according to actual conditions and is not limited here.
  • the text processing method can be an event homonym resolution method based on BERT pre-training.
  • the first text may be determined according to actual conditions, and is not limited here.
  • the first text may be an event sentence.
  • the obtaining of the first text may be to determine the event sentence based on the corpus in a preset corpus.
  • the preset corpus may be determined according to actual conditions, and is not limited here.
  • the preset corpus may be the corpus of the International Knowledge Base Population Contest and the 2005 Automatic Content Extraction Contest. Extraction, ACE) corpus.
  • the step of acquiring the event pair data included in the first text may include acquiring the first text; and preprocessing the first text to obtain the event pair data included in the first text.
  • two event sentences in the first text that need to be judged as having a same-referential relationship are used as event pair data included in the first text.
  • the preprocessing of the first text may include: performing data cleaning on the first text in combination with a regular expression and a stop word list; filtering special symbols and stop words in the first text; and restoring words in the first text to their original forms.
  • step 102 the event pair data is processed using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data. This can be done by respectively processing each of the two event sentences in the event pair data using a dependency syntax analysis tool to obtain two event short sentences corresponding to the two event sentences, and using the two event short sentences as the event short sentence pair data corresponding to the event pair data.
  • the first linear similarity can be determined according to actual conditions, which is not limited here.
  • the first linear similarity can be the first cosine distance of the event pair data;
  • the second linear similarity can be determined according to actual conditions, which is not limited here.
  • the second linear similarity can be the second cosine distance of the event phrase pair data.
  • the first nonlinear similarity can be determined according to actual conditions and is not limited here.
  • the first nonlinear similarity can be the first bilinear distance and the first single-layer network distance of the event pair data;
  • the second nonlinear similarity can be determined according to actual conditions and is not limited here.
  • the second nonlinear similarity can be the second bilinear distance and the second single-layer network distance of the event phrase pair data.
  • step 104 after determining the confidence of the event pair data, the method further includes: judging whether the confidence is greater than a preset threshold; if the confidence is greater than the preset threshold, determining that the event pair data has a same-reference relationship; wherein the existence of the same-reference relationship between the event pair data indicates that the degree of the same-reference relationship between the event pair data is high; if the confidence is less than or equal to the preset threshold, determining that the event pair data does not have a same-reference relationship; wherein the event pair data has a same-reference relationship between the event pair data and the same-reference relationship between the event pair data.
  • the absence of a common reference relationship for the data indicates that the degree to which the event has a common reference relationship for the data is low.
  • the preset threshold can be determined according to actual conditions and is not limited here. As an example, the confidence level can be a value between 0 and 1, and the preset threshold can be 0.5.
  • the disclosed embodiment provides a text processing method, which obtains event pair data included in a first text; uses a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data; determines the first linear similarity and the first non-linear similarity of the event pair data and determines the second linear similarity and the second non-linear similarity of the event short sentence pair data; determines the confidence of the event pair data based on the event pair data, the event short sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence characterizes the degree to which the event pair data has a co-referential relationship.
  • This embodiment proposes a method combining linear similarity and non-linear similarity, and uses non-linear similarity to calculate the similarity between words to make up for the shortcoming that linear similarity can only calculate the similarity between sentences of the entire event.
  • the process of using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data includes:
  • the event pair data is intercepted based on the start word and the end word to obtain the event short sentence pair data.
  • the dependency syntax analysis tool can be determined according to actual conditions, and is not limited here.
  • the dependency syntax analysis tool can be a Stanford natural language processing tool.
  • the trigger word can be determined according to actual conditions and is not limited here.
  • the trigger word can be a word in the event sentence that starts a process or action process.
  • the argument can be determined according to actual conditions and is not limited here.
  • the argument can be the agent, the patient, the time and place of the event in the event sentence, etc.
  • the dependent words can be determined according to actual conditions and are not limited here.
  • the dependent words can be the subject and object in the event sentence.
  • the method of sorting the first distance and the second distance can be determined according to actual conditions and is not limited here.
  • the first distance and the second distance are arranged in order from small to large to obtain the sorting result.
  • This embodiment uses a dependency word analysis tool to obtain the dependent words of the trigger word, and then uses the trigger word, the dependent words, and the arguments together to determine the starting and ending positions of the event short sentence in the sentence, thereby extracting the event short sentence.
  • determining the confidence of the event pair data in the first text based on the event pair data, the event short sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity, and the second non-linear similarity includes:
  • the confidence vector is processed based on a fully connected classifier to obtain the confidence of the event on the data.
  • the confidence vector is processed based on the fully connected classifier to obtain the confidence of the event on the data.
  • the confidence vector is processed using a slope (Rectified Linear Unit, relu) activation function in the fully connected classifier to obtain a processed confidence vector; the processed confidence vector is processed by a logistic regression model (sigmoid) activation function to obtain the confidence of the event on the data.
  • sigmoid logistic regression model
  • the method further includes:
  • the pre-trained model BERT is used to predict the event pair data to obtain the word vector pairs corresponding to the event pair data.
  • the pre-trained model BERT is used to predict the event pair data to obtain the word vector pair corresponding to the event pair data.
  • the method can be: the BERT is used to predict the masked words or sentences by using characters to respectively mask the words of each of the two event sentences in the event pair data or the sentences in the text where each of the two event sentences is located, to obtain two word vectors corresponding to the two event sentences in the event pair data, and the two word vectors are used as the word vector pair corresponding to the event pair data.
  • This embodiment no longer uses fixed word vectors, but instead uses the BERT pre-trained model for training to obtain accurate word vector expressions.
  • the event pair data includes a plurality of word pair data; the method further includes:
  • the first information pair represents a part-of-speech information pair of the word pair data
  • the second information pair represents a position information pair of the word pair data
  • a first event vector pair corresponding to the event pair data is determined.
  • the word data can be determined according to actual conditions, and no limitation is made here.
  • the word data can be a word in the event sentence.
  • the event pair data may include a plurality of word pair data, and each of the two event sentences in the event pair data may include a plurality of word data respectively, and the plurality of word data respectively included in each of the two event sentences may be used as the plurality of word pair data included in the event pair data.
  • the method of obtaining the first information pair and the second information pair of the plurality of word pair data in the event pair data may be as follows: respectively obtaining the first information and the second information of the plurality of word data of each event sentence in the two event sentences in the event pair data, and using the first information and the second information of the plurality of word data of each event sentence in the two event sentences as the first information pair and the second information of the plurality of word pair data in the event pair data. Interest is right.
  • the method of obtaining the first information pair and the second information pair of the multiple word pair data in the event pair data may be: using the Stanford natural language processing tool to determine the first information pair of the multiple word pair data in the event pair data; and determining the second information pair of the multiple word pair data in the event pair data based on the relative distance between each word pair data in the multiple word pair data and the trigger word of the event pair data.
  • the determination of the first event vector pair corresponding to the event pair data based on the word vector pair, the event pair data, the first information pair and the second information pair may be as follows: encoding the event pair data based on the word vector pair to obtain a first dimension vector pair; encoding the first information pair based on the word vector pair to obtain a second dimension vector pair; determining a third dimension vector pair based on the second information pair; and determining the first event vector pair based on the first dimension vector pair, the second dimension vector pair and the third dimension vector pair.
  • the first dimension vector pair may be an event vector pair of the first dimension; the second dimension vector pair may be a part-of-speech vector pair of the second dimension; the third dimension vector pair may be a position vector pair of the third dimension; and the first event vector pair may be an event vector pair of the fourth dimension.
  • This embodiment concatenates the event sentence, the position information of each word in the event sentence, and the part-of-speech information of each word, thereby enriching the feature information of the input data.
  • the method further includes:
  • Bi-LSTM Using a long short-term memory network Bi-LSTM to extract the first event vector pair, to obtain a global information pair corresponding to the first event vector pair;
  • CNN convolutional neural network
  • the fused vector pair is processed by a first global maximum pooling layer to obtain a second event vector pair corresponding to the first event vector pair.
  • the long short-term memory network Bi-LSTM is used to process the first event vector pair.
  • Row extraction, to obtain the global information pair corresponding to the first event vector pair can be, using the Bi-LSTM to transmit the word information of each event sentence in the two event sentences of the first event vector in a front-to-back order, and then in a back-to-front order; obtain the global information of each event sentence in the two event sentences of the first event vector pair; use the global information of each event sentence in the two event sentences as the global information pair corresponding to the first event vector pair.
  • the number of neurons of the Bi-LSTM can be determined according to actual conditions, and is not limited here.
  • the number of neurons of the Bi-LSTM can be 150.
  • the global information pair can be determined according to actual conditions, and is not limited here.
  • the global information pair can be a global vector pair.
  • the global information pair can be a global vector pair of the fifth dimension.
  • the use of the convolutional neural network CNN to extract the first event vector pair to obtain the local information pair corresponding to the first event vector pair can be to use the CNN to extract the local information of each of the two event sentences of the first event vector; and use the local information of each of the two event sentences as the local information pair corresponding to the first event vector pair.
  • the number of convolution kernels and the convolution kernel window size of the CNN can be determined according to actual conditions and are not limited here. As an example, the number of convolution kernels of the CNN is set to 300 and the convolution kernel window size is 2.
  • the local information between two adjacent words in each of the two event sentences of the first event vector is obtained by using the CNN; the local information between two adjacent words in each of the two event sentences is used as the local information pair corresponding to the first event vector pair.
  • the local information pair can be determined according to actual conditions and is not limited here.
  • the local information pair can be a local vector pair.
  • the local information pair can be a local vector pair of the sixth dimension.
  • the fusing of the global information pair and the local information pair to obtain the fused vector pair corresponding to the first event vector pair may be performed by bitwise addition of the global information pair and the local information pair to obtain the fused vector pair corresponding to the first event vector pair.
  • the fused vector pair may be a fused vector pair of the seventh dimension.
  • the second event vector pair can be determined according to actual conditions and is not limited here.
  • the second event vector pair can be an event vector pair of the eighth dimension.
  • determining the first linear similarity and the first non-linear similarity of the event pair data includes:
  • the first linear similarity includes a first cosine distance; and the first non-linear similarity includes at least one of a first bilinear distance and a first single-layer network distance.
  • determining the first linear similarity and the first non-linear similarity of the event pair data according to the second event vector pair may be determining the first linear similarity and the first non-linear similarity of the event pair data according to two second event vectors in the second event vector pair.
  • the method further includes:
  • the first event short sentence vector pair is processed by a second global maximum pooling layer to obtain a second event short sentence vector pair corresponding to the first event short sentence vector pair.
  • the determination of the first event short sentence vector pair corresponding to the event short sentence pair data based on the word vector pair and the event short sentence pair data can be performed by encoding the event short sentence pair data based on the word vector pair to obtain the first event short sentence vector pair corresponding to the event short sentence pair data; the first event short sentence vector pair can be an event short sentence vector pair of the ninth dimension.
  • the second event phrase vector pair can be determined according to actual conditions and is not limited here.
  • the second event phrase vector pair can be an event phrase vector pair of the tenth dimension.
  • determining the second linear similarity and the second non-linear similarity of the event phrase pair data includes:
  • the second linear similarity includes a second cosine distance; and the second non-linear similarity includes at least one of a second bilinear distance and a second single-layer network distance.
  • the second line of the event pair data is determined according to the second event phrase vector pair.
  • the second linear similarity and the second non-linear similarity may be determined by determining the second linear similarity and the second non-linear similarity of the event pair data according to two second event phrase vectors in the second event phrase vector pair.
  • the confidence of the event pair data in the first text is determined based on the second event vector pair, the second event phrase vector pair, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity.
  • the company's current intelligent customer service system still relies heavily on manual customer service to answer customers' questions.
  • the method proposed in this embodiment can automatically obtain the answer that best matches the question raised by the customer, thereby reducing labor costs and improving user experience.
  • This embodiment effectively enriches the feature information of the input data, and performs one-to-one splicing of words, word position information and word part-of-speech information; uses the BERT pre-training model for training to obtain accurate word vector expressions; uses Bi-LSTM to encode event sentences to obtain global vectors, and uses CNN to encode event sentences to obtain local vectors, and combines the two; uses the dependency words, trigger words and arguments of trigger words to extract event short sentences, rather than fixedly extracting three words before and after the trigger word to form an event short sentence; combines linear similarity with nonlinear similarity, and does not only calculate linear similarity, but also calculates nonlinear similarity to make up for the shortcomings of linear similarity; compared with the methods of related technologies, the performance is improved.
  • FIG2 is a technical flow diagram of the BNN system of the text processing method of the embodiment of the present disclosure. As shown in FIG2, the method includes the following steps:
  • the first step preprocess the event sentences.
  • the corpus was determined using the KBP and ACE2005 corpora.
  • the KBP corpus contains 6538 event sentences, and the ACE2005 corpus contains 5349 event sentences.
  • the event sentences provided by the corpus were input into the preprocessing module of the BNN system.
  • the event sentences provided by the corpus are news texts directly crawled from web pages. Since there are a large number of special symbols, stop words and other irrelevant information in the crawled text data, the text data needs to be processed in the preprocessing module.
  • the preprocessing module mainly uses regular expressions and stop word lists to clean the text data, filter out special symbols and stop words, and replace the sentences with the original text. The words are restored to their original form.
  • the processed sentences are used as the input event sentences (Sentence, Sen).
  • the Stanford natural language processing tool is used to obtain the part-of-speech information (Pos) of each word in the event sentence, and then the location information (Location, Loc) of each word in the sentence is assigned.
  • the location information is the relative distance between each word and the trigger word of the event sentence.
  • the preprocessing module uses the two event sentences that need to be judged as the event pair data.
  • Step 2 Perform BERT prediction on event sentences.
  • the BERT pre-training model predicts the masked words or sentences by masking the words in the event sentence or the sentences in the text where the event sentence is located with characters, thereby obtaining the vector representation BM of various words. Therefore, there is a strong correlation between the words in the sentence, and there is also a strong contextual connectivity and logic between the sentences in the text. It has a great impact on the experimental results.
  • Step 3 Use word vectors to encode event sentences.
  • the word vector BM trained by the BERT pre-training model is used to encode the event sentence Sen and the part-of-speech information Pos to obtain the event sentence vector SEN with a dimension of a ⁇ b and the part-of-speech vector POS with a dimension of a ⁇ b. Then, the event sentence vector, the part-of-speech vector and the position information with a dimension of a ⁇ 1 are horizontally concatenated to form an event vector EB with a dimension of a ⁇ (2b+1).
  • Step 4 Extract global and local information of event sentences.
  • this embodiment first uses Bi-LSTM to extract the global information of the event vector EB, and sets the number of Bi-LSTM neurons to 150.
  • Bi-LSTM will pass the information of the previous words in the event sentence to the back in sequence, and then pass the information from the back to the front in reverse, observing an event sentence from a global perspective.
  • Use CNN to extract local information of event vector EB set the number of CNN convolution kernels to 300, the convolution kernel window size to 2, and keep the dimension unchanged. Since the convolution kernel window size is 2, local information between two adjacent words in the event sentence will be extracted.
  • the two networks obtain a global vector GE with a dimension of a ⁇ 300 and a local vector LE with a dimension of a ⁇ 300, respectively, as shown in formulas (3) and (4):
  • this embodiment adds the global vector GE and the local vector LE bit by bit to obtain a vector GL with a dimension of a ⁇ 300, which is equivalent to fusing the global information and local information of each word in the event sentence together.
  • Formula (5) is shown as follows:
  • Step 5 Extract event short sentences from event sentences.
  • This embodiment optimizes the extraction method, and the steps of event short sentence extraction are as follows:
  • Step (5.1) uses the Stanford natural language processing tool to obtain the arguments in the event sentence.
  • the arguments mainly include: agent, patient, time and place of the event, etc.
  • Step (5.2) uses a dependency word analysis tool to generate dependency words for the trigger word in the sentence.
  • Step (5.3) calculates the distance between each argument and each dependency word and the trigger word, determines the two words farthest from the trigger word before and after the trigger word, and uses these two words as the start and end positions of the event phrase.
  • Step (5.4) extracts the sentence from the starting position to the ending position as the event short sentence.
  • the trigger word is "appointed”
  • the dependent words of the trigger word are "Zhang Junxiong”, "President”, and "invited”.
  • the distances between the three dependent words and the trigger word are -3, 2, and 5.
  • the arguments in the event sentence are "Zhang Junxiong” and “invited”, and the distances between the two arguments and the trigger word are -3 and 5 respectively.
  • the event short sentence extracted according to the fixed method of extracting short sentences is "Junxiong the newly appointed Executive President was", which shows that the short sentence is incomplete.
  • the dependent word or argument "Zhang Junxiong" farthest before the trigger word is taken as the starting position and the dependent word or argument "invited” farthest after the trigger word is taken as the ending position, then the event short sentence "Zhang Junxiong the newly appointed Executive President was also invited” can be extracted.
  • the event short sentence is obtained, and the word vector BM is used to encode the event short sentence to obtain the event short sentence vector ES with a dimension of a ⁇ b. Then the event short sentence vector ES is passed through the global maximum pooling layer to obtain the event short sentence vector SX with a dimension of a ⁇ 1.
  • Step 6 Calculate the similarity between two event sentences.
  • the key to determining whether there is a synonymous relationship between event sentences is to calculate the similarity between the two.
  • the accuracy and comprehensiveness of the similarity calculation has a great impact on the performance results of the model.
  • researchers have only used the cosine distance calculation method to obtain the linear similarity between event sentences.
  • Linear similarity considers the relationship between two event sentences from a holistic perspective. If the structural gap between the two is too large, it will be misjudged as a non-synonymous relationship. Non-linear similarity can calculate the relationship between words between a pair of events, thereby making up for the shortcomings of linear similarity.
  • This embodiment proposes three similarity calculation methods, namely: cosine distance C, bilinear distance S and single-layer network distance L.
  • the formulas are shown in (8), (9), (10), (11), (12) and (13):
  • C 1 represents the cosine distance corresponding to the event sentence vector.
  • C 2 represents the cosine distance corresponding to the event short sentence vector.
  • formula (10) represents the weight used to calculate the bilinear distance corresponding to the event sentence vector.
  • formula (11) represents the weight used to calculate the bilinear distance corresponding to the event short sentence vector.
  • formula (12) Represents the weight used to calculate the single-layer network distance corresponding to the event sentence vector; Represents the offset vector used to calculate the single-layer network distance corresponding to the event sentence vector.
  • formula (13) Represents the weight used to calculate the single-layer network distance corresponding to the event short sentence vector; Represents the offset vector used to calculate the single-layer network distance corresponding to the event short sentence vector.
  • Step 7 Output confidence.
  • V h ⁇ (W h *P+b h )
  • W h represents the weight of the activation function corresponding to the vector P
  • b h represents the offset vector of the activation function corresponding to the vector P.
  • W0 represents the weight of the confidence
  • b0 represents the offset vector of the confidence
  • the confidence score is a value between 0 and 1. If the score is greater than 0.5, it is determined to be a co-referential relationship. Otherwise, it is determined to be a non-co-referential relationship.
  • this embodiment uses Dropout, which is a strategy widely used in deep learning to solve the problem of model overfitting. The value is set to 0.2.
  • the BNN system uses BERT pre-training and extraction of global and local information to transform the semantics of text content.
  • the information is mined accurately and comprehensively and converted into vector expressions.
  • the auxiliary model can identify the same-reference relationship.
  • the system has achieved good results in actual tests, and has improved performance compared with the methods in related technologies and existing technologies.
  • Table 1 shows the KBP performance result data
  • Table 2 shows the ACE performance result data. As described in Tables 1 and 2, the performance results are as follows:
  • Table 1 shows the KBP performance results.
  • MUC, B3, BLANC, CEAFe, and Links are performance evaluation methods, and KBP and ACE are test sets.
  • the BNN system has greatly improved compared with the neural network methods of related researchers 6 and KBP-TOP, and has improved by 0.6% on average compared with the machine learning method of related scholar 4. Although it is only improved by 0.6%, the neural network method has the advantages of low labor cost, high efficiency and strong portability compared with the machine learning method.
  • FIG3 is a schematic diagram of the structure of the text processing device according to the present disclosure. As shown in FIG3 , the device 300 includes:
  • a first acquisition module 301 is configured to acquire event pair data included in a first text
  • a first processing module 302 is configured to process the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data;
  • a first determination module 303 is configured to determine a first linear similarity and a first non-linear similarity of the event pair data and to determine a second linear similarity and a second non-linear similarity of the event phrase pair data;
  • the second determination module 304 is configured to determine the confidence of the event pair data based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a synonymous relationship.
  • the first processing module 302 is further configured to use the dependency syntax analysis tool to determine the arguments and dependent words of the trigger word in the event pair data; determine a first distance between the argument and the trigger word, and determine a second distance between the dependency word and the trigger word; sort the first distance and the second distance to obtain a sorting result; determine two arguments or trigger words corresponding to the maximum distance in the sorting result, and use the two arguments or trigger words corresponding to the maximum distance as the starting word and ending word of the event short sentence pair data; intercept the event pair data based on the starting word and the ending word to obtain the event short sentence pair data.
  • the second determination module 304 is further configured to determine a confidence vector of the event pair data in the first text based on the event pair data, the event sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; and process the confidence vector based on a fully connected classifier to obtain the confidence of the event pair data.
  • the device 300 further includes: a prediction module, configured to use a pre-trained model BERT to predict the event pair data to obtain a word vector pair corresponding to the event pair data.
  • a prediction module configured to use a pre-trained model BERT to predict the event pair data to obtain a word vector pair corresponding to the event pair data.
  • the event pair data includes a plurality of word pair data; the device 300 further includes: a second acquisition module and a third determination module; wherein,
  • the second acquisition module is configured to acquire a first information pair and a second information pair of a plurality of word pair data in the event pair data; the first information pair represents a part-of-speech information pair of the word pair data; the second information pair represents a position information pair of the word pair data;
  • the third determination module is configured to determine a first event vector pair corresponding to the event pair data based on the word vector pair, the event pair data, the first information pair and the second information pair.
  • the device 300 further includes: a first extraction module, a second extraction module, a fusion module, and a second processing module; wherein,
  • the first extraction module is configured to extract the first event vector pair using a long short-term memory network Bi-LSTM to obtain a global information pair corresponding to the first event vector pair;
  • the second extraction module is configured to extract the first event vector pair using a convolutional neural network (CNN) to obtain a local information pair corresponding to the first event vector pair;
  • CNN convolutional neural network
  • the fusion module is configured to fuse the global information pair and the local information pair to obtain a fusion vector pair corresponding to the first event vector pair;
  • the second processing module is configured to perform a first global maximum pooling layer processing on the fused vector pair to obtain a second event vector pair corresponding to the first event vector pair.
  • the first determination module 303 is further configured to determine a first linear similarity and a first non-linear similarity of the event pair data based on the second event vector pair; wherein the first linear similarity includes a first cosine distance; and the first non-linear similarity includes at least one of a first bilinear distance and a first single-layer network distance.
  • the device 300 further includes: a fourth determining module and a third processing module; wherein,
  • the fourth determination module is configured to determine a first event short sentence vector pair corresponding to the event short sentence pair data based on the word vector pair and the event short sentence pair data;
  • the third processing module is configured to perform a second global maximum pooling layer processing on the first event short sentence vector pair to obtain a second event short sentence vector pair corresponding to the first event short sentence vector pair.
  • the first determination module 303 is further configured to determine a second linear similarity and a second non-linear similarity of the event phrase pair data based on the second event phrase vector pair; wherein the second linear similarity includes a second cosine distance; and the second non-linear similarity includes at least one of a second bilinear distance and a second single-layer network distance.
  • the above-mentioned text processing method is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the technical embodiment of the embodiments of the present disclosure is essentially or the part that contributes to the prior art can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium and includes a number of instructions for enabling a text processing device (which can be a personal computer, server, or network device, etc.) to execute the entire method described in each embodiment of the present disclosure.
  • the aforementioned storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a magnetic disk or an optical disk, and other media that can store program codes.
  • ROM read-only memory
  • the aforementioned storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a magnetic disk or an optical disk, and other media that can store program codes.
  • the embodiments of the present disclosure are not limited to any specific combination of hardware and software.
  • an embodiment of the present disclosure further provides a text processing device, including a memory and a processor, wherein the memory stores a computer program that can be executed on the processor, and when the processor executes the program, any step in the above-mentioned method is implemented.
  • an embodiment of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any step in the above-mentioned method is implemented.
  • Figure 4 is a schematic diagram of a hardware entity structure of a text processing device according to an embodiment of the present disclosure.
  • the hardware entity of the text processing device 400 includes: a processor 401 and a memory 403.
  • the text processing device 400 may also include a communication interface 402.
  • the memory 403 can be a volatile memory or a non-volatile memory, and can also include both volatile and non-volatile memories.
  • the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), a magnetic random access memory (FRAM), a flash memory, a magnetic surface memory, an optical disk, or a compact disc read-only memory (CD-ROM); the magnetic surface memory can be a disk memory or a tape memory.
  • the volatile memory can be a random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • SSRAM synchronous static random access memory
  • DRAM dynamic random access memory
  • Dynamic Random Access Memory Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), Direct Rambus Random Access Memory (DRRAM).
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM SyncLink Dynamic Random Access Memory
  • DRRAM Direct Rambus Random Access Memory
  • the method disclosed in the above embodiment of the present disclosure can be applied to the processor 401, or implemented by the processor 401.
  • the processor 401 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 401 or the instruction in the form of software.
  • the above processor 401 can be a general processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the processor 401 can implement or execute the methods, steps and logic block diagrams disclosed in the embodiment of the present disclosure.
  • the general processor can be a microprocessor or any conventional processor, etc.
  • the steps of the method disclosed in the embodiment of the present disclosure can be directly embodied as a hardware decoding processor to execute, or a combination of hardware and software modules in the decoding processor to execute.
  • the software module can be located in a storage medium, which is located in the memory 403.
  • the processor 401 reads the information in the memory 403 and completes the steps of the above method in combination with its hardware.
  • the text processing device can be implemented by one or more application specific integrated circuits (ASIC), DSP, programmable logic device (PLD), complex programmable logic device (CPLD), field programmable gate array (FPGA), general processor, controller, microcontroller (MCU), microprocessor, or other electronic components to execute the aforementioned method.
  • ASIC application specific integrated circuits
  • DSP digital signal processor
  • PLD programmable logic device
  • CPLD complex programmable logic device
  • FPGA field programmable gate array
  • general processor controller
  • MCU microcontroller
  • microprocessor or other electronic components to execute the aforementioned method.
  • the disclosed methods and devices can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of units is only a logical function division. There may be other divisions in actual implementation, such as: multiple units or components can be combined, or can be integrated into another observation, or some features can be ignored or not executed.
  • the communication connection between the components shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; some or all of the units may be selected according to actual needs to achieve the purpose of this embodiment.
  • the above-mentioned integrated unit of the embodiment of the present disclosure is implemented in the form of a software functional unit and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the technical embodiment of the embodiment of the present disclosure is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a text processing device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in each embodiment of the present disclosure.
  • the aforementioned storage medium includes: various media that can store program codes, such as mobile storage devices, ROMs, magnetic disks, or optical disks.
  • references to "one embodiment” or “an embodiment” throughout the specification mean that a particular feature, structure, or characteristic associated with the embodiment is included in at least one embodiment of the present disclosure.
  • the phrases “in one embodiment” or “in an embodiment” that appear in various places in the specification do not necessarily refer to the same embodiment.
  • these specific features, structures, or characteristics may be combined in one or more embodiments in any suitable manner.
  • the size of the serial numbers of the above-mentioned processes does not mean the order of execution. The order of execution of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.
  • the serial numbers of the embodiments of the present disclosure are for description only and do not represent the advantages and disadvantages of the embodiments.

Abstract

Provided in the embodiments of the present disclosure are a text processing method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring event pair data comprised in a first text; processing the event pair data by using a dependency syntactic parsing tool, so as to obtain event short-sentence pair data corresponding to the event pair data; determining a first linear similarity and a first nonlinear similarity of the event pair data and a second linear similarity and a second nonlinear similarity of the event short-sentence pair data; and determining a confidence coefficient of the event to data on the basis of the event pair data, the event short-sentence pair data, the first linear similarity, the first nonlinear similarity, the second linear similarity and the second nonlinear similarity, wherein the confidence coefficient represents the degree of the event pair data having referentiality.

Description

文本处理方法、装置、电子设备及存储介质Text processing method, device, electronic device and storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开基于申请号为202211320876.5、申请日为2022年10月26日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。This disclosure is based on the Chinese patent application with application number 202211320876.5 and application date October 26, 2022, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by introduction.
技术领域Technical Field
本公开涉及数据处理技术,具体涉及一种文本处理方法、装置、电子设备及存储介质。The present disclosure relates to data processing technology, and in particular to a text processing method, device, electronic device and storage medium.
背景技术Background technique
如今,随着互联网科技快速的发展,人们在互联网中产生的交互信息日益剧增,可以随时随地通过互联网获得自己想要的信息。互联网虽然给人们提供越来越快捷、多样化的信息,但它同时也产生了大量的垃圾信息,这就导致人们在寻找自己所需要的信息时耗费大量的精力,甚至无功而返。在大数据时代,如何处理大数据并筛选出有价值的信息成为了一个重要的课题。事件抽取可以帮助机器在文本中发现有价值的事件信息,将语义同指的文本内容归为一类,从而进行事件同指消解。Nowadays, with the rapid development of Internet technology, the interactive information generated by people on the Internet is increasing day by day, and they can get the information they want through the Internet anytime and anywhere. Although the Internet provides people with faster and more diverse information, it also generates a lot of junk information, which causes people to spend a lot of energy when looking for the information they need, or even return empty-handed. In the era of big data, how to process big data and filter out valuable information has become an important topic. Event extraction can help machines find valuable event information in texts, classify text content with the same semantics into one category, and thus resolve event co-references.
事件同指消解是判断不同描述方法的事件句是否指向现实生活中的同一件事,主要依赖于二者的相似度。难点就在于如何准确的计算出两个事件句之间的相似度值,如何提高相似度计算的准确性。而针对该问题,目前尚无有效解决方案。Event co-reference resolution is to determine whether event sentences with different description methods refer to the same event in real life, which mainly depends on the similarity between the two. The difficulty lies in how to accurately calculate the similarity value between two event sentences and how to improve the accuracy of similarity calculation. There is currently no effective solution to this problem.
发明内容Summary of the invention
有鉴于此,本公开的主要目的在于提供一种文本处理方法、装置、电子设 备及存储介质。In view of this, the main purpose of the present invention is to provide a text processing method, device, and electronic device. equipment and storage media.
为达到上述目的,本公开的技术方案是这样实现的:To achieve the above objectives, the technical solution of the present disclosure is implemented as follows:
本公开实施例提供一种文本处理方法,包括:The present disclosure provides a text processing method, including:
获取第一文本中包括的事件对数据;Acquire event pair data included in the first text;
采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;Using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data;
确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;Determine a first linear similarity and a first non-linear similarity of the event pair data and determine a second linear similarity and a second non-linear similarity of the event phrase pair data;
基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。The confidence of the event pair data is determined based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a same-reference relationship.
在上述方案中,所述采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据,包括:In the above scheme, the process of using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data includes:
采用所述依存句法分析工具确定所述事件对数据中触发词的论元和依存词;Determining arguments and dependent words of trigger words in the event pair data using the dependency syntax analysis tool;
确定所述论元与所述触发词的第一距离,以及确定所述依存词与所述触发词的第二距离;Determining a first distance between the argument and the trigger word, and determining a second distance between the dependency word and the trigger word;
对所述第一距离以及所述第二距离进行排序,得到排序结果;Sorting the first distance and the second distance to obtain a sorting result;
确定所述排序结果中距离最大值对应的两个论元或触发词,将所述距离最大值对应的两个论元或触发词作为所述事件短句对数据的起始词和结束词;Determine two arguments or trigger words corresponding to the maximum distance in the sorting result, and use the two arguments or trigger words corresponding to the maximum distance as the start word and the end word of the event short sentence pair data;
基于所述起始词和所述结束词对所述事件对数据进行截取,得到所述事件短句对数据。The event pair data is intercepted based on the start word and the end word to obtain the event short sentence pair data.
在上述方案中,所述基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第一文本中的事件对数据的置信度,包括:In the above solution, determining the confidence of the event pair data in the first text based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity includes:
基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第 一文本中的事件对数据的置信度向量;The first similarity is determined based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity, and the second non-linear similarity. A confidence vector of the event in the text to the data;
基于全连接分类器对所述置信度向量进行处理,得到所述事件对数据的置信度。The confidence vector is processed based on a fully connected classifier to obtain the confidence of the event on the data.
在上述方案中,所述方法还包括:In the above solution, the method further includes:
采用预训练模型(Bidirectional Encoder Representation from Transformers,BERT)对所述事件对数据进行预测,得到所述事件对数据对应的词向量对。A pre-trained model (Bidirectional Encoder Representation from Transformers, BERT) is used to predict the event pair data to obtain the word vector pairs corresponding to the event pair data.
在上述方案中,所述事件对数据包括多个单词对数据;所述方法还包括:In the above scheme, the event pair data includes a plurality of word pair data; the method further includes:
获取所述事件对数据中多个单词对数据的第一信息对和第二信息对;所述第一信息对表征单词对数据的词性信息对;所述第二信息对表征所述单词对数据的位置信息对;Acquire a first information pair and a second information pair of a plurality of word pair data in the event pair data; the first information pair represents a part-of-speech information pair of the word pair data; the second information pair represents a position information pair of the word pair data;
基于所述词向量对、所述事件对数据、所述第一信息对和所述第二信息对,确定所述事件对数据对应的第一事件向量对。Based on the word vector pair, the event pair data, the first information pair and the second information pair, a first event vector pair corresponding to the event pair data is determined.
在上述方案中,所述方法还包括:In the above solution, the method further includes:
采用长短时记忆网络(Bi-directional Long Short-Term Memory,Bi-LSTM)对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的全局信息对;Using a Bi-directional Long Short-Term Memory network (Bi-LSTM) to extract the first event vector pair, and obtain a global information pair corresponding to the first event vector pair;
采用卷积神经网络(Convolutional Neural Network,CNN)对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的局部信息对;Using a convolutional neural network (CNN) to extract the first event vector pair to obtain a local information pair corresponding to the first event vector pair;
对所述全局信息对和所述局部信息对进行融合,得到所述第一事件向量对对应的融合向量对;fusing the global information pair and the local information pair to obtain a fused vector pair corresponding to the first event vector pair;
对所述融合向量对进行第一全局最大池化层处理,得到所述第一事件向量对对应的第二事件向量对。Perform a first global maximum pooling layer process on the fused vector pair to obtain a second event vector pair corresponding to the first event vector pair.
在上述方案中,所述确定所述事件对数据的第一线性相似度和第一非线性相似度,包括:In the above solution, determining the first linear similarity and the first non-linear similarity of the event pair data includes:
根据所述第二事件向量对确定所述事件对数据的第一线性相似度和第一非线性相似度;Determine a first linear similarity and a first nonlinear similarity of the event pair data according to the second event vector pair;
其中,所述第一线性相似度包括第一余弦距离;所述第一非线性相似度包括第一双线性距离和第一单层网络距离中的至少一项。 The first linear similarity includes a first cosine distance; and the first non-linear similarity includes at least one of a first bilinear distance and a first single-layer network distance.
在上述方案中,所述方法还包括:In the above solution, the method further includes:
基于所述词向量对和所述事件短句对数据,确定所述事件短句对数据对应的第一事件短句向量对;Based on the word vector pair and the event short sentence pair data, determining a first event short sentence vector pair corresponding to the event short sentence pair data;
对所述第一事件短句向量对进行第二全局最大池化层处理,得到所述第一事件短句向量对对应的第二事件短句向量对。The first event short sentence vector pair is processed by a second global maximum pooling layer to obtain a second event short sentence vector pair corresponding to the first event short sentence vector pair.
在上述方案中,所述确定所述事件短句对数据的第二线性相似度和第二非线性相似度,包括:In the above solution, determining the second linear similarity and the second non-linear similarity of the event phrase pair data includes:
根据所述第二事件短句向量对确定所述事件短句对数据的第二线性相似度和第二非线性相似度;Determine a second linear similarity and a second non-linear similarity of the event phrase pair data according to the second event phrase vector pair;
其中,所述第二线性相似度包括第二余弦距离;所述第二非线性相似度包括第二双线性距离和第二单层网络距离中的至少一项。The second linear similarity includes a second cosine distance; and the second non-linear similarity includes at least one of a second bilinear distance and a second single-layer network distance.
本公开实施例提供一种文本处理装置,包括:The present disclosure provides a text processing device, including:
第一获取模块,配置为获取第一文本中包括的事件对数据;A first acquisition module, configured to acquire event pair data included in the first text;
第一处理模块,配置为采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;A first processing module is configured to process the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data;
第一确定模块,配置为确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;A first determination module is configured to determine a first linear similarity and a first non-linear similarity of the event pair data and to determine a second linear similarity and a second non-linear similarity of the event phrase pair data;
第二确定模块,配置为基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。The second determination module is configured to determine the confidence of the event pair data based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a synonymous relationship.
本公开实施例提供一种文本处理设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现上述任一项所述的方法。An embodiment of the present disclosure provides a text processing device, including a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and when the processor executes the program, any of the above-mentioned methods is implemented.
本公开实施例提供一种存储介质,所述存储介质存储有可执行指令,当所述可执行指令被处理器执行时,实现上述任一项所述的方法。 An embodiment of the present disclosure provides a storage medium, wherein the storage medium stores executable instructions. When the executable instructions are executed by a processor, any of the above methods is implemented.
本公开实施例提供一种文本处理方法、装置、电子设备及存储介质。其中,所述方法包括:获取第一文本中包括的事件对数据;采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。通过将所述事件对数据的第一线性相似度和第一非线性相似度以及所述事件短句对数据的第二线性相似度和第二非线性相似度进行结合确定事件对数据的置信度,能够弥补通过线性相似度确定置信度的情况下,仅整体考虑事件对数据造成的缺陷。The disclosed embodiment provides a text processing method, device, electronic device and storage medium. The method includes: obtaining event pair data included in a first text; processing the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data; determining the first linear similarity and the first non-linear similarity of the event pair data and determining the second linear similarity and the second non-linear similarity of the event short sentence pair data; determining the confidence of the event pair data based on the event pair data, the event short sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence characterizes the degree to which the event pair data has a co-referential relationship. By combining the first linear similarity and the first non-linear similarity of the event pair data and the second linear similarity and the second non-linear similarity of the event short sentence pair data to determine the confidence of the event pair data, it is possible to make up for the defect of only considering the event pair data as a whole when the confidence is determined by linear similarity.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本公开实施例文本处理方法实现流程示意图;FIG1 is a schematic diagram of a flow chart of a text processing method according to an embodiment of the present disclosure;
图2为本公开实施例文本处理方法BNN系统的技术流程示意图;FIG2 is a schematic diagram of the technical process of the BNN system of the text processing method according to an embodiment of the present disclosure;
图3为本公开实施例文本处理装置的组成结构示意图;FIG3 is a schematic diagram of the structure of a text processing device according to an embodiment of the present disclosure;
图4为本公开实施例文本处理设备的一种硬件实体结构示意图。FIG. 4 is a schematic diagram of a hardware entity structure of a text processing device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对公开的具体技术方案做进一步详细描述。以下实施例用于说明本公开,但不用来限制本公开的范围。To make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the specific technical solutions disclosed will be further described in detail below in conjunction with the drawings in the embodiments of the present disclosure. The following embodiments are used to illustrate the present disclosure, but are not used to limit the scope of the present disclosure.
相关技术中,事件同指消解的方法主要有两种。一种是使用基于概率或者图的机器学习方法,需要进行大量的特征工程来人工提取出事件句中的特征,再结合机器学习方法来判别同指关系。另一种是运用主流的神经网络方法,设计一种相似度模型来计算两个事件句之间的相似度,从而判别同指关系。In the related art, there are two main methods for event synonym resolution. One is to use a probability or graph-based machine learning method, which requires a lot of feature engineering to manually extract features from event sentences, and then combine machine learning methods to identify synonymous relationships. The other is to use mainstream neural network methods to design a similarity model to calculate the similarity between two event sentences, thereby identifying synonymous relationships.
在机器学习方法中,相关学者一在事件对同指消解分类器中引入了一系列 事件对属性,比如触发词、时态、极性等等是否一致。相关学者二设计了最大熵分类器,并引入了100多种特征进行实验。相关学者三提出了一种基于马尔科夫链的联合推理模型用于纠正分类器产生的错误结果。相关学者四设计了一种基于图的模型分类器,将事件合并成一个无向图,然后将非同指的事件从图中剔除出去。相关学者五首先使用聚类算法生成事件同指关系的无向图,然后使用最优切割算法对该图进行优化,将错误的边从无向图中删去,从而优化完成事件同指消解。滕佳月使用最大熵分类器模型,结合大量通过工具提取的特征进行研究。In machine learning methods, relevant scholars have introduced a series of Whether the event attributes, such as trigger words, tense, polarity, etc. are consistent. Relevant scholar 2 designed a maximum entropy classifier and introduced more than 100 features for experiments. Relevant scholar 3 proposed a joint reasoning model based on Markov chain to correct the erroneous results produced by the classifier. Relevant scholar 4 designed a graph-based model classifier to merge events into an undirected graph, and then remove non-co-referential events from the graph. Relevant scholar 5 first used a clustering algorithm to generate an undirected graph of event co-referential relationships, and then used the optimal cutting algorithm to optimize the graph, deleting the erroneous edges from the undirected graph, thereby optimizing the event co-referential resolution. Teng Jiayue used the maximum entropy classifier model combined with a large number of features extracted by tools for research.
在神经网络方法中,相关学者六先用卷积池化网络抽取事件句和触发词上下文的特征信息,然后引入事件对匹配特征来辅助判别事件对之间是否存在同指关系。相关学者七先用全连接层对两个事件句进行了变维操作,然后计算两个事件句的余弦距离和欧式距离,最后通过激活函数得出一个置信度来判定同指关系。方杰主要使用注意力机制抽取事件句中的重要信息,并结合事件句之间的线性相似度与事件对匹配特征来判别事件对之间是否存在同指关系。In the neural network method, the sixth scholar first used a convolutional pooling network to extract feature information of the event sentence and the trigger word context, and then introduced event pair matching features to assist in determining whether there is a synonymous relationship between event pairs. The seventh scholar first used a fully connected layer to perform a dimensionality change operation on the two event sentences, and then calculated the cosine distance and Euclidean distance of the two event sentences, and finally used an activation function to derive a confidence level to determine the synonymous relationship. Fang Jie mainly used the attention mechanism to extract important information from event sentences, and combined the linear similarity between event sentences with event pair matching features to determine whether there is a synonymous relationship between event pairs.
上述相关技术中存在以下缺点:The above-mentioned related technologies have the following disadvantages:
第一、基于概率或图的机器学习方法需要进行大量的特征工程来提取特征,人工成本较大,准确性不高,且可移植性不强。First, probability or graph-based machine learning methods require a lot of feature engineering to extract features, which has high labor costs, low accuracy, and poor portability.
第二、相关学者六的方法对事件句使用卷积神经网络来提取出单词的上下文特征信息,只考虑了事件句中单词与单词间的局部信息,并未考虑一对事件句间的关系,且没有深层次的抽取事件句中的特征,导致事件同指消解的性能不高。Second, the method proposed by the six relevant scholars uses a convolutional neural network to extract the contextual feature information of words in event sentences. It only considers the local information between words in the event sentences, does not consider the relationship between a pair of event sentences, and does not deeply extract the features in the event sentences, resulting in low performance of event co-reference resolution.
第三、相关学者七的方法只是简单的对事件句进行变维操作,也没有深层次提取特征,导致计算的事件句之间的余弦距离与欧式距离不是很准确,影响最终的分类性能。Third, the method proposed by the relevant scholar No. 7 simply performs dimensionality transformation on the event sentences, and does not extract features in depth, which results in the calculated cosine distance and Euclidean distance between the event sentences being not very accurate, affecting the final classification performance.
第四、神经网络方法的输入信息不够丰富且有一定的错误,基本都只结合了事件句和各单词到触发词的相对距离,此外,取触发词前后各三个单词形成一个事件短句。但是,使用固定的规则取出的事件短句会存在一定的错误,进 而影响模型的判别性能。Fourth, the input information of the neural network method is not rich enough and has certain errors. It basically only combines the event sentence and the relative distance between each word and the trigger word. In addition, three words before and after the trigger word are taken to form an event short sentence. However, the event short sentences extracted using fixed rules will have certain errors. This affects the discriminative performance of the model.
为了解决上述的缺点,本申请提出了一种文本处理方法、装置、电子设备及存储介质。旨在预训练出准确的词向量表示事件句,深层次从维度高、语义信息复杂、句子结构复杂的事件句中提取出有用的特征信息,并通过计算事件短句间的相似度来辅助判别同指关系。In order to solve the above shortcomings, this application proposes a text processing method, device, electronic device and storage medium. It aims to pre-train accurate word vectors to represent event sentences, deeply extract useful feature information from event sentences with high dimensions, complex semantic information and complex sentence structures, and assist in distinguishing the same reference relationship by calculating the similarity between event short sentences.
本公开实施例提出一种文本处理方法,该方法所实现的功能可以通过文本处理设备中的处理器调用程序代码来实现,当然程序代码可以保存在计算机存储介质中,可见,该计算设备至少包括处理器和存储介质。The disclosed embodiment proposes a text processing method, the functions implemented by the method can be implemented by calling program codes by a processor in a text processing device. Of course, the program codes can be stored in a computer storage medium. It can be seen that the computing device at least includes a processor and a storage medium.
图1为本公开实施例文本处理方法实现流程示意图,如图1所示,所述方法包括:FIG1 is a schematic diagram of a text processing method according to an embodiment of the present disclosure. As shown in FIG1 , the method includes:
步骤101:获取第一文本中包括的事件对数据;Step 101: Acquire event pair data included in a first text;
步骤102:采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;Step 102: using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data;
步骤103:确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;Step 103: determining a first linear similarity and a first non-linear similarity of the event pair data and determining a second linear similarity and a second non-linear similarity of the event phrase pair data;
步骤104:基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。Step 104: Determine the confidence of the event pair data based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a coreference relationship.
在步骤101中,所述文本处理方法可以根据实际情况进行确定,在此不做限定。作为一种示例,所述文本处理方法可以是基于BERT预训练的事件同指消解方法。In step 101, the text processing method can be determined according to actual conditions and is not limited here. As an example, the text processing method can be an event homonym resolution method based on BERT pre-training.
所述第一文本可以根据实际情况进行确定,在此不做限定。作为一种示例,所述第一文本可以是事件句。所述获取第一文本可以为,基于预设语料库中的语料确定所述事件句。所述预设语料库可以根据实际情况进行确定,在此不做限定。作为一种示例,所述预设语料库可以是国际知识图谱构建大赛(Knowledge base population)语料库和2005自动内容提取(Automatic Content  Extraction,ACE)语料库中的一种或多种。The first text may be determined according to actual conditions, and is not limited here. As an example, the first text may be an event sentence. The obtaining of the first text may be to determine the event sentence based on the corpus in a preset corpus. The preset corpus may be determined according to actual conditions, and is not limited here. As an example, the preset corpus may be the corpus of the International Knowledge Base Population Contest and the 2005 Automatic Content Extraction Contest. Extraction, ACE) corpus.
所述获取第一文本中包括的事件对数据可以为,获取所述第一文本;对所述第一文本进行预处理,得到所述第一文本中包括的事件对数据。The step of acquiring the event pair data included in the first text may include acquiring the first text; and preprocessing the first text to obtain the event pair data included in the first text.
在一些实施例中,将所述第一文本中需要进行同指关系判断的两个事件句作为所述第一文本中包括的事件对数据。In some embodiments, two event sentences in the first text that need to be judged as having a same-referential relationship are used as event pair data included in the first text.
在一些实施例中,所述对所述第一文本进行预处理可以为,结合正则表达式和停用词列表对所述第一文本进行数据清洗;对所述第一文本中的特殊符号和停用词进行过滤;将所述第一文本中的单词恢复为原形。In some embodiments, the preprocessing of the first text may include: performing data cleaning on the first text in combination with a regular expression and a stop word list; filtering special symbols and stop words in the first text; and restoring words in the first text to their original forms.
在步骤102中,采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据可以为,对所述事件对数据中的两个事件句中每个事件句分别采用依存句法分析工具进行处理,得到所述两个事件句对应的两个事件短句,将两个所述事件短句作为所述事件对数据对应的事件短句对数据。In step 102, the event pair data is processed using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data. This can be done by respectively processing each of the two event sentences in the event pair data using a dependency syntax analysis tool to obtain two event short sentences corresponding to the two event sentences, and using the two event short sentences as the event short sentence pair data corresponding to the event pair data.
在步骤103中,所述第一线性相似度可以根据实际情况进行确定,在此不做限定。作为一种示例,所述第一线性相似度可以是所述事件对数据的第一余弦距离;所述第二线性相似度可以根据实际情况进行确定,在此不做限定。作为一种示例,所述第二线性相似度可以是所述事件短句对数据的第二余弦距离。In step 103, the first linear similarity can be determined according to actual conditions, which is not limited here. As an example, the first linear similarity can be the first cosine distance of the event pair data; the second linear similarity can be determined according to actual conditions, which is not limited here. As an example, the second linear similarity can be the second cosine distance of the event phrase pair data.
所述第一非线性相似度可以根据实际情况进行确定,在此不做限定。作为一种示例,所述第一非线性相似度可以是所述事件对数据的第一双线性距离和第一单层网络距离;所述第二非线性相似度可以根据实际情况进行确定,在此不做限定。作为一种示例,所述第二非线性相似度可以是所述事件短句对数据的第二双线性距离和第二单层网络距离。The first nonlinear similarity can be determined according to actual conditions and is not limited here. As an example, the first nonlinear similarity can be the first bilinear distance and the first single-layer network distance of the event pair data; the second nonlinear similarity can be determined according to actual conditions and is not limited here. As an example, the second nonlinear similarity can be the second bilinear distance and the second single-layer network distance of the event phrase pair data.
在步骤104中,在所述确定所述事件对数据的置信度之后,所述方法还包括:判断所述置信度是否大于预设阈值;在所述置信度大于所述预设阈值的情况下,确定所述事件对数据存在同指关系;其中,所述事件对数据存在同指关系表征所述事件对数据具有同指关系的程度较高;在所述置信度小于或等于所述预设阈值的情况下,确定所述事件对数据不存在同指关系;其中,所述事件 对数据不存在同指关系表征所述事件对数据具有同指关系的程度较低。所述预设阈值可以根据实际情况进行确定,在此不做限定。作为一种示例,所述置信度可以是一个介于0和1之间的值,所述预设阈值可以是0.5。In step 104, after determining the confidence of the event pair data, the method further includes: judging whether the confidence is greater than a preset threshold; if the confidence is greater than the preset threshold, determining that the event pair data has a same-reference relationship; wherein the existence of the same-reference relationship between the event pair data indicates that the degree of the same-reference relationship between the event pair data is high; if the confidence is less than or equal to the preset threshold, determining that the event pair data does not have a same-reference relationship; wherein the event pair data has a same-reference relationship between the event pair data and the same-reference relationship between the event pair data. The absence of a common reference relationship for the data indicates that the degree to which the event has a common reference relationship for the data is low. The preset threshold can be determined according to actual conditions and is not limited here. As an example, the confidence level can be a value between 0 and 1, and the preset threshold can be 0.5.
本公开实施例提供一种文本处理方法,获取第一文本中包括的事件对数据;采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。通过将所述事件对数据的第一线性相似度和第一非线性相似度以及所述事件短句对数据的第二线性相似度和第二非线性相似度进行结合确定事件对数据的置信度,能够弥补线性相似度确定置信度仅整体考虑事件对数据的缺陷。The disclosed embodiment provides a text processing method, which obtains event pair data included in a first text; uses a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data; determines the first linear similarity and the first non-linear similarity of the event pair data and determines the second linear similarity and the second non-linear similarity of the event short sentence pair data; determines the confidence of the event pair data based on the event pair data, the event short sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence characterizes the degree to which the event pair data has a co-referential relationship. By combining the first linear similarity and the first non-linear similarity of the event pair data and the second linear similarity and the second non-linear similarity of the event short sentence pair data to determine the confidence of the event pair data, it is possible to make up for the defect that the linear similarity determines the confidence by only considering the event pair data as a whole.
本实施例提出了线性相似度与非线性相似度相结合的方法,利用非线性相似度计算单词与单词之间的相似度来弥补线性相似度只能计算整体事件句间的相似度的缺点。This embodiment proposes a method combining linear similarity and non-linear similarity, and uses non-linear similarity to calculate the similarity between words to make up for the shortcoming that linear similarity can only calculate the similarity between sentences of the entire event.
在本公开的一种可选实施例中,所述采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据,包括:In an optional embodiment of the present disclosure, the process of using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data includes:
采用所述依存句法分析工具确定所述事件对数据中触发词的论元和依存词;Determining arguments and dependent words of trigger words in the event pair data using the dependency syntax analysis tool;
确定所述论元与所述触发词的第一距离,以及确定所述依存词与所述触发词的第二距离;Determining a first distance between the argument and the trigger word, and determining a second distance between the dependency word and the trigger word;
对所述第一距离以及所述第二距离进行排序,得到排序结果;Sorting the first distance and the second distance to obtain a sorting result;
确定所述排序结果中距离最大值对应的两个论元或触发词,将所述距离最大值对应的两个论元或触发词作为所述事件短句对数据的起始词和结束词;Determine two arguments or trigger words corresponding to the maximum distance in the sorting result, and use the two arguments or trigger words corresponding to the maximum distance as the start word and the end word of the event short sentence pair data;
基于所述起始词和所述结束词对所述事件对数据进行截取,得到所述事件短句对数据。 The event pair data is intercepted based on the start word and the end word to obtain the event short sentence pair data.
本实施例中,所述依存句法分析工具可以根据实际情况进行确定,在此不做限定。作为一种示例,所述依存句法分析工具可以是斯坦福自然语言处理工具。In this embodiment, the dependency syntax analysis tool can be determined according to actual conditions, and is not limited here. As an example, the dependency syntax analysis tool can be a Stanford natural language processing tool.
所述触发词可以根据实际情况进行确定,在此不做限定。作为一种示例,所述触发词可以是所述事件句中启动一个过程或行动过程的词。The trigger word can be determined according to actual conditions and is not limited here. As an example, the trigger word can be a word in the event sentence that starts a process or action process.
所述论元可以根据实际情况进行确定,在此不做限定。作为一种示例,所述论元可以是所述事件句中施事者、受事者、事件发生的时间地点等。The argument can be determined according to actual conditions and is not limited here. As an example, the argument can be the agent, the patient, the time and place of the event in the event sentence, etc.
所述依存词可以根据实际情况进行确定,在此不做限定。作为一种示例,所述依存词可以是所述事件句中主语和宾语等。The dependent words can be determined according to actual conditions and are not limited here. As an example, the dependent words can be the subject and object in the event sentence.
所述对所述第一距离以及所述第二距离进行排序的方式可以根据实际情况确定,在此不做限定,作为一种示例,对所述第一距离以及所述第二距离按照由小至大的顺序排列,得到所述排序结果。The method of sorting the first distance and the second distance can be determined according to actual conditions and is not limited here. As an example, the first distance and the second distance are arranged in order from small to large to obtain the sorting result.
本实施例使用依存词分析工具获得触发词的依存词,再利用触发词、依存词、论元一起来确定事件短句在句中的起始与结束位置,从而截取出事件短句。This embodiment uses a dependency word analysis tool to obtain the dependent words of the trigger word, and then uses the trigger word, the dependent words, and the arguments together to determine the starting and ending positions of the event short sentence in the sentence, thereby extracting the event short sentence.
在本公开的一种可选实施例中,所述基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第一文本中的事件对数据的置信度,包括:In an optional embodiment of the present disclosure, determining the confidence of the event pair data in the first text based on the event pair data, the event short sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity, and the second non-linear similarity includes:
基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第一文本中的事件对数据的置信度向量;Determine a confidence vector of the event pair data in the first text based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity, and the second non-linear similarity;
基于全连接分类器对所述置信度向量进行处理,得到所述事件对数据的置信度。The confidence vector is processed based on a fully connected classifier to obtain the confidence of the event on the data.
本实施例中,所述基于全连接分类器对所述置信度向量进行处理,得到所述事件对数据的置信度可以为,在所述全连接分类器中使用斜坡(Rectified Linear Unit,relu)激活函数对所述置信度向量进行处理,得到处理后的置信度向量;通过逻辑回归模型(sigmoid)激活函数对所述处理后的置信度向量进行处理,得到所述事件对数据的置信度。 In this embodiment, the confidence vector is processed based on the fully connected classifier to obtain the confidence of the event on the data. The confidence vector is processed using a slope (Rectified Linear Unit, relu) activation function in the fully connected classifier to obtain a processed confidence vector; the processed confidence vector is processed by a logistic regression model (sigmoid) activation function to obtain the confidence of the event on the data.
在本公开的一种可选实施例中,所述方法还包括:In an optional embodiment of the present disclosure, the method further includes:
采用预训练模型BERT对所述事件对数据进行预测,得到所述事件对数据对应的词向量对。The pre-trained model BERT is used to predict the event pair data to obtain the word vector pairs corresponding to the event pair data.
本实施例中,所述采用预训练模型BERT对所述事件对数据进行预测,得到所述事件对数据对应的词向量对可以为,采用所述BERT通过使用字符分别遮住所述事件对数据句中两个事件句中每个事件句的词语或者所述两个事件句中每个事件句所在文中的语句来预测被遮住的词语或语句,得到所述事件对数据中两个事件句对应的两个词向量,将所述两个词向量作为所述事件对数据对应的词向量对。In this embodiment, the pre-trained model BERT is used to predict the event pair data to obtain the word vector pair corresponding to the event pair data. The method can be: the BERT is used to predict the masked words or sentences by using characters to respectively mask the words of each of the two event sentences in the event pair data or the sentences in the text where each of the two event sentences is located, to obtain two word vectors corresponding to the two event sentences in the event pair data, and the two word vectors are used as the word vector pair corresponding to the event pair data.
本实施例不再使用固定的词向量,而是使用BERT预训练模型进行训练,获取准确的词向量表达。This embodiment no longer uses fixed word vectors, but instead uses the BERT pre-trained model for training to obtain accurate word vector expressions.
在本公开的一种可选实施例中,所述事件对数据包括多个单词对数据;所述方法还包括:In an optional embodiment of the present disclosure, the event pair data includes a plurality of word pair data; the method further includes:
获取所述事件对数据中多个单词对数据的第一信息对和第二信息对;所述第一信息对表征单词对数据的词性信息对;所述第二信息对表征所述单词对数据的位置信息对;Acquire a first information pair and a second information pair of a plurality of word pair data in the event pair data; the first information pair represents a part-of-speech information pair of the word pair data; the second information pair represents a position information pair of the word pair data;
基于所述词向量对、所述事件对数据、所述第一信息对和所述第二信息对,确定所述事件对数据对应的第一事件向量对。Based on the word vector pair, the event pair data, the first information pair and the second information pair, a first event vector pair corresponding to the event pair data is determined.
本实施例中,所述单词数据可以根据实际情况进行确定,在此不做限定。作为一种示例,所述单词数据可以是所述事件句中的单词。In this embodiment, the word data can be determined according to actual conditions, and no limitation is made here. As an example, the word data can be a word in the event sentence.
所述事件对数据包括多个单词对数据可以为,所述事件对数据中两个事件句中每个事件句分别包括多个单词数据,将所述两个事件句中每个事件句分别包括的多个单词数据作为所述事件对数据包括的多个单词对数据。The event pair data may include a plurality of word pair data, and each of the two event sentences in the event pair data may include a plurality of word data respectively, and the plurality of word data respectively included in each of the two event sentences may be used as the plurality of word pair data included in the event pair data.
所述获取所述事件对数据中多个单词对数据的第一信息对和第二信息对可以为,分别获取所述事件对数据中两个事件句中每个事件句的多个单词数据的第一信息和第二信息,将所述两个事件句中每个事件句的多个单词数据的第一信息和第二信息作为所述事件对数据中多个单词对数据的第一信息对和第二信 息对。The method of obtaining the first information pair and the second information pair of the plurality of word pair data in the event pair data may be as follows: respectively obtaining the first information and the second information of the plurality of word data of each event sentence in the two event sentences in the event pair data, and using the first information and the second information of the plurality of word data of each event sentence in the two event sentences as the first information pair and the second information of the plurality of word pair data in the event pair data. Interest is right.
所述获取所述事件对数据中多个单词对数据的第一信息对和第二信息对可以为,采用所述斯坦福自然语言处理工具,确定所述事件对数据中多个单词对数据的第一信息对;基于所述多个单词对数据中每个单词对数据距离所述事件对数据的触发词的相对距离,确定所述事件对数据中多个单词对数据的第二信息对。The method of obtaining the first information pair and the second information pair of the multiple word pair data in the event pair data may be: using the Stanford natural language processing tool to determine the first information pair of the multiple word pair data in the event pair data; and determining the second information pair of the multiple word pair data in the event pair data based on the relative distance between each word pair data in the multiple word pair data and the trigger word of the event pair data.
所述基于所述词向量对、所述事件对数据、所述第一信息对和所述第二信息对,确定所述事件对数据对应的第一事件向量对可以为,基于所述词向量对对所述事件对数据进行编码,得到第一维度向量对;基于所述词向量对对所述第一信息对进行编码,得到第二维度向量对;基于所述第二信息对确定第三维度向量对;基于所述第一维度向量对、所述第二维度向量对和所述第三维度向量对确定所述第一事件向量对。其中,所述第一维度向量对可以是第一维度的事件向量对;所述第二维度向量对可以是第二维度的词性向量对;所述第三维度向量对可以是第三维度的位置向量对;所述第一事件向量对可以是第四维度的事件向量对。The determination of the first event vector pair corresponding to the event pair data based on the word vector pair, the event pair data, the first information pair and the second information pair may be as follows: encoding the event pair data based on the word vector pair to obtain a first dimension vector pair; encoding the first information pair based on the word vector pair to obtain a second dimension vector pair; determining a third dimension vector pair based on the second information pair; and determining the first event vector pair based on the first dimension vector pair, the second dimension vector pair and the third dimension vector pair. The first dimension vector pair may be an event vector pair of the first dimension; the second dimension vector pair may be a part-of-speech vector pair of the second dimension; the third dimension vector pair may be a position vector pair of the third dimension; and the first event vector pair may be an event vector pair of the fourth dimension.
本实施例将事件句、事件句中每个单词的位置信息和每个单词的词性信息进行拼接,从而丰富输入数据的特征信息。This embodiment concatenates the event sentence, the position information of each word in the event sentence, and the part-of-speech information of each word, thereby enriching the feature information of the input data.
在本公开的一种可选实施例中,所述方法还包括:In an optional embodiment of the present disclosure, the method further includes:
采用长短时记忆网络Bi-LSTM对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的全局信息对;Using a long short-term memory network Bi-LSTM to extract the first event vector pair, to obtain a global information pair corresponding to the first event vector pair;
采用卷积神经网络CNN对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的局部信息对;Using a convolutional neural network (CNN) to extract the first event vector pair, to obtain a local information pair corresponding to the first event vector pair;
对所述全局信息对和所述局部信息对进行融合,得到所述第一事件向量对对应的融合向量对;fusing the global information pair and the local information pair to obtain a fused vector pair corresponding to the first event vector pair;
对所述融合向量对进行第一全局最大池化层处理,得到所述第一事件向量对对应的第二事件向量对。The fused vector pair is processed by a first global maximum pooling layer to obtain a second event vector pair corresponding to the first event vector pair.
本实施例中,所述采用长短时记忆网络Bi-LSTM对所述第一事件向量对进 行抽取,得到所述第一事件向量对对应的全局信息对可以为,采用所述Bi-LSTM对所述第一事件向量的两个事件句中每个事件句的单词信息按照由前至后的顺序进行传递,再按照由后至前的顺序进行传递;得到所述第一事件向量对的两个事件句中每个事件句的全局信息;将所述两个事件句中每个事件句的全局信息作为所述第一事件向量对对应的全局信息对。其中,所述Bi-LSTM的神经元数量可以根据实际情况进行确定,在此不做限定。作为一种示例,所述Bi-LSTM的神经元数量可以是150。所述全局信息对可以根据实际情况进行确定,在此不做限定。作为一种示例,所述全局信息对可以是全局向量对。所述全局信息对可以是第五维度的全局向量对。In this embodiment, the long short-term memory network Bi-LSTM is used to process the first event vector pair. Row extraction, to obtain the global information pair corresponding to the first event vector pair can be, using the Bi-LSTM to transmit the word information of each event sentence in the two event sentences of the first event vector in a front-to-back order, and then in a back-to-front order; obtain the global information of each event sentence in the two event sentences of the first event vector pair; use the global information of each event sentence in the two event sentences as the global information pair corresponding to the first event vector pair. Among them, the number of neurons of the Bi-LSTM can be determined according to actual conditions, and is not limited here. As an example, the number of neurons of the Bi-LSTM can be 150. The global information pair can be determined according to actual conditions, and is not limited here. As an example, the global information pair can be a global vector pair. The global information pair can be a global vector pair of the fifth dimension.
所述采用卷积神经网络CNN对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的局部信息对可以为,采用所述CNN对所述第一事件向量的两个事件句中每个事件句的局部信息;将所述两个事件句中每个事件句的局部信息作为所述第一事件向量对对应的局部信息对。其中,所述CNN的卷积核数量和卷积核窗口大小可以根据实际情况进行确定,在此不做限定。作为一种示例,所述CNN的卷积核数量设为300、卷积核窗口大小为2。The use of the convolutional neural network CNN to extract the first event vector pair to obtain the local information pair corresponding to the first event vector pair can be to use the CNN to extract the local information of each of the two event sentences of the first event vector; and use the local information of each of the two event sentences as the local information pair corresponding to the first event vector pair. The number of convolution kernels and the convolution kernel window size of the CNN can be determined according to actual conditions and are not limited here. As an example, the number of convolution kernels of the CNN is set to 300 and the convolution kernel window size is 2.
在所述卷积核窗口大小为2的情况下,采用所述CNN对所述第一事件向量的两个事件句中每个事件句的相邻两个单词之间的局部信息;将所述两个事件句中每个事件句的相邻两个单词之间的局部信息作为所述第一事件向量对对应的局部信息对。所述局部信息对可以根据实际情况进行确定,在此不做限定。作为一种示例,所述局部信息对可以是局部向量对。所述局部信息对可以是第六维度的局部向量对。When the convolution kernel window size is 2, the local information between two adjacent words in each of the two event sentences of the first event vector is obtained by using the CNN; the local information between two adjacent words in each of the two event sentences is used as the local information pair corresponding to the first event vector pair. The local information pair can be determined according to actual conditions and is not limited here. As an example, the local information pair can be a local vector pair. The local information pair can be a local vector pair of the sixth dimension.
所述对所述全局信息对和所述局部信息对进行融合,得到所述第一事件向量对对应的融合向量对可以为,将所述全局信息对和所述局部信息对进行按位相加,得到所述第一事件向量对对应的融合向量对。所述融合向量对可以是第七维度的融合向量对。The fusing of the global information pair and the local information pair to obtain the fused vector pair corresponding to the first event vector pair may be performed by bitwise addition of the global information pair and the local information pair to obtain the fused vector pair corresponding to the first event vector pair. The fused vector pair may be a fused vector pair of the seventh dimension.
所述第二事件向量对可以根据实际情况进行确定,在此不做限定。作为一种示例,所述第二事件向量对可以是第八维度的事件向量对。 The second event vector pair can be determined according to actual conditions and is not limited here. As an example, the second event vector pair can be an event vector pair of the eighth dimension.
在本公开的一种可选实施例中,所述确定所述事件对数据的第一线性相似度和第一非线性相似度,包括:In an optional embodiment of the present disclosure, determining the first linear similarity and the first non-linear similarity of the event pair data includes:
根据所述第二事件向量对确定所述事件对数据的第一线性相似度和第一非线性相似度;Determine a first linear similarity and a first nonlinear similarity of the event pair data according to the second event vector pair;
其中,所述第一线性相似度包括第一余弦距离;所述第一非线性相似度包括第一双线性距离和第一单层网络距离中的至少一项。The first linear similarity includes a first cosine distance; and the first non-linear similarity includes at least one of a first bilinear distance and a first single-layer network distance.
本实施例中,根据所述第二事件向量对确定所述事件对数据的第一线性相似度和第一非线性相似度可以为,根据所述第二事件向量对中两个第二事件向量确定所述事件对数据的第一线性相似度和第一非线性相似度。In this embodiment, determining the first linear similarity and the first non-linear similarity of the event pair data according to the second event vector pair may be determining the first linear similarity and the first non-linear similarity of the event pair data according to two second event vectors in the second event vector pair.
在本公开的一种可选实施例中,所述方法还包括:In an optional embodiment of the present disclosure, the method further includes:
基于所述词向量对和所述事件短句对数据,确定所述事件短句对数据对应的第一事件短句向量对;Based on the word vector pair and the event short sentence pair data, determining a first event short sentence vector pair corresponding to the event short sentence pair data;
对所述第一事件短句向量对进行第二全局最大池化层处理,得到所述第一事件短句向量对对应的第二事件短句向量对。The first event short sentence vector pair is processed by a second global maximum pooling layer to obtain a second event short sentence vector pair corresponding to the first event short sentence vector pair.
本实施例中,所述基于所述词向量对和所述事件短句对数据,确定所述事件短句对数据对应的第一事件短句向量对可以为,基于所述词向量对对所述事件短句对数据进行编码,得到所述事件短句对数据对应的第一事件短句向量对;所述第一事件短句向量对可以是第九维度的事件短句向量对。In this embodiment, the determination of the first event short sentence vector pair corresponding to the event short sentence pair data based on the word vector pair and the event short sentence pair data can be performed by encoding the event short sentence pair data based on the word vector pair to obtain the first event short sentence vector pair corresponding to the event short sentence pair data; the first event short sentence vector pair can be an event short sentence vector pair of the ninth dimension.
所述第二事件短句向量对可以根据实际情况进行确定,在此不做限定。作为一种示例,所述第二事件短句向量对可以是第十维度的事件短句向量对。The second event phrase vector pair can be determined according to actual conditions and is not limited here. As an example, the second event phrase vector pair can be an event phrase vector pair of the tenth dimension.
在本公开的一种可选实施例中,所述确定所述事件短句对数据的第二线性相似度和第二非线性相似度,包括:In an optional embodiment of the present disclosure, determining the second linear similarity and the second non-linear similarity of the event phrase pair data includes:
根据所述第二事件短句向量对确定所述事件短句对数据的第二线性相似度和第二非线性相似度;Determine a second linear similarity and a second non-linear similarity of the event phrase pair data according to the second event phrase vector pair;
其中,所述第二线性相似度包括第二余弦距离;所述第二非线性相似度包括第二双线性距离和第二单层网络距离中的至少一项。The second linear similarity includes a second cosine distance; and the second non-linear similarity includes at least one of a second bilinear distance and a second single-layer network distance.
本实施例中,根据所述第二事件短句向量对确定所述事件对数据的第二线 性相似度和第二非线性相似度可以为,根据所述第二事件短句向量对中两个第二事件短句向量确定所述事件对数据的第二线性相似度和第二非线性相似度。In this embodiment, the second line of the event pair data is determined according to the second event phrase vector pair. The second linear similarity and the second non-linear similarity may be determined by determining the second linear similarity and the second non-linear similarity of the event pair data according to two second event phrase vectors in the second event phrase vector pair.
在一些实施例中,基于所述第二事件向量对、所述第二事件短句向量对、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第一文本中的事件对数据的置信度。In some embodiments, the confidence of the event pair data in the first text is determined based on the second event vector pair, the second event phrase vector pair, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity.
在一些实施例中,目前公司的智能客服系统还大量的依赖人工客服来回答客户的问题,本实施例提出的方法可以自动获得与客户提出的问题匹配度最高的答案,从而减少人力成本,提高用户的体验感。In some embodiments, the company's current intelligent customer service system still relies heavily on manual customer service to answer customers' questions. The method proposed in this embodiment can automatically obtain the answer that best matches the question raised by the customer, thereby reducing labor costs and improving user experience.
本实施例有效的丰富了输入数据的特征信息,将单词、单词的位置信息与单词的词性信息进行一对一的拼接;使用BERT预训练模型进行训练,获取准确的词向量表达;利用Bi-LSTM对事件句进行编码,获得全局向量,同时利用CNN对事件句进行编码获得局部向量,将二者相结合;利用触发词的依存词、触发词和论元抽取出事件短句,而不是固定的抽取触发词前后各三个单词形成事件短句;将线性相似度与非线性相似度结合,并不单单计算线性相似度,同时计算非线性相似度来弥补线性相似度的缺点;相较于相关技术的方法,在性能上有所提升。This embodiment effectively enriches the feature information of the input data, and performs one-to-one splicing of words, word position information and word part-of-speech information; uses the BERT pre-training model for training to obtain accurate word vector expressions; uses Bi-LSTM to encode event sentences to obtain global vectors, and uses CNN to encode event sentences to obtain local vectors, and combines the two; uses the dependency words, trigger words and arguments of trigger words to extract event short sentences, rather than fixedly extracting three words before and after the trigger word to form an event short sentence; combines linear similarity with nonlinear similarity, and does not only calculate linear similarity, but also calculates nonlinear similarity to make up for the shortcomings of linear similarity; compared with the methods of related technologies, the performance is improved.
为了方便理解,这里示例出一种基于BERT预训练的事件同指消解方法,所述方法应用于一种基于BERT、Bi-LSTM和CNN的事件同指消解系统(BNN系统),图2为本公开实施例文本处理方法BNN系统的技术流程示意图,如图2所示,该方法包括以下步骤:For ease of understanding, an event homonymy resolution method based on BERT pre-training is exemplified here. The method is applied to an event homonymy resolution system (BNN system) based on BERT, Bi-LSTM and CNN. FIG2 is a technical flow diagram of the BNN system of the text processing method of the embodiment of the present disclosure. As shown in FIG2, the method includes the following steps:
第一步:对事件句进行预处理。The first step: preprocess the event sentences.
使用KBP和ACE2005语料库确定语料。其中,KBP语料库有6538个事件句,ACE2005语料库有5349个事件句。在BNN系统的预处理模块中输入语料提供的事件句。其中,语料提供的事件句是直接在网页上爬取的新闻文本,由于爬取的文本数据中存在大量的特殊符号、停用词等无关信息,所以需要在预处理模块中对文本数据进行处理,预处理模块主要使用正则表达式和停用词列表的方法对文本数据进行数据清洗,过滤掉特殊符号和停用词,并且将句中 的单词都恢复成原形。将处理后的语句作为输入的事件句(Sentence,Sen)。对输入的Sen使用斯坦福自然语言处理工具获得事件句中各单词的词性信息(Parts-of-speech,Pos),再赋予句中每个单词各自的位置信息(Location,Loc),位置信息取各单词到事件句触发词的相对距离。预处理模块将需要进行同指关系判断的两个事件句作为事件对数据。The corpus was determined using the KBP and ACE2005 corpora. The KBP corpus contains 6538 event sentences, and the ACE2005 corpus contains 5349 event sentences. The event sentences provided by the corpus were input into the preprocessing module of the BNN system. The event sentences provided by the corpus are news texts directly crawled from web pages. Since there are a large number of special symbols, stop words and other irrelevant information in the crawled text data, the text data needs to be processed in the preprocessing module. The preprocessing module mainly uses regular expressions and stop word lists to clean the text data, filter out special symbols and stop words, and replace the sentences with the original text. The words are restored to their original form. The processed sentences are used as the input event sentences (Sentence, Sen). The Stanford natural language processing tool is used to obtain the part-of-speech information (Pos) of each word in the event sentence, and then the location information (Location, Loc) of each word in the sentence is assigned. The location information is the relative distance between each word and the trigger word of the event sentence. The preprocessing module uses the two event sentences that need to be judged as the event pair data.
第二步:对事件句进行BERT预测。Step 2: Perform BERT prediction on event sentences.
由于输入BNN系统的信息的准确性在很大程度上决定了事件同指的准确率,以往的实验大都使用固定的词向量来表示输入信息,对于事件句的表示不够准确。而本实施例使用BERT预训练模型获得词语的向量表示。Since the accuracy of the information input into the BNN system largely determines the accuracy of event co-reference, previous experiments mostly use fixed word vectors to represent input information, which is not accurate enough for the representation of event sentences. This embodiment uses the BERT pre-training model to obtain the vector representation of words.
BERT预训练模型通过用字符遮住事件句中的词语或者事件句所在文中的语句来预测被遮住的词语或语句,从而获得各种词语的向量表示BM。因此,句中各词语间有很强的关联性,文中的语句之间也有很强的上下文连通性、逻辑性。对实验结果产生很大的影响。公式如(1)所示:
BMi=BERT(Seni)(i=1,2)      (1)
The BERT pre-training model predicts the masked words or sentences by masking the words in the event sentence or the sentences in the text where the event sentence is located with characters, thereby obtaining the vector representation BM of various words. Therefore, there is a strong correlation between the words in the sentence, and there is also a strong contextual connectivity and logic between the sentences in the text. It has a great impact on the experimental results. The formula is shown in (1):
BM i = BERT(Sen i )(i=1,2) (1)
第三步:采用词向量对事件句进行编码。Step 3: Use word vectors to encode event sentences.
使用BERT预训练模型训练出来的词向量BM对事件句Sen、词性信息Pos进行编码获得维度为a×b的事件句向量SEN和维度为a×b的词性向量POS,然后将事件句向量、词性向量和维度为a×1的位置信息进行横向拼接,形成维度为a×(2b+1)的事件向量EB。公式如(2)所示:
EBi=Concat(SENi,POSi,Loci)(i=1,2)    (2)
The word vector BM trained by the BERT pre-training model is used to encode the event sentence Sen and the part-of-speech information Pos to obtain the event sentence vector SEN with a dimension of a×b and the part-of-speech vector POS with a dimension of a×b. Then, the event sentence vector, the part-of-speech vector and the position information with a dimension of a×1 are horizontally concatenated to form an event vector EB with a dimension of a×(2b+1). The formula is shown in (2):
EB i =Concat(SEN i ,POS i ,Loc i )(i=1,2) (2)
第四步:抽取事件句的全局与局部信息。Step 4: Extract global and local information of event sentences.
对比两个事件句是否是同指关系的时候,可以先从整体结构观察两个事件句间是否有相似之处,如果相似度不高,依然有可能是同指关系,为此就需要逐词对比,寻找二者之间的同指关系。When comparing whether two event sentences are in a co-referential relationship, we can first observe whether there are similarities between the two event sentences from the overall structure. If the similarity is not high, it is still possible that they are in a co-referential relationship. For this reason, we need to compare word by word to find the co-referential relationship between the two.
为此,本实施例先使用Bi-LSTM来抽取事件向量EB的全局信息,将Bi-LSTM神经元设置为150。Bi-LSTM会将事件句中前面的单词信息依次往后传递,然后再反过来由后往前传递信息,从全局的角度来观察一个事件句。还 使用CNN抽取事件向量EB的局部信息,把CNN卷积核数量设为300、卷积核窗口大小为2、保持维度不变。由于卷积核的窗口大小是2,因此会抽取事件句中相邻两个单词之间的局部信息。两种网络分别获得维度为a×300的全局向量GE和维度为a×300的局部向量LE,公式如(3)和(4)所示:

LEi=Conv(EBi)(i=1,2)        (4)
To this end, this embodiment first uses Bi-LSTM to extract the global information of the event vector EB, and sets the number of Bi-LSTM neurons to 150. Bi-LSTM will pass the information of the previous words in the event sentence to the back in sequence, and then pass the information from the back to the front in reverse, observing an event sentence from a global perspective. Use CNN to extract local information of event vector EB, set the number of CNN convolution kernels to 300, the convolution kernel window size to 2, and keep the dimension unchanged. Since the convolution kernel window size is 2, local information between two adjacent words in the event sentence will be extracted. The two networks obtain a global vector GE with a dimension of a×300 and a local vector LE with a dimension of a×300, respectively, as shown in formulas (3) and (4):

LE i =Conv(EB i )(i=1,2) (4)
由于全局向量GE和局部向量LE的维度相同,本实施例将全局向量GE和局部向量LE按位相加得到维度为a×300的向量GL,就相当于将事件句中各个词语的全局信息和局部信息融合在了一起。公式如(5)所示:
Since the dimensions of the global vector GE and the local vector LE are the same, this embodiment adds the global vector GE and the local vector LE bit by bit to obtain a vector GL with a dimension of a×300, which is equivalent to fusing the global information and local information of each word in the event sentence together. Formula (5) is shown as follows:
最后,将向量GL经过全局最大池化层,获得维度为a×1的向量EX,公式如(6)所示:
EXi=GlobalMax(GLi)(i=1,2)   (6)
Finally, the vector GL is passed through the global maximum pooling layer to obtain a vector EX with a dimension of a×1, as shown in formula (6):
EX i =GlobalMax(GL i )(i=1,2) (6)
第五步:在事件句中抽取事件短句。Step 5: Extract event short sentences from event sentences.
在相关技术中,研究人员会固定的抽取触发词前后各三个单词作为事件短句,来简要的描述事件句。这种方法可能会抽取出一个结构信息不完整的语句,从而会错误的表示原句的意思。为此,本实施例优化了该抽取方法,事件短句抽取步骤如下:In related technologies, researchers will extract three words before and after the trigger word as event short sentences to briefly describe the event sentence. This method may extract a sentence with incomplete structural information, thereby incorrectly representing the meaning of the original sentence. To this end, this embodiment optimizes the extraction method, and the steps of event short sentence extraction are as follows:
步骤(5.1)使用斯坦福自然语言处理工具获得事件句中的论元,论元主要包括:施事者、受事者、事件发生的时间地点等。Step (5.1) uses the Stanford natural language processing tool to obtain the arguments in the event sentence. The arguments mainly include: agent, patient, time and place of the event, etc.
步骤(5.2)使用依存词分析工具生成句中触发词的依存词。Step (5.2) uses a dependency word analysis tool to generate dependency words for the trigger word in the sentence.
步骤(5.3)计算各论元和各依存词距离触发词的距离,在触发词前后确定距其最远的2个词,将这2个词作为事件短句的起始与结束位置。Step (5.3) calculates the distance between each argument and each dependency word and the trigger word, determines the two words farthest from the trigger word before and after the trigger word, and uses these two words as the start and end positions of the event phrase.
步骤(5.4)截取从起始位置到结束位置的语句作为事件短句。Step (5.4) extracts the sentence from the starting position to the ending position as the event short sentence.
举例:某个语句:Zhang Junxiong,the newly appointed Executive President,was also invited to attend the inauguration ceremony and delivered a speech。在该 句中,触发词是“appointed”,触发词的依存词是“Zhang Junxiong”、“President”、“invited”,三个依存词与触发词的距离分别是-3、2、5。事件句中的论元为“Zhang Junxiong”、“invited”,两个论元与触发词的距离分别是-3、5。Example: A sentence: Zhang Junxiong, the newly appointed Executive President, was also invited to attend the inauguration ceremony and delivered a speech. In the sentence, the trigger word is "appointed", and the dependent words of the trigger word are "Zhang Junxiong", "President", and "invited". The distances between the three dependent words and the trigger word are -3, 2, and 5. The arguments in the event sentence are "Zhang Junxiong" and "invited", and the distances between the two arguments and the trigger word are -3 and 5 respectively.
按照固定抽取短句的方法抽出来的事件短句是“Junxiong the newly appointed Executive President was”,可以看出该短句是残缺不全的。而按照本实施例提出的优化方法,取触发词前距离最远的依存词或论元“Zhang Junxiong”为起始位置,触发词后距离最远的依存词或论元“invited”为结束位置,则可抽取出事件短句“Zhang Junxiong the newly appointed Executive President was also invited”。The event short sentence extracted according to the fixed method of extracting short sentences is "Junxiong the newly appointed Executive President was", which shows that the short sentence is incomplete. According to the optimization method proposed in this embodiment, the dependent word or argument "Zhang Junxiong" farthest before the trigger word is taken as the starting position, and the dependent word or argument "invited" farthest after the trigger word is taken as the ending position, then the event short sentence "Zhang Junxiong the newly appointed Executive President was also invited" can be extracted.
按照上述方法获得事件短句,用词向量BM对事件短句编码获得维度为a×b的事件短句向量ES,再将事件短句向量ES通过全局最大池化层,获得维度为a×1的事件短句向量SX,公式如(7)所示:
SXi=GlobalMax(ESi)(i=1,2)          (7)
According to the above method, the event short sentence is obtained, and the word vector BM is used to encode the event short sentence to obtain the event short sentence vector ES with a dimension of a×b. Then the event short sentence vector ES is passed through the global maximum pooling layer to obtain the event short sentence vector SX with a dimension of a×1. The formula is shown in (7):
SX i =GlobalMax(ES i )(i=1,2) (7)
第六步:计算两个事件句的相似度。Step 6: Calculate the similarity between two event sentences.
判断事件句间是否存在同指关系的关键在于计算二者的相似度,相似度计算的准确以及全面对模型的性能结果有很大的影响。在相关的技术中,研究人员均只使用了余弦距离的计算方法来获得事件句间的线性相似度,线性相似度是从整体的角度来考虑两个事件句间的关系,如果二者的结构差距过大,会误判为非同指关系。而非线性相似度可以计算一对事件对之间单词与单词之间的关系,从而弥补线性相似度的缺陷。The key to determining whether there is a synonymous relationship between event sentences is to calculate the similarity between the two. The accuracy and comprehensiveness of the similarity calculation has a great impact on the performance results of the model. In related technologies, researchers have only used the cosine distance calculation method to obtain the linear similarity between event sentences. Linear similarity considers the relationship between two event sentences from a holistic perspective. If the structural gap between the two is too large, it will be misjudged as a non-synonymous relationship. Non-linear similarity can calculate the relationship between words between a pair of events, thereby making up for the shortcomings of linear similarity.
本实施例提出了三种相似度计算方法,分别是:余弦距离C、双线性距离S和单层网络距离L,公式如(8)、(9)、(10)、(11)、(12)、(13)所示:





This embodiment proposes three similarity calculation methods, namely: cosine distance C, bilinear distance S and single-layer network distance L. The formulas are shown in (8), (9), (10), (11), (12) and (13):





式(8)中,C1表示事件句向量对应的余弦距离。式(9)中,C2表示事件短句向量对应的余弦距离。式(10)中,表示用于计算事件句向量对应的双线性距离的权重。式(11)中,表示用于计算事件短句向量对应的双线性距离的权重。式(12)中,表示用于计算事件句向量对应的单层网络距离的权重;表示用于计算事件句向量对应的单层网络距离的偏移向量。式(13)中,表示用于计算事件短句向量对应的单层网络距离的权重;表示用于计算事件短句向量对应的单层网络距离的偏移向量。In formula (8), C 1 represents the cosine distance corresponding to the event sentence vector. In formula (9), C 2 represents the cosine distance corresponding to the event short sentence vector. In formula (10), represents the weight used to calculate the bilinear distance corresponding to the event sentence vector. In formula (11), represents the weight used to calculate the bilinear distance corresponding to the event short sentence vector. In formula (12), Represents the weight used to calculate the single-layer network distance corresponding to the event sentence vector; Represents the offset vector used to calculate the single-layer network distance corresponding to the event sentence vector. In formula (13), Represents the weight used to calculate the single-layer network distance corresponding to the event short sentence vector; Represents the offset vector used to calculate the single-layer network distance corresponding to the event short sentence vector.
第七步:输出置信度。Step 7: Output confidence.
将事件句向量EX、事件短句向量SX、相似度向量C、相似度双线性向量S和相似度单层网络向量L相结合,生成向量P,公式如(14)所示:
P=Concat(EX1,EX2,SX1,SX2,C1,C2,S1,S2,L1,L2)    (14)
Combine the event sentence vector EX, the event short sentence vector SX, the similarity vector C, the similarity bilinear vector S and the similarity single-layer network vector L to generate the vector P, as shown in (14):
P = Concat(EX 1 , EX 2 , SX 1 , SX 2 , C 1 , C 2 , S 1 , S 2 , L 1 , L 2 ) (14)
将向量P放入一个全连接分类器中,分类器使用relu激活函数,公式如(15)所示:
Vh=α(Wh*P+bh)        (15)
Put the vector P into a fully connected classifier, which uses the relu activation function as shown in (15):
V h =α(W h *P+b h ) (15)
式(15)中,Wh表示向量P对应的激活函数的权重;bh表示向量P对应的激活函数的偏移向量。In formula (15), W h represents the weight of the activation function corresponding to the vector P; b h represents the offset vector of the activation function corresponding to the vector P.
通过sigmoid层得出事件同指的置信度,公式如(16)所示:
score=sigmoid(W0*Vh+b0)     (16)
The confidence of event homonymy is obtained through the sigmoid layer, as shown in formula (16):
score = sigmoid ( W 0 * V h + b 0 ) (16)
式(16)中,W0表示置信度的权重;b0表示置信度的偏移向量。In formula (16), W0 represents the weight of the confidence; b0 represents the offset vector of the confidence.
置信度score是一个介于0和1之间的值,如果score大于0.5,则判定为同指关系,反之,则判定为非同指关系。为了防止过拟合,本实施例使用了Dropout,Dropout是深度学习中被广泛的应用到解决模型过拟合问题的策略,值设为0.2。The confidence score is a value between 0 and 1. If the score is greater than 0.5, it is determined to be a co-referential relationship. Otherwise, it is determined to be a non-co-referential relationship. In order to prevent overfitting, this embodiment uses Dropout, which is a strategy widely used in deep learning to solve the problem of model overfitting. The value is set to 0.2.
BNN系统通过BERT预训练和全局与局部信息的抽取,将文本内容的语义 信息准确且全面的挖掘出来并转换成向量表达。通过事件短句抽取和相似度距离的计算,辅助模型判别同指关系。该系统在实际测试中取得了较好的效果,相较于相关技术中的方法和现存的技术提高了性能,表1为KBP性能结果数据,表2为ACE性能结果数据,如表1和表2所述,性能结果如下:
The BNN system uses BERT pre-training and extraction of global and local information to transform the semantics of text content. The information is mined accurately and comprehensively and converted into vector expressions. Through the extraction of event short sentences and the calculation of similarity distance, the auxiliary model can identify the same-reference relationship. The system has achieved good results in actual tests, and has improved performance compared with the methods in related technologies and existing technologies. Table 1 shows the KBP performance result data, and Table 2 shows the ACE performance result data. As described in Tables 1 and 2, the performance results are as follows:
表1为KBP性能结果数据Table 1 shows the KBP performance results.
在表1中,MUC、B3、BLANC、CEAFe、Links是性能指标评测方法,KBP、ACE是测试集。In Table 1, MUC, B3, BLANC, CEAFe, and Links are performance evaluation methods, and KBP and ACE are test sets.
从表1可以看出,BNN系统相较于相关学者六和KBP-TOP的神经网络方法有很大的提升,相较于相关学者四的机器学习方法在均值上提升0.6%。虽然仅提升0.6%,但是神经网络方法相较于机器学习方法具有人工成本低、效率高、可移植性强的优势。As can be seen from Table 1, the BNN system has greatly improved compared with the neural network methods of related scholars 6 and KBP-TOP, and has improved by 0.6% on average compared with the machine learning method of related scholar 4. Although it is only improved by 0.6%, the neural network method has the advantages of low labor cost, high efficiency and strong portability compared with the machine learning method.
本公开实施例提供一种文本处理装置,图3为本公开实施例文本处理装置的组成结构示意图,如图3所示,所述装置300包括:The present disclosure provides a text processing device. FIG3 is a schematic diagram of the structure of the text processing device according to the present disclosure. As shown in FIG3 , the device 300 includes:
第一获取模块301,配置为获取第一文本中包括的事件对数据;A first acquisition module 301 is configured to acquire event pair data included in a first text;
第一处理模块302,配置为采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;A first processing module 302 is configured to process the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data;
第一确定模块303,配置为确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;A first determination module 303 is configured to determine a first linear similarity and a first non-linear similarity of the event pair data and to determine a second linear similarity and a second non-linear similarity of the event phrase pair data;
第二确定模块304,配置为基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。 The second determination module 304 is configured to determine the confidence of the event pair data based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a synonymous relationship.
在其他的实施例中,所述第一处理模块302,还配置为采用所述依存句法分析工具确定所述事件对数据中触发词的论元和依存词;确定所述论元与所述触发词的第一距离,以及确定所述依存词与所述触发词的第二距离;对所述第一距离以及所述第二距离进行排序,得到排序结果;确定所述排序结果中距离最大值对应的两个论元或触发词,将所述距离最大值对应的两个论元或触发词作为所述事件短句对数据的起始词和结束词;基于所述起始词和所述结束词对所述事件对数据进行截取,得到所述事件短句对数据。In other embodiments, the first processing module 302 is further configured to use the dependency syntax analysis tool to determine the arguments and dependent words of the trigger word in the event pair data; determine a first distance between the argument and the trigger word, and determine a second distance between the dependency word and the trigger word; sort the first distance and the second distance to obtain a sorting result; determine two arguments or trigger words corresponding to the maximum distance in the sorting result, and use the two arguments or trigger words corresponding to the maximum distance as the starting word and ending word of the event short sentence pair data; intercept the event pair data based on the starting word and the ending word to obtain the event short sentence pair data.
在其他的实施例中,所述第二确定模块304,还配置为基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第一文本中的事件对数据的置信度向量;基于全连接分类器对所述置信度向量进行处理,得到所述事件对数据的置信度。In other embodiments, the second determination module 304 is further configured to determine a confidence vector of the event pair data in the first text based on the event pair data, the event sentence pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; and process the confidence vector based on a fully connected classifier to obtain the confidence of the event pair data.
在其他的实施例中,所述装置300还包括:预测模块,配置为采用预训练模型BERT对所述事件对数据进行预测,得到所述事件对数据对应的词向量对。In other embodiments, the device 300 further includes: a prediction module, configured to use a pre-trained model BERT to predict the event pair data to obtain a word vector pair corresponding to the event pair data.
在其他的实施例中,所述事件对数据包括多个单词对数据;所述装置300还包括:第二获取模块和第三确定模块;其中,In other embodiments, the event pair data includes a plurality of word pair data; the device 300 further includes: a second acquisition module and a third determination module; wherein,
所述第二获取模块,配置为获取所述事件对数据中多个单词对数据的第一信息对和第二信息对;所述第一信息对表征单词对数据的词性信息对;所述第二信息对表征所述单词对数据的位置信息对;The second acquisition module is configured to acquire a first information pair and a second information pair of a plurality of word pair data in the event pair data; the first information pair represents a part-of-speech information pair of the word pair data; the second information pair represents a position information pair of the word pair data;
所述第三确定模块,配置为基于所述词向量对、所述事件对数据、所述第一信息对和所述第二信息对,确定所述事件对数据对应的第一事件向量对。The third determination module is configured to determine a first event vector pair corresponding to the event pair data based on the word vector pair, the event pair data, the first information pair and the second information pair.
在其他的实施例中,所述装置300还包括:第一抽取模块、第二抽取模块、融合模块和第二处理模块;其中,In other embodiments, the device 300 further includes: a first extraction module, a second extraction module, a fusion module, and a second processing module; wherein,
所述第一抽取模块,配置为采用长短时记忆网络Bi-LSTM对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的全局信息对;The first extraction module is configured to extract the first event vector pair using a long short-term memory network Bi-LSTM to obtain a global information pair corresponding to the first event vector pair;
所述第二抽取模块,配置为采用卷积神经网络CNN对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的局部信息对; The second extraction module is configured to extract the first event vector pair using a convolutional neural network (CNN) to obtain a local information pair corresponding to the first event vector pair;
所述融合模块,配置为对所述全局信息对和所述局部信息对进行融合,得到所述第一事件向量对对应的融合向量对;The fusion module is configured to fuse the global information pair and the local information pair to obtain a fusion vector pair corresponding to the first event vector pair;
所述第二处理模块,配置为对所述融合向量对进行第一全局最大池化层处理,得到所述第一事件向量对对应的第二事件向量对。The second processing module is configured to perform a first global maximum pooling layer processing on the fused vector pair to obtain a second event vector pair corresponding to the first event vector pair.
在其他的实施例中,所述第一确定模块303,还配置为根据所述第二事件向量对确定所述事件对数据的第一线性相似度和第一非线性相似度;其中,所述第一线性相似度包括第一余弦距离;所述第一非线性相似度包括第一双线性距离和第一单层网络距离中的至少一项。In other embodiments, the first determination module 303 is further configured to determine a first linear similarity and a first non-linear similarity of the event pair data based on the second event vector pair; wherein the first linear similarity includes a first cosine distance; and the first non-linear similarity includes at least one of a first bilinear distance and a first single-layer network distance.
在其他的实施例中,所述装置300还包括:第四确定模块和第三处理模块;其中,In other embodiments, the device 300 further includes: a fourth determining module and a third processing module; wherein,
所述第四确定模块,配置为基于所述词向量对和所述事件短句对数据,确定所述事件短句对数据对应的第一事件短句向量对;The fourth determination module is configured to determine a first event short sentence vector pair corresponding to the event short sentence pair data based on the word vector pair and the event short sentence pair data;
所述第三处理模块,配置为对所述第一事件短句向量对进行第二全局最大池化层处理,得到所述第一事件短句向量对对应的第二事件短句向量对。The third processing module is configured to perform a second global maximum pooling layer processing on the first event short sentence vector pair to obtain a second event short sentence vector pair corresponding to the first event short sentence vector pair.
在其他的实施例中,所述第一确定模块303,还配置为根据所述第二事件短句向量对确定所述事件短句对数据的第二线性相似度和第二非线性相似度;其中,所述第二线性相似度包括第二余弦距离;所述第二非线性相似度包括第二双线性距离和第二单层网络距离中的至少一项。In other embodiments, the first determination module 303 is further configured to determine a second linear similarity and a second non-linear similarity of the event phrase pair data based on the second event phrase vector pair; wherein the second linear similarity includes a second cosine distance; and the second non-linear similarity includes at least one of a second bilinear distance and a second single-layer network distance.
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiment of the present disclosure, please refer to the description of the method embodiment of the present disclosure for understanding.
需要说明的是,本公开实施例中,如果以软件功能模块的形式实现上述的文本处理方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术实施例本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台文本处理设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全 部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本公开实施例不限制于任何特定的硬件和软件结合。It should be noted that in the embodiments of the present disclosure, if the above-mentioned text processing method is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical embodiment of the embodiments of the present disclosure is essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes a number of instructions for enabling a text processing device (which can be a personal computer, server, or network device, etc.) to execute the entire method described in each embodiment of the present disclosure. The aforementioned storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a magnetic disk or an optical disk, and other media that can store program codes. Thus, the embodiments of the present disclosure are not limited to any specific combination of hardware and software.
对应地,本公开实施例还提供一种文本处理设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述所述方法中的任一步骤。Correspondingly, an embodiment of the present disclosure further provides a text processing device, including a memory and a processor, wherein the memory stores a computer program that can be executed on the processor, and when the processor executes the program, any step in the above-mentioned method is implemented.
对应地,本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述所述方法中的任一步骤。Correspondingly, an embodiment of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any step in the above-mentioned method is implemented.
这里需要指出的是:以上存储介质和设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开存储介质和设备实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。It should be noted here that the description of the above storage medium and device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. For technical details not disclosed in the storage medium and device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
需要说明的是,图4为本公开实施例文本处理设备的一种硬件实体结构示意图,如图4所示,该文本处理设备400的硬件实体包括:处理器401和存储器403,可选地,所述文本处理设备400还可以包括通信接口402。It should be noted that Figure 4 is a schematic diagram of a hardware entity structure of a text processing device according to an embodiment of the present disclosure. As shown in Figure 4 , the hardware entity of the text processing device 400 includes: a processor 401 and a memory 403. Optionally, the text processing device 400 may also include a communication interface 402.
可以理解,存储器403可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM, Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本公开实施例描述的存储器403旨在包括但不限于这些和任意其它适合类型的存储器。It can be understood that the memory 403 can be a volatile memory or a non-volatile memory, and can also include both volatile and non-volatile memories. Among them, the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), a magnetic random access memory (FRAM), a flash memory, a magnetic surface memory, an optical disk, or a compact disc read-only memory (CD-ROM); the magnetic surface memory can be a disk memory or a tape memory. The volatile memory can be a random access memory (RAM), which is used as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), synchronous static random access memory (SSRAM), dynamic random access memory (DRAM), and so on. Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), Direct Rambus Random Access Memory (DRRAM). The memory 403 described in the embodiments of the present disclosure is intended to include but is not limited to these and any other suitable types of memory.
上述本公开实施例揭示的方法可以应用于处理器401中,或者由处理器401实现。处理器401可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器401中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器401可以是通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器401可以实现或者执行本公开实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本公开实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器403,处理器401读取存储器403中的信息,结合其硬件完成前述方法的步骤。The method disclosed in the above embodiment of the present disclosure can be applied to the processor 401, or implemented by the processor 401. The processor 401 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 401 or the instruction in the form of software. The above processor 401 can be a general processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 401 can implement or execute the methods, steps and logic block diagrams disclosed in the embodiment of the present disclosure. The general processor can be a microprocessor or any conventional processor, etc. The steps of the method disclosed in the embodiment of the present disclosure can be directly embodied as a hardware decoding processor to execute, or a combination of hardware and software modules in the decoding processor to execute. The software module can be located in a storage medium, which is located in the memory 403. The processor 401 reads the information in the memory 403 and completes the steps of the above method in combination with its hardware.
在示例性实施例中,文本处理设备可以被一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)、通用处理器、控制器、微控制器(MCU,Micro Controller Unit)、微处理器(Microprocessor)、或其他电子元件实现,用于执行前述方法。In an exemplary embodiment, the text processing device can be implemented by one or more application specific integrated circuits (ASIC), DSP, programmable logic device (PLD), complex programmable logic device (CPLD), field programmable gate array (FPGA), general processor, controller, microcontroller (MCU), microprocessor, or other electronic components to execute the aforementioned method.
在本公开所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其他的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述 单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个观测量,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其他形式的。In the several embodiments provided in the present disclosure, it should be understood that the disclosed methods and devices can be implemented in other ways. The device embodiments described above are merely illustrative. For example, The division of units is only a logical function division. There may be other divisions in actual implementation, such as: multiple units or components can be combined, or can be integrated into another observation, or some features can be ignored or not executed. In addition, the communication connection between the components shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; some or all of the units may be selected according to actual needs to achieve the purpose of this embodiment.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiment can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps of the above method embodiment; and the aforementioned storage medium includes: mobile storage devices, read-only memories (ROM, Read-Only Memory), magnetic disks or optical disks, and other media that can store program codes.
或者,本公开实施例上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术实施例本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台文本处理设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated unit of the embodiment of the present disclosure is implemented in the form of a software functional unit and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical embodiment of the embodiment of the present disclosure is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a text processing device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in each embodiment of the present disclosure. The aforementioned storage medium includes: various media that can store program codes, such as mobile storage devices, ROMs, magnetic disks, or optical disks.
本公开是实例中记载的文本处理方法、装置和计算机存储介质只以本公开所述实施例为例,但不仅限于此,只要涉及到该文本处理方法、装置和计算机存储介质均在本公开的保护范围。The text processing method, device and computer storage medium recorded in the examples of the present disclosure are only taken as examples of the embodiments described in the present disclosure, but are not limited to this. As long as the text processing method, device and computer storage medium are involved, they are within the protection scope of the present disclosure.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本公开的至少一个实施例中。因此,在整 个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本公开的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本公开实施例的实施过程构成任何限定。上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。It should be understood that references to "one embodiment" or "an embodiment" throughout the specification mean that a particular feature, structure, or characteristic associated with the embodiment is included in at least one embodiment of the present disclosure. The phrases “in one embodiment” or “in an embodiment” that appear in various places in the specification do not necessarily refer to the same embodiment. In addition, these specific features, structures, or characteristics may be combined in one or more embodiments in any suitable manner. It should be understood that in the various embodiments of the present disclosure, the size of the serial numbers of the above-mentioned processes does not mean the order of execution. The order of execution of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The serial numbers of the embodiments of the present disclosure are for description only and do not represent the advantages and disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, an element defined by the sentence "comprises a ..." does not exclude the existence of other identical elements in the process, method, article or device including the element.
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。 The above is only a specific embodiment of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person skilled in the art who is familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure, which should be included in the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be based on the protection scope of the claims.

Claims (12)

  1. 一种文本处理方法,包括:A text processing method, comprising:
    获取第一文本中包括的事件对数据;Acquire event pair data included in the first text;
    采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;Using a dependency syntax analysis tool to process the event pair data to obtain event short sentence pair data corresponding to the event pair data;
    确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;Determine a first linear similarity and a first non-linear similarity of the event pair data and determine a second linear similarity and a second non-linear similarity of the event phrase pair data;
    基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。The confidence of the event pair data is determined based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a same-reference relationship.
  2. 根据权利要求1所述的方法,其中,所述采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据,包括:The method according to claim 1, wherein the step of processing the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data comprises:
    采用所述依存句法分析工具确定所述事件对数据中触发词的论元和依存词;Determining arguments and dependent words of trigger words in the event pair data using the dependency syntax analysis tool;
    确定所述论元与所述触发词的第一距离,以及确定所述依存词与所述触发词的第二距离;Determining a first distance between the argument and the trigger word, and determining a second distance between the dependency word and the trigger word;
    对所述第一距离以及所述第二距离进行排序,得到排序结果;Sorting the first distance and the second distance to obtain a sorting result;
    确定所述排序结果中距离最大值对应的两个论元或触发词,将所述距离最大值对应的两个论元或触发词作为所述事件短句对数据的起始词和结束词;Determine two arguments or trigger words corresponding to the maximum distance in the sorting result, and use the two arguments or trigger words corresponding to the maximum distance as the start word and the end word of the event short sentence pair data;
    基于所述起始词和所述结束词对所述事件对数据进行截取,得到所述事件短句对数据。The event pair data is intercepted based on the start word and the end word to obtain the event short sentence pair data.
  3. 根据权利要求1所述的方法,其中,所述基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第一文本中的事件对数据的置信度,包括:The method according to claim 1, wherein the determining the confidence of the event pair data in the first text based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity comprises:
    基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述 第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述第一文本中的事件对数据的置信度向量;Based on the event pair data, the event short sentence pair data, the first linear similarity, the The first non-linear similarity, the second linear similarity and the second non-linear similarity determine a confidence vector of event pair data in the first text;
    基于全连接分类器对所述置信度向量进行处理,得到所述事件对数据的置信度。The confidence vector is processed based on a fully connected classifier to obtain the confidence of the event on the data.
  4. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    采用预训练模型BERT对所述事件对数据进行预测,得到所述事件对数据对应的词向量对。The pre-trained model BERT is used to predict the event pair data to obtain the word vector pairs corresponding to the event pair data.
  5. 根据权利要求4所述的方法,其中,所述事件对数据包括多个单词对数据;所述方法还包括:The method according to claim 4, wherein the event pair data comprises a plurality of word pair data; the method further comprising:
    获取所述事件对数据中多个单词对数据的第一信息对和第二信息对;所述第一信息对表征单词对数据的词性信息对;所述第二信息对表征所述单词对数据的位置信息对;Acquire a first information pair and a second information pair of a plurality of word pair data in the event pair data; the first information pair represents a part-of-speech information pair of the word pair data; the second information pair represents a position information pair of the word pair data;
    基于所述词向量对、所述事件对数据、所述第一信息对和所述第二信息对,确定所述事件对数据对应的第一事件向量对。Based on the word vector pair, the event pair data, the first information pair and the second information pair, a first event vector pair corresponding to the event pair data is determined.
  6. 根据权利要求5所述的方法,其中,所述方法还包括:The method according to claim 5, wherein the method further comprises:
    采用长短时记忆网络Bi-LSTM对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的全局信息对;Using a long short-term memory network Bi-LSTM to extract the first event vector pair, to obtain a global information pair corresponding to the first event vector pair;
    采用卷积神经网络CNN对所述第一事件向量对进行抽取,得到所述第一事件向量对对应的局部信息对;Using a convolutional neural network (CNN) to extract the first event vector pair, to obtain a local information pair corresponding to the first event vector pair;
    对所述全局信息对和所述局部信息对进行融合,得到所述第一事件向量对对应的融合向量对;fusing the global information pair and the local information pair to obtain a fused vector pair corresponding to the first event vector pair;
    对所述融合向量对进行第一全局最大池化层处理,得到所述第一事件向量对对应的第二事件向量对。The fused vector pair is processed by a first global maximum pooling layer to obtain a second event vector pair corresponding to the first event vector pair.
  7. 根据权利要求6所述的方法,其中,所述确定所述事件对数据的第一线性相似度和第一非线性相似度,包括:The method according to claim 6, wherein determining the first linear similarity and the first non-linear similarity of the event pair data comprises:
    根据所述第二事件向量对确定所述事件对数据的第一线性相似度和第一非线性相似度; Determine a first linear similarity and a first nonlinear similarity of the event pair data according to the second event vector pair;
    其中,所述第一线性相似度包括第一余弦距离;所述第一非线性相似度包括第一双线性距离和第一单层网络距离中的至少一项。The first linear similarity includes a first cosine distance; and the first non-linear similarity includes at least one of a first bilinear distance and a first single-layer network distance.
  8. 根据权利要求4所述的方法,其中,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    基于所述词向量对和所述事件短句对数据,确定所述事件短句对数据对应的第一事件短句向量对;Based on the word vector pair and the event short sentence pair data, determining a first event short sentence vector pair corresponding to the event short sentence pair data;
    对所述第一事件短句向量对进行第二全局最大池化层处理,得到所述第一事件短句向量对对应的第二事件短句向量对。The first event short sentence vector pair is processed by a second global maximum pooling layer to obtain a second event short sentence vector pair corresponding to the first event short sentence vector pair.
  9. 根据权利要求8所述的方法,其中,所述确定所述事件短句对数据的第二线性相似度和第二非线性相似度,包括:The method according to claim 8, wherein determining the second linear similarity and the second non-linear similarity of the event phrase pair data comprises:
    根据所述第二事件短句向量对确定所述事件短句对数据的第二线性相似度和第二非线性相似度;Determine a second linear similarity and a second non-linear similarity of the event phrase pair data according to the second event phrase vector pair;
    其中,所述第二线性相似度包括第二余弦距离;所述第二非线性相似度包括第二双线性距离和第二单层网络距离中的至少一项。The second linear similarity includes a second cosine distance; and the second non-linear similarity includes at least one of a second bilinear distance and a second single-layer network distance.
  10. 一种文本处理装置,包括:A text processing device, comprising:
    第一获取模块,配置为获取第一文本中包括的事件对数据;A first acquisition module, configured to acquire event pair data included in the first text;
    第一处理模块,配置为采用依存句法分析工具对所述事件对数据进行处理,得到所述事件对数据对应的事件短句对数据;A first processing module is configured to process the event pair data using a dependency syntax analysis tool to obtain event short sentence pair data corresponding to the event pair data;
    第一确定模块,配置为确定所述事件对数据的第一线性相似度和第一非线性相似度以及确定所述事件短句对数据的第二线性相似度和第二非线性相似度;A first determination module is configured to determine a first linear similarity and a first non-linear similarity of the event pair data and to determine a second linear similarity and a second non-linear similarity of the event phrase pair data;
    第二确定模块,配置为基于所述事件对数据、所述事件短句对数据、所述第一线性相似度、所述第一非线性相似度、所述第二线性相似度和所述第二非线性相似度确定所述事件对数据的置信度;所述置信度表征所述事件对数据具有同指关系的程度。The second determination module is configured to determine the confidence of the event pair data based on the event pair data, the event phrase pair data, the first linear similarity, the first non-linear similarity, the second linear similarity and the second non-linear similarity; the confidence represents the degree to which the event pair data has a synonymous relationship.
  11. 一种文本处理设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现权利要求1至9任一项所述的方法。 A text processing device comprises a memory and a processor, wherein the memory stores a computer program that can be run on the processor, and when the processor executes the program, the method described in any one of claims 1 to 9 is implemented.
  12. 一种存储介质,所述存储介质存储有可执行指令,当所述可执行指令被处理器执行时,实现权利要求1至9任一项所述的方法。 A storage medium storing executable instructions, wherein when the executable instructions are executed by a processor, the method according to any one of claims 1 to 9 is implemented.
PCT/CN2023/120521 2022-10-26 2023-09-21 Text processing method and apparatus, and electronic device and storage medium WO2024087963A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211320876.5A CN116821276A (en) 2022-10-26 2022-10-26 Text processing method, device, electronic equipment and storage medium
CN202211320876.5 2022-10-26

Publications (1)

Publication Number Publication Date
WO2024087963A1 true WO2024087963A1 (en) 2024-05-02

Family

ID=88141677

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/120521 WO2024087963A1 (en) 2022-10-26 2023-09-21 Text processing method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN116821276A (en)
WO (1) WO2024087963A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302794A (en) * 2015-10-30 2016-02-03 苏州大学 Chinese homodigital event recognition method and system
CN114996414A (en) * 2022-08-05 2022-09-02 中科雨辰科技有限公司 Data processing system for determining similar events
US20220318505A1 (en) * 2021-04-06 2022-10-06 Adobe Inc. Inducing rich interaction structures between words for document-level event argument extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302794A (en) * 2015-10-30 2016-02-03 苏州大学 Chinese homodigital event recognition method and system
US20220318505A1 (en) * 2021-04-06 2022-10-06 Adobe Inc. Inducing rich interaction structures between words for document-level event argument extraction
CN114996414A (en) * 2022-08-05 2022-09-02 中科雨辰科技有限公司 Data processing system for determining similar events

Also Published As

Publication number Publication date
CN116821276A (en) 2023-09-29

Similar Documents

Publication Publication Date Title
US11775760B2 (en) Man-machine conversation method, electronic device, and computer-readable medium
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
EP3862889A1 (en) Responding to user queries by context-based intelligent agents
WO2019153522A1 (en) Intelligent interaction method, electronic device, and storage medium
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
JP5936698B2 (en) Word semantic relation extraction device
US8204751B1 (en) Relevance recognition for a human machine dialog system contextual question answering based on a normalization of the length of the user input
CN104636466B (en) Entity attribute extraction method and system for open webpage
JP4148522B2 (en) Expression detection system, expression detection method, and program
WO2023029420A1 (en) Power user appeal screening method and system, electronic device, and storage medium
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
Arshad et al. Corpus for emotion detection on roman urdu
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
Jayasiriwardene et al. Keyword extraction from Tweets using NLP tools for collecting relevant news
CN112417155B (en) Court trial query generation method, device and medium based on pointer-generation Seq2Seq model
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
Dündar et al. A Hybrid Approach to Question-answering for a Banking Chatbot on Turkish: Extending Keywords with Embedding Vectors.
Behere et al. Text summarization and classification of conversation data between service chatbot and customer
WO2024087963A1 (en) Text processing method and apparatus, and electronic device and storage medium
Sugiyama Empirical feature analysis for dialogue breakdown detection
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
TWI603320B (en) Global spoken dialogue system
Chaichi et al. Deploying natural language processing to extract key product features of crowdfunding campaigns: the case of 3D printing technologies on kickstarter
Lin et al. Domain Independent Key Term Extraction from Spoken Content Based on Context and Term Location Information in the Utterances
Li et al. Sentiment classification of financial microblogs through automatic text summarization