WO2022078348A1

WO2022078348A1 - Mail content extraction method and apparatus, and electronic device and storage medium

Info

Publication number: WO2022078348A1
Application number: PCT/CN2021/123362
Authority: WO
Inventors: 徐国诚
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-10-14
Filing date: 2021-10-12
Publication date: 2022-04-21
Also published as: CN112184178A

Abstract

The present application relates to a mail content extraction method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring a first mail data set, and labeling the first mail data set to obtain a first training data set; obtaining a first model by using the first training data set; acquiring a second mail data set, and labeling the second mail data set to obtain a second training data set; obtaining a second model by using the second training data set; receiving a mail paragraph and a question; and loading the first model and the second model, and inputting the mail paragraph and the question into the first model or the second model, so as to obtain an extraction result. In the present application, a first model with a better effect is mainly used and a second model is used as a supplement, which can be applied to the extraction of different mail content, thereby improving the practicability of a content extraction function to a great extent. In addition, the extraction result can be stored in a blockchain.

Description

Mail content extraction method, device, electronic device and storage medium

This application claims the priority of the Chinese patent application filed on October 14, 2020 with the application number 202011095137.1 and the invention title is "Mail Content Extraction Method, Device, Electronic Device and Storage Medium", the entire contents of which are by reference Incorporated in this application.

technical field

The present application relates to the technical field of text content extraction in artificial intelligence, and in particular to a method, device, electronic device and storage medium for extracting email content.

Background technique

In the prior art, the content extraction method based on regular expressions is mostly used for email content extraction, and the inventor realizes that this method usually requires a large workload and is applicable to limited scenarios.

SUMMARY OF THE INVENTION

In view of the above content, it is necessary to propose a method, apparatus, electronic device and storage medium for extracting email content to realize rapid extraction of information in emails.

A first aspect of the present application provides a method for extracting email content, and the method for extracting email content includes:

Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;

Using the first training data set to train the BERT model to obtain the first model;

Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;

Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;

Receive email paragraphs and questions;

Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.

A second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:

Receive email paragraphs and questions;

A third aspect of the present application provides a computer-readable storage medium on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:

Receive email paragraphs and questions;

A fourth aspect of the present application provides an apparatus for extracting mail content, and the apparatus for extracting mail content includes:

a first data labeling module, configured to obtain a first mail data set, and label each mail in the first mail data set to obtain a first training data set;

a first model training module, used to train the BERT model using the first training data set to obtain the first model;

A second data labeling module, configured to obtain a second mail data set, and label each mail in the second mail data set to obtain a second training data set;

The second model training module is used to train the BERT-LSTM-CRF model using the second training data set to obtain the second model;

A receiving module for receiving email paragraphs and questions;

an extraction module, configured to load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question the extraction result, and output the extraction result.

It can be seen from the above technical solutions that in the present application, the first training data set is obtained by obtaining the first mail data set and marking the first mail data set; the first model is obtained by training the BERT model by using the first training data set. Obtain the second mail data set, mark the second mail data set to obtain the second training data set; use the second training data set to train the BERT-LSTM-CRF model to obtain the second model; receive mail paragraphs and questions; load In the first model, the email paragraph and the question are input into the first model or the second model to obtain an extraction result. The present application uses the first model with better effect as the main extraction method, and uses the second model as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.

Description of drawings

FIG. 1 is a flowchart of a preferred embodiment of a method for extracting email content of the present application.

FIG. 2 is a functional block diagram of a preferred embodiment of the mail content extraction device of the present application.

FIG. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the method for extracting email content according to the present application.

Detailed ways

In order to more clearly understand the above objects, features and advantages of the present application, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present application and the features in the embodiments may be combined with each other in the case of no conflict.

Many specific details are set forth in the following description to facilitate a full understanding of the present application, and the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein in the specification of the application are for the purpose of describing specific embodiments only, and are not intended to limit the application.

Preferably, the method for extracting email content of the present application is applied in one or more electronic devices. The electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, application specific integrated circuits (ASICs) , programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.

The electronic device may be a computing device such as a desktop computer, a notebook computer, a tablet computer, and a cloud server. The device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad, or a voice-activated device.

Example 1

FIG. 1 is a flowchart of a method for extracting email content in an embodiment of the present application. According to different requirements, the order of the steps in the flow chart can be changed, and some steps can be omitted.

Referring to Figure 1, the method for extracting email content specifically includes the following steps:

Step S11, acquiring a first mail data set, and marking each mail in the first mail data set to obtain a first training data set.

In at least one embodiment of the present application, the marking each email in the first email data set includes:

The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.

Specifically, for each email in the first email data set, the email content of each email is marked as "paragraph", the preset question is marked as "question", and the content of the email is marked with all the emails. The answer text corresponding to the preset question is marked as "answer".

For example, when the content of the email is "Please reply. From: 123@45.com", and the question is "Who is the sender", it will be "Please reply. From: 123@45.com. ” as “paragraph”, “who is the sender” as “question”, and “123@45.com” as “answer”.

Step S12, using the first training data set to train a BERT model to obtain a first model.

In at least one embodiment of the present application, using the first training data set to train a BERT model, and obtaining the first model includes:

Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model. Wherein, in other embodiments, the start position and end position of the answer corresponding to the third mark in the first training data set in the content of the email may also be used as the output of the BERT model.

The BERT model is a bidirectional encoder representation (Bidirectional Encoder Representations from Transformers) model from the transformer.

Further, the use of the first training data set to train the BERT model to obtain the first model may specifically include:

Obtain the preset question token sequence and the email content token sequence according to the preset question and the email content, splicing the preset question token sequence and the email content token sequence, before the preset question token sequence Add a delimiter characterizing the question, such as [CLS], add a delimiter characterizing the content before the token sequence of the email content, such as [SEP], the preset question token sequence after adding the delimiter and the email content will be added The token sequence is used as the input data of the BERT model;

encoding the input data using the encoding layer of the BERT model;

Taking the answer corresponding to the third mark as the expected output, train the prediction layer of the BERT model until the prediction layer converges. After the convergence, the prediction layer can predict the answer corresponding to the question that needs to be answered, and obtain the Describe the first model.

Step S13: Obtain a second mail data set, and mark each mail in the second mail data set to obtain a second training data set.

In at least one embodiment of the present application, the marking each email in the second email data set includes:

Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.

Specifically, using the BIO labeling method to label each character in each email in the second email data set includes:

Use B to mark the starting position of a named entity (for example, when marking "sender", mark the starting position of the named entity corresponding to "inventor" as "B-SENDER"), use I to mark the starting position of a named entity. Content (for example, when marking "sender", mark the content position of the named entity corresponding to "inventor" as "I-SENDER"), use 0 to mark non-named entity characters.

For example, when the content of the email is "Please reply as soon as possible. Sender: 123@45.com." and the preset keyword is the sender, the character "1" can be marked as "B-SENDER", and the The characters "2", "3", "@", "4", "5", ".", "c", "o", "m" are marked as "I-SENDER", and the characters "Please", "Speed", "return", ".", "send", "piece", "person", ":" are marked as "0", and the corresponding relationship between the text content and the marked result is shown in the following table.

please 0

Speed 0

back to 0

. 0

send 0

Piece 0

person 0

: 0

1 B-SENDER

2 I-SENDER

3 I-SENDER

@I-SENDER

4 I-SENDER

5 I-SENDER

.I-SENDER

c I-SENDER

o I-SENDER

m I-SENDER

Step S14, using the second training data set to train the BERT-LSTM-CRF model to obtain a second model.

In at least one embodiment of the present application, the use of the second training data set to train the BERT-LSTM-CRF model to obtain the second model includes:

Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.

The LSTM is a Long Short-Term Memory network (Long Short-Term Memory), and the CRF is a conditional random field algorithm.

In at least one embodiment of the present application, the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, and then inputs the word embedding sequence into a long short-term memory network model for semantic encoding, Finally, it is decoded through the conditional random field model, and the label sequence with the highest probability is output.

Specifically, the BERT-LSTM-CRF model first obtains the word embedding sequence of the email content through the BERT model, then inputs the word embedding sequence into the long short-term memory network model for semantic encoding, and finally uses the conditional random field model to perform semantic encoding. Decode, and output the sequence of tokens with the highest probability including:

Perform word segmentation on the mail text content of each mail in the second training data set to obtain the mail content token sequence;

Add an identifier to the starting position and the end of the email content token sequence, for example, add an identifier [CLS] to the start of the email content token sequence, and add a symbol [SEP] to the end of the email content token sequence. ];

Input the token sequence of the email content after adding the identifier into the BERT model, and output the vector representation of each character in the email content, that is, the word embedding sequence of each email content (x ₁ ,x ₂ ,...,x _n ) ,include:

The input sequence is obtained by summing the token embedding, segmentation embedding and position embedding of each character, and the transformer in the BERT model is called to encode and decode the input sequence to obtain a vector sequence. The vectors correspond to tokens with the same index, that is, the word embedding sequence of the email content;

Input the word embedding sequence obtained by the BERT model into the long short-term memory network model to obtain the semantic representation of the input, including:

The word embedding sequence (x ₁ , x ₂ ,...,x _n ) is used as the input of each time step of the long-term and short-term memory network, and then the hidden state sequence output by the forward long-term and short-term memory network is used.

Hidden State Sequences with Reverse Long Short-Term Memory Networks

The hidden states output at each position are spliced by position

Get the complete hidden state sequence (h ₁ , h ₂ ,...,h _n )∈R ^n*m , after setting jump out, access a linear layer to map the hidden state vector from m dimension to k dimension, and label it when k The number of labels in the sequence, so as to obtain the automatically extracted sentence features, that is, the semantic representation of the input, denoted as matrix P, P=(p ₁ ,p ₂ ,...,p _n )∈R ^n*k , p _i ∈R _k is The probability value of classifying the character _xi to the jth label for each dimension p _ij ;

Decoding the input semantic representation input conditional random field model to obtain a label sequence with the highest probability, including:

The parameter of the conditional random field model is a (k+2)*(k+2) matrix A, and _Aij represents the transition matrix of each character in the email content from the ith tag to the jth tag, Indicates the probability of one-step transition of all states of the label in the statement x, and then the label that has been labeled before can be used when labeling a position. In k+2, k represents the length of the email content, and 2 represents the initial state of the email content header and The termination state at the end of the email content, record a label sequence y=(y ₁ , y ₂ ,..., y _n ) with a length equal to the length of the email, the model has a probability value of the label of the email content x being equal to y

Use softmax to get the normalized probability that the label of the email content x is equal to the probability value of y

The training is performed by maximizing the log-likelihood function, and the dynamic programming algorithm Viterbi algorithm is used to solve the optimal path in the decoding process, and the output sequence y=(y ₁ , y ₂ ,...,y _n ) reaches the predicted sequence with the highest probability. word tag sequence.

Step S15, receiving the email paragraph and question.

In at least one embodiment of the present application, the email paragraph and the question are received using a django server.

In other embodiments of the present application, the server may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.

Step S16, load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the extraction corresponding to the email paragraph and the question result, and output the extraction result.

It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned extraction results, the above-mentioned extraction results can also be stored in a node of a blockchain.

In at least one embodiment of the present application, the first model and the second model are loaded, and the email paragraph and the question are input into the first model or the second model to obtain a connection with the email Extracted results corresponding to passages and said questions include:

Inputting the email paragraph and the question into the first model to obtain a first output;

When the first output is not empty, use the first output as the extraction result;

When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.

In at least one embodiment of the present application, the first output is obtained by inputting the email paragraph and the question into the first model, including:

converting the email paragraph and the question into data that meets the format requirements of the first model;

inputting the data conforming to the format requirements of the first model into the first model;

The first model obtains a first output through calculation.

For example, when the email paragraph received by the server is "Please reply. Sender: 123@45.com" and the question is "Who is the sender", the email paragraph and the question are converted to Data that meets the format requirements of the first model, input the data that meets the format requirements of the first model into the first model, when the first model can predict the answer to the question "123@45.com ", output the start position and end position of the answer in the email paragraph, and take "From: '123@45.com'" as the extraction result.

In at least one embodiment of the present application, the second output is obtained by inputting the email paragraph and the question into the second model, including:

inputting the email paragraph into the second model;

The second model obtains a second output through calculation.

For example, when the first model cannot predict the answer to the question, the email paragraph "Please reply. Sender: 123@45.com" is input into the second model, and the second model outputs Answer "123@45.com" with "From: '123@45.com'" as the extraction result.

The present application obtains the first mail data set, and marks the first mail data set to obtain the first training data set; uses the first training data set to train the BERT model to obtain the first model; obtains the second mail data set, marks the The second email data set is used to obtain a second training data set; the BERT-LSTM-CRF model is trained using the second training data set to obtain a second model; the email paragraphs and questions are received; the first model is loaded, and the email The passage and the question are input into the first model or the second model to obtain an extraction result. In this application, the first model with better effect is used as the main extraction method, and the second model is used as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.

Example 2

FIG. 2 is a structural diagram of a mail content extraction apparatus 30 in an embodiment of the present application.

In some embodiments, the email content extraction apparatus 30 is executed in an electronic device. The mail content extraction apparatus 30 may include a plurality of functional modules composed of program code segments. The program codes of each program segment in the mail content extraction apparatus 30 may be stored in a memory and executed by at least one processor to perform the function of mail content extraction.

In this embodiment, the mail content extraction apparatus 30 can be divided into a plurality of functional modules according to the functions it performs. Referring to FIG. 2 , the email content extraction apparatus 30 may include a first data labeling module 301 , a first model training module 302 , a second data labeling module 303 , a second model training module 304 , a receiving module 305 and an extraction module 306 . A module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can perform fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be described in detail in subsequent embodiments.

The first data labeling module 301 obtains a first email data set, and labels each email in the first email data set to obtain a first training data set.

In at least one embodiment of the present application, the first data labeling module 301 labeling each email in the first email data set includes:

The first data marking module 301 marks the email content of each email in the first email data set as a first mark, marks a preset question as a second mark, and marks the content of the email with the preset question. The corresponding answer is marked as the third marker.

Specifically, for each email in the first email data set, the first data labeling module 301 labels the email content of each email as "paragraph", labels the preset question as "question", and labels all the emails as "paragraph". The answer text corresponding to the preset question in the content of the email is marked as "answer".

The first model training module 302 uses the first training data set to train a BERT model to obtain a first model.

In at least one embodiment of the present application, the first model training module 302 uses the first training data set to train a BERT model, and obtaining the first model includes:

The first model training module 302 uses the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and uses the data in the first training data set. The answer corresponding to the third mark is used as the output of the BERT model, and the BERT model is optimized to obtain the first model. Wherein, in other embodiments, the start position and end position of the answer corresponding to the third mark in the first training data set in the content of the email may also be used as the output of the BERT model.

Further, the first model training module 302 uses the first training data set to train the BERT model, and obtaining the first model may specifically include:

The first model training module 302 obtains a preset question token sequence and an email content token sequence according to the preset question and the email content, splices the preset question token sequence and the email content token sequence, and A separator characterizing the question is added before the token sequence of the preset question, such as [CLS], and a separator characterizing the content is added before the token sequence of the email content, such as [SEP], and the preset character after the separator is added. The question token sequence and the email content token sequence are used as the input data of the BERT model;

The first model training module 302 uses the encoding layer of the BERT model to encode the input data;

The first model training module 302 takes the answer corresponding to the third mark as the expected output, and trains the prediction layer of the BERT model until the prediction layer converges, and the converged prediction layer can predict the question to be answered. corresponding answer, and get the first model.

The second data labeling module 303 obtains a second email data set, and labels each email in the second email data set to obtain a second training data set.

In at least one embodiment of the present application, the second data labeling module 303 labeling each email in the second email data set includes:

The second data labeling module 303 uses the BIO labeling method to label each character in each email in the second email data set to obtain a label sequence, and each email and the corresponding label sequence form the second training data set.

Use B to mark the starting position of a named entity, I to mark the contents of a named entity, and 0 to mark non-named entity characters.

The second model training module 304 uses the second training data set to train a BERT-LSTM-CRF model to obtain a second model.

In at least one embodiment of the present application, the second model training module 304 uses the second training data set to train a BERT-LSTM-CRF model, and obtaining the second model includes:

The second model training module 304 uses the mail text content in the second training data set as the input data of the BERT-LSTM-CRF model, and uses the label sequence in the second training data set as the BERT - Desired output of the LSTM-CRF model, optimize the BERT-LSTM-CRF model to obtain a second model.

Specifically, the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, then inputs the word embedding sequence into the long short-term memory network model for semantic encoding, and finally uses the conditional random field model to perform semantic encoding. Decode, and output the sequence of tokens with the highest probability including:

Hidden State Sequences with Reverse Long Short-Term Memory Networks

The hidden states output at each position are spliced by position

The parameter of the conditional random field model is a matrix A of (x+2)*(k+2), A _jj represents the transition matrix of each character in the email content from the ith tag to the jth tag, Indicates the probability of one-step transition of all states of the label in the statement x, and then the label that has been labeled before can be used when labeling a position. In k+2, k indicates the length of the email content, and 2 indicates the initial state of the email content header and The termination state at the end of the email content, record a label sequence y=(y ₁ , y ₂ ,..., y _n ) with a length equal to the length of the email, the model has a probability value of the label of the email content x being equal to y

The receiving module 305 receives email paragraphs and questions.

In at least one embodiment of the present application, the receiving module 305 uses a django server to receive the email paragraph and the question.

In other embodiments of the present application, the server may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.

The extracting module 306 loads the first model and the second model, and inputs the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question. the extraction result, and output the extraction result.

In at least one embodiment of the present application, the extraction module 306 loads the first model and the second model, and inputs the email paragraph and the question into the first model or the second model Obtaining the extraction results corresponding to the email paragraph and the question includes:

The extraction module 306 inputs the email paragraph and the question into the first model to obtain a first output;

When the first output is not empty, the extraction module 306 uses the first output as an extraction result;

When the first output is empty, the extraction module 306 loads the second model, and inputs the email paragraph and the question into the second model to obtain a second output, which is the second output as the extraction result.

In at least one embodiment of the present application, the extraction module 306 inputs the email paragraph and the question into the first model to obtain a first output, including:

The extraction module 306 converts the email paragraph and the question into data that meets the format requirements of the first model;

The extraction module 306 inputs the data conforming to the format requirements of the first model into the first model;

The first model obtains a first output through calculation.

In at least one embodiment of the present application, the extraction module 306 inputs the email paragraph and the question into the second model to obtain a second output, including:

The extraction module 306 inputs the email paragraph into the second model;

The second model obtains a second output through calculation.

The present application obtains the first mail data set, and marks the first mail data set to obtain the first training data set; uses the first training data set to train the BERT model to obtain the first model; obtains the second mail data set, marks the The second email data set is used to obtain a second training data set; the BERT-LSTM-CRF model is trained using the second training data set to obtain a second model; the email paragraphs and questions are received; the first model is loaded, and the email The passage and the question are input into the first model or the second model to obtain an extraction result. The present application uses the first model with better effect as the main extraction method, and uses the second model as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.

Example 3

FIG. 3 is a schematic diagram of an electronic device 6 in an embodiment of the present application.

The electronic device 6 includes a memory 61 , a processor 62 and computer readable instructions 63 stored in the memory 61 and executable on the processor 62 . When the processor 62 executes the computer-readable instructions 63, the steps in the above embodiments of the mail content extraction method are implemented, for example, steps S11 to S16 shown in FIG. 1 . Alternatively, when the processor 62 executes the computer-readable instructions 63, the functions of each module/unit in the above-mentioned embodiment of the apparatus for extracting mail content are implemented, for example, modules 301 to 306 in FIG. 2 .

Exemplarily, the computer-readable instructions 63 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 62, to complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 63 in the electronic device 6 . For example, the computer readable instructions 63 can be divided into the first data labeling module 301, the first model training module 302, the second data labeling module 303, the second model training module 304, the receiving module 305 and the extraction module in FIG. 2 Module 306, see Embodiment 2 for specific functions of each module.

In this embodiment, the electronic device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, a server, and a cloud terminal device. Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 6, and does not constitute a limitation to the electronic device 6, and may include more or less components than the one shown, or combine some components, or different Components such as the electronic device 6 may also include input and output devices, network access devices, buses, and the like.

The so-called processor 62 may be a central processing module (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 62 can also be any conventional processor, etc. The processor 62 is the control center of the electronic device 6, and uses various interfaces and lines to connect the entire electronic device 6. of each part.

The memory 61 may be used to store the computer-readable instructions 63 and/or modules/units, and the processor 62 executes or executes the computer-readable instructions and/or modules/units stored in the memory 61, and The data stored in the memory 61 is called to realize various functions of the electronic device 6 . The memory 61 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data and the like created according to the use of the electronic device 6 are stored. In addition, the memory 61 may include volatile memory, and may also include non-volatile memory, such as hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card , a flash memory card (Flash Card), at least one disk storage device, flash memory device, or other storage device.

If the modules/units integrated in the electronic device 6 are implemented in the form of software functional modules and sold or used as independent products, they may be stored in a computer-readable storage medium, which may be non-transitory. A volatile storage medium can also be a volatile storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , the computer-readable instructions, when executed by the processor, can implement the steps of the above-mentioned method embodiments. Wherein, the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in source code form, object code form, executable file, or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Access Memory), etc.

1, the memory 61 in the electronic device 6 stores computer-readable instructions to implement a method for extracting email content, and the processor 62 can execute the computer-readable instructions to implement:

Receive email paragraphs and questions;

Specifically, for the specific implementation method of the computer-readable instruction by the processor 62, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, and details are not described herein.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

The computer-readable storage medium stores computer-readable instructions 63, wherein the computer-readable instructions 63 are used to implement the following steps when executed by the processor 62:

Receive email paragraphs and questions;

Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output. In addition, each functional module in each embodiment of the present application may be integrated in the same processing module, or each module may exist physically alone, or two or more modules may be integrated in the same module. The above-mentioned integrated modules can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim. Furthermore, it is clear that the word "comprising" does not exclude other modules or steps, and the singular does not exclude the plural. Multiple modules or electronic devices recited in the electronic device claims can also be realized by one and the same module or electronic device by means of software or hardware. The terms first, second, etc. are used to denote names and do not denote any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims

A method for extracting email content, wherein the method for extracting email content includes:

Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;

Using the first training data set to train the BERT model to obtain the first model;

Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;

Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;

Receive email paragraphs and questions;

Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
The method for extracting email content according to claim 1, wherein the marking each email in the first email data set comprises:

The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.
The method for extracting email content according to claim 2, wherein said using the first training data set to train a BERT model to obtain the first model comprises:

Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model.
The method for extracting email content according to claim 1, wherein marking each email in the second email data set comprises:

Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.
The method for extracting email content according to claim 4, wherein said using the second training data set to train a BERT-LSTM-CRF model to obtain the second model comprises:

Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.
The method for extracting email content according to claim 5, wherein the BERT-LSTM-CRF model first obtains the word embedding sequence of the email content through the BERT model, and then inputs the word embedding sequence into a long short-term memory network model Semantic encoding is performed, and finally the conditional random field model is used for decoding, and the token sequence with the highest probability is output.
The method for extracting mail content according to claim 1, wherein the first model and the second model are loaded, and the mail paragraph and the question are input into the first model or the second model to obtain the same The extraction results corresponding to the email paragraph and the question include:

Inputting the email paragraph and the question into the first model to obtain a first output;

When the first output is not empty, use the first output as the extraction result;

When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.
A mail content extraction device, wherein the mail content extraction device comprises:

a first data labeling module, configured to obtain a first mail data set, and label each mail in the first mail data set to obtain a first training data set;

The first model training module is used to train the BERT model using the first training data set to obtain the first model;

A second data labeling module, configured to obtain a second mail data set, and label each mail in the second mail data set to obtain a second training data set;

The second model training module is used to train the BERT-LSTM-CRF model using the second training data set to obtain the second model;

A receiving module for receiving email paragraphs and questions;

an extraction module, configured to load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question the extraction result, and output the extraction result.
An electronic device, wherein the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:

Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;

Using the first training data set to train the BERT model to obtain the first model;

Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;

Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;

Receive email paragraphs and questions;

Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
9. The electronic device of claim 9, wherein, in the annotating each mail in the first mail data set, the processor executes the at least one computer-readable instruction to implement the steps of:

The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.
The electronic device according to claim 10, wherein when the BERT model is trained using the first training data set to obtain the first model, the processor executes the at least one computer-readable instruction to implement the following steps :

Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model.
9. The electronic device of claim 9, wherein, in labeling each mail in the second mail data set, the processor executes the at least one computer-readable instruction to implement the steps of:

Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.
The electronic device of claim 12, wherein the processor executes the at least one computer-readable instruction when the BERT-LSTM-CRF model is trained using the second training data set to obtain the second model to implement the following steps:

Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.
The electronic device according to claim 9, wherein, after loading the first model and the second model, inputting the mail paragraph and the question into the first model or the second model results in the same When the extraction result corresponding to the email paragraph and the question is retrieved, the processor executes the at least one computer-readable instruction to implement the following steps:

Inputting the email paragraph and the question into the first model to obtain a first output;

When the first output is not empty, use the first output as the extraction result;

When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.
A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and the at least one computer-readable instruction implements the following steps when executed by a processor:

Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;

Using the first training data set to train the BERT model to obtain the first model;

Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;

Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;

Receive email paragraphs and questions;

Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
16. The storage medium of claim 15, wherein, in the tagging of each mail in the first mail data set, the at least one computer-readable instruction is executed by a processor to implement the steps of:

The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.
The storage medium of claim 16, wherein when the BERT model is trained using the first training data set to obtain the first model, the at least one computer-readable instruction is executed by the processor to implement the following steps:

Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model.
16. The storage medium of claim 15, wherein in labeling each mail in the second mail data set, the at least one computer-readable instruction is executed by a processor to implement the following steps:

Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.
The storage medium of claim 18, wherein the at least one computer-readable instruction is executed by the processor when the second model is obtained by training the BERT-LSTM-CRF model using the second training data set to implement the following steps:

Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.
The storage medium according to claim 15, wherein, after loading the first model and the second model, inputting the mail paragraph and the question into the first model or the second model results in a The at least one computer-readable instruction is executed by the processor to implement the following steps when the extraction result corresponding to the email paragraph and the question is retrieved:

Inputting the email paragraph and the question into the first model to obtain a first output;

When the first output is not empty, use the first output as the extraction result;

When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.