WO2022078348A1 - Mail content extraction method and apparatus, and electronic device and storage medium - Google Patents

Mail content extraction method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2022078348A1
WO2022078348A1 PCT/CN2021/123362 CN2021123362W WO2022078348A1 WO 2022078348 A1 WO2022078348 A1 WO 2022078348A1 CN 2021123362 W CN2021123362 W CN 2021123362W WO 2022078348 A1 WO2022078348 A1 WO 2022078348A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
data set
email
mail
bert
Prior art date
Application number
PCT/CN2021/123362
Other languages
French (fr)
Chinese (zh)
Inventor
徐国诚
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022078348A1 publication Critical patent/WO2022078348A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the technical field of text content extraction in artificial intelligence, and in particular to a method, device, electronic device and storage medium for extracting email content.
  • the content extraction method based on regular expressions is mostly used for email content extraction, and the inventor realizes that this method usually requires a large workload and is applicable to limited scenarios.
  • a first aspect of the present application provides a method for extracting email content, and the method for extracting email content includes:
  • a second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • a third aspect of the present application provides a computer-readable storage medium on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
  • a fourth aspect of the present application provides an apparatus for extracting mail content, and the apparatus for extracting mail content includes:
  • a first data labeling module configured to obtain a first mail data set, and label each mail in the first mail data set to obtain a first training data set;
  • a first model training module used to train the BERT model using the first training data set to obtain the first model
  • a second data labeling module configured to obtain a second mail data set, and label each mail in the second mail data set to obtain a second training data set;
  • the second model training module is used to train the BERT-LSTM-CRF model using the second training data set to obtain the second model
  • a receiving module for receiving email paragraphs and questions
  • an extraction module configured to load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question the extraction result, and output the extraction result.
  • the first training data set is obtained by obtaining the first mail data set and marking the first mail data set;
  • the first model is obtained by training the BERT model by using the first training data set.
  • Obtain the second mail data set mark the second mail data set to obtain the second training data set; use the second training data set to train the BERT-LSTM-CRF model to obtain the second model; receive mail paragraphs and questions; load In the first model, the email paragraph and the question are input into the first model or the second model to obtain an extraction result.
  • the present application uses the first model with better effect as the main extraction method, and uses the second model as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.
  • FIG. 1 is a flowchart of a preferred embodiment of a method for extracting email content of the present application.
  • FIG. 2 is a functional block diagram of a preferred embodiment of the mail content extraction device of the present application.
  • FIG. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the method for extracting email content according to the present application.
  • the method for extracting email content of the present application is applied in one or more electronic devices.
  • the electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, application specific integrated circuits (ASICs) , programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASICs application specific integrated circuits
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the electronic device may be a computing device such as a desktop computer, a notebook computer, a tablet computer, and a cloud server.
  • the device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad, or a voice-activated device.
  • FIG. 1 is a flowchart of a method for extracting email content in an embodiment of the present application. According to different requirements, the order of the steps in the flow chart can be changed, and some steps can be omitted.
  • the method for extracting email content specifically includes the following steps:
  • Step S11 acquiring a first mail data set, and marking each mail in the first mail data set to obtain a first training data set.
  • the marking each email in the first email data set includes:
  • the mail content of each mail in the first mail data set is marked as the first mark
  • the preset question is marked as the second mark
  • the answer corresponding to the preset question in the mail content is marked as the first mark.
  • the email content of each email is marked as "paragraph”
  • the preset question is marked as “question”
  • the content of the email is marked with all the emails.
  • the answer text corresponding to the preset question is marked as "answer”.
  • Step S12 using the first training data set to train a BERT model to obtain a first model.
  • using the first training data set to train a BERT model, and obtaining the first model includes:
  • the BERT model is optimized to obtain the first model.
  • the start position and end position of the answer corresponding to the third mark in the first training data set in the content of the email may also be used as the output of the BERT model.
  • the BERT model is a bidirectional encoder representation (Bidirectional Encoder Representations from Transformers) model from the transformer.
  • first training data set to train the BERT model to obtain the first model may specifically include:
  • Step S13 Obtain a second mail data set, and mark each mail in the second mail data set to obtain a second training data set.
  • the marking each email in the second email data set includes:
  • BIO labeling method to label each character in each email in the second email data set includes:
  • Use B to mark the starting position of a named entity (for example, when marking "sender”, mark the starting position of the named entity corresponding to "inventor” as “B-SENDER"), use I to mark the starting position of a named entity.
  • Content (for example, when marking "sender”, mark the content position of the named entity corresponding to "inventor” as "I-SENDER"), use 0 to mark non-named entity characters.
  • Step S14 using the second training data set to train the BERT-LSTM-CRF model to obtain a second model.
  • the use of the second training data set to train the BERT-LSTM-CRF model to obtain the second model includes:
  • the LSTM is a Long Short-Term Memory network (Long Short-Term Memory), and the CRF is a conditional random field algorithm.
  • the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, and then inputs the word embedding sequence into a long short-term memory network model for semantic encoding, Finally, it is decoded through the conditional random field model, and the label sequence with the highest probability is output.
  • the BERT-LSTM-CRF model first obtains the word embedding sequence of the email content through the BERT model, then inputs the word embedding sequence into the long short-term memory network model for semantic encoding, and finally uses the conditional random field model to perform semantic encoding. Decode, and output the sequence of tokens with the highest probability including:
  • Add an identifier to the starting position and the end of the email content token sequence for example, add an identifier [CLS] to the start of the email content token sequence, and add a symbol [SEP] to the end of the email content token sequence. ];
  • the input sequence is obtained by summing the token embedding, segmentation embedding and position embedding of each character, and the transformer in the BERT model is called to encode and decode the input sequence to obtain a vector sequence.
  • the vectors correspond to tokens with the same index, that is, the word embedding sequence of the email content;
  • Input the word embedding sequence obtained by the BERT model into the long short-term memory network model to obtain the semantic representation of the input, including:
  • the word embedding sequence (x 1 , x 2 ,...,x n ) is used as the input of each time step of the long-term and short-term memory network, and then the hidden state sequence output by the forward long-term and short-term memory network is used.
  • Hidden State Sequences with Reverse Long Short-Term Memory Networks The hidden states output at each position are spliced by position Get the complete hidden state sequence (h 1 , h 2 ,...,h n ) ⁇ R n*m , after setting jump out, access a linear layer to map the hidden state vector from m dimension to k dimension, and label it when k
  • Decoding the input semantic representation input conditional random field model to obtain a label sequence with the highest probability including:
  • the parameter of the conditional random field model is a (k+2)*(k+2) matrix A, and Aij represents the transition matrix of each character in the email content from the ith tag to the jth tag, Indicates the probability of one-step transition of all states of the label in the statement x, and then the label that has been labeled before can be used when labeling a position.
  • k represents the length of the email content
  • 2 represents the initial state of the email content header and
  • the model has a probability value of the label of the email content x being equal to y
  • Use softmax to get the normalized probability that the label of the email content x is equal to the probability value of y
  • Step S15 receiving the email paragraph and question.
  • the email paragraph and the question are received using a django server.
  • the server may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • cloud services cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • Step S16 load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the extraction corresponding to the email paragraph and the question result, and output the extraction result.
  • the above-mentioned extraction results can also be stored in a node of a blockchain.
  • the first model and the second model are loaded, and the email paragraph and the question are input into the first model or the second model to obtain a connection with the email Extracted results corresponding to passages and said questions include:
  • the first output is obtained by inputting the email paragraph and the question into the first model, including:
  • the first model obtains a first output through calculation.
  • the email paragraph and the question are converted to Data that meets the format requirements of the first model, input the data that meets the format requirements of the first model into the first model, when the first model can predict the answer to the question "123@45.com ", output the start position and end position of the answer in the email paragraph, and take "From: '123@45.com'” as the extraction result.
  • the second output is obtained by inputting the email paragraph and the question into the second model, including:
  • the second model obtains a second output through calculation.
  • the email paragraph "Please reply. Sender: 123@45.com” is input into the second model, and the second model outputs Answer "123@45.com” with “From: '123@45.com'” as the extraction result.
  • the present application obtains the first mail data set, and marks the first mail data set to obtain the first training data set; uses the first training data set to train the BERT model to obtain the first model; obtains the second mail data set, marks the The second email data set is used to obtain a second training data set; the BERT-LSTM-CRF model is trained using the second training data set to obtain a second model; the email paragraphs and questions are received; the first model is loaded, and the email The passage and the question are input into the first model or the second model to obtain an extraction result.
  • the first model with better effect is used as the main extraction method
  • the second model is used as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.
  • FIG. 2 is a structural diagram of a mail content extraction apparatus 30 in an embodiment of the present application.
  • the email content extraction apparatus 30 is executed in an electronic device.
  • the mail content extraction apparatus 30 may include a plurality of functional modules composed of program code segments.
  • the program codes of each program segment in the mail content extraction apparatus 30 may be stored in a memory and executed by at least one processor to perform the function of mail content extraction.
  • the mail content extraction apparatus 30 can be divided into a plurality of functional modules according to the functions it performs.
  • the email content extraction apparatus 30 may include a first data labeling module 301 , a first model training module 302 , a second data labeling module 303 , a second model training module 304 , a receiving module 305 and an extraction module 306 .
  • a module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can perform fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be described in detail in subsequent embodiments.
  • the first data labeling module 301 obtains a first email data set, and labels each email in the first email data set to obtain a first training data set.
  • the first data labeling module 301 labeling each email in the first email data set includes:
  • the first data marking module 301 marks the email content of each email in the first email data set as a first mark, marks a preset question as a second mark, and marks the content of the email with the preset question. The corresponding answer is marked as the third marker.
  • the first data labeling module 301 labels the email content of each email as "paragraph”, labels the preset question as “question”, and labels all the emails as "paragraph”.
  • the answer text corresponding to the preset question in the content of the email is marked as "answer”.
  • the first model training module 302 uses the first training data set to train a BERT model to obtain a first model.
  • the first model training module 302 uses the first training data set to train a BERT model, and obtaining the first model includes:
  • the first model training module 302 uses the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and uses the data in the first training data set.
  • the answer corresponding to the third mark is used as the output of the BERT model, and the BERT model is optimized to obtain the first model.
  • the start position and end position of the answer corresponding to the third mark in the first training data set in the content of the email may also be used as the output of the BERT model.
  • the first model training module 302 uses the first training data set to train the BERT model, and obtaining the first model may specifically include:
  • the first model training module 302 obtains a preset question token sequence and an email content token sequence according to the preset question and the email content, splices the preset question token sequence and the email content token sequence, and A separator characterizing the question is added before the token sequence of the preset question, such as [CLS], and a separator characterizing the content is added before the token sequence of the email content, such as [SEP], and the preset character after the separator is added.
  • the question token sequence and the email content token sequence are used as the input data of the BERT model;
  • the first model training module 302 uses the encoding layer of the BERT model to encode the input data
  • the first model training module 302 takes the answer corresponding to the third mark as the expected output, and trains the prediction layer of the BERT model until the prediction layer converges, and the converged prediction layer can predict the question to be answered. corresponding answer, and get the first model.
  • the second data labeling module 303 obtains a second email data set, and labels each email in the second email data set to obtain a second training data set.
  • the second data labeling module 303 labeling each email in the second email data set includes:
  • the second data labeling module 303 uses the BIO labeling method to label each character in each email in the second email data set to obtain a label sequence, and each email and the corresponding label sequence form the second training data set.
  • BIO labeling method to label each character in each email in the second email data set includes:
  • the second model training module 304 uses the second training data set to train a BERT-LSTM-CRF model to obtain a second model.
  • the second model training module 304 uses the second training data set to train a BERT-LSTM-CRF model, and obtaining the second model includes:
  • the second model training module 304 uses the mail text content in the second training data set as the input data of the BERT-LSTM-CRF model, and uses the label sequence in the second training data set as the BERT - Desired output of the LSTM-CRF model, optimize the BERT-LSTM-CRF model to obtain a second model.
  • the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, and then inputs the word embedding sequence into a long short-term memory network model for semantic encoding, Finally, it is decoded through the conditional random field model, and the label sequence with the highest probability is output.
  • the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, then inputs the word embedding sequence into the long short-term memory network model for semantic encoding, and finally uses the conditional random field model to perform semantic encoding. Decode, and output the sequence of tokens with the highest probability including:
  • Add an identifier to the starting position and the end of the email content token sequence for example, add an identifier [CLS] to the start of the email content token sequence, and add a symbol [SEP] to the end of the email content token sequence. ];
  • the input sequence is obtained by summing the token embedding, segmentation embedding and position embedding of each character, and the transformer in the BERT model is called to encode and decode the input sequence to obtain a vector sequence.
  • the vectors correspond to tokens with the same index, that is, the word embedding sequence of the email content;
  • Input the word embedding sequence obtained by the BERT model into the long short-term memory network model to obtain the semantic representation of the input, including:
  • the word embedding sequence (x 1 , x 2 ,...,x n ) is used as the input of each time step of the long-term and short-term memory network, and then the hidden state sequence output by the forward long-term and short-term memory network is used.
  • Hidden State Sequences with Reverse Long Short-Term Memory Networks The hidden states output at each position are spliced by position Get the complete hidden state sequence (h 1 , h 2 ,...,h n ) ⁇ R n*m , after setting jump out, access a linear layer to map the hidden state vector from m dimension to k dimension, and label it when k
  • Decoding the input semantic representation input conditional random field model to obtain a label sequence with the highest probability including:
  • the parameter of the conditional random field model is a matrix A of (x+2)*(k+2), A jj represents the transition matrix of each character in the email content from the ith tag to the jth tag, Indicates the probability of one-step transition of all states of the label in the statement x, and then the label that has been labeled before can be used when labeling a position.
  • k indicates the length of the email content
  • 2 indicates the initial state of the email content header and
  • the model has a probability value of the label of the email content x being equal to y
  • Use softmax to get the normalized probability that the label of the email content x is equal to the probability value of y
  • the receiving module 305 receives email paragraphs and questions.
  • the receiving module 305 uses a django server to receive the email paragraph and the question.
  • the server may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • cloud services cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the extracting module 306 loads the first model and the second model, and inputs the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question. the extraction result, and output the extraction result.
  • the above-mentioned extraction results can also be stored in a node of a blockchain.
  • the extraction module 306 loads the first model and the second model, and inputs the email paragraph and the question into the first model or the second model Obtaining the extraction results corresponding to the email paragraph and the question includes:
  • the extraction module 306 inputs the email paragraph and the question into the first model to obtain a first output
  • the extraction module 306 uses the first output as an extraction result
  • the extraction module 306 loads the second model, and inputs the email paragraph and the question into the second model to obtain a second output, which is the second output as the extraction result.
  • the extraction module 306 inputs the email paragraph and the question into the first model to obtain a first output, including:
  • the extraction module 306 converts the email paragraph and the question into data that meets the format requirements of the first model
  • the extraction module 306 inputs the data conforming to the format requirements of the first model into the first model
  • the first model obtains a first output through calculation.
  • the extraction module 306 inputs the email paragraph and the question into the second model to obtain a second output, including:
  • the extraction module 306 inputs the email paragraph into the second model
  • the second model obtains a second output through calculation.
  • the present application obtains the first mail data set, and marks the first mail data set to obtain the first training data set; uses the first training data set to train the BERT model to obtain the first model; obtains the second mail data set, marks the The second email data set is used to obtain a second training data set; the BERT-LSTM-CRF model is trained using the second training data set to obtain a second model; the email paragraphs and questions are received; the first model is loaded, and the email The passage and the question are input into the first model or the second model to obtain an extraction result.
  • the present application uses the first model with better effect as the main extraction method, and uses the second model as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.
  • FIG. 3 is a schematic diagram of an electronic device 6 in an embodiment of the present application.
  • the electronic device 6 includes a memory 61 , a processor 62 and computer readable instructions 63 stored in the memory 61 and executable on the processor 62 .
  • the processor 62 executes the computer-readable instructions 63
  • the steps in the above embodiments of the mail content extraction method are implemented, for example, steps S11 to S16 shown in FIG. 1 .
  • the processor 62 executes the computer-readable instructions 63
  • the functions of each module/unit in the above-mentioned embodiment of the apparatus for extracting mail content are implemented, for example, modules 301 to 306 in FIG. 2 .
  • the computer-readable instructions 63 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 62, to complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 63 in the electronic device 6 .
  • the computer readable instructions 63 can be divided into the first data labeling module 301, the first model training module 302, the second data labeling module 303, the second model training module 304, the receiving module 305 and the extraction module in FIG. 2 Module 306, see Embodiment 2 for specific functions of each module.
  • the electronic device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, a server, and a cloud terminal device.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, a server, and a cloud terminal device.
  • the schematic diagram is only an example of the electronic device 6, and does not constitute a limitation to the electronic device 6, and may include more or less components than the one shown, or combine some components, or different Components such as the electronic device 6 may also include input and output devices, network access devices, buses, and the like.
  • the so-called processor 62 may be a central processing module (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor 62 can also be any conventional processor, etc.
  • the processor 62 is the control center of the electronic device 6, and uses various interfaces and lines to connect the entire electronic device 6. of each part.
  • the memory 61 may be used to store the computer-readable instructions 63 and/or modules/units, and the processor 62 executes or executes the computer-readable instructions and/or modules/units stored in the memory 61, and The data stored in the memory 61 is called to realize various functions of the electronic device 6 .
  • the memory 61 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data and the like created according to the use of the electronic device 6 are stored.
  • the memory 61 may include volatile memory, and may also include non-volatile memory, such as hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card , a flash memory card (Flash Card), at least one disk storage device, flash memory device, or other storage device.
  • non-volatile memory such as hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card , a flash memory card (Flash Card), at least one disk storage device, flash memory device, or other storage device.
  • modules/units integrated in the electronic device 6 are implemented in the form of software functional modules and sold or used as independent products, they may be stored in a computer-readable storage medium, which may be non-transitory.
  • a volatile storage medium can also be a volatile storage medium.
  • the computer-readable instructions include computer-readable instruction codes
  • the computer-readable instruction codes may be in source code form, object code form, executable file, or some intermediate form, and the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Access Memory), etc.
  • the memory 61 in the electronic device 6 stores computer-readable instructions to implement a method for extracting email content
  • the processor 62 can execute the computer-readable instructions to implement:
  • the computer-readable storage medium stores computer-readable instructions 63, wherein the computer-readable instructions 63 are used to implement the following steps when executed by the processor 62:
  • each functional module in each embodiment of the present application may be integrated in the same processing module, or each module may exist physically alone, or two or more modules may be integrated in the same module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Biomedical Technology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present application relates to a mail content extraction method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring a first mail data set, and labeling the first mail data set to obtain a first training data set; obtaining a first model by using the first training data set; acquiring a second mail data set, and labeling the second mail data set to obtain a second training data set; obtaining a second model by using the second training data set; receiving a mail paragraph and a question; and loading the first model and the second model, and inputting the mail paragraph and the question into the first model or the second model, so as to obtain an extraction result. In the present application, a first model with a better effect is mainly used and a second model is used as a supplement, which can be applied to the extraction of different mail content, thereby improving the practicability of a content extraction function to a great extent. In addition, the extraction result can be stored in a blockchain.

Description

邮件内容提取方法、装置、电子设备及存储介质Mail content extraction method, device, electronic device and storage medium
本申请要求于2020年10月14日提交中国专利局,申请号为202011095137.1,发明名称为“邮件内容提取方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on October 14, 2020 with the application number 202011095137.1 and the invention title is "Mail Content Extraction Method, Device, Electronic Device and Storage Medium", the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请涉及人工智能中的文本内容提取技术领域,具体涉及一种邮件内容提取方法、装置、电子设备及存储介质。The present application relates to the technical field of text content extraction in artificial intelligence, and in particular to a method, device, electronic device and storage medium for extracting email content.
背景技术Background technique
现有技术中,邮件内容提取大多使用基于正则表达式的内容提取方法,发明人意识到这种方法通常需要很大的工作量且适用的场景有限。In the prior art, the content extraction method based on regular expressions is mostly used for email content extraction, and the inventor realizes that this method usually requires a large workload and is applicable to limited scenarios.
发明内容SUMMARY OF THE INVENTION
鉴于以上内容,有必要提出一种邮件内容提取方法、装置、电子设备及存储介质以实现对邮件中的信息的快速提取。In view of the above content, it is necessary to propose a method, apparatus, electronic device and storage medium for extracting email content to realize rapid extraction of information in emails.
本申请的第一方面提供一种邮件内容提取方法,所述邮件内容提取方法包括:A first aspect of the present application provides a method for extracting email content, and the method for extracting email content includes:
获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
接收邮件段落和问题;Receive email paragraphs and questions;
加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
本申请的第二方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:A second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
接收邮件段落和问题;Receive email paragraphs and questions;
加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行以实现以下步骤:A third aspect of the present application provides a computer-readable storage medium on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
接收邮件段落和问题;Receive email paragraphs and questions;
加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
本申请的第四方面提供一种邮件内容提取装置,所述邮件内容提取装置包括:A fourth aspect of the present application provides an apparatus for extracting mail content, and the apparatus for extracting mail content includes:
第一数据标注模块,用于获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;a first data labeling module, configured to obtain a first mail data set, and label each mail in the first mail data set to obtain a first training data set;
第一模型训练模块,用于使用所述第一训练数据集训练BERT模型,得到第一模型;a first model training module, used to train the BERT model using the first training data set to obtain the first model;
第二数据标注模块,用于获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;A second data labeling module, configured to obtain a second mail data set, and label each mail in the second mail data set to obtain a second training data set;
第二模型训练模块,用于使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;The second model training module is used to train the BERT-LSTM-CRF model using the second training data set to obtain the second model;
接收模块,用于接收邮件段落和问题;A receiving module for receiving email paragraphs and questions;
提取模块,用于加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。an extraction module, configured to load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question the extraction result, and output the extraction result.
由以上技术方案可以看出,本申请中,通过获取第一邮件数据集,标注所述第一邮件数据集得到第一训练数据集;使用所述第一训练数据集训练BERT模型得到第一模型;获取第二邮件数据集,标注所述第二邮件数据集得到第二训练数据集;使用所述第二训练数据集训练BERT-LSTM-CRF模型得到第二模型;接收邮件段落和问题;加载所述第一模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到提取结果。本申请以效果更好的第一模型为主要提取方法,以第二模型作为补充,可以应用于提取不同邮件内容,很大程度上提高了内容提取功能的实用性。It can be seen from the above technical solutions that in the present application, the first training data set is obtained by obtaining the first mail data set and marking the first mail data set; the first model is obtained by training the BERT model by using the first training data set. Obtain the second mail data set, mark the second mail data set to obtain the second training data set; use the second training data set to train the BERT-LSTM-CRF model to obtain the second model; receive mail paragraphs and questions; load In the first model, the email paragraph and the question are input into the first model or the second model to obtain an extraction result. The present application uses the first model with better effect as the main extraction method, and uses the second model as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.
附图说明Description of drawings
图1是本申请邮件内容提取方法的较佳实施例的流程图。FIG. 1 is a flowchart of a preferred embodiment of a method for extracting email content of the present application.
图2是本申请邮件内容提取装置的较佳实施例的功能模块图。FIG. 2 is a functional block diagram of a preferred embodiment of the mail content extraction device of the present application.
图3是本申请实现邮件内容提取方法的较佳实施例的电子设备的结构示意图。FIG. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the method for extracting email content according to the present application.
具体实施方式Detailed ways
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to more clearly understand the above objects, features and advantages of the present application, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present application and the features in the embodiments may be combined with each other in the case of no conflict.
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。Many specific details are set forth in the following description to facilitate a full understanding of the present application, and the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein in the specification of the application are for the purpose of describing specific embodiments only, and are not intended to limit the application.
优选地,本申请邮件内容提取方法应用在一个或者多个电子设备中。所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。Preferably, the method for extracting email content of the present application is applied in one or more electronic devices. The electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, application specific integrated circuits (ASICs) , programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述电子设备可以是桌上型计算机、笔记本电脑、平板电脑及云端服务器等计算设备。所述设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The electronic device may be a computing device such as a desktop computer, a notebook computer, a tablet computer, and a cloud server. The device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad, or a voice-activated device.
实施例1Example 1
图1是本申请一实施方式中邮件内容提取方法的流程图。根据不同的需求,所述流 程图中步骤的顺序可以改变,某些步骤可以省略。FIG. 1 is a flowchart of a method for extracting email content in an embodiment of the present application. According to different requirements, the order of the steps in the flow chart can be changed, and some steps can be omitted.
参阅图1所示,所述邮件内容提取方法具体包括以下步骤:Referring to Figure 1, the method for extracting email content specifically includes the following steps:
步骤S11,获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集。Step S11, acquiring a first mail data set, and marking each mail in the first mail data set to obtain a first training data set.
在本申请的至少一个实施方式中,所述标注所述第一邮件数据集中的每一邮件包括:In at least one embodiment of the present application, the marking each email in the first email data set includes:
将所述第一邮件数据集中的每一邮件的邮件内容标注为第一标记,将预设的问题标注为第二标记,将所述邮件内容中与所述预设问题对应的答案标注为第三标记。The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.
具体地,对所述第一邮件数据集中的每一邮件,将所述每一邮件的邮件内容标注为“paragraph”,将预设的问题标注为“question”,将所述邮件内容中与所述预设问题对应的答案文本标注为“answer”。Specifically, for each email in the first email data set, the email content of each email is marked as "paragraph", the preset question is marked as "question", and the content of the email is marked with all the emails. The answer text corresponding to the preset question is marked as "answer".
例如,当邮件内容为“请速回。发件人:123@45.com”,且问题为“发件人是谁”时,将“请速回。发件人:123@45.com.”标注为“paragraph”,将“发件人是谁”标注为“question”,将“123@45.com”标注为“answer”。For example, when the content of the email is "Please reply. From: 123@45.com", and the question is "Who is the sender", it will be "Please reply. From: 123@45.com. ” as “paragraph”, “who is the sender” as “question”, and “123@45.com” as “answer”.
步骤S12,使用所述第一训练数据集训练BERT模型,得到第一模型。Step S12, using the first training data set to train a BERT model to obtain a first model.
在本申请的至少一个实施方式中,使用所述第一训练数据集训练BERT模型,得到第一模型包括:In at least one embodiment of the present application, using the first training data set to train a BERT model, and obtaining the first model includes:
将所述第一训练数据集中所述第一标记对应的数据和所述第二标记对应的数据作为所述BERT模型的输入,将所述第一训练数据集中的所述第三标记对应的答案作为所述BERT模型的输出,优化所述BERT模型,得到所述第一模型。其中,在其他实施方式中,还可以将所述第一训练数据集中的所述第三标记对应的答案在所述邮件内容中的开始位置和结束位置作为所述BERT模型的输出。Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model. Wherein, in other embodiments, the start position and end position of the answer corresponding to the third mark in the first training data set in the content of the email may also be used as the output of the BERT model.
所述BERT模型为来自变换器的双向编码器表征量(Bidirectional Encoder Representations from Transformers)模型。The BERT model is a bidirectional encoder representation (Bidirectional Encoder Representations from Transformers) model from the transformer.
进一步地,所述使用所述第一训练数据集训练BERT模型,得到第一模型具体可包括:Further, the use of the first training data set to train the BERT model to obtain the first model may specifically include:
根据所述预设问题和所述邮件内容得到预设问题token序列和邮件内容token序列,对所述预设问题token序列和所述邮件内容token序列进行拼接,在所述预设问题token序列前添加表征问题的分隔符,例如[CLS],在所述邮件内容token序列前添加表征内容的分隔符,例如[SEP],将添加分隔符后的所述预设问题token序列和所述邮件内容token序列作为所述BERT模型的输入数据;Obtain the preset question token sequence and the email content token sequence according to the preset question and the email content, splicing the preset question token sequence and the email content token sequence, before the preset question token sequence Add a delimiter characterizing the question, such as [CLS], add a delimiter characterizing the content before the token sequence of the email content, such as [SEP], the preset question token sequence after adding the delimiter and the email content will be added The token sequence is used as the input data of the BERT model;
利用所述BERT模型的编码层对所述输入数据进行编码;encoding the input data using the encoding layer of the BERT model;
将与所述第三标记对应的答案作为期望输出,训练所述BERT模型预测层,直至所述预测层收敛,收敛后的所述预测层能够预测需要回答的问题所对应的答案,并得到所述第一模型。Taking the answer corresponding to the third mark as the expected output, train the prediction layer of the BERT model until the prediction layer converges. After the convergence, the prediction layer can predict the answer corresponding to the question that needs to be answered, and obtain the Describe the first model.
步骤S13,获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集。Step S13: Obtain a second mail data set, and mark each mail in the second mail data set to obtain a second training data set.
在本申请的至少一个实施方式中,所述标注所述第二邮件数据集中的每一邮件包括:In at least one embodiment of the present application, the marking each email in the second email data set includes:
使用BIO标注方法标注所述第二邮件数据集中的每一邮件中的每一字符,得到标记序列,所述每一邮件与对应的标记序列组成所述第二训练数据集。Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.
具体地,使用BIO标注方法标注所述第二邮件数据集中的每一邮件中的每一字符包括:Specifically, using the BIO labeling method to label each character in each email in the second email data set includes:
使用B标注一个命名实体的起始位置(例如,标注“发件人”时,将对应“发明人”的命名实体的起始位置标注为“B-SENDER”),使用I标注一个命名实体的内容(例如,标注“发件人”时,将对应“发明人”的命名实体的内容位置标注为“I-SENDER”),使用0标注非命名实体字符。Use B to mark the starting position of a named entity (for example, when marking "sender", mark the starting position of the named entity corresponding to "inventor" as "B-SENDER"), use I to mark the starting position of a named entity. Content (for example, when marking "sender", mark the content position of the named entity corresponding to "inventor" as "I-SENDER"), use 0 to mark non-named entity characters.
例如,当邮件内容为“请速回。发件人:123@45.com.”,且预设的关键字为发件人 时,可以将字符“1”标注为“B-SENDER”,将字符“2”、“3”、“@”、“4”、“5”、“.”、“c”、“o”、“m”标注为“I-SENDER”,将字符“请”、“速”、“回”、“。”、“发”、“件”、“人”、“:”标注为“0”,文本内容和标记的结果的对应关系如下表所示。For example, when the content of the email is "Please reply as soon as possible. Sender: 123@45.com." and the preset keyword is the sender, the character "1" can be marked as "B-SENDER", and the The characters "2", "3", "@", "4", "5", ".", "c", "o", "m" are marked as "I-SENDER", and the characters "Please", "Speed", "return", ".", "send", "piece", "person", ":" are marked as "0", and the corresponding relationship between the text content and the marked result is shown in the following table.
请0please 0
速0Speed 0
回0back to 0
。0. 0
发0send 0
件0Piece 0
人0person 0
:0: 0
1 B-SENDER1 B-SENDER
2 I-SENDER2 I-SENDER
3 I-SENDER3 I-SENDER
@ I-SENDER@I-SENDER
4 I-SENDER4 I-SENDER
5 I-SENDER5 I-SENDER
. I-SENDER.I-SENDER
c I-SENDERc I-SENDER
o I-SENDERo I-SENDER
m I-SENDERm I-SENDER
步骤S14,使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型。Step S14, using the second training data set to train the BERT-LSTM-CRF model to obtain a second model.
在本申请的至少一个实施方式中,所述使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型包括:In at least one embodiment of the present application, the use of the second training data set to train the BERT-LSTM-CRF model to obtain the second model includes:
将所述第二训练数据集中的邮件文本内容作为所述BERT-LSTM-CRF模型的输入数据,将所述第二训练数据集中的所述标记序列作为所述BERT-LSTM-CRF模型的期望输出,优化所述BERT-LSTM-CRF模型,得到第二模型。Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.
所述LSTM为长短期记忆网络(Long Short-Term Memory),所述CRF为条件随机场算法(conditional random field)。The LSTM is a Long Short-Term Memory network (Long Short-Term Memory), and the CRF is a conditional random field algorithm.
在本申请的至少一个实施方式中,所述BERT-LSTM-CRF模型先通过BERT模型获得所述邮件内容的字嵌入序列,然后将所述字嵌入序列输入长短期记忆网络模型中进行语义编码,最后通过条件随机场模型进行解码,并输出概率最大的标记序列。In at least one embodiment of the present application, the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, and then inputs the word embedding sequence into a long short-term memory network model for semantic encoding, Finally, it is decoded through the conditional random field model, and the label sequence with the highest probability is output.
具体地,所述BERT-LSTM-CRF模型先通过BERT模型获得所述邮件内容的字嵌入序列,然后将所述字嵌入序列输入长短期记忆网络模型中进行语义编码,最后通过条件随机场模型进行解码,并输出概率最大的标记序列包括:Specifically, the BERT-LSTM-CRF model first obtains the word embedding sequence of the email content through the BERT model, then inputs the word embedding sequence into the long short-term memory network model for semantic encoding, and finally uses the conditional random field model to perform semantic encoding. Decode, and output the sequence of tokens with the highest probability including:
将所述第二训练数据集中的每个邮件的邮件文本内容进行分词得到邮件内容token序列;Perform word segmentation on the mail text content of each mail in the second training data set to obtain the mail content token sequence;
在所述邮件内容token序列起始位置和末尾位置分别增加标识符号,例如,在所述邮件内容token序列起始位置增加标识符号[CLS],在所述邮件内容token序列末尾位置表示符号[SEP];Add an identifier to the starting position and the end of the email content token sequence, for example, add an identifier [CLS] to the start of the email content token sequence, and add a symbol [SEP] to the end of the email content token sequence. ];
将增加标识符后的所述邮件内容token序列输入BERT模型,输出所述邮件内容中每个字符的向量表示,即每个邮件内容的字嵌入序列(x 1,x 2,…,x n),包括: Input the token sequence of the email content after adding the identifier into the BERT model, and output the vector representation of each character in the email content, that is, the word embedding sequence of each email content (x 1 ,x 2 ,...,x n ) ,include:
对每个字符的token嵌入、分割嵌入和位置嵌入进行求和得到输入的序列,调用所述BERT模型中的变换器对所述输入的序列进行编解码得到向量序列,所述向量序列中的每个向量对应具有相同索引的token,即所述邮件内容的字嵌入序列;The input sequence is obtained by summing the token embedding, segmentation embedding and position embedding of each character, and the transformer in the BERT model is called to encode and decode the input sequence to obtain a vector sequence. The vectors correspond to tokens with the same index, that is, the word embedding sequence of the email content;
将由所述BERT模型得到的字嵌入序列输入长短期记忆网络模型获得输入的语义表示,包括:Input the word embedding sequence obtained by the BERT model into the long short-term memory network model to obtain the semantic representation of the input, including:
将所述字嵌入序列(x 1,x 2,…,x n)作为长短期记忆网络各个时间步的输入,再将正向长短期记忆网络输出的隐状态序列
Figure PCTCN2021123362-appb-000001
与反向长短期记忆网络的隐状态序列
Figure PCTCN2021123362-appb-000002
在各个位置输出的隐状态进行按位置拼接
Figure PCTCN2021123362-appb-000003
得到完整的隐状态序列(h 1,h 2,…,h n)∈R n*m,在设置跳出后,接入一个线性层,将隐状态向量从m维映射到k维,k时标注序列的标签数,从而得到自动提取的句子特征,即输入的语义表示,记作矩阵P,P=(p 1,p 2,…,p n)∈R n*k,p i∈R k为每一维p ij将字符x i分类到第j个标签的概率值;
The word embedding sequence (x 1 , x 2 ,...,x n ) is used as the input of each time step of the long-term and short-term memory network, and then the hidden state sequence output by the forward long-term and short-term memory network is used.
Figure PCTCN2021123362-appb-000001
Hidden State Sequences with Reverse Long Short-Term Memory Networks
Figure PCTCN2021123362-appb-000002
The hidden states output at each position are spliced by position
Figure PCTCN2021123362-appb-000003
Get the complete hidden state sequence (h 1 , h 2 ,...,h n )∈R n*m , after setting jump out, access a linear layer to map the hidden state vector from m dimension to k dimension, and label it when k The number of labels in the sequence, so as to obtain the automatically extracted sentence features, that is, the semantic representation of the input, denoted as matrix P, P=(p 1 ,p 2 ,...,p n )∈R n*k , p i ∈R k is The probability value of classifying the character xi to the jth label for each dimension p ij ;
将所述输入的语义表示输入条件随机场模型进行解码,得到概率最大的标签序列,包括:Decoding the input semantic representation input conditional random field model to obtain a label sequence with the highest probability, including:
所述条件随机场模型的参数时一个(k+2)*(k+2)的矩阵A,A ij表示的是邮件内容中每个字符从第i个标签到第j个标签的转移矩阵,表示语句x中标签所有状态一步转移的概率,进而在为一个位置进行标注的时候可以利用此前已经标注过的标签,k+2中k表示邮件内容长度,2表示邮件内容首部的起始状态和邮件内容尾部的终止状态,记一个长度等于邮件长度的标签序列y=(y 1,y 2,…,y n),模型对于邮件内容x的标签等于y的概率值为
Figure PCTCN2021123362-appb-000004
利用softmax得到对邮件内容x的标签等于y的概率值的归一化后的概率
Figure PCTCN2021123362-appb-000005
通过最大化对数似然函数进行训练,在解码过程中使用动态规划算法Viterbi算法求解最优路径,输出序列y=(y 1,y 2,…,y n)到预测序列对应的概率最大的字标签序列。
The parameter of the conditional random field model is a (k+2)*(k+2) matrix A, and Aij represents the transition matrix of each character in the email content from the ith tag to the jth tag, Indicates the probability of one-step transition of all states of the label in the statement x, and then the label that has been labeled before can be used when labeling a position. In k+2, k represents the length of the email content, and 2 represents the initial state of the email content header and The termination state at the end of the email content, record a label sequence y=(y 1 , y 2 ,..., y n ) with a length equal to the length of the email, the model has a probability value of the label of the email content x being equal to y
Figure PCTCN2021123362-appb-000004
Use softmax to get the normalized probability that the label of the email content x is equal to the probability value of y
Figure PCTCN2021123362-appb-000005
The training is performed by maximizing the log-likelihood function, and the dynamic programming algorithm Viterbi algorithm is used to solve the optimal path in the decoding process, and the output sequence y=(y 1 , y 2 ,...,y n ) reaches the predicted sequence with the highest probability. word tag sequence.
步骤S15,接收邮件段落和问题。Step S15, receiving the email paragraph and question.
在本申请的至少一个实施方式中,使用django服务器接收所述邮件段落和所述问题。In at least one embodiment of the present application, the email paragraph and the question are received using a django server.
在本申请的其他实施方式中,所述服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。In other embodiments of the present application, the server may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
步骤S16,加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Step S16, load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the extraction corresponding to the email paragraph and the question result, and output the extraction result.
需要强调的是,为进一步保证上述提取结果的私密和安全性,上述提取结果还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned extraction results, the above-mentioned extraction results can also be stored in a node of a blockchain.
在本申请的至少一个实施方式中,加载所述第一模型和所述第二模型,并将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果包括:In at least one embodiment of the present application, the first model and the second model are loaded, and the email paragraph and the question are input into the first model or the second model to obtain a connection with the email Extracted results corresponding to passages and said questions include:
将所述邮件段落和所述问题输入所述第一模型得到第一输出;Inputting the email paragraph and the question into the first model to obtain a first output;
当所述第一输出为非空时,将所述第一输出作为提取结果;When the first output is not empty, use the first output as the extraction result;
当所述第一输出为空时,加载所述第二模型,并将所述邮件段落和所述问题输入所述第二模型得到第二输出,将所述第二输出作为所述提取结果。When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.
在本申请的至少一个实施方式中,并将所述邮件段落和所述问题输入所述第一模型得到第一输出,包括:In at least one embodiment of the present application, the first output is obtained by inputting the email paragraph and the question into the first model, including:
将所述邮件段落和所述问题转为满足所述第一模型格式要求的数据;converting the email paragraph and the question into data that meets the format requirements of the first model;
将所述符合所述第一模型格式要求的数据输入所述第一模型;inputting the data conforming to the format requirements of the first model into the first model;
所述第一模型通过计算得到第一输出。The first model obtains a first output through calculation.
例如,当所述服务器接收到的邮件段落是“请速回。发件人:123@45.com”,问题 是“发件人是谁”时,将所述邮件段落和所述问题转为满足所述第一模型格式要求的数据,将所述符合所述第一模型格式要求的数据输入所述第一模型,当所述第一模型可以预测所述问题的答案“123@45.com”时,输出所述答案在所述邮件段落中的开始位置和结束位置,并将“发件人:‘123@45.com’”作为提取结果。For example, when the email paragraph received by the server is "Please reply. Sender: 123@45.com" and the question is "Who is the sender", the email paragraph and the question are converted to Data that meets the format requirements of the first model, input the data that meets the format requirements of the first model into the first model, when the first model can predict the answer to the question "123@45.com ", output the start position and end position of the answer in the email paragraph, and take "From: '123@45.com'" as the extraction result.
在本申请的至少一个实施方式中,并将所述邮件段落和所述问题输入所述第二模型得到第二输出,包括:In at least one embodiment of the present application, the second output is obtained by inputting the email paragraph and the question into the second model, including:
将所述邮件段落输入所述第二模型;inputting the email paragraph into the second model;
所述第二模型通过计算得到第二输出。The second model obtains a second output through calculation.
例如,当所述第一模型不能预测所述问题的答案时,将所述邮件段落“请速回。发件人:123@45.com”输入所述第二模型,所述第二模型输出答案“123@45.com”,并将“发件人:‘123@45.com’”作为提取结果。For example, when the first model cannot predict the answer to the question, the email paragraph "Please reply. Sender: 123@45.com" is input into the second model, and the second model outputs Answer "123@45.com" with "From: '123@45.com'" as the extraction result.
本申请通过获取第一邮件数据集,标注所述第一邮件数据集得到第一训练数据集;使用所述第一训练数据集训练BERT模型得到第一模型;获取第二邮件数据集,标注所述第二邮件数据集得到第二训练数据集;使用所述第二训练数据集训练BERT-LSTM-CRF模型得到第二模型;接收邮件段落和问题;加载所述第一模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到提取结果。本申请以效果更好的第一模型为主要提取方法,以第二模型作为补充,可以应用于提取不同邮件内容,很大程度上提高了内容提取功能的实用性。The present application obtains the first mail data set, and marks the first mail data set to obtain the first training data set; uses the first training data set to train the BERT model to obtain the first model; obtains the second mail data set, marks the The second email data set is used to obtain a second training data set; the BERT-LSTM-CRF model is trained using the second training data set to obtain a second model; the email paragraphs and questions are received; the first model is loaded, and the email The passage and the question are input into the first model or the second model to obtain an extraction result. In this application, the first model with better effect is used as the main extraction method, and the second model is used as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.
实施例2Example 2
图2为本申请一实施方式中邮件内容提取装置30的结构图。FIG. 2 is a structural diagram of a mail content extraction apparatus 30 in an embodiment of the present application.
在一些实施例中,所述邮件内容提取装置30运行于电子设备中。所述邮件内容提取装置30可以包括多个由程序代码段所组成的功能模块。所述邮件内容提取装置30中的各个程序段的程序代码可以存储于存储器中,并由至少一个处理器所执行,以执行邮件内容提取的功能。In some embodiments, the email content extraction apparatus 30 is executed in an electronic device. The mail content extraction apparatus 30 may include a plurality of functional modules composed of program code segments. The program codes of each program segment in the mail content extraction apparatus 30 may be stored in a memory and executed by at least one processor to perform the function of mail content extraction.
本实施例中,所述邮件内容提取装置30根据其所执行的功能,可以被划分为多个功能模块。参阅图2所示,所述邮件内容提取装置30可以包括第一数据标注模块301、第一模型训练模块302、第二数据标注模块303、第二模型训练模块304、接收模块305及提取模块306。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。所述在一些实施例中,关于各模块的功能将在后续的实施例中详述。In this embodiment, the mail content extraction apparatus 30 can be divided into a plurality of functional modules according to the functions it performs. Referring to FIG. 2 , the email content extraction apparatus 30 may include a first data labeling module 301 , a first model training module 302 , a second data labeling module 303 , a second model training module 304 , a receiving module 305 and an extraction module 306 . A module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can perform fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be described in detail in subsequent embodiments.
所述第一数据标注模块301获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集。The first data labeling module 301 obtains a first email data set, and labels each email in the first email data set to obtain a first training data set.
在本申请的至少一个实施方式中,所述第一数据标注模块301标注所述第一邮件数据集中的每一邮件包括:In at least one embodiment of the present application, the first data labeling module 301 labeling each email in the first email data set includes:
第一数据标注模块301将所述第一邮件数据集中的每一邮件的邮件内容标注为第一标记,将预设的问题标注为第二标记,将所述邮件内容中与所述预设问题对应的答案标注为第三标记。The first data marking module 301 marks the email content of each email in the first email data set as a first mark, marks a preset question as a second mark, and marks the content of the email with the preset question. The corresponding answer is marked as the third marker.
具体地,对所述第一邮件数据集中的每一邮件,第一数据标注模块301将所述每一邮件的邮件内容标注为“paragraph”,将预设的问题标注为“question”,将所述邮件内容中与所述预设问题对应的答案文本标注为“answer”。Specifically, for each email in the first email data set, the first data labeling module 301 labels the email content of each email as "paragraph", labels the preset question as "question", and labels all the emails as "paragraph". The answer text corresponding to the preset question in the content of the email is marked as "answer".
所述第一模型训练模块302使用所述第一训练数据集训练BERT模型,得到第一模型。The first model training module 302 uses the first training data set to train a BERT model to obtain a first model.
在本申请的至少一个实施方式中,所述第一模型训练模块302使用所述第一训练数据集训练BERT模型,得到第一模型包括:In at least one embodiment of the present application, the first model training module 302 uses the first training data set to train a BERT model, and obtaining the first model includes:
所述第一模型训练模块302将所述第一训练数据集中所述第一标记对应的数据和所述第二标记对应的数据作为所述BERT模型的输入,将所述第一训练数据集中的所述第 三标记对应的答案作为所述BERT模型的输出,优化所述BERT模型,得到所述第一模型。其中,在其他实施方式中,还可以将所述第一训练数据集中的所述第三标记对应的答案在所述邮件内容中的开始位置和结束位置作为所述BERT模型的输出。The first model training module 302 uses the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and uses the data in the first training data set. The answer corresponding to the third mark is used as the output of the BERT model, and the BERT model is optimized to obtain the first model. Wherein, in other embodiments, the start position and end position of the answer corresponding to the third mark in the first training data set in the content of the email may also be used as the output of the BERT model.
进一步地,所述第一模型训练模块302使用所述第一训练数据集训练BERT模型,得到第一模型具体可包括:Further, the first model training module 302 uses the first training data set to train the BERT model, and obtaining the first model may specifically include:
所述第一模型训练模块302根据所述预设问题和所述邮件内容得到预设问题token序列和邮件内容token序列,对所述预设问题token序列和所述邮件内容token序列进行拼接,在所述预设问题token序列前添加表征问题的分隔符,例如[CLS],在所述邮件内容token序列前添加表征内容的分隔符,例如[SEP],将添加分隔符后的所述预设问题token序列和所述邮件内容token序列作为所述BERT模型的输入数据;The first model training module 302 obtains a preset question token sequence and an email content token sequence according to the preset question and the email content, splices the preset question token sequence and the email content token sequence, and A separator characterizing the question is added before the token sequence of the preset question, such as [CLS], and a separator characterizing the content is added before the token sequence of the email content, such as [SEP], and the preset character after the separator is added. The question token sequence and the email content token sequence are used as the input data of the BERT model;
所述第一模型训练模块302利用所述BERT模型的编码层对所述输入数据进行编码;The first model training module 302 uses the encoding layer of the BERT model to encode the input data;
所述第一模型训练模块302将与所述第三标记对应的答案作为期望输出,训练所述BERT模型预测层,直至所述预测层收敛,收敛后的所述预测层能够预测需要回答的问题所对应的答案,并得到所述第一模型。The first model training module 302 takes the answer corresponding to the third mark as the expected output, and trains the prediction layer of the BERT model until the prediction layer converges, and the converged prediction layer can predict the question to be answered. corresponding answer, and get the first model.
所述第二数据标注模块303获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集。The second data labeling module 303 obtains a second email data set, and labels each email in the second email data set to obtain a second training data set.
在本申请的至少一个实施方式中,所述第二数据标注模块303标注所述第二邮件数据集中的每一邮件包括:In at least one embodiment of the present application, the second data labeling module 303 labeling each email in the second email data set includes:
所述第二数据标注模块303使用BIO标注方法标注所述第二邮件数据集中的每一邮件中的每一字符,得到标记序列,所述每一邮件与对应的标记序列组成所述第二训练数据集。The second data labeling module 303 uses the BIO labeling method to label each character in each email in the second email data set to obtain a label sequence, and each email and the corresponding label sequence form the second training data set.
具体地,使用BIO标注方法标注所述第二邮件数据集中的每一邮件中的每一字符包括:Specifically, using the BIO labeling method to label each character in each email in the second email data set includes:
使用B标注一个命名实体的起始位置,使用I标注一个命名实体的内容,使用0标注非命名实体字符。Use B to mark the starting position of a named entity, I to mark the contents of a named entity, and 0 to mark non-named entity characters.
所述第二模型训练模块304使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型。The second model training module 304 uses the second training data set to train a BERT-LSTM-CRF model to obtain a second model.
在本申请的至少一个实施方式中,所述第二模型训练模块304使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型包括:In at least one embodiment of the present application, the second model training module 304 uses the second training data set to train a BERT-LSTM-CRF model, and obtaining the second model includes:
所述第二模型训练模块304将所述第二训练数据集中的邮件文本内容作为所述BERT-LSTM-CRF模型的输入数据,将所述第二训练数据集中的所述标记序列作为所述BERT-LSTM-CRF模型的期望输出,优化所述BERT-LSTM-CRF模型,得到第二模型。The second model training module 304 uses the mail text content in the second training data set as the input data of the BERT-LSTM-CRF model, and uses the label sequence in the second training data set as the BERT - Desired output of the LSTM-CRF model, optimize the BERT-LSTM-CRF model to obtain a second model.
在本申请的至少一个实施方式中,所述BERT-LSTM-CRF模型先通过BERT模型获得所述邮件内容的字嵌入序列,然后将所述字嵌入序列输入长短期记忆网络模型中进行语义编码,最后通过条件随机场模型进行解码,并输出概率最大的标记序列。In at least one embodiment of the present application, the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, and then inputs the word embedding sequence into a long short-term memory network model for semantic encoding, Finally, it is decoded through the conditional random field model, and the label sequence with the highest probability is output.
具体地,所述BERT-LSTM-CRF模型先通过BERT模型获得所述邮件内容的字嵌入序列,然后将所述字嵌入序列输入长短期记忆网络模型中进行语义编码,最后通过条件随机场模型进行解码,并输出概率最大的标记序列包括:Specifically, the BERT-LSTM-CRF model first obtains the word embedding sequence of the mail content through the BERT model, then inputs the word embedding sequence into the long short-term memory network model for semantic encoding, and finally uses the conditional random field model to perform semantic encoding. Decode, and output the sequence of tokens with the highest probability including:
将所述第二训练数据集中的每个邮件的邮件文本内容进行分词得到邮件内容token序列;Perform word segmentation on the mail text content of each mail in the second training data set to obtain the mail content token sequence;
在所述邮件内容token序列起始位置和末尾位置分别增加标识符号,例如,在所述邮件内容token序列起始位置增加标识符号[CLS],在所述邮件内容token序列末尾位置表示符号[SEP];Add an identifier to the starting position and the end of the email content token sequence, for example, add an identifier [CLS] to the start of the email content token sequence, and add a symbol [SEP] to the end of the email content token sequence. ];
将增加标识符后的所述邮件内容token序列输入BERT模型,输出所述邮件内容中每个字符的向量表示,即每个邮件内容的字嵌入序列(x 1,x 2,…,x n),包括: Input the token sequence of the email content after adding the identifier into the BERT model, and output the vector representation of each character in the email content, that is, the word embedding sequence of each email content (x 1 ,x 2 ,...,x n ) ,include:
对每个字符的token嵌入、分割嵌入和位置嵌入进行求和得到输入的序列,调用所述 BERT模型中的变换器对所述输入的序列进行编解码得到向量序列,所述向量序列中的每个向量对应具有相同索引的token,即所述邮件内容的字嵌入序列;The input sequence is obtained by summing the token embedding, segmentation embedding and position embedding of each character, and the transformer in the BERT model is called to encode and decode the input sequence to obtain a vector sequence. The vectors correspond to tokens with the same index, that is, the word embedding sequence of the email content;
将由所述BERT模型得到的字嵌入序列输入长短期记忆网络模型获得输入的语义表示,包括:Input the word embedding sequence obtained by the BERT model into the long short-term memory network model to obtain the semantic representation of the input, including:
将所述字嵌入序列(x 1,x 2,…,x n)作为长短期记忆网络各个时间步的输入,再将正向长短期记忆网络输出的隐状态序列
Figure PCTCN2021123362-appb-000006
与反向长短期记忆网络的隐状态序列
Figure PCTCN2021123362-appb-000007
在各个位置输出的隐状态进行按位置拼接
Figure PCTCN2021123362-appb-000008
得到完整的隐状态序列(h 1,h 2,…,h n)∈R n*m,在设置跳出后,接入一个线性层,将隐状态向量从m维映射到k维,k时标注序列的标签数,从而得到自动提取的句子特征,即输入的语义表示,记作矩阵P,P=(p 1,p 2,…,p n)∈R n*k,p i∈R k为每一维p ij将字符x i分类到第j个标签的概率值;
The word embedding sequence (x 1 , x 2 ,...,x n ) is used as the input of each time step of the long-term and short-term memory network, and then the hidden state sequence output by the forward long-term and short-term memory network is used.
Figure PCTCN2021123362-appb-000006
Hidden State Sequences with Reverse Long Short-Term Memory Networks
Figure PCTCN2021123362-appb-000007
The hidden states output at each position are spliced by position
Figure PCTCN2021123362-appb-000008
Get the complete hidden state sequence (h 1 , h 2 ,...,h n )∈R n*m , after setting jump out, access a linear layer to map the hidden state vector from m dimension to k dimension, and label it when k The number of labels in the sequence, so as to obtain the automatically extracted sentence features, that is, the semantic representation of the input, denoted as matrix P, P=(p 1 ,p 2 ,...,p n )∈R n*k , p i ∈R k is The probability value of classifying the character xi to the jth label for each dimension p ij ;
将所述输入的语义表示输入条件随机场模型进行解码,得到概率最大的标签序列,包括:Decoding the input semantic representation input conditional random field model to obtain a label sequence with the highest probability, including:
所述条件随机场模型的参数时一个(x+2)*(k+2)的矩阵A,A jj表示的是邮件内容中每个字符从第i个标签到第j个标签的转移矩阵,表示语句x中标签所有状态一步转移的概率,进而在为一个位置进行标注的时候可以利用此前已经标注过的标签,k+2中k表示邮件内容长度,2表示邮件内容首部的起始状态和邮件内容尾部的终止状态,记一个长度等于邮件长度的标签序列y=(y 1,y 2,…,y n),模型对于邮件内容x的标签等于y的概率值为
Figure PCTCN2021123362-appb-000009
利用softmax得到对邮件内容x的标签等于y的概率值的归一化后的概率
Figure PCTCN2021123362-appb-000010
通过最大化对数似然函数进行训练,在解码过程中使用动态规划算法Viterbi算法求解最优路径,输出序列y=(y 1,y 2,…,y n)到预测序列对应的概率最大的字标签序列。
The parameter of the conditional random field model is a matrix A of (x+2)*(k+2), A jj represents the transition matrix of each character in the email content from the ith tag to the jth tag, Indicates the probability of one-step transition of all states of the label in the statement x, and then the label that has been labeled before can be used when labeling a position. In k+2, k indicates the length of the email content, and 2 indicates the initial state of the email content header and The termination state at the end of the email content, record a label sequence y=(y 1 , y 2 ,..., y n ) with a length equal to the length of the email, the model has a probability value of the label of the email content x being equal to y
Figure PCTCN2021123362-appb-000009
Use softmax to get the normalized probability that the label of the email content x is equal to the probability value of y
Figure PCTCN2021123362-appb-000010
The training is performed by maximizing the log-likelihood function, and the dynamic programming algorithm Viterbi algorithm is used to solve the optimal path in the decoding process, and the output sequence y=(y 1 , y 2 ,...,y n ) reaches the predicted sequence with the highest probability. word tag sequence.
所述接收模块305接收邮件段落和问题。The receiving module 305 receives email paragraphs and questions.
在本申请的至少一个实施方式中,所述接收模块305使用django服务器接收所述邮件段落和所述问题。In at least one embodiment of the present application, the receiving module 305 uses a django server to receive the email paragraph and the question.
在本申请的其他实施方式中,所述服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。In other embodiments of the present application, the server may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, and cloud functions , cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
所述提取模块306加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。The extracting module 306 loads the first model and the second model, and inputs the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question. the extraction result, and output the extraction result.
需要强调的是,为进一步保证上述提取结果的私密和安全性,上述提取结果还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned extraction results, the above-mentioned extraction results can also be stored in a node of a blockchain.
在本申请的至少一个实施方式中,所述提取模块306加载所述第一模型和所述第二模型,并将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果包括:In at least one embodiment of the present application, the extraction module 306 loads the first model and the second model, and inputs the email paragraph and the question into the first model or the second model Obtaining the extraction results corresponding to the email paragraph and the question includes:
所述提取模块306将所述邮件段落和所述问题输入所述第一模型得到第一输出;The extraction module 306 inputs the email paragraph and the question into the first model to obtain a first output;
当所述第一输出为非空时,所述提取模块306将所述第一输出作为提取结果;When the first output is not empty, the extraction module 306 uses the first output as an extraction result;
当所述第一输出为空时,所述提取模块306加载所述第二模型,并将所述邮件段落和所述问题输入所述第二模型得到第二输出,将所述第二输出作为所述提取结果。When the first output is empty, the extraction module 306 loads the second model, and inputs the email paragraph and the question into the second model to obtain a second output, which is the second output as the extraction result.
在本申请的至少一个实施方式中,所述提取模块306并将所述邮件段落和所述问题输入所述第一模型得到第一输出,包括:In at least one embodiment of the present application, the extraction module 306 inputs the email paragraph and the question into the first model to obtain a first output, including:
所述提取模块306将所述邮件段落和所述问题转为满足所述第一模型格式要求的数 据;The extraction module 306 converts the email paragraph and the question into data that meets the format requirements of the first model;
所述提取模块306将所述符合所述第一模型格式要求的数据输入所述第一模型;The extraction module 306 inputs the data conforming to the format requirements of the first model into the first model;
所述第一模型通过计算得到第一输出。The first model obtains a first output through calculation.
在本申请的至少一个实施方式中,所述提取模块306并将所述邮件段落和所述问题输入所述第二模型得到第二输出,包括:In at least one embodiment of the present application, the extraction module 306 inputs the email paragraph and the question into the second model to obtain a second output, including:
所述提取模块306将所述邮件段落输入所述第二模型;The extraction module 306 inputs the email paragraph into the second model;
所述第二模型通过计算得到第二输出。The second model obtains a second output through calculation.
本申请通过获取第一邮件数据集,标注所述第一邮件数据集得到第一训练数据集;使用所述第一训练数据集训练BERT模型得到第一模型;获取第二邮件数据集,标注所述第二邮件数据集得到第二训练数据集;使用所述第二训练数据集训练BERT-LSTM-CRF模型得到第二模型;接收邮件段落和问题;加载所述第一模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到提取结果。本申请以效果更好的第一模型为主要提取方法,以第二模型作为补充,可以应用于提取不同邮件内容,很大程度上提高了内容提取功能的实用性。The present application obtains the first mail data set, and marks the first mail data set to obtain the first training data set; uses the first training data set to train the BERT model to obtain the first model; obtains the second mail data set, marks the The second email data set is used to obtain a second training data set; the BERT-LSTM-CRF model is trained using the second training data set to obtain a second model; the email paragraphs and questions are received; the first model is loaded, and the email The passage and the question are input into the first model or the second model to obtain an extraction result. The present application uses the first model with better effect as the main extraction method, and uses the second model as a supplement, which can be applied to extract different email contents, which greatly improves the practicability of the content extraction function.
实施例3Example 3
图3为本申请一实施方式中电子设备6的示意图。FIG. 3 is a schematic diagram of an electronic device 6 in an embodiment of the present application.
所述电子设备6包括存储器61、处理器62以及存储在所述存储器61中并可在所述处理器62上运行的计算机可读指令63。所述处理器62执行所述计算机可读指令63时实现上述邮件内容提取方法实施例中的步骤,例如图1所示的步骤S11~S16。或者,所述处理器62执行所述计算机可读指令63时实现上述邮件内容提取装置实施例中各模块/单元的功能,例如图2中的模块301~306。The electronic device 6 includes a memory 61 , a processor 62 and computer readable instructions 63 stored in the memory 61 and executable on the processor 62 . When the processor 62 executes the computer-readable instructions 63, the steps in the above embodiments of the mail content extraction method are implemented, for example, steps S11 to S16 shown in FIG. 1 . Alternatively, when the processor 62 executes the computer-readable instructions 63, the functions of each module/unit in the above-mentioned embodiment of the apparatus for extracting mail content are implemented, for example, modules 301 to 306 in FIG. 2 .
示例性的,所述计算机可读指令63可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器61中,并由所述处理器62执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,所述指令段用于描述所述计算机可读指令63在所述电子设备6中的执行过程。例如,所述计算机可读指令63可以被分割成图2中的第一数据标注模块301、第一模型训练模块302、第二数据标注模块303、第二模型训练模块304、接收模块305及提取模块306,各模块具体功能参见实施例2。Exemplarily, the computer-readable instructions 63 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 62, to complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 63 in the electronic device 6 . For example, the computer readable instructions 63 can be divided into the first data labeling module 301, the first model training module 302, the second data labeling module 303, the second model training module 304, the receiving module 305 and the extraction module in FIG. 2 Module 306, see Embodiment 2 for specific functions of each module.
本实施方式中,所述电子设备6可以是桌上型计算机、笔记本、掌上电脑、服务器及云端终端装置等计算设备。本领域技术人员可以理解,所述示意图仅仅是电子设备6的示例,并不构成对电子设备6的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备6还可以包括输入输出设备、网络接入设备、总线等。In this embodiment, the electronic device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, a server, and a cloud terminal device. Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 6, and does not constitute a limitation to the electronic device 6, and may include more or less components than the one shown, or combine some components, or different Components such as the electronic device 6 may also include input and output devices, network access devices, buses, and the like.
所称处理器62可以是中央处理模块(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者所述处理器62也可以是任何常规的处理器等,所述处理器62是所述电子设备6的控制中心,利用各种接口和线路连接整个电子设备6的各个部分。The so-called processor 62 may be a central processing module (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 62 can also be any conventional processor, etc. The processor 62 is the control center of the electronic device 6, and uses various interfaces and lines to connect the entire electronic device 6. of each part.
所述存储器61可用于存储所述计算机可读指令63和/或模块/单元,所述处理器62通过运行或执行存储在所述存储器61内的计算机可读指令和/或模块/单元,以及调用存储在存储器61内的数据,实现所述电子设备6的各种功能。所述存储器61可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备6的使用所创建的数据等。此外,存储器61可以包括易失性存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字 (Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他存储器件。The memory 61 may be used to store the computer-readable instructions 63 and/or modules/units, and the processor 62 executes or executes the computer-readable instructions and/or modules/units stored in the memory 61, and The data stored in the memory 61 is called to realize various functions of the electronic device 6 . The memory 61 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data and the like created according to the use of the electronic device 6 are stored. In addition, the memory 61 may include volatile memory, and may also include non-volatile memory, such as hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card , a flash memory card (Flash Card), at least one disk storage device, flash memory device, or other storage device.
所述电子设备6集成的模块/单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中,所述计算机可读存储介质可以是非易失性的存储介质,也可以是易失性的存储介质。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,所述计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)等。If the modules/units integrated in the electronic device 6 are implemented in the form of software functional modules and sold or used as independent products, they may be stored in a computer-readable storage medium, which may be non-transitory. A volatile storage medium can also be a volatile storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , the computer-readable instructions, when executed by the processor, can implement the steps of the above-mentioned method embodiments. Wherein, the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in source code form, object code form, executable file, or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Access Memory), etc.
结合图1,所述电子设备6中的所述存储器61存储计算机可读指令实现一种邮件内容提取方法,所述处理器62可执行所述计算机可读指令从而实现:1, the memory 61 in the electronic device 6 stores computer-readable instructions to implement a method for extracting email content, and the processor 62 can execute the computer-readable instructions to implement:
获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
接收邮件段落和问题;Receive email paragraphs and questions;
加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
具体地,所述处理器62对上述计算机可读指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the computer-readable instruction by the processor 62, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, and details are not described herein.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.
所述计算机可读存储介质上存储有计算机可读指令63,其中,所述计算机可读指令63被处理器62执行时用以实现以下步骤:The computer-readable storage medium stores computer-readable instructions 63, wherein the computer-readable instructions 63 are used to implement the following steps when executed by the processor 62:
获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
接收邮件段落和问题;Receive email paragraphs and questions;
加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。另外,在本申请各个实施例中的各功能模块可以集成在相同处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在相同模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output. In addition, each functional module in each embodiment of the present application may be integrated in the same processing module, or each module may exist physically alone, or two or more modules may be integrated in the same module. The above-mentioned integrated modules can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他模块或步骤,单数不排除复数。电子设备权利要求中陈述的多个模块或电子设备也可以由同一个模块或电子设备通过软件或者硬件来 实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim. Furthermore, it is clear that the word "comprising" does not exclude other modules or steps, and the singular does not exclude the plural. Multiple modules or electronic devices recited in the electronic device claims can also be realized by one and the same module or electronic device by means of software or hardware. The terms first, second, etc. are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种邮件内容提取方法,其中,所述邮件内容提取方法包括:A method for extracting email content, wherein the method for extracting email content includes:
    获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
    使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
    获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
    使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
    接收邮件段落和问题;Receive email paragraphs and questions;
    加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
  2. 根据权利要求1所述的邮件内容提取方法,其中,所述标注所述第一邮件数据集中的每一邮件包括:The method for extracting email content according to claim 1, wherein the marking each email in the first email data set comprises:
    将所述第一邮件数据集中的每一邮件的邮件内容标注为第一标记,将预设的问题标注为第二标记,将所述邮件内容中与所述预设问题对应的答案标注为第三标记。The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.
  3. 根据权利要求2所述的邮件内容提取方法,其中,所述使用所述第一训练数据集训练BERT模型,得到第一模型包括:The method for extracting email content according to claim 2, wherein said using the first training data set to train a BERT model to obtain the first model comprises:
    将所述第一训练数据集中所述第一标记对应的数据和所述第二标记对应的数据作为所述BERT模型的输入,将所述第一训练数据集中的所述第三标记对应的答案作为所述BERT模型的输出,优化所述BERT模型,得到所述第一模型。Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model.
  4. 根据权利要求1所述的邮件内容提取方法,其中,标注所述第二邮件数据集中的每一邮件包括:The method for extracting email content according to claim 1, wherein marking each email in the second email data set comprises:
    使用BIO标注方法标注所述第二邮件数据集中的每一邮件中的每一字符,得到标记序列,所述每一邮件与对应的标记序列组成所述第二训练数据集。Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.
  5. 根据权利要求4所述的邮件内容提取方法,其中,所述使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型包括:The method for extracting email content according to claim 4, wherein said using the second training data set to train a BERT-LSTM-CRF model to obtain the second model comprises:
    将所述第二训练数据集中的邮件文本内容作为所述BERT-LSTM-CRF模型的输入数据,将所述第二训练数据集中的所述标记序列作为所述BERT-LSTM-CRF模型的期望输出,优化所述BERT-LSTM-CRF模型,得到第二模型。Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.
  6. 根据权利要求5所述的邮件内容提取方法,其中,所述BERT-LSTM-CRF模型先通过BERT模型获得所述邮件内容的字嵌入序列,然后将所述字嵌入序列输入长短期记忆网络模型中进行语义编码,最后通过条件随机场模型进行解码,并输出概率最大的标记序列。The method for extracting email content according to claim 5, wherein the BERT-LSTM-CRF model first obtains the word embedding sequence of the email content through the BERT model, and then inputs the word embedding sequence into a long short-term memory network model Semantic encoding is performed, and finally the conditional random field model is used for decoding, and the token sequence with the highest probability is output.
  7. 根据权利要求1所述的邮件内容提取方法,其中,加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果包括:The method for extracting mail content according to claim 1, wherein the first model and the second model are loaded, and the mail paragraph and the question are input into the first model or the second model to obtain the same The extraction results corresponding to the email paragraph and the question include:
    将所述邮件段落和所述问题输入所述第一模型得到第一输出;Inputting the email paragraph and the question into the first model to obtain a first output;
    当所述第一输出为非空时,将所述第一输出作为提取结果;When the first output is not empty, use the first output as the extraction result;
    当所述第一输出为空时,加载所述第二模型,并将所述邮件段落和所述问题输入所述第二模型得到第二输出,将所述第二输出作为所述提取结果。When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.
  8. 一种邮件内容提取装置,其中,所述邮件内容提取装置包括:A mail content extraction device, wherein the mail content extraction device comprises:
    第一数据标注模块,用于获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;a first data labeling module, configured to obtain a first mail data set, and label each mail in the first mail data set to obtain a first training data set;
    第一模型训练模块,用于使用所述第一训练数据集训练BERT模型,得到第一模 型;The first model training module is used to train the BERT model using the first training data set to obtain the first model;
    第二数据标注模块,用于获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;A second data labeling module, configured to obtain a second mail data set, and label each mail in the second mail data set to obtain a second training data set;
    第二模型训练模块,用于使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;The second model training module is used to train the BERT-LSTM-CRF model using the second training data set to obtain the second model;
    接收模块,用于接收邮件段落和问题;A receiving module for receiving email paragraphs and questions;
    提取模块,用于加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。an extraction module, configured to load the first model and the second model, and input the email paragraph and the question into the first model or the second model to obtain the corresponding email paragraph and the question the extraction result, and output the extraction result.
  9. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:An electronic device, wherein the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:
    获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
    使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
    获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
    使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
    接收邮件段落和问题;Receive email paragraphs and questions;
    加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
  10. 根据权利要求9所述的电子设备,其中,在所述标注所述第一邮件数据集中的每一邮件时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:9. The electronic device of claim 9, wherein, in the annotating each mail in the first mail data set, the processor executes the at least one computer-readable instruction to implement the steps of:
    将所述第一邮件数据集中的每一邮件的邮件内容标注为第一标记,将预设的问题标注为第二标记,将所述邮件内容中与所述预设问题对应的答案标注为第三标记。The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.
  11. 根据权利要求10所述的电子设备,其中,在所述使用所述第一训练数据集训练BERT模型,得到第一模型时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 10, wherein when the BERT model is trained using the first training data set to obtain the first model, the processor executes the at least one computer-readable instruction to implement the following steps :
    将所述第一训练数据集中所述第一标记对应的数据和所述第二标记对应的数据作为所述BERT模型的输入,将所述第一训练数据集中的所述第三标记对应的答案作为所述BERT模型的输出,优化所述BERT模型,得到所述第一模型。Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model.
  12. 根据权利要求9所述的电子设备,其中,在标注所述第二邮件数据集中的每一邮件时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:9. The electronic device of claim 9, wherein, in labeling each mail in the second mail data set, the processor executes the at least one computer-readable instruction to implement the steps of:
    使用BIO标注方法标注所述第二邮件数据集中的每一邮件中的每一字符,得到标记序列,所述每一邮件与对应的标记序列组成所述第二训练数据集。Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.
  13. 根据权利要求12所述的电子设备,其中,在所述使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device of claim 12, wherein the processor executes the at least one computer-readable instruction when the BERT-LSTM-CRF model is trained using the second training data set to obtain the second model to implement the following steps:
    将所述第二训练数据集中的邮件文本内容作为所述BERT-LSTM-CRF模型的输入数据,将所述第二训练数据集中的所述标记序列作为所述BERT-LSTM-CRF模型的期望输出,优化所述BERT-LSTM-CRF模型,得到第二模型。Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.
  14. 根据权利要求9所述的电子设备,其中,在加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein, after loading the first model and the second model, inputting the mail paragraph and the question into the first model or the second model results in the same When the extraction result corresponding to the email paragraph and the question is retrieved, the processor executes the at least one computer-readable instruction to implement the following steps:
    将所述邮件段落和所述问题输入所述第一模型得到第一输出;Inputting the email paragraph and the question into the first model to obtain a first output;
    当所述第一输出为非空时,将所述第一输出作为提取结果;When the first output is not empty, use the first output as the extraction result;
    当所述第一输出为空时,加载所述第二模型,并将所述邮件段落和所述问题输入所述第二模型得到第二输出,将所述第二输出作为所述提取结果。When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and the at least one computer-readable instruction implements the following steps when executed by a processor:
    获取第一邮件数据集,标注所述第一邮件数据集中的每一邮件得到第一训练数据集;Obtaining a first mail data set, marking each mail in the first mail data set to obtain a first training data set;
    使用所述第一训练数据集训练BERT模型,得到第一模型;Using the first training data set to train the BERT model to obtain the first model;
    获取第二邮件数据集,标注所述第二邮件数据集中的每一邮件得到第二训练数据集;Obtaining a second mail data set, marking each mail in the second mail data set to obtain a second training data set;
    使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型;Use the second training data set to train the BERT-LSTM-CRF model to obtain the second model;
    接收邮件段落和问题;Receive email paragraphs and questions;
    加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果,并输出所述提取结果。Loading the first model and the second model, inputting the email paragraph and the question into the first model or the second model to obtain extraction results corresponding to the email paragraph and the question, and The extraction result is output.
  16. 根据权利要求15所述的存储介质,其中,在所述标注所述第一邮件数据集中的每一邮件时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:16. The storage medium of claim 15, wherein, in the tagging of each mail in the first mail data set, the at least one computer-readable instruction is executed by a processor to implement the steps of:
    将所述第一邮件数据集中的每一邮件的邮件内容标注为第一标记,将预设的问题标注为第二标记,将所述邮件内容中与所述预设问题对应的答案标注为第三标记。The mail content of each mail in the first mail data set is marked as the first mark, the preset question is marked as the second mark, and the answer corresponding to the preset question in the mail content is marked as the first mark. Three markers.
  17. 根据权利要求16所述的存储介质,其中,在所述使用所述第一训练数据集训练BERT模型,得到第一模型时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:The storage medium of claim 16, wherein when the BERT model is trained using the first training data set to obtain the first model, the at least one computer-readable instruction is executed by the processor to implement the following steps:
    将所述第一训练数据集中所述第一标记对应的数据和所述第二标记对应的数据作为所述BERT模型的输入,将所述第一训练数据集中的所述第三标记对应的答案作为所述BERT模型的输出,优化所述BERT模型,得到所述第一模型。Use the data corresponding to the first mark and the data corresponding to the second mark in the first training data set as the input of the BERT model, and use the answer corresponding to the third mark in the first training data set. As the output of the BERT model, the BERT model is optimized to obtain the first model.
  18. 根据权利要求15所述的存储介质,其中,在标注所述第二邮件数据集中的每一邮件时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:16. The storage medium of claim 15, wherein in labeling each mail in the second mail data set, the at least one computer-readable instruction is executed by a processor to implement the following steps:
    使用BIO标注方法标注所述第二邮件数据集中的每一邮件中的每一字符,得到标记序列,所述每一邮件与对应的标记序列组成所述第二训练数据集。Using the BIO labeling method to label each character in each email in the second email data set, a label sequence is obtained, and each email and the corresponding label sequence form the second training data set.
  19. 根据权利要求18所述的存储介质,其中,在所述使用所述第二训练数据集训练BERT-LSTM-CRF模型,得到第二模型时,所述至少一个计算机可读指令被处理器执行时以实现以下步骤:The storage medium of claim 18, wherein the at least one computer-readable instruction is executed by the processor when the second model is obtained by training the BERT-LSTM-CRF model using the second training data set to implement the following steps:
    将所述第二训练数据集中的邮件文本内容作为所述BERT-LSTM-CRF模型的输入数据,将所述第二训练数据集中的所述标记序列作为所述BERT-LSTM-CRF模型的期望输出,优化所述BERT-LSTM-CRF模型,得到第二模型。Use the email text content in the second training data set as the input data of the BERT-LSTM-CRF model, and use the tag sequence in the second training data set as the expected output of the BERT-LSTM-CRF model , and optimize the BERT-LSTM-CRF model to obtain the second model.
  20. 根据权利要求15所述的存储介质,其中,在加载所述第一模型和所述第二模型,将所述邮件段落和所述问题输入所述第一模型或所述第二模型得到与所述邮件段落和所述问题对应的提取结果时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:The storage medium according to claim 15, wherein, after loading the first model and the second model, inputting the mail paragraph and the question into the first model or the second model results in a The at least one computer-readable instruction is executed by the processor to implement the following steps when the extraction result corresponding to the email paragraph and the question is retrieved:
    将所述邮件段落和所述问题输入所述第一模型得到第一输出;Inputting the email paragraph and the question into the first model to obtain a first output;
    当所述第一输出为非空时,将所述第一输出作为提取结果;When the first output is not empty, use the first output as the extraction result;
    当所述第一输出为空时,加载所述第二模型,并将所述邮件段落和所述问题输入所述第二模型得到第二输出,将所述第二输出作为所述提取结果。When the first output is empty, load the second model, input the email paragraph and the question into the second model to obtain a second output, and use the second output as the extraction result.
PCT/CN2021/123362 2020-10-14 2021-10-12 Mail content extraction method and apparatus, and electronic device and storage medium WO2022078348A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011095137.1A CN112184178A (en) 2020-10-14 2020-10-14 Mail content extraction method and device, electronic equipment and storage medium
CN202011095137.1 2020-10-14

Publications (1)

Publication Number Publication Date
WO2022078348A1 true WO2022078348A1 (en) 2022-04-21

Family

ID=73949920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123362 WO2022078348A1 (en) 2020-10-14 2021-10-12 Mail content extraction method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN112184178A (en)
WO (1) WO2022078348A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112184178A (en) * 2020-10-14 2021-01-05 深圳壹账通智能科技有限公司 Mail content extraction method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188175A (en) * 2019-04-29 2019-08-30 厦门快商通信息咨询有限公司 A kind of question and answer based on BiLSTM-CRF model are to abstracting method, system and storage medium
CN110287334A (en) * 2019-06-13 2019-09-27 淮阴工学院 A kind of school's domain knowledge map construction method based on Entity recognition and attribute extraction model
US20190362020A1 (en) * 2018-05-22 2019-11-28 Salesforce.Com, Inc. Abstraction of text summarizaton
WO2020174826A1 (en) * 2019-02-25 2020-09-03 日本電信電話株式会社 Answer generating device, answer learning device, answer generating method, and answer generating program
CN112184178A (en) * 2020-10-14 2021-01-05 深圳壹账通智能科技有限公司 Mail content extraction method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10157177B2 (en) * 2016-10-28 2018-12-18 Kira Inc. System and method for extracting entities in electronic documents
CN109460551B (en) * 2018-10-29 2023-04-18 北京知道创宇信息技术股份有限公司 Signature information extraction method and device
CN111368542A (en) * 2018-12-26 2020-07-03 北京大学 Text language association extraction method and system based on recurrent neural network
CN110516256A (en) * 2019-08-30 2019-11-29 的卢技术有限公司 A kind of Chinese name entity extraction method and its system
CN111737383B (en) * 2020-05-21 2021-11-23 百度在线网络技术(北京)有限公司 Method for extracting spatial relation of geographic position points and method and device for training extraction model
CN111598550A (en) * 2020-05-22 2020-08-28 深圳市小满科技有限公司 Mail signature information extraction method, device, electronic equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190362020A1 (en) * 2018-05-22 2019-11-28 Salesforce.Com, Inc. Abstraction of text summarizaton
WO2020174826A1 (en) * 2019-02-25 2020-09-03 日本電信電話株式会社 Answer generating device, answer learning device, answer generating method, and answer generating program
CN110188175A (en) * 2019-04-29 2019-08-30 厦门快商通信息咨询有限公司 A kind of question and answer based on BiLSTM-CRF model are to abstracting method, system and storage medium
CN110287334A (en) * 2019-06-13 2019-09-27 淮阴工学院 A kind of school's domain knowledge map construction method based on Entity recognition and attribute extraction model
CN112184178A (en) * 2020-10-14 2021-01-05 深圳壹账通智能科技有限公司 Mail content extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112184178A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN112685565B (en) Text classification method based on multi-mode information fusion and related equipment thereof
WO2020232882A1 (en) Named entity recognition method and apparatus, device, and computer readable storage medium
JP6909832B2 (en) Methods, devices, equipment and media for recognizing important words in audio
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
US11551437B2 (en) Collaborative information extraction
CN111241209B (en) Method and device for generating information
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
WO2021143206A1 (en) Single-statement natural language processing method and apparatus, computer device, and readable storage medium
WO2021174871A1 (en) Data query method and system, computer device, and storage medium
CN112287095A (en) Method and device for determining answers to questions, computer equipment and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
JP2023539470A (en) Automatic knowledge graph configuration
WO2022078348A1 (en) Mail content extraction method and apparatus, and electronic device and storage medium
CN111160026A (en) Model training method and device, and method and device for realizing text processing
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN116978028A (en) Video processing method, device, electronic equipment and storage medium
WO2023061441A1 (en) Text quantum circuit determination method, text classification method, and related apparatus
CN115033683A (en) Abstract generation method, device, equipment and storage medium
WO2021135103A1 (en) Method and apparatus for semantic analysis, computer device, and storage medium
US20210295036A1 (en) Systematic language to enable natural language processing on technical diagrams
CN113065354A (en) Method for identifying geographic position in corpus and related equipment thereof
CN110909541A (en) Instruction generation method, system, device and medium
CN114385779B (en) Emergency scheduling instruction execution method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21879392

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.07.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21879392

Country of ref document: EP

Kind code of ref document: A1