CN116821552A - Mail information extraction method and device and electronic equipment - Google Patents

Mail information extraction method and device and electronic equipment Download PDF

Info

Publication number
CN116821552A
CN116821552A CN202310907244.7A CN202310907244A CN116821552A CN 116821552 A CN116821552 A CN 116821552A CN 202310907244 A CN202310907244 A CN 202310907244A CN 116821552 A CN116821552 A CN 116821552A
Authority
CN
China
Prior art keywords
mail
target
content
preset
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310907244.7A
Other languages
Chinese (zh)
Inventor
吴培浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huanju Shidai Information Technology Co Ltd
Original Assignee
Guangzhou Huanju Shidai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huanju Shidai Information Technology Co Ltd filed Critical Guangzhou Huanju Shidai Information Technology Co Ltd
Priority to CN202310907244.7A priority Critical patent/CN116821552A/en
Publication of CN116821552A publication Critical patent/CN116821552A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • G06V30/424Postal images, e.g. labels or addresses on parcels or postal envelopes

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a mail information extraction method, a device and electronic equipment, wherein the mail information extraction method comprises the following steps: determining a target mail page of mail information to be extracted; analyzing the target mail page to obtain attribute information corresponding to a plurality of first elements; determining first element content corresponding to the first element according to the attribute information; determining a target mail text corresponding to the target mail page, and inputting the target mail text into a preset element identification model to obtain a second element and second element content corresponding to the second element; and combining the first element with the first element content and combining the second element with the second element content to obtain the extracted mail information. The method can improve the accuracy of mail information extraction.

Description

Mail information extraction method and device and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for extracting mail information, and an electronic device.
Background
With the continuous development of internet communication, more and more information transmission occurs in daily work, study and life of people in an email mode, and email communication has become an indispensable communication and communication mode for most people. In the face of massive emails, how to quickly and effectively analyze email information becomes a focus of massive email analysis in the big data era.
However, in the related art, it is often necessary to help the receiver of the mail filter the illegal mail and the junk mail through the related security protocol, but the content of the mail cannot be further understood, which results in low accuracy of extracting the mail information.
Disclosure of Invention
The embodiment of the application provides a mail information extraction method, a device and electronic equipment, which are used for solving the problem of low accuracy of mail information extraction.
An embodiment of the present application provides a method for extracting mail information, where the method includes: determining a target mail page of mail information to be extracted; analyzing the target mail page to obtain attribute information corresponding to a plurality of first elements; determining first element content corresponding to the first element according to the attribute information; determining a target mail text corresponding to the target mail page, and inputting the target mail text into a preset element identification model to obtain a second element and second element content corresponding to the second element; and combining the first element with the first element content and combining the second element with the second element content to obtain the extracted mail information.
Further, in the above method provided by the embodiment of the present application, the determining the target mail page of the mail information to be extracted includes: acquiring an initial mail set; selecting a target mail set associated with a preset element keyword from the initial mail set; and carrying out page rendering on each target mail in the target mail set according to the rendering instruction to obtain a target mail page.
Further, in the above method provided by the embodiment of the present application, the analyzing the target mail page to obtain attribute information corresponding to a plurality of first elements includes: acquiring a first element associated with the preset element keyword according to the preset element keyword; determining an element position of the first element in the target mail page; determining the element size corresponding to the first element according to the element position; and combining the element position and the element size to obtain attribute information of the first element.
Further, in the above method provided by the embodiment of the present application, the determining, according to the attribute information, the first element content corresponding to the first element includes: determining the element position corresponding to the first element according to the attribute information; acquiring a plurality of data blocks in the neighborhood corresponding to the element position; and selecting a target data block from the plurality of data blocks, and taking the content corresponding to the target data block as the first element content corresponding to the first element.
Further, in the above method provided by the embodiment of the present application, the selecting a target data block from among the plurality of data blocks includes: analyzing each data block in the neighborhood to obtain the length and the data type of the data block corresponding to each data block; and selecting a target data block from the plurality of data blocks in the neighborhood according to the data block length and the data type.
Further, in the above method provided by the embodiment of the present application, the determining the target mail text corresponding to the target mail page includes: acquiring an initial mail text corresponding to the target mail page; and deleting the first element and the first element content from the initial mail text to obtain a target mail text.
Further, in the above method provided by the embodiment of the present application, the inputting the target mail text into a preset element recognition model to obtain a second element and a second element content corresponding to the second element includes: inputting the target mail text to an embedding layer of the preset element recognition model to obtain a text feature vector sequence; inputting the text feature vector sequence to a feature coding layer of the preset element recognition model to obtain a character-level feature vector sequence; inputting the character-level feature vector sequence into a label prediction network layer of the preset element recognition model to obtain labels corresponding to each character-level feature vector; and determining a second element and second element content corresponding to the second element according to the label.
Further, in the above method provided by the embodiment of the present application, after the combining the first element and the first element content, and the combining the second element and the second element content, the method further includes: determining a first preset additional content corresponding to the first element and a second preset additional content corresponding to the second element; updating the first element content according to the first preset additional content to obtain target first element content; updating the second element content according to the second preset additional content to obtain target second element content; and sending the first element and the target first element content, and the second element and the target second element content to a terminal device.
The second aspect of the embodiment of the present application also provides a mail information extraction device, where the device includes: the page determining module is used for determining a target mail page of the mail information to be extracted; the page analysis module is used for analyzing the target mail page to obtain attribute information corresponding to a plurality of first elements; the content determining module is used for determining first element content corresponding to the first element according to the attribute information; the model processing module is used for determining a target mail text corresponding to the target mail page, inputting the target mail text into a preset element identification model and obtaining a second element and second element content corresponding to the second element; and the content combination module is used for combining the first element and the first element content and combining the second element and the second element content to obtain the extracted mail information.
The third aspect of the embodiment of the present application further provides an electronic device, where the electronic device includes a controller and a memory, and the controller is configured to implement the mail information extraction method according to any one of the foregoing when executing a computer program stored in the memory.
The fourth aspect of the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, and when the computer program is executed by a controller, the mail information extraction method is implemented.
According to the mail information extraction method provided by the embodiment of the application, the target mail page is analyzed to obtain attribute information corresponding to a plurality of first elements; determining first element content corresponding to the first element according to the attribute information; determining a target mail text corresponding to the target mail page, and inputting the target mail text into a preset element identification model to obtain a second element and second element content corresponding to the second element; and combining the first element with the first element content and combining the second element with the second element content to obtain the extracted mail information. According to the embodiment of the application, the first element content corresponding to the first element in the mail is determined according to the attribute information, the second element and the second element content in the mail are determined by inputting the target mail text into the preset element identification model, and the mail information is extracted by combining the attribute analysis and the model processing, so that the accuracy of mail information extraction can be improved.
Drawings
Fig. 1 is an application scenario diagram of a mail information extraction method provided by an embodiment of the present application.
Fig. 2 is a flowchart of a mail information extraction method according to an embodiment of the present application.
Fig. 3 is a flowchart for determining a target mail page according to an embodiment of the present application.
Fig. 4 is a flowchart for determining attribute information according to an embodiment of the present application.
Fig. 5 is a flowchart for determining the content of the first element according to an embodiment of the present application.
Fig. 6 is a flowchart of selecting a target data block according to an embodiment of the present application.
Fig. 7 is a flowchart for determining a target mail text according to an embodiment of the present application.
Fig. 8 is a process flow diagram of a preset element identification model according to an embodiment of the present application.
Fig. 9 is a display flowchart applied to a terminal device according to an embodiment of the present application.
Fig. 10 is a schematic structural diagram of a mail information extraction device according to an embodiment of the present application.
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
It should be noted that the terms "first" and "second" in the description and claims of the present application and the accompanying drawings are used to distinguish similar objects, and are not used to describe a specific order or sequence.
It should be further noted that, in the method disclosed in the embodiment of the present application or the method shown in the flowchart, one or more steps for implementing the method are included, and the execution order of the steps may be interchanged with each other, where some steps may be deleted without departing from the scope of the claims.
Some embodiments will be described below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
With the continuous development of internet communication, email plays an increasingly important role in various fields, such as the case detection field, the electronic commerce field, the education field, and the like, which relate to network security. In the embodiment of the application, the E-mail is applied to the field of E-commerce, and the merchant pushes some commodity information, preferential activities, order information, logistics information and the like to the customer in the form of the E-mail. If the customer ignores the mail when processing the mail or is treated as a junk mail, the customer is an order loss for a merchant; for some mails of order information and logistics information, the mails are easily submerged by other mails or forgotten, and the finding difficulty is higher when the mail information is wanted to be checked next time. As such, it is necessary to perform content analysis on the received e-mail.
However, in the related art, the receiver of the mail is often helped to filter the illegal mail and the junk mail through the related security protocol, but further understanding of the mail content cannot be realized, which results in low accuracy of extracting the mail information.
Based on the above problems, the embodiment of the application provides a mail information extraction method, which can improve the accuracy of mail information extraction.
An application scenario of a mail information extraction method provided by an embodiment of the present application is described with reference to fig. 1. As shown in fig. 1, the electronic device is communicatively coupled to a terminal device that is communicatively coupled to a mail service provider. The terminal device may implement man-machine interaction or network communication, for example, the terminal device may be an electronic product such as a personal computer, a tablet computer, a smart phone, a digital camera, and the like. The terminal device may receive and store a plurality of different types of mails, for example, mails about merchandise offers, order information mails, logistics information mails, news push mails, etc. sent by merchants. The electronic device is used for implementing the mail information extraction method, and for example, the electronic device may be a computer, a notebook computer, a server and the like. The mail service provider may be a mail server for providing mail services, e.g. the mail service provider may store a plurality of different types of mail for different users.
In an embodiment of the present application, the terminal device is used as an authorized party, and the electronic device is used as an authorized party, for example, the terminal device may allow the electronic device to establish communication connection with the terminal device in an authorized manner, and extract all mails sent to the terminal device; then, the electronic device can call the mail in the terminal device through an input device (such as a keyboard, a mouse, a remote controller, a touch pad or a sound control device, etc.); then, the electronic equipment extracts mail information from the called mail to obtain the mail information; and finally outputting the mail information to the terminal equipment so as to display the mail information on the terminal equipment. In another embodiment of the present application, the terminal device may allow the electronic device to establish a communication connection with the terminal device by means of user authorization, and then the electronic device obtains multiple different types of mails corresponding to the user of the terminal device from the mail service provider.
Fig. 2 is a flowchart of a mail information extraction method according to an embodiment of the present application, where the mail information extraction method is applied to an electronic device. As shown in FIG. 2, the order of the steps in the flow chart may be changed and some may be omitted, depending on different needs.
S11, determining a target mail page of the mail information to be extracted.
In some embodiments, in order to extract mail information, keywords for identifying corresponding elements of the mail are preset, where the elements may include order elements, logistics elements, preferential elements, discount elements, and the like, and the corresponding preset element keywords may include keywords such as "order", "logistics", "preferential", "discount", and the like. By setting preset element keywords and performing preliminary screening on a large number of mails according to the preset element keywords, target mails of the mail information to be extracted can be obtained. The target mail page refers to a rendered page of target mails associated with preset element keywords. The electronic equipment extracts an initial mail set from the terminal equipment, searches preset element keywords for each initial mail in the initial mail set, and determines that any initial mail is a target mail if the initial mail comprises the preset element keywords; and then, the electronic equipment performs page rendering on the target mail to obtain a target mail page. The initial mail set refers to a set of all mails obtained from the terminal device, and the initial mail set may include target mails containing preset element keywords and irrelevant mails not containing preset element keywords.
S12, analyzing the target mail page to obtain attribute information corresponding to the plurality of first elements.
In some embodiments, the first element refers to a preset element having a positional relationship with element content, taking an example that the element includes an order element, a logistics element, a preferential element, a merchant element and a commodity element, wherein the element content corresponding to the order element is an order number, and the order element and the order number have a positional relationship, for example, the order numbers are arranged in the same column or the same row of the order element; the content of the element corresponding to the physical distribution element is a physical distribution number, and there is a positional relationship between the physical distribution element and the physical distribution number, for example, the physical distribution number is arranged in the same column or the same row of the physical distribution element. The element content corresponding to the preferential element is commodity preferential information, and the commodity preferential information is not fixedly arranged at a certain position of the preferential element, namely, the preferential element and the commodity preferential information have no positional relationship, and similarly, the merchant element and the commodity element are also not fixedly arranged at a certain position. In this way, the order element and the physical distribution element having the positional relationship are set as the first element. And analyzing the target mail page to obtain the attribute information corresponding to the first element. The attribute information may include information such as a preset element name, an element type, an element position, and an element size, which are used to identify a data block corresponding to the first element, where the element type may include a text type and a picture type, the element position may refer to coordinates of the data block in the target mail page, and the element size may refer to a width and a height of the data block. Taking a first element as an order element as an example, determining a text box of the order element in a target mail page, taking the text box as a data block, wherein the element name corresponding to the data block is "order", the element type corresponding to the data block is "text", and the element position corresponding to the data block is "x":100, "y" 200, the size of the corresponding element of the data block is "width" 300, and "height" 100".
S13, determining the first element content corresponding to the first element according to the attribute information.
In some embodiments, there is a positional relationship between the first element and the first element content, and the first element content corresponding to the first element can be determined from the element position in the attribute information. Firstly, determining a neighborhood corresponding to an element position of a first element and data blocks in the neighborhood to obtain a plurality of data blocks; then, determining the data type and the data length of each data block in the neighborhood; then, selecting a target data block with the data type and the data block length meeting the preset data requirement from a plurality of data blocks in the neighborhood; and finally, determining the content corresponding to the target data block as first element content corresponding to the first element. Wherein, the neighborhood corresponding to the element position of the first element may include, but is not limited to: left and right positions of the same row, upper and lower positions of the same column; the data blocks in the neighborhood may include at least one data block at a left side position, at least one data block corresponding to a right side position, at least one data block corresponding to an upper side position, and at least one data block corresponding to a lower side position. The data types of the data blocks may include, but are not limited to, number type, letter type, number and letter combination type, text type, and picture type. The data block length may be the width of the data block, which in one embodiment may be the number of bytes of each element in the data block. For example, if a data block contains 10 integers, each of which occupies 4 bytes, the width of the data block is 40 bytes; in another embodiment, the width of a data block may refer to the width of a content area.
In an embodiment, the preset data requirement refers to a requirement that a preset data type and a preset data block length of a data block corresponding to a first element content need to be met, taking the first element as an order element as an example, the first element content corresponding to the order element is an order number, so that the preset data requirement corresponding to the order element may be: the data types may include a number type, a letter type, a number and letter combination type, and a data block length of 8 to 32 characters. And detecting the preset data requirements of a plurality of data blocks in the neighborhood corresponding to the order element to obtain that the data blocks corresponding to the right side position of the same row meet the preset data requirements, taking the data blocks as target data blocks, and taking the content corresponding to the data blocks as order numbers corresponding to the order element.
S14, determining a target mail text corresponding to the target mail page, and inputting the target mail text into a preset element identification model to obtain a second element and second element content corresponding to the second element.
In some embodiments, the second element is an element that has no positional relationship with the element content, and the element includes an order element, a logistics element, a coupon element, a merchant element, and a commodity element, where the coupon element, the merchant element, and the commodity element are all the second element. The element content corresponding to the preferential element can be commodity preferential information, the element content corresponding to the merchant element can be commodity name information, and the element content corresponding to the commodity element can be commodity name, commodity model, commodity price and other information. The target mail text may include the remaining mail content after the first element and the first element content are removed within the target mail page.
In some embodiments, the preset element recognition model is a preset model for performing sequence labeling processing on the target mail text to obtain elements and element contents, and the preset element recognition model may be a hidden markov model, a maximum entropy model, a conditional random field or a deep learning model, which is not limited herein. The input data of the preset element identification model is mail text, and the output data is element and element content. In an embodiment, mail text with element labels and element content labels is collected as input data in training data, elements corresponding to the determined element labels and element content corresponding to the element content labels are used as output data in the training data, and an element recognition model is trained according to the input data and the output data until the model converges to a preset range, so that a preset element recognition model is obtained.
And S15, combining the first element and the first element content, and combining the second element and the second element content to obtain the extracted mail information.
In some embodiments, the electronic device uses the combined first element and first element content, and the combined second element and second element content as the extracted mail information, and then the electronic device transmits the extracted mail information to the terminal device and displays the extracted mail information on the terminal device. For example, a mail tag is constructed for a target mail from which mail information has been extracted in a terminal device, and the content of the mail tag includes a first element and a first element content, and a second element content. And then, when a large amount of mails exist in the terminal equipment, the user can quickly know the mails by searching the mail labels on the mails, so that the user is prevented from missing important mail information.
According to the mail information extraction method provided by the embodiment of the application, the target mail page is analyzed to obtain attribute information corresponding to a plurality of first elements; determining first element content corresponding to the first element according to the attribute information; determining a target mail text corresponding to the target mail page, and inputting the target mail text into a preset element identification model to obtain a second element and second element content corresponding to the second element; and combining the first element with the first element content and combining the second element with the second element content to obtain the extracted mail information. According to the embodiment of the application, the first element content corresponding to the first element in the mail is determined according to the attribute information, and the second element content in the mail are determined by inputting the target mail text into a preset element identification model. The mail information is extracted by combining the attribute analysis and the model processing, so that the accuracy of extracting the mail information can be improved.
Fig. 3 is a flowchart for determining a target mail page according to an embodiment of the present application. The determined flow of the target mail page is applied to the electronic device. As shown in fig. 3, the method comprises the following steps:
S21, acquiring an initial mail set.
In some embodiments, the initial mail set refers to a set of all mails of different types stored in the terminal device, for example, the initial mail set may include mails about merchandise offers issued by merchants, order information mails, logistics information mails, news push mails, and the like. In another embodiment, the initial mail set may also be a set of all mails of the customer collected from the mail server upon authorization of the terminal device. In an embodiment, the initial mail set may be acquired from the terminal device or the mail server according to a preset acquisition period, for example, the preset acquisition period may be 1 week, which is not limited herein.
S22, selecting a target mail set associated with the preset element keywords from the initial mail set.
In some embodiments, the preset element keywords refer to preset keywords for identifying elements, for example, the preset element keywords may include keywords such as "order", "logistics", "discount", and the like. Searching preset element keywords for each initial mail in the initial mail set, and determining the mail as a target mail if the initial mail comprises the preset element keywords; if the initial mail does not include the preset element keywords, determining that the mail is not the target mail.
S23, carrying out page rendering on each target mail in the target mail set according to the rendering instruction to obtain a target mail page.
In some embodiments, the rendering instruction refers to a preset instruction for rendering the initial mail format into the target mail format, where the initial mail instruction may be in html format or xml format, and the target mail format may be in plain text format, which is not limited herein. Because the position of the element and the position of the element content are difficult to locate by the mail corresponding to the initial mail format, the initial mail format is rendered into the target mail format, the difficulty in determining the position of the element and the position of the element content is reduced, and the mail information extraction rate is further improved.
According to the embodiment of the application, the target mail set associated with the preset element keywords is selected from the initial mail set, so that a large number of irrelevant mails can be filtered, the number of mails to be extracted with information is reduced, the operating pressure of electronic equipment is reduced, and the mail information extraction rate is improved; in addition, the embodiment of the application obtains the target mail page by carrying out page rendering on the target mail, so that the target mail page can reduce the difficulty of determining the position of the element and the position of the element content, and further improve the mail information extraction rate.
Fig. 4 is a flowchart for determining attribute information according to an embodiment of the present application. The determination procedure of the attribute information is applied to the electronic device. As shown in fig. 4, the method comprises the following steps:
s31, acquiring a first element associated with the preset element keyword according to the preset element keyword.
In some embodiments, a corresponding relationship exists between a preset element keyword and an element, that is, a first mapping relationship, for example, the preset element keyword is "order", and the corresponding element is an order element; the key words of the preset elements are 'logistics', and the corresponding elements are logistics elements. And inquiring the first mapping relation to obtain a first element corresponding to the preset element keyword. In other embodiments, the first element associated with the preset element keyword may also be obtained directly through the preset element keyword by associating the preset element keyword with the element.
S32, determining the element position of the first element in the target mail page.
In some embodiments, the element position may refer to the coordinates of a data block within the target mail page, for example, the element position corresponding to the data block where the first element is located is "x" 100, "y" 200.
S33, determining the element size corresponding to the first element according to the element position.
In some embodiments, a data block at an element location is obtained, and the width and height of the data block are taken as element sizes corresponding to a first element, e.g., the element size corresponding to the first element is "width" 300, "height" 100".
And S34, combining the element position and the element size to obtain attribute information of the first element.
In some embodiments, the element position and the element size are combined according to a preset data format to obtain attribute information of the first element, where the preset data format is a preset data format, and for example, the preset data format may be { element position, element size }. In other embodiments, the attribute information may further include a name and a type of the first element, where the type may include a text type and a picture type.
According to the embodiment of the application, the attribute information is obtained by analyzing the element position and the element size of the first element in the target mail page, so that the first element content can be conveniently determined according to the attribute information, the accuracy and the speed of determining the first element content can be improved, and the accuracy and the speed of extracting the mail information are further improved.
Fig. 5 is a flowchart for determining the content of the first element according to an embodiment of the present application. The determined flow of the first element content is applied to the electronic device. As shown in fig. 5, the method comprises the following steps:
S41, determining the element position corresponding to the first element according to the attribute information.
In some embodiments, the attribute information may include information such as an element name, an element type, an element position, an element size, and the like of the first element, and for each attribute information, there is a unique attribute identifier. For example, the attribute identifier corresponding to the element name may be "content", the attribute identifier corresponding to the element type may be "type", the attribute identifier corresponding to the element position may be "X", "Y", and the attribute identifier corresponding to the element size may be "width", "height". By querying the attribute identifiers "X" and "Y", the element position corresponding to the first element can be obtained.
S42, acquiring a plurality of data blocks in the neighborhood corresponding to the element position.
In some embodiments, the neighborhood corresponding to the element position may include a left side position and a right side position of the same row, an upper side position and a lower side position of the same column; the data blocks in the neighborhood may include at least one data block at a left side position, at least one data block corresponding to a right side position, at least one data block corresponding to an upper side position, and at least one data block corresponding to a lower side position. The number of data blocks corresponding to each neighborhood position may be 1 or more, which is not limited herein. The closer the neighborhood position is to the element position, the greater the probability that the data block corresponding to the neighborhood position can be used as the element content. The embodiment of the application is illustrated by taking 1 data block number corresponding to each neighborhood position as an example.
S43, selecting a target data block from the plurality of data blocks, and taking the content corresponding to the target data block as the first element content corresponding to the first element.
In some embodiments, the target data block refers to a database selected as the first element content, the data type and the data block length corresponding to the target data block need to meet the preset data requirement corresponding to the first element, and the preset data requirement refers to the preset requirement that the data type and the data block length corresponding to the first element content need to meet. In an embodiment, a correspondence exists between a first element and a preset data requirement, the preset data requirement corresponding to the first element can be obtained by inquiring the correspondence, detection of the preset data requirement is performed on a plurality of data blocks in the neighborhood before, the data block meeting the preset data requirement is taken as a target data block, and content corresponding to the target data block is taken as first element content corresponding to the first element.
Before determining the element position of a first element in a mail according to attribute information, selecting a data block meeting the preset data requirement corresponding to the first element from a plurality of data blocks in the neighborhood corresponding to the element position as a target data block, and taking the content corresponding to the target data block as first element content corresponding to the first element. The first element content corresponding to the first element in the mail is determined through the attribute information, so that the first element and the first element content with the position relationship can be extracted quickly, and the mail information extraction efficiency is improved.
Fig. 6 is a flowchart of selecting a target data block according to an embodiment of the present application. The selection flow of the target data block is applied to the electronic equipment. As shown in fig. 6, the method comprises the following steps:
s51, analyzing each data block in the neighborhood to obtain the data block length and the data type corresponding to each data block.
In some embodiments, the data block length may be obtained by parsing a data block size corresponding to the data block, and taking a data block width within the data block size as the data block length, and the data types may include, but are not limited to, a numeric type, a letter type, a numeric and letter combination type, a text type, and a picture type.
S52, selecting a target data block from the plurality of data blocks in the neighborhood according to the data block length and the data type.
In some embodiments, the foregoing embodiments are received, a preset data requirement corresponding to the first element is determined, a data type and a data block length of a data block corresponding to the first element content in the preset data requirement are obtained, the data block length and the data type corresponding to each data block and the data block length and the data type in the preset data requirement are matched, and a data block with both the data block length and the data type meeting the preset data requirement is selected as the target data block.
According to the embodiment of the application, the target data block is determined according to the data block length and the data type corresponding to the data block, so that the accuracy of determining the target data block can be improved, and the accuracy of extracting the mail information can be further improved.
Fig. 7 is a flowchart for determining a target mail text according to an embodiment of the present application. The determined flow of the target mail text is applied to the electronic device. As shown in fig. 7, the method comprises the following steps:
s61, acquiring an initial mail text corresponding to the target mail page.
In some embodiments, all content in the target mail page (e.g., mail recipient, mail sender, mail subject, mail body, mail attachment, mail signature) constitutes the initial mail text. In an embodiment, if the mail attachment exists, the attachment text can be obtained by analyzing the mail attachment; if the mail attachment is a compressed text, the attachment text can be obtained by decompressing the compressed text. The content of the target mail page may include only text content, only picture content, or a combination of text content and picture content, which is not limited herein. If the content of the target mail page contains the picture content, the picture content can be processed through a preset picture coding model to obtain text data, and the text data can be obtained from the picture through an optical character recognition (Optical Character Recognition, OCR) technology. The input data of the preset picture coding model are pictures, and the output data are text data corresponding to the pictures. The preset picture coding model may be a convolutional neural network model, a cyclic neural network model, and an countermeasure network model, which is not limited herein.
And S62, deleting the first element and the first element content from the initial mail text to obtain a target mail text.
In some embodiments, since the first element and the first element content can be used as the extracted mail information, in order to reduce the processing amount of the subsequent model, the first element and the first element content are deleted from the initial mail text, so as to obtain the target mail text.
According to the embodiment of the application, the determined first element and the content of the first element are deleted from the initial mail text to obtain the target mail text, so that the processing number of the follow-up models can be reduced, the model processing efficiency is improved, and the mail extraction efficiency is further improved.
Fig. 8 is a process flow diagram of a preset element identification model according to an embodiment of the present application. The processing flow of the preset element identification model is applied to the electronic equipment. As shown in fig. 8, the method comprises the following steps:
s71, inputting the target mail text into an embedding layer of the preset element recognition model to obtain a text feature vector sequence.
In some embodiments, the embedded layer of the pre-set element recognition model enables mapping of high-dimensional input data into a low-dimensional vector space by converting each word of the target mail text into a text feature vector. For example, text data is converted into vector data, e.g., text feature vectors, in order to extract features therefrom. After the text is converted into vector data, operations such as classification, clustering and the like can be performed on the text.
S72, inputting the text feature vector sequence to a feature coding layer of the preset element recognition model to obtain a character-level feature vector sequence.
In some embodiments, the feature encoding layer of the preset element recognition model may be a bidirectional LSTM network, a Text-CNN network, a transformer network, a bert network, or a superposition of multiple encoding networks, which is not limited herein. And inputting the text feature vector sequence into a feature coding layer of a preset element identification model to obtain a character-level feature vector sequence combined with the context information.
S73, inputting the character-level feature vector sequence into a label prediction network layer of the preset element recognition model to obtain labels corresponding to each character-level feature vector.
In some embodiments, the tag may refer to a preset tag for identifying the element name and the element content position, and for example, the tag corresponding to each character-level feature vector may be "Business Name Begin", "Business Name Middle" or "Business Name End", where "Business Name Begin" is used to identify that the character-level feature vector corresponding to the tag is a beginning portion of the merchant name, "Business Name Middle" is used to identify that the character-level feature vector corresponding to the tag is a middle portion of the merchant name, and "Business Name End" is used to identify that the character-level feature vector corresponding to the tag is an ending portion of the merchant name.
And S74, determining a second element and second element content corresponding to the second element according to the label.
In some embodiments, bearing the above embodiments, the second element is a merchant element by parsing the "Business Name Begin" tag. By locating the "Business Name Begin" tag and the "Business Name End" tag, complete information of the name of the merchant can be obtained, and the text between the "Business Name Begin" tag and the "Business Name End" tag is used as the name of the merchant, namely, the element content corresponding to the merchant element.
According to the embodiment of the application, the target mail text is input into the preset element identification model for sequence labeling processing, the label for identifying the element name and the element content position can be obtained, and the second element content can be obtained by identifying and converting the label. The accuracy and efficiency of determining the second element and the content of the second element can be improved by means of model processing, and the accuracy and efficiency of mail extraction are further improved.
Fig. 9 is a display flowchart applied to a terminal device according to an embodiment of the present application. The display flow of the terminal equipment is applied to the electronic equipment. As shown in fig. 9, the method comprises the following steps:
S81, determining a first preset additional content corresponding to the first element and a second preset additional content corresponding to the second element.
In some embodiments, the preset additional content may be additional content corresponding to a preset first element, or may be additional content corresponding to a preset second element. The preset additional content is not stored in the target mail page and needs to be invoked from within the associated application (e.g., e-commerce platform) by invoking the query interface. The embodiment is received, the additional content corresponding to the order element may be an order link, and the order link may include information such as order placing time, order merchandise, order address, etc.; the additional content corresponding to the logistic element can be the current logistic address; the additional content corresponding to the merchant element may be information such as a merchant address, and the additional content corresponding to the commodity element may be information such as a sales number of the commodity.
S82, updating the first element content according to the first preset additional content to obtain target first element content.
In some embodiments, the first preset additional content is added to a specified position of the first element content to obtain the target first element content. For example, the element content corresponding to the order element is an order number, the first preset additional content is an order link, and the order link may be added to the back of the order number, so as to obtain the target first element content.
And S83, updating the second element content according to the second preset additional content to obtain target second element content.
In some embodiments, the second preset additional content is added to the specified position of the second element content to obtain the target second element content. For example, the element content corresponding to the merchant element is a merchant name, the second preset additional content is a merchant address, and the merchant address may be added to the back of the merchant name to obtain the target second element content.
S84, the first element and the target first element content and the second element and the target second element content are sent to a terminal device.
According to the embodiment of the application, the first element content is updated according to the first preset additional content to obtain the target first element content, the second element content is updated according to the second preset additional content to obtain the target second element content, and the integrity of mail information extraction can be improved by displaying the first element and the target first element content and the second element and the target second element content on the terminal equipment, so that a user can quickly know mails through the displayed mail information when looking up a large number of mails.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a mail information extraction device according to an embodiment of the present application. In some embodiments, mail information extraction device 20 may include a plurality of functional modules that are comprised of computer program segments. The computer program of the individual program segments in the mail information extraction means 20 may be stored in a memory of the computer device 30 and executed by at least one controller to perform the functions of mail information extraction (described in detail with reference to fig. 2).
In the present embodiment, the mail information extraction device 20 may be divided into a plurality of functional modules according to the functions it performs. When the mail information extraction apparatus 20 is applied to a host device, the functional modules may include: a page determination module 201, a page parsing module 202, a content determination module 203, a model processing module 204, and a content combining module 205. The module referred to in the present application refers to a series of computer program segments capable of being executed by at least one controller and of performing a fixed function, which are stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.
The page determining module 201 is configured to determine a target mail page of the mail information to be extracted.
The page parsing module 202 is configured to parse the target mail page to obtain attribute information corresponding to the plurality of first elements.
The content determining module 203 is configured to determine, according to the attribute information, first element content corresponding to the first element.
The model processing module 204 is configured to determine a target mail text corresponding to the target mail page, and input the target mail text into a preset element identification model to obtain a combined second element and a second element content corresponding to the second element.
The content combination module 205 is configured to combine the first element and the first element content, and the second element content to obtain extracted mail information.
In some embodiments, the page determination module 201 further includes: the mail acquisition sub-module is used for acquiring an initial mail set; the keyword query sub-module is used for selecting a target mail set associated with a preset element keyword from the initial mail set; and the mail page rendering sub-module is used for performing page rendering on each target mail in the target mail set according to the rendering instruction to obtain a target mail page.
In some embodiments, the page resolution module 202 further includes: the first element acquisition sub-module is used for acquiring a first element associated with the preset element keyword according to the preset element keyword; an element position determining sub-module, configured to determine an element position of the first element in the target mail page; the element size determining submodule is used for determining the element size corresponding to the first element according to the element position; and the attribute information determining submodule is used for combining the element position and the element size to obtain the attribute information of the first element.
In some embodiments, the content determination module 203 further comprises: the attribute information analysis sub-module is used for determining the element position corresponding to the first element according to the attribute information; a neighborhood data block obtaining sub-module, configured to obtain a plurality of data blocks in a neighborhood corresponding to the element position; and the target data block selecting sub-module is used for selecting a target data block from the plurality of data blocks, and taking the content corresponding to the target data block as the first element content corresponding to the first element.
In some embodiments, the content determination module 203 further comprises: the data block analysis submodule is used for analyzing each data block in the neighborhood to obtain the data block length and the data type corresponding to each data block; and the target data block selecting sub-module is used for selecting a target data block from the plurality of data blocks in the neighborhood according to the data block length and the data type.
In some embodiments, the model processing module 204 further comprises: the initial mail text acquisition sub-module is used for acquiring an initial mail text corresponding to the target mail page; and the target mail text acquisition sub-module is used for deleting the first element and the first element content from the initial mail text to obtain a target mail text.
In some embodiments, the model processing module 204 further comprises: the embedded layer processing sub-module is used for inputting the target mail text into the embedded layer of the preset element recognition model to obtain a text feature vector sequence; the coding layer processing submodule is used for inputting the text feature vector sequence to a feature coding layer of the preset element identification model to obtain a character-level feature vector sequence; the label prediction layer processing sub-module is used for inputting the character-level feature vector sequence into a label prediction network layer of the preset element recognition model to obtain labels corresponding to each character-level feature vector; and the label identification sub-module is used for determining a second element and second element content corresponding to the second element according to the label.
In some embodiments, the content combining module 205 further comprises: a second mapping relation traversing submodule, configured to traverse a second mapping relation between a preset element and a preset additional content to obtain a first preset additional content corresponding to the first element and a second preset additional content corresponding to the second element; a first element content updating sub-module, configured to update the first element content according to the first preset additional content, to obtain a target first element content; a second element content updating sub-module, configured to update the second element content according to the second preset additional content, to obtain a target second element content; and the client display sub-module is used for sending the first element and the target first element content, and sending the second element and the target second element content to the terminal equipment.
It can be understood that the mail information extraction device 20 belongs to the same inventive concept as the mail information extraction method in the above embodiment, and the specific implementation manner of each module of the mail information extraction device 20 corresponds to each step of the mail information extraction method in the above embodiment, which is not repeated herein.
The above-described module division is a logic function division, and there may be another division manner in actual implementation. In addition, each functional module in the embodiments of the present application may be integrated in the same processing unit, or each module may exist alone physically, or two or more modules may be integrated in the same unit. The integrated modules may be implemented in hardware or in hardware plus software functional modules.
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 11, the electronic device 30 includes a memory 31, at least one controller 32 for implementing a mail information extraction method when executing a computer program stored in the memory 31, and at least one communication bus 33 provided to implement connection communication between the memory 31 and the at least one controller 32 and the like.
The configuration of the electronic device shown in fig. 11 is not limiting of the embodiments of the present application, and electronic device 30 may include more or less other hardware or software than shown, or a different arrangement of components.
In some embodiments of the present application, the electronic device 30 may also be connected to a client device, where the client device includes, but is not limited to, any electronic product that can interact with a user by way of a keyboard, a mouse, a remote control, a touch pad, or a voice-controlled device, such as a personal computer, a tablet, a smart phone, a digital camera, etc.
It should be noted that the electronic device 30 is only an example, and other electronic products that may be present in the present application or may be present in the future are also included in the scope of the present application by way of reference.
In some embodiments, the electronic device 30 may also include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described in detail herein.
The memory 31 stores a computer program which, when executed by the at least one controller 32, performs all or part of the steps in the mail information extraction method, for example. The Memory 31 includes Read-Only Memory (ROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable rewritable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic tape Memory, or any other medium that can be used for a computer readable medium that carries or stores data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device 30, or the like.
In some embodiments, at least one controller 32 is a Control Unit (Control Unit) of electronic device 30 that connects the various components of the entire electronic device 30 using various interfaces and lines, by running or executing programs or modules stored in memory 31, and invoking data stored in memory 31 to perform various functions of electronic device 30 and process data. For example, at least one controller 32, when executing a computer program stored in memory, implements all or part of the steps of the mail information extraction method in embodiments of the present application; or to implement all or part of the functions of the mail information extraction means. The at least one controller 32 may be comprised of integrated circuits, such as a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central controllers (Central Processing unit, CPUs), microcontrollers, digital processing chips, graphics controllers, combinations of various control chips, and the like.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium that includes instructions for causing an electronic device (which may be a personal computer, an electronic device, or a network device, etc.) or a controller (processor) to perform portions of the methods of various embodiments of the application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of modules is merely a logical function division, and other manners of division may be implemented in practice.
The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it will be obvious that the term "comprising" does not exclude other elements or that the singular does not exclude a plurality. Several of the elements or devices recited in the specification may be embodied by one and the same item of software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present application without departing from the spirit and scope of the technical solution of the present application.

Claims (10)

1. A mail information extraction method, characterized in that the method comprises:
determining a target mail page of mail information to be extracted;
analyzing the target mail page to obtain attribute information corresponding to a plurality of first elements;
determining first element content corresponding to the first element according to the attribute information;
determining a target mail text corresponding to the target mail page, and inputting the target mail text into a preset element identification model to obtain a second element and second element content corresponding to the second element;
and combining the first element with the first element content and combining the second element with the second element content to obtain the extracted mail information.
2. The method of claim 1, wherein the determining the target mail page of the mail information to be extracted comprises:
acquiring an initial mail set;
selecting a target mail set associated with a preset element keyword from the initial mail set;
and carrying out page rendering on each target mail in the target mail set according to the rendering instruction to obtain a target mail page.
3. The method of claim 2, wherein the parsing the target mail page to obtain attribute information corresponding to a plurality of first elements includes:
Acquiring a first element associated with the preset element keyword according to the preset element keyword;
determining an element position of the first element in the target mail page;
determining the element size corresponding to the first element according to the element position;
and combining the element position and the element size to obtain attribute information of the first element.
4. The method of claim 3, wherein the determining the first element content corresponding to the first element according to the attribute information includes:
determining the element position corresponding to the first element according to the attribute information;
acquiring a plurality of data blocks in the neighborhood corresponding to the element position;
and selecting a target data block from the plurality of data blocks, and taking the content corresponding to the target data block as the first element content corresponding to the first element.
5. The method of claim 4, wherein selecting the target data block from among the plurality of data blocks comprises:
analyzing each data block in the neighborhood to obtain the length and the data type of the data block corresponding to each data block;
and selecting a target data block from the plurality of data blocks in the neighborhood according to the data block length and the data type.
6. The method of claim 1, wherein the determining the target mail text corresponding to the target mail page comprises:
acquiring an initial mail text corresponding to the target mail page;
and deleting the first element and the first element content from the initial mail text to obtain a target mail text.
7. The method of claim 1, wherein the inputting the target mail text into a preset element recognition model to obtain a second element and a second element content corresponding to the second element includes:
inputting the target mail text to an embedding layer of the preset element recognition model to obtain a text feature vector sequence;
inputting the text feature vector sequence to a feature coding layer of the preset element recognition model to obtain a character-level feature vector sequence;
inputting the character-level feature vector sequence into a label prediction network layer of the preset element recognition model to obtain labels corresponding to each character-level feature vector;
and determining a second element and second element content corresponding to the second element according to the label.
8. The method of claim 1, wherein after said combining the first element with the first element content and combining the second element with the second element content, the method further comprises:
Determining a first preset additional content corresponding to the first element and a second preset additional content corresponding to the second element;
updating the first element content according to the first preset additional content to obtain target first element content;
updating the second element content according to the second preset additional content to obtain target second element content;
and sending the first element and the target first element content, and the second element and the target second element content to a terminal device.
9. A mail information extraction apparatus, characterized in that the apparatus comprises:
the page determining module is used for determining a target mail page of the mail information to be extracted;
the page analysis module is used for analyzing the target mail page to obtain attribute information corresponding to a plurality of first elements;
the content determining module is used for determining first element content corresponding to the first element according to the attribute information;
the model processing module is used for determining a target mail text corresponding to the target mail page, inputting the target mail text into a preset element identification model and obtaining a second element and second element content corresponding to the second element;
And the content combination module is used for combining the first element and the first element content and combining the second element and the second element content to obtain the extracted mail information.
10. An electronic device comprising a controller and a memory, wherein the controller is configured to implement the mail information extraction method according to any one of claims 1 to 8 when executing a computer program stored in the memory.
CN202310907244.7A 2023-07-21 2023-07-21 Mail information extraction method and device and electronic equipment Pending CN116821552A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310907244.7A CN116821552A (en) 2023-07-21 2023-07-21 Mail information extraction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310907244.7A CN116821552A (en) 2023-07-21 2023-07-21 Mail information extraction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN116821552A true CN116821552A (en) 2023-09-29

Family

ID=88120334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310907244.7A Pending CN116821552A (en) 2023-07-21 2023-07-21 Mail information extraction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116821552A (en)

Similar Documents

Publication Publication Date Title
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN110827112B (en) Deep learning commodity recommendation method and device, computer equipment and storage medium
CN107798001B (en) Webpage processing method, device and equipment
CN112380870A (en) User intention analysis method and device, electronic equipment and computer storage medium
CN107992523B (en) Function option searching method of mobile application and terminal equipment
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
CN110347786B (en) Semantic model tuning method and system
US20240134860A1 (en) Order searching method, apparatus, computer device, and storage medium
CN110472121B (en) Business card information searching method and device, electronic equipment and computer readable storage medium
CN110765778B (en) Label entity processing method, device, computer equipment and storage medium
CN115759100A (en) Data processing method, device, equipment and medium
CN116821552A (en) Mail information extraction method and device and electronic equipment
CN113505293B (en) Information pushing method and device, electronic equipment and storage medium
CN114943306A (en) Intention classification method, device, equipment and storage medium
CN114330240A (en) PDF document analysis method and device, computer equipment and storage medium
CN113743721A (en) Marketing strategy generation method and device, computer equipment and storage medium
CN108648026B (en) Method and device for modifying invoice information
CN109299439B (en) Digital extraction method and apparatus, storage medium, and electronic apparatus
US11258845B2 (en) Browser management system, browser management method, browser management program, and client program
CN112487164A (en) Artificial intelligence interaction method
CN111753548A (en) Information acquisition method and device, computer storage medium and electronic equipment
CN112287184B (en) Migration labeling method, device, equipment and storage medium based on neural network
KR20130023897A (en) System, terminal, server, method, recording medium and program providing device for providing phonebook service with qr code
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN114462405A (en) Text type identification method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination