CN114154480A - Information extraction method, device, equipment and storage medium - Google Patents

Information extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN114154480A
CN114154480A CN202111520172.8A CN202111520172A CN114154480A CN 114154480 A CN114154480 A CN 114154480A CN 202111520172 A CN202111520172 A CN 202111520172A CN 114154480 A CN114154480 A CN 114154480A
Authority
CN
China
Prior art keywords
information
text
order data
target
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111520172.8A
Other languages
Chinese (zh)
Inventor
简仁贤
李梦雄
马永宁
王海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Emotibot Technologies Ltd
Original Assignee
Emotibot Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Emotibot Technologies Ltd filed Critical Emotibot Technologies Ltd
Priority to CN202111520172.8A priority Critical patent/CN114154480A/en
Publication of CN114154480A publication Critical patent/CN114154480A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Development Economics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Character Discrimination (AREA)

Abstract

The application provides an information extraction method, an information extraction device, information extraction equipment and a storage medium, wherein the method comprises the following steps: acquiring order data corresponding to the query instruction; inputting the order data into a preset identification model, and outputting object information in the order data; verifying the subject matter information based on a standard word bank to obtain verified subject matter information; generating triple information of the order data based on the verified object information. The method and the device combine artificial intelligence model recognition and standard word bank rule verification to extract order information, and improve extraction precision.

Description

Information extraction method, device, equipment and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to an information extraction method, apparatus, device, and storage medium.
Background
With the development of internet technology, more and more commodities are purchased by online orders, such as delivering order information through mail, for example, after a user places an order for a batch of commodities on a platform, the order information is delivered through mail.
The commodity information and the arrival date in the order information are very important commodity data, and when a user wants to check the commodity information and the arrival date of the relevant commodity in the mail, the user often needs to open the mail to search manually, which is very inconvenient for the user. Therefore, the technology for automatically extracting the information of the mail content is developed.
In the existing mail extraction method, information is mainly extracted by writing rules and the like, but the extracted information has limitations and low precision, and because mail content has diversity, the extraction of information in any form cannot be satisfied, so how to improve the extraction precision of the mail content information becomes an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application aims to provide an information extraction method, an information extraction device, equipment and a storage medium, and order information is extracted by combining model identification and standard word bank rule verification, so that the extraction precision is improved.
A first aspect of the embodiments of the present application provides an information extraction method, including: acquiring order data corresponding to the query instruction; inputting the order data into a preset identification model, and outputting object information in the order data; verifying the subject matter information based on a standard word bank to obtain verified subject matter information; generating triple information of the order data based on the verified object information.
In an embodiment, the query instruction carries identification information of the target order; the obtaining of the order data corresponding to the query instruction includes: when a query instruction is received, extracting order content corresponding to the identification information from a preset order library; and analyzing the content of the order to obtain text data of the target order, and taking the text data as the order data.
In one embodiment, the step of establishing the predetermined recognition model includes: obtaining a sample order data set; converting the sample order data set to a predetermined standard format; labeling sample target information in the sample order data set in a standard format; and training a neural network model by adopting the marked sample order data set to obtain the preset identification model.
In one embodiment, the subject information includes: a target object identification text and a text position of the identification text in the order data; the standard word bank-based verification processing of the subject matter information to obtain verified subject matter information includes: judging whether the standard word bank has target standard data which is the same as the identification text or not; and when the target standard data does not exist in the standard word bank, correcting the identification text based on the text position to obtain the verified object information.
In an embodiment, before the determining whether the target standard data identical to the identification text exists in the standard lexicon, the method further includes: and detecting character information at the boundary of the identification text, and deleting the non-text symbols at the boundary of the identification text to obtain the corrected identification text.
In an embodiment, the correcting the identification text based on the text position to obtain the verified subject matter information includes: when the target standard data does not exist in the standard word bank, target candidate data with the similarity between the target candidate data and the identification text larger than a preset threshold value are selected from the standard word bank; judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data; and when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data, taking the target candidate data as the verified object information.
In an embodiment, the correcting the identification text based on the text position to obtain the verified subject matter information further includes: and when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed interval in the order data, expanding text content along the text position boundary in the order data until a space symbol is met, and taking the expanded text content and a new text position corresponding to the text content as the verified object information.
In one embodiment, the method further comprises: and updating the verified object information into the standard word stock.
In one embodiment, the subject information includes: the target article identification and the date information corresponding to the target article; the generating of the triple information of the order data based on the verified object information includes: and respectively taking the target article identifier and the date information as two entities, and taking the type label and the date label of the target article as the relationship between the two entities to generate the triple information of the order data.
A second aspect of the embodiments of the present application provides an information extraction apparatus, including: the acquisition module is used for acquiring order data corresponding to the query instruction; the recognition module is used for inputting the order data into a preset recognition model and outputting the object information in the order data; the verification module is used for verifying the subject matter information based on the standard word stock to obtain verified subject matter information; and the generating module is used for generating the triple information of the order data based on the verified object information.
In an embodiment, the query instruction carries identification information of the target order; the acquisition module is configured to: when a query instruction is received, extracting order content corresponding to the identification information from a preset order library; and analyzing the content of the order to obtain text data of the target order, and taking the text data as the order data.
In one embodiment, the method further comprises: an establishment module to: obtaining a sample order data set;
converting the sample order data set to a predetermined standard format; labeling sample target information in the sample order data set in a standard format; and training a neural network model by adopting the marked sample order data set to obtain the preset identification model.
In one embodiment, the subject information includes: a target object identification text and a text position of the identification text in the order data; the check module is used for: judging whether the standard word bank has target standard data which is the same as the identification text or not; and when the target standard data does not exist in the standard word bank, correcting the identification text based on the text position to obtain the verified object information.
In an embodiment, before the determining whether the target standard data identical to the identification text exists in the standard lexicon, the method further includes: and detecting character information at the boundary of the identification text, and deleting the non-text symbols at the boundary of the identification text to obtain the corrected identification text.
In an embodiment, the correcting the identification text based on the text position to obtain the verified subject matter information includes: when the target standard data does not exist in the standard word bank, target candidate data with the similarity between the target candidate data and the identification text larger than a preset threshold value are selected from the standard word bank; judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data; and when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data, taking the target candidate data as the verified object information.
In an embodiment, the correcting the identification text based on the text position to obtain the verified subject matter information further includes: and when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed interval in the order data, expanding text content along the text position boundary in the order data until a space symbol is met, and taking the expanded text content and a new text position corresponding to the text content as the verified object information.
In one embodiment, the method further comprises: and the updating module is used for updating the verified object information to the standard word stock.
In one embodiment, the subject information includes: the target article identification and the date information corresponding to the target article; the generation module is configured to: and respectively taking the target article identifier and the date information as two entities, and taking the type label and the date label of the target article as the relationship between the two entities to generate the triple information of the order data.
A third aspect of embodiments of the present application provides an electronic device, including: a memory to store a computer program; a processor configured to perform the method of the first aspect of the embodiments of the present application and any of the embodiments of the present application.
A fourth aspect of embodiments of the present application provides a non-transitory electronic device-readable storage medium, including: a program which, when run by an electronic device, causes the electronic device to perform the method of the first aspect of an embodiment of the present application and any embodiment thereof.
According to the information extraction method, the device, the equipment and the storage medium, the order data corresponding to the query instruction are processed by the recognition model to obtain the object information in a unified format, then the object information output by the recognition model is verified based on the standard lexicon, the triple information of the order data is generated based on the verified object information, and therefore the order information is extracted by combining model recognition and standard lexicon rule verification, and the information extraction precision is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 2A is a schematic flowchart of an information extraction method according to an embodiment of the present application;
FIG. 2B is a diagram illustrating mail content parsing according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating an information extraction method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an information extraction apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
As shown in fig. 1, the present embodiment provides an electronic apparatus 1 including: at least one processor 11 and a memory 12, one processor being exemplified in fig. 1. The processor 11 and the memory 12 are connected by a bus 10. The memory 12 stores instructions executable by the processor 11, and the instructions are executed by the processor 11, so that the electronic device 1 may execute all or part of the processes of the method in the following embodiments, so as to extract the order information by combining the model identification and the standard thesaurus rule verification, thereby improving the information extraction precision.
In an embodiment, the electronic device 1 may be a mobile phone, a tablet computer, a notebook computer, a desktop computer, or a large computing system formed by a plurality of computer devices.
Please refer to fig. 2A, which is an information extraction method according to an embodiment of the present application, and the method may be executed by the electronic device 1 shown in fig. 1, and may be applied to an information extraction scenario of order data using a mail as an information carrier, so as to extract order information by combining model identification and standard lexicon rule verification, thereby improving information extraction accuracy. The method comprises the following steps:
step 201: and acquiring order data corresponding to the query instruction.
In this step, the query instruction may be input by the user, for example, the query instruction may be entered by the user through an interactive interface of the terminal, and the query instruction may include identification information, such as a name, of the specified order data. The order data can be carried in various ways, for example, the order data can be data transmitted via email, and in this case, the query instruction may include the name of the email to be queried. The order data may be pre-stored in a pre-set order repository.
In one embodiment, step 201 may include: and when a query instruction is received, extracting order content corresponding to the identification information from a preset order library. And analyzing the content of the order to obtain text data of the target order, and taking the text data as order data.
In this step, a large amount of queryable order data is pre-stored in the preset order library, and the order data may be data in the form of mail or document. Assuming that an inquiry instruction entered by a user carries identification information of a target order, such as a mail name of the target order, target order content corresponding to the mail name is firstly searched in a preset order library, then content analysis is performed on the order content to obtain text data of the order, and the text data is used as the order data of the target order, so that no matter which form of the order content is, the order content can be processed into unified text data, and information extraction is not limited by the form of the order.
In an embodiment, as shown in fig. 2B, taking an order in the form of a mail as an example, a user inputs a name of a mail to be queried, where content of the order is mail content including a target order, and when analyzing content of the order, first, a content type of the mail is determined, and content is analyzed based on the content type of the mail. The content types of the mail may include the following:
1. if only text exists, the mail content is input to a text analysis module to obtain corresponding text content.
2. If the mail has a form in the body, the form part is input into a form analysis module, and the unified ginger form is converted into a text format.
3. If the mail has the attachment, judging the type of the attachment, if the attachment is a word document, adopting a word parser to parse, if the attachment is an excel document, adopting an excel parser to parse, if the attachment is a pdf document, adopting a pdf parser to parse, if the attachment is a picture, adopting an ocr (Optical Character Recognition) picture parsing tool to extract information, if the attachment is a text file, directly transferring the text file to a text parsing module, and uniformly converting mail contents in different formats into text contents in the mode. The mail content extraction is not limited by the mail type any more, and the use range of the mail information extraction is expanded.
Step 202: and inputting the order data into a preset identification model, and outputting the object information in the order data.
In this step, the target object may be a commodity, a general article, or the like, and different target objects may be selected based on different scenes. In practical scenarios, the mail related to the subject matter generally includes various attribute information of the subject matter, such as name, category, etc., and if the subject matter is a commodity, the order data may be order information of the commodity delivered in the form of mail content, such as a mail order form of a batch of articles, and the mail order form generally includes information of the category, name, quantity, arrival date, etc. of the commodity.
The preset recognition model may be a neural network-based recognition model. Assuming that the object is a commodity, taking order data in the form of a mail as an example, because the commodity information has strong correlation with the arrival date thereof, the neural network model can be trained based on regularity of the commodity information and the arrival date in the past mail, so as to obtain a preset identification model, the mail content is firstly converted into a text format in a unified manner through step 201, and then the mail content in the text format is input into the preset identification model, so that information such as the commodity information and the arrival date in the mail can be output.
In an embodiment, before the mail content in the text format is input into the preset recognition model, the mail content in the text format may be standardized and converted into a predetermined standard format, so that the preset recognition model can extract the object information more accurately.
In an embodiment, before step 202, the method may further include: the step of establishing the preset identification model comprises the following steps: a sample order data set is obtained. The sample order data set is converted to a predetermined standard format. Labeling sample target information in a sample order dataset in a standard format. And training the neural network model by using the marked sample order data set to obtain a preset identification model.
In this step, the sample order data set may be an order data set in a text format after a plurality of order mail content sets of the object are processed by the parser, the order data set is first standardized into a predetermined standard format, such as a uniform text layout format, and is subjected to data cleaning, deduplication, and the like, and then the sample order data set in the standard format is labeled, such as labeling the object information and the label corresponding to the high object information related in each text, where the label may be a label representing the category of the object. And finally training the neural network model by using the marked sample order data set to obtain a preset identification model.
Taking the example that the object can be a commodity, the input of the preset recognition model is a text and a label of commodity information, and the output result of the preset recognition model is the text and the label of the inquired object information and the position information of the text of the object information in the input text.
In an embodiment, the model training may use a bidirectional LSTM + CRF network architecture, taking commodity information as an example, and the preprocessed commodity information and the label are used as input data of the network architecture, and iterative training is performed on the data, so that the loss function loss is minimized, and after the set F1 value is reached on the test set, the model training process is ended. During prediction, after text information is input, the model can obtain whether the input text contains commodity information or not and position information corresponding to the label and the commodity information through some processing on the label with the maximum prediction probability.
Step 203: and verifying the subject matter information based on the standard word bank to obtain the verified subject matter information.
In this step, the target object information output by the preset recognition model may have an inaccurate phenomenon, for example, word segmentation of the mail content may be inaccurate, which may result in incomplete information contained in the recognition result, and further affect the final information extraction precision, so that on the basis of model recognition, rule verification may be performed, that is, the target object information is verified based on the standard lexicon, and the verified target object information is obtained. The standard thesaurus is preset with related standard format information of the object, taking the commodity as the object, the standard thesaurus can include information such as the name, the order number, the arrival date and the like of the commodity which is ordered. The standard thesaurus may be statistically derived based on different completed order data. Therefore, the standard word bank is used for verifying the model identification result, and the accuracy of information extraction can be further ensured.
Step 204: and generating triple information of the order data based on the verified object information.
In this step, the corresponding triple information is extracted based on the verified object information, so that the user can look up the interested content of the mail more clearly, and the situation that the user can obtain the related object information only by reading in the whole process when the mail content is very much is avoided. Taking a commodity as an example, in an actual scene, an order mail of a commodity may contain many contents and even be attached with many attachments, if a user only wants to query the name and the arrival date related to the commodity in the commodity, the user can obviously waste too much time and energy by reading the contents of all mails, if the user only needs to enter the name of the query mail, the user can automatically return the commodity information in the mail and the triplet information related to the arrival date, and the query time of the user can be greatly saved.
According to the information extraction method, mail content corresponding to the query instruction is converted into a uniform text format through an analyzer, then order data in the text format is processed through an identification model to obtain object information in the uniform format, then the object information output by the identification model is verified based on a standard lexicon, and triple information of the order data is generated based on the verified object information.
Please refer to fig. 3, which is an information extraction method according to an embodiment of the present application, and the method can be executed by the electronic device 1 shown in fig. 1, and can be applied to an information extraction scenario of order data using a mail as an information carrier, so as to extract order information by combining model identification and standard lexicon rule verification, thereby improving information extraction accuracy. The method comprises the following steps:
step 301: and acquiring order data corresponding to the query instruction. See the description of step 201 in the above embodiments for details.
Step 302: and inputting the order data into a preset identification model, and outputting the object information in the order data. See the description of step 202 in the above embodiments for details.
Step 303: and judging whether the target standard data same as the identification text exists in the standard word bank or not. If yes, go to step 308, otherwise go to step 304.
In this step, the preset object information output by the recognition model may include: the object identification text and the text location of the identification text in the order data. Taking an object as an example of a commodity, assuming that the object identifier is a commodity name, the object information output by the preset identification model further includes a text position of the commodity name in the input text, and the text position may be a coordinate position. When the object information is verified based on the standard lexicon, it may be first determined whether the target standard data identical to the identification text exists in the standard lexicon. Taking a commodity as an example, the identification text is a text of a commodity name, each commodity can be provided with one standard word bank, and different standard word banks can be maintained through tag matching. In the determination, the text of the commodity name may be input into the standard lexicon of the commodity information, and if the standard lexicon does not have target standard data completely identical to the text of the commodity name, the step 304 is performed, otherwise, the step 308 is performed.
In an embodiment, before step 303, the method may further include: and detecting character information at the boundary of the identification text, and deleting non-text symbols at the boundary of the identification text to obtain the corrected identification text.
In an actual scenario, the identification text recognized by the preset recognition model may not be accurate enough, for example, some similar non-character numbers may be used as the content of the identification text, which may result in an inaccurate credit check based on the standard lexicon, such as scenario 1:
the user inputs the mail name, and the mail content obtained by analysis is as follows:
-CSSA MONTE VLCC LOAD MONGSTAD/DISCHARGE INDIA LAYCAN 20-21/08-REMARK RPLC
the analyzer analyzes the mail content into order data in a text format, and obtains-CSSA MONTE as a commodity name text (i.e. an identification text) through recognition of a preset recognition model, the text position information is (0, 12), the text of the arrival date is 20-21/08, the text position information is (57, 65), -CSSA MONTE is labeled PRD, the label of 20-21/08 is DAT, and the output result of the preset recognition model is: (-CSSA MONTE, PRODUCT, 0, 12) and (20-21/08, DATA, 57, 65). Wherein, for the text of the commodity name-CSSA MONTE with a non-text symbol "-", if the-CSSA MONTE is directly matched and checked with the standard thesaurus, the real commodity name in the standard thesaurus is called CSSA MONTE, so the same target standard data as the-CSSA MONTE cannot be matched in the standard thesaurus.
And the error is caused by the output result of the preset recognition model, so that in order to avoid the error caused by the error to the final information extraction result, the output result of the preset recognition model can be preprocessed before being matched and verified with the standard word stock, character information at the boundary of the identification text is detected based on the text position of the identification text, and non-text symbols at the boundary of the identification text are deleted to obtain the corrected identification text. For example, the commodity name is truncated by the location information, and the "minus" is removed to obtain CSSA MONTE, and finally the tuple of the commodity name is (CSSA MONTE, PRODUCT, 2, 12). Obviously, at this time, since the CSSA MONTE is already in the thesaurus table, that is, the target standard data in the standard thesaurus can enter step 308.
Step 304: and selecting target candidate data with the similarity between the target candidate data and the identification text larger than a preset threshold value from the standard word bank.
In this step, when the target standard data does not exist in the standard lexicon, the identification text needs to be corrected based on the text position to obtain verified object information. Such as for scene 2:
the user inputs the mail name to be inquired, and the corresponding mail content is analyzed as follows:
09-12 DELTA APOLLONIA 319 15 22.52 VADINAR 17-11 DELTA
firstly, the mail content is analyzed into order data in a text format, then a text with Delta APOLLON as a commodity name is obtained through recognition by a preset recognition model, the text position information is (9, 22), the text of the arrival date is 17-11, the text position information is (61, 66), the label of Delta APOLLON is PRD, the label of 17-11 is DAT, and the output result of the preset recognition model is recorded as: (DELTA APOLLON, PRODUCT, 9, 22) and (17-11, DATA, 61, 66), the first item in each tuple is matched and checked by the standard lexicon, if the word DELTA APOLLON is not in the standard lexicon, the standard lexicon can be further queried, the target candidate DATA with the similarity larger than the preset threshold value with the text DELTA APOLLON of the commodity name is selected, and the candidate DATA with the similarity larger than the preset threshold value can be multiple, and the candidate DATA with the maximum similarity value is selected from the multiple candidate DATA to serve as the target candidate DATA.
Step 305: and judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data. If yes, go to step 306, otherwise go to step 307.
In this step, for the selected target candidate data, although the similarity between the selected target candidate data and the text DELTA APOLLON of the commodity name is greater than the preset threshold, the spelling order may not be the same, especially, for the english-like text, the spelling order of the same letter may identify different meanings, so that the target candidate data may be extracted from words in the standard lexicon, and assuming that the target candidate data is DELTA APOLLON, the target candidate data is subjected to sliding judgment on the text DELTA APOLLON of the commodity name within the interval between the start position and the end position of the text DELTA APOLLON of the commodity name, and further, the spelling order of the letter within the specified interval between the start position and the end position of the text DELTA APOLLON is judged to be the same as that of the target candidate data one by one letter, where the spelling order of the target candidate data DELTA APOLLON may be obtained to be the same as that of the commodity name within the specified interval, step 306 may be entered.
Step 306: and taking the target candidate data as the verified object information. Step 308 is then entered.
In this step, when the spelling order of the target candidate data is the same as the spelling order of the text position designation section in the order data, for example, the spelling order of the target candidate data DELTA APOLLONIA in the above-mentioned scenario 2 is the same as the text DELTA APOLLON of the product name within the designation section, the target candidate data DELTA APOLLONIA can be used as the verified object information.
Step 307: and expanding the text content along the text position boundary in the order data until a space symbol is met, and taking the expanded text content and a new text position corresponding to the text content as verified object information. Step 308 is entered.
In this step, when the spelling order of the target candidate data is not the same as the spelling order of the text position designation interval in the order data, or the target candidate data meeting the standard is not screened from the standard lexicon in step 304, it is indicated that the standard lexicon may not pre-store the standard data of the target object queried this time, and at this time, the identification text may be corrected based on the text position of the identification text of the target object and the original order data, for example, the identification text output by the preset identification model may be incomplete, and then completion processing may be performed.
Taking scene 2 as an example, assuming that the identification text of the object is text DELTA APOLLON of the commodity name, the text position is (9, 22), the boundary 22 of the text position in the original order data can be located as N, the text content continues to be expanded backward along the boundary, the latter bit is letter I, the commodity name is extended backward by one bit, the expansion is sequentially performed circularly, the space symbol is used as a segmentation symbol, the operation is ended, if the space symbol is met, the operation is stopped, the text content DELTA APOLLONIA obtained by final expansion and the new text position (9, 24) thereof are used as verified object information, and verified object information is (DELTA APOLLONIA, PRODUCT, 9, 24).
In one embodiment, the hypothetical target information includes: and the target item identification and the date information corresponding to the target item. It should be noted that, in the above scenarios 1 and 2, the description of the verification process is performed by taking the triple of the product name as the target article identifier as an example, and for the verification processing of the date information corresponding to the target article, the verification process of the target article identifier may be referred to. For example, the triple information corresponding to the commodity arrival date can also be checked in the same way, because the original mail sentence has a plurality of dates, the dates are directly searched by the rules, only half of the chances are found in one sentence, the approximate position of the date can be identified by the preset identification model, and a more accurate result can be obtained by the standard word bank checking. And will not be described in detail herein.
Step 308: and respectively taking the target article identification and the date information in the target article information as two entities, and taking the type label and the date label of the target article as the relationship between the two entities to generate the triple information of the order data.
In this step, it is assumed that after the identification text of the target object in each triplet is matched in the standard thesaurus, target standard data that is completely matched is found, for example, CSSA MONTE in scene 1 is already in the thesaurus table, or after the verification processing of step 307 and step 306, complete target object information is obtained, and the target object information includes: and the target item identification and the date information corresponding to the target item. Taking the goods as the subject matter, the subject matter information may include the identification of the goods and the arrival date of the goods, and the triple information may be (goods identification, goods-date, arrival date), so that the user can clearly look up the content of interest in the email.
In one embodiment, the date information may be formatted to be converted into a unified predetermined date format, such as scene 1, the date information is correctly identified without the need for a correction process, the original date 20-21/08, which is actually an abbreviation for two dates, only needs to be converted from 20-21/08 to the standard format 2021-08-20/2021-08-21. The final output triplet content is (CSSA MONTE, PRD-DAT, 2021-08-20/2021-08-21)
For example, in scenario 2, date 17-11 is converted to standard format 2021-11-17, and the final output triplet content is (DELTA APOLLONIA, PRD-DAT, 2021-11-17).
Step 309: and updating the verified object information to a standard word bank.
In this step, for the target object information matched with the target standard data or the target candidate data in the standard lexicon, it is indicated that the target object information is not filed in the standard lexicon, and in order to enrich the standard lexicon, the target object information can be added into the corresponding standard lexicon, so as to update the standard lexicon and further promote the subsequent improvement of the accuracy of information extraction.
Please refer to fig. 4, which is an information extraction apparatus 400 according to an embodiment of the present application, and the apparatus can be applied to the electronic device 1 shown in fig. 1, and can be applied to an information extraction scenario of order data using a mail as an information carrier, so as to extract order information by combining model identification and standard lexicon rule verification, thereby improving information extraction accuracy. The device includes: the system comprises an acquisition module 401, an identification module 402, a verification module 403 and a generation module 404, wherein the principle relationship of each module is as follows:
the obtaining module 401 is configured to obtain order data corresponding to the query instruction.
The identification module 402 is configured to input order data into a preset identification model, and output object information in the order data.
The verification module 403 is configured to perform verification processing on the subject matter information based on the standard lexicon to obtain verified subject matter information.
A generating module 404, configured to generate triple information of the order data based on the verified object information.
In one embodiment, the query instruction carries identification information of the target order. The obtaining module 401 is configured to: and when a query instruction is received, extracting order content corresponding to the identification information from a preset order library. And analyzing the content of the order to obtain text data of the target order, and taking the text data as order data.
In one embodiment, the method further comprises: a setup module 405 configured to: a sample order data set is obtained.
The sample order data set is converted to a predetermined standard format. Labeling sample target information in a sample order dataset in a standard format. And training the neural network model by using the marked sample order data set to obtain a preset identification model.
In one embodiment, the subject information includes: the object identification text and the text location of the identification text in the order data. The verification module 403 is configured to: and judging whether the target standard data same as the identification text exists in the standard word bank or not. And when the standard word bank does not have target standard data, correcting the identification text based on the text position to obtain verified object information.
In an embodiment, before determining whether the target standard data identical to the identification text exists in the standard word bank, the method further includes: and detecting character information at the boundary of the identification text, and deleting non-text symbols at the boundary of the identification text to obtain the corrected identification text.
In an embodiment, the correcting the identification text based on the text position to obtain verified object information includes: and when the target standard data does not exist in the standard word bank, selecting target candidate data with the similarity between the target candidate data and the identification text larger than a preset threshold value from the standard word bank. And judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data. And when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated intervals in the order data, taking the target candidate data as the verified object information.
In an embodiment, the method further includes performing correction processing on the identification text based on the text position to obtain verified object information, and further includes: and when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed interval in the order data, expanding the text content in the order data along the text position boundary until a space symbol is met, and taking the expanded text content and a new text position corresponding to the text content as verified object information.
In one embodiment, the method further comprises: and an updating module 406, configured to update the verified object information to the standard lexicon.
In one embodiment, the subject matter information includes: and the target item identification and the date information corresponding to the target item. The generation module 404 is configured to: and respectively taking the identification and the date information of the target object as two entities, and taking the type label and the date label of the target object as the relationship between the two entities to generate the triple information of the order data.
For a detailed description of the information extraction apparatus 400, please refer to the description of the related method steps in the above embodiments.
An embodiment of the present invention further provides a non-transitory electronic device readable storage medium, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (12)

1. An information extraction method, comprising:
acquiring order data corresponding to the query instruction;
inputting the order data into a preset identification model, and outputting object information in the order data;
verifying the subject matter information based on a standard word bank to obtain verified subject matter information;
generating triple information of the order data based on the verified object information.
2. The method according to claim 1, wherein the query instruction carries identification information of the target order; the obtaining of the order data corresponding to the query instruction includes:
when a query instruction is received, extracting order content corresponding to the identification information from a preset order library;
and analyzing the content of the order to obtain text data of the target order, and taking the text data as the order data.
3. The method of claim 1, wherein the step of building the predetermined recognition model comprises:
obtaining a sample order data set;
converting the sample order data set to a predetermined standard format;
labeling sample target information in the sample order data set in a standard format;
and training a neural network model by adopting the marked sample order data set to obtain the preset identification model.
4. The method of claim 1, wherein the subject matter information comprises: a target object identification text and a text position of the identification text in the order data; the standard word bank-based verification processing of the subject matter information to obtain verified subject matter information includes:
judging whether the standard word bank has target standard data which is the same as the identification text or not;
and when the target standard data does not exist in the standard word bank, correcting the identification text based on the text position to obtain the verified object information.
5. The method according to claim 4, before said determining whether the target standard data identical to the identification text exists in the standard thesaurus, further comprising:
and detecting character information at the boundary of the identification text, and deleting the non-text symbols at the boundary of the identification text to obtain the corrected identification text.
6. The method according to claim 4, wherein the performing correction processing on the identification text based on the text position to obtain the verified subject matter information comprises:
when the target standard data does not exist in the standard word bank, target candidate data with the similarity between the target candidate data and the identification text larger than a preset threshold value are selected from the standard word bank;
judging whether the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data;
and when the spelling sequence of the target candidate data is the same as the spelling sequence of the text position designated interval in the order data, taking the target candidate data as the verified object information.
7. The method according to claim 6, wherein the correcting the identification text based on the text position to obtain the verified object information further comprises:
and when the spelling sequence of the target candidate data is different from the spelling sequence of the text position appointed interval in the order data, expanding text content along the text position boundary in the order data until a space symbol is met, and taking the expanded text content and a new text position corresponding to the text content as the verified object information.
8. The method of claim 7, further comprising:
and updating the verified object information into the standard word stock.
9. The method of claim 1, wherein the subject matter information comprises: the target article identification and the date information corresponding to the target article; the generating of the triple information of the order data based on the verified object information includes:
and respectively taking the target article identifier and the date information as two entities, and taking the type label and the date label of the target article as the relationship between the two entities to generate the triple information of the order data.
10. An information extraction apparatus characterized by comprising:
the acquisition module is used for acquiring order data corresponding to the query instruction;
the recognition module is used for inputting the order data into a preset recognition model and outputting the object information in the order data;
the verification module is used for verifying the subject matter information based on the standard word stock to obtain verified subject matter information;
and the generating module is used for generating the triple information of the order data based on the verified object information.
11. An electronic device, comprising:
a memory to store a computer program;
a processor to execute the computer program to implement the method of any one of claims 1 to 9.
12. A non-transitory electronic device readable storage medium, comprising: program which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 9.
CN202111520172.8A 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium Pending CN114154480A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111520172.8A CN114154480A (en) 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111520172.8A CN114154480A (en) 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114154480A true CN114154480A (en) 2022-03-08

Family

ID=80450513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111520172.8A Pending CN114154480A (en) 2021-12-13 2021-12-13 Information extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114154480A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468255A (en) * 2023-06-15 2023-07-21 国网信通亿力科技有限责任公司 Configurable main data management system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468255A (en) * 2023-06-15 2023-07-21 国网信通亿力科技有限责任公司 Configurable main data management system
CN116468255B (en) * 2023-06-15 2023-09-08 国网信通亿力科技有限责任公司 Configurable main data management system

Similar Documents

Publication Publication Date Title
CN110704633A (en) Named entity recognition method and device, computer equipment and storage medium
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN111198948A (en) Text classification correction method, device and equipment and computer readable storage medium
EP3591539A1 (en) Parsing unstructured information for conversion into structured data
CN113254574A (en) Method, device and system for auxiliary generation of customs official documents
CN108027814B (en) Stop word recognition method and device
CN112163424A (en) Data labeling method, device, equipment and medium
US20160140389A1 (en) Information extraction supporting apparatus and method
CN108664471B (en) Character recognition error correction method, device, equipment and computer readable storage medium
CN116244410B (en) Index data analysis method and system based on knowledge graph and natural language
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN110781673B (en) Document acceptance method and device, computer equipment and storage medium
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN114154480A (en) Information extraction method, device, equipment and storage medium
CN113642327A (en) Method and device for constructing standard knowledge base
CN112989050B (en) Form classification method, device, equipment and storage medium
CN115470034A (en) Log analysis method, device and storage medium
CN114020904A (en) Test question file screening method, model training method, device, equipment and medium
CN113515587A (en) Object information extraction method and device, computer equipment and storage medium
CN115017872B (en) Method and device for intelligently labeling table in PDF file and electronic equipment
CN113505570B (en) Reference is made to empty checking method, device, equipment and storage medium
CN113342931B (en) Big data based user demand analysis method, device, equipment and storage medium
CN109325126B (en) Method and device for objectification processing of language text and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination