CN114281953A - Information extraction method, device and equipment and computer readable storage medium - Google Patents

Information extraction method, device and equipment and computer readable storage medium Download PDF

Info

Publication number
CN114281953A
CN114281953A CN202110962863.7A CN202110962863A CN114281953A CN 114281953 A CN114281953 A CN 114281953A CN 202110962863 A CN202110962863 A CN 202110962863A CN 114281953 A CN114281953 A CN 114281953A
Authority
CN
China
Prior art keywords
question
document
test
information
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110962863.7A
Other languages
Chinese (zh)
Inventor
李淼
冯河
包恒耀
赵学敏
曹云波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110962863.7A priority Critical patent/CN114281953A/en
Publication of CN114281953A publication Critical patent/CN114281953A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an information extraction method, an information extraction device, information extraction equipment and a computer readable storage medium; relates to artificial intelligence technology; the method comprises the following steps: responding to the information extraction triggering operation, and acquiring a document to be extracted, which is corresponding to the information extraction triggering operation and waits for information extraction; identifying the organization type of the document to be extracted to obtain the target organization type of the document to be extracted; the target organization type represents the content contained in the document to be extracted and the layout structure of the contained content; and extracting the document to be extracted corresponding to the target organization type to obtain the extraction information of the document to be extracted. Through the method and the device, the intelligent degree of information extraction can be improved.

Description

Information extraction method, device and equipment and computer readable storage medium
Technical Field
The present application relates to artificial intelligence technologies, and in particular, to an information extraction method, apparatus, device, and computer readable storage medium.
Background
The information extraction is to extract information to be filed and stored from an input document, and for example, to extract a question and an answer analysis from an input test paper document, and to extract a summary of knowledge points related to the question from an input teaching material image. In the related art, when template-based extraction or rule-based extraction is generally used, the efficiency and effectiveness of extracting information from an input document are low, and thus the intelligence degree of information extraction is low.
Disclosure of Invention
The embodiment of the application provides an information extraction method, an information extraction device, information extraction equipment and a computer readable storage medium, which can improve the intelligence degree of information extraction.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides an information extraction method, which comprises the following steps:
responding to an information extraction triggering operation, and acquiring a document to be extracted, which is corresponding to the information extraction triggering operation and waits for information extraction;
identifying the organization type of the document to be extracted to obtain the target organization type of the document to be extracted; the target organization type represents the content contained in the document to be extracted and the layout structure of the contained content;
and extracting the document to be extracted corresponding to the target organization type to obtain the extraction information of the document to be extracted.
An embodiment of the present application provides an information extraction device, including: .
The document acquisition module is used for responding to the information extraction triggering operation and acquiring a document to be extracted, which is corresponding to the information extraction triggering operation and waits for information extraction;
the type identification module is used for identifying the organization type of the document to be extracted to obtain the target organization type of the document to be extracted; the target organization type represents the content contained in the document to be extracted and the layout structure of the contained content;
and the extraction processing module is used for performing extraction processing corresponding to the target organization type aiming at the document to be extracted to obtain the extraction information of the document to be extracted.
In some embodiments of the present application, the document to be extracted includes: a test paper document; the type identification module is also used for extracting rich text data from the test paper document to obtain data to be identified; predicting the tissue type of the data to be identified to obtain the target tissue type of the test paper document; the target tissue type characterizing the test paper document includes: any one or more of a test question text, a first parsed text that follows an individual test question, and a second parsed text that is at the end of the test paper document.
In some embodiments of the application, the type identification module is further configured to perform statement segmentation on the data to be identified according to line feed information to obtain a plurality of statement information; performing feature coding on the statement information to obtain statement features; aggregating the sentence characteristics to obtain document characteristics of the test paper document; and classifying the tissue types of the document features to obtain the target tissue type of the test paper document.
In some embodiments of the present application, the target tissue type characterizing the test paper document includes: the test question text; the extraction processing module is further configured to perform question segmentation on the test question text in the test paper document to obtain independent test question questions in the test paper document; performing text classification on the test question to obtain the question type category of the test question; integrating the question type categories and the question questions into the extraction information corresponding to the test paper documents; the extraction information structurally organizes the test question in the test paper document and the question type category corresponding to the test question.
In some embodiments of the present application, the extraction processing module is further configured to perform sentence segmentation on the test question text in the test paper document according to linefeed information to obtain a plurality of question sentences; performing text classification on each topic sentence in the plurality of topic sentences to obtain a classification category which represents whether each topic sentence is a topic boundary; and segmenting the test question text based on the segmentation boundary determined by the classification category to obtain the test question in the test paper document.
In some embodiments of the present application, the extraction processing module is further configured to filter out a candidate category that characterizes a topic sentence as a topic boundary from the plurality of classification categories; determining the question sentence corresponding to the candidate category as the segmentation boundary; and segmenting the test question text according to the segmentation boundary to obtain the test question in the test paper document.
In some embodiments of the present application, the extraction processing module is further configured to perform granular labeling on each topic sentence in the plurality of topic sentences to obtain a labeling label; the labeling label includes: a stem label or option label; integrating the question sentence corresponding to the question stem label and the question sentence corresponding to the option label into the test question of the test paper document.
In some embodiments of the present application, the target tissue type characterizing the test paper document further comprises: a first parsed text following the individual test question topics; the extraction processing module is further configured to extract an answer analysis corresponding to the test question from the first analysis text after the test question; and analyzing the question type categories, the test question questions and the answers and integrating the analyzed question types, the test question questions and the answers into the extraction information corresponding to the test paper documents.
In some embodiments of the present application, the target tissue type characterizing the test paper document further comprises: a second parsed text at the end of the test paper document; the extraction processing module is further configured to segment the analysis content in the second analysis text to obtain an independent analysis sub-content; aligning the analysis sub-content with the test question, and extracting answer analysis corresponding to the test question from an analysis sub-text corresponding to the test question; and analyzing the question type categories, the test question questions and the answers and integrating the analyzed question types, the test question questions and the answers into the extraction information corresponding to the test paper documents.
In some embodiments of the present application, the information extraction apparatus further includes: a subject identification module;
the subject identification module is used for identifying subjects of the document characteristics to obtain subject information of the test paper document; the subject information is at least used for screening a classification model used in text classification and a labeling model used in labeling of line granularity.
In some embodiments of the present application, the document obtaining module is further configured to, in response to an information extraction trigger operation for an extraction trigger identifier in a document uploading interface, obtain, from a document adding area of the document uploading interface, the document to be extracted corresponding to the information extraction trigger operation.
In some embodiments of the present application, the information extraction apparatus further includes: an information display module;
the information display module is used for analyzing the extracted information to obtain test question questions of the test paper document, question sequence numbers corresponding to the test question questions and answer analysis corresponding to the test question questions; displaying the test question questions in a test question display area of an information display interface, displaying the question serial numbers in a serial number display area of the information display interface, and displaying the answers in an analysis content area of the information display interface in an analysis mode.
An embodiment of the present application provides an information extraction device, including:
a memory for storing executable information extraction instructions;
and the processor is used for realizing the information extraction method provided by the embodiment of the application when executing the executable information extraction instruction stored in the memory.
The embodiment of the application provides a computer-readable storage medium, which stores executable information extraction instructions and is used for causing a processor to execute the executable information extraction instructions so as to realize the information extraction method provided by the embodiment of the application.
The embodiment of the application has the following beneficial effects: the information extraction equipment firstly responds to the information extraction triggering operation to obtain the document to be extracted, then identifies the target organization type of the document to be extracted, so as to clearly determine the content contained in the document to be extracted and the structural layout between the contained content, then carries out extraction processing corresponding to the target organization type on the document to be extracted, so as to utilize a relatively proper mode to extract the information of the document to be extracted, ensure that the information to be filed can be successfully extracted from the document to be extracted, and the whole information extraction process does not need manual participation and is quick in extraction, so that the effectiveness of information extraction is improved, the efficiency of information extraction is also improved, and finally the intelligent degree of information extraction is improved.
Drawings
FIG. 1 is a diagram I of template-based extraction of information;
FIG. 2 is a diagram two of template-based extraction of information;
FIG. 3 is a schematic diagram of rule-based extraction;
FIG. 4 is a schematic illustration of answers after a test question;
FIG. 5 is a schematic view of an answer at the end of a test paper;
FIG. 6 is a block diagram of an alternative architecture of an information extraction system provided by an embodiment of the present application;
fig. 7 is a schematic structural diagram of an information extraction device provided in an embodiment of the present application;
FIG. 8 is a schematic flow chart of an alternative information extraction method provided in the embodiments of the present application;
FIG. 9 is a schematic illustration of predicting a target tissue type provided by an embodiment of the present application;
FIG. 10 is a schematic diagram illustrating a process of performing granular annotation on each topic sentence according to an embodiment of the present application;
FIG. 11 is a diagram illustrating a labeling tag of a topic sentence provided in an embodiment of the present application;
FIG. 12 is a diagram illustrating extracted answer parsing according to an embodiment of the present application;
FIG. 13 is a schematic illustration of subject identification provided by an embodiment of the present application;
FIG. 14 is a schematic structural diagram of a classification model provided in an embodiment of the present application;
FIG. 15 is a first schematic structural diagram of an encoder according to an embodiment of the present disclosure;
fig. 16 is a schematic structural diagram of an encoder according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of an encoder provided in the embodiment of the present application;
FIG. 18 is a schematic diagram of a document upload interface provided by an embodiment of the present application;
FIG. 19 is a schematic diagram of an information presentation interface provided by an embodiment of the present application;
fig. 20 is a schematic flowchart of extracting questions and answers from a test paper according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first", "second", and the like, are only to distinguish similar objects and do not denote a particular order, but rather the terms "first", "second", and the like may be used interchangeably with the order specified, where permissible, to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend the self-energy of a human, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
2) Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning and map building, automatic driving, intelligent transportation, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.
3) Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.
4) Convolutional Neural Networks (CNNs) are a class of feed-forward Neural networks that contain convolution calculations with a deep structure. The CNN has the capability of representing learning and can carry out translation invariant classification on input information according to the structure of the CNN. CNN can perform supervised learning and unsupervised learning.
5) Text CNN, a convolutional neural network applied to Text coding, is commonly used to code sentences of Text.
6) A Recurrent Neural Network (RNN) is a Recurrent Neural Network in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all nodes are connected in a chain. RNN can be used for coding sequence data, and common models include Long Short Term Memory artificial neural network (LSTM) and Gated Recurrent Unit (GRU).
7) A Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner, and a decision boundary of the SVM is a maximum edge distance hyperplane for solving a learning sample. SVM is a traditional machine learning classification method.
8) Naive Bayes (Naive Bayes) is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions, and is a traditional machine learning classification method.
9) The transform model is a deep neural network based on the self-attention mechanism, and is commonly used for encoding character strings and images.
10) The Bert model is a model pre-trained on massive texts based on a Transformer model.
11) Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics, so that research in the field relates to natural language, namely language used by people daily, and is closely related to linguistics research. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
12) Optical Character Recognition (OCR) refers to a process of inspecting a printed Character, determining its shape by detecting dark and light patterns, and then recognizing characters in an image as Character strings by a Character Recognition method.
13) Sequence tagging requires predicting a category for each element in a sequence, and is widely applied, for example, segmentation, part of speech tagging (POS tagging), Named Entity Recognition (NER), keyword extraction, Semantic Role tagging (Semantic Role tagging), Slot extraction (Slot tagging), and the like, all belong to the category of sequence tagging in nature.
The information extraction is to extract information to be filed and stored from an input document, and for example, to extract a question and an answer analysis from an input test paper document, and to extract a summary of knowledge points related to the question from an input teaching material image.
In the related art, information is generally extracted from an input document or image by template-based extraction or rule-based extraction.
Template-based extraction refers to pre-editing an input document according to the requirements of a template format, namely adding specific tags to each part of the document, and then extracting different types of information according to the tags. Illustratively, fig. 1 is a first diagram of extracting information based on a template. Referring to fig. 1, the input document is a math test question 1-1, and the question text, the answer, the analysis and the end part of the test question in the math test paper are extracted. During extraction, the staff needs to mark the labels for each part of the mathematical test questions 1-1, namely, mark the labels for the contents 1-11, mark the labels for the contents 1-12, mark the labels for the contents 1-13, and the like according to the template format requirements 1-2 (including the (topic) indicating the question stem and the option 1-21, the (answer) indicating the answer 1-22, the (analysis) indicating the analysis content 1-23, and the (end) indicating the end 1-24 of the test question), so that the computer can extract different parts of the test questions according to the labels and store the different parts in the test question library.
Fig. 2 is a diagram two illustrating template-based extraction of information. Referring to fig. 2, the input document is an english test question 2-1, and the staff needs to mark labels for each part of the english test question 2-1 according to the template format requirement in fig. 1, that is, mark labels of the text for the content 2-11 (including question 1, question 2, question 3, question 4, etc.), mark labels of the answer for the content 2-12 (including question 1, question 2, question 3, question 4, etc.), mark analyzed labels for the content 2-13, and the like, and the computer can extract different parts of the english test question according to the labels and store the extracted parts in the test question library.
As can be seen from fig. 1 and 2, when extracting based on a template, a large amount of manual editing is required, and thus the efficiency of information extraction is low.
Rule-based extraction is based on the order of the order in the documents. For example, when the physical extraction is performed on the test paper, the range of different test questions is segmented according to the question number characteristics of the test questions, and then the test question information is directly extracted. However, this approach generally requires that the document organization rules be regular, for example, requiring that the question numbers of the questions be continuously increasing. However, generally, for the science subject, the question numbers in the test paper are continuously increased, and for the literature subject, the organization ways of the question numbers in the test paper may be various, and the scope of the test paper cannot be effectively divided, so that the effectiveness of information extraction is low.
Illustratively, FIG. 3 is a schematic diagram of rule-based extraction. In the test subject 3-1 of the literature, the theme of poetry appreciation 3-11 includes two subtotals, namely subtotal 6 and subtotal 7. At this time, if the test question range is divided according to the question number, it is very likely that the test question ranges of the two questions cannot be correctly divided (for example, the question 7 is extracted as an independent question, and a part of the question text is lost in the question 7), and thus the information extraction cannot be effectively performed.
Further, the organization of the content may vary from input document to input document. For example, for a test paper, the test paper mainly consists of test questions and answers, and the organization form of the test questions and the answers in different test papers is different, and can be divided into three types, namely, answers after the test questions, answers at the end of the test paper and no answers.
Illustratively, FIG. 4 is a schematic representation of an answer after a test question. Referring to FIG. 4, in the test paper 4-1, a question 4-11 is followed by answer resolution 4-12 and test point summarization 4-13 for the question. FIG. 5 is a schematic diagram of answers at the end of a test paper, in the first half 5-11 of the test paper 5-1, a plurality of test question topics, namely topics 5-111 through test question topics 5-114, are set; in the second half 5-12 of the test paper 5-1, answers corresponding to each question, i.e., answers 5-121 to 5-124, are set.
Because the organization forms of different input documents are different, when information is extracted based on rules, all parts of each test question cannot be extracted effectively, namely the information extraction efficiency is low.
In summary, in the related art, the efficiency and effectiveness of extracting information from the input document are low, and finally the intelligence degree of information extraction is low.
With the research and progress of artificial intelligence technology, the artificial intelligence technology develops research and application in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned teachers, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, internet of vehicles, automatic driving, smart traffic and the like.
The computer vision and other technologies of the artificial intelligence designed by the scheme provided by the embodiment of the application are specifically explained by the following embodiments.
The embodiment of the application provides an information extraction method, an information extraction device, information extraction equipment and a computer readable storage medium, which can improve the intelligence degree of information extraction. An exemplary application of the information extraction device provided in the embodiment of the present application is described below, and the information extraction device provided in the embodiment of the present application may be implemented as various types of terminals such as a notebook computer, a tablet computer, a desktop computer, and the like, and may also be implemented as a server. Next, an exemplary application when the information extraction device is implemented as a server will be described.
Referring to fig. 6, fig. 6 is an alternative architecture diagram of an information extraction system provided in the embodiment of the present application. In order to support an information extraction application, in the information extraction system 100, the terminals 400 (the terminal 400-1 and the terminal 400-2 are exemplarily shown) are connected to the server 200 through the network 300, the terminals 400 can be regarded as the front end of the server 200, and the network 300 can be a wide area network or a local area network, or a combination of the two.
The terminal 400-1 is configured to receive an information extraction triggering operation of a user on the graphical interface 400-11, obtain a document to be extracted specified by the user on the graphical interface 410, and send the information extraction triggering operation and the document to be extracted to the server 200 through the network 300.
The server 200 is configured to respond to the information extraction triggering operation, and acquire a to-be-extracted document to be subjected to information extraction, which corresponds to the information extraction triggering operation; identifying the organization type of the document to be extracted to obtain a target organization type of the document to be extracted, wherein the target organization type represents the content contained in the document to be extracted and the layout structure of the contained content; and extracting the file to be extracted according to the target organization type to obtain the extraction information of the file to be extracted. Thus, the server 200 completes the information extraction.
The terminal 400-2 sends an information viewing request to the server 200, the server 200 responds to the information viewing request of the terminal 400-2 and sends the extracted information to the terminal 400-2, and the terminal 400-2 displays the extracted information on the graphical interface 400-21 for viewing by a user.
In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present invention.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an information extraction device according to an embodiment of the present application, and the information extraction device 500 shown in fig. 7 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the information extraction device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 7.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.
The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless-compatibility authentication (Wi-Fi), and Universal Serial Bus (USB), etc.;
a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;
an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.
In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 7 illustrates an information extraction apparatus 555 stored in a memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: the document acquisition module 5551, the type identification module 5552, the extraction processing module 5553, the subject identification module 5554, and the information presentation module 5555 are logical, and thus may be arbitrarily combined or further divided according to the functions implemented.
The functions of the respective modules will be explained below.
In other embodiments, the information extraction Device provided in this embodiment of the present Application may be implemented in hardware, and for example, the information extraction Device provided in this embodiment of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the information extraction method provided in this embodiment of the present Application, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
Illustratively, an embodiment of the present application provides an information extraction device, including:
a memory for storing executable information extraction instructions;
and the processor is used for realizing the information extraction method provided by the embodiment of the application when executing the executable information extraction instruction stored in the memory.
In the following, the information extraction method provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the information extraction device provided by the embodiment of the present application.
Referring to fig. 8, fig. 8 is an alternative flow chart of an information extraction method provided in the embodiment of the present application, and will be described with reference to the steps shown in fig. 8.
S101, responding to the information extraction triggering operation, and acquiring a document to be extracted, which is waiting for information extraction and corresponds to the extraction triggering operation.
The method and the device for extracting the information of the document are realized in a scene of extracting the information of the document, for example, information extraction is carried out on a real test paper uploaded by a user, and information extraction is carried out on auxiliary materials in a storage space. In the embodiment of the application, the information extraction device firstly judges whether the information extraction triggering operation of the user is received. When the information extraction equipment receives the information extraction triggering operation, the information extraction process is definitely required to be started, and the document to be extracted corresponding to the information extraction triggering operation is obtained.
It can be understood that the document to be extracted corresponding to the information extraction triggering operation is the document acquired by the information extraction device after the information extraction triggering operation is triggered by the user, and therefore, the document to be extracted is the document selected by the user and needing information extraction, and is the document waiting for information extraction. The document to be extracted may be a test paper document, a teaching and assistant material document, or other types of documents, and the application is not limited herein.
It should be noted that the document to be extracted may be uploaded by the user, or may be selected from the storage space of the information extraction device itself, and the application is not limited herein.
The information extraction triggering operation may be triggered by a user. For example, when the information extraction device is implemented as a terminal, a document uploading interface can be provided for a user, and the user can upload a document to be extracted on the document uploading interface and trigger information extraction triggering operation; for another example, when the information extraction device is implemented as a server, a front end of the server may provide a document uploading interface for a user, and the user may upload a document to be extracted and trigger the information extraction triggering operation on the document uploading interface of the front end.
The information-triggered extraction operation may also be triggered by the information extraction device itself. For example, the information extraction device is implemented to be capable of periodically screening out a document which has not been subjected to information extraction from the storage space of the information extraction device, as a document to be extracted, and the like.
And S102, identifying the structure type of the document to be extracted to obtain the target organization type of the document to be extracted.
The information extraction equipment identifies the structure type of the document to be extracted so as to determine what the content of the document to be extracted is and what structure form different contents are laid out, thereby obtaining the target organization type. That is to say, in the embodiment of the present application, the target organization type characterizes the contents included in the document to be extracted and the layout structure between the included contents.
It can be understood that the information extraction device may identify the rich text data extracted from the document to be extracted to obtain the target organization type, or may directly compare the document to be extracted with the document template with the set structure type to determine the target organization type, which is not limited herein.
In some embodiments, the document to be extracted is a test paper document, and at this time, the target organization type may describe the composition structure of the test paper document, that is, whether the analysis of the test questions is included, which part of the test paper document the analysis of the test questions is in, and the like. For example, the target organization type may characterize the test paper document as containing any one or more of test question text, first parsed text in the middle of the test question text, and second parsed text at the end of the test paper document.
In other embodiments, the document to be extracted is a teaching material document, and at this time, the target organization type may specify the composition structure of the teaching material, that is, whether the teaching material document contains a knowledge point explanation, a key introduction, a practice problem, and a structural relationship between contained contents.
S103, extracting the document to be extracted according to the target organization type to obtain the extraction information of the document to be extracted.
Different document structure types are different, which means that the information contained in different documents and the layout structure of the information are different, and the information to be filed can be accurately and effectively extracted only by using a proper extraction mode. Therefore, in the embodiment of the application, the information extraction device determines the processing mode required by the document to be extracted according to the target organization type of the document to be extracted, and then extracts the document to be extracted by using the processing mode, so that the information extraction can be effectively performed without manual marking, and the extraction information is obtained.
Taking test paper documents as an example, the structure types of different test paper documents are different, for example, some test paper documents have answer analyses following test question questions, which are suitable for directly extracting texts behind the test question questions as answer analyses, and some test paper documents have answer analyses uniformly arranged at the tail of the documents and are suitable for segmenting the texts at the tail of the documents to extract the answer analyses.
In the embodiment of the application, the information extraction equipment firstly responds to the information extraction triggering operation to obtain the document to be extracted, then the target organization type of the document to be extracted is identified, so as to clearly determine the content contained in the document to be extracted and the structural layout between the contained content, then the extraction processing corresponding to the target organization type is carried out on the document to be extracted, so that the information extraction is carried out on the document to be extracted in a more appropriate mode, the information to be filed can be successfully extracted from the document to be extracted, the whole information extraction process does not need manual participation, the extraction is rapid, so that the effectiveness of the information extraction is improved, the efficiency of the information extraction is also improved, and the intelligent degree of the information extraction is finally improved.
In some embodiments of the present application, the document to be extracted includes: the examination paper document, at this time, identifying the structure type of the document to be extracted to obtain the target organization type of the document to be extracted, that is, the specific implementation process of S102 may include: S1021-S1022, as follows:
and S1021, extracting rich text data of the test paper document to obtain data to be identified.
The test paper document can be in picture format or text format. When the test paper document is in a picture format, the information extraction equipment can perform character recognition on the test paper document by using an OCR (optical character recognition), and the extracted data is to-be-recognized data; when the test paper document is in a text format, the information extraction device may extract the data to be identified from the test paper document by using a special document parsing method.
That is to say, in the embodiment of the present application, the data to be recognized is data in a rich text format, which may be data in a json format or data in an rtf format, and the present application is not limited herein.
And S1022, predicting the organization type of the data to be identified to obtain the target organization type of the test paper document.
After the information extraction device obtains the data to be identified, the content contained in the data to be identified and the layout structure among different contents are identified so as to realize identification of the organization type, wherein the identified organization type is the target organization type.
It should be noted that the target tissue type characterization paper document includes: any one or more of the test question text, a first parsed text that follows the individual test question questions, and a second parsed text that is at the end of the test paper document.
In some embodiments, the target tissue type may characterize the test paper document as including only the test question text, in which case the test paper document may be understood as the test paper; in other embodiments, the target organization type may characterize the test paper document as including test question text and first parsed text of the test questions following a separate test question topic, i.e., the test paper document may be understood as a test paper with answer parsing immediately following the test question topic; in other embodiments, the target organization type may also characterize the test question document as including test question text, and a second parsed text at the end of the test paper document, i.e., the test paper document may be understood as the same test paper as the last set for the answer parsing. In still other embodiments, the target organization category may identify that the test question document includes a first parsed text following the individual test question and a second parsed text at the end of the test paper document, where the test paper document may be viewed as material that simply explains the test question, the first parsed text may give a short answer, and the second parsed text gives detailed analysis.
It should be noted that, as long as the parsed text is located after the test question, the parsed text is the first parsed text, and thus, in some cases, the information extraction device may obtain a plurality of first parsed texts.
In the embodiment of the application, the information extraction device firstly extracts rich text data from the test paper document to obtain the data to be identified, and then predicts the tissue type of the data to be identified to obtain the target tissue type, so as to determine what kind of extraction processing needs to be performed on the test paper document subsequently.
In some embodiments of the present application, the predicting the tissue type of the data to be identified to obtain the target tissue type of the test paper document, that is, the specific implementation process of S1022 may include: s1022a-S1022d, as follows:
and S1022a, performing statement segmentation on the data to be identified according to the line feed information to obtain a plurality of statement information.
When the information extraction device identifies the target organization type, the line feed information can be determined from the data to be identified, so that the independent statement information in the data to be identified can be divided according to the line feed information. Since there is inevitably more than one sentence in the test paper document, the information extraction device obtains a plurality of sentence information.
It is to be understood that the line break information may be a line break, that is, the information extraction device may cut out a plurality of sentence information according to the line break in the data to be recognized. The line feed information can also be a line number, so that the information extraction equipment can cut out a plurality of statement information according to different line numbers.
S1022b, feature-coding the plurality of term information to obtain a plurality of term features.
The information extraction device performs feature coding on each statement information, so as to obtain the statement feature corresponding to each statement information. When the information extraction device completes feature coding on a plurality of statement information, a plurality of statement features corresponding to the plurality of statement information one to one are obtained.
In some embodiments, the information extraction device may directly utilize a plurality of single-layer encoders to encode a plurality of statement information in parallel, and after each statement information is encoded once, a statement feature corresponding to each statement information is obtained. The number of single-layer encoders may be the same as or different from the number of term information.
In other embodiments, the information extraction device may further use a multi-level encoder to encode the plurality of sentence features. Specifically, the information extraction device first uses a sentence encoder in the multi-level encoder to encode the plurality of sentence information once to obtain a primary encoding result corresponding to the plurality of sentence information, and then uses a line granularity encoder in the multi-level encoder to encode the primary encoding result to obtain the plurality of sentence characteristics.
Further, the term encoder may be a TextCNN encoding network, an RNN encoding network, a transform encoding network, or a Bert encoding network, and the present application is not limited thereto. The line-granularity encoder may be an LSTM encoding network or a transform encoding network, and the application is not limited herein.
S1022c, aggregating the sentence features to obtain the document feature of the test paper document.
Then, the information extraction device aggregates the plurality of sentence characteristics to obtain the document characteristics capable of representing the whole test paper document. It can be understood that the information extraction device may obtain the document feature by using a plurality of sentence features to form a feature matrix, and then performing a pooling operation on the feature matrix, or may obtain the document feature by splicing a plurality of sentence features end to end, which is not limited herein.
It is understood that the pooling operation in the embodiment of the present application may be maximum pooling, mean pooling, or other commonly used pooling, and the present application is not limited thereto.
S1022d, classifying the organization type of the document features to obtain the target organization type of the test paper document.
After obtaining the document features, the information extraction device inputs the document features into a classifier for classifying the tissue types, and the output of the classifier is the target tissue type of the test paper document.
Illustratively, fig. 9 is a schematic diagram of predicting a target tissue type according to an embodiment of the present application. As shown in fig. 9, the information extraction device first inputs a plurality of sentence information 9-1 (including sentences 9-11 to 9-1n) obtained by dividing data to be recognized into n sentence encoders 9-2, then inputs the encoding results of the n sentence encoders 9-2 into a line granularity encoder 9-3 to obtain a plurality of sentence features 9-4 (i.e., h1 to hn), then splices the plurality of sentence features 9-4 according to lines to obtain a feature matrix, and inputs the feature matrix into a pooling layer 9-5 to obtain document features, and finally inputs the document features into a classifier 9-6 for classifying the tissue types, so as to obtain the target tissue type of the test paper document, that is, the characterization test paper document includes the test question text 9-7.
In the embodiment of the application, the information extraction device firstly segments the data to be identified, then performs feature coding on the segmented statement information to obtain a plurality of statement features, then aggregates the statement features into document features, and classifies the tissue types of the document features to obtain the final target tissue type, so that the test paper document can be subsequently and correspondingly processed according to the target tissue type.
In some embodiments of the present application, the target tissue type characterization test paper document includes: the examination text, at this time, the extraction processing corresponding to the target organization type is performed on the document to be extracted to obtain the extraction information of the file to be extracted, that is, the specific implementation process of S103 may include: S1031-S1033, as follows:
and S1031, performing question segmentation on the test question text in the test paper document to obtain independent test question questions in the test paper document.
When the test paper document comprises the test question text, the information extraction equipment can determine the boundary of each independent test question in the test question text, and segment the test question text according to the boundaries of different test question questions to obtain different test question questions in the test paper document.
It can be understood that the information extraction device may determine the boundaries of different test questions by performing text classification on the test question text line by line, may also determine the boundaries of different test questions by performing sequence labeling of line granularity on the test question text, and may also determine the boundaries of different test questions by using continuous labels appearing in the test question text, which is not limited herein.
Further, some questions with complicated structure may be included in the test question text, for example, a reading solution with different small questions for an article, or a solution with several small questions. At this time, when the information extraction device performs topic segmentation, segmentation of large topics needs to be performed first, and then segmentation of small topics needs to be performed. At this time, the information extraction device may segment the big topics first according to the same type of topic number, for example, rules such as the capitalized topic number, and then segment the small topics by line-by-line text classification or line-granularity sequence labeling. The information extraction equipment can also process the whole test question text through line-by-line text classification or line-granularity sequence marking, so that after the type of each statement information is determined, the segmentation of big questions and small questions is respectively carried out based on the type of each statement information.
S1032, text classification is carried out on the test question to obtain the question type category of the test question.
After the test question is segmented, the information extraction device performs text classification on the test question to identify the question type of the test question, so as to obtain the question type of the test question, for example, whether the test question is a blank filling question, a selection question, a solution question, or the like is determined.
It can be understood that the information extraction device can realize the text classification process of the test question by inputting the test question into the trained text classification model. The text classification model may be a conventional text classification model, such as an SVM model, an NB model, or a text classification model based on deep learning, such as a TextCNN model, an RNN model, or a Bert model, which is not limited herein.
S1033, integrating the question type categories and the question questions into extraction information corresponding to the test paper documents.
The information extraction equipment arranges each test question and the corresponding reminding type together, so that an integration result corresponding to each test question can be obtained, and then the integration results of all the test questions are integrated according to the sequence of the test questions, so that the extraction information of the test paper document can be obtained. In this way, different test question and corresponding question type exist in the extraction information in the form of the order of the test question, and have certain structurality, so that the extraction information structurizes the test question and the corresponding question type in the test paper document.
In the embodiment of the application, the information extraction equipment firstly extracts independent test questions from test question texts, then performs text classification on the test questions to identify the question type categories of the test questions, and finally integrates the question type categories and the test questions to obtain structured extraction information, so that the information extraction process of the document to be extracted is realized.
In some embodiments of the present application, topic segmentation is performed on a test question text in a test paper document to obtain an independent test question in the test paper document, that is, a specific implementation process of S1031 may include: s1031a-S1031c, as follows:
and S1031a, performing sentence segmentation on the test question text in the test paper document according to the linefeed information to obtain a plurality of question sentences.
It is understood that the process is similar to the specific implementation process of S1022a, except that the execution object is a test question text, and thus the specific implementation process of this step is not described herein again.
And S1031b, performing text classification on each topic sentence in the multiple topic sentences to obtain a classification category for representing whether each topic sentence is a topic boundary.
The information extraction equipment classifies a plurality of question sentences on the text dimension so as to judge whether each question sentence is a question boundary or not and obtain the classification category corresponding to each question sentence. That is to say, in the embodiment of the present application, the classification category represents whether the corresponding topic sentence is a boundary of different topics.
It is understood that the text classification model may be a conventional text classification model, such as an SVM model, an NB model, etc., or may be a text classification model based on deep learning, such as a TextCNN model, an RNN model, or a Bert model, etc., and the present application is not limited thereto.
And S1031c, segmenting the test question text based on the segmentation boundary determined by the classification category to obtain the test question in the test paper document.
The information extraction equipment can determine which question sentences can be used as segmentation boundaries when segmentation according to the classification type of whether each question sentence is a question boundary or not. And then, the information extraction equipment cuts the test question text according to the determined cutting boundary, so that an independent test question can be obtained.
In the embodiment of the application, the information extraction equipment cuts out different question sentences from the test question text, then performs text classification on the different question sentences to determine whether the question sentences are question boundaries or not, and finally cuts out the test question text according to the determined question boundaries to obtain independent test question questions so as to facilitate subsequent generation of the extraction information.
In some embodiments of the present application, the segmenting the test question text based on the segmentation boundary determined by the classification category to obtain the test question in the test paper document, that is, the specific implementation process of S1031c may include: S201-S203, as follows:
s201, screening candidate categories of which the characterization topic sentences are topic boundaries from the plurality of classification categories.
The information extraction equipment analyzes each classification category to determine what each classification category represents, and then screens out the category which represents the topic sentence as the topic boundary from a plurality of classification categories to serve as a candidate category.
S202, determining the question sentences corresponding to the candidate categories as segmentation boundaries.
The classification category and the question sentence have a corresponding relationship, and the information extraction device can find the question sentence corresponding to the candidate category from the multiple question sentences by using the corresponding relationship, and use the found question sentences as segmentation boundaries so as to perform subsequent segmentation of the question.
S203, segmenting the test question text according to the segmentation boundary to obtain the test question in the test paper document.
After the information extraction device obtains the segmentation boundary, the test question text is segmented into different sub-texts by taking the segmentation boundary as a boundary, and characters and images in the sub-texts are identified, so that an independent test question is obtained.
In the embodiment of the application, the information extraction equipment can screen the candidate categories from the multiple categories, then determine question sentences corresponding to the candidate categories as the segmentation boundaries when the questions are segmented, and finally segment the test question texts according to the segmentation boundaries to obtain the individual test question questions so as to facilitate the subsequent question type identification.
In some embodiments of the present application, after performing sentence segmentation on the test question text in the test paper document according to the linefeed information to obtain a plurality of question sentences, that is, after S1031a, the method may further include: s1031d-S1031e, as follows:
and S1031d, performing granularity marking on each topic sentence in the multiple topic sentences to obtain marking labels.
The information extraction equipment inputs each topic sentence in the multiple topic sentences into a trained labeling model for performing granularity labeling, so that a labeling label corresponding to each topic sentence is predicted by using the labeling model. It should be noted that the labeling label at least includes: the question stem label and the option label, that is, the information extraction device can predict whether the question stem is the question stem or the option in the choice question at least for each question sentence. Of course, in some embodiments, the label tag may further include a title and a tag of a blank area, which is not limited herein.
It should be noted that the labeling model may be a common text labeling model, for example, a Bert model, or a labeling model constructed based on a multi-level encoder, and the labeling model is further provided with a line-granularity output layer in addition to the multi-level encoder, for outputting a labeling label of each topic sentence.
For example, fig. 10 is a schematic diagram of a process for performing granular labeling on each topic sentence according to an embodiment of the present application. Referring to fig. 10, a multilevel encoder 10-1 of a labeling model includes a sentence encoder 10-11 and a line granularity encoder 10-12, an information extraction device inputs a plurality of topic sentences 10-3 (i.e., sentences 10-31 to 10-3n) into the encoders 10-111 to 10-11n of the sentence encoder 10-11, respectively, then inputs the encoding result of the sentence encoder into the line granularity encoder 10-12, obtains the corresponding features of each topic sentence, i.e., f1 to fn, and finally predicts the label of each topic sentence, for example, whether each topic sentence is a topic stem 10-a or an option 10-B, by using a line granularity output layer 10-4 of the labeling model. Thus, the labeling of the line granularity is completed. Further, the multi-level encoder of FIG. 10 and the multi-level encoder of FIG. 9 (i.e., encoder 9-2 and line granularity encoder 9-3) may be configured identically, as well as the same parameters used during training. That is, the information extraction device can utilize the same multi-level encoder to perform feature extraction on sentence information and question sentences simultaneously. Furthermore, the question sentence is a part of the sentence information, so that the information extraction equipment can only encode the sentence information to obtain the sentence characteristics, and then simultaneously input the sentence characteristics to perform a pooling layer (including a splicing process) and a line granularity output layer, thereby obtaining the labeling label of the question sentence and the target organization type of the test question document at one time.
It will be appreciated that the row-granularity output layer may use a Softmax layer, or a CRF (conditional random field) layer. The Softmax layer is used for directly modeling the transition probability of the line-granularity labels, and the CRF layer can add a constraint to the last predicted label to ensure that the predicted label is legal so as to better model the transition probability between the line-granularity labels. These constraints can be learned automatically at training time.
And S1031e, integrating the question sentence corresponding to the question stem label and the question sentence corresponding to the option label into the test question of the test paper document.
After the information extraction equipment obtains the label tag of each question sentence, the question sentence corresponding to one question stem tag and the question sentence corresponding to one option tag form a test question, so that after the questions to which the question sentences belong are determined sequentially according to the sequence, all independent test question questions of the test paper document can be integrated.
For example, fig. 11 is a schematic diagram of a labeling tag of a topic sentence provided in an embodiment of the present application. In the test paper document 11-1, the information extraction device labels a corresponding label, for example, the question stem 11-2 and the option 11-3, for each question sentence 11-11. Therefore, the information extraction equipment can screen out the components of each test question according to the sequence, so that the test question of the test paper document is obtained.
In the embodiment of the application, after the information extraction device obtains a plurality of question sentences, granularity marking is performed on each question sentence, so that whether each question sentence belongs to a question stem or an option is determined, and then independent test question component parts are screened out according to question stem labels and option labels.
In some embodiments of the present application, the target tissue type characterization test paper document further comprises: the first parsing text after the individual entity topic, at this time, text classification is performed on the test topic, and after the topic type category of the test topic is obtained, that is, after S1032, the method may further include: S1034-S1035, as follows:
s1034, extracting answer analysis corresponding to the test question from the first analysis text behind the single test question.
The first analysis text refers to an analysis part of the test question, and after the test question of the first analysis text is single, analysis contents of the previous test question are inserted among different test questions. The information extraction equipment classifies texts of the first analysis text after the test question, labels the text in line granularity, or utilizes different labels, so as to determine the content type of each line of text in the first analysis text, such as the content of answer, analysis, comment, analysis and the like. Then, the information extraction device extracts the answers and the resolutions in the first resolution text and integrates the answers and the resolutions into an answer resolution.
For example, fig. 12 is a schematic diagram of extracted answer parsing provided in the embodiment of the present application. The information extraction apparatus identifies the content type for each line of text in the first parsed text 12-1 immediately following the question, i.e., the text 12-11 to the text 12-13 (even if displayed in two lines, if no line break is present in the two lines, the text 12-13 is processed in the same line by the information extraction apparatus during the extraction process, e.g., the text 12-13 is processed in one line), thereby marking each line of text content type, e.g., the answer 12-2, the analysis 12-3, and the parsing 12-4, so that the information extraction apparatus can extract the text with the answer and the parsing tag to constitute an answer parsing part for the question.
And S1035, analyzing the question type, the question of the test question and the answer, and integrating the analyzed question type, the question of the test question and the answer into extraction information corresponding to the test paper document.
And finally, the information extraction equipment correspondingly integrates the question type category, the test question and the answer analysis to obtain the extraction information of the test paper document.
In the embodiment of the application, when the test paper document further includes a first analysis text after the test question, the information extraction device may extract an answer analysis corresponding to the test question from the first analysis text, so as to integrate the structured extraction information by using the test question, the corresponding question type and the answer analysis. Thus, the information extraction equipment can complete the information extraction of the test paper of which the analysis content is next to the test question.
In some embodiments of the present application, the target tissue type characterizing the test paper document further comprises: the second parsing text at the end of the test paper document, at this time, text classification is performed on the test question to obtain a question type category of the test question, that is, after S1032, the method may further include: s1036 to S1038, as follows:
and S1036, segmenting the analysis content in the second analysis text to obtain independent analysis sub-content.
When the second parsing text is at the end of the test paper document, it indicates that the parsing content in the second parsing text does not follow the corresponding test question, but the parsing contents corresponding to all the test question appear as a whole. At this time, the information extraction device may segment the parsing content in the second parsing text first, so as to obtain mutually independent parsing sub-contents, that is, non-overlapping parsing sub-contents.
It is to be understood that the information extraction device may perform segmentation according to the numbers of the parsed contents in the second parsed text, for example, segment out the contents between two consecutive numbers as parsed sub-contents. The information extraction device may further perform text classification or line-granularity labeling on each line of parsing content of the second parsing sub-text to identify a content type of each line of parsing content, and segment the second parsing sub-content based on the content type, for example, segment the parsing sub-content between a parsing start line and a parsing end line.
S1037, aligning the analysis sub-content and the test question, and extracting answer analysis corresponding to the test question from the analysis sub-text corresponding to the test question.
After obtaining the separate analysis sub-content, the information extraction device needs to align the analysis sub-content with the test question so that each test question has its corresponding analysis sub-content, which is the answer analysis corresponding to the test question.
It can be understood that the information extraction device may align the parsed sub-content and the test question through parsing the number in the sub-content and the number of the test question, or may align the parsed sub-content and the test question through some heuristic alignment algorithms, which is not limited herein.
S1038, analyzing the question type, the question of the test question and the answer, and integrating the analyzed question type, the question of the test question and the answer into extraction information corresponding to the test paper document.
And finally, the information extraction equipment correspondingly integrates the question type category, the test question and the answer analysis to obtain the extraction information of the test paper document.
In the embodiment of the application, when the test paper document further includes a second analysis sub-text at the tail of the test paper document, the information extraction device can segment the analysis content contained in the second analysis sub-text, align the analysis sub-content obtained by segmentation with the test question, obtain the answer analysis corresponding to the test question, and integrate the reminding category, the test question and the answer analysis. In this way, the information extraction device can complete the information extraction of the test paper with the analysis content at the end.
In some embodiments of the present application, after aggregating the plurality of sentence characteristics to obtain the document characteristics of the test paper document, that is, after S1022c, the method may further include: s1022e, the following:
and S1022e, identifying the subject of the document characteristics to obtain subject information of the test paper document.
The models used by the test paper of different disciplines are different when the text is classified or the line granularity is marked. For example, for a mathematic test paper, the method is suitable for text classification and line granularity marking of a model trained by data with more characters and numbers, and for a Chinese test paper, the method is suitable for text classification and line granularity marking of a model trained by text data with richer semantic meanings. In view of this situation, in the embodiment of the present application, after obtaining the document features, the information extraction device may further perform subject identification on the document features to determine which subject the test paper document belongs to, so as to obtain subject information corresponding to the test paper document. The obtained subject information can help to select a proper classification model and a proper labeling model when the extraction processing related to the target tissue type is carried out subsequently. That is, the discipline information is used at least for the classification model used when screening text classifications, and the labeling model used when labeling of line granularity. Furthermore, the subject information can also be used for screening classification models and labeling models of subjects similar to the subject information, so that the classification models and the labeling models applicable to the subject information can be obtained in a transfer learning manner under the condition of having a small amount of labeling data.
Based on fig. 9, referring to fig. 13, fig. 13 is a schematic diagram of subject identification provided by the embodiment of the present application. The information extraction apparatus, after inputting the feature matrix into the pooling layer 9-5 to obtain the document feature, may input the document feature into a classifier 13-1 for classifying subjects, in addition to the classifier 9-6 for classifying the tissue type, to obtain subject information of the test paper document, such as mathematics 13-2.
It should be noted that, in some embodiments, the classification model and the annotation model may also be constructed based on different encoders, so that the classification model may have different model structures, and the annotation model may also have different model structures. The information extraction equipment can also select the most appropriate model structure from different model structures of the classification model to realize text classification according to subject information, and select the most appropriate model structure from different model structures of the labeling model to realize line-granularity labeling.
In the following, the classification model is taken as an example to illustrate the differences between different model structures.
The classification model used in text classification generally includes two parts, an encoder and an output layer.
For example, fig. 14 is a schematic structural diagram of a classification model provided in an embodiment of the present application. The classification model 14-1 is composed of an encoder 14-11 and an output layer 14-12. The encoder 14-11 is responsible for encoding the text content 14-2 into feature vectors of fixed length, and transmitting the feature vectors to the output layer 14-12, and the output layer 14-12 calculates the class probability 14-3 belonging to each class according to the feature vectors. In the case of dichotomy, the output layer often uses a logistic regression layer, and in the case of polyphenols, the output layer often uses a Softmax layer.
The encoder may have a variety of structural options. The information extraction device may use the TextCNN model as an encoder, may use the RNN model as an encoder, and may use the Bert model as an encoder.
Fig. 15 is a schematic structural diagram of an encoder according to an embodiment of the present application. The encoder is based on the TextCNN model. The encoder extracts word vectors from the input sentences, i.e. the weather is good at no mood today 15-1, and then converts the extracted word vectors 15-a into a matrix of n x d 15-2, where n is the number of words in the sentence and d is the dimension of a single word vector. Then, the encoder performs one-dimensional convolution 15-B on the matrix 15-2 through k convolution kernels 15-3, the area covered by each convolution kernel is converted into an eigenvalue 15-4, and after the one-dimensional convolution is completed, k eigenvectors 15-5 are obtained, so that the eigenvectors are spliced into an n × k matrix. The encoder then performs a max pooling 15-C operation on the n x k matrix in sentence dimension to obtain k-dimensional eigenvectors 15-6 for delivery to the output layer.
Fig. 16 is a schematic structural diagram of an encoder according to an embodiment of the present application. The encoding is based on the RNN model. The encoder inputs each word of the input sentence, i.e., word vectors e1 to e3 of today 16-1, weather 16-2, and good 16-3, to the hidden layer 16-4 in order from left to right to calculate feature vectors h1 to h3 in the manner of equation (1). The feature vector of the last word is then passed to the output layer as a vector representation of the entire sentence.
ht=fh(et,ht-1) (1)
Wherein e istIs a word vector, ht-1Is the feature vector of the last hidden layer output, fhIs a hidden layer calculation function, htIs the calculated feature vector.
The encoder may also have some variants, for example, in addition to left-to-right order, encoding may be performed using right-to-left order to form a bi-directional recurrent neural network, or a variant network, i.e., LSTM or GRU, may be obtained as a new encoder, depending on the f-function.
Fig. 17 is a schematic structural diagram three of an encoder according to an embodiment of the present application. The encoder is implemented based on the Bert model. The encoder inputs the statement information, i.e. the weather is good today 17-1 into the Bert model 17-2, resulting in the feature vector 17-3 to be passed to the output layer. The [ CLS ] flag in FIG. 17 indicates the beginning of a sentence, and the [ SEP ] flag indicates that the user divides two different sentences.
In some embodiments, the encoder may also be based on a transform model, and the structure of the transform model-based encoder is not described herein.
In the embodiment of the application, the information extraction equipment can also identify the subject of the document features to obtain subject information to which the test paper document belongs, so that when text classification and line granularity labeling are carried out subsequently, a proper classification model and a labeling model are screened, a more accurate classification type and a more accurate labeling label are obtained, and the effectiveness of information extraction is further improved.
In some embodiments of the present application, in response to the information extraction triggering operation, acquiring a to-be-extracted document waiting for information extraction corresponding to the information extraction triggering operation, that is, a specific implementation process of S101 may include: s1011, as follows:
s1011, responding to the information extraction triggering operation aiming at the extraction triggering identification in the document uploading interface, and acquiring the document to be extracted corresponding to the information extraction triggering operation from the document adding area of the document uploading interface.
The information extraction device can detect whether the user operates the extraction trigger in the displayed document uploading interface. When the user operates the extraction trigger, the operation is the information extraction trigger operation, and at this time, the information extraction device acquires the document specified by the user from the document adding area of the document uploading interface in response to the information extraction trigger operation, so that the document is the document to be extracted.
It is understood that the form of the extraction trigger may be set according to actual requirements, for example, a button for starting extraction or a switch for starting extraction is provided, and the application is not limited herein. The size and position of the document adding area can be set according to actual requirements, and the application is not limited herein.
Illustratively, fig. 18 is a schematic diagram of a document uploading interface provided in an embodiment of the present application. Below the document upload interface 18-1, a document addition area 18-11 is provided, in which a document 18-2, which the user can specify to perform information extraction, is added in the document addition area 18-11. The lower right corner of the document uploading interface 18-1 is provided with an extraction trigger, namely an extraction starting button 18-12, and when a user clicks the button, the information extraction device takes the document in the document adding area 18-11 as a document to be extracted to start the information extraction processing.
In some embodiments of the present application, after performing extraction processing corresponding to a target organization type for a document to be extracted and obtaining extraction information of the document to be extracted, that is, after S103, the method may further include: S104-S105, as follows:
s104, analyzing the extracted information to obtain the test question of the test paper document, the question sequence number corresponding to the test question and the answer analysis corresponding to the test question.
S105, displaying the test question in a test question display area of the information display interface, displaying the question serial number in a serial number display area of the information display interface, and displaying the answer in an analysis content area of the information display interface in an analysis mode.
In the embodiment of the application, the information extraction equipment can further analyze the extracted information after obtaining the extracted information so as to obtain question serial numbers and answer analyses corresponding to all test question questions, pop up an information display interface, and respectively display the test question serial numbers and the answer analyses so that a user can clearly view information related to the test question questions.
In some embodiments, the information extraction device may further display the title serial number in the serial number display area first, and at this time, temporarily display no content in the test question display area and the analysis content area first. When a user triggers a certain question serial number in the serial number display area, the test question corresponding to the question serial number is displayed in the test question display area, and the answer is analyzed and displayed in the analysis content area.
It can be understood that the sizes and positions of the test question display area, the serial number display area and the analysis content area can be set according to actual conditions. When the test question is a choice question, the test question display area can be divided into different sub-areas, a sub-area displays the question stem, and the remaining sub-areas respectively display one choice of the test question.
For example, fig. 19 is a schematic diagram of an information presentation interface provided in an embodiment of the present application. In the sequence number display area 19-11 of the information display interface 19-1, the question sequence numbers of the questions of each test, i.e., the sequence numbers 1 to 28, are displayed. When the user clicks the serial number 2, the question stem 19-2 and the options a to D of the test question corresponding to the serial number 2 are respectively displayed in five sub-areas, namely the sub-areas 19-121 to 19-125, of the test question display area 19-12, and meanwhile, the answer analysis 19-3 of the test question is displayed in the analysis content area 19-13.
In the embodiment of the application, the information extraction equipment also provides a document uploading interface and an information display interface for a user, so that the user can trigger the information extraction triggering operation on the document uploading interface and check the extracted information on the information display interface, and the user can use the information extraction equipment conveniently.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
The embodiment of the application is realized in a scene that a server (information extraction equipment) extracts titles and answers in test paper (documents to be extracted). Fig. 20 is a schematic flowchart of extracting questions and answers from a test paper according to an embodiment of the present application. Referring to fig. 20, the process includes:
s301, inputting a test paper.
And S302, preprocessing.
And S303, obtaining rich text data (data to be identified).
The input test paper needs to be preprocessed through a preprocessing module, and rich text data is extracted. The rich text data contains information such as texts, pictures, tables and the like in the test paper, and the preprocessing methods required by the test papers with different formats may be different.
S304, identifying the test paper pattern (identifying the tissue type to obtain the target tissue type).
This step is used to identify the answer organization style of the input test paper, and common styles include: the answer is three, 20-a after the test question (first parsed text after the question of the individual test question), 20-B at the end of the answer (second parsed text at the end of the test paper document), and 20-C without answer (test question text). The server may implement the process based on a text classification model of a scene, may implement the process based on a deep learning model, and may also implement the process based on a task architecture shown in fig. 9, that is, the test paper is segmented into a plurality of sentences (a plurality of sentence information) according to the line feed information in the test paper, the plurality of sentences are encoded by a two-stage hierarchical neural network encoder (the plurality of sentence information is feature-encoded to obtain a plurality of sentence features), a global word vector identifier of each sentence is obtained, the vector identifiers of each sentence may be aggregated to obtain a vector representation (a document feature) of a document, document-level classification prediction is performed, and a test paper style category is input.
Different test paper styles are subsequently processed using different processing flows. For the test paper with the answer at the tail of the test paper, the boundary of the entity area and the answer area in the test paper needs to be found, and for the test paper without the answer and with the answer at the tail of the test paper, the whole test paper can be regarded as the test question area.
And S306, extracting.
Aiming at the answer 20-A behind the test question, the processes of test question segmentation 20-1, question type identification 20-2, answer analysis and extraction 20-5 and the like are required to be carried out; aiming at the answer at the tail 20-B, the processes of test question segmentation 20-1, question type identification 20-2, answer segmentation 20-3, test question answer mapping 20-4, answer analysis and extraction 20-5 and the like are required to be carried out; for the no answer 20-C, only two processes of question segmentation 20-1 and question type identification 20-2 are needed.
Dividing test questions into 20-1: the boundaries of the individual test questions in the test question area need to be found for segmentation (to obtain the individual test question). The process can be realized through a continuous labeling rule, can also be converted into a line-by-line classification task (text classification is carried out on each topic sentence to obtain a classification category which represents whether each topic sentence is a topic boundary), and can also be converted into a line-granularity sequence labeling task, namely, a category label (labeling label), such as a question stem or an option, is predicted for each text line of the test paper through a model, so that segmentation is carried out. For some complex topics, editing of segmenting large topics and small topics is needed, and at the moment, the task can be converted into a line-by-line classification task or a line-granularity sequence labeling task, so that prediction is carried out in a topic range or a test paper range, and segmentation is carried out.
Topic type identification 20-2: the question types of the test questions, such as selection questions and blank questions (question type categories), need to be identified. The answer types may differ in answer form from question type to question type, which may assist in the answer extraction process of answer parsing extraction 20-5.
Answer segmentation 20-3: for the test paper with the answer at the tail 20-B of the test paper, the answer needs to be split according to the boundaries of different questions (the analysis content in the second analysis text is split to obtain the independent analysis sub-content), at this time, the answer can be split through the question number, or the answer can be split in a model-based mode, that is, the answer is split in a mode similar to the test paper splitting 20-1.
Answer to test question mapping 20-4: for the test paper with the answer at the tail 20-B of the test paper, the mapping of the test questions and the answer is required to be completed (the analysis subcontent and the test question questions are aligned), the alignment can be performed based on the question numbers of the test questions and the question numbers of the answer, and some heuristic alignment algorithms can be introduced for alignment.
Answer parsing extraction 20-5: the answer part of the test question may have rich contents, including answers, analyses, comments and the like. The server can perform category division through a sequence labeling task of line granularity by referring to a topic segmentation method, and for the condition that the answer is contained in the analysis text, the answer can be extracted through keyword features or a sequence labeling model of introduced word granularity.
And S307, obtaining a structured test question list.
In some cases, the server also proceeds to S308 after obtaining the rich text data at S303.
And S308, subject identification.
For identifying the subject type (subject information) of the input test paper, for example, language, mathematics, english, chemistry, etc. The server may implement this process based on the text classification model of the scene, as well as with the architecture shown in FIG. 9.
The identified disciplines can be used for assisting in screening models used by the processing module in S306, such as a text classification model when screening line-by-line classification, a labeling model of a line-granularity sequence labeling task, and the like, so that a more appropriate model is used for performing a classification task and a labeling task to obtain a more accurate classification result and a more accurate labeling label, and the method is helpful for processes of topic segmentation, topic type identification, answer segmentation, and the like.
Through the process, the examination paper can be automatically extracted by manually marking, and different styles are aimed at. The examination papers of different disciplines can both effectively carry out the examination question extraction, improves the validity and the efficiency of examination question extraction, has just also improved the intelligent degree of examination question extraction.
Continuing with the exemplary structure of the information extraction device 555 provided by the embodiments of the present application as a software module, in some embodiments, as shown in fig. 7, the software module stored in the information extraction device 555 of the memory 550 may include:
the document obtaining module 5551 is configured to, in response to an information extraction triggering operation, obtain a to-be-extracted document waiting for information extraction corresponding to the information extraction triggering operation;
the type identification module 5552 is configured to identify an organization type of the document to be extracted, so as to obtain a target organization type of the document to be extracted; the target organization type represents the content contained in the document to be extracted and the layout structure of the contained content;
the extraction processing module 5553 is configured to perform extraction processing corresponding to the target organization type on the document to be extracted, so as to obtain extraction information of the document to be extracted.
In some embodiments of the present application, the document to be extracted includes: a test paper document; the type identification module 5552 is further configured to extract rich text data from the test paper document to obtain data to be identified; predicting the tissue type of the data to be identified to obtain the target tissue type of the test paper document; the target tissue type characterizing the test paper document includes: any one or more of a test question text, a first parsed text that follows an individual test question, and a second parsed text that is at the end of the test paper document.
In some embodiments of the present application, the type identifying module 5552 is further configured to perform statement segmentation on the data to be identified according to line feed information to obtain a plurality of statement information; performing feature coding on the statement information to obtain statement features; aggregating the sentence characteristics to obtain document characteristics of the test paper document; and classifying the tissue types of the document features to obtain the target tissue type of the test paper document.
In some embodiments of the present application, the target tissue type characterizing the test paper document includes: the test question text; the extraction processing module 5553 is further configured to perform question segmentation on the test question text in the test paper document to obtain an independent test question in the test paper document; performing text classification on the test question to obtain the question type category of the test question; integrating the question type categories and the question questions into the extraction information corresponding to the test paper documents; the extraction information structurally organizes the test question in the test paper document and the question type category corresponding to the test question.
In some embodiments of the present application, the extraction processing module 5553 is further configured to perform sentence segmentation on the test question text in the test paper document according to linefeed information to obtain a plurality of question sentences; performing text classification on each topic sentence in the plurality of topic sentences to obtain a classification category which represents whether each topic sentence is a topic boundary; and segmenting the test question text based on the segmentation boundary determined by the classification category to obtain the test question in the test paper document.
In some embodiments of the present application, the extraction processing module 5553 is further configured to filter out a candidate category that characterizes the topic sentence as a topic boundary from the plurality of classification categories; determining the question sentence corresponding to the candidate category as the segmentation boundary; and segmenting the test question text according to the segmentation boundary to obtain the test question in the test paper document.
In some embodiments of the present application, the extraction processing module 5553 is further configured to perform granular labeling on each topic sentence in the multiple topic sentences to obtain a label; the labeling label includes: a stem label or option label; integrating the question sentence corresponding to the question stem label and the question sentence corresponding to the option label into the test question of the test paper document.
In some embodiments of the present application, the target tissue type characterizing the test paper document further comprises: a first parsed text following the individual test question topics; the extraction processing module 5553 is further configured to extract an answer analysis corresponding to the test question from the first analysis text after the individual test question; and analyzing the question type categories, the test question questions and the answers and integrating the analyzed question types, the test question questions and the answers into the extraction information corresponding to the test paper documents.
In some embodiments of the present application, the target tissue type characterizing the test paper document further comprises: a second parsed text at the end of the test paper document; the extraction processing module 5553 is further configured to segment the analysis content in the second analysis text to obtain an independent analysis sub-content; aligning the analysis sub-content with the test question, and extracting answer analysis corresponding to the test question from an analysis sub-text corresponding to the test question; and analyzing the question type categories, the test question questions and the answers and integrating the analyzed question types, the test question questions and the answers into the extraction information corresponding to the test paper documents.
In some embodiments of the present application, the information extraction device 555 further includes: a subject identification module 5554;
the subject identification module 5554 is configured to perform subject identification on the document features to obtain subject information of the test paper document; the subject information is at least used for screening a classification model used in text classification and a labeling model used in labeling of line granularity.
In some embodiments of the present application, the document obtaining module 5551 is further configured to, in response to an information extraction triggering operation for an extraction triggering identifier in a document uploading interface, obtain, from a document adding area of the document uploading interface, the document to be extracted corresponding to the information extraction triggering operation.
In some embodiments of the present application, the information extraction device 555 further includes: an information presentation module 5555;
the information display module 5555 is configured to analyze the extracted information to obtain test question questions of the test paper document, question sequence numbers corresponding to the test question questions, and answer analysis corresponding to the test question questions; displaying the test question questions in a test question display area of an information display interface, displaying the question serial numbers in a serial number display area of the information display interface, and displaying the answers in an analysis content area of the information display interface in an analysis mode.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the information extraction method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when executed by a processor, the executable instructions cause the processor to execute an information extraction method provided by embodiments of the present application, for example, a method as shown in fig. 8.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, the executable information extraction instructions may be in the form of a program, software module, script, or code, written in any form of programming language (including compiled or interpreted languages), and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, the executable information extraction instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, the executable information extraction instructions may be deployed to be executed on one information extraction device, or on multiple information extraction devices located at one site, or distributed across multiple sites and interconnected by a communication network.
In summary, according to the embodiment of the present application, the information extraction device firstly responds to the information extraction triggering operation to obtain the document to be extracted, and then identifies the target organization type of the document to be extracted to clarify the content included in the document to be extracted and the structural layout between the included contents, and then performs the extraction processing corresponding to the target organization type on the document to be extracted, so as to extract the information of the document to be extracted in a more appropriate manner, and ensure that the information to be filed can be successfully extracted from the document to be extracted. .
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (15)

1. An information extraction method, comprising:
responding to an information extraction triggering operation, and acquiring a document to be extracted, which is corresponding to the information extraction triggering operation and waits for information extraction;
identifying the organization type of the document to be extracted to obtain the target organization type of the document to be extracted; the target organization type represents the content contained in the document to be extracted and the layout structure of the contained content;
and extracting the document to be extracted corresponding to the target organization type to obtain the extraction information of the document to be extracted.
2. The method according to claim 1, wherein the document to be extracted comprises: a test paper document; the identifying the tissue type of the document to be extracted to obtain the target tissue type of the document to be extracted includes:
extracting rich text data from the test paper document to obtain data to be identified;
predicting the tissue type of the data to be identified to obtain the target tissue type of the test paper document; the target tissue type characterizing the test paper document includes: any one or more of a test question text, a first parsed text that follows an individual test question, and a second parsed text that is at the end of the test paper document.
3. The method according to claim 2, wherein the predicting the tissue type of the data to be identified to obtain the target tissue type of the test paper document comprises:
according to the line feed information, performing statement segmentation on the data to be identified to obtain a plurality of statement information;
performing feature coding on the statement information to obtain statement features;
aggregating the sentence characteristics to obtain document characteristics of the test paper document;
and classifying the tissue types of the document features to obtain the target tissue type of the test paper document.
4. The method of claim 2 or 3, wherein the target tissue type characterizing the test paper document comprises: the test question text; the extracting process corresponding to the target organization type is performed on the document to be extracted to obtain the extraction information of the document to be extracted, and the extracting process comprises the following steps:
performing question segmentation on the test question text in the test paper document to obtain independent test question questions in the test paper document;
performing text classification on the test question to obtain the question type category of the test question;
integrating the question type categories and the question questions into the extraction information corresponding to the test paper documents; the extraction information structurally organizes the test question in the test paper document and the question type category corresponding to the test question.
5. The method of claim 4, wherein the topic segmentation of the test question text in the test paper document to obtain an independent test question topic in the test paper document comprises:
according to the linefeed information, sentence segmentation is carried out on the test question text in the test paper document to obtain a plurality of question sentences;
performing text classification on each topic sentence in the plurality of topic sentences to obtain a classification category which represents whether each topic sentence is a topic boundary;
and segmenting the test question text based on the segmentation boundary determined by the classification category to obtain the test question in the test paper document.
6. The method of claim 5, wherein the segmenting the test question text based on the segmentation boundary determined by the classification category to obtain the test question topic in the test paper document comprises:
screening out candidate categories of which the characterization question sentences are question boundaries from the plurality of classification categories;
determining the question sentence corresponding to the candidate category as the segmentation boundary;
and segmenting the test question text according to the segmentation boundary to obtain the test question in the test paper document.
7. The method of claim 5, wherein after performing sentence segmentation on the test question text in the test paper document according to the linefeed information to obtain a plurality of question sentences, the method further comprises:
marking granularity of each question sentence in the question sentences to obtain a marking label; the labeling label includes: a stem label or option label;
integrating the question sentence corresponding to the question stem label and the question sentence corresponding to the option label into the test question of the test paper document.
8. The method of claim 4, wherein the target tissue type characterizing the test paper document further comprises: a first parsed text following the individual test question topics; after the text classification is performed on the test question to obtain the question type category of the test question, the method further comprises the following steps:
extracting answer analysis corresponding to the test question from the first analysis text behind the separate test question;
and analyzing the question type categories, the test question questions and the answers and integrating the analyzed question types, the test question questions and the answers into the extraction information corresponding to the test paper documents.
9. The method of claim 4, wherein the target tissue type characterizing the test paper document further comprises: a second parsed text at the end of the test paper document; after the text classification is performed on the test question to obtain the question type category of the test question, the method further comprises the following steps:
segmenting the analytic contents in the second analytic text to obtain independent analytic sub-contents;
aligning the analysis sub-content with the test question, and extracting answer analysis corresponding to the test question from an analysis sub-text corresponding to the test question;
and analyzing the question type categories, the test question questions and the answers and integrating the analyzed question types, the test question questions and the answers into the extraction information corresponding to the test paper documents.
10. The method of claim 3, wherein after aggregating the plurality of sentence features to obtain the document features of the test paper document, the method further comprises:
subject identification is carried out on the document characteristics to obtain subject information of the test paper document;
the subject information is at least used for screening a classification model used in text classification and a labeling model used in labeling of line granularity.
11. The method according to any one of claims 1 to 3 and 10, wherein the obtaining of the document to be extracted, which is waiting for information extraction and corresponds to the information extraction triggering operation, in response to the information extraction triggering operation comprises:
and responding to an information extraction triggering operation aiming at an extraction triggering identifier in a document uploading interface, and acquiring the document to be extracted corresponding to the information extraction triggering operation from a document adding area of the document uploading interface.
12. The method according to any one of claims 1 to 3 and 10, wherein after the extracting process corresponding to the target organization type is performed on the document to be extracted to obtain the extraction information of the file to be extracted, the method further comprises:
analyzing the extracted information to obtain test question questions of the test paper document, question sequence numbers corresponding to the test question questions and answer analysis corresponding to the test question questions;
displaying the test question questions in a test question display area of an information display interface, displaying the question serial numbers in a serial number display area of the information display interface, and displaying the answers in an analysis content area of the information display interface in an analysis mode.
13. An information extraction apparatus, characterized by comprising:
the document acquisition module is used for responding to the information extraction triggering operation and acquiring a document to be extracted, which is corresponding to the information extraction triggering operation and waits for information extraction;
the type identification module is used for identifying the organization type of the document to be extracted to obtain the target organization type of the document to be extracted; the target organization type represents the content contained in the document to be extracted and the layout structure of the contained content;
and the extraction processing module is used for performing extraction processing corresponding to the target organization type aiming at the document to be extracted to obtain the extraction information of the document to be extracted.
14. An information extraction device characterized by comprising:
a memory for storing executable information extraction instructions;
a processor for implementing the method of any one of claims 1 to 12 when executing the executable information extraction instructions stored in the memory.
15. A computer-readable storage medium having stored thereon executable information extraction instructions for, when executed by a processor, implementing the method of any one of claims 1 to 12.
CN202110962863.7A 2021-08-20 2021-08-20 Information extraction method, device and equipment and computer readable storage medium Pending CN114281953A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110962863.7A CN114281953A (en) 2021-08-20 2021-08-20 Information extraction method, device and equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110962863.7A CN114281953A (en) 2021-08-20 2021-08-20 Information extraction method, device and equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114281953A true CN114281953A (en) 2022-04-05

Family

ID=80868413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110962863.7A Pending CN114281953A (en) 2021-08-20 2021-08-20 Information extraction method, device and equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114281953A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997137A (en) * 2022-06-16 2022-09-02 壹沓科技(上海)有限公司 Document information extraction method, device and equipment and readable storage medium
CN116108144A (en) * 2023-04-10 2023-05-12 恒生电子股份有限公司 Information extraction method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997137A (en) * 2022-06-16 2022-09-02 壹沓科技(上海)有限公司 Document information extraction method, device and equipment and readable storage medium
CN116108144A (en) * 2023-04-10 2023-05-12 恒生电子股份有限公司 Information extraction method and device
CN116108144B (en) * 2023-04-10 2023-07-25 恒生电子股份有限公司 Information extraction method and device

Similar Documents

Publication Publication Date Title
US11132541B2 (en) Systems and method for generating event timelines using human language technology
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN110347894A (en) Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
CN111353555A (en) Label detection method and device and computer readable storage medium
CN109034203B (en) Method, device, equipment and medium for training expression recommendation model and recommending expression
CN112100438A (en) Label extraction method and device and computer readable storage medium
CN110196982A (en) Hyponymy abstracting method, device and computer equipment
CN114281953A (en) Information extraction method, device and equipment and computer readable storage medium
CN114626097A (en) Desensitization method, desensitization device, electronic apparatus, and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN112560506B (en) Text semantic analysis method, device, terminal equipment and storage medium
CN116775872A (en) Text processing method and device, electronic equipment and storage medium
CN117150436B (en) Multi-mode self-adaptive fusion topic identification method and system
US20220101060A1 (en) Text partitioning method, text classifying method, apparatus, device and storage medium
CN114372532A (en) Method, device, equipment, medium and product for determining label marking quality
CN114417785A (en) Knowledge point annotation method, model training method, computer device, and storage medium
Sahin et al. Introduction to Apple ML tools
CN112052424A (en) Content auditing method and device
Shen et al. A general approach to multimodal document quality assessment
Shen et al. A Multimodal Approach to Assessing Document Quality.
CN110889717A (en) Method and device for filtering advertisement content in text, electronic equipment and storage medium
CN113761209B (en) Text splicing method and device, electronic equipment and storage medium
CN115759262A (en) Visual common sense reasoning method and system based on knowledge perception attention network
CN113657092A (en) Method, apparatus, device and medium for identifying label
Akram et al. From Data Quality to Model Performance: Navigating the Landscape of Deep Learning Model Evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination