WO2021135469A1 - Machine learning-based information extraction method, apparatus, computer device, and medium - Google Patents

Machine learning-based information extraction method, apparatus, computer device, and medium Download PDF

Info

Publication number
WO2021135469A1
WO2021135469A1 PCT/CN2020/118951 CN2020118951W WO2021135469A1 WO 2021135469 A1 WO2021135469 A1 WO 2021135469A1 CN 2020118951 W CN2020118951 W CN 2020118951W WO 2021135469 A1 WO2021135469 A1 WO 2021135469A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
candidate set
information
rct
article
Prior art date
Application number
PCT/CN2020/118951
Other languages
French (fr)
Chinese (zh)
Inventor
黎旭东
丁佳佳
林桂
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135469A1 publication Critical patent/WO2021135469A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an information extraction method, device, computer equipment and medium based on machine learning.
  • the RCT literature is highly targeted. At present, there are many completed RCT experimental design methods and data in the RCT literature.
  • the key information of the experimental design can be refined in these RCT articles to provide convenience for later researchers. At present, it is mainly through simple Keyword or classification search to extract experimental experimental standards, intervention methods, and key results from the RCT medical literature.
  • this method of extraction results in insufficient accuracy of the sentences and extracts information. The accuracy is biased. If the key information of the extracted RCT article is to be helpful to medical researchers, the extraction result of the extraction system needs to be reliable and accurate. For this reason, seek a high-quality key sentence that can be extracted from the RCT article The method of information has become a problem that needs to be solved urgently.
  • the embodiments of the present application provide an information extraction method, device, computer equipment, and storage medium based on machine learning to improve the accuracy of RCT article information extraction.
  • an embodiment of the present application provides an information extraction method based on machine learning, including:
  • the candidate set of coarse-grained key information is used as the initial candidate set;
  • the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • an embodiment of the present application further provides an information extraction device based on machine learning, including:
  • the article acquisition module is used to acquire a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article;
  • the content extraction module is used to extract the title, abstract and body of the RCT article
  • a data preprocessing module configured to perform data preprocessing on the main text to obtain processed text information, wherein the text information includes a text short sentence and a position corresponding to the text short sentence;
  • the information extraction module is used to use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain coarse-grained key information Candidate set, taking the coarse-grained key information candidate set as the initial candidate set;
  • the information determining module is configured to filter the initial candidate set according to preset filtering conditions to obtain a target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
  • an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the steps of the following information extraction method based on machine learning are implemented:
  • the candidate set of coarse-grained key information is used as the initial candidate set;
  • the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, when the computer-readable instructions are executed by a processor, the following is achieved based on machine learning
  • the candidate set of coarse-grained key information is used as the initial candidate set;
  • the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • the machine learning-based information extraction method, device, computer equipment, and storage medium provided in the embodiments of the application obtain preset classification identifications, and search based on the classification identifications in the search database to obtain RCT articles and extract RCT articles.
  • Title, abstract and main text the main text is data preprocessed to obtain the processed text information
  • the title, abstract and text information are used as fusion features
  • the fusion features and RCT articles are input into the preset BERT model for training, and the rough
  • the candidate set of coarse-grained key information is used as the initial candidate set, so that the extracted initial candidate set has a strong correlation with the title and abstract, ensuring the accuracy of the extracted content, and then according to the preset filter conditions,
  • the initial candidate set is screened to obtain the target candidate set.
  • the text information corresponding to the target candidate set is used as the key information of the RCT article, so that the initial candidate set can be screened according to needs to obtain more accurate key information. Improve the accuracy of information extraction.
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of the information extraction method based on machine learning of the present application
  • Fig. 3 is a schematic structural diagram of an embodiment of an information extraction device based on machine learning according to the present application
  • Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • the terminal devices 101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • the method for extracting information based on machine learning provided by the embodiments of the present application is executed by a server, and accordingly, the device for extracting information based on machine learning is set in the server.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices, networks, and servers.
  • the terminal devices 101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.
  • FIG. 2 shows a machine learning-based information extraction method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description. The details are as follows:
  • S201 Obtain a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article.
  • the classification identifiers of RCT articles are different.
  • the classification identification preset in the search database is obtained, and then based on the classification identification, the search is performed in the search database to obtain the RCT article.
  • RCT research clinical trails
  • RCT research clinical trails
  • the completed RCT experimental design method the key information of the experimental design can be refined in the published RCT articles to provide convenience for later researchers.
  • the experimental standards, intervention methods, and methods for extracting experiments from the RCT medical literature do not exist in the industry.
  • the system of summary sentences such as key results, and the accuracy is not up to the doctor's requirements. If the key information of the extracted RCT articles is to be helpful to medical researchers, the extraction results of the extraction system need to be reliable and accurate.
  • search databases refer to digital libraries, databases, academic libraries, etc. containing medical RCT articles.
  • the category identification refers to the identification of the retrieval category corresponding to each category of document data in the search database, and the document information of a certain category can be quickly found through the classification identification.
  • the medical RCT article is analyzed through a preset script file to obtain the title, abstract, and body of the medical RCT article.
  • the preset script file can be defined according to actual needs, and there is no limitation here.
  • the preset script types include but are not limited to: shell script, JavaScript script, Lua script, python script, etc.
  • this embodiment Use python script.
  • the way of parsing includes, but is not limited to: regular matching, format parsing, template matching, etc.
  • S203 Perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.
  • the obtained text is subjected to data preprocessing, including text segmentation, punctuation removal, etc., to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.
  • the position corresponding to the text short sentence refers to the text short sentence obtained after data preprocessing, which is numbered in the order of the front and back, and the position of each text short sentence relative to other text short sentences is obtained.
  • S204 Use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and use the candidate set of coarse-grained key information as initial candidates set.
  • the title, abstract, and text information are used as fusion features, and then the fusion features and RCT articles are input into a preset language representation model for training, and a candidate set of coarse-grained key information in the RCT article is obtained as the initial candidate set.
  • the language table model includes but is not limited to: deep semantic representation (Embedding from Language Model, ELMo) algorithm, OpenAI GPT, and pre-trained bidirectional encoder representations (Bidirectional Encoder Representations from Transformers, BERT) model.
  • ELMo Embedding from Language Model
  • OpenAI GPT OpenAI GPT
  • BERT Bidirectional Encoder Representations from Transformers
  • the BERT model is used as the language table model.
  • the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a representation of the text that contains rich semantic information, that is: the semantic representation of the text, and then fine-tune the semantic representation of the text in a specific NLP task, and finally apply it to the NLP mission.
  • the word segmentation in the title is used as the key vocabulary feature of the annotation
  • the short sentence in the abstract is used as the key short sentence feature of the annotation.
  • the BERT model is used to obtain the association with these annotation features from the text The shortest sentence is used as a candidate set.
  • coarse-grained key information refers to a collection of information containing key information, that is, the coarse-grained key information contains not only key information, but also other less important information, and therefore, further screening is required in the future.
  • S205 Perform screening processing on the initial candidate set according to the preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
  • RCT articles have their fixed characteristics.
  • some general characteristics of the key information in the RCT articles are obtained, and the general characteristics are used as a preset filter condition, and the filter conditions
  • the initial candidate set is screened to obtain the target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • the preset filter conditions in this embodiment can be determined according to actual conditions, but all include the following features: (1) Features contained in key information sentences of RCT articles; (2) Each type of key information sentence to be extracted and its sentence The interdependence that exists within. Using these two features as a basis, the initial candidate set output by the Bert algorithm is screened, and non-key information sentences are excluded, so that the skill obtains the target candidate set of each type of information to be extracted.
  • the RCT article is obtained, the title, abstract, and main text of the RCT article are extracted, and the main text is preprocessed to obtain the processed data.
  • the initial candidate set is screened to obtain the target candidate set, and the target candidate
  • the text information corresponding to the set is used as the key information of the RCT article, and the initial candidate set can be screened according to needs to obtain more accurate key information, which is beneficial to improve the accuracy of information extraction.
  • the key information of each RCT article is stored in the blockchain network node, and the data information is shared between different platforms through the blockchain storage. Can prevent data from being tampered with.
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • step S204 using the title, abstract, and text information as fusion features includes:
  • the target word segmentation, summary sentence and text information are marked according to the source type, and the marked target word segmentation, the marked summary sentence and the marked text information are used as the fusion features of the input BERT model.
  • the word segmentation process is performed on the title of the preset word segmentation method to obtain the target word segmentation, and then short sentence extraction is performed on the abstract to obtain the abstract short sentence, and then the target word segmentation, the abstract short sentence and the text information are respectively marked according to the source type. Get the fusion feature.
  • preset word segmentation methods include, but are not limited to: third-party word segmentation tools or word segmentation algorithms, etc.
  • common third-party word segmentation tools include but are not limited to: Stanford NLP word segmentation, ICTCLAS word segmentation system, ansj word segmentation tool and HanLP Chinese word segmentation tool, etc.
  • word segmentation algorithms include but are not limited to: Maximum Forward Matching (MM) algorithm, Reverse Direction Maximum Matching Method (RMM) algorithm, Bi-directction Matching method, BM) algorithm, Hidden Marco Markov Model (Hidden Markov Model, HMM) and N-gram model, etc.
  • the short sentence extraction of the abstract may specifically adopt the TextRank algorithm, or may adopt the method of natural language processing for semantic recognition.
  • the TextRank algorithm divides the text into several constituent units (words, sentences) and establishes a graph model, uses a voting mechanism to rank important components in the text, and uses only the information of the abstract itself to achieve key short sentences extraction.
  • Natural Language Processing is a method based on machine learning, especially statistical machine learning, to enable effective communication between humans and computers in natural language, generally applied to corpora and Markov models.
  • the target word segmentation, summary sentence, and text information are marked according to the source type.
  • an attribute may be added to the target word segmentation, summary sentence, and text information, and different identifiers are used to mark them.
  • which type of origin for example, the identifier "FC” is used to identify the source as the target word segmentation, the identifier "ZY” is used to identify the source as a summary sentence, and the identifier "WB" is used to identify the source as text information.
  • the title, abstract, and text information are processed and marked as fusion features, which is beneficial to the accuracy of subsequent recognition through the BERT model.
  • the preset BERT model includes an encoding layer and a Transformer layer.
  • the fusion features and RCT articles are input to the preset BERT model for training to obtain coarse-grained key information
  • the candidate set of the coarse-grained key information as the initial candidate set includes:
  • the initial code includes the first code corresponding to the title and the second code corresponding to the abstract.
  • the third code corresponding to the code and the text information;
  • the text information corresponding to the features to be filtered is used as the initial candidate set.
  • the fusion feature and RCT article are input into the preset BERT model, and the fusion feature is encoded through the coding layer of the preset BERT model to obtain the initial code.
  • the initial code includes the first code and abstract corresponding to the title.
  • the corresponding second code and the third code corresponding to the text information are extracted from the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code, and the third code Encode the corresponding third feature, and then calculate the similarity between the third feature and the second feature for each third feature. If the similarity is less than the first preset threshold, then the third feature corresponding to the similarity is taken as Features to be filtered.
  • the preset BERT model is a pre-trained BERT model, and its training samples are derived from pre-selected and labeled data features from RCT articles.
  • the calculation method of similarity includes, but is not limited to: Manhattan Distance, Euclidean Distance, Cosine Similarity, Minkowski Distance, etc.
  • the Transformer layer is constructed through the Transformer framework.
  • the Transformer framework is a classic of natural language processing proposed by the Google team.
  • the Transformer can be increased to a very deep depth and use the attention mechanism to achieve rapid parallelism. Therefore, the Transformer framework is relatively
  • the usual convolutional neural network or recurrent neural network has the characteristics of fast training speed and high recognition rate.
  • the first preset threshold can be set according to actual conditions, for example, set to 0.6, which is not specifically limited here.
  • the fusion features are encoded and feature extracted, and then the set of text information associated with the abstract is determined as the initial candidate set, which reduces the scope of key information extraction, which is beneficial to Improve the efficiency of key information extraction.
  • the similarity value between the third feature and the second feature is calculated, and the third feature whose similarity value with the second feature is less than the first preset threshold is used as After the features are to be screened, it also includes:
  • the updated text information corresponding to the feature to be screened is used as the initial candidate set.
  • the first code corresponding to the title is used as a reference dimension to calculate the Euclidean distance between the feature to be screened and the first code. It is easy to understand. The smaller the distance, It shows that the text information corresponding to the feature to be screened is more closely related to the title, the feature to be screened is screened according to the preset second threshold, and the feature to be screened whose Euclidean distance from the first code is less than or equal to the second preset threshold is screened. Retain, as the updated candidate feature, the feature to be screened whose Euclidean distance from the first code is greater than the second preset threshold is confirmed as the candidate feature that is not closely related to the title, and is eliminated.
  • the second preset threshold can be set according to actual needs, for example, set to 8, which is not specifically limited here.
  • Euclidean Distance also known as Euclidean metric
  • Euclidean metric is a commonly used distance definition, which refers to the true distance between two points in m-dimensional space, or the natural length of the vector (that is, the point The distance to the origin). In this embodiment, it specifically refers to the distance between the space vector corresponding to the feature to be screened and the space vector corresponding to the first code.
  • the Euclidean distance between the first code and the feature to be screened is used to select the feature to be screened that is more closely related to the title (smaller Euclidean distance) as the updated feature to be screened, which is beneficial to increase the range of the initial candidate set.
  • the accuracy rate is used to select the feature to be screened that is more closely related to the title (smaller Euclidean distance) as the updated feature to be screened, which is beneficial to increase the range of the initial candidate set.
  • the machine learning-based information extraction method further includes:
  • Sentence reconstruction is performed on the key information of the RCT article, and the updated key information is obtained.
  • the key information obtained may be derived from multiple paragraphs of the RCT article, that is, the extraction result may have poor readability because the position of the sentence in the full text is not continuous. At this time, it is necessary to reconstruct the extracted key information to obtain updated key information with clear sentence meaning and strong readability, and to enhance the reliability of extracting key information.
  • sentence reconstruction refers to the use of preset grammatical rules to check and correct the sentence pattern, and to supplement the missing parts of the sentence pattern according to the semantics to achieve the completeness of the sentence.
  • the preset grammar rules can be selected according to the actual language, and the corresponding grammar can be selected to formulate the corresponding rule script.
  • sentence reconstruction is performed on the key information of the RCT article to avoid problems such as grammatical incompatibility and semantic disconnection in the key information, so that the expression of the updated key information is more accurate.
  • Fig. 3 shows a principle block diagram of a machine learning-based information extraction device corresponding to the above-mentioned embodiment of the machine learning-based information extraction method one-to-one.
  • the information extraction device based on machine learning includes an article acquisition module 31, a content extraction module 32, a data preprocessing module 33, an information extraction module 34, and an information determination module 35.
  • the detailed description of each functional module is as follows:
  • the article obtaining module 31 is used to obtain a preset classification mark, and based on the classification mark, perform a search in the search database to obtain an RCT article;
  • the content extraction module 32 is used to extract the title, abstract and body of the RCT article
  • the data preprocessing module 33 is configured to perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
  • the information extraction module 34 is used to use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and combine the coarse-grained key information
  • the candidate set is used as the initial candidate set;
  • the information determining module 35 is configured to filter the initial candidate set according to preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
  • the information extraction module 34 includes:
  • the word segmentation processing unit is used to perform word segmentation processing on the title to obtain the target word segmentation
  • the short sentence extraction unit is used to extract short sentences from the abstract to obtain the abstract short sentences;
  • the information marking unit is used to respectively mark the target word segmentation, summary sentence and text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.
  • the information extraction module 34 further includes:
  • the coding unit is used to input the fusion feature and RCT article into the preset BERT model, and encode the fusion feature through the coding layer of the preset BERT model to obtain the initial code.
  • the initial code includes the first code corresponding to the title, The second code corresponding to the abstract and the third code corresponding to the text information;
  • the feature extraction unit is configured to perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;
  • the similarity calculation unit is configured to calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the feature to be screened;
  • the candidate set determining unit is used to use the text information corresponding to the feature to be screened as the initial candidate set.
  • RCT article information extraction based on machine learning also includes:
  • the distance calculation module is used to calculate the Euclidean distance between the feature to be screened and the first code
  • the feature screening module is configured to use the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold as the feature to be candidate after update;
  • the candidate set acquisition module is used to use the updated text information corresponding to the feature to be screened as the initial candidate set.
  • the device for extracting information based on machine learning further includes:
  • the sentence reconstruction module is used to reconstruct the key information of the RCT article to obtain the updated key information.
  • the device for extracting information based on machine learning further includes:
  • the storage module is used to store the key information of the RCT article in the blockchain network node.
  • Each module in the above-mentioned machine learning-based information extraction device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43. However, it should be understood that it is not required to implement all the shown components, and alternative implementations can be made More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4.
  • the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 42 is generally used to control the overall operation of the computer device 4.
  • the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • This application also provides another implementation manner, that is, to provide a computer-readable storage medium that stores an interface display program, and the interface display program can be executed by at least one processor to enable all The at least one processor executes the steps of the information extraction method based on machine learning as described above.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a machine learning-based information extraction method, apparatus, computer device, and medium, relating to the field of artificial intelligence, said method comprising: extracting the title, abstract, and main text of an RCT article (S202); performing data pre-processing of the main text to obtain processed text information; taking the title, abstract, and text information as fusion features, and inputting the fusion features and RCT article into a preset BERT model for training; obtaining a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set (S204); according to preset filter conditions, screening the initial candidate set to obtain a target candidate set, the text information corresponding to the target candidate set, and taking it as the key information of the RCT article (S205); the method also relates to blockchain technology; the key information of the obtained RCT article is stored in a blockchain network; the method improves the accuracy of information extraction.

Description

基于机器学习的信息抽取方法、装置、计算机设备及介质Information extraction method, device, computer equipment and medium based on machine learning
本申请要求于2020年6月17日,提交中国专利局、申请号为202010554248.8,发明名称为“基于机器学习的信息抽取方法、装置、计算机设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 17, 2020, the application number is 202010554248.8, and the invention title is "machine learning-based information extraction methods, devices, computer equipment and media", all of which The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种基于机器学习的信息抽取方法、装置、计算机设备及介质。This application relates to the field of artificial intelligence, and in particular to an information extraction method, device, computer equipment and medium based on machine learning.
背景技术Background technique
随着医学理念的发展,当前的医疗模式已从过去的经验医学向循证医学(Evidence based medicine,EBM)转变。秉持“一切临床决策均应由临床证据出发”的循证医学能为医学临床工作提供最有力的证据支持和严谨的临床科研设计指导,对临床实践与科研有重要指导意义。循证医学的主要证据载体为系统评价,其撰写要求极其严格,研究者需要针对某一明确临床问题进行系统性检索和文献筛选找出当前最佳临床证据,并对这些证据进行偏倚风险评价和结果整合。其步骤涉及系统检索、文献筛选、信息提取、偏倚风险评价和数据合成等。为了控制纳入文献本身的偏倚风险,系统评价撰写者需要寻找的当前最佳临床证据一般为研究设计最为严谨的随机对照临床试验(Randomized Controlled Clinical Trial,RCT)。With the development of medical concepts, the current medical model has changed from past empirical medicine to evidence-based medicine (EBM). Evidence-based medicine, which upholds "All clinical decision-making should be based on clinical evidence" can provide the most powerful evidence support and rigorous clinical research design guidance for medical clinical work, and has important guiding significance for clinical practice and scientific research. The main evidence carrier of evidence-based medicine is systematic review, and its writing requirements are extremely strict. Researchers need to conduct systematic search and document screening for a clear clinical problem to find the best current clinical evidence, and conduct bias risk assessment and evaluation of these evidences. Results integration. Its steps involve systematic retrieval, document screening, information extraction, bias risk evaluation, and data synthesis. In order to control the risk of bias in the included literature, the current best clinical evidence that a systematic review writer needs to find is generally the most rigorously designed randomized controlled clinical trial (Randomized Controlled Clinical Trial, RCT).
RCT文献针对性较强,当前在RCT文献中,存着许多已完成的RCT实验设计方法和数据,这些RCT文章中可精炼出实验设计的重点信息为后来研究者提供便利,目前主要通过简单的关键字或者分类进行检索,来从RCT类医学文献中抽取出实验的实验标准、干预手段、及重点结果等总结性句子的系统,但这种抽取方式,得到的句子精确程度不够,抽取信息的准确性存在偏差,若要使得抽取的RCT文章重点信息对医学研究者有所帮助,就需要抽取系统 的抽取结果可靠、准确,为此,寻求一种能够从RCT文章中提取高质量的重点句子信息的方法,成了一个亟待解决的难题。The RCT literature is highly targeted. At present, there are many completed RCT experimental design methods and data in the RCT literature. The key information of the experimental design can be refined in these RCT articles to provide convenience for later researchers. At present, it is mainly through simple Keyword or classification search to extract experimental experimental standards, intervention methods, and key results from the RCT medical literature. However, this method of extraction results in insufficient accuracy of the sentences and extracts information. The accuracy is biased. If the key information of the extracted RCT article is to be helpful to medical researchers, the extraction result of the extraction system needs to be reliable and accurate. For this reason, seek a high-quality key sentence that can be extracted from the RCT article The method of information has become a problem that needs to be solved urgently.
发明内容Summary of the invention
本申请实施例提供一种基于机器学习的信息抽取方法、装置、计算机设备和存储介质,以提高RCT文章信息抽取的准确度。The embodiments of the present application provide an information extraction method, device, computer equipment, and storage medium based on machine learning to improve the accuracy of RCT article information extraction.
为了解决上述技术问题,本申请实施例提供一种基于机器学习的信息抽取方法,包括:In order to solve the above technical problems, an embodiment of the present application provides an information extraction method based on machine learning, including:
获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;
提取所述RCT文章的标题、摘要和正文;Extract the title, abstract and body of the RCT article;
对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集;Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;
根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
为了解决上述技术问题,本申请实施例还提供一种基于机器学习的信息抽取装置,包括:In order to solve the above technical problems, an embodiment of the present application further provides an information extraction device based on machine learning, including:
文章获取模块,用于获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;The article acquisition module is used to acquire a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article;
内容提取模块,用于提取所述RCT文章的标题、摘要和正文;The content extraction module is used to extract the title, abstract and body of the RCT article;
数据预处理模块,用于对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;A data preprocessing module, configured to perform data preprocessing on the main text to obtain processed text information, wherein the text information includes a text short sentence and a position corresponding to the text short sentence;
信息抽取模块,用于将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集;The information extraction module is used to use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain coarse-grained key information Candidate set, taking the coarse-grained key information candidate set as the initial candidate set;
信息确定模块,用于根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。The information determining module is configured to filter the initial candidate set according to preset filtering conditions to obtain a target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存 储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下基于机器学习的信息抽取方法的步骤:In order to solve the above technical problems, an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the steps of the following information extraction method based on machine learning are implemented:
获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;
提取所述RCT文章的标题、摘要和正文;Extract the title, abstract and body of the RCT article;
对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集;Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;
根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下基于机器学习的信息抽取方法的步骤:In order to solve the above technical problems, embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, when the computer-readable instructions are executed by a processor, the following is achieved based on machine learning The steps of the information extraction method:
获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;
提取所述RCT文章的标题、摘要和正文;Extract the title, abstract and body of the RCT article;
对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集;Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;
根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
本申请实施例提供的基于机器学习的信息抽取方法、装置、计算机设备及存储介质,通过获取预设的分类标识,并基于分类标识,在检索数据库中进行检索,得到RCT文章,提取RCT文章的标题、摘要和正文,对正文进行数据预处理,得到处理后的文本信息,将标题、摘要与文本信息作为融合特征,并将融合特征与RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将粗粒度关键信息的候选集作为初始候选集,使得提取的初始候选集与标题、摘要具有强相关性,确保提取内容的准确性,再根据预设的过滤条件,对初始候选集进行筛选处理,得到目标候选集,将目标 候选集对应的文本信息,作为RCT文章的关键信息,实现根据需要,对初始候选集进行筛选,得到更为准确地重点信息,有利于提高信息抽取的准确性。The machine learning-based information extraction method, device, computer equipment, and storage medium provided in the embodiments of the application obtain preset classification identifications, and search based on the classification identifications in the search database to obtain RCT articles and extract RCT articles. Title, abstract and main text, the main text is data preprocessed to obtain the processed text information, the title, abstract and text information are used as fusion features, and the fusion features and RCT articles are input into the preset BERT model for training, and the rough For the candidate set of granular key information, the candidate set of coarse-grained key information is used as the initial candidate set, so that the extracted initial candidate set has a strong correlation with the title and abstract, ensuring the accuracy of the extracted content, and then according to the preset filter conditions, The initial candidate set is screened to obtain the target candidate set. The text information corresponding to the target candidate set is used as the key information of the RCT article, so that the initial candidate set can be screened according to needs to obtain more accurate key information. Improve the accuracy of information extraction.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请可以应用于其中的示例性系统架构图;Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
图2是本申请的基于机器学习的信息抽取方法的一个实施例的流程图;2 is a flowchart of an embodiment of the information extraction method based on machine learning of the present application;
图3是根据本申请的基于机器学习的信息抽取装置的一个实施例的结构示意图;Fig. 3 is a schematic structural diagram of an embodiment of an information extraction device based on machine learning according to the present application;
图4是根据本申请的计算机设备的一个实施例的结构示意图。Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
具体实施方式Detailed ways
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the present application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
请参阅图1,如图1所示,系统架构100可以包括终端设备101、102、 103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。Please refer to FIG. 1. As shown in FIG. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。The user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture E界面显示perts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture E界面显示perts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
需要说明的是,本申请实施例所提供的基于机器学习的信息抽取方法由服务器执行,相应地,基于机器学习的信息抽取装置设置于服务器中。It should be noted that the method for extracting information based on machine learning provided by the embodiments of the present application is executed by a server, and accordingly, the device for extracting information based on machine learning is set in the server.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器,本申请实施例中的终端设备101、102、103具体可以对应的是实际生产中的应用系统。It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices, networks, and servers. The terminal devices 101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.
请参阅图2,图2示出本申请实施例提供的一种基于机器学习的信息抽取方法,以该方法应用在图1中的服务端为例进行说明,详述如下:Please refer to FIG. 2. FIG. 2 shows a machine learning-based information extraction method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description. The details are as follows:
S201:获取预设的分类标识,并基于分类标识,在检索数据库中进行检索,得到RCT文章。S201: Obtain a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article.
具体地,不同检索数据库中,RCT文章的分类标识不同,首先获取检索数据库预设的分类标识,进而基于该分类标识,在检索数据库中进行检索,得到RCT文章。Specifically, in different search databases, the classification identifiers of RCT articles are different. First, the classification identification preset in the search database is obtained, and then based on the classification identification, the search is performed in the search database to obtain the RCT article.
其中,RCT(research clinical trails)文章是一类医学文章,为研究某种药物或者是干预方法等的实际效果,为此医学研究者会制定招募标准招募志愿者进行实验,此前,也会借鉴已完成的RCT实验设计方法,已发表的RCT文章中可精炼出实验设计的重点信息为后来研究者提供便利,目前业内还未存在于RCT类医学文献中抽取出实验的实验标准,干预手段、及重点结果等总结性句子的系统,且精准度未达医生要求,若要使得抽取的RCT文章重点信息对医学研究者有所帮助,就需要抽取系统的抽取结果可靠、准确。Among them, RCT (research clinical trails) articles are a type of medical article, in order to study the actual effect of a certain drug or intervention method, for this reason, medical researchers will develop recruitment standards to recruit volunteers for experiments. The completed RCT experimental design method, the key information of the experimental design can be refined in the published RCT articles to provide convenience for later researchers. At present, the experimental standards, intervention methods, and methods for extracting experiments from the RCT medical literature do not exist in the industry. The system of summary sentences such as key results, and the accuracy is not up to the doctor's requirements. If the key information of the extracted RCT articles is to be helpful to medical researchers, the extraction results of the extraction system need to be reliable and accurate.
其中,检索数据库是指包含医学RCT文章的数字图书馆、数据库、学术文库等。Among them, search databases refer to digital libraries, databases, academic libraries, etc. containing medical RCT articles.
其中,分类标识是指检索数据库中每个类别文献资料对应的检索类别的标识,通过该分类标识,可快速查找到某一类别的文献资料。Among them, the category identification refers to the identification of the retrieval category corresponding to each category of document data in the search database, and the document information of a certain category can be quickly found through the classification identification.
S202:提取RCT文章的标题、摘要和正文。S202: Extract the title, abstract and body of the RCT article.
具体地,通过预设的脚本文件,对医学RCT文章进行解析,得到该医学RCT文章的标题、摘要和正文。Specifically, the medical RCT article is analyzed through a preset script file to obtain the title, abstract, and body of the medical RCT article.
其中,预设的脚本文件可以根据实际需求来进行定义,此处不做限制,预设的脚本类型包括但不限于:shell脚本、JavaScript脚本、Lua脚本和python脚本等,优选地,本实施例采用python脚本。Among them, the preset script file can be defined according to actual needs, and there is no limitation here. The preset script types include but are not limited to: shell script, JavaScript script, Lua script, python script, etc. Preferably, this embodiment Use python script.
其中,解析的方式,具体包括但不限于:正则匹配、格式解析和模板匹配等。Among them, the way of parsing includes, but is not limited to: regular matching, format parsing, template matching, etc.
S203:对正文进行数据预处理,得到处理后的文本信息,其中,文本信息包括文本短句和文本短句对应的位置。S203: Perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.
具体地,获取到的正文进行数据预处理,包括文本分割、去除标点等,得到处理后的文本信息,其中,文本信息包括文本短句与文本短句对应的位置。Specifically, the obtained text is subjected to data preprocessing, including text segmentation, punctuation removal, etc., to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.
其中,文本短句对应的位置,是指对数据预处理之后得到的文本短句,按照前后顺序进行编号,得到每个文本短句相对其他文本短句的位置。Among them, the position corresponding to the text short sentence refers to the text short sentence obtained after data preprocessing, which is numbered in the order of the front and back, and the position of each text short sentence relative to other text short sentences is obtained.
S204:将标题、摘要与文本信息作为融合特征,并将融合特征与RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将粗粒度关键信息的候选集作为初始候选集。S204: Use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and use the candidate set of coarse-grained key information as initial candidates set.
具体地,将标题、摘要与文本信息作为融合特征,进而将融合特征和RCT文章输入到预设的语言表征模型中进行训练,得到RCT文章中粗粒度关键信息的候选集,作为初始候选集。Specifically, the title, abstract, and text information are used as fusion features, and then the fusion features and RCT articles are input into a preset language representation model for training, and a candidate set of coarse-grained key information in the RCT article is obtained as the initial candidate set.
其中,语言表模型包括但不限于:深度语义表征(Embedding from Language Model,ELMo)算法、OpenAI GPT和预训练双向编码器语义(Bidirectional Encoder Representations from Transformers,BERT)模型,优选地,在本实施例中采用BERT模型作为语言表模型。Among them, the language table model includes but is not limited to: deep semantic representation (Embedding from Language Model, ELMo) algorithm, OpenAI GPT, and pre-trained bidirectional encoder representations (Bidirectional Encoder Representations from Transformers, BERT) model. Preferably, in this embodiment The BERT model is used as the language table model.
其中,BERT模型的目标是利用大规模无标注语料训练、获得文本的包含丰富语义信息的Representation,即:文本的语义表示,然后将文本的语义表示在特定NLP任务中作微调,最终应用于该NLP任务。在本实施例中,标题中的分词,作为标注的重点词汇特征,摘要中的短句,作为标注的重点短句特征,根据这些标注特征,通过BERT模型,从正文中获取与这些标注特征关联最紧密的短句,作为候选集。Among them, the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a representation of the text that contains rich semantic information, that is: the semantic representation of the text, and then fine-tune the semantic representation of the text in a specific NLP task, and finally apply it to the NLP mission. In this embodiment, the word segmentation in the title is used as the key vocabulary feature of the annotation, and the short sentence in the abstract is used as the key short sentence feature of the annotation. According to these annotation features, the BERT model is used to obtain the association with these annotation features from the text The shortest sentence is used as a candidate set.
其中,将标题、摘要与文本信息进行融合作为融合特征的过程,可参考 后续实施例的描述,为避免重复,此处不再赘述。Among them, the process of fusing the title, abstract, and text information as the fusion feature can be referred to the description of the subsequent embodiments. To avoid repetition, it will not be repeated here.
其中,粗粒度关键信息是指包含关键信息的信息集合,也即,该粗粒度关键信息中不仅包含关键信息,也包含其他一些不是很重要的信息,因而,需要后续进行进一步筛选。Among them, coarse-grained key information refers to a collection of information containing key information, that is, the coarse-grained key information contains not only key information, but also other less important information, and therefore, further screening is required in the future.
S205:根据预设的过滤条件,对初始候选集进行筛选处理,得到目标候选集,将目标候选集对应的文本信息,作为RCT文章的关键信息。S205: Perform screening processing on the initial candidate set according to the preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
具体地,RCT文章具有其固定的特征,通过对预先对一些RCT文章进行分析,得到RCT文章中关键信息的一些通用特征,并将该通用特征作为预设的过滤条件,并根据该过滤条件对初始候选集进行筛选处理,得到目标候选集,将目标候选集对应的文本信息,作为RCT文章的关键信息。Specifically, RCT articles have their fixed characteristics. By analyzing some RCT articles in advance, some general characteristics of the key information in the RCT articles are obtained, and the general characteristics are used as a preset filter condition, and the filter conditions The initial candidate set is screened to obtain the target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
其中,本实施例中预设的过滤条件可根据实际情况进行确定,但均包含如下特征:(1)RCT文章重点信息句子所含有的特征;(2)每类待抽取的重点信息句子其句子内部存在的依存关系。将这两个特征作为依据,筛选Bert算法输出的初始候选集,将非重点信息句子排除在外,这样技能获取到每类待抽取信息的目标候选集。Among them, the preset filter conditions in this embodiment can be determined according to actual conditions, but all include the following features: (1) Features contained in key information sentences of RCT articles; (2) Each type of key information sentence to be extracted and its sentence The interdependence that exists within. Using these two features as a basis, the initial candidate set output by the Bert algorithm is screened, and non-key information sentences are excluded, so that the skill obtains the target candidate set of each type of information to be extracted.
在本实施例中,通过获取预设的分类标识,并基于分类标识,在检索数据库中进行检索,得到RCT文章,提取RCT文章的标题、摘要和正文,对正文进行数据预处理,得到处理后的文本信息,将标题、摘要与文本信息作为融合特征,并将融合特征与RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将粗粒度关键信息的候选集作为初始候选集,使得提取的初始候选集与标题、摘要具有强相关性,确保提取内容的准确性,再根据预设的过滤条件,对初始候选集进行筛选处理,得到目标候选集,将目标候选集对应的文本信息,作为RCT文章的关键信息,实现根据需要,对初始候选集进行筛选,得到更为准确地重点信息,有利于提高信息抽取的准确性。In this embodiment, by obtaining the preset classification identification, and searching in the search database based on the classification identification, the RCT article is obtained, the title, abstract, and main text of the RCT article are extracted, and the main text is preprocessed to obtain the processed data. Use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and use the candidate set of coarse-grained key information as The initial candidate set makes the extracted initial candidate set have a strong correlation with the title and abstract to ensure the accuracy of the extracted content. Then, according to the preset filtering conditions, the initial candidate set is screened to obtain the target candidate set, and the target candidate The text information corresponding to the set is used as the key information of the RCT article, and the initial candidate set can be screened according to needs to obtain more accurate key information, which is beneficial to improve the accuracy of information extraction.
在一实施例中,在得到RCT文章的关键信息之后,将每个RCT文章的关键信息存储于区块链网络节点中,通过区块链存储,实现数据信息在不同平台之间的共享,也可防止数据被篡改。In one embodiment, after obtaining the key information of the RCT article, the key information of each RCT article is stored in the blockchain network node, and the data information is shared between different platforms through the blockchain storage. Can prevent data from being tampered with.
区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
在本实施例的一些可选的实现方式中,步骤S204中,将标题、摘要与文 本信息作为融合特征包括:In some optional implementation manners of this embodiment, in step S204, using the title, abstract, and text information as fusion features includes:
对标题进行分词处理,得到目标分词;Perform word segmentation on the title to get the target word segmentation;
对摘要进行短句提取,得到摘要短句;Extract short sentences from the abstract to get the abstract short sentences;
分别对目标分词、摘要短句和文本信息,按照来源类型进行标记,将标记后的目标分词,标记后的的摘要短句和标记后的文本信息作为输入BERT模型的融合特征。The target word segmentation, summary sentence and text information are marked according to the source type, and the marked target word segmentation, the marked summary sentence and the marked text information are used as the fusion features of the input BERT model.
具体地,通过预设的分词方式标题进行分词处理,得到目标分词,再对摘要进行短句提取,得到摘要短句,进而按照来源类型,分别对目标分词、摘要短句和文本信息进行标记,得到融合特征。Specifically, the word segmentation process is performed on the title of the preset word segmentation method to obtain the target word segmentation, and then short sentence extraction is performed on the abstract to obtain the abstract short sentence, and then the target word segmentation, the abstract short sentence and the text information are respectively marked according to the source type. Get the fusion feature.
进一步地,预设的分词方式包括但不限于:通过第三方分词工具或者分词算法等。Further, the preset word segmentation methods include, but are not limited to: third-party word segmentation tools or word segmentation algorithms, etc.
其中,常见的第三方分词工具包括但不限于:Stanford NLP分词器、ICTClAS分词系统、ansj分词工具和HanLP中文分词工具等。Among them, common third-party word segmentation tools include but are not limited to: Stanford NLP word segmentation, ICTCLAS word segmentation system, ansj word segmentation tool and HanLP Chinese word segmentation tool, etc.
其中,分词算法包括但不限于:最大正向匹配(Maximum Matching,MM)算法、逆向最大匹配(ReverseDirectionMaximum Matching Method,RMM)算法、双向最大匹配(Bi-directction Matching method,BM)算法、隐马尔科夫模型(Hidden Markov Model,HMM)和N-gram模型等。Among them, word segmentation algorithms include but are not limited to: Maximum Forward Matching (MM) algorithm, Reverse Direction Maximum Matching Method (RMM) algorithm, Bi-directction Matching method, BM) algorithm, Hidden Marco Markov Model (Hidden Markov Model, HMM) and N-gram model, etc.
容易理解地,通过分词的方式从标题中提取分词,可以过滤掉一些无意义的词汇,有利于后续根据这些分词对关键信息抽取的范围进行限定。It is easy to understand that by extracting word segmentation from the title by word segmentation, some meaningless words can be filtered out, which is beneficial to the subsequent limitation of the scope of key information extraction based on these word segmentation.
进一步地,对摘要进行短句提取具体可以采用TextRank算法,也可以是采用自然语言处理的方式进行语义识别。Further, the short sentence extraction of the abstract may specifically adopt the TextRank algorithm, or may adopt the method of natural language processing for semantic recognition.
其中,TextRank算法通过把文本分割成若干组成单元(单词、句子)并建立图模型,利用投票机制对文本中的重要成分进行排序,仅利用摘要本身的信息即可实现关键短句提取。Among them, the TextRank algorithm divides the text into several constituent units (words, sentences) and establishes a graph model, uses a voting mechanism to rank important components in the text, and uses only the information of the abstract itself to achieve key short sentences extraction.
其中,自然语言处理(Natural Language Processing)是基于机器学习,特别是统计机器学习,来能实现人与计算机之间用自然语言进行有效通信的方法,一般运用到语料库以及马可夫模型(Markov models)。Among them, natural language processing (Natural Language Processing) is a method based on machine learning, especially statistical machine learning, to enable effective communication between humans and computers in natural language, generally applied to corpora and Markov models.
进一步地,本实施例中,对目标分词、摘要短句和文本信息,按照来源类型进行标记,具体可以是对目标分词、摘要短句和文本信息分别添加一个属性,使用不同标识符来标明其来源于哪一类,例如,采用标识符“FC”标识来源为目标分词,采用标识符“ZY”标识来源为摘要短句,采用标识符“WB”标识来源为文本信息。Further, in this embodiment, the target word segmentation, summary sentence, and text information are marked according to the source type. Specifically, an attribute may be added to the target word segmentation, summary sentence, and text information, and different identifiers are used to mark them. Which type of origin, for example, the identifier "FC" is used to identify the source as the target word segmentation, the identifier "ZY" is used to identify the source as a summary sentence, and the identifier "WB" is used to identify the source as text information.
在本实施例中,通过对标题、摘要和文本信息进行处理和标记,作为融合特征,有利于在后续通过BERT模型识别的准确程度。In this embodiment, the title, abstract, and text information are processed and marked as fusion features, which is beneficial to the accuracy of subsequent recognition through the BERT model.
在本实施例的一些可选的实现方式中,预设的BERT模型包括编码层和Transformer层,步骤S204中,将融合特征与RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将粗粒度关键信息的候选集作为初始候选集包括:In some optional implementations of this embodiment, the preset BERT model includes an encoding layer and a Transformer layer. In step S204, the fusion features and RCT articles are input to the preset BERT model for training to obtain coarse-grained key information The candidate set of the coarse-grained key information as the initial candidate set includes:
将融合特征与RCT文章输入到预设的BERT模型中,通过预设的BERT模型的编码层,对融合特征进行编码,得到初始编码,初始编码包括标题对应的第一编码、摘要对应的第二编码和文本信息对应的第三编码;Input the fusion features and RCT articles into the preset BERT model, and encode the fusion features through the coding layer of the preset BERT model to obtain the initial code. The initial code includes the first code corresponding to the title and the second code corresponding to the abstract. The third code corresponding to the code and the text information;
通过预设的BERT模型的Transformer层,对第二编码和第三编码进行特征提取,得到第二编码对应的第二特征,以及第三编码对应的第三特征;Perform feature extraction on the second code and the third code through the preset Transformer layer of the BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;
计算第三特征与第二特征之间的相似度值,并将与第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征;Calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the feature to be screened;
将待筛选特征对应的文本信息,作为初始候选集。The text information corresponding to the features to be filtered is used as the initial candidate set.
具体地,将融合特征与RCT文章输入到预设的BERT模型中,通过预设的BERT模型的编码层,对融合特征进行编码处理,得到初始编码,初始编码包括标题对应的第一编码、摘要对应的第二编码和文本信息对应的第三编码,再通过预设的BERT模型的Transformer层,对第二编码和第三编码进行特征提取,得到第二编码对应的第二特征,以及第三编码对应的第三特征,进而针对每个第三特征,计算该第三特征与第二特征的相似度,若相似度小于第一预设阈值,则将该相似度对应的第三特征,作为待筛选特征。Specifically, the fusion feature and RCT article are input into the preset BERT model, and the fusion feature is encoded through the coding layer of the preset BERT model to obtain the initial code. The initial code includes the first code and abstract corresponding to the title. The corresponding second code and the third code corresponding to the text information are extracted from the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code, and the third code Encode the corresponding third feature, and then calculate the similarity between the third feature and the second feature for each third feature. If the similarity is less than the first preset threshold, then the third feature corresponding to the similarity is taken as Features to be filtered.
需要说明的是,预设的BERT模型为预先训练好的BERT模型,其训练样本来源于预先从RCT文章中选取并标记的数据特征。It should be noted that the preset BERT model is a pre-trained BERT model, and its training samples are derived from pre-selected and labeled data features from RCT articles.
其中,相似度的计算方式,具体包括但不限于:曼哈顿距离(Manhattan Distance)、欧式距离(Euclidean Distance)、余弦相似度(Cosine similarity)和闵氏距离(Minkowski distance)等。Among them, the calculation method of similarity includes, but is not limited to: Manhattan Distance, Euclidean Distance, Cosine Similarity, Minkowski Distance, etc.
其中,Transformer层是通过Transformer框架进行构建,Transformer框架是谷歌团队提出的自然语言处理的经典之作,Transformer可以增加到非常深的深度,并利用注意力机制实现快速并行,因而,Transformer框架相对于通常的卷积神经网络或者循环神经网络具有训练速度快,且识别率高的特点。Among them, the Transformer layer is constructed through the Transformer framework. The Transformer framework is a classic of natural language processing proposed by the Google team. The Transformer can be increased to a very deep depth and use the attention mechanism to achieve rapid parallelism. Therefore, the Transformer framework is relatively The usual convolutional neural network or recurrent neural network has the characteristics of fast training speed and high recognition rate.
其中,第一预设阈值可根据实际情况进行设定,例如设为0.6,此处不作具体限制。Among them, the first preset threshold can be set according to actual conditions, for example, set to 0.6, which is not specifically limited here.
在本实施例中,通过采用预设的BERT模型,对融合特征进行编码和特征提取,进而确定与摘要存在关联的文本信息的集合,作为初始候选集,减少了关键信息提取的范围,有利于提高关键信息提取的效率。In this embodiment, by using a preset BERT model, the fusion features are encoded and feature extracted, and then the set of text information associated with the abstract is determined as the initial candidate set, which reduces the scope of key information extraction, which is beneficial to Improve the efficiency of key information extraction.
在本实施例的一些可选的实现方式中,在计算第三特征与第二特征之间 的相似度值,并将与第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征之后,还包括:In some optional implementations of this embodiment, the similarity value between the third feature and the second feature is calculated, and the third feature whose similarity value with the second feature is less than the first preset threshold is used as After the features are to be screened, it also includes:
计算待筛选特征与第一编码的欧式距离;Calculate the Euclidean distance between the feature to be filtered and the first code;
将欧式距离小于或等于第二预设阈值的待筛选特征,作为更新后的待候选特征;Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the updated feature to be candidate;
将更新后的待筛选特征对应的文本信息,作为初始候选集。The updated text information corresponding to the feature to be screened is used as the initial candidate set.
具体地,在得到待筛选特征之后,为更好的筛选出重要信息,采用标题对应的第一编码作为一个参考维度,计算待筛选特征与第一编码的欧式距离,易理解,距离越小,说明该待筛选特征对应的文本信息与标题的关联越紧密,根据预设的第二阈值对待筛选特征进行筛选,将与第一编码的欧式距离小于或等于第二预设阈值的待筛选特征进行保留,作为更新后的待候选特征,与第一编码的欧式距离大于第二预设阈值的待筛选特征,确认为与标题关联不紧密的待候选特征,进行剔除。Specifically, after the features to be screened are obtained, in order to better screen out important information, the first code corresponding to the title is used as a reference dimension to calculate the Euclidean distance between the feature to be screened and the first code. It is easy to understand. The smaller the distance, It shows that the text information corresponding to the feature to be screened is more closely related to the title, the feature to be screened is screened according to the preset second threshold, and the feature to be screened whose Euclidean distance from the first code is less than or equal to the second preset threshold is screened. Retain, as the updated candidate feature, the feature to be screened whose Euclidean distance from the first code is greater than the second preset threshold is confirmed as the candidate feature that is not closely related to the title, and is eliminated.
其中,第二预设阈值可以根据实际需求进行设定,例如,设置为8,此处不做具体限定。Wherein, the second preset threshold can be set according to actual needs, for example, set to 8, which is not specifically limited here.
其中,欧式距离(Euclidean Distance)又称为欧几里得度量,是是一个通常采用的距离定义,指在m维空间中两个点之间的真实距离,或者向量的自然长度(即该点到原点的距离)。在本实施例中,具体是指待筛选特征对应的空间向量与第一编码对应的空间向量之间的距离。Among them, Euclidean Distance (Euclidean Distance), also known as Euclidean metric, is a commonly used distance definition, which refers to the true distance between two points in m-dimensional space, or the natural length of the vector (that is, the point The distance to the origin). In this embodiment, it specifically refers to the distance between the space vector corresponding to the feature to be screened and the space vector corresponding to the first code.
在本实施例中,通过第一编码与待筛选特征的欧式距离,挑选出与标题关联较紧(欧式距离较小)的待筛选特征作为更新后的待筛选特征,有利于提高初始候选集范围的准确率。In this embodiment, the Euclidean distance between the first code and the feature to be screened is used to select the feature to be screened that is more closely related to the title (smaller Euclidean distance) as the updated feature to be screened, which is beneficial to increase the range of the initial candidate set. The accuracy rate.
在本实施例的一些可选的实现方式中,在步骤S205之后,该基于机器学习的信息抽取方法还包括:In some optional implementation manners of this embodiment, after step S205, the machine learning-based information extraction method further includes:
对RCT文章的关键信息进行句子重构,得到更新后的关键信息。Sentence reconstruction is performed on the key information of the RCT article, and the updated key information is obtained.
具体地,得到关键信息可能来源于RCT文章的多段内容,也即,抽取结果存在因为句子在全文的位置不连续,而出现可读性差的情况。此时,需要对提取到的关键信息进行句子重构,以得到句意明确、可读性强的更新后的关键信息,增强抽取关键信息的可靠性。Specifically, the key information obtained may be derived from multiple paragraphs of the RCT article, that is, the extraction result may have poor readability because the position of the sentence in the full text is not continuous. At this time, it is necessary to reconstruct the extracted key information to obtain updated key information with clear sentence meaning and strong readability, and to enhance the reliability of extracting key information.
在本实施例中,句子重构是指采用预设的语法规则,对句式进行检查修正,并对句式中缺失的部分,根据语义进行补充完整,实现句子的完整性。In this embodiment, sentence reconstruction refers to the use of preset grammatical rules to check and correct the sentence pattern, and to supplement the missing parts of the sentence pattern according to the semantics to achieve the completeness of the sentence.
其中,预设的语法规则,可根据实际的语言,选取对应的语法,制定成对应的规则脚本。Among them, the preset grammar rules can be selected according to the actual language, and the corresponding grammar can be selected to formulate the corresponding rule script.
其中,根据以使进行补充完整,具体可以是先对其进行语义识别,在根 据句式中缺失的部分,补充相应关键词的方式,来实现句子的完整性,语义识别可采用自然语言处理的方式来实现,具体过程可参考前述实施例的描述,为避免重复,此处不再赘述。Among them, according to make it complete, it can specifically be semantically recognized first, and the corresponding keywords are added according to the missing parts in the sentence pattern to achieve the completeness of the sentence. The semantic recognition can adopt natural language processing. For the specific process, refer to the description of the foregoing embodiment. To avoid repetition, details are not described herein again.
在本实施例中,通过对RCT文章的关键信息进行句子重构,避免关键信息中语法不通顺、前后语义脱节等问题,使得更新后的关键信息表达更为准确。In this embodiment, sentence reconstruction is performed on the key information of the RCT article to avoid problems such as grammatical incompatibility and semantic disconnection in the key information, so that the expression of the updated key information is more accurate.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
图3示出与上述实施例基于机器学习的信息抽取方法一一对应的基于机器学习的信息抽取装置的原理框图。如图3所示,该基于机器学习的信息抽取装置包括文章获取模块31、内容提取模块32、数据预处理模块33、信息抽取模块34和信息确定模块35。各功能模块详细说明如下:Fig. 3 shows a principle block diagram of a machine learning-based information extraction device corresponding to the above-mentioned embodiment of the machine learning-based information extraction method one-to-one. As shown in FIG. 3, the information extraction device based on machine learning includes an article acquisition module 31, a content extraction module 32, a data preprocessing module 33, an information extraction module 34, and an information determination module 35. The detailed description of each functional module is as follows:
文章获取模块31,用于获取预设的分类标识,并基于分类标识,在检索数据库中进行检索,得到RCT文章;The article obtaining module 31 is used to obtain a preset classification mark, and based on the classification mark, perform a search in the search database to obtain an RCT article;
内容提取模块32,用于提取RCT文章的标题、摘要和正文;The content extraction module 32 is used to extract the title, abstract and body of the RCT article;
数据预处理模块33,用于对正文进行数据预处理,得到处理后的文本信息,其中,文本信息包括文本短句和文本短句对应的位置;The data preprocessing module 33 is configured to perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
信息抽取模块34,用于将标题、摘要与文本信息作为融合特征,并将融合特征与RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将粗粒度关键信息的候选集作为初始候选集;The information extraction module 34 is used to use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and combine the coarse-grained key information The candidate set is used as the initial candidate set;
信息确定模块35,用于根据预设的过滤条件,对初始候选集进行筛选处理,得到目标候选集,将目标候选集对应的文本信息,作为RCT文章的关键信息。The information determining module 35 is configured to filter the initial candidate set according to preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
可选地,信息抽取模块34包括:Optionally, the information extraction module 34 includes:
分词处理单元,用于对标题进行分词处理,得到目标分词;The word segmentation processing unit is used to perform word segmentation processing on the title to obtain the target word segmentation;
短句提取单元,用于对摘要进行短句提取,得到摘要短句;The short sentence extraction unit is used to extract short sentences from the abstract to obtain the abstract short sentences;
信息标记单元,用于分别对目标分词、摘要短句和文本信息,按照来源类型进行标记,将标记后的目标分词,标记后的的摘要短句和标记后的文本信息作为输入BERT模型的融合特征。The information marking unit is used to respectively mark the target word segmentation, summary sentence and text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.
可选地,信息抽取模块34还包括:Optionally, the information extraction module 34 further includes:
编码单元,用于将融合特征与RCT文章输入到预设的BERT模型中,通过预设的BERT模型的编码层,对融合特征进行编码,得到初始编码,初始编码包括标题对应的第一编码、摘要对应的第二编码和文本信息对应的第三 编码;The coding unit is used to input the fusion feature and RCT article into the preset BERT model, and encode the fusion feature through the coding layer of the preset BERT model to obtain the initial code. The initial code includes the first code corresponding to the title, The second code corresponding to the abstract and the third code corresponding to the text information;
特征提取单元,用于通过预设的BERT模型的Transformer层,对第二编码和第三编码进行特征提取,得到第二编码对应的第二特征,以及第三编码对应的第三特征;The feature extraction unit is configured to perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;
相似度计算单元,用于计算第三特征与第二特征之间的相似度值,并将与第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征;The similarity calculation unit is configured to calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the feature to be screened;
候选集确定单元,用于将待筛选特征对应的文本信息,作为初始候选集。The candidate set determining unit is used to use the text information corresponding to the feature to be screened as the initial candidate set.
可选地,基于机器学习的RCT文章信息抽取还包括:Optionally, RCT article information extraction based on machine learning also includes:
距离计算模块,用于计算待筛选特征与第一编码的欧式距离;The distance calculation module is used to calculate the Euclidean distance between the feature to be screened and the first code;
特征筛选模块,用于将欧式距离小于或等于第二预设阈值的待筛选特征,作为更新后的待候选特征;The feature screening module is configured to use the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold as the feature to be candidate after update;
候选集获取模块,用于将更新后的待筛选特征对应的文本信息,作为初始候选集。The candidate set acquisition module is used to use the updated text information corresponding to the feature to be screened as the initial candidate set.
可选地,基于机器学习的信息抽取装置还包括:Optionally, the device for extracting information based on machine learning further includes:
句子重构模块,用于对RCT文章的关键信息进行句子重构,得到更新后的关键信息。The sentence reconstruction module is used to reconstruct the key information of the RCT article to obtain the updated key information.
可选地,基于机器学习的信息抽取装置还包括:Optionally, the device for extracting information based on machine learning further includes:
存储模块,用于将RCT文章的关键信息存储于区块链网络节点中。The storage module is used to store the key information of the RCT article in the blockchain network node.
关于基于机器学习的信息抽取装置的具体限定可以参见上文中对于基于机器学习的信息抽取方法的限定,在此不再赘述。上述基于机器学习的信息抽取装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the information extraction device based on machine learning, please refer to the above limitation on the information extraction method based on machine learning, which will not be repeated here. Each module in the above-mentioned machine learning-based information extraction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件连接存储器41、处理器42、网络接口43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、 嵌入式设备等。The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43. However, it should be understood that it is not required to implement all the shown components, and alternative implementations can be made More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
所述存储器41至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或D界面显示存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如电子文件的控制的程序代码等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的程序代码或者处理数据,例如运行电子文件的控制的程序代码。The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 42 is generally used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有界面显示程序,所述界面显示程序可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于机器学习的信息抽取方法的步骤。This application also provides another implementation manner, that is, to provide a computer-readable storage medium that stores an interface display program, and the interface display program can be executed by at least one processor to enable all The at least one processor executes the steps of the information extraction method based on machine learning as described above.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务 器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the embodiments described above are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims (20)

  1. 一种基于机器学习的信息抽取方法,应用于RCT文章的关键信息抽取,其特征在于,所述基于机器学习的信息抽取方法包括:An information extraction method based on machine learning, applied to key information extraction of RCT articles, characterized in that, the information extraction method based on machine learning includes:
    获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;
    提取所述RCT文章的标题、摘要和正文;Extract the title, abstract and body of the RCT article;
    对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
    将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集;Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;
    根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  2. 如权利要求1所述的基于机器学习的信息抽取方法,其特征在于,所述将所述标题、所述摘要与所述文本信息作为融合特征包括:The method for extracting information based on machine learning according to claim 1, wherein said using said title, said abstract and said text information as fusion features comprises:
    对所述标题进行分词处理,得到目标分词;Perform word segmentation processing on the title to obtain the target word segmentation;
    对所述摘要进行短句提取,得到摘要短句;Short sentence extraction is performed on the abstract to obtain abstract short sentences;
    分别对所述目标分词、所述摘要短句和所述文本信息,按照来源类型进行标记,将标记后的目标分词,标记后的的摘要短句和标记后的文本信息作为输入BERT模型的融合特征。Separately mark the target word segmentation, the summary sentence and the text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.
  3. 如权利要求1所述的基于机器学习的信息抽取方法,其特征在于,所述预设的BERT模型包括编码层和Transformer层,所述将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集包括:The method for extracting information based on machine learning according to claim 1, wherein the preset BERT model includes an encoding layer and a Transformer layer, and the fusion feature and the RCT article are input to the preset The BERT model is trained to obtain a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set includes:
    将所述融合特征与所述RCT文章输入到预设的BERT模型中,通过所述预设的BERT模型的编码层,对所述融合特征进行编码,得到初始编码,所述初始编码包括标题对应的第一编码、摘要对应的第二编码和文本信息对应的第三编码;The fusion feature and the RCT article are input into a preset BERT model, and the fusion feature is coded through the coding layer of the preset BERT model to obtain an initial code, and the initial code includes the title correspondence The first code of, the second code corresponding to the abstract, and the third code corresponding to the text information;
    通过所述预设的BERT模型的Transformer层,对所述第二编码和所述第三编码进行特征提取,得到第二编码对应的第二特征,以及所述第三编码对应的第三特征;Performing feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain a second feature corresponding to the second code and a third feature corresponding to the third code;
    计算所述第三特征与所述第二特征之间的相似度值,并将与所述第二特 征相似度值小于第一预设阈值的第三特征,作为待筛选特征;Calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than a first preset threshold as the feature to be screened;
    将所述待筛选特征对应的文本信息,作为初始候选集。The text information corresponding to the features to be screened is used as the initial candidate set.
  4. 如权利要求3所述的基于机器学习的信息抽取方法,其特征在于,在所述计算所述第三特征与所述第二特征之间的相似度值,并将与所述第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征之后,还包括:The method for extracting information based on machine learning according to claim 3, characterized in that the similarity value between the third feature and the second feature is calculated and will be similar to the second feature The third feature whose degree value is less than the first preset threshold, after serving as the feature to be screened, further includes:
    计算所述待筛选特征与所述第一编码的欧式距离;Calculating the Euclidean distance between the feature to be screened and the first code;
    将欧式距离小于或等于第二预设阈值的所述待筛选特征,作为更新后的待候选特征;Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the feature to be candidate after update;
    将所述更新后的待筛选特征对应的文本信息,作为初始候选集。The text information corresponding to the updated feature to be screened is used as the initial candidate set.
  5. 如权利要求1所述的基于机器学习的信息抽取方法,其特征在于,在所述根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息之后,所述基于机器学习的信息抽取方法还包括:The method for extracting information based on machine learning according to claim 1, wherein the initial candidate set is filtered according to preset filtering conditions to obtain a target candidate set, and the target candidate set is After the corresponding text information is used as the key information of the RCT article, the machine learning-based information extraction method further includes:
    对所述RCT文章的关键信息进行句子重构,得到更新后的关键信息。Sentence reconstruction is performed on the key information of the RCT article to obtain updated key information.
  6. 如权利要求1所述的所述的基于机器学习的信息抽取方法,其特征在于,在所述根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息之后,还包括:The method for extracting information based on machine learning according to claim 1, wherein the initial candidate set is filtered according to preset filtering conditions to obtain a target candidate set, and the The text information corresponding to the target candidate set, after serving as the key information of the RCT article, also includes:
    将所述RCT文章的关键信息存储于区块链网络节点中。The key information of the RCT article is stored in the blockchain network node.
  7. 一种基于机器学习的信息抽取装置,应用于RCT文章的关键信息抽取,其特征在于,所述基于机器学习的信息抽取装置包括:A machine learning-based information extraction device applied to key information extraction of RCT articles, characterized in that the machine learning-based information extraction device includes:
    文章获取模块,用于获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;The article acquisition module is used to acquire a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article;
    内容提取模块,用于提取所述RCT文章的标题、摘要和正文;The content extraction module is used to extract the title, abstract and body of the RCT article;
    数据预处理模块,用于对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;A data preprocessing module, configured to perform data preprocessing on the main text to obtain processed text information, wherein the text information includes a text short sentence and a position corresponding to the text short sentence;
    信息抽取模块,用于将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集;The information extraction module is used to use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain coarse-grained key information Candidate set, taking the coarse-grained key information candidate set as the initial candidate set;
    信息确定模块,用于根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。The information determining module is configured to filter the initial candidate set according to preset filtering conditions to obtain a target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
  8. 如权利要求7所述的基于机器学习的信息抽取装置,其特征在于,信 息抽取模块包括:The information extraction device based on machine learning according to claim 7, wherein the information extraction module comprises:
    编码单元,用于将所述融合特征与所述RCT文章输入到预设的BERT模型中,通过所述预设的BERT模型的编码层,对所述融合特征进行编码,得到初始编码,所述初始编码包括标题对应的第一编码、摘要对应的第二编码和文本信息对应的第三编码;The coding unit is used to input the fusion feature and the RCT article into a preset BERT model, and encode the fusion feature through the coding layer of the preset BERT model to obtain an initial code, the The initial code includes the first code corresponding to the title, the second code corresponding to the abstract, and the third code corresponding to the text information;
    特征提取单元,用于通过所述预设的BERT模型的Transformer层,对所述第二编码和所述第三编码进行特征提取,得到第二编码对应的第二特征,以及所述第三编码对应的第三特征;The feature extraction unit is configured to perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain a second feature corresponding to the second code, and the third code Corresponding third feature;
    相似度计算单元,用于计算所述第三特征与所述第二特征之间的相似度值,并将与所述第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征;The similarity calculation unit is configured to calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the waiting Screening characteristics;
    候选集确定单元,用于将所述待筛选特征对应的文本信息,作为初始候选集。The candidate set determining unit is configured to use the text information corresponding to the feature to be screened as an initial candidate set.
  9. 如权利要求7所述的基于机器学习的信息抽取装置,其特征在于,所述基于机器学习的RCT文章信息抽取装置还包括:8. The device for extracting information based on machine learning according to claim 7, wherein the device for extracting RCT article information based on machine learning further comprises:
    距离计算模块,用于计算待筛选特征与第一编码的欧式距离;The distance calculation module is used to calculate the Euclidean distance between the feature to be screened and the first code;
    特征筛选模块,用于将欧式距离小于或等于第二预设阈值的待筛选特征,作为更新后的待候选特征;The feature screening module is configured to use the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold as the feature to be candidate after update;
    候选集获取模块,用于将更新后的待筛选特征对应的文本信息,作为初始候选集。The candidate set acquisition module is used to use the updated text information corresponding to the feature to be screened as the initial candidate set.
  10. 如权利要求7所述的基于机器学习的信息抽取装置,其特征在于,所述基于机器学习的RCT文章信息抽取装置还包括::8. The machine learning-based information extraction device according to claim 7, wherein the machine learning-based RCT article information extraction device further comprises:
    句子重构模块,用于对RCT文章的关键信息进行句子重构,得到更新后的关键信息。The sentence reconstruction module is used to reconstruct the key information of the RCT article to obtain the updated key information.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下基于机器学习的信息抽取方法的步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows Steps of information extraction method based on machine learning:
    获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;
    提取所述RCT文章的标题、摘要和正文;Extract the title, abstract and body of the RCT article;
    对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
    将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息 的候选集,将所述粗粒度关键信息的候选集作为初始候选集;Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;
    根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  12. 如权利要求11所述的计算机设备,其特征在于,所述将所述标题、所述摘要与所述文本信息作为融合特征包括:11. The computer device according to claim 11, wherein said using said title, said abstract and said text information as a fusion feature comprises:
    对所述标题进行分词处理,得到目标分词;Perform word segmentation processing on the title to obtain the target word segmentation;
    对所述摘要进行短句提取,得到摘要短句;Short sentence extraction is performed on the abstract to obtain abstract short sentences;
    分别对所述目标分词、所述摘要短句和所述文本信息,按照来源类型进行标记,将标记后的目标分词,标记后的的摘要短句和标记后的文本信息作为输入BERT模型的融合特征。Separately mark the target word segmentation, the summary sentence and the text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.
  13. 如权利要求11所述的计算机设备,其特征在于,所述预设的BERT模型包括编码层和Transformer层,所述将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集包括:The computer device according to claim 11, wherein the preset BERT model includes an encoding layer and a Transformer layer, and the fusion feature and the RCT article are input into the preset BERT model for training, Obtaining a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set includes:
    将所述融合特征与所述RCT文章输入到预设的BERT模型中,通过所述预设的BERT模型的编码层,对所述融合特征进行编码,得到初始编码,所述初始编码包括标题对应的第一编码、摘要对应的第二编码和文本信息对应的第三编码;The fusion feature and the RCT article are input into a preset BERT model, and the fusion feature is coded through the coding layer of the preset BERT model to obtain an initial code, and the initial code includes the title correspondence The first code of, the second code corresponding to the abstract, and the third code corresponding to the text information;
    通过所述预设的BERT模型的Transformer层,对所述第二编码和所述第三编码进行特征提取,得到第二编码对应的第二特征,以及所述第三编码对应的第三特征;Performing feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain a second feature corresponding to the second code and a third feature corresponding to the third code;
    计算所述第三特征与所述第二特征之间的相似度值,并将与所述第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征;Calculate a similarity value between the third feature and the second feature, and use a third feature whose similarity value with the second feature is less than a first preset threshold as the feature to be screened;
    将所述待筛选特征对应的文本信息,作为初始候选集。The text information corresponding to the features to be screened is used as the initial candidate set.
  14. 如权利要求13所述的计算机设备,其特征在于,在所述计算所述第三特征与所述第二特征之间的相似度值,并将与所述第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征之后,所述处理器执行所述计算机可读指令时还实现如下基于机器学习的信息抽取方法的步骤:The computer device according to claim 13, wherein the similarity value between the third feature and the second feature is calculated, and the similarity value with the second feature is smaller than that of the first feature. After the third feature with the preset threshold is used as the feature to be filtered, the processor further implements the following steps of the machine learning-based information extraction method when the processor executes the computer-readable instruction:
    计算所述待筛选特征与所述第一编码的欧式距离;Calculating the Euclidean distance between the feature to be screened and the first code;
    将欧式距离小于或等于第二预设阈值的所述待筛选特征,作为更新后的待候选特征;Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the feature to be candidate after update;
    将所述更新后的待筛选特征对应的文本信息,作为初始候选集。The text information corresponding to the updated feature to be screened is used as the initial candidate set.
  15. 如权利要求11所述的计算机设备,其特征在于,在所述根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标 候选集对应的文本信息,作为所述RCT文章的关键信息之后,所述处理器执行所述计算机可读指令时还实现如下基于机器学习的信息抽取方法的步骤:11. The computer device according to claim 11, wherein in the step of filtering the initial candidate set according to preset filtering conditions, a target candidate set is obtained, and the text information corresponding to the target candidate set is obtained, After serving as the key information of the RCT article, the processor also implements the following steps of the machine learning-based information extraction method when the processor executes the computer-readable instructions:
    对所述RCT文章的关键信息进行句子重构,得到更新后的关键信息。Sentence reconstruction is performed on the key information of the RCT article to obtain updated key information.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下所述的基于机器学习的信息抽取方法:A computer-readable storage medium storing computer-readable instructions, wherein the computer-readable instructions are executed by a processor to implement the following information extraction method based on machine learning:
    获取预设的分类标识,并基于所述分类标识,在检索数据库中进行检索,得到RCT文章;Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;
    提取所述RCT文章的标题、摘要和正文;Extract the title, abstract and body of the RCT article;
    对所述正文进行数据预处理,得到处理后的文本信息,其中,所述文本信息包括文本短句和所述文本短句对应的位置;Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
    将所述标题、所述摘要与所述文本信息作为融合特征,并将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集;Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;
    根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息。According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述将所述标题、所述摘要与所述文本信息作为融合特征包括:15. The computer-readable storage medium according to claim 16, wherein said using said title, said abstract and said text information as a fusion feature comprises:
    对所述标题进行分词处理,得到目标分词;Perform word segmentation processing on the title to obtain the target word segmentation;
    对所述摘要进行短句提取,得到摘要短句;Short sentence extraction is performed on the abstract to obtain abstract short sentences;
    分别对所述目标分词、所述摘要短句和所述文本信息,按照来源类型进行标记,将标记后的目标分词,标记后的的摘要短句和标记后的文本信息作为输入BERT模型的融合特征。Separately mark the target word segmentation, the abstract short sentence and the text information according to the source type, and use the marked target word segmentation, the marked abstract short sentence and the marked text information as the fusion of the input BERT model feature.
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述预设的BERT模型包括编码层和Transformer层,所述将所述融合特征与所述RCT文章输入到预设的BERT模型进行训练,得到粗粒度关键信息的候选集,将所述粗粒度关键信息的候选集作为初始候选集包括:The computer-readable storage medium of claim 16, wherein the preset BERT model includes an encoding layer and a Transformer layer, and the fusion feature and the RCT article are input into the preset BERT model Performing training to obtain a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set includes:
    将所述融合特征与所述RCT文章输入到预设的BERT模型中,通过所述预设的BERT模型的编码层,对所述融合特征进行编码,得到初始编码,所述初始编码包括标题对应的第一编码、摘要对应的第二编码和文本信息对应的第三编码;The fusion feature and the RCT article are input into a preset BERT model, and the fusion feature is coded through the coding layer of the preset BERT model to obtain an initial code, and the initial code includes the title correspondence The first code of, the second code corresponding to the abstract, and the third code corresponding to the text information;
    通过所述预设的BERT模型的Transformer层,对所述第二编码和所述第三编码进行特征提取,得到第二编码对应的第二特征,以及所述第三编码对 应的第三特征;Perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;
    计算所述第三特征与所述第二特征之间的相似度值,并将与所述第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征;Calculate a similarity value between the third feature and the second feature, and use a third feature whose similarity value with the second feature is less than a first preset threshold as the feature to be screened;
    将所述待筛选特征对应的文本信息,作为初始候选集。The text information corresponding to the features to be screened is used as the initial candidate set.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,在所述计算所述第三特征与所述第二特征之间的相似度值,并将与所述第二特征相似度值小于第一预设阈值的第三特征,作为待筛选特征之后,所述计算机可读指令被处理器执行时还实现如下所述的基于机器学习的信息抽取方法:The computer-readable storage medium of claim 18, wherein the similarity value between the third feature and the second feature is calculated, and the similarity value is compared with the second feature After the third feature smaller than the first preset threshold is used as the feature to be screened, the computer-readable instruction when executed by the processor also implements the following information extraction method based on machine learning:
    计算所述待筛选特征与所述第一编码的欧式距离;Calculating the Euclidean distance between the feature to be screened and the first code;
    将欧式距离小于或等于第二预设阈值的所述待筛选特征,作为更新后的待候选特征;Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the feature to be candidate after update;
    将所述更新后的待筛选特征对应的文本信息,作为初始候选集。The text information corresponding to the updated feature to be screened is used as the initial candidate set.
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,在所述根据预设的过滤条件,对所述初始候选集进行筛选处理,得到目标候选集,将所述目标候选集对应的文本信息,作为所述RCT文章的关键信息之后,所述计算机可读指令被处理器执行时还实现如下所述的基于机器学习的信息抽取方法:16. The computer-readable storage medium according to claim 16, wherein in the step of filtering the initial candidate set according to preset filtering conditions, a target candidate set is obtained, and the target candidate set corresponding to the target candidate set is obtained. After the text information is used as the key information of the RCT article, when the computer-readable instructions are executed by the processor, the following information extraction method based on machine learning is also implemented:
    对所述RCT文章的关键信息进行句子重构,得到更新后的关键信息。Sentence reconstruction is performed on the key information of the RCT article to obtain updated key information.
PCT/CN2020/118951 2020-06-17 2020-09-29 Machine learning-based information extraction method, apparatus, computer device, and medium WO2021135469A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010554248.8 2020-06-17
CN202010554248.8A CN111814465A (en) 2020-06-17 2020-06-17 Information extraction method and device based on machine learning, computer equipment and medium

Publications (1)

Publication Number Publication Date
WO2021135469A1 true WO2021135469A1 (en) 2021-07-08

Family

ID=72845811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118951 WO2021135469A1 (en) 2020-06-17 2020-09-29 Machine learning-based information extraction method, apparatus, computer device, and medium

Country Status (2)

Country Link
CN (1) CN111814465A (en)
WO (1) WO2021135469A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282528A (en) * 2021-08-20 2022-04-05 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium
CN115879450A (en) * 2023-01-06 2023-03-31 广东爱因智能科技有限公司 Step-by-step text generation method, system, computer equipment and storage medium
CN116501861A (en) * 2023-06-25 2023-07-28 知呱呱(天津)大数据技术有限公司 Long text abstract generation method based on hierarchical BERT model and label migration
CN117093717A (en) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof
CN117875268A (en) * 2024-03-13 2024-04-12 山东科技大学 Extraction type text abstract generation method based on clause coding

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347753B (en) * 2020-11-12 2022-05-27 山西大学 Abstract generation method and system applied to reading robot
CN112800465A (en) * 2021-02-09 2021-05-14 第四范式(北京)技术有限公司 Method and device for processing text data to be labeled, electronic equipment and medium
CN113378024B (en) * 2021-05-24 2023-09-01 哈尔滨工业大学 Deep learning-oriented public inspection method field-based related event identification method
CN113626582B (en) * 2021-07-08 2023-07-28 中国人民解放军战略支援部队信息工程大学 Two-stage abstract generation method and system based on content selection and fusion
CN114510560A (en) * 2022-01-27 2022-05-17 福建博思软件股份有限公司 Commodity key information extraction method based on deep learning and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106570191A (en) * 2016-11-11 2017-04-19 浙江大学 Wikipedia-based Chinese and English cross-language entity matching method
US20190087490A1 (en) * 2016-05-25 2019-03-21 Huawei Technologies Co., Ltd. Text classification method and apparatus
CN110413994A (en) * 2019-06-28 2019-11-05 宁波深擎信息科技有限公司 Hot topic generation method, device, computer equipment and storage medium
CN110427482A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 A kind of abstracting method and relevant device of object content
CN110598213A (en) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190087490A1 (en) * 2016-05-25 2019-03-21 Huawei Technologies Co., Ltd. Text classification method and apparatus
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106570191A (en) * 2016-11-11 2017-04-19 浙江大学 Wikipedia-based Chinese and English cross-language entity matching method
CN110413994A (en) * 2019-06-28 2019-11-05 宁波深擎信息科技有限公司 Hot topic generation method, device, computer equipment and storage medium
CN110427482A (en) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 A kind of abstracting method and relevant device of object content
CN110598213A (en) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHI YUAN-BING;ZHOU JUN;WEI ZHONG: "TextRank-based Chinese Automatic Summarization Method", COMMUNICATIONS TECHNOLOGY, vol. 52, no. 9, 10 September 2019 (2019-09-10), pages 2233 - 2239, XP055826776, ISSN: 1002-0802, DOI: 10.3969/j.issn.1002-0802.2019.09.029 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282528A (en) * 2021-08-20 2022-04-05 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium
CN115879450A (en) * 2023-01-06 2023-03-31 广东爱因智能科技有限公司 Step-by-step text generation method, system, computer equipment and storage medium
CN115879450B (en) * 2023-01-06 2023-09-01 广东爱因智能科技有限公司 Gradual text generation method, system, computer equipment and storage medium
CN116501861A (en) * 2023-06-25 2023-07-28 知呱呱(天津)大数据技术有限公司 Long text abstract generation method based on hierarchical BERT model and label migration
CN116501861B (en) * 2023-06-25 2023-09-22 知呱呱(天津)大数据技术有限公司 Long text abstract generation method based on hierarchical BERT model and label migration
CN117093717A (en) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof
CN117093717B (en) * 2023-10-20 2024-01-30 湖南财信数字科技有限公司 Similar text aggregation method, device, equipment and storage medium thereof
CN117875268A (en) * 2024-03-13 2024-04-12 山东科技大学 Extraction type text abstract generation method based on clause coding
CN117875268B (en) * 2024-03-13 2024-05-31 山东科技大学 Extraction type text abstract generation method based on clause coding

Also Published As

Publication number Publication date
CN111814465A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN108804423B (en) Medical text feature extraction and automatic matching method and system
US11361002B2 (en) Method and apparatus for recognizing entity word, and storage medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN112328761B (en) Method and device for setting intention label, computer equipment and storage medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN112215008A (en) Entity recognition method and device based on semantic understanding, computer equipment and medium
CN112860919B (en) Data labeling method, device, equipment and storage medium based on generation model
CN110852106A (en) Named entity processing method and device based on artificial intelligence and electronic equipment
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN111353311A (en) Named entity identification method and device, computer equipment and storage medium
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN115983271A (en) Named entity recognition method and named entity recognition model training method
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN112084779A (en) Entity acquisition method, device, equipment and storage medium for semantic recognition
CN114220505A (en) Information extraction method of medical record data, terminal equipment and readable storage medium
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
CN112417875B (en) Configuration information updating method and device, computer equipment and medium
CN116542246A (en) Keyword quality inspection text-based method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20910788

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20910788

Country of ref document: EP

Kind code of ref document: A1