WO2021135469A1

WO2021135469A1 - Machine learning-based information extraction method, apparatus, computer device, and medium

Info

Publication number: WO2021135469A1
Application number: PCT/CN2020/118951
Authority: WO
Inventors: 黎旭东; 丁佳佳; 林桂
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-06-17
Filing date: 2020-09-29
Publication date: 2021-07-08
Also published as: CN111814465A

Abstract

Provided are a machine learning-based information extraction method, apparatus, computer device, and medium, relating to the field of artificial intelligence, said method comprising: extracting the title, abstract, and main text of an RCT article (S202); performing data pre-processing of the main text to obtain processed text information; taking the title, abstract, and text information as fusion features, and inputting the fusion features and RCT article into a preset BERT model for training; obtaining a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set (S204); according to preset filter conditions, screening the initial candidate set to obtain a target candidate set, the text information corresponding to the target candidate set, and taking it as the key information of the RCT article (S205); the method also relates to blockchain technology; the key information of the obtained RCT article is stored in a blockchain network; the method improves the accuracy of information extraction.

Description

Information extraction method, device, computer equipment and medium based on machine learning

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 17, 2020, the application number is 202010554248.8, and the invention title is "machine learning-based information extraction methods, devices, computer equipment and media", all of which The content is incorporated in this application by reference.

Technical field

This application relates to the field of artificial intelligence, and in particular to an information extraction method, device, computer equipment and medium based on machine learning.

Background technique

With the development of medical concepts, the current medical model has changed from past empirical medicine to evidence-based medicine (EBM). Evidence-based medicine, which upholds "All clinical decision-making should be based on clinical evidence" can provide the most powerful evidence support and rigorous clinical research design guidance for medical clinical work, and has important guiding significance for clinical practice and scientific research. The main evidence carrier of evidence-based medicine is systematic review, and its writing requirements are extremely strict. Researchers need to conduct systematic search and document screening for a clear clinical problem to find the best current clinical evidence, and conduct bias risk assessment and evaluation of these evidences. Results integration. Its steps involve systematic retrieval, document screening, information extraction, bias risk evaluation, and data synthesis. In order to control the risk of bias in the included literature, the current best clinical evidence that a systematic review writer needs to find is generally the most rigorously designed randomized controlled clinical trial (Randomized Controlled Clinical Trial, RCT).

The RCT literature is highly targeted. At present, there are many completed RCT experimental design methods and data in the RCT literature. The key information of the experimental design can be refined in these RCT articles to provide convenience for later researchers. At present, it is mainly through simple Keyword or classification search to extract experimental experimental standards, intervention methods, and key results from the RCT medical literature. However, this method of extraction results in insufficient accuracy of the sentences and extracts information. The accuracy is biased. If the key information of the extracted RCT article is to be helpful to medical researchers, the extraction result of the extraction system needs to be reliable and accurate. For this reason, seek a high-quality key sentence that can be extracted from the RCT article The method of information has become a problem that needs to be solved urgently.

Summary of the invention

The embodiments of the present application provide an information extraction method, device, computer equipment, and storage medium based on machine learning to improve the accuracy of RCT article information extraction.

In order to solve the above technical problems, an embodiment of the present application provides an information extraction method based on machine learning, including:

Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;

Extract the title, abstract and body of the RCT article;

Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;

Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;

According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.

In order to solve the above technical problems, an embodiment of the present application further provides an information extraction device based on machine learning, including:

The article acquisition module is used to acquire a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article;

The content extraction module is used to extract the title, abstract and body of the RCT article;

A data preprocessing module, configured to perform data preprocessing on the main text to obtain processed text information, wherein the text information includes a text short sentence and a position corresponding to the text short sentence;

The information extraction module is used to use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain coarse-grained key information Candidate set, taking the coarse-grained key information candidate set as the initial candidate set;

The information determining module is configured to filter the initial candidate set according to preset filtering conditions to obtain a target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.

In order to solve the above technical problems, an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the steps of the following information extraction method based on machine learning are implemented:

Extract the title, abstract and body of the RCT article;

In order to solve the above technical problems, embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, when the computer-readable instructions are executed by a processor, the following is achieved based on machine learning The steps of the information extraction method:

Extract the title, abstract and body of the RCT article;

The machine learning-based information extraction method, device, computer equipment, and storage medium provided in the embodiments of the application obtain preset classification identifications, and search based on the classification identifications in the search database to obtain RCT articles and extract RCT articles. Title, abstract and main text, the main text is data preprocessed to obtain the processed text information, the title, abstract and text information are used as fusion features, and the fusion features and RCT articles are input into the preset BERT model for training, and the rough For the candidate set of granular key information, the candidate set of coarse-grained key information is used as the initial candidate set, so that the extracted initial candidate set has a strong correlation with the title and abstract, ensuring the accuracy of the extracted content, and then according to the preset filter conditions, The initial candidate set is screened to obtain the target candidate set. The text information corresponding to the target candidate set is used as the key information of the RCT article, so that the initial candidate set can be screened according to needs to obtain more accurate key information. Improve the accuracy of information extraction.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

Figure 1 is an exemplary system architecture diagram to which the present application can be applied;

2 is a flowchart of an embodiment of the information extraction method based on machine learning of the present application;

Fig. 3 is a schematic structural diagram of an embodiment of an information extraction device based on machine learning according to the present application;

Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the present application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.

The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Please refer to FIG. 1. As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.

The user can use the

terminal devices

101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.

The

terminal devices

101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.

The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the

terminal devices

101, 102, and 103.

It should be noted that the method for extracting information based on machine learning provided by the embodiments of the present application is executed by a server, and accordingly, the device for extracting information based on machine learning is set in the server.

It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices, networks, and servers. The

terminal devices

101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.

Please refer to FIG. 2. FIG. 2 shows a machine learning-based information extraction method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description. The details are as follows:

S201: Obtain a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article.

Specifically, in different search databases, the classification identifiers of RCT articles are different. First, the classification identification preset in the search database is obtained, and then based on the classification identification, the search is performed in the search database to obtain the RCT article.

Among them, RCT (research clinical trails) articles are a type of medical article, in order to study the actual effect of a certain drug or intervention method, for this reason, medical researchers will develop recruitment standards to recruit volunteers for experiments. The completed RCT experimental design method, the key information of the experimental design can be refined in the published RCT articles to provide convenience for later researchers. At present, the experimental standards, intervention methods, and methods for extracting experiments from the RCT medical literature do not exist in the industry. The system of summary sentences such as key results, and the accuracy is not up to the doctor's requirements. If the key information of the extracted RCT articles is to be helpful to medical researchers, the extraction results of the extraction system need to be reliable and accurate.

Among them, search databases refer to digital libraries, databases, academic libraries, etc. containing medical RCT articles.

Among them, the category identification refers to the identification of the retrieval category corresponding to each category of document data in the search database, and the document information of a certain category can be quickly found through the classification identification.

S202: Extract the title, abstract and body of the RCT article.

Specifically, the medical RCT article is analyzed through a preset script file to obtain the title, abstract, and body of the medical RCT article.

Among them, the preset script file can be defined according to actual needs, and there is no limitation here. The preset script types include but are not limited to: shell script, JavaScript script, Lua script, python script, etc. Preferably, this embodiment Use python script.

Among them, the way of parsing includes, but is not limited to: regular matching, format parsing, template matching, etc.

S203: Perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.

Specifically, the obtained text is subjected to data preprocessing, including text segmentation, punctuation removal, etc., to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.

Among them, the position corresponding to the text short sentence refers to the text short sentence obtained after data preprocessing, which is numbered in the order of the front and back, and the position of each text short sentence relative to other text short sentences is obtained.

S204: Use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and use the candidate set of coarse-grained key information as initial candidates set.

Specifically, the title, abstract, and text information are used as fusion features, and then the fusion features and RCT articles are input into a preset language representation model for training, and a candidate set of coarse-grained key information in the RCT article is obtained as the initial candidate set.

Among them, the language table model includes but is not limited to: deep semantic representation (Embedding from Language Model, ELMo) algorithm, OpenAI GPT, and pre-trained bidirectional encoder representations (Bidirectional Encoder Representations from Transformers, BERT) model. Preferably, in this embodiment The BERT model is used as the language table model.

Among them, the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a representation of the text that contains rich semantic information, that is: the semantic representation of the text, and then fine-tune the semantic representation of the text in a specific NLP task, and finally apply it to the NLP mission. In this embodiment, the word segmentation in the title is used as the key vocabulary feature of the annotation, and the short sentence in the abstract is used as the key short sentence feature of the annotation. According to these annotation features, the BERT model is used to obtain the association with these annotation features from the text The shortest sentence is used as a candidate set.

Among them, the process of fusing the title, abstract, and text information as the fusion feature can be referred to the description of the subsequent embodiments. To avoid repetition, it will not be repeated here.

Among them, coarse-grained key information refers to a collection of information containing key information, that is, the coarse-grained key information contains not only key information, but also other less important information, and therefore, further screening is required in the future.

S205: Perform screening processing on the initial candidate set according to the preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.

Specifically, RCT articles have their fixed characteristics. By analyzing some RCT articles in advance, some general characteristics of the key information in the RCT articles are obtained, and the general characteristics are used as a preset filter condition, and the filter conditions The initial candidate set is screened to obtain the target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.

Among them, the preset filter conditions in this embodiment can be determined according to actual conditions, but all include the following features: (1) Features contained in key information sentences of RCT articles; (2) Each type of key information sentence to be extracted and its sentence The interdependence that exists within. Using these two features as a basis, the initial candidate set output by the Bert algorithm is screened, and non-key information sentences are excluded, so that the skill obtains the target candidate set of each type of information to be extracted.

In this embodiment, by obtaining the preset classification identification, and searching in the search database based on the classification identification, the RCT article is obtained, the title, abstract, and main text of the RCT article are extracted, and the main text is preprocessed to obtain the processed data. Use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and use the candidate set of coarse-grained key information as The initial candidate set makes the extracted initial candidate set have a strong correlation with the title and abstract to ensure the accuracy of the extracted content. Then, according to the preset filtering conditions, the initial candidate set is screened to obtain the target candidate set, and the target candidate The text information corresponding to the set is used as the key information of the RCT article, and the initial candidate set can be screened according to needs to obtain more accurate key information, which is beneficial to improve the accuracy of information extraction.

In one embodiment, after obtaining the key information of the RCT article, the key information of each RCT article is stored in the blockchain network node, and the data information is shared between different platforms through the blockchain storage. Can prevent data from being tampered with.

Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

In some optional implementation manners of this embodiment, in step S204, using the title, abstract, and text information as fusion features includes:

Perform word segmentation on the title to get the target word segmentation;

Extract short sentences from the abstract to get the abstract short sentences;

The target word segmentation, summary sentence and text information are marked according to the source type, and the marked target word segmentation, the marked summary sentence and the marked text information are used as the fusion features of the input BERT model.

Specifically, the word segmentation process is performed on the title of the preset word segmentation method to obtain the target word segmentation, and then short sentence extraction is performed on the abstract to obtain the abstract short sentence, and then the target word segmentation, the abstract short sentence and the text information are respectively marked according to the source type. Get the fusion feature.

Further, the preset word segmentation methods include, but are not limited to: third-party word segmentation tools or word segmentation algorithms, etc.

Among them, common third-party word segmentation tools include but are not limited to: Stanford NLP word segmentation, ICTCLAS word segmentation system, ansj word segmentation tool and HanLP Chinese word segmentation tool, etc.

Among them, word segmentation algorithms include but are not limited to: Maximum Forward Matching (MM) algorithm, Reverse Direction Maximum Matching Method (RMM) algorithm, Bi-directction Matching method, BM) algorithm, Hidden Marco Markov Model (Hidden Markov Model, HMM) and N-gram model, etc.

It is easy to understand that by extracting word segmentation from the title by word segmentation, some meaningless words can be filtered out, which is beneficial to the subsequent limitation of the scope of key information extraction based on these word segmentation.

Further, the short sentence extraction of the abstract may specifically adopt the TextRank algorithm, or may adopt the method of natural language processing for semantic recognition.

Among them, the TextRank algorithm divides the text into several constituent units (words, sentences) and establishes a graph model, uses a voting mechanism to rank important components in the text, and uses only the information of the abstract itself to achieve key short sentences extraction.

Among them, natural language processing (Natural Language Processing) is a method based on machine learning, especially statistical machine learning, to enable effective communication between humans and computers in natural language, generally applied to corpora and Markov models.

Further, in this embodiment, the target word segmentation, summary sentence, and text information are marked according to the source type. Specifically, an attribute may be added to the target word segmentation, summary sentence, and text information, and different identifiers are used to mark them. Which type of origin, for example, the identifier "FC" is used to identify the source as the target word segmentation, the identifier "ZY" is used to identify the source as a summary sentence, and the identifier "WB" is used to identify the source as text information.

In this embodiment, the title, abstract, and text information are processed and marked as fusion features, which is beneficial to the accuracy of subsequent recognition through the BERT model.

In some optional implementations of this embodiment, the preset BERT model includes an encoding layer and a Transformer layer. In step S204, the fusion features and RCT articles are input to the preset BERT model for training to obtain coarse-grained key information The candidate set of the coarse-grained key information as the initial candidate set includes:

Input the fusion features and RCT articles into the preset BERT model, and encode the fusion features through the coding layer of the preset BERT model to obtain the initial code. The initial code includes the first code corresponding to the title and the second code corresponding to the abstract. The third code corresponding to the code and the text information;

Perform feature extraction on the second code and the third code through the preset Transformer layer of the BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;

Calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the feature to be screened;

The text information corresponding to the features to be filtered is used as the initial candidate set.

Specifically, the fusion feature and RCT article are input into the preset BERT model, and the fusion feature is encoded through the coding layer of the preset BERT model to obtain the initial code. The initial code includes the first code and abstract corresponding to the title. The corresponding second code and the third code corresponding to the text information are extracted from the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code, and the third code Encode the corresponding third feature, and then calculate the similarity between the third feature and the second feature for each third feature. If the similarity is less than the first preset threshold, then the third feature corresponding to the similarity is taken as Features to be filtered.

It should be noted that the preset BERT model is a pre-trained BERT model, and its training samples are derived from pre-selected and labeled data features from RCT articles.

Among them, the calculation method of similarity includes, but is not limited to: Manhattan Distance, Euclidean Distance, Cosine Similarity, Minkowski Distance, etc.

Among them, the Transformer layer is constructed through the Transformer framework. The Transformer framework is a classic of natural language processing proposed by the Google team. The Transformer can be increased to a very deep depth and use the attention mechanism to achieve rapid parallelism. Therefore, the Transformer framework is relatively The usual convolutional neural network or recurrent neural network has the characteristics of fast training speed and high recognition rate.

Among them, the first preset threshold can be set according to actual conditions, for example, set to 0.6, which is not specifically limited here.

In this embodiment, by using a preset BERT model, the fusion features are encoded and feature extracted, and then the set of text information associated with the abstract is determined as the initial candidate set, which reduces the scope of key information extraction, which is beneficial to Improve the efficiency of key information extraction.

In some optional implementations of this embodiment, the similarity value between the third feature and the second feature is calculated, and the third feature whose similarity value with the second feature is less than the first preset threshold is used as After the features are to be screened, it also includes:

Calculate the Euclidean distance between the feature to be filtered and the first code;

Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the updated feature to be candidate;

The updated text information corresponding to the feature to be screened is used as the initial candidate set.

Specifically, after the features to be screened are obtained, in order to better screen out important information, the first code corresponding to the title is used as a reference dimension to calculate the Euclidean distance between the feature to be screened and the first code. It is easy to understand. The smaller the distance, It shows that the text information corresponding to the feature to be screened is more closely related to the title, the feature to be screened is screened according to the preset second threshold, and the feature to be screened whose Euclidean distance from the first code is less than or equal to the second preset threshold is screened. Retain, as the updated candidate feature, the feature to be screened whose Euclidean distance from the first code is greater than the second preset threshold is confirmed as the candidate feature that is not closely related to the title, and is eliminated.

Wherein, the second preset threshold can be set according to actual needs, for example, set to 8, which is not specifically limited here.

Among them, Euclidean Distance (Euclidean Distance), also known as Euclidean metric, is a commonly used distance definition, which refers to the true distance between two points in m-dimensional space, or the natural length of the vector (that is, the point The distance to the origin). In this embodiment, it specifically refers to the distance between the space vector corresponding to the feature to be screened and the space vector corresponding to the first code.

In this embodiment, the Euclidean distance between the first code and the feature to be screened is used to select the feature to be screened that is more closely related to the title (smaller Euclidean distance) as the updated feature to be screened, which is beneficial to increase the range of the initial candidate set. The accuracy rate.

In some optional implementation manners of this embodiment, after step S205, the machine learning-based information extraction method further includes:

Sentence reconstruction is performed on the key information of the RCT article, and the updated key information is obtained.

Specifically, the key information obtained may be derived from multiple paragraphs of the RCT article, that is, the extraction result may have poor readability because the position of the sentence in the full text is not continuous. At this time, it is necessary to reconstruct the extracted key information to obtain updated key information with clear sentence meaning and strong readability, and to enhance the reliability of extracting key information.

In this embodiment, sentence reconstruction refers to the use of preset grammatical rules to check and correct the sentence pattern, and to supplement the missing parts of the sentence pattern according to the semantics to achieve the completeness of the sentence.

Among them, the preset grammar rules can be selected according to the actual language, and the corresponding grammar can be selected to formulate the corresponding rule script.

Among them, according to make it complete, it can specifically be semantically recognized first, and the corresponding keywords are added according to the missing parts in the sentence pattern to achieve the completeness of the sentence. The semantic recognition can adopt natural language processing. For the specific process, refer to the description of the foregoing embodiment. To avoid repetition, details are not described herein again.

In this embodiment, sentence reconstruction is performed on the key information of the RCT article to avoid problems such as grammatical incompatibility and semantic disconnection in the key information, so that the expression of the updated key information is more accurate.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Fig. 3 shows a principle block diagram of a machine learning-based information extraction device corresponding to the above-mentioned embodiment of the machine learning-based information extraction method one-to-one. As shown in FIG. 3, the information extraction device based on machine learning includes an article acquisition module 31, a content extraction module 32, a data preprocessing module 33, an information extraction module 34, and an information determination module 35. The detailed description of each functional module is as follows:

The article obtaining module 31 is used to obtain a preset classification mark, and based on the classification mark, perform a search in the search database to obtain an RCT article;

The content extraction module 32 is used to extract the title, abstract and body of the RCT article;

The data preprocessing module 33 is configured to perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;

The information extraction module 34 is used to use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and combine the coarse-grained key information The candidate set is used as the initial candidate set;

The information determining module 35 is configured to filter the initial candidate set according to preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.

Optionally, the information extraction module 34 includes:

The word segmentation processing unit is used to perform word segmentation processing on the title to obtain the target word segmentation;

The short sentence extraction unit is used to extract short sentences from the abstract to obtain the abstract short sentences;

The information marking unit is used to respectively mark the target word segmentation, summary sentence and text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.

Optionally, the information extraction module 34 further includes:

The coding unit is used to input the fusion feature and RCT article into the preset BERT model, and encode the fusion feature through the coding layer of the preset BERT model to obtain the initial code. The initial code includes the first code corresponding to the title, The second code corresponding to the abstract and the third code corresponding to the text information;

The feature extraction unit is configured to perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;

The similarity calculation unit is configured to calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the feature to be screened;

The candidate set determining unit is used to use the text information corresponding to the feature to be screened as the initial candidate set.

Optionally, RCT article information extraction based on machine learning also includes:

The distance calculation module is used to calculate the Euclidean distance between the feature to be screened and the first code;

The feature screening module is configured to use the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold as the feature to be candidate after update;

The candidate set acquisition module is used to use the updated text information corresponding to the feature to be screened as the initial candidate set.

Optionally, the device for extracting information based on machine learning further includes:

The sentence reconstruction module is used to reconstruct the key information of the RCT article to obtain the updated key information.

The storage module is used to store the key information of the RCT article in the blockchain network node.

For the specific limitation of the information extraction device based on machine learning, please refer to the above limitation on the information extraction method based on machine learning, which will not be repeated here. Each module in the above-mentioned machine learning-based information extraction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.

The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43. However, it should be understood that it is not required to implement all the shown components, and alternative implementations can be made More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.

The memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 42 is generally used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.

The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.

This application also provides another implementation manner, that is, to provide a computer-readable storage medium that stores an interface display program, and the interface display program can be executed by at least one processor to enable all The at least one processor executes the steps of the information extraction method based on machine learning as described above.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.

Obviously, the embodiments described above are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims

An information extraction method based on machine learning, applied to key information extraction of RCT articles, characterized in that, the information extraction method based on machine learning includes:

Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;

Extract the title, abstract and body of the RCT article;

Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;

Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;

According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
The method for extracting information based on machine learning according to claim 1, wherein said using said title, said abstract and said text information as fusion features comprises:

Perform word segmentation processing on the title to obtain the target word segmentation;

Short sentence extraction is performed on the abstract to obtain abstract short sentences;

Separately mark the target word segmentation, the summary sentence and the text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.
The method for extracting information based on machine learning according to claim 1, wherein the preset BERT model includes an encoding layer and a Transformer layer, and the fusion feature and the RCT article are input to the preset The BERT model is trained to obtain a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set includes:

The fusion feature and the RCT article are input into a preset BERT model, and the fusion feature is coded through the coding layer of the preset BERT model to obtain an initial code, and the initial code includes the title correspondence The first code of, the second code corresponding to the abstract, and the third code corresponding to the text information;

Performing feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain a second feature corresponding to the second code and a third feature corresponding to the third code;

Calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than a first preset threshold as the feature to be screened;

The text information corresponding to the features to be screened is used as the initial candidate set.
The method for extracting information based on machine learning according to claim 3, characterized in that the similarity value between the third feature and the second feature is calculated and will be similar to the second feature The third feature whose degree value is less than the first preset threshold, after serving as the feature to be screened, further includes:

Calculating the Euclidean distance between the feature to be screened and the first code;

Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the feature to be candidate after update;

The text information corresponding to the updated feature to be screened is used as the initial candidate set.
The method for extracting information based on machine learning according to claim 1, wherein the initial candidate set is filtered according to preset filtering conditions to obtain a target candidate set, and the target candidate set is After the corresponding text information is used as the key information of the RCT article, the machine learning-based information extraction method further includes:

Sentence reconstruction is performed on the key information of the RCT article to obtain updated key information.
The method for extracting information based on machine learning according to claim 1, wherein the initial candidate set is filtered according to preset filtering conditions to obtain a target candidate set, and the The text information corresponding to the target candidate set, after serving as the key information of the RCT article, also includes:

The key information of the RCT article is stored in the blockchain network node.
A machine learning-based information extraction device applied to key information extraction of RCT articles, characterized in that the machine learning-based information extraction device includes:

The article acquisition module is used to acquire a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article;

The content extraction module is used to extract the title, abstract and body of the RCT article;

A data preprocessing module, configured to perform data preprocessing on the main text to obtain processed text information, wherein the text information includes a text short sentence and a position corresponding to the text short sentence;

The information extraction module is used to use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain coarse-grained key information Candidate set, taking the coarse-grained key information candidate set as the initial candidate set;

The information determining module is configured to filter the initial candidate set according to preset filtering conditions to obtain a target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
The information extraction device based on machine learning according to claim 7, wherein the information extraction module comprises:

The coding unit is used to input the fusion feature and the RCT article into a preset BERT model, and encode the fusion feature through the coding layer of the preset BERT model to obtain an initial code, the The initial code includes the first code corresponding to the title, the second code corresponding to the abstract, and the third code corresponding to the text information;

The feature extraction unit is configured to perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain a second feature corresponding to the second code, and the third code Corresponding third feature;

The similarity calculation unit is configured to calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the waiting Screening characteristics;

The candidate set determining unit is configured to use the text information corresponding to the feature to be screened as an initial candidate set.
8. The device for extracting information based on machine learning according to claim 7, wherein the device for extracting RCT article information based on machine learning further comprises:

The distance calculation module is used to calculate the Euclidean distance between the feature to be screened and the first code;

The feature screening module is configured to use the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold as the feature to be candidate after update;

The candidate set acquisition module is used to use the updated text information corresponding to the feature to be screened as the initial candidate set.
8. The machine learning-based information extraction device according to claim 7, wherein the machine learning-based RCT article information extraction device further comprises:

The sentence reconstruction module is used to reconstruct the key information of the RCT article to obtain the updated key information.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows Steps of information extraction method based on machine learning:

Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;

Extract the title, abstract and body of the RCT article;

Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;

Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;

According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
11. The computer device according to claim 11, wherein said using said title, said abstract and said text information as a fusion feature comprises:

Perform word segmentation processing on the title to obtain the target word segmentation;

Short sentence extraction is performed on the abstract to obtain abstract short sentences;

Separately mark the target word segmentation, the summary sentence and the text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.
The computer device according to claim 11, wherein the preset BERT model includes an encoding layer and a Transformer layer, and the fusion feature and the RCT article are input into the preset BERT model for training, Obtaining a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set includes:

The fusion feature and the RCT article are input into a preset BERT model, and the fusion feature is coded through the coding layer of the preset BERT model to obtain an initial code, and the initial code includes the title correspondence The first code of, the second code corresponding to the abstract, and the third code corresponding to the text information;

Performing feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain a second feature corresponding to the second code and a third feature corresponding to the third code;

Calculate a similarity value between the third feature and the second feature, and use a third feature whose similarity value with the second feature is less than a first preset threshold as the feature to be screened;

The text information corresponding to the features to be screened is used as the initial candidate set.
The computer device according to claim 13, wherein the similarity value between the third feature and the second feature is calculated, and the similarity value with the second feature is smaller than that of the first feature. After the third feature with the preset threshold is used as the feature to be filtered, the processor further implements the following steps of the machine learning-based information extraction method when the processor executes the computer-readable instruction:

Calculating the Euclidean distance between the feature to be screened and the first code;

Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the feature to be candidate after update;

The text information corresponding to the updated feature to be screened is used as the initial candidate set.
11. The computer device according to claim 11, wherein in the step of filtering the initial candidate set according to preset filtering conditions, a target candidate set is obtained, and the text information corresponding to the target candidate set is obtained, After serving as the key information of the RCT article, the processor also implements the following steps of the machine learning-based information extraction method when the processor executes the computer-readable instructions:

Sentence reconstruction is performed on the key information of the RCT article to obtain updated key information.
A computer-readable storage medium storing computer-readable instructions, wherein the computer-readable instructions are executed by a processor to implement the following information extraction method based on machine learning:

Obtain a preset classification identifier, and based on the classification identifier, perform a search in a search database to obtain an RCT article;

Extract the title, abstract and body of the RCT article;

Performing data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;

Use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain a candidate set of coarse-grained key information. The candidate set of coarse-grained key information is used as the initial candidate set;

According to preset filtering conditions, the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
15. The computer-readable storage medium according to claim 16, wherein said using said title, said abstract and said text information as a fusion feature comprises:

Perform word segmentation processing on the title to obtain the target word segmentation;

Short sentence extraction is performed on the abstract to obtain abstract short sentences;

Separately mark the target word segmentation, the abstract short sentence and the text information according to the source type, and use the marked target word segmentation, the marked abstract short sentence and the marked text information as the fusion of the input BERT model feature.
The computer-readable storage medium of claim 16, wherein the preset BERT model includes an encoding layer and a Transformer layer, and the fusion feature and the RCT article are input into the preset BERT model Performing training to obtain a candidate set of coarse-grained key information, and using the candidate set of coarse-grained key information as an initial candidate set includes:

The fusion feature and the RCT article are input into a preset BERT model, and the fusion feature is coded through the coding layer of the preset BERT model to obtain an initial code, and the initial code includes the title correspondence The first code of, the second code corresponding to the abstract, and the third code corresponding to the text information;

Perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;

Calculate a similarity value between the third feature and the second feature, and use a third feature whose similarity value with the second feature is less than a first preset threshold as the feature to be screened;

The text information corresponding to the features to be screened is used as the initial candidate set.
The computer-readable storage medium of claim 18, wherein the similarity value between the third feature and the second feature is calculated, and the similarity value is compared with the second feature After the third feature smaller than the first preset threshold is used as the feature to be screened, the computer-readable instruction when executed by the processor also implements the following information extraction method based on machine learning:

Calculating the Euclidean distance between the feature to be screened and the first code;

Taking the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold value as the feature to be candidate after update;

The text information corresponding to the updated feature to be screened is used as the initial candidate set.
16. The computer-readable storage medium according to claim 16, wherein in the step of filtering the initial candidate set according to preset filtering conditions, a target candidate set is obtained, and the target candidate set corresponding to the target candidate set is obtained. After the text information is used as the key information of the RCT article, when the computer-readable instructions are executed by the processor, the following information extraction method based on machine learning is also implemented:

Sentence reconstruction is performed on the key information of the RCT article to obtain updated key information.