WO2021135469A1 - Procédé, appareil, dispositif informatique et support d'extraction d'informations basée sur l'apprentissage automatique - Google Patents

Procédé, appareil, dispositif informatique et support d'extraction d'informations basée sur l'apprentissage automatique Download PDF

Info

Publication number
WO2021135469A1
WO2021135469A1 PCT/CN2020/118951 CN2020118951W WO2021135469A1 WO 2021135469 A1 WO2021135469 A1 WO 2021135469A1 CN 2020118951 W CN2020118951 W CN 2020118951W WO 2021135469 A1 WO2021135469 A1 WO 2021135469A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
candidate set
information
rct
article
Prior art date
Application number
PCT/CN2020/118951
Other languages
English (en)
Chinese (zh)
Inventor
黎旭东
丁佳佳
林桂
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135469A1 publication Critical patent/WO2021135469A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an information extraction method, device, computer equipment and medium based on machine learning.
  • the RCT literature is highly targeted. At present, there are many completed RCT experimental design methods and data in the RCT literature.
  • the key information of the experimental design can be refined in these RCT articles to provide convenience for later researchers. At present, it is mainly through simple Keyword or classification search to extract experimental experimental standards, intervention methods, and key results from the RCT medical literature.
  • this method of extraction results in insufficient accuracy of the sentences and extracts information. The accuracy is biased. If the key information of the extracted RCT article is to be helpful to medical researchers, the extraction result of the extraction system needs to be reliable and accurate. For this reason, seek a high-quality key sentence that can be extracted from the RCT article The method of information has become a problem that needs to be solved urgently.
  • the embodiments of the present application provide an information extraction method, device, computer equipment, and storage medium based on machine learning to improve the accuracy of RCT article information extraction.
  • an embodiment of the present application provides an information extraction method based on machine learning, including:
  • the candidate set of coarse-grained key information is used as the initial candidate set;
  • the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • an embodiment of the present application further provides an information extraction device based on machine learning, including:
  • the article acquisition module is used to acquire a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article;
  • the content extraction module is used to extract the title, abstract and body of the RCT article
  • a data preprocessing module configured to perform data preprocessing on the main text to obtain processed text information, wherein the text information includes a text short sentence and a position corresponding to the text short sentence;
  • the information extraction module is used to use the title, the abstract, and the text information as fusion features, and input the fusion features and the RCT article into a preset BERT model for training to obtain coarse-grained key information Candidate set, taking the coarse-grained key information candidate set as the initial candidate set;
  • the information determining module is configured to filter the initial candidate set according to preset filtering conditions to obtain a target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
  • an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the steps of the following information extraction method based on machine learning are implemented:
  • the candidate set of coarse-grained key information is used as the initial candidate set;
  • the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, when the computer-readable instructions are executed by a processor, the following is achieved based on machine learning
  • the candidate set of coarse-grained key information is used as the initial candidate set;
  • the initial candidate set is filtered to obtain a target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • the machine learning-based information extraction method, device, computer equipment, and storage medium provided in the embodiments of the application obtain preset classification identifications, and search based on the classification identifications in the search database to obtain RCT articles and extract RCT articles.
  • Title, abstract and main text the main text is data preprocessed to obtain the processed text information
  • the title, abstract and text information are used as fusion features
  • the fusion features and RCT articles are input into the preset BERT model for training, and the rough
  • the candidate set of coarse-grained key information is used as the initial candidate set, so that the extracted initial candidate set has a strong correlation with the title and abstract, ensuring the accuracy of the extracted content, and then according to the preset filter conditions,
  • the initial candidate set is screened to obtain the target candidate set.
  • the text information corresponding to the target candidate set is used as the key information of the RCT article, so that the initial candidate set can be screened according to needs to obtain more accurate key information. Improve the accuracy of information extraction.
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of the information extraction method based on machine learning of the present application
  • Fig. 3 is a schematic structural diagram of an embodiment of an information extraction device based on machine learning according to the present application
  • Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • the terminal devices 101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • the method for extracting information based on machine learning provided by the embodiments of the present application is executed by a server, and accordingly, the device for extracting information based on machine learning is set in the server.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices, networks, and servers.
  • the terminal devices 101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.
  • FIG. 2 shows a machine learning-based information extraction method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description. The details are as follows:
  • S201 Obtain a preset classification mark, and based on the classification mark, perform a search in a search database to obtain an RCT article.
  • the classification identifiers of RCT articles are different.
  • the classification identification preset in the search database is obtained, and then based on the classification identification, the search is performed in the search database to obtain the RCT article.
  • RCT research clinical trails
  • RCT research clinical trails
  • the completed RCT experimental design method the key information of the experimental design can be refined in the published RCT articles to provide convenience for later researchers.
  • the experimental standards, intervention methods, and methods for extracting experiments from the RCT medical literature do not exist in the industry.
  • the system of summary sentences such as key results, and the accuracy is not up to the doctor's requirements. If the key information of the extracted RCT articles is to be helpful to medical researchers, the extraction results of the extraction system need to be reliable and accurate.
  • search databases refer to digital libraries, databases, academic libraries, etc. containing medical RCT articles.
  • the category identification refers to the identification of the retrieval category corresponding to each category of document data in the search database, and the document information of a certain category can be quickly found through the classification identification.
  • the medical RCT article is analyzed through a preset script file to obtain the title, abstract, and body of the medical RCT article.
  • the preset script file can be defined according to actual needs, and there is no limitation here.
  • the preset script types include but are not limited to: shell script, JavaScript script, Lua script, python script, etc.
  • this embodiment Use python script.
  • the way of parsing includes, but is not limited to: regular matching, format parsing, template matching, etc.
  • S203 Perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.
  • the obtained text is subjected to data preprocessing, including text segmentation, punctuation removal, etc., to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences.
  • the position corresponding to the text short sentence refers to the text short sentence obtained after data preprocessing, which is numbered in the order of the front and back, and the position of each text short sentence relative to other text short sentences is obtained.
  • S204 Use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and use the candidate set of coarse-grained key information as initial candidates set.
  • the title, abstract, and text information are used as fusion features, and then the fusion features and RCT articles are input into a preset language representation model for training, and a candidate set of coarse-grained key information in the RCT article is obtained as the initial candidate set.
  • the language table model includes but is not limited to: deep semantic representation (Embedding from Language Model, ELMo) algorithm, OpenAI GPT, and pre-trained bidirectional encoder representations (Bidirectional Encoder Representations from Transformers, BERT) model.
  • ELMo Embedding from Language Model
  • OpenAI GPT OpenAI GPT
  • BERT Bidirectional Encoder Representations from Transformers
  • the BERT model is used as the language table model.
  • the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a representation of the text that contains rich semantic information, that is: the semantic representation of the text, and then fine-tune the semantic representation of the text in a specific NLP task, and finally apply it to the NLP mission.
  • the word segmentation in the title is used as the key vocabulary feature of the annotation
  • the short sentence in the abstract is used as the key short sentence feature of the annotation.
  • the BERT model is used to obtain the association with these annotation features from the text The shortest sentence is used as a candidate set.
  • coarse-grained key information refers to a collection of information containing key information, that is, the coarse-grained key information contains not only key information, but also other less important information, and therefore, further screening is required in the future.
  • S205 Perform screening processing on the initial candidate set according to the preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
  • RCT articles have their fixed characteristics.
  • some general characteristics of the key information in the RCT articles are obtained, and the general characteristics are used as a preset filter condition, and the filter conditions
  • the initial candidate set is screened to obtain the target candidate set, and the text information corresponding to the target candidate set is used as the key information of the RCT article.
  • the preset filter conditions in this embodiment can be determined according to actual conditions, but all include the following features: (1) Features contained in key information sentences of RCT articles; (2) Each type of key information sentence to be extracted and its sentence The interdependence that exists within. Using these two features as a basis, the initial candidate set output by the Bert algorithm is screened, and non-key information sentences are excluded, so that the skill obtains the target candidate set of each type of information to be extracted.
  • the RCT article is obtained, the title, abstract, and main text of the RCT article are extracted, and the main text is preprocessed to obtain the processed data.
  • the initial candidate set is screened to obtain the target candidate set, and the target candidate
  • the text information corresponding to the set is used as the key information of the RCT article, and the initial candidate set can be screened according to needs to obtain more accurate key information, which is beneficial to improve the accuracy of information extraction.
  • the key information of each RCT article is stored in the blockchain network node, and the data information is shared between different platforms through the blockchain storage. Can prevent data from being tampered with.
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • step S204 using the title, abstract, and text information as fusion features includes:
  • the target word segmentation, summary sentence and text information are marked according to the source type, and the marked target word segmentation, the marked summary sentence and the marked text information are used as the fusion features of the input BERT model.
  • the word segmentation process is performed on the title of the preset word segmentation method to obtain the target word segmentation, and then short sentence extraction is performed on the abstract to obtain the abstract short sentence, and then the target word segmentation, the abstract short sentence and the text information are respectively marked according to the source type. Get the fusion feature.
  • preset word segmentation methods include, but are not limited to: third-party word segmentation tools or word segmentation algorithms, etc.
  • common third-party word segmentation tools include but are not limited to: Stanford NLP word segmentation, ICTCLAS word segmentation system, ansj word segmentation tool and HanLP Chinese word segmentation tool, etc.
  • word segmentation algorithms include but are not limited to: Maximum Forward Matching (MM) algorithm, Reverse Direction Maximum Matching Method (RMM) algorithm, Bi-directction Matching method, BM) algorithm, Hidden Marco Markov Model (Hidden Markov Model, HMM) and N-gram model, etc.
  • the short sentence extraction of the abstract may specifically adopt the TextRank algorithm, or may adopt the method of natural language processing for semantic recognition.
  • the TextRank algorithm divides the text into several constituent units (words, sentences) and establishes a graph model, uses a voting mechanism to rank important components in the text, and uses only the information of the abstract itself to achieve key short sentences extraction.
  • Natural Language Processing is a method based on machine learning, especially statistical machine learning, to enable effective communication between humans and computers in natural language, generally applied to corpora and Markov models.
  • the target word segmentation, summary sentence, and text information are marked according to the source type.
  • an attribute may be added to the target word segmentation, summary sentence, and text information, and different identifiers are used to mark them.
  • which type of origin for example, the identifier "FC” is used to identify the source as the target word segmentation, the identifier "ZY” is used to identify the source as a summary sentence, and the identifier "WB" is used to identify the source as text information.
  • the title, abstract, and text information are processed and marked as fusion features, which is beneficial to the accuracy of subsequent recognition through the BERT model.
  • the preset BERT model includes an encoding layer and a Transformer layer.
  • the fusion features and RCT articles are input to the preset BERT model for training to obtain coarse-grained key information
  • the candidate set of the coarse-grained key information as the initial candidate set includes:
  • the initial code includes the first code corresponding to the title and the second code corresponding to the abstract.
  • the third code corresponding to the code and the text information;
  • the text information corresponding to the features to be filtered is used as the initial candidate set.
  • the fusion feature and RCT article are input into the preset BERT model, and the fusion feature is encoded through the coding layer of the preset BERT model to obtain the initial code.
  • the initial code includes the first code and abstract corresponding to the title.
  • the corresponding second code and the third code corresponding to the text information are extracted from the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code, and the third code Encode the corresponding third feature, and then calculate the similarity between the third feature and the second feature for each third feature. If the similarity is less than the first preset threshold, then the third feature corresponding to the similarity is taken as Features to be filtered.
  • the preset BERT model is a pre-trained BERT model, and its training samples are derived from pre-selected and labeled data features from RCT articles.
  • the calculation method of similarity includes, but is not limited to: Manhattan Distance, Euclidean Distance, Cosine Similarity, Minkowski Distance, etc.
  • the Transformer layer is constructed through the Transformer framework.
  • the Transformer framework is a classic of natural language processing proposed by the Google team.
  • the Transformer can be increased to a very deep depth and use the attention mechanism to achieve rapid parallelism. Therefore, the Transformer framework is relatively
  • the usual convolutional neural network or recurrent neural network has the characteristics of fast training speed and high recognition rate.
  • the first preset threshold can be set according to actual conditions, for example, set to 0.6, which is not specifically limited here.
  • the fusion features are encoded and feature extracted, and then the set of text information associated with the abstract is determined as the initial candidate set, which reduces the scope of key information extraction, which is beneficial to Improve the efficiency of key information extraction.
  • the similarity value between the third feature and the second feature is calculated, and the third feature whose similarity value with the second feature is less than the first preset threshold is used as After the features are to be screened, it also includes:
  • the updated text information corresponding to the feature to be screened is used as the initial candidate set.
  • the first code corresponding to the title is used as a reference dimension to calculate the Euclidean distance between the feature to be screened and the first code. It is easy to understand. The smaller the distance, It shows that the text information corresponding to the feature to be screened is more closely related to the title, the feature to be screened is screened according to the preset second threshold, and the feature to be screened whose Euclidean distance from the first code is less than or equal to the second preset threshold is screened. Retain, as the updated candidate feature, the feature to be screened whose Euclidean distance from the first code is greater than the second preset threshold is confirmed as the candidate feature that is not closely related to the title, and is eliminated.
  • the second preset threshold can be set according to actual needs, for example, set to 8, which is not specifically limited here.
  • Euclidean Distance also known as Euclidean metric
  • Euclidean metric is a commonly used distance definition, which refers to the true distance between two points in m-dimensional space, or the natural length of the vector (that is, the point The distance to the origin). In this embodiment, it specifically refers to the distance between the space vector corresponding to the feature to be screened and the space vector corresponding to the first code.
  • the Euclidean distance between the first code and the feature to be screened is used to select the feature to be screened that is more closely related to the title (smaller Euclidean distance) as the updated feature to be screened, which is beneficial to increase the range of the initial candidate set.
  • the accuracy rate is used to select the feature to be screened that is more closely related to the title (smaller Euclidean distance) as the updated feature to be screened, which is beneficial to increase the range of the initial candidate set.
  • the machine learning-based information extraction method further includes:
  • Sentence reconstruction is performed on the key information of the RCT article, and the updated key information is obtained.
  • the key information obtained may be derived from multiple paragraphs of the RCT article, that is, the extraction result may have poor readability because the position of the sentence in the full text is not continuous. At this time, it is necessary to reconstruct the extracted key information to obtain updated key information with clear sentence meaning and strong readability, and to enhance the reliability of extracting key information.
  • sentence reconstruction refers to the use of preset grammatical rules to check and correct the sentence pattern, and to supplement the missing parts of the sentence pattern according to the semantics to achieve the completeness of the sentence.
  • the preset grammar rules can be selected according to the actual language, and the corresponding grammar can be selected to formulate the corresponding rule script.
  • sentence reconstruction is performed on the key information of the RCT article to avoid problems such as grammatical incompatibility and semantic disconnection in the key information, so that the expression of the updated key information is more accurate.
  • Fig. 3 shows a principle block diagram of a machine learning-based information extraction device corresponding to the above-mentioned embodiment of the machine learning-based information extraction method one-to-one.
  • the information extraction device based on machine learning includes an article acquisition module 31, a content extraction module 32, a data preprocessing module 33, an information extraction module 34, and an information determination module 35.
  • the detailed description of each functional module is as follows:
  • the article obtaining module 31 is used to obtain a preset classification mark, and based on the classification mark, perform a search in the search database to obtain an RCT article;
  • the content extraction module 32 is used to extract the title, abstract and body of the RCT article
  • the data preprocessing module 33 is configured to perform data preprocessing on the main text to obtain processed text information, where the text information includes text short sentences and positions corresponding to the text short sentences;
  • the information extraction module 34 is used to use the title, abstract, and text information as fusion features, and input the fusion features and RCT articles into the preset BERT model for training to obtain a candidate set of coarse-grained key information, and combine the coarse-grained key information
  • the candidate set is used as the initial candidate set;
  • the information determining module 35 is configured to filter the initial candidate set according to preset filtering conditions to obtain the target candidate set, and use the text information corresponding to the target candidate set as the key information of the RCT article.
  • the information extraction module 34 includes:
  • the word segmentation processing unit is used to perform word segmentation processing on the title to obtain the target word segmentation
  • the short sentence extraction unit is used to extract short sentences from the abstract to obtain the abstract short sentences;
  • the information marking unit is used to respectively mark the target word segmentation, summary sentence and text information according to the source type, and use the marked target word segmentation, the marked summary sentence and the marked text information as the fusion of the input BERT model feature.
  • the information extraction module 34 further includes:
  • the coding unit is used to input the fusion feature and RCT article into the preset BERT model, and encode the fusion feature through the coding layer of the preset BERT model to obtain the initial code.
  • the initial code includes the first code corresponding to the title, The second code corresponding to the abstract and the third code corresponding to the text information;
  • the feature extraction unit is configured to perform feature extraction on the second code and the third code through the Transformer layer of the preset BERT model to obtain the second feature corresponding to the second code and the third feature corresponding to the third code;
  • the similarity calculation unit is configured to calculate the similarity value between the third feature and the second feature, and use the third feature whose similarity value with the second feature is less than the first preset threshold as the feature to be screened;
  • the candidate set determining unit is used to use the text information corresponding to the feature to be screened as the initial candidate set.
  • RCT article information extraction based on machine learning also includes:
  • the distance calculation module is used to calculate the Euclidean distance between the feature to be screened and the first code
  • the feature screening module is configured to use the feature to be screened whose Euclidean distance is less than or equal to the second preset threshold as the feature to be candidate after update;
  • the candidate set acquisition module is used to use the updated text information corresponding to the feature to be screened as the initial candidate set.
  • the device for extracting information based on machine learning further includes:
  • the sentence reconstruction module is used to reconstruct the key information of the RCT article to obtain the updated key information.
  • the device for extracting information based on machine learning further includes:
  • the storage module is used to store the key information of the RCT article in the blockchain network node.
  • Each module in the above-mentioned machine learning-based information extraction device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43. However, it should be understood that it is not required to implement all the shown components, and alternative implementations can be made More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4.
  • the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 42 is generally used to control the overall operation of the computer device 4.
  • the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • This application also provides another implementation manner, that is, to provide a computer-readable storage medium that stores an interface display program, and the interface display program can be executed by at least one processor to enable all The at least one processor executes the steps of the information extraction method based on machine learning as described above.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé, un appareil, un dispositif informatique et un support d'extraction d'informations basée sur l'apprentissage automatique, se rapportant au domaine de l'intelligence artificielle, ledit procédé comportant les étapes consistant à: extraire le titre, l'abrégé et le texte principal d'un article de RCT (S202); effectuer un prétraitement de données du texte principal pour obtenir des informations de texte traitées; prendre le titre, l'abrégé et les informations de texte comme caractéristiques de fusion, et introduire les caractéristiques de fusion et l'article de RCT dans un modèle BERT prédéfini à des fins d''apprentissage; obtenir un ensemble candidat d'informations de clé à grain grossier, et utiliser l'ensemble candidat d'informations de clé à grain grossier en tant qu'ensemble candidat initial (S204); selon des conditions de filtre prédéfinies, cribler l'ensemble candidat initial pour obtenir un ensemble candidat cible, les informations de texte correspondant à l'ensemble candidat cible, et les prendre comme informations de clé de l'article de RCT (S205); le procédé se rapporte également à la technologie des chaînes de blocs; les informations de clé de l'article de RCT obtenu sont stockées dans un réseau à chaîne de blocs; le procédé améliore la précision de l'extraction d'informations.
PCT/CN2020/118951 2020-06-17 2020-09-29 Procédé, appareil, dispositif informatique et support d'extraction d'informations basée sur l'apprentissage automatique WO2021135469A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010554248.8 2020-06-17
CN202010554248.8A CN111814465A (zh) 2020-06-17 2020-06-17 基于机器学习的信息抽取方法、装置、计算机设备及介质

Publications (1)

Publication Number Publication Date
WO2021135469A1 true WO2021135469A1 (fr) 2021-07-08

Family

ID=72845811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118951 WO2021135469A1 (fr) 2020-06-17 2020-09-29 Procédé, appareil, dispositif informatique et support d'extraction d'informations basée sur l'apprentissage automatique

Country Status (2)

Country Link
CN (1) CN111814465A (fr)
WO (1) WO2021135469A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282528A (zh) * 2021-08-20 2022-04-05 腾讯科技(深圳)有限公司 一种关键词提取方法、装置、设备及存储介质
CN115879450A (zh) * 2023-01-06 2023-03-31 广东爱因智能科技有限公司 一种逐步文本生成方法、系统、计算机设备及存储介质
CN116501861A (zh) * 2023-06-25 2023-07-28 知呱呱(天津)大数据技术有限公司 基于层级bert模型与标签迁移的长文本摘要生成方法
CN117093717A (zh) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 一种相似文本聚合方法、装置、设备及其存储介质
CN117875268A (zh) * 2024-03-13 2024-04-12 山东科技大学 一种基于分句编码的抽取式文本摘要生成方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347753B (zh) * 2020-11-12 2022-05-27 山西大学 一种应用于阅读机器人的摘要生成方法及系统
CN112800465A (zh) * 2021-02-09 2021-05-14 第四范式(北京)技术有限公司 待标注文本数据的处理方法、装置、电子设备及介质
CN113378024B (zh) * 2021-05-24 2023-09-01 哈尔滨工业大学 一种基于深度学习面向公检法领域的相关事件识别方法
CN113626582B (zh) * 2021-07-08 2023-07-28 中国人民解放军战略支援部队信息工程大学 基于内容选择和融合的两阶段摘要生成方法及系统
CN114510560A (zh) * 2022-01-27 2022-05-17 福建博思软件股份有限公司 一种基于深度学习的商品关键信息抽取方法及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294320A (zh) * 2016-08-04 2017-01-04 武汉数为科技有限公司 一种面向学术论文的术语抽取方法及系统
CN106570191A (zh) * 2016-11-11 2017-04-19 浙江大学 基于维基百科的中英文跨语言实体匹配方法
US20190087490A1 (en) * 2016-05-25 2019-03-21 Huawei Technologies Co., Ltd. Text classification method and apparatus
CN110413994A (zh) * 2019-06-28 2019-11-05 宁波深擎信息科技有限公司 热点话题生成方法、装置、计算机设备和存储介质
CN110427482A (zh) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 一种目标内容的抽取方法及相关设备
CN110598213A (zh) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 一种关键词提取方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190087490A1 (en) * 2016-05-25 2019-03-21 Huawei Technologies Co., Ltd. Text classification method and apparatus
CN106294320A (zh) * 2016-08-04 2017-01-04 武汉数为科技有限公司 一种面向学术论文的术语抽取方法及系统
CN106570191A (zh) * 2016-11-11 2017-04-19 浙江大学 基于维基百科的中英文跨语言实体匹配方法
CN110413994A (zh) * 2019-06-28 2019-11-05 宁波深擎信息科技有限公司 热点话题生成方法、装置、计算机设备和存储介质
CN110427482A (zh) * 2019-07-31 2019-11-08 腾讯科技(深圳)有限公司 一种目标内容的抽取方法及相关设备
CN110598213A (zh) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 一种关键词提取方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHI YUAN-BING;ZHOU JUN;WEI ZHONG: "TextRank-based Chinese Automatic Summarization Method", COMMUNICATIONS TECHNOLOGY, vol. 52, no. 9, 10 September 2019 (2019-09-10), pages 2233 - 2239, XP055826776, ISSN: 1002-0802, DOI: 10.3969/j.issn.1002-0802.2019.09.029 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282528A (zh) * 2021-08-20 2022-04-05 腾讯科技(深圳)有限公司 一种关键词提取方法、装置、设备及存储介质
CN115879450A (zh) * 2023-01-06 2023-03-31 广东爱因智能科技有限公司 一种逐步文本生成方法、系统、计算机设备及存储介质
CN115879450B (zh) * 2023-01-06 2023-09-01 广东爱因智能科技有限公司 一种逐步文本生成方法、系统、计算机设备及存储介质
CN116501861A (zh) * 2023-06-25 2023-07-28 知呱呱(天津)大数据技术有限公司 基于层级bert模型与标签迁移的长文本摘要生成方法
CN116501861B (zh) * 2023-06-25 2023-09-22 知呱呱(天津)大数据技术有限公司 基于层级bert模型与标签迁移的长文本摘要生成方法
CN117093717A (zh) * 2023-10-20 2023-11-21 湖南财信数字科技有限公司 一种相似文本聚合方法、装置、设备及其存储介质
CN117093717B (zh) * 2023-10-20 2024-01-30 湖南财信数字科技有限公司 一种相似文本聚合方法、装置、设备及其存储介质
CN117875268A (zh) * 2024-03-13 2024-04-12 山东科技大学 一种基于分句编码的抽取式文本摘要生成方法
CN117875268B (zh) * 2024-03-13 2024-05-31 山东科技大学 一种基于分句编码的抽取式文本摘要生成方法

Also Published As

Publication number Publication date
CN111814465A (zh) 2020-10-23

Similar Documents

Publication Publication Date Title
WO2021135469A1 (fr) Procédé, appareil, dispositif informatique et support d'extraction d'informations basée sur l'apprentissage automatique
CN112101041B (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
CN111241237B (zh) 一种基于运维业务的智能问答数据处理方法及装置
CN108804423B (zh) 医疗文本特征提取与自动匹配方法和系统
US11361002B2 (en) Method and apparatus for recognizing entity word, and storage medium
CN110276023B (zh) Poi变迁事件发现方法、装置、计算设备和介质
CN110532381B (zh) 一种文本向量获取方法、装置、计算机设备及存储介质
CN112818093B (zh) 基于语义匹配的证据文档检索方法、系统及存储介质
CN112328761B (zh) 一种意图标签设置方法、装置、计算机设备及存储介质
CN112287069B (zh) 基于语音语义的信息检索方法、装置及计算机设备
CN113434636B (zh) 基于语义的近似文本搜索方法、装置、计算机设备及介质
CN111783471B (zh) 自然语言的语义识别方法、装置、设备及存储介质
CN112215008A (zh) 基于语义理解的实体识别方法、装置、计算机设备和介质
CN112860919B (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
CN110852106A (zh) 基于人工智能的命名实体处理方法、装置及电子设备
CN113051356A (zh) 开放关系抽取方法、装置、电子设备及存储介质
CN111353311A (zh) 一种命名实体识别方法、装置、计算机设备及存储介质
CN113657105A (zh) 基于词汇增强的医学实体抽取方法、装置、设备及介质
CN115983271A (zh) 命名实体的识别方法和命名实体识别模型的训练方法
CN115438149A (zh) 一种端到端模型训练方法、装置、计算机设备及存储介质
CN112084779A (zh) 用于语义识别的实体获取方法、装置、设备及存储介质
CN114220505A (zh) 病历数据的信息抽取方法、终端设备及可读存储介质
WO2022073341A1 (fr) Procédé et appareil de mise en correspondance d'entités de maladie fondés sur la sémantique vocale, et dispositif informatique
CN112417875B (zh) 配置信息的更新方法、装置、计算机设备及介质
CN116542246A (zh) 基于关键词质检文本的方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20910788

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20910788

Country of ref document: EP

Kind code of ref document: A1