WO2021174864A1 - 基于少量训练样本的信息抽取方法及装置 - Google Patents

基于少量训练样本的信息抽取方法及装置 Download PDF

Info

Publication number
WO2021174864A1
WO2021174864A1 PCT/CN2020/121886 CN2020121886W WO2021174864A1 WO 2021174864 A1 WO2021174864 A1 WO 2021174864A1 CN 2020121886 W CN2020121886 W CN 2020121886W WO 2021174864 A1 WO2021174864 A1 WO 2021174864A1
Authority
WO
WIPO (PCT)
Prior art keywords
extracted
training
model
lsi
text
Prior art date
Application number
PCT/CN2020/121886
Other languages
English (en)
French (fr)
Inventor
谭莹
黄麟越
许开河
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021174864A1 publication Critical patent/WO2021174864A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Definitions

  • This application relates to the technical field of data processing, and in particular to an information extraction method and device based on a small number of training samples.
  • Information extraction is to structure the information contained in unstructured text and output information points in a fixed format, thereby helping users to classify, extract, and reconstruct massive amounts of content.
  • Information extraction tags usually include entities, relationships, and events, such as extraction time, location, and key figures. Information extraction is of great significance. Because it can extract the information frame and content that users are interested in from a large amount of text, it can be used for information retrieval, information integration, etc. It has rich application scenarios in sentiment analysis and text mining.
  • the inventor realizes that at present, the general text extraction model is obtained, and then a small number of training samples are obtained, and then the training sample data is trained in the general text extraction model to obtain the training standard field obtained by extracting the training sample from the training sample data, and then according to the training Standard fields and target standard fields adjust the parameters of the general text extraction module, know that the convergence conditions are met, and obtain the target text extraction model.
  • the text to be extracted is input into the template text extraction model, and the target text extraction model is used to extract the text from the text to be extracted. Get the target text information.
  • the target text information and the training label field may be inconsistent, resulting in the inability to obtain effective target text information of the text to be extracted.
  • the present application provides an information extraction method and device based on a small number of training samples, the main purpose of which is to solve the problem that effective target text information of the text to be extracted cannot be obtained in the prior art.
  • an information extraction method based on a small number of training samples includes: obtaining training samples, the training samples being texts labeled with key information to be extracted; and extracting the training samples according to the BERT language model
  • the sample feature vector of each sentence in the sample according to the training sample, the key information to be extracted and the sample feature vector, the initial model is trained to obtain the text prediction model; the text prediction model is extracted according to the text prediction model information.
  • an information extraction device based on a small number of training samples, including: an acquisition module for acquiring training samples, the training samples being labeled texts of key information to be extracted; and an extraction module using According to the BERT language model, extract the sample feature quantity of each sentence in the training sample; the training module is used to train the initial model according to the training sample, the key information to be extracted, and the sample feature vector to generate text Prediction model; an extraction module for extracting extraction information of the text to be extracted according to the text prediction model.
  • a computer storage medium is provided, and at least one executable instruction is stored in the computer storage medium, and the executable instruction causes a processor to perform the following steps: obtaining training samples, the training samples Is the labeled text of the key information to be extracted; according to the BERT language model, the sample feature vector of each sentence in the training sample is extracted; according to the training sample, the key information to be extracted, and the sample feature vector, training The initial model obtains a text prediction model; according to the text prediction model, extraction information of the text to be extracted is extracted.
  • a computer device including: a processor, a memory, a communication interface, and a communication bus.
  • the processor, the memory, and the communication interface complete mutual communication through the communication bus.
  • Communication; the memory is used to store at least one executable instruction, the executable instruction causes the processor to perform the following steps: obtain training samples, the training samples are marked text to be extracted with key information; according to the BERT language Model, extracting the sample feature vector of each sentence in the training sample; according to the training sample, the key information to be extracted, and the sample feature vector, training an initial model to obtain a text prediction model; according to the text prediction model , To extract the extraction information of the text to be extracted.
  • the embodiment of the application extracts sample feature vectors through the BERT language model based on large-scale training corpus. Even a small amount of training samples can obtain the feature vectors that learn more comprehensive key information to be extracted, so that the trained text prediction model can extract all the features. Extract information similar to the key information to be extracted in order to obtain effective extraction information.
  • Fig. 1 shows a flow chart of a method for extracting information based on a small number of training samples provided by an embodiment of the present application.
  • Fig. 2 shows a flowchart of another method for extracting information based on a small number of training samples provided by an embodiment of the present application.
  • Fig. 3 shows a block diagram of an information extraction device based on a small number of training samples provided by an embodiment of the present application.
  • Fig. 4 shows a block diagram of another information extraction device based on a small number of training samples provided by an embodiment of the present application.
  • Fig. 5 shows a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the technical solution of the present application can be applied to the fields of artificial intelligence, blockchain and/or big data technology, and can achieve effective extraction of information through predictive analysis.
  • the data involved in this application such as training samples, can be stored in a database, or can be stored in a blockchain, which is not limited in this application.
  • the embodiment of the application only needs to make a small amount of annotations to quickly train the required text prediction model, so it can be applied to multiple types, such as contract text, resume, and insurance.
  • the embodiment of the present application provides an information extraction method based on a small number of training samples. As shown in FIG. 1, the method includes the following steps.
  • Training samples are texts that have been labeled with key information to be extracted. In the embodiments of the present application, using a small number of training samples can also achieve effective extraction of similar information in the texts to be extracted. Exemplarily, if you need to extract “rent-free period information” in batches, set the “rent-free period” label. If the “rent-free period” in a training sample is from January 1, 2018 to June 1, 2018, mark it The label “Rent Free Period” from "January 1, 2018 to June 1, 2018” is the key information to be extracted. The training sample includes multiple documents, such as 30 documents marked with a "rent-free period” label.
  • Users can annotate the initial text through online annotation work to generate training samples. It is used for labeling through online labeling tools, updating and improving the labeling content online at any time, meeting individual needs, and flexibly extracting requirements to ensure that the extracted information can be applied to the information extraction requirements of various types of documents.
  • the training samples and the labeled key information to be extracted together serve as the basis of model training.
  • multiple labels can be set according to actual needs, such as multiple labels such as Party A, Party B, lease time, lease address, and rent-free period.
  • the number of labels is not limited in the embodiment of the present application.
  • the BERT language model includes large-scale pre-training corpus, which can make up for the problem of the small number of training samples.
  • the BERT language model can be used as a text semantic feature extractor to learn the vector representation of Chinese words.
  • the training corpus in the BERT language model includes a series of natural language texts such as Chinese wikis, news texts, and novels.
  • the sample feature vector of each sentence in the extracted training sample is a vector representation of the corresponding sentence, representing the word-level, sentence-level and contextual mapping results of the sentence.
  • the training sample and its corresponding sample feature vector are input into the initial model, and the training sentence predicted by the initial model is compared with the sentence where the key information is to be extracted. If the two are the same, the training of the initial model model has been completed , If the two are different, it means that the model parameters of the initial model need to be changed to continue training the initial model. When the training is over, the initial model and its model parameters are obtained together to form the text prediction model.
  • the extracted information corresponds to the sample feature vector of the key information to be extracted in the training sample. If the sample feature vector of the information to be extracted corresponds to the "rent free period", then the extracted information is related to the "rent free period" in the text to be extracted Text.
  • This application provides an information extraction method based on a small number of training samples. First obtain training samples, then extract the sample feature amount of each sentence in the training sample according to the BERT language model, and then according to the training sample, key information to be extracted, and sample feature vector , Train the text prediction model, and finally extract the extraction information of the text to be extracted according to the text prediction model.
  • the embodiment of the present application extracts sample feature vectors through the BERT language model based on large-scale training corpus. Even a small amount of training samples can obtain the feature vectors that learn more comprehensive key information to be extracted, so that after training
  • the text prediction model can extract extraction information similar to the key information to be extracted, so as to obtain effective extraction information.
  • the embodiment of the present application provides another information extraction method based on a small number of training samples. As shown in FIG. 2, the method includes the following steps.
  • Training samples are texts that have been labeled with key information to be extracted. In the embodiments of the present application, using a small number of training samples can also achieve effective extraction of similar information in the texts to be extracted.
  • the training samples and the labeled key information to be extracted together serve as the basis of model training.
  • the tag types of the key information to be extracted include phrase tags and paragraph tags. For example, for a certain type of lease contract text, users can set multiple labels such as Party A, Party B, lease time, lease address, lease-free period, etc. according to their needs.
  • the tag types include phrase tags and paragraph tags. Phrases tags are tags for indicators to note shorter information, such as Party A and Party B, and paragraph tags are tags for indicators to note longer information, such as breach of contract.
  • the BERT language model includes large-scale pre-training corpus, which can make up for the problem of the small number of training samples.
  • the BERT language model can be used as a text semantic feature extractor to learn the vector representation of Chinese words.
  • As a text semantic feature extractor it broke through the current technical bottleneck that requires large training samples.
  • the training corpus in the BERT language model includes a series of natural language texts such as Chinese wikis, news texts, and novels.
  • the sample feature vector of each sentence in the extracted training sample is a vector representation of the corresponding sentence, representing the word-level, sentence-level and contextual mapping results of the sentence.
  • word-level, sentence-level, and context-containing mapping results refer to the three data features covered by the pointing quantity representation, and three vector components are used to identify sentence features in the same vector representation.
  • the sample text is "After completing the first transaction, Party A Zhang San and Party B Li Si sign an agreement in Shanghai", and the user marks "Zhang San, Li Si", then the feature vector may be "[0, Party A, Party B]", where 0 means that the marked text is word-level text, Party A means that the context mapping result of the "Zhang San” mark is the upper feature of the marked text, and Party B means that the context mapping result of the "Li Si” mark is the upper feature of the marked text .
  • the initial model includes a latent semantic index LSI initial model and a conditional random field CRF initial model.
  • Text prediction models include LSI prediction models and CRF prediction models.
  • the tag types of the key information to be extracted include phrase tags and paragraph tags. There is a one-to-one correspondence between the two different text prediction models and the label types of the key information to be extracted.
  • training the initial model specifically includes: judging the label type of the key information to be extracted; if the label is a sentence label , It is determined that the text prediction model is the LSI model; if the annotation tag is a paragraph tag, it is determined that the text prediction model is the CRF model; according to the training sample, the key information to be extracted, and the The sample feature vector is trained to obtain the LSI prediction model by training the initial LSI model, and/or the CRF prediction model is obtained by training the CRF initial model.
  • the training of the initial LSI model to obtain the LSI prediction model according to the training sample, the key information to be extracted, and the sample feature vector includes: using the LSI initial model to calculate feature similarity, and Feature similarity is the similarity between the sample feature vector of each sentence in the training sample and the sample feature vector of the sentence where the key information to be extracted is located; find the feature with the highest similarity in the feature similarity
  • the training sentence in the training sample if the training sentence contains the key information to be extracted, the training of the initial LSI model is ended to obtain the LSI prediction model; if the training sentence does not contain the to-be-extracted key information
  • the LSI parameters are updated, and the feature similarity is recalculated.
  • the initial LSI model is used to train the semantics of the key information vocabulary to be extracted, so as to extract the vocabulary related to the semantics of the key information to be extracted from the text to be extracted.
  • the basic idea of the initial LSI model is that the words in the text are not isolated, and there is a certain potential semantic relationship. Through statistical analysis of the training samples, the potential semantic relationship is automatically mined, and these relationships are expressed as a computer.
  • the model of understanding can also train synonymous and polysemous imagination in the process of mining semantic relations. In the process of training the initial LSI model, it is necessary to set the minimum error of the low-rank approximation, the number of topics and other LSI parameters.
  • the LSI parameters are updated according to the preset rules, and then based on the updated LSI
  • the parameter recalculates the feature acquaintance degree.
  • the preset rules for LSI parameter update include two update trends that increase or decrease the minimum error and the number of topics according to a fixed step. Each update changes one LSI parameter according to one update trend, and then calculates the characteristics based on the updated LSI parameters. Similarity, if the feature similarity increases, it indicates that the update trend of the LSI parameter is beneficial to the convergence of the initial model training process of the LSI. If the LSI parameter needs to be updated, the LSI parameter is updated again according to the update trend.
  • the sample feature vector obtained by the BERT language model is used to overcome the problem of fewer training samples.
  • the training of the CRF initial model to obtain the CRF prediction model according to the training sample, the key information to be extracted, and the sample feature vector includes: performing the sample feature vector corresponding to each clause in the training sample Splicing; taking the splicing result and the sample feature vector corresponding to the key information to be extracted as input, training the CRF initial model to obtain a CRF prediction model.
  • Training the CRF initial model that is, training the model parameter parameters of the CRF initial model, to obtain the CRF prediction model.
  • Each clause in the training sample is sequence-labeled, and the sample feature vector of each clause is distinguished by sequence labeling during the training process.
  • the splicing result also carries sequence labels. Download the CRF toolkit in the programming software to train the initial model of CRF.
  • the LSI initial model and the CRF initial model are trained in parallel according to the training sample, the key information to be extracted, and the sample feature vector. According to the type of label, the LSI initial model and the CRF initial model are trained in parallel. According to the specific methods of training the LSI initial model and the training CRF initial model, the parallel method is adopted, and the two training methods are started at the same time, thereby greatly reducing the model parameters to be trained Magnitude, to ensure that 90% accuracy can be achieved on labeled data of the order of 10-20, so as to achieve a training effect with fewer samples, high accuracy, and fast speed.
  • the text prediction model includes the LSI prediction model and the CRF prediction model
  • the two models are also used to extract the extraction information respectively, which specifically includes: using the LSI prediction model to extract the LSI information of the text to be extracted; using the CRF prediction model to extract the CRF information of the text to be extracted; combining the LSI information and the CRF information to generate the extraction information.
  • Different algorithm models are adopted according to the labeling type, which can ensure the highest accuracy and facilitate the user's label management.
  • the extracted information is displayed in tabular form, which is intuitive and clear for users to view.
  • This application provides an information extraction method based on a small number of training samples. First obtain training samples, then extract the sample feature amount of each sentence in the training sample according to the BERT language model, and then according to the training sample, key information to be extracted, and sample feature vector , Train the text prediction model, and finally extract the extraction information of the text to be extracted according to the text prediction model.
  • the embodiment of the present application extracts sample feature vectors through the BERT language model based on large-scale training corpus. Even a small amount of training samples can obtain the feature vectors that learn more comprehensive key information to be extracted, so that after training
  • the text prediction model can extract extraction information similar to the key information to be extracted, so as to obtain effective extraction information.
  • an embodiment of the present application provides an information extraction device based on a small number of training samples.
  • the device includes: an acquisition module 31 for acquiring training samples ,
  • the training sample is the text that has been labeled with key information to be extracted;
  • the extraction module 32 is used to extract the sample feature amount of each sentence in the training sample according to the BERT language model;
  • the training module 33 is used to The training samples, the key information to be extracted, and the sample feature vectors are trained to train an initial model to obtain a text prediction model;
  • the extraction module 34 is configured to extract extraction information of the text to be extracted according to the text prediction model.
  • This application provides an information extraction device based on a small number of training samples.
  • the training samples are first obtained, and then the sample feature amount of each sentence in the training sample is extracted according to the BERT language model, and then based on the training samples, key information to be extracted, and sample feature vectors , Train the text prediction model, and finally extract the extraction information of the text to be extracted according to the text prediction model.
  • the embodiment of the present application extracts sample feature vectors through the BERT language model based on large-scale training corpus. Even a small amount of training samples can obtain the feature vectors that learn more comprehensive key information to be extracted, so that after training
  • the text prediction model can extract extraction information similar to the key information to be extracted, so as to obtain effective extraction information.
  • an embodiment of the present application provides another information extraction device based on a small number of training samples.
  • the device includes: an acquisition module 41 for acquiring training Sample, the training sample is the text of the key information to be extracted; the extraction module 42 is used to extract the sample feature amount of each sentence in the training sample according to the BERT language model; the training module 43 is used to The training sample, the key information to be extracted, and the sample feature vector are trained to train an initial model to obtain a text prediction model; the extraction module 44 is configured to extract extraction information of the text to be extracted according to the text prediction model.
  • the tag types of the key information to be extracted include phrase tags and paragraph tags;
  • the initial model includes a latent semantic index LSI initial model and a conditional random field CRF initial model, and the text prediction model includes an LSI prediction model and a CRF prediction Model.
  • the training module 43 includes: a judging unit 431 for judging the tag type of the key information to be extracted; a determining unit 432 for determining that the text prediction model is the LSI model; a determining unit 432, configured to determine that the text prediction model is the CRF model if the label is a paragraph tag; a training unit 433, configured to determine based on the training samples, the key information to be extracted, and For the sample feature vector, the LSI initial model is trained to obtain the LSI prediction model, and/or the CRF initial model is trained to obtain the CRF prediction model.
  • the training unit 433 includes: a calculation subunit 4331, configured to use the initial LSI model to calculate feature similarity, where the feature similarity is the sample feature vector of each sentence in the training sample The similarity with the sample feature vector of the sentence where the key information to be extracted is located; the searching subunit 4332 is used to find the training sentence in the training sample with the highest similarity among the feature similarities; the end subunit The unit 4333 is used for if the training sentence contains the key information to be extracted, then the training of the initial LSI model is ended to obtain the LSI prediction model; the update subunit 4334 is used for if the training sentence does not contain the To extract the key information, the LSI parameters are updated, and the feature similarity is recalculated.
  • a calculation subunit 4331 configured to use the initial LSI model to calculate feature similarity, where the feature similarity is the sample feature vector of each sentence in the training sample The similarity with the sample feature vector of the sentence where the key information to be extracted is located
  • the searching subunit 4332 is
  • the training unit 423 includes a splicing subunit 4335, which is used to splice sample feature vectors corresponding to each clause in the training sample; a training subunit 4336, which is used to combine the splicing result with the The sample feature vector corresponding to the key information to be extracted is an input, and the CRF initial model is trained to obtain the CRF prediction model.
  • the training unit 423 is configured to: if the tag type includes the phrase tag and the paragraph tag, perform parallel training based on the training sample, the key information to be extracted, and the sample feature vector The LSI initial model and the CRF initial model.
  • the extraction module 44 includes: an extraction unit 441, configured to use the LSI prediction model to extract the LSI information of the text to be extracted; the extraction unit 441, further configured to use the CRF prediction model, Extract the CRF information of the text to be extracted; the merging unit 442 is configured to merge the LSI information and the CRF information to generate the extracted information.
  • the device further includes: a display module 45 for displaying the extracted information in the form of a table after extracting the extraction information of the text to be extracted according to the text prediction model.
  • This application provides an information extraction device based on a small number of training samples.
  • the training samples are first obtained, and then the sample feature amount of each sentence in the training sample is extracted according to the BERT language model, and then based on the training samples, key information to be extracted, and sample feature vectors , Train the text prediction model, and finally extract the extraction information of the text to be extracted according to the text prediction model.
  • the embodiment of the present application extracts sample feature vectors through the BERT language model based on large-scale training corpus. Even a small amount of training samples can obtain the feature vectors that learn more comprehensive key information to be extracted, so that after training
  • the text prediction model can extract extraction information similar to the key information to be extracted, so as to obtain effective extraction information.
  • a storage medium stores at least one executable instruction, and the executable instruction enables a processor to execute the information extraction method based on a small number of training samples in any of the foregoing method embodiments.
  • the storage medium involved in this application may be a computer (readable) storage medium, and the storage medium, such as a computer storage medium, may be non-volatile or volatile.
  • FIG. 5 shows a schematic structural diagram of a computer device according to an embodiment of the present application, and the specific embodiment of the present application does not limit the specific implementation of the computer device.
  • the computer device may include: a processor (processor) 502, a communication interface (Communications Interface) 504, a memory (memory) 506, and a communication bus 508.
  • processor processor
  • communication interface Communication Interface
  • memory memory
  • the processor 502, the communication interface 504, and the memory 506 communicate with each other through the communication bus 508.
  • the communication interface 504 is used to communicate with other devices, such as network elements such as clients or other servers.
  • the processor 502 is configured to execute the program 510, and specifically can execute the relevant steps in the foregoing embodiment of the information extraction method based on a small number of training samples.
  • the program 510 may include program code, and the program code includes a computer operation instruction.
  • the processor 502 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.
  • the one or more processors included in the computer device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs.
  • the memory 506 is used to store the program 510.
  • the memory 506 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one magnetic disk memory.
  • the program 510 can specifically be used to cause the processor 502 to perform the following operations: obtain training samples, which are labeled texts of key information to be extracted; and extract sample features of each sentence in the training samples according to the BERT language model Vector; according to the training sample, the key information to be extracted, and the sample feature vector, an initial model is trained to obtain a text prediction model; according to the text prediction model, extraction information of the text to be extracted is extracted.
  • modules or steps of this application can be implemented by a general computing device, and they can be concentrated on a single computing device or distributed in a network composed of multiple computing devices.
  • they can be implemented with program codes executable by a computing device, so that they can be stored in a storage device for execution by the computing device, and in some cases, can be executed in a different order than here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本申请公开了一种基于少量训练样本的信息抽取方法及装置,涉及数据处理技术领域,为解决现有技术中不能得到待提取文本的有效目标文本信息的问题而发明。该方法主要包括:获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;根据所述文本预测模型,抽取待抽取文本的抽取信息。本申请主要应用于信息抽取的过程中。

Description

基于少量训练样本的信息抽取方法及装置
本申请要求于2020年3月3日提交中国专利局、申请号为202010138072.8,发明名称为“基于少量训练样本的信息抽取方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种数据处理技术领域,特别是涉及一种基于少量训练样本的信息抽取方法及装置。
背景技术
信息抽取是将非结构化的文本包含的信息进行结构化处理,输出固定格式的信息点,从而帮助用户对海量内容进行分类、提取和重构。信息抽取的标签通常包括实体、关系、事件,如抽取时间、地点、关键人物等。信息抽取具有重要意义,由于其能从大量文本中抽取出用户感兴趣的信息框架和内容,可用于信息检索、信息整合等,在情感分析、文本挖掘等方面有丰富的应用场景。
发明人意识到,目前是采用获取通用文本提取模型,然后获取少量训练样本,再将训练样本数据通用文本提取模型中进行训练得到通用文本提取模型对训练样本提取得到的训练标准字段,再根据训练标准字段和目标标准字段对通用文本提取模块的参数进行调整,知道满足收敛条件,得到目标文本提取模型,最后将待提取文本输入模板文本提取模型中个,通过目标文本提取模型从待提取文本中得到目标文本信息。
采用上述方法,由于训练样本较少,目标文本信息与训练标注字段可能不一致,导致不能得到待提取文本的有效目标文本信息。
技术问题
有鉴于此,本申请提供一种基于少量训练样本的信息抽取方法及装置,主要目的在于解决现有技术中不能得到待提取文本的有效目标文本信息的问题。
技术解决方案
依据本申请一个方面,提供了一种基于少量训练样本的信息抽取方法,包括:获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;根据所述文本预测模型,抽取待抽取文本的抽取信息。
依据本申请另一个方面,提供了一种基于少量训练样本的信息抽取装置,包括:获取模块,用于获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;提取模块,用于根据BERT语言模型,提取所述训练样本中每个句子的样本特征量;训练模块,用于根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,生成文本预测模型;抽取模块,用于根据所述文本预测模型,抽取待抽取文本的抽取信息。
根据本申请的又一方面,提供了一种计算机存储介质,所述计算机存储介质中存储有至少一可执行指令,所述可执行指令使处理器执行以下步骤:获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;根据所述文本预测模型,抽取待抽取文本的抽取信息。
根据本申请的再一方面,提供了一种计算机设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行以下步骤:获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;根据所述文本预测模型,抽取待抽取文本的抽取信息。
有益效果
本申请实施例通过基于大规模训练语料的BERT语言模型提取样本特征向量,即使少量的训练样本也能够获取学习到比较全面的待抽取关键信息的特征向量,使得训练后的文本预测模型能够抽取所述待抽取关键信息类似的抽取信息,以获取有效的抽取信息。
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。
图1示出了本申请实施例提供的一种基于少量训练样本的信息抽取方法流程图。
图2示出了本申请实施例提供的另一种基于少量训练样本的信息抽取方法流程图。
图3示出了本申请实施例提供的一种基于少量训练样本的信息抽取装置组成框图。
图4示出了本申请实施例提供的另一种基于少量训练样本的信息抽取装置组成框图。
图5示出了本申请实施例提供的一种计算机设备的结构示意图。
本发明的实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
本申请的技术方案可应用于人工智能、区块链和/或大数据技术领域,可通过预测分析实现获取有效抽取信息。可选的,本申请涉及的数据如训练样本等可存储于数据库中,或者可以存储于区块链中,本申请不做限定。
本申请实施例仅需要做少量的标注,即可快速训练所需的文本预测模型,故可以适用多种类型,如合同文本、简历和保险等类型。本申请实施例提供了一种基于少量训练样本的信息抽取方法,如图1所示,该方法包括以下步骤。
101、获取训练样本。
训练样本是已标注的待抽取关键信息的文本,在本申请实施例中采用少量的训练样本,也可实现对待抽取文本中相似信息的有效抽取。示例性的,需要批量抽取“免租期信息”,则设置“免租期”标签,某训练样本中的“免租期”为2018年1月1日到2018年6月1日,则标注出“2018年1月1日到2018年6月1日”为“免租期”标签,也就是待抽取关键信息。训练样本中包括多篇文档,如标注“免租期”标签的30篇文档。
用户可以通过在线标注工作,对初始文本进行标注,以生成训练样本。用于通过在线标注工具进行标注,随时在线更新和完善标注内容,满足个性化需求,灵活化抽取需求,以保证抽取信息能够适用多种类型文档的信息抽取需求。
训练样本以及标注的待抽取关键信息,共同作为模型训练的基础。在标注过程中,可以根据实际需求设置多个标签,如甲方、乙方、租赁时间、租赁地址和免租期等多个标签,在本申请实施例中对标签的个数不做限定。
102、根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量。
BERT语言模型包括大规模预训练语料,能够弥补训练样本的数量少的问题。BERT语言模型能够作为文本语义特征提取器,以学习中文字词的向量表示。BERT语言模型中的训练语料包括中文wiki、新闻文本、小说等一系列自然语言文本。根据BERT语言模型,提取的训练样本中的每个句子的样本特征向量,是对应句子的向量表示,表示该句子的词语级、句子级和包含上下文的映射结果。
103、根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型。
在训练过程中,将训练样本以及其对应的样本特征向量输入初始模型,经初始模型预测的训练句子与待抽取关键信息所在的句子进行比较,如果两者相同则说明已完成初始模型模型的训练,如果两者不同则说明需要更改初始模型的模型参数继续训练初始模型。当训练结束后,获取初始模型及其模型参数共同构成文本预测模型。
104、根据所述文本预测模型,抽取待抽取文本的抽取信息。
抽取信息与训练样本中的待抽取关键信息的样本特征向量相对应,如果待抽取信息的样本特征向量对应的是“免租期”,那么抽取信息是待抽取文本中与“免租期”相关的文本。
本申请提供了一种基于少量训练样本的信息抽取方法,首先获取训练样本,然后根据BERT语言模型提取训练样本中每个句子的样本特征量,再根据训练样本、待抽取关键信息和样本特征向量,训练文本预测模型,最后根据文本预测模型抽取待抽取文本的抽取信息。与现有技术相比,本申请实施例通过基于大规模训练语料的BERT语言模型提取样本特征向量,即使少量的训练样本也能够获取学习到比较全面的待抽取关键信息的特征向量,使得训练后的文本预测模型能够抽取所述待抽取关键信息类似的抽取信息,以获取有效的抽取信息。
本申请实施例提供了另一种基于少量训练样本的信息抽取方法,如图2所示,该方法包括以下步骤。
201、获取训练样本。
训练样本是已标注的待抽取关键信息的文本,在本申请实施例中采用少量的训练样本,也可实现对待抽取文本中相似信息的有效抽取。训练样本以及标注的待抽取关键信息,共同作为模型训练的基础。待抽取关键信息的标签类型包括词句标签和段落标签。如某一类租赁合同文本,用户可根据需要设定甲方、乙方、租赁时间、租赁地址、免租期等多个标签。标签类型包括词句标签和段落标签,词句标签是指标注较短信息的标签,如甲方、乙方,标注段落标签是指标注较长信息的标签,如违约条款。
202、根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量。
BERT语言模型包括大规模预训练语料,能够弥补训练样本的数量少的问题。BERT语言模型能够作为文本语义特征提取器,以学习中文字词的向量表示。在前期,创新性地预训练了大规模训练语料并进行实际语句的文字迁移,作为文本语义特征提取器,突破了当前要求大训练样本的技术瓶颈。BERT语言模型中的训练语料包括中文wiki、新闻文本、小说等一系列自然语言文本。根据BERT语言模型,提取的训练样本中的每个句子的样本特征向量,是对应句子的向量表示,表示该句子的词语级、句子级和包含上下文的映射结果。
其中,“词语级、句子级和包含上下文的映射结果”是指向量表示所涵盖的三种数据特征,在同一向量表示中分别采用三种向量分量标识句子特征。示例性的,样本文本为“在完成首次交易后,甲方张三和乙方李四在上海签署协议”,用户标记“张三、李四”,那么特征向量可能为“【0,甲方,乙方】”,其中0表示标注文本为词语级文本,甲方表示“张三”标记的上下文映射结果即标注文本的上位特征,乙方表示“李四”标记的上下文映射结果即标注文本的上位特征。
203、根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型。
初始模型包括潜在语义索引LSI初始模型和条件随机场CRF初始模型。文本预测模型包括LSI预测模型和CRF预测模型。待抽取关键信息的标签类型包括词句标签和段落标签。两种不同的文本预测模型与待抽取关键信息的标签类型是一一对应的,在此基础上训练初始模型具体包括:判断所述待抽取关键信息的标签类型;如果所述标注标签是词句标签,则确定所述文本预测模型是所述LSI模型;如果所述标注标签是段落标签,则确定所述文本预测模型是所述CRF模型;根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,和/或训练所述CRF初始模型得到所述CRF预测模型。
所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,包括:采用所述LSI初始模型,计算特征相似度,所述特征相似度是所述训练样本中每个句子的所述样本特征向量与所述待抽取关键信息所在句子的所述样本特征向量之间的相似度;查找所述特征相似度中相似度最高的所述训练样本中的训练句子;如果所述训练句子中包含所述待抽取关键信息,则结束训练所述LSI初始模型得到所述LSI预测模型;如果所述训练句子中不包含所述待抽取关键信息,则更新LSI参数,重新计算所述特征相似度。
LSI初始模型用于训练待抽取关键信息词汇的语义,以抽取待抽取文本中与抽取关键信息的语义相关的词汇。LSI初始模型的基本思想是文本中的词语词之间不是孤立的,存在着某种潜在的语义关系,通过对训练样本的统计分析,自动挖掘潜在的语义关系,并把这些关系表示成计算机可以理解的模型,在挖掘语义关系过程中同样可以训练同义和多义想象。在训练LSI初始模型的过程中,需要设置低秩逼近的最小误差、主题数等LSI参数,如果训练句子中不包含待抽取关键信息,则按照预置规则更新LSI参数,再根据更新后的LSI参数重新计算特征相识度。LSI参数更新的预置规则包括将最小误差和主题数按照固定步长增加或减小两种更新趋势,每次更新按照一种更新趋势修改一种LSI参数,然后根据更新LSI参数计算得到的特征相似度,如果特征相似度增加,则说明该LSI参数的更新趋势有利于训练LSI初始模型过程的收敛,如果还需要更新LSI参数,则按照该更新趋势继续再次更新LSI参数。在本申请实施例中通过BERT语言模型获取的样本特征向量,以克服训练样本较少的问题。
所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述CRF初始模型得到CRF预测模型,包括:将所述训练样本中的各个分句对应的样本特征向量进行拼接;以所述拼接结果和所述待抽取关键信息对应的样本特征向量为输入,训练所述CRF初始模型得到CRF预测模型。训练CRF初始模型,也就是训练CRF初始模型的的模型参数参数,以得到CRF预测模型。将训练样本中的各个分句进行序列标注,在训练过程中通过序列标注区分各个分句的样本特征向量。在将训练样本中的各个分局对应的样本特征向量进行拼接后,拼接结果中也携带序列标注。下载编程软件中的CRF工具包,以训练CRF初始模型。
如果所述标签类型包括所述词句标签和所述段落标签,则根据所述训练样本、所述待抽取关键信息和所述样本特征向量,并行训练所述LSI初始模型和CRF初始模型。按照标注标签的类型,将LSI初始模型和CRF初始模型并行训练,根据训练LSI初始模型和训练CRF初始模型的具体方法,采用并行方式,同时启动两种训练方法,从而大大降低待训练的模型参数量级,保证在10-20量级的标注数据上也可以实现90%的准确率,以实现少样本、高精度、速度快的训练效果。
204、根据所述文本预测模型,抽取待抽取文本的抽取信息。
由于文本预测模型中包括LSI预测模型和CRF预测模型,所以在抽取待抽取文本的抽取信息时,也采用两种模型分别进行抽取信息的抽取,具体包括:采用所述LSI预测模型,抽取所述待抽取文本的LSI信息;采用所述CRF预测模型,抽取所述待抽取文本的CRF信息;合并所述LSI信息和所述CRF信息,生成所述抽取信息。根据标注类型采用不同的算法模型,能够保证最高精准度,同时便于用户进行标签管理。
205、以表格形式展示所述抽取信息。
抽取信息以表格形式展示,直观清晰,以便于用户查看。
本申请提供了一种基于少量训练样本的信息抽取方法,首先获取训练样本,然后根据BERT语言模型提取训练样本中每个句子的样本特征量,再根据训练样本、待抽取关键信息和样本特征向量,训练文本预测模型,最后根据文本预测模型抽取待抽取文本的抽取信息。与现有技术相比,本申请实施例通过基于大规模训练语料的BERT语言模型提取样本特征向量,即使少量的训练样本也能够获取学习到比较全面的待抽取关键信息的特征向量,使得训练后的文本预测模型能够抽取所述待抽取关键信息类似的抽取信息,以获取有效的抽取信息。
进一步的,作为对上述图1所示方法的实现,本申请实施例提供了一种基于少量训练样本的信息抽取装置,如图3所示,该装置包括:获取模块31,用于获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;提取模块32,用于根据BERT语言模型,提取所述训练样本中每个句子的样本特征量;训练模块33,用于根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;抽取模块34,用于根据所述文本预测模型,抽取待抽取文本的抽取信息。
本申请提供了一种基于少量训练样本的信息抽取装置,首先获取训练样本,然后根据BERT语言模型提取训练样本中每个句子的样本特征量,再根据训练样本、待抽取关键信息和样本特征向量,训练文本预测模型,最后根据文本预测模型抽取待抽取文本的抽取信息。与现有技术相比,本申请实施例通过基于大规模训练语料的BERT语言模型提取样本特征向量,即使少量的训练样本也能够获取学习到比较全面的待抽取关键信息的特征向量,使得训练后的文本预测模型能够抽取所述待抽取关键信息类似的抽取信息,以获取有效的抽取信息。
进一步的,作为对上述图2所示方法的实现,本申请实施例提供了另一种基于少量训练样本的信息抽取装置,如图4所示,该装置包括:获取模块41,用于获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;提取模块42,用于根据BERT语言模型,提取所述训练样本中每个句子的样本特征量;训练模块43,用于根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;抽取模块44,用于根据所述文本预测模型,抽取待抽取文本的抽取信息。
进一步地,所述待抽取关键信息的标签类型包括词句标签和段落标签;所述初始模型包括潜在语义索引LSI初始模型和条件随机场CRF初始模型,所述文本预测模型包括LSI预测模型和CRF预测模型。
所述训练模块43,包括:判断单元431,用于判断所述待抽取关键信息的标签类型;确定单元432,用于如果所述标注标签是词句标签,则确定所述文本预测模型是所述LSI模型;确定单元432,用于如果所述标注标签是段落标签,则确定所述文本预测模型是所述CRF模型;训练单元433,用于根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,和/或训练所述CRF初始模型得到CRF预测模型。
进一步地,所述训练单元433,包括:计算子单元4331,用于采用所述LSI初始模型,计算特征相似度,所述特征相似度是所述训练样本中每个句子的所述样本特征向量与所述待抽取关键信息所在句子的所述样本特征向量之间的相似度;查找子单元4332,用于查找所述特征相似度中相似度最高的所述训练样本中的训练句子;结束子单元4333,用于如果所述训练句子中包含所述待抽取关键信息,则结束训练所述LSI初始模型得到所述LSI预测模型;更新子单元4334,用于如果所述训练句子不包含所述待抽取关键信息,则更新LSI参数,重新计算所述特征相似度。
进一步地,所述训练单元423,包括:拼接子单元4335,用于将所述训练样本中的各个分句对应的样本特征向量进行拼接;训练子单元4336,用于以所述拼接结果和所述待抽取关键信息对应的样本特征向量为输入,训练所述CRF初始模型得到所述CRF预测模型。
进一步地,所述训练单元423,用于:如果所述标签类型包括所述词句标签和所述段落标签,则根据所述训练样本、所述待抽取关键信息和所述样本特征向量,并行训练所述LSI初始模型和CRF初始模型。
进一步地,所述抽取模块44,包括:抽取单元441,用于采用所述LSI预测模型,抽取所述待抽取文本的LSI信息;所述抽取单元441,还用于采用所述CRF预测模型,抽取所述待抽取文本的CRF信息;合并单元442,用于合并所述LSI信息和所述CRF信息,生成所述抽取信息。
进一步地,所述装置还包括:展示模块45,用于所述根据所述文本预测模型,抽取待抽取文本的抽取信息之后,以表格形式展示所述抽取信息。
本申请提供了一种基于少量训练样本的信息抽取装置,首先获取训练样本,然后根据BERT语言模型提取训练样本中每个句子的样本特征量,再根据训练样本、待抽取关键信息和样本特征向量,训练文本预测模型,最后根据文本预测模型抽取待抽取文本的抽取信息。与现有技术相比,本申请实施例通过基于大规模训练语料的BERT语言模型提取样本特征向量,即使少量的训练样本也能够获取学习到比较全面的待抽取关键信息的特征向量,使得训练后的文本预测模型能够抽取所述待抽取关键信息类似的抽取信息,以获取有效的抽取信息。
根据本申请一个实施例提供了一种存储介质,所述存储介质存储有至少一可执行指令,该可执行指令可使处理器执行上述任意方法实施例中的基于少量训练样本的信息抽取方法。
可选的,本申请涉及的存储介质可以是计算机(可读)存储介质,该存储介质如计算机存储介质可以是非易失性的,也可以是易失性的。
图5示出了根据本申请一个实施例提供的一种计算机设备的结构示意图,本申请具体实施例并不对计算机设备的具体实现做限定。
如图5所示,该计算机设备可以包括:处理器(processor)502、通信接口(Communications Interface)504、存储器(memory)506、以及通信总线508。
其中:处理器502、通信接口504、以及存储器506通过通信总线508完成相互间的通信。
通信接口504,用于与其它设备比如客户端或其它服务器等的网元通信。
处理器502,用于执行程序510,具体可以执行上述基于少量训练样本的信息抽取方法实施例中的相关步骤。
具体地,程序510可以包括程序代码,该程序代码包括计算机操作指令。
处理器502可能是中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。计算机设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器506,用于存放程序510。存储器506可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序510具体可以用于使得处理器502执行以下操作:获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;根据所述文本预测模型,抽取待抽取文本的抽取信息。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (20)

  1. 一种基于少量训练样本的信息抽取方法,其中,包括:
    获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;
    根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;
    根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;
    根据所述文本预测模型,抽取待抽取文本的抽取信息。
  2. 如权利要求1所述的方法,其中,所述待抽取关键信息的标签类型包括词句标签和段落标签;所述初始模型包括潜在语义索引LSI初始模型和条件随机场CRF初始模型;所述文本预测模块包括LSI预测模型和CRF预测模型;
    所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型,包括:
    判断所述待抽取关键信息的标签类型;
    如果所述标注标签是词句标签,则确定所述文本预测模型是所述LSI模型;
    如果所述标注标签是段落标签,则确定所述文本预测模型是所述CRF模型;
    根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,和/或训练所述CRF初始模型得到所述CRF预测模型。
  3. 如权利要求2所述的方法,其中,所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,包括:
    采用所述LSI初始模型,计算特征相似度,所述特征相似度是所述训练样本中每个句子的所述样本特征向量与所述待抽取关键信息所在句子的所述样本特征向量之间的相似度;
    查找所述特征相似度中相似度最高的所述训练样本中的训练句子;
    如果所述训练句子中包含所述待抽取关键信息,则结束训练所述LSI初始模型得到所述LSI预测模型;
    如果所述训练句子中不包含所述待抽取关键信息,则更新LSI参数,重新计算所述特征相似度。
  4. 如权利要求2所述的方法,其中,所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述CRF初始模型得到所述CRF预测模型,包括:
    将所述训练样本中的各个分句对应的样本特征向量进行拼接;
    以所述拼接结果和所述待抽取关键信息对应的样本特征向量为输入,训练所述CRF初始模型得到所述CRF预测模型。
  5. 如权利要2所述的方法,其中,所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,和/或训练所述CRF初始模型得到所述CRF预测模型,包括:
    如果所述标签类型包括所述词句标签和所述段落标签,则根据所述训练样本、所述待抽取关键信息和所述样本特征向量,并行训练所述LSI初始模型和CRF初始模型。
  6. 如权利要求2所述的方法,其中,所述根据所述文本预测模型,抽取待抽取文本的抽取信息,包括:
    采用所述LSI预测模型,抽取所述待抽取文本的LSI信息;
    采用所述CRF预测模型,抽取所述待抽取文本的CRF信息;
    合并所述LSI信息和所述CRF信息,生成所述抽取信息。
  7. 如权利要求1-6任一项所述的方法,其中,所述根据所述文本预测模型,抽取待抽取文本的抽取信息之后,所述方法还包括:
    以表格形式展示所述抽取信息。
  8. 一种基于少量训练样本的信息抽取装置,其中,包括:
    获取模块,用于获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;
    提取模块,用于根据BERT语言模型,提取所述训练样本中每个句子的样本特征量;
    训练模块,用于根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;
    抽取模块,用于根据所述文本预测模型,抽取待抽取文本的抽取信息。
  9. 一种计算机存储介质,其中,所述计算机存储介质中存储有至少一可执行指令,所述可执行指令使处理器执行以下步骤:
    获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;
    根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;
    根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;
    根据所述文本预测模型,抽取待抽取文本的抽取信息。
  10. 如权利要求9所述的计算机存储介质,其中,所述待抽取关键信息的标签类型包括词句标签和段落标签;所述初始模型包括潜在语义索引LSI初始模型和条件随机场CRF初始模型;所述文本预测模块包括LSI预测模型和CRF预测模型;
    所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型时,具体执行:
    判断所述待抽取关键信息的标签类型;
    如果所述标注标签是词句标签,则确定所述文本预测模型是所述LSI模型;
    如果所述标注标签是段落标签,则确定所述文本预测模型是所述CRF模型;
    根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,和/或训练所述CRF初始模型得到所述CRF预测模型。
  11. 如权利要求10所述的计算机存储介质,其中,所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型时,具体执行:
    采用所述LSI初始模型,计算特征相似度,所述特征相似度是所述训练样本中每个句子的所述样本特征向量与所述待抽取关键信息所在句子的所述样本特征向量之间的相似度;
    查找所述特征相似度中相似度最高的所述训练样本中的训练句子;
    如果所述训练句子中包含所述待抽取关键信息,则结束训练所述LSI初始模型得到所述LSI预测模型;
    如果所述训练句子中不包含所述待抽取关键信息,则更新LSI参数,重新计算所述特征相似度。
  12. 如权利要求10所述的计算机存储介质,其中,所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述CRF初始模型得到所述CRF预测模型时,具体执行:
    将所述训练样本中的各个分句对应的样本特征向量进行拼接;
    以所述拼接结果和所述待抽取关键信息对应的样本特征向量为输入,训练所述CRF初始模型得到所述CRF预测模型。
  13. 如权利要求10所述的计算机存储介质,其中,所述根据所述文本预测模型,抽取待抽取文本的抽取信息时,具体执行:
    采用所述LSI预测模型,抽取所述待抽取文本的LSI信息;
    采用所述CRF预测模型,抽取所述待抽取文本的CRF信息;
    合并所述LSI信息和所述CRF信息,生成所述抽取信息。
  14. 如权利要求9-13任一项所述的计算机存储介质,其中,所述根据所述文本预测模型,抽取待抽取文本的抽取信息之后,所述可执行指令还用于使处理器执行以下步骤:
    以表格形式展示所述抽取信息。
  15. 一种计算机设备,其中,包括:处理器、存储器、通信接口和通信总线、所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行以下步骤:
    获取训练样本,所述训练样本是已标注的待抽取关键信息的文本;
    根据BERT语言模型,提取所述训练样本中每个句子的样本特征向量;
    根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型;
    根据所述文本预测模型,抽取待抽取文本的抽取信息。
  16. 如权利要求15所述的计算机设备,其中,所述待抽取关键信息的标签类型包括词句标签和段落标签;所述初始模型包括潜在语义索引LSI初始模型和条件随机场CRF初始模型;所述文本预测模块包括LSI预测模型和CRF预测模型;
    所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练初始模型,得到文本预测模型时,具体执行:
    判断所述待抽取关键信息的标签类型;
    如果所述标注标签是词句标签,则确定所述文本预测模型是所述LSI模型;
    如果所述标注标签是段落标签,则确定所述文本预测模型是所述CRF模型;
    根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型,和/或训练所述CRF初始模型得到所述CRF预测模型。
  17. 如权利要求16所述的计算机设备,其中,所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述LSI初始模型得到所述LSI预测模型时,具体执行:
    采用所述LSI初始模型,计算特征相似度,所述特征相似度是所述训练样本中每个句子的所述样本特征向量与所述待抽取关键信息所在句子的所述样本特征向量之间的相似度;
    查找所述特征相似度中相似度最高的所述训练样本中的训练句子;
    如果所述训练句子中包含所述待抽取关键信息,则结束训练所述LSI初始模型得到所述LSI预测模型;
    如果所述训练句子中不包含所述待抽取关键信息,则更新LSI参数,重新计算所述特征相似度。
  18. 如权利要求16所述的计算机设备,其中,所述根据所述训练样本、所述待抽取关键信息和所述样本特征向量,训练所述CRF初始模型得到所述CRF预测模型时,具体执行:
    将所述训练样本中的各个分句对应的样本特征向量进行拼接;
    以所述拼接结果和所述待抽取关键信息对应的样本特征向量为输入,训练所述CRF初始模型得到所述CRF预测模型。
  19. 如权利要求16所述的计算机设备,其中,所述根据所述文本预测模型,抽取待抽取文本的抽取信息时,具体执行:
    采用所述LSI预测模型,抽取所述待抽取文本的LSI信息;
    采用所述CRF预测模型,抽取所述待抽取文本的CRF信息;
    合并所述LSI信息和所述CRF信息,生成所述抽取信息。
  20. 如权利要求15-19任一项所述的计算机设备,其中,所述根据所述文本预测模型,抽取待抽取文本的抽取信息之后,所述可执行指令还用于使所述处理器执行以下步骤:
    以表格形式展示所述抽取信息。
PCT/CN2020/121886 2020-03-03 2020-10-19 基于少量训练样本的信息抽取方法及装置 WO2021174864A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010138072.8 2020-03-03
CN202010138072.8A CN111506696A (zh) 2020-03-03 2020-03-03 基于少量训练样本的信息抽取方法及装置

Publications (1)

Publication Number Publication Date
WO2021174864A1 true WO2021174864A1 (zh) 2021-09-10

Family

ID=71877420

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121886 WO2021174864A1 (zh) 2020-03-03 2020-10-19 基于少量训练样本的信息抽取方法及装置

Country Status (2)

Country Link
CN (1) CN111506696A (zh)
WO (1) WO2021174864A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806565A (zh) * 2021-11-18 2021-12-17 中科雨辰科技有限公司 一种文本标注的数据处理系统
CN114357144A (zh) * 2022-03-09 2022-04-15 北京大学 基于小样本的医疗数值抽取和理解方法及装置
CN114417974A (zh) * 2021-12-22 2022-04-29 北京百度网讯科技有限公司 模型训练方法、信息处理方法、装置、电子设备和介质
CN114841274A (zh) * 2022-05-12 2022-08-02 百度在线网络技术(北京)有限公司 语言模型的训练方法、装置、电子设备和存储介质
CN114970955A (zh) * 2022-04-15 2022-08-30 黑龙江省网络空间研究中心 基于多模态预训练模型的短视频热度预测方法及装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506696A (zh) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 基于少量训练样本的信息抽取方法及装置
CN112668316A (zh) * 2020-11-17 2021-04-16 国家计算机网络与信息安全管理中心 word文档关键信息抽取方法
CN115600602B (zh) * 2022-12-13 2023-02-28 中南大学 一种长文本的关键要素抽取方法、系统及终端设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008472A (zh) * 2019-03-29 2019-07-12 北京明略软件系统有限公司 一种实体抽取的方法、装置、设备和计算机可读存储介质
CN110083836A (zh) * 2019-04-24 2019-08-02 哈尔滨工业大学 一种文本预测结果的关键证据抽取方法
CN110532563A (zh) * 2019-09-02 2019-12-03 苏州美能华智能科技有限公司 文本中关键段落的检测方法及装置
CN110781276A (zh) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 文本抽取方法、装置、设备及存储介质
CN110851596A (zh) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 文本分类方法、装置及计算机可读存储介质
CN111506696A (zh) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 基于少量训练样本的信息抽取方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874432B2 (en) * 2010-04-28 2014-10-28 Nec Laboratories America, Inc. Systems and methods for semi-supervised relationship extraction
CN109241521B (zh) * 2018-07-27 2023-06-20 中山大学 一种基于引用关系的科技文献高关注度句子提取方法
CN109145089B (zh) * 2018-08-30 2021-07-30 中国科学院遥感与数字地球研究所 一种基于自然语言处理的层次化专题属性抽取方法
CN109871451B (zh) * 2019-01-25 2021-03-19 中译语通科技股份有限公司 一种融入动态词向量的关系抽取方法和系统
CN110598213A (zh) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 一种关键词提取方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008472A (zh) * 2019-03-29 2019-07-12 北京明略软件系统有限公司 一种实体抽取的方法、装置、设备和计算机可读存储介质
CN110083836A (zh) * 2019-04-24 2019-08-02 哈尔滨工业大学 一种文本预测结果的关键证据抽取方法
CN110532563A (zh) * 2019-09-02 2019-12-03 苏州美能华智能科技有限公司 文本中关键段落的检测方法及装置
CN110781276A (zh) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 文本抽取方法、装置、设备及存储介质
CN110851596A (zh) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 文本分类方法、装置及计算机可读存储介质
CN111506696A (zh) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 基于少量训练样本的信息抽取方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806565A (zh) * 2021-11-18 2021-12-17 中科雨辰科技有限公司 一种文本标注的数据处理系统
CN113806565B (zh) * 2021-11-18 2022-03-25 中科雨辰科技有限公司 一种文本标注的数据处理系统
CN114417974A (zh) * 2021-12-22 2022-04-29 北京百度网讯科技有限公司 模型训练方法、信息处理方法、装置、电子设备和介质
CN114417974B (zh) * 2021-12-22 2023-06-20 北京百度网讯科技有限公司 模型训练方法、信息处理方法、装置、电子设备和介质
CN114357144A (zh) * 2022-03-09 2022-04-15 北京大学 基于小样本的医疗数值抽取和理解方法及装置
CN114357144B (zh) * 2022-03-09 2022-08-09 北京大学 基于小样本的医疗数值抽取和理解方法及装置
CN114970955A (zh) * 2022-04-15 2022-08-30 黑龙江省网络空间研究中心 基于多模态预训练模型的短视频热度预测方法及装置
CN114970955B (zh) * 2022-04-15 2023-12-15 黑龙江省网络空间研究中心 基于多模态预训练模型的短视频热度预测方法及装置
CN114841274A (zh) * 2022-05-12 2022-08-02 百度在线网络技术(北京)有限公司 语言模型的训练方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN111506696A (zh) 2020-08-07

Similar Documents

Publication Publication Date Title
WO2021174864A1 (zh) 基于少量训练样本的信息抽取方法及装置
US11501182B2 (en) Method and apparatus for generating model
CN109493977B (zh) 文本数据处理方法、装置、电子设备及计算机可读介质
WO2020215457A1 (zh) 一种基于对抗学习的文本标注方法和设备
CN110287480B (zh) 一种命名实体识别方法、装置、存储介质及终端设备
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN108460011B (zh) 一种实体概念标注方法及系统
CN107861954B (zh) 基于人工智能的信息输出方法和装置
EP3879427A2 (en) Information extraction method, extraction model training method, apparatus and electronic device
WO2020215456A1 (zh) 一种基于教师监督的文本标注方法和设备
CN113220836A (zh) 序列标注模型的训练方法、装置、电子设备和存储介质
US11651015B2 (en) Method and apparatus for presenting information
US20120158742A1 (en) Managing documents using weighted prevalence data for statements
US20220129448A1 (en) Intelligent dialogue method and apparatus, and storage medium
WO2022174496A1 (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
CN113051356A (zh) 开放关系抽取方法、装置、电子设备及存储介质
CN113836925A (zh) 预训练语言模型的训练方法、装置、电子设备及存储介质
CN114218951B (zh) 实体识别模型的训练方法、实体识别方法及装置
CN112711943A (zh) 一种维吾尔文语种识别方法、装置及存储介质
CN112015866A (zh) 用于生成同义文本的方法、装置、电子设备及存储介质
US20230139642A1 (en) Method and apparatus for extracting skill label
CN111328416B (zh) 用于自然语言处理中的模糊匹配的语音模式
US20230111052A1 (en) Self-learning annotations to generate rules to be utilized by rule-based system
CN115510247A (zh) 一种电碳政策知识图谱构建方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922976

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922976

Country of ref document: EP

Kind code of ref document: A1