WO2021203581A1 - 基于精标注文本的关键信息抽取方法、装置及存储介质 - Google Patents

基于精标注文本的关键信息抽取方法、装置及存储介质 Download PDF

Info

Publication number
WO2021203581A1
WO2021203581A1 PCT/CN2020/103933 CN2020103933W WO2021203581A1 WO 2021203581 A1 WO2021203581 A1 WO 2021203581A1 CN 2020103933 W CN2020103933 W CN 2020103933W WO 2021203581 A1 WO2021203581 A1 WO 2021203581A1
Authority
WO
WIPO (PCT)
Prior art keywords
key information
information extraction
text data
text
extraction model
Prior art date
Application number
PCT/CN2020/103933
Other languages
English (en)
French (fr)
Inventor
曹辰捷
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021203581A1 publication Critical patent/WO2021203581A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, system, device and storage medium for extracting key information based on precise annotated text.
  • Machine reading comprehension refers to allowing machines to answer content-related questions by reading text.
  • the application of artificial intelligence-style reading comprehension by inputting the questions to be answered and related reading materials into the trained reading comprehension model is becoming more and more widespread.
  • the marking of key fragments cannot take into account many areas and therefore exists One-sidedness, if the method of outsourcing manual marking of key sentences/paragraphs is adopted, it will greatly increase the time and money costs.
  • This application provides a method, system, electronic device, and computer-readable storage medium for extracting key information based on fine-labeled text, which mainly solves the problem of automatically labeling text segment fragments through a BERT pre-training model and a key information extraction model.
  • the present application also provides a method for extracting key information based on fine-labeled text, which is applied to an electronic device.
  • the method includes: S110. Pre-training the text data through the BERT pre-training model to obtain the word vector of the text data , Combine the obtained word vectors into matrix text data; S120, input the matrix text data into a key information extraction model, and the key information extraction model is trained using the CMRC data set, and based on the matrix text data Obtain key information; S130, sort the obtained key information according to a preset sorting rule, and output the key information that meets the set selection rule.
  • this application provides a key information extraction system based on fine-labeled text, including a pre-training unit, a key information acquisition unit, and a key information output unit; the pre-training unit is used for text The data is pre-trained to obtain the word vectors of the text data, and the obtained word vectors are combined into matrix text data; the key information obtaining unit is used to input the matrix text data into the key information extraction model, and The key information extraction model uses the CMRC data set for training, and obtains key information according to the matrix text data; the key information output unit is used to sort the obtained key information according to a preset sorting rule, and to match the design Set the key information of the selection rule as output.
  • the present application provides an electronic device, which includes a memory and a processor.
  • the memory stores a key information extraction program based on finely labeled text, and the key information extraction program based on finely labeled text is stored in the memory.
  • the following steps are implemented: S110, pre-training the text data through the BERT pre-training model to obtain word vectors of the text data, and combining the obtained word vectors into matrix text data; S120.
  • the matrix text data is input to the key information extraction model, and the key information extraction model is trained using the CMRC data set, and the key information is obtained according to the matrix text data; S130, the key information obtained is performed according to a preset sorting rule Sort, and output the key information that meets the set selection rules.
  • the present application also provides a computer-readable storage medium in which a computer program is stored, and the computer program includes a key information extraction program based on finely labeled text.
  • the key information extraction program of the fine-labeled text is executed by the processor, the steps of the key information extraction method based on the fine-labeled text are realized.
  • the key information extraction method, system, electronic device, and computer readable storage medium based on finely labeled text proposed in this application change the input of the reading comprehension model into long text and empty string (that is, replace the question with an empty string),
  • the reading comprehension model is trained to learn the characteristics of the standard answer, so that the corresponding fragment in the text is output as the answer, which completely changes the previous mode of inputting text plus questions and outputting answers; its beneficial effects are as follows: 1), to read Understanding the model is an improvement based on the basic idea.
  • the key information previously marked with words, sentences, and paragraphs is converted into a continuous segment; 2), the problem of automatic marking of text segments is solved; 3), It greatly reduces the cost of labeling and provides strong support for downstream tasks.
  • FIG. 1 is a flowchart of a preferred embodiment of a method for extracting key information based on fine-labeled text according to this application;
  • FIG. 2 is a flowchart of a preferred embodiment of the method for obtaining key information by the key information extraction model of the application
  • FIG. 3 is a schematic structural diagram of a preferred embodiment of a key information extraction system based on fine-labeled text of this application;
  • the reading comprehension model is based on the question (or the input is a long text and the question), and the correct answer is marked in the text (the output is the corresponding fragment of the answer in the text); the existing reading comprehension model will be marked in advance Good key sentences/segments are used as the input of the model, while artificially labeling key fragments has the disadvantage of one-sidedness.
  • This application uses the BERT (Bidirectional Encoder Representation from Transformer, bidirectional attention neural network model) pre-training model to pre-train the text data, and then input the key information extraction model, thereby outputting the key information in the text data as a response.
  • BERT Bidirectional Encoder Representation from Transformer, bidirectional attention neural network model
  • the key information extraction model of this application uses text as the input and key information as the output; the key information here is the answer Candidate, that is, the key information is to some extent a subset of the answers output by reading comprehension. It should be noted that the key information extraction method based on finely labeled texts of this application is unsupervised and does not require questions as input, so the output key information covers a wider range than answers.
  • FIG. 1 shows the flow of a preferred embodiment of a method for extracting key information based on fine-labeled text according to the present application.
  • the method may be executed by an apparatus, and the apparatus may be realized by software and/or hardware.
  • the input is: "The champion of the 2018 dota2 World invitational is the OG team”; the key information output is: “2018, the dota2 World invitational, the OG team”.
  • the method for extracting key information based on fine-labeled text includes: step S110-step S130.
  • S110 Pre-train the text data through the BERT pre-training model to obtain word vectors of the text data, and combine the obtained word vectors into matrix text data.
  • BERT Bidirectional Encoder Representation from Transformer, two-way attention neural network model
  • BERT can directly obtain a unique vector representation of an entire sentence. It adds a special mark [CLS] in front of each input, and then lets Transformer encode [CLS] in depth. Because Transformer can encode global information into each position regardless of space and distance, and [CLS] The highest hidden layer is directly connected to the output layer of softmax as the representation of the sentence/sentence pair, so it is used as a "checkpoint" on the gradient back propagation path and can learn the upper layer characteristics of the entire input. Therefore, the BERT model can further increase the generalization ability of the word vector model, and fully describe the characteristics of character-level, word-level, sentence-level and even inter-sentence relationship.
  • the process of obtaining word vectors by the BERT pre-training model is to perform word segmentation first, and then pre-train the segmented documents to generate training word vectors. That is to say, the low-dimensional vector representation method of all characters is first obtained, and then the low-dimensional vectors are combined into a two-dimensional vector to obtain a matrix representation of a sentence.
  • “1”, “5", “year”, “yi”, “xia”, “de”, “learning” and “sheng” can all be represented by a vector, and then combine the above vectors into a two-dimensional vector , So as to obtain the matrix representation of this sentence.
  • the d-dimensional word vector corresponding to the above 8 words can be generated through the BERT model, and then these eight words can be spliced together to form an 8*d matrix, which can uniquely represent the above text, that is, matrix text data.
  • the BERT pre-training model uses a one-dimensional vector representation of the characters of the text data, and forms the two-dimensional vector matrix text data according to the character arrangement sequence.
  • a step of preprocessing the text data is further included, and the preprocessing includes a cleaning process.
  • the cleaning refers to the preprocessing of the vertical data corpus.
  • the BERT pre-training model is applied to vertical data, and these data (law, medicine, news, etc.) are not tidy enough, so they need to be processed to meet the input of the model (cleaning is cleaning test data, not training data).
  • Segmentation As mentioned earlier, the BERT pre-training model needs to input a piece of text and then output its keywords. The maximum length of the input text needs to be limited to standardize all input text to this length (this parameter is set to 512, assuming that a certain segment is less than 512 tokens, then fill it with blanks, so that all inputs are standardized to the same One length now). Obviously, most of these vertical data lengths exceed 512, so it is divided and recombined according to paragraphs to ensure that the length of each paragraph is within the specified limit, while retaining the semantic coherence of the context as much as possible. Remove too short expectations: due to various possible reasons, a very small part of the data may be empty or very short. This part of the data is not helpful for downstream work, so it is directly filtered out at this step.
  • the sample set set aside during the model training process is used to adjust the hyperparameters of the model and evaluate the ability of the model. It is used to evaluate the performance of the final model, to help compare multiple final models and make choices. Using the set aside samples to evaluate the model's ability has less deviation.
  • the training samples are divided into training set and validation set, the model is fitted on the training set, and then the fitted model is used to predict the data samples retained in the validation set, and the model verification error is calculated quantitatively. MSE is usually used to correct the error Rate is evaluated, and the error rate of the generated verification set is evaluated as the test error rate.
  • a test set is used to test the trained key information extraction model to obtain the em value. That is, the key information extraction model is tested through the test set, and the key information extraction model whose em value is greater than the set threshold is selected as the trained key information extraction model;
  • n is the number of standard answers
  • the CMRC data set is used to train the key information extraction model.
  • Each document in the CMRC data set includes multiple question and answer pairs, and the questions in the question and answer pairs are removed; only the document and the answer are left, and the CMRC with only the document-answer as the data set is used to train the above-mentioned key information extraction model. That is, the problem-removed CMRC data set is used as the training set of the key information extraction model.
  • Fig. 2 shows a flowchart of a preferred embodiment of the method for obtaining key information according to the key information extraction model of the present application;
  • the key information extraction model includes a fully connected layer, a sigmoid layer, and a cross-entropy optimization layer.
  • the method for obtaining key information by the key information extraction model includes step S210-step S230:
  • the fully connected layer includes a starting point fully connected network and an end point fully connected network; the starting point fully connected network is used to convert the matrix text data into a starting point vector; the end point fully connected network is used to transform the matrix The formula text data is transformed into an end point vector.
  • the matrix of the generated text passes a fully connected layer that represents the starting point (ie, the starting point fully connected network), and a vector of length l is obtained, and the vector is named start_logits; in the same way, a fully connected layer that represents the end point is obtained.
  • the layer namely, the end point fully connected network
  • another vector named end_logits is obtained; that is, the starting point and the end point of the keyword are predicted through two fully connected networks of the starting point fully connected network and the end point fully connected network.
  • the matrix text data is (d*length) transformed into two (1*length) one-dimensional vectors after passing through two fully connected layers. That is, in a vector, each word in the sentence corresponds to a value, and this value represents the possibility that it can be used as a starting point; in addition, in another vector, each word in the sentence corresponds to a value, and this value represents its The possibility of being the end point.
  • Input P is expressed as the original text, and before the text is input, tokens are added before and after the text to mark;
  • the token is marked by adding ⁇ CLS> in front and ⁇ SEP> after word segmentation; it can be regarded as a mark at the beginning and end of the text.
  • the X output obtained can be regarded as a matrix with length p_length and dimension d_im;
  • S220 Pass the multiple sets of keywords through the sigmoid layer of the key information extraction model to output preliminary key information
  • step S210 the start value and end value of each group of keywords in the text data (that is, the possibility of serving as the start point and end point), such as s and e, are obtained. That is, the first character and the last character are confirmed, thereby confirming the result text.
  • the score of c i is judged to be s logits [si]+e logits [ei] .
  • the key information to be screened is part of the text data, it includes the start point S and the end point e, 0 ⁇ s ⁇ e ⁇ l;
  • Loss_start -y*log(sigmoid(start_logits))-(1-y)*log(1-sigmoid(start_logits))
  • Loss_end -y*log(sigmoid(end_logits))-(1-y)*log(1-sigmoid(end_logits))
  • the network parameters of the key information extraction model can quickly learn the problem from the error, and the network result can be obtained relatively quickly.
  • d is the word embedding dimension
  • l is the maximum length of the text
  • s is the starting point of the key information
  • e is the end point of the key information.
  • M c as the representation of the text, M c is the size of the matrix is d * l; V s and V e are the vector 1 * d two fully connected layers.
  • s logit s, s position , e logits , and e position are vectors of length l, respectively, where:
  • H(p,q) is the cross entropy of the two vectors p and q; that is,
  • the sum of the start value of the first character and the end value of the last character of the obtained keyword is arranged in descending order from largest to smallest. In other words, because the value of k is relatively small, iterate over the values corresponding to the topK results, and then sort them.
  • the score of c is judged to be s logits [s]+e logits [e]. In the subsequent steps, and by controlling the maximum length of c to increase the possibility of different start-end pairs.
  • the key information that meets the set selection rule is output.
  • the topK method is adopted, which means that the sum of the start value of the first character of the keyword and the end value of the last character is arranged in descending order from largest to smallest, and then the ranking is selected.
  • the top K keywords are used as the answer to the final keyword.
  • set k 10, and then select the top10 of startlogits and endlogits, and then add the period crossover to obtain about 100 sets of start+end values. Finally, sort it from big to small and choose top20 as the final keyword answer.
  • the goal of the key information extraction model training of this application is indeed the answer to the CMRC data set, but the "question” information in the CMRC data set is not used, and the final result of the output is not the only answer. It is the key information of the top20, that is, on the basis of no questions, all candidate answers with answer potential are output as key information.
  • Fig. 3 shows the structure of a preferred embodiment of the neural network model of the present application; referring to Fig. 3, the present application provides a key information extraction system 300 based on fine-labeled text, including a pre-training unit 310 and key information acquisition Unit 320 and key information output unit 330.
  • the pre-training unit 310 is configured to pre-train the text data through the BERT pre-training model to obtain word vectors of the text data, and combine the obtained word vectors into matrix text data;
  • the key information obtaining unit 320 Used to input the matrix text data into a key information extraction model, the key information extraction model uses CMRC data set for training, and obtains key information according to the matrix text data;
  • the key information output unit 330 is used for Sort the obtained key information according to a preset sorting rule, and output the key information that meets the set selection rule.
  • the key information acquisition unit 320 includes multiple sets of keyword acquisition modules 321, preliminary key information acquisition modules 322, and key information acquisition modules 323; the multiple sets of keyword acquisition modules 321 are used to extract the complete set of models through the key information.
  • connection layer obtains multiple sets of keywords, each set of keywords includes a keyword starting point and a keyword ending;
  • the preliminary key information acquisition module 322 is used to pass the multiple sets of keywords through the sigmoid layer of the key information extraction model Output preliminary key information;
  • the key information acquisition module 323 is configured to optimize the output preliminary key information using the cross-entropy optimization layer of the key information extraction model to obtain key information.
  • the key information acquisition unit 320 also includes a key information extraction model testing module, the key information extraction model testing module is used to test the key information extraction model through a test set, and select the key information extraction model with an em value greater than a set threshold as the key information extraction model
  • the BERT pre-training model in the pre-training unit 310 performs a one-dimensional vector representation of the characters of the text data, and forms the two-dimensional vector matrix text data according to the character arrangement sequence.
  • the key information extraction system further includes a text data cleaning unit, and the text data cleaning unit is used to segment the text data, remove the phrase material, and remove the erroneous corpus. .
  • segmenting the text data includes dividing and recombining the input text according to paragraphs, and the length of each paragraph after the combination is less than or equal to the standardized length.
  • the key information extraction system based on fine-labeled text of this application realizes the input text and the corresponding segment (key information) in the output text through the BERT pre-training model and the key information extraction model, completely changing the previous input text The mode of adding questions and outputting answers.
  • This application provides a method for extracting key information based on fine-labeled text, which is applied to an electronic device 4.
  • Fig. 4 shows the application environment of the preferred embodiment of the method for extracting key information based on fine-labeled text according to the present application.
  • the electronic device 4 may be a terminal device with arithmetic function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
  • the electronic device 4 includes a processor 42, a memory 41, a communication bus 43 and a network interface 44.
  • the memory 41 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory 41, and the like.
  • the readable storage medium may be an internal storage unit of the electronic device 4, such as a hard disk of the electronic device 4.
  • the readable storage medium may also be the external memory 41 of the electronic device 4, such as a plug-in hard disk equipped on the electronic device 4, or a smart memory card (Smart Media Card, SMC). , Secure Digital (SD) card, Flash Card (Flash Card), etc.
  • SD Secure Digital
  • Flash Card Flash Card
  • the readable storage medium of the memory 41 is generally used to store the key information extraction program 40 based on the fine-labeled text installed in the electronic device 4 and the like.
  • the memory 41 can also be used to temporarily store data that has been output or will be output.
  • the processor 42 may be a central processing unit (CPU), a microprocessor, or other data processing chip, which is used to run the program code or processing data stored in the memory 41, for example, to execute based on precise annotation.
  • the communication bus 43 is used to realize connection and communication between these components.
  • the network interface 44 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 4 and other electronic devices.
  • FIG. 4 only shows the electronic device 4 with the components 41-44, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the electronic device 4 may also include a user interface, which may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc.
  • a user interface may also include a standard wired interface and a wireless interface.
  • the electronic device 4 may also include a display, and the display may also be called a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like.
  • the display is used for displaying information processed in the electronic device 4 and for displaying a visualized user interface.
  • the electronic device 4 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
  • RF radio frequency
  • the memory 41 as a computer storage medium may include an operating system and a key information extraction program 40 based on fine-marked text; the processor 42 executes the fine-marked-based information stored in the memory 41
  • the key information extraction program 40 of the text implements the following steps: S110, pre-training the text data through the BERT pre-training model to obtain word vectors of the text data, and combining the obtained word vectors into matrix text data; S120, combining The matrix text data is input to a key information extraction model, and the key information extraction model uses the CMRC data set for training, and obtains key information according to the matrix text data; S130, compare the obtained key information according to a preset sorting rule Sort, and output the key information that meets the set selection rules.
  • the key information extraction program 40 based on finely labeled text can also be divided into one or more modules, and the one or more modules are stored in the memory 41 and executed by the processor 42 to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.
  • the key information extraction program 40 based on finely labeled text may include a pre-training unit 310, a key information obtaining unit 320, and a key information output unit 330.
  • the computer-readable storage medium may be non-volatile or volatile, and mainly includes a storage data area and a storage program area, wherein the storage data area It can store data, etc. created according to the use of blockchain nodes, the storage program area can store the operating system, at least one application program required by the function, and the computer-readable storage medium includes a key information extraction program based on fine-labeled text
  • the key information extraction program based on finely labeled text is executed by the processor, the following operations are implemented: S110, pre-training the text data through the BERT pre-training model to obtain word vectors of the text data, and combining the obtained word vectors Into matrix text data; S120, input the matrix text data into a key information extraction model, and the key information extraction model uses the CMRC data set for training, and obtains key information according to the matrix text data; S130: Set a sorting rule to sort the obtained key information, and output the key information that meets the set selection rule.
  • the specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the key information extraction method and the electronic device based on the fine-labeled text, and will not be repeated here.
  • this application is based on the key information extraction method, system, electronic device, and computer readable storage medium of fine-labeled text based on the basic idea of reading comprehension model.
  • the key information is converted to annotate a continuous segment; the problem of automatic tagging of text segments is solved; the tagging cost is greatly reduced, and the technical effect of providing strong support for downstream tasks is achieved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种基于精标注文本的关键信息抽取方法、装置及存储介质,其中的方法包括:S110、通过BERT预训练模型对文本数据进行预训练获得词向量,将所获得的词向量组合成矩阵式文本数据(S110);S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息(S120);S130、按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出(S130)。本方法解决了对文本段片段进行自动标注的问题,大大降低了标注成本,达到了为下游任务提供了有力支持的技术效果。

Description

基于精标注文本的关键信息抽取方法、装置及存储介质
本申请要求于2020年4月10日提交中国专利局、申请号为202010280586.7,发明名称为“基于精标注文本的关键信息抽取方法、系统、装置及存储介质”的中国发明专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于精标注文本的关键信息抽取方法、系统、装置及存储介质。
背景技术
机器阅读理解,是指让机器通过阅读文本,回答内容相关的问题。目前,将待回答的问题和相关的阅读材料输入到训练完毕的阅读理解模型进行人工智能式阅读理解的应用越来越广泛。
申请人意识到,现有的阅读理解模型是以问题为条件,在文本中标出正确答案,将提前标注好的关键句/段作为模型的输入,但是,关键片段的标注无法兼顾很多领域因而存在片面性,如果采用外包人工标注关键句/段的办法,则会大大增加时间和金钱成本。
为了实现对长文本中可作为答案的片段进行自动标注的目的,业界常见的解决方式为,通过无监督方法或者无监督的方法进行关键片段的标注,但是,仍然存在以下弊端:
1)通过无监督方法进行关键片段的标注,仅可以标注出词语,无法标注出片段;2)有监督方法进行关键片段的标注,提取的内容也是词级别,无法标注出片段。
所以,亟需一种可以标注出片段的关键信息抽取的方法。
发明内容
本申请提供一种基于精标注文本的关键信息抽取方法、系统、电子装置及计算机可读存储介质,其主要通过BERT预训练模型和关键信息抽取模型解决了对文本段片段进行自动标注的问题。
为实现上述目的,本申请还提供一种基于精标注文本的关键信息抽取方法,应用于电子装置,方法包括:S110、通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;S130、按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
为实现上述目的,本申请提供一种基于精标注文本的关键信息抽取系统,包括预训练单元、关键信息获得单元和关键信息输出单元;所述预训练单元,用于通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;所述关键信息获得单元,用于将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;所述关键信息输出单元,用于按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
为实现上述目的,本申请提供一种电子装置,该电子装置包括:存储器、处理器,所述存储器中存储有基于精标注文本的关键信息抽取程序,所述基于精标注文本的关键信息抽取程序被所述处理器执行时实现如下步骤:S110、通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;S130、按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序包括基于精标注文本的关键信息抽取程序,所述基于精标注文本的关键信息抽取程序被处理器执行时,实现上述的基于精标注文本的关键信息抽取方法的步骤。
本申请提出的基于精标注文本的关键信息抽取方法、系统、电子装置及计算机可读存储介质,通过将阅读理解模型的输入改变为长文本以及空字符串(即将问题替换为空字符串),并对阅读理解模型进行训练使其学习到标准答案的特征,从而输出文本中对应的片段作为回答,彻底改变了以往输入文本加问题而输出答案的模式;其有益效果如下:1)、以阅读理解模型为基本思路进行的改进,将以往以词、句、段为单位标注的关键信息,转换为标注一个连续的片段;2)、解决了对文本段片段进行自动标注的问题;3)、大大降低了标注成本,为下游任务提供了有力的支持。
附图说明
图1为本申请基于精标注文本的关键信息抽取方法较佳实施例的流程图;
图2为本申请的关键信息抽取模型获得关键信息方法的较佳实施例的流程图;
图3为本申请的基于精标注文本的关键信息抽取系统的较佳实施例的结构示意图;
图4为本申请的电子装置的较佳实施例的结构示意图;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
阅读理解模型是以问题为条件(或者说输入的是长文本以及问题),在文本中标出正确答案(输出的是文本中对应的作为回答的片段);现有的阅读理解模型是将提前标注好的关键句/段作为模型的输入,而人为标注关键片段存在片面性较大的弊端。
本申请利用BERT(Bidirectional Encoder Representation from Transformer,双向注意力神经网络模型)预训练模型对文本数据进行预训练,然后输入关键信息抽取模型,从而输出文本数据中的关键信息作为回答。
具体地说,与传统的阅读理解模型的输入文本加问题,而输出答案的模式不同,本申请的关键信息抽取模型,输入的是文本,而输出的是关键信息;这里的关键信息是作为答案候选的,也就是说,关键信息某种程度上是阅读理解输出的答案的子集。需要说明的是,本申请的基于精标注文本的关键信息抽取方法由于是无监督的,也不需要问题作为输入,因此输出的关键信息覆盖的范围比答案更广。
本申请提供一种基于精标注文本的关键信息抽取方法。图1示出了根据本申请基于精标注文本的关键信息抽取方法较佳实施例的流程。参照图1所示,该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。
以“2018年的dota2世界邀请赛的冠军是OG战队”为例,若采用传统的阅读理解模型,则输入为:文本--“2018年的dota2世界邀请赛的冠军是OG战队”+问题--“2018年冠军是谁”;输出为:答案--“OG战队”。
而若采取本申请的关键信息抽取模型的话,输入为:“2018年的dota2世界邀请赛的冠军是OG战队”;输出的关键信息为:“2018年,dota2世界邀请赛,OG战队”。
在本实施例中,基于精标注文本的关键信息抽取方法包括:步骤S110-步骤S130。
S110、通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据。
具体地说,BERT(Bidirectional Encoder Representation from Transformer,双向注意力神经网络模型)是一个句子级别的语言模型,不像ELMo模型在与下游具体NLP任务拼接时需要每层加上权重做全局池化,BERT可以直接获得一整个句子的唯一向量表示。它在每个input前面加一个特殊的记号[CLS],然后让Transformer对[CLS]进行深度encoding,由于Transformer是可以无视空间和距离的把全局信息encoding进每个位置的,而[CLS]的最高隐层作为句子/句对的表示直接跟softmax的输出层连接,因此其作为梯度反向传播路径上的“关卡”,可以学到整个input的上层特征。因此BERT模型可以进一步增加词向量模型泛化能力,充分描述字符级、词级、句子级甚至句间关系特征。
需要说明的是,BERT预训练模型获得词向量的过程是先进行分词,再将分词后的文档进行预训练生成训练词向量。也就是说,首先得出所有字符的低维向量表示方法,然后将所述低维向量组合成一个二维向量,即可获得一句话的矩阵表示。
下面以“15岁以下的学生”这句话为例进行详细说明。
首先,“1”、“5”、“岁”、“以”、“下”、“的”、“学”、“生”都可以使用一个向量进行表示,然后将上述向量组合成二维向量,从而获得这句话的矩阵表示。具体地说,就是通过BERT模型可以生成上述8个词分别对应的d维词向量,然后将这八个拼接起来形成一个8*d的矩阵,这个矩阵可以唯一的表示上述文本,即矩阵式文本数据。
总的来说,所述BERT预训练模型通过将所述文本数据的字符进行一维向量表示,将所述一维向量按照字符排列顺序形成二维向量的矩阵式文本数据。
在一个具体实施例中,在步骤S110之前,还包括对所述文本数据进行预处理步骤,所述预处理包括清洗处理。具体地说,这里的清洗指的是垂直数据语料的预处理。如前所述,BERT预训练模型应用在垂直数据上,而这些数据(法律、医药、新闻等)不够整洁,因此需要对它们进行处理以满足模型的输入(清洗是清洗测试数据,而非训练数据)。
清洗:包括分段、去除过短语料、清除有误预料。分段:如前所述,BERT预训练模型需要输入一段文本,然后输出其关键字。需要对输入文本最大长度进行限制,以将所有输入文本标准化为该长度(该参数设置成512,假设某段不足512个token,则用空白将其补全,这样所有的输入都被标准化为同一个长度了)。而显然地,这些垂直数据长度绝大多数都超过512,因此把它按照段落进行分开重新组合,保证每一段的长度都在规定的限制内,且同时尽可能地保留上下文的语义连贯性。去除过短的预料:由于各种可能的原因,极小部分数据可能是空,或者篇幅特别短,这部分数据对于下游工作没有帮助,因此在这步直接将其筛选掉。
S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息。
模型训练过程中留出的样本集,用于调整模型的超参数以及评估模型的能力。是用于评估最终模型的性能,帮助对比多个最终模型并作出选择。用留出的样本对模型能力进行评估的结果偏差更小。将训练样本分为训练集和验证集,模型在训练集上进行拟合,然后使用拟合后的模型对验证集中保留的数据样本进行预测,并定量地计算模型验证误差,通常使用MSE对错误率进行评估,评估生成的验证集错误率即测试错误率。
在一个具体实施例中,利用测试集对训练好的关键信息抽取模型进行测试,获得em值。即将关键信息抽取模型通过测试集进行测试,选取em值大于设定阈值的关键信息抽取模型作为训练好的关键信息抽取模型;
其中,em=n’/n;
n为标准答案数量,n’为所述关键信息抽取模型所获得的关键信息中包含的标准答案的数量。即若一篇文本有n个标准答案,输入这篇文本后,模型生成的topK个答案组成了一个集合K,假设在n个答案中,有n’个答案存在于集合K中,则em=n’/n。
需要说明的是,利用CMRC数据集对关键信息抽取模型进行训练。CMRC数据集中的每个文档中均包括多个问答对,将所述问答对中的问题去除;仅剩文档以及答案,将 仅剩文档-答案的CMRC作为数据集,训练上述关键信息抽取模型。即,以去除问题的CMRC数据集作为关键信息抽取模型的训练集。
图2示出了根据本申请的关键信息抽取模型获得关键信息方法的较佳实施例的流程图;关键信息抽取模型包括全连接层、sigmoid层和交叉熵优化层。参照图2所示,关键信息抽取模型获得关键信息的方法包括步骤S210-步骤S230:
S210、通过所述关键信息抽取模型的全连接层获得多组关键词,每组关键词均包含关键词起点和关键词终点。
所述全连接层包括起点全连接网络和终点全连接网络;所述起点全连接网络,用于将所述矩阵式文本数据转化为起点向量;所述终点全连接网络,用于将所述矩阵式文本数据转化为终点点向量。
具体地说,对生成的文本的矩阵通过一个表示起点的全连接层(即起点全连接网络),得到一个长度为l的向量,并将向量命名为start_logits;同理通过一个表示终点的全连接层(即终点全连接网络),得到另一个命名为end_logits的向量;也就是通过起点全连接网络和终点全连接网络两个全连接网络进行预测关键词的起点和终点。
矩阵式的文本数据为(d*length)通过两个全连接层后,转化为两个(1*length)的一维向量。即,在一个向量中,句子中每个字对应一个值,这个值就是表示其能作为起点的可能性;另外,在另一个向量中,句子中每个字对应一个值,这个值就是表示其能作为终点的可能性。
在一个具体的实施例中,若将关键信息抽取模型表示为LM_Model;
则:X=LM_Model(P);
输入P表示为原文文本,并且在文本输入前,在文本的前后添加token进行标记;
P=[<CLS>,passage,<SEP>]
需要进一步说明的是,token进行标记就是,分词之后,在前面加<CLS>,后面加<SEP>进行标记;可以看作文本首尾的标记。
继续以内容就是“15岁以下的学生”为例:
P=[<CLS>,passage,<SEP>]表示[“<CLS>”,“1”,“5”,“岁”,“以”,“下”,“的”,“学”,“生”,“。”,“<SEP>”];其中passage指的是整篇文章。
通过上述公式,得到的X输出,可以看做为长度为p_length,维度为d_im的矩阵;
若通过start_logits和end_logits两个全连接网络来预测关键词的起点和终点,则表示为:
startLogits=FC_start(X)
endLogits=FC_end(X)
S220、将所述多组关键词经过所述关键信息抽取模型的sigmoid层输出初步关键信息;
其中,关于关键词的起点和终点,通过步骤S210可以得到s logits和e logits两个向量。需要说明的是,通过步骤S210得到了文本数据中的每组关键词的起点值和终点值(即,作为 起点和终点的可能性),例如s和e。即确认了首字符和末字符,从而确定了结果文本。为了控制结果文本的长度,在一定的长度范围(max_answer_legnth=64)内计算所有组合中,首字符的起点值与末字符的终点值之和即C,然后通过各关键词的C的得分进行排序。
在一个具体的实施例中,若对于在文本中出现的某一个片段c i,假设起点为s i,终点为e i,则判断c i的得分为s logits[si]+e logits[ei]。
S230、将输出的初步关键信息利用所述关键信息抽取模型的交叉熵优化层进行优化后,获得关键信息。
在具体的实施过程中,因为所筛选的关键信息是文本数据的一部分,因此包含起点S和终点e,0<s<e≤l;
新建一个作为长度为l的向量,令s=1,e=0时,作为start_position;
新建另一个一个作为长度为l的向量,令e=1,s=0时,作为end_position;计算start_logits与start_position的稀疏交叉熵start_loss;
以及end_logits与end_position的稀疏交叉熵end_loss,令loss为稀疏交叉熵start_loss和稀疏交叉熵start_loss二者的平均数,对初步关键信息进行优化训练。其中:
Loss_start=-y*log(sigmoid(start_logits))-(1-y)*log(1-sigmoid(start_logits))
Loss_end=-y*log(sigmoid(end_logits))-(1-y)*log(1-sigmoid(end_logits))
总之,使用sigmoid作为激活函数时,以及使用交叉熵损失函数cross-entropy对关键信息进行筛选,可使关键信息抽取模型的网络参数能够快速的从错误中学习问题,可较为快速的得出网络结果。
下面通过一个具体的实施例利用公式对关键信息抽取模型的关键信息抽取流程进行说明:
假设d为词嵌入维度,l为文本最大长度,s为该关键信息的起始点,e为该关键信息的终点。
M c作为该文本的表示,M c是大小为d*l的矩阵;V s和V e分别是两个全连接层中的1*d的向量。
s logits,s position,e logits,e position分别为长度为l的向量,其中:
s position[i]=0,if i≠s,1;且if i=s(i=0,1,…,l-1);
e position[i]=0,if i≠e,1;且if i=e(i=0,1,…,l-1);
s logits=(v s*M c) T;e logits=(v e*M c) T
loss=(loss s+loss e)/2
=(H(s logits,s position)+H(e logits,e position))/2
其中H(p,q)为p,q两向量的交叉熵;即,
当p为s logits,则q为s position
当p为e logits,则q为e position
S130、根据预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关 键信息作为输出。
之前的网络已经表明,得到了文本中的每个关键词的起点值和终点值(即,作为起点和终点的可能性)。显然地,确认了首字符和末字符,即确定了结果文本。为了控制结果的长度,在一定的长度范围(max_answer_legnth=64)内计算所有组合中,首字符的起点值与末字符的终点值之和。
然后根据预设排序规则对所获得的每个关键词的首字符的起点值与末字符的终点值之和进行排序;需要说明的是预设的排序规则,在具体实施过程中,是将所获得的关键词的首字符的起点值与末字符的终点值之和按照从大到小的降序排列。也就是说,因为k值比较小,所以遍历topK结果对应的值,然后对它们进行排序。
在一个具体实施例中,在文本中出现的某一个片段c,假设起点为s,终点为e,则判断c的得分为s logits[s]+e logits[e]。在后续步骤中,并通过控制c的最大长度以增加不同的起点-终点对的可能性。
对于所获得的关键词的首字符的起点值与末字符的终点值之和按照从大到小的降序排列后,将符合设定选取规则的关键信息作为输出。需要说明的是,选取规则在具体的实施过程中,是采用topK方式,也就是说关键词的首字符的起点值与末字符的终点值之和按照从大到小的降序排列后,选择排名前K的关键词作为最终关键词的答案。
在一个具体的实施例中,设定k=10,则通过选取startlogits和endlogits各top10后,再将期交叉相加,得到大约100组start+end的数值。最后再对其从大到小排序后选择top20作为最终关键词的答案。
总的来说,本申请的关键信息抽取模型训练的目标确实是CMRC数据集的答案,但是却没有利用CMRC数据集里的“问题”这个信息,并且输出的最终结果也不是唯一的答案,而是top20的关键信息,也就是说,在没有问题的基础上,输出所有具有答案潜力的候选答案作为关键信息。
图3示出了本申请的神经网络模型的较佳实施例的结构;参照图3所示,本申请提供一种基于精标注文本的关键信息抽取系统300,包括预训练单元310、关键信息获得单元320和关键信息输出单元330。
所述预训练单元310,用于通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;所述关键信息获得单元320,用于将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;所述关键信息输出单元330,用于通过按照预设的排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。所述关键信息获得单元320包括多组关键词获取模块321、初步关键信息获取模块322和关键信息获取模块323;所述多组关键词获取模块321,用于通过所述关键信息抽取模型的全连接层获得多组关键词,每组关键词均包含关键词起点和关键词终点;所述初步关键信息获取模块322,用于将所述多组关键词经过所述关键信息抽取模型 的sigmoid层输出初步关键信息;所述关键信息获取模块323,用于将输出的初步关键信息利用所述关键信息抽取模型的交叉熵优化层进行优化后,获得关键信息。
所述关键信息获取单元320还包括关键信息抽取模型测试模块,所述关键信息抽取模型测试模块用于将关键信息抽取模型通过测试集进行测试,选取em值大于设定阈值的关键信息抽取模型作为训练好的关键信息抽取模型;其中,em=n’/n;n为标准答案数量,n’为所述关键信息抽取模型所获得的关键信息中包含的标准答案的数量。
所述预训练单元310中的所述BERT预训练模型将所述文本数据的字符进行一维向量表示,将所述一维向量按照字符排列顺序形成二维向量的矩阵式文本数据。
在一个具体的实施例中,所述关键信息抽取系统还包括文本数据的清洗单元,所述文本数据的清洗单元,用于对所述文本数据的分段、去除过短语料和去除有误语料。
具体地说,对所述文本数据的分段包括把输入文本按照段落进行分开重新组合,组合后的每一段的长度均小于等于标准化长度。
综上所述,本申请的基于精标注文本的关键信息抽取系统通过BERT预训练模型和关键信息抽取模型,实现了输入文本而输出文本中对应的片段(关键信息),彻底改变了以往输入文本加问题而输出答案的模式。
本申请提供一种基于精标注文本的关键信息抽取方法,应用于一种电子装置4。
图4示出了根据本申请基于精标注文本的关键信息抽取方法较佳实施例的应用环境。
参照图4所示,在本实施例中,电子装置4可以是服务器、智能手机、平板电脑、便携计算机、桌上型计算机等具有运算功能的终端设备。
该电子装置4包括:处理器42、存储器41、通信总线43及网络接口44。
存储器41包括至少一种类型的可读存储介质。所述至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器41等的非易失性存储介质。在一些实施例中,所述可读存储介质可以是所述电子装置4的内部存储单元,例如该电子装置4的硬盘。在另一些实施例中,所述可读存储介质也可以是所述电子装置4的外部存储器41,例如所述电子装置4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
在本实施例中,所述存储器41的可读存储介质通常用于存储安装于所述电子装置4的基于精标注文本的关键信息抽取程序40等。所述存储器41还可以用于暂时地存储已经输出或者将要输出的数据。
处理器42在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器41中存储的程序代码或处理数据,例如执行基于精标注文本的关键信息抽取程序40等。
通信总线43用于实现这些组件之间的连接通信。
网络接口44可选地可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该电子装置4与其他电子设备之间建立通信连接。
图4仅示出了具有组件41-44的电子装置4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
可选地,该电子装置4还可以包括用户接口,用户接口可以包括输入单元比如键盘(Keyboard)、语音输入装置比如麦克风(microphone)等具有语音识别功能的设备、语音输出装置比如音响、耳机等,可选地用户接口还可以包括标准的有线接口、无线接口。
可选地,该电子装置4还可以包括显示器,显示器也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器用于显示在电子装置4中处理的信息以及用于显示可视化的用户界面。
可选地,该电子装置4还可以包括射频(Radio Frequency,RF)电路,传感器、音频电路等等,在此不再赘述。
在图4所示的装置实施例中,作为一种计算机存储介质的存储器41中可以包括操作系统、以及基于精标注文本的关键信息抽取程序40;处理器42执行存储器41中存储的基于精标注文本的关键信息抽取程序40时实现如下步骤:S110、通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;S130、按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
在其他实施例中,基于精标注文本的关键信息抽取程序40还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器41中,并由处理器42执行,以完成本申请。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。基于精标注文本的关键信息抽取程序40可以包括预训练单元310、关键信息获得单元320和关键信息输出单元330。
此外,本申请还提出一种计算机可读存储介质,所述计算机可读存储介质可以非易失性的,也可以是易失性,主要包括存储数据区和存储程序区,其中,存储数据区可存储根据区块链节点的使用所创建的数据等,存储程序区可存储操作系统、至少一个功能所需的应用程序,所述计算机可读存储介质中包括基于精标注文本的关键信息抽取程序,所述基于精标注文本的关键信息抽取程序被处理器执行时实现如下操作:S110、通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;S130、按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
本申请之计算机可读存储介质的具体实施方式与上述基于精标注文本的关键信息抽取方法、电子装置的具体实施方式大致相同,在此不再赘述。
总的来说,本申请基于精标注文本的关键信息抽取方法、系统、电子装置及计算机可读存储介质以阅读理解模型为基本思路进行的改进,将以往以词、句、段为单位标注的关键信息,转换为标注一个连续的片段;解决了对文本段片段进行自动标注的问题;大大降低了标注成本,达到了为下游任务提供有力支持的技术效果。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于精标注文本的关键信息抽取方法,应用于电子装置,其中,所述方法包括:S110、通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;S130、按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
  2. 根据权利要求1所述的基于精标注文本的关键信息抽取方法,其中,在所述步骤S120中,所述关键信息抽取模型根据所述矩阵式文本数据获得关键信息的方法包括:S210、通过所述关键信息抽取模型的全连接层获得多组关键词,每组关键词均包含关键词起点和关键词终点;S220、将所述多组关键词经过所述关键信息抽取模型的sigmoid层输出初步关键信息;S230、将输出的初步关键信息利用所述关键信息抽取模型的交叉熵优化层进行优化后,获得关键信息。
  3. 根据权利要求2所述的基于精标注文本的关键信息抽取方法,其中,所述全连接层包括起点全连接网络和终点全连接网络;所述起点全连接网络,用于将所述矩阵式文本数据转化为起点向量;所述终点全连接网络,用于将所述矩阵式文本数据转化为终点向量。
  4. 根据权利要求1所述的基于精标注文本的关键信息抽取方法,其中,将关键信息抽取模型通过测试集进行测试,选取em值大于设定阈值的关键信息抽取模型作为训练好的关键信息抽取模型;其中,em=n’/n;n为标准答案数量,n’为所述关键信息抽取模型所获得的关键信息中包含的标准答案的数量。
  5. 根据权利要求1所述的基于精标注文本的关键信息抽取方法,其中,所述BERT预训练模型将所述文本数据的字符进行一维向量表示,将所述一维向量按照字符排列顺序形成二维向量的矩阵式文本数据。
  6. 根据权利要求1所述的基于精标注文本的关键信息抽取方法,其中,在通过BERT预训练模型对文本数据进行预训练获得词向量之前还包括对所述文本数据的清洗步骤,所述清洗步骤包括对所述文本数据的分段、去除过短语料和去除有误语料。
  7. 根据权利要求6所述的基于精标注文本的关键信息抽取方法,其中,对所述文本数据的分段包括把输入文本按照段落进行分开重新组合,组合后的每一段的长度均小于等于标准化长度。
  8. 一种基于精标注文本的关键信息抽取系统,其中,包括预训练单元、关键信息获得单元和关键信息输出单元;所述预训练单元,用于通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;所述关键信息获得单元,用于将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取 模型利用CMRC数据集进行训练,并根据所述矩阵式文本数据获得关键信息;所述关键信息输出单元,用于按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
  9. 根据权利要求8所述的基于精标注文本的关键信息抽取系统,其中,所述关键信息获得单元包括多组关键词获取模块、初步关键信息获取模块和关键信息获取模块;所述多组关键词获取模块,用于通过所述关键信息抽取模型的全连接层获得多组关键词,每组关键词均包含关键词起点和关键词终点;所述初步关键信息获取模块,用于将所述多组关键词经过所述关键信息抽取模型的sigmoid层输出初步关键信息;所述关键信息获取模块,用于将输出的初步关键信息利用所述关键信息抽取模型的交叉熵优化层进行优化后,获得关键信息。
  10. 根据权利要求8所述的基于精标注文本的关键信息抽取系统,其中,所述全连接层包括起点全连接网络和终点全连接网络;所述起点全连接网络,用于将所述矩阵式文本数据转化为起点向量;所述终点全连接网络,用于将所述矩阵式文本数据转化为终点向量。
  11. 根据权利要求8所述的基于精标注文本的关键信息抽取系统,其中,所述关键信息获取单元还包括关键信息抽取模型测试模块,所述关键信息抽取模型测试模块用于将关键信息抽取模型通过测试集进行测试,选取em值大于设定阈值的关键信息抽取模型作为训练好的关键信息抽取模型;
    其中,em=n’/n;
    n为标准答案数量,n’为所述关键信息抽取模型所获得的关键信息中包含的标准答案的数量。
  12. 根据权利要求8所述的基于精标注文本的关键信息抽取系统,其中,所述预训练单元中的所述BERT预训练模型将所述文本数据的字符进行一维向量表示,将所述一维向量按照字符排列顺序形成二维向量的矩阵式文本数据。
  13. 根据权利要求8所述的基于精标注文本的关键信息抽取系统,其中,所述关键信息抽取系统还包括文本数据的清洗单元,所述文本数据的清洗单元,用于对所述文本数据的分段、去除过短语料和去除有误语料。
  14. 根据权利要求13所述的基于精标注文本的关键信息抽取系统,其中,对所述文本数据的分段包括把输入文本按照段落进行分开重新组合,组合后的每一段的长度均小于等于标准化长度。
  15. 一种电子装置,其中,该电子装置包括:存储器、处理器,所述存储器中存储有基于精标注文本的关键信息抽取程序,所述基于精标注文本的关键信息抽取程序被所述处理器执行时实现如下步骤:S110、通过BERT预训练模型对文本数据进行预训练获得所述文本数据的词向量,将所获得的词向量组合成矩阵式文本数据;S120、将所述矩阵式文本数据输入关键信息抽取模型,所述关键信息抽取模型利用CMRC数据集进行训练,并 根据所述矩阵式文本数据获得关键信息;S130、按照预设排序规则对所获得的关键信息进行排序,并将符合设定选取规则的关键信息作为输出。
  16. 根据权利要求15所述的电子装置,其中,在所述步骤S120中,所述关键信息抽取模型根据所述矩阵式文本数据获得关键信息的方法包括:S210、通过所述关键信息抽取模型的全连接层获得多组关键词,每组关键词均包含关键词起点和关键词终点;S220、将所述多组关键词经过所述关键信息抽取模型的sigmoid层输出初步关键信息;S230、将输出的初步关键信息利用所述关键信息抽取模型的交叉熵优化层进行优化后,获得关键信息。
  17. 根据权利要求15所述的电子装置,其中,所述全连接层包括起点全连接网络和终点全连接网络;所述起点全连接网络,用于将所述矩阵式文本数据转化为起点向量;所述终点全连接网络,用于将所述矩阵式文本数据转化为终点向量。
  18. 根据权利要求15所述的电子装置,其中,将关键信息抽取模型通过测试集进行测试,选取em值大于设定阈值的关键信息抽取模型作为训练好的关键信息抽取模型;其中,em=n’/n;n为标准答案数量,n’为所述关键信息抽取模型所获得的关键信息中包含的标准答案的数量。
  19. 根据权利要求15所述的电子装置,其中,所述BERT预训练模型将所述文本数据的字符进行一维向量表示,将所述一维向量按照字符排列顺序形成二维向量的矩阵式文本数据。
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有计算机程序,所述计算机程序包括基于精标注文本的关键信息抽取程序,所述基于精标注文本的关键信息抽取程序被处理器执行时,实现如权利要求1至7中任一项所述的基于精标注文本的关键信息抽取方法的步骤。
PCT/CN2020/103933 2020-04-10 2020-07-24 基于精标注文本的关键信息抽取方法、装置及存储介质 WO2021203581A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010280586.7A CN111177326B (zh) 2020-04-10 2020-04-10 基于精标注文本的关键信息抽取方法、装置及存储介质
CN202010280586.7 2020-04-10

Publications (1)

Publication Number Publication Date
WO2021203581A1 true WO2021203581A1 (zh) 2021-10-14

Family

ID=70645903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/103933 WO2021203581A1 (zh) 2020-04-10 2020-07-24 基于精标注文本的关键信息抽取方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN111177326B (zh)
WO (1) WO2021203581A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779182A (zh) * 2021-11-12 2021-12-10 航天宏康智能科技(北京)有限公司 从文本数据抽取事件的方法及装置
CN113806548A (zh) * 2021-11-19 2021-12-17 北京北大软件工程股份有限公司 基于深度学习模型的信访要素抽取方法及抽取系统
CN114067256A (zh) * 2021-11-24 2022-02-18 西安交通大学 一种基于Wi-Fi信号的人体关键点检测方法及系统
CN114239566A (zh) * 2021-12-14 2022-03-25 公安部第三研究所 基于信息增强实现两步中文事件精准检测的方法、装置、处理器及其计算机可读存储介质
CN114818685A (zh) * 2022-04-21 2022-07-29 平安科技(深圳)有限公司 关键词提取方法、装置、电子设备及存储介质
CN115292469A (zh) * 2022-09-28 2022-11-04 之江实验室 一种结合段落搜索和机器阅读理解的问答方法
CN115809665A (zh) * 2022-12-13 2023-03-17 杭州电子科技大学 一种基于双向多粒度注意力机制的无监督关键词抽取方法

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177326B (zh) * 2020-04-10 2020-08-04 深圳壹账通智能科技有限公司 基于精标注文本的关键信息抽取方法、装置及存储介质
CN111753546B (zh) * 2020-06-23 2024-03-26 深圳市华云中盛科技股份有限公司 文书信息抽取方法、装置、计算机设备及存储介质
CN111723182B (zh) * 2020-07-10 2023-12-08 云南电网有限责任公司曲靖供电局 一种用于漏洞文本的关键信息抽取方法及装置
CN112182141A (zh) * 2020-09-25 2021-01-05 中国建设银行股份有限公司 一种关键信息抽取方法、装置、设备和可读存储介质
CN114586038A (zh) * 2020-09-28 2022-06-03 京东方科技集团股份有限公司 事件抽取和抽取模型训练的方法和装置、设备、介质
CN112329477A (zh) * 2020-11-27 2021-02-05 上海浦东发展银行股份有限公司 基于预训练模型的信息抽取方法、装置、设备及存储介质
CN113361261B (zh) * 2021-05-19 2022-09-09 重庆邮电大学 一种基于enhance matrix的法律案件候选段落的选取方法及装置
CN113505207B (zh) * 2021-07-02 2024-02-20 中科苏州智能计算技术研究院 一种金融舆情研报的机器阅读理解方法及系统
CN113536735B (zh) * 2021-09-17 2021-12-31 杭州费尔斯通科技有限公司 一种基于关键词的文本标记方法、系统和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436900A (zh) * 2016-05-26 2017-12-05 北京搜狗科技发展有限公司 基于搜索引擎的信息处理方法和装置
CN110442691A (zh) * 2019-07-04 2019-11-12 平安科技(深圳)有限公司 机器阅读理解中文的方法、装置和计算机设备
CN110888966A (zh) * 2018-09-06 2020-03-17 微软技术许可有限责任公司 自然语言问答
EP3627398A1 (en) * 2018-09-19 2020-03-25 42 Maru Inc. Method, system, and computer program for artificial intelligence answer
CN111177326A (zh) * 2020-04-10 2020-05-19 深圳壹账通智能科技有限公司 基于精标注文本的关键信息抽取方法、装置及存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389005A (zh) * 2017-08-05 2019-02-26 富泰华工业(深圳)有限公司 智能机器人及人机交互方法
CN108519890B (zh) * 2018-04-08 2021-07-20 武汉大学 一种基于自注意力机制的鲁棒性代码摘要生成方法
CN108536678B (zh) * 2018-04-12 2023-04-07 腾讯科技(深圳)有限公司 文本关键信息提取方法、装置、计算机设备和存储介质
CN108664473A (zh) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 文本关键信息的识别方法、电子装置及可读存储介质
CN109614614B (zh) * 2018-12-03 2021-04-02 焦点科技股份有限公司 一种基于自注意力的bilstm-crf产品名称识别方法
CN110263123B (zh) * 2019-06-05 2023-10-31 腾讯科技(深圳)有限公司 机构名简称的预测方法、装置和计算机设备
CN110390108B (zh) * 2019-07-29 2023-11-21 中国工商银行股份有限公司 基于深度强化学习的任务型交互方法和系统
CN110413743B (zh) * 2019-08-09 2022-05-06 安徽科大讯飞医疗信息技术有限公司 一种关键信息抽取方法、装置、设备及存储介质
CN110929094B (zh) * 2019-11-20 2023-05-16 北京香侬慧语科技有限责任公司 一种视频标题处理方法和装置
CN110968667B (zh) * 2019-11-27 2023-04-18 广西大学 一种基于文本状态特征的期刊文献表格抽取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436900A (zh) * 2016-05-26 2017-12-05 北京搜狗科技发展有限公司 基于搜索引擎的信息处理方法和装置
CN110888966A (zh) * 2018-09-06 2020-03-17 微软技术许可有限责任公司 自然语言问答
EP3627398A1 (en) * 2018-09-19 2020-03-25 42 Maru Inc. Method, system, and computer program for artificial intelligence answer
CN110442691A (zh) * 2019-07-04 2019-11-12 平安科技(深圳)有限公司 机器阅读理解中文的方法、装置和计算机设备
CN111177326A (zh) * 2020-04-10 2020-05-19 深圳壹账通智能科技有限公司 基于精标注文本的关键信息抽取方法、装置及存储介质

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779182A (zh) * 2021-11-12 2021-12-10 航天宏康智能科技(北京)有限公司 从文本数据抽取事件的方法及装置
CN113806548A (zh) * 2021-11-19 2021-12-17 北京北大软件工程股份有限公司 基于深度学习模型的信访要素抽取方法及抽取系统
CN114067256A (zh) * 2021-11-24 2022-02-18 西安交通大学 一种基于Wi-Fi信号的人体关键点检测方法及系统
CN114067256B (zh) * 2021-11-24 2023-09-12 西安交通大学 一种基于Wi-Fi信号的人体关键点检测方法及系统
CN114239566A (zh) * 2021-12-14 2022-03-25 公安部第三研究所 基于信息增强实现两步中文事件精准检测的方法、装置、处理器及其计算机可读存储介质
CN114239566B (zh) * 2021-12-14 2024-04-23 公安部第三研究所 基于信息增强实现两步中文事件精准检测的方法、装置、处理器及其计算机可读存储介质
CN114818685A (zh) * 2022-04-21 2022-07-29 平安科技(深圳)有限公司 关键词提取方法、装置、电子设备及存储介质
CN114818685B (zh) * 2022-04-21 2023-06-20 平安科技(深圳)有限公司 关键词提取方法、装置、电子设备及存储介质
CN115292469A (zh) * 2022-09-28 2022-11-04 之江实验室 一种结合段落搜索和机器阅读理解的问答方法
CN115809665A (zh) * 2022-12-13 2023-03-17 杭州电子科技大学 一种基于双向多粒度注意力机制的无监督关键词抽取方法
CN115809665B (zh) * 2022-12-13 2023-07-11 杭州电子科技大学 一种基于双向多粒度注意力机制的无监督关键词抽取方法

Also Published As

Publication number Publication date
CN111177326A (zh) 2020-05-19
CN111177326B (zh) 2020-08-04

Similar Documents

Publication Publication Date Title
WO2021203581A1 (zh) 基于精标注文本的关键信息抽取方法、装置及存储介质
CN112685565B (zh) 基于多模态信息融合的文本分类方法、及其相关设备
CN111897970B (zh) 基于知识图谱的文本比对方法、装置、设备及存储介质
CN112784578B (zh) 法律要素提取方法、装置和电子设备
CN112101041B (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
Abdullah et al. Fake news classification bimodal using convolutional neural network and long short-term memory
CN112818093B (zh) 基于语义匹配的证据文档检索方法、系统及存储介质
WO2021135469A1 (zh) 基于机器学习的信息抽取方法、装置、计算机设备及介质
CN112149421A (zh) 一种基于bert嵌入的软件编程领域实体识别方法
CN113392209B (zh) 一种基于人工智能的文本聚类方法、相关设备及存储介质
CN103678684A (zh) 一种基于导航信息检索的中文分词方法
CN114491018A (zh) 敏感信息检测模型的构建方法、敏感信息检测方法及装置
CN113934909A (zh) 基于预训练语言结合深度学习模型的金融事件抽取方法
CN114357204B (zh) 媒体信息的处理方法及相关设备
CN111199151A (zh) 数据处理方法、及数据处理装置
CN113204956B (zh) 多模型训练方法、摘要分段方法、文本分段方法及装置
Wei et al. Online education recommendation model based on user behavior data analysis
CN111950265A (zh) 一种领域词库构建方法和装置
CN115344668A (zh) 一种多领域与多学科科技政策资源检索方法及装置
CN117009503A (zh) 文本分类方法及装置
CN114722832A (zh) 一种摘要提取方法、装置、设备以及存储介质
CN114595309A (zh) 一种培训装置实现方法及系统
CN114021004A (zh) 一种理科相似题推荐方法、装置、设备及可读存储介质
CN110442759B (zh) 一种知识检索方法及其系统、计算机设备和可读存储介质
CN114067343A (zh) 一种数据集的构建方法、模型训练方法和对应装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20929759

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 250123)

122 Ep: pct application non-entry in european phase

Ref document number: 20929759

Country of ref document: EP

Kind code of ref document: A1