WO2023029354A1 - Text information extraction method and apparatus, and storage medium and computer device - Google Patents

Text information extraction method and apparatus, and storage medium and computer device Download PDF

Info

Publication number
WO2023029354A1
WO2023029354A1 PCT/CN2022/071444 CN2022071444W WO2023029354A1 WO 2023029354 A1 WO2023029354 A1 WO 2023029354A1 CN 2022071444 W CN2022071444 W CN 2022071444W WO 2023029354 A1 WO2023029354 A1 WO 2023029354A1
Authority
WO
WIPO (PCT)
Prior art keywords
word vector
initial
determined
text
word
Prior art date
Application number
PCT/CN2022/071444
Other languages
French (fr)
Chinese (zh)
Inventor
谯轶轩
陈浩
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023029354A1 publication Critical patent/WO2023029354A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a text information extraction method, device, storage medium and computer equipment.
  • text information extraction is developing in the direction of digitization, intelligence, and semantics with the development of artificial intelligence and other disciplines, and it plays a greater role in social knowledge management.
  • text information extraction methods include, based on regular expressions, artificial filtering or matching rules, regular rule methods for text extraction; use of named entity recognition NER models to process by setting extraction tasks; And, other mainstream ways to predict individual words in text.
  • the inventor realized that the regular rule method has the problem of relying on artificial rules. When faced with a complex sentence environment and text with incomplete semantics, it cannot completely extract text information; NER model recognition is prone to overfitting, When faced with texts containing new corpus information, the accuracy of extraction drops significantly; and words in isolated texts are extracted, resulting in low accuracy of text information extraction.
  • the present application provides a text information extraction method, device, storage medium and computer equipment.
  • a method for extracting text information comprising:
  • the initial word vector group of the text paragraph to be extracted is obtained;
  • the target text to be extracted is determined.
  • a text information extraction device comprising:
  • the sentence recognition module is used to carry out sentence recognition by the text paragraph to be extracted to obtain the initial word vector group of the text paragraph to be extracted;
  • the first position prediction module is used to use the pre-trained text extraction network model to predict a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group;
  • the second position prediction module is used to predict a plurality of to-be-determined initial word vectors corresponding to each of the to-be-determined initial word vectors in the initial word vector group according to the plurality of to-be-determined initial word vectors and the said initial word vector group. Determine the end word vector;
  • the determination module is configured to determine the target text to be extracted according to the multiple to-be-determined start word vectors obtained through prediction, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • a storage medium on which computer-readable instructions are stored, and when the program is executed by a processor, the above text information extraction method is implemented, including:
  • the initial word vector group of the text paragraph to be extracted is obtained;
  • the target text to be extracted is determined.
  • a computer device including a storage medium, a processor, and computer-readable instructions stored on the storage medium and operable on the processor.
  • the processor executes the program, the above-mentioned Text information extraction methods, including:
  • the initial word vector group of the text paragraph to be extracted is obtained;
  • the target text to be extracted is determined.
  • FIG. 1 shows a schematic flow chart of a method for extracting text information provided by an embodiment of the present application
  • FIG. 2 shows a schematic flow diagram of another text information extraction method provided by the embodiment of the present application
  • FIG. 3 shows a schematic diagram of the text extraction network model architecture in the training phase provided by the embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of a text information extraction device provided by an embodiment of the present application
  • FIG. 5 shows a schematic structural diagram of another apparatus for extracting text information provided by an embodiment of the present application.
  • AI Artificial Intelligence
  • AI is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • this embodiment provides a method for extracting text information, as shown in Figure 1, taking the application of this method to a computer device such as a server as an example for illustration, wherein the server can be an independent server, or it can provide a cloud service , cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network (CDN: Content Delivery Network), and big data and artificial intelligence platforms and other basic clouds Cloud servers for computing services, such as intelligent medical systems, digital medical platforms, etc.
  • the above method comprises the following steps:
  • Step S101 Obtain an initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.
  • word segmentation processing is performed on the words of the text paragraph to be extracted, and the text paragraph after word segmentation is divided according to the preset sequence length, and one or more text paragraphs containing The initial data sequence of the complete sentence, and perform word vector conversion processing on the initial data sequence to obtain the initial word vector group.
  • the text paragraphs are divided into sentence units, and the text paragraphs smaller than the preset sequence length are completed.
  • dividing the text paragraphs after word segmentation by 512 words can enhance the ability of the text extraction network model to extract long texts. Further, dividing the text paragraphs in units of sentences can effectively avoid text paragraphs In the division process, a complete sentence is divided into different data sequences, which in turn affects the accuracy of the text extraction network model for contextual semantic extraction.
  • the splicing process is performed to obtain the text paragraphs to be extracted containing the question information, and according to the text to be extracted containing the question information
  • the sentence recognition of the paragraph is carried out to obtain the initial word vector group of the text paragraph to be extracted, so as to further predict the position of the start word vector and the position of the end word vector in the text paragraph to be extracted. Due to the addition of the user's question information, the extracted text Information is more accurate.
  • Step S102 using the pre-trained text extraction network model, predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.
  • the pre-training module (GPT: Generative Pre-training) in the pre-trained text extraction network model is used to enable each word vector in the initial word vector group to learn the semantic information of other word vectors to obtain contextual The first word vector group of semantic information; further, use the first position prediction module to obtain the predicted probability value of the starting position of each word vector in the first word vector group, and determine K starting positions in the first word vector group by traversing The to-be-determined starting word vector with the largest position prediction probability value.
  • GPT Generative Pre-training
  • the pre-training model GPT adopts a multi-layer Transformer architecture, and the self-attention mechanism self-attention enables each word vector to extract grammar other than its own features after multi-layer learning.
  • Syntactic and other deep-level semantic information establishes the contextual connection of each word vector in the initial word vector group, thereby improving the accuracy of text information extraction by the text extraction network model.
  • Step S103 According to the plurality of to-be-determined start word vectors and the initial word vector group, predict a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors in the initial word vector group.
  • the word vector end position is predicted by using the predicted word vector start position information.
  • the K initial word vectors to be determined are respectively subjected to vector splicing processing with the initial word vector groups to obtain K spliced word vector groups for inputting into the second position prediction module.
  • Utilize the second position prediction module to obtain the end position prediction probability value corresponding to each to-be-determined start word vector in the spliced word vector group, and determine the N end positions corresponding to each to-be-determined start word vector in the spliced word vector group by traversing The to-be-determined ending word vector with the largest predicted probability value.
  • K and N may be set to be equal or unequal according to requirements of actual application scenarios.
  • Step S104 Determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • K*N initially extracted text combinations are obtained, and K*N is further determined
  • the to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations, and the predicted probability value according to the starting position of each to-be-determined start word vector in the to-be-determined extracted text combinations, and the corresponding multiple to-be-determined end words The predicted probability value of the end position of the vector is determined by product calculation to determine that the start word vector corresponding to the maximum product value is the target start word vector, and its corresponding end word vector is the target end word vector, thereby obtaining the target extracted text.
  • the initial word vector group of the text paragraph to be extracted obtained by sentence recognition is input into the pre-trained text extraction network model, and a plurality of initial word vectors to be determined in the initial word vector group are predicted.
  • a plurality of to-be-determined start word vectors and initial word vector groups predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text.
  • this embodiment can improve the extraction accuracy of the text extraction network model based on the initial word vector and the initial word vector group, thereby more accurately Extract the target text in the text paragraph. Further, as a refinement and extension of the specific implementation of the above embodiment, in order to fully describe the specific implementation process of this embodiment, another text information extraction method is provided, as shown in Figure 2, the method includes:
  • Step S201 training an initial text extraction network model.
  • the initial text extraction network model constructed includes a serial first position prediction module and a second position prediction module, namely the span module, which are used to implement Predict the start position and end position of the target extracted text; add a pre-training module GPT at the input of the first position prediction model to obtain the contextual semantic information of each word vector; in the model training stage, by adding a correction module , so that the update of the model parameters obtained from the training of the text extraction network model tends to be more stable.
  • step 201 may specifically include: training the initial text extraction network model according to the position labels corresponding to the starting position serial number and the ending position serial number in the training sample; when monitoring When the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value, the first-stage text extraction network model is obtained; using the first loss function and the preset
  • the second loss function corresponding to the correction module performs secondary training on the first-stage text extraction network model according to the training samples ignoring the position label, to obtain a trained text extraction network model.
  • the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
  • the specific steps for training the initial text extraction network model constructed include:
  • the GPT module in the initial text extraction network model to extract the semantic features of each word in the initial word vector group, and obtain the first word vector group containing contextual semantics, expressed as [h1, h2, ... h512] .
  • the GPT model adopts a multi-layer Transformer architecture, and each layer of the Transformer contains a self-attention mechanism self-attention, which can make each word in [w1, w2, ...
  • Word feature information is extracted, and the extracted feature information is used to update its own vector to obtain the deep relationship between other words and itself, that is, each word vector in the initial word vector group after multi-layer learning , to obtain word vectors containing grammar, syntax and other deep semantic information of all other positions in the initial word vector group, so as to obtain the first word vector group [h1, h2, ... h512] containing contextual semantics.
  • the first position prediction module and the second position prediction module can be two position prediction modules, or one position prediction module, which sequentially output the position prediction probability values of the start word vector and the end word vector, which is not correct here
  • the position prediction module makes specific limitations.
  • the target is to maximize the product of the position prediction probability value of the target start position and the target end position. If it is monitored that the current loss value of the first loss function L1 of the position prediction module drops to 30% of the initial loss value %, set the target start position and target end position as empty, and use the multi-task learning framework for secondary training to obtain a trained text extraction network model.
  • the multi-task learning framework is to use the correction module to assist the training position prediction module, that is, when the current loss value L m of the first loss function L1 is reduced to 30% of the initial loss value, the first-stage text extraction network model is obtained. According to the correction module The corresponding second loss function L2 and first loss function L1 continue to train the first-stage text extraction network model to obtain the second-stage text extraction network model as a trained text extraction network model. Specifically:
  • the position prediction module is used to calculate the position prediction probability value of the word vector, and its calculation formula is:
  • W is the weight
  • b is the bias value
  • s represents the sigmoid function
  • the output results of 1 and 2 are iteratively trained until the training ends, and the trained text extraction network model is obtained.
  • the maximum number of iterations for model training is N rounds, and N is 10000 by default, which can be customized by the user.
  • the loss function L1 is used to calculate the negative logarithm of the target start position and target end position. The formula is as follows:
  • the P start position represents the position prediction probability value of the word vector corresponding to the target start position output by the first position prediction module
  • the P end position represents the position prediction probability value of the word vector corresponding to the target end position output by the second position prediction module
  • M is the size of the preset vocabulary, which is set to 50000 word vectors
  • y hc indicates that the dimension value at the index c of the current word vector h is 1, and the other values are 0, 0 ⁇ c ⁇ M
  • p hc indicates the current word vector h is the probability at c, that is, the value corresponding to the cth dimension of the digital vector after the above-mentioned softmax layer processing.
  • Multi-task training is realized by adding a correction module, which can be closer to the actual scene of text extraction.
  • emptying the vector mark at the position of text extraction during the training process will lead to a sudden increase in the loss value of the model and increase the difficulty of learning.
  • the first loss function of the model was assisted by adding a correction module to make the update of model parameters more stable.
  • the trained text extraction network model The correction module is not included in , and the correction module is only used to further optimize the model parameters in the position prediction module.
  • the stochastic gradient descent algorithm SGD is used to iteratively update the network model parameters W and b in the initial text extraction network model to obtain a trained text extraction network model. Specifically, during the model training process, if the difference between L m and L m+1 obtained from two adjacent training sessions is less than the set value, that is, L m -L m+1 ⁇ 0.01, the model is considered to have converged, and the decision After the training is over, the trained text extraction network model is obtained.
  • Step S202 perform word segmentation processing on the text paragraph to be extracted, and obtain a text paragraph after word segmentation processing.
  • Step S203 According to the preset sequence length, an initial data sequence including complete sentences is obtained.
  • Step S204 performing word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  • Step S205 according to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information.
  • Step S206 using the first position prediction module in the pre-trained text extraction network model to obtain the initial word vector group according to the initial position prediction probability value of each word vector in the first word vector group In is used to characterize the multiple starting word vectors to be determined at the start position of text extraction.
  • Step S207 for each to-be-determined initial word vector, concatenate the to-be-determined initial word vector and the initial word vector group to obtain a concatenated word vector group.
  • step 207 may specifically include: splicing the to-be-determined starting word vector with each word vector in the initial word vector group to obtain A concatenated word vector group.
  • Step S208 using the second position prediction module in the pre-trained text extraction network model, according to the predicted probability value of the end position of each word vector in the spliced word vector group, to obtain the representation value in the spliced word vector group A plurality of to-be-determined end word vectors at the end position of text extraction.
  • Step S209 Determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • Step S210 obtaining the to-be-determined extracted text combinations satisfying the preset conditions among the initial extracted text combinations.
  • the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.
  • the to-be-determined extractions that meet the preset conditions among the K*N initial extraction text combinations are determined according to the preset conditions.
  • the preset condition is that the end position number corresponding to the end word vector to be determined in the text combination to be determined is greater than the start position number of the start word vector to be determined, and the difference between the end position number and the start position number is greater than A threshold (for example, 2) is set, and the preset condition is not specifically limited here.
  • Step S211 according to the predicted probability value of the start position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, a plurality of to-be-determined end words corresponding to each of the to-be-determined start word vectors
  • the probability product value of the predicted probability value of the end position of the vector is used to determine the target start word vector and its corresponding target end word vector.
  • the predicted start position probability value of each start word vector to be determined in the to-be-determined extracted text combination is multiplied by the predicted end position probability values of the corresponding N end word vectors to be determined respectively Processing, by traversing each probability product value, combining the extracted text with the largest probability product value to determine the target start word vector and its corresponding target end word vector.
  • Step S212 according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector, to obtain the target extracted text.
  • the initial word vector group of the text paragraph to be extracted obtained through sentence recognition is input into the pre-trained text extraction network model, and multiple to-be-determined initial word vectors in the initial word vector group are predicted , according to a plurality of to-be-determined start word vectors and initial word vector groups, predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text.
  • the pre-trained text information extraction network model can effectively avoid the strong dependence of artificial rules in the existing regular rule method, and cannot completely extract complex or incomplete text information; NER model recognition is prone to overfitting, And when faced with text containing new corpus information, the accuracy of text extraction is low; and other mainstream methods to extract words in isolated texts lead to technical problems of low text extraction accuracy, thereby effectively improving the accuracy of text information extraction .
  • an embodiment of the present application provides a text information extraction device, as shown in FIG. 4 , the device includes: a sentence recognition module 32, a first position prediction module 33, a second position prediction module Module 34 , determining module 35 .
  • the sentence recognition module 32 can be used to obtain the initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.
  • the first position prediction module 33 may be configured to use a pre-trained text extraction network model to predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.
  • the second position prediction module 34 can be used to predict the number of each of the initial word vectors corresponding to each of the initial word vectors in the initial word vector group according to the plurality of initial word vectors to be determined and the initial word vector group. A to-be-determined ending word vector.
  • the determining module 35 may be configured to determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • a model training module 31 is also included.
  • the sentence recognition module 32 includes a word segmentation processing unit 321 , a grouping division unit 322 , and a word vector conversion unit 323 .
  • the word segmentation processing unit 321 may be configured to perform word segmentation processing on the to-be-extracted text paragraph to obtain a text paragraph after word segmentation processing.
  • the grouping unit 322 can be configured to obtain an initial data sequence including a complete sentence according to a preset sequence length.
  • the word vector conversion unit 323 may be configured to perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  • the first position prediction module 33 includes a pre-training unit 331 and a starting position prediction unit 332 .
  • the pre-training unit 331 may be configured to use the pre-trained module in the pre-trained text extraction network model according to the initial word vector group to obtain a first word vector group containing contextual semantic information.
  • the starting position prediction unit 332 can be used to use the first position prediction module in the pre-trained text extraction network model to predict the probability value according to the starting position of each word vector in the first word vector group, A plurality of to-be-determined initial word vectors used to characterize the starting position of text extraction in the initial word vector group are obtained.
  • the second position prediction module 34 includes a vector splicing unit 341 and an end position prediction unit 342 .
  • the vector concatenating unit 341 may be configured to concatenate the to-be-determined initial word vector and the initial word vector group for each to-be-determined initial word vector to obtain a concatenated word vector group.
  • the end position prediction unit 342 can be used to use the second position prediction module in the pre-trained text extraction network model to obtain the spliced word according to the predicted probability value of the end position of each word vector in the spliced word vector group. Multiple to-be-determined end word vectors used to represent the end position of text extraction in the vector group.
  • the determination module 35 includes a combination determination unit 351 , a preset condition unit 352 , a probability determination unit 353 , and a text extraction unit 354 .
  • the combination determining unit 351 may be configured to determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • the preset condition unit 352 can be used to obtain the to-be-determined extracted text combination that satisfies the preset condition in the initial extracted text combination; wherein, the preset condition includes at least: the end position sequence number corresponding to the to-be-determined end word vector and the number to be determined It is determined that the difference between the starting position numbers of the starting word vectors is greater than a set threshold.
  • the probability value determination unit 353 can be used to predict the probability value according to the initial position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, respectively corresponding to each of the to-be-determined start word vectors.
  • the probability product value of the predicted probability values of the end positions of the plurality of end word vectors to be determined is used to determine the target start word vector and its corresponding target end word vector.
  • the text extraction unit 354 may be configured to obtain the target extracted text according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector.
  • the model training module 31 can be used to train the initial text extraction network model.
  • the model training module 31 includes a first-stage training unit 311 , a training monitoring unit 312 , and a second-stage training unit 313 .
  • the first-stage training unit 311 may be configured to train the initial text extraction network model according to the position labels corresponding to the start position number and the end position number in the training samples.
  • the training monitoring unit 312 can be used to obtain the first-stage text extraction network model when it is monitored that the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value.
  • the second-stage training unit 313 can be used to use the first loss function and the second loss function corresponding to the preset correction module to train the first-stage text extraction network according to the training samples ignoring the position label
  • the model is trained twice to obtain a trained text extraction network model.
  • the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
  • the embodiment of the present application also provides a storage medium on which computer-readable instructions are stored, and when the readable instructions are executed by a processor, the above-mentioned information shown in Figure 1 can be realized. And the text information extraction method of Fig. 2.
  • the technical solution of the present application can be embodied in the form of software products, which can be stored in a storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes the methods described in each implementation scenario of this application.
  • a storage medium which can be CD-ROM, U disk, mobile hard disk, etc.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the embodiment of this application also provides a computer device, which can be a personal computer, A server, a network device, etc.
  • the physical device includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to realize the text shown in Figure 1 and Figure 2 information extraction method.
  • the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the like, and optional user interfaces may also include a USB interface, a card reader interface, and the like.
  • the network interface may include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface) and the like.
  • a computer device does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.
  • the storage medium may also include an operating system and a network communication module.
  • An operating system is a program that manages the hardware and software resources of a computer device and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to realize the communication between various components inside the storage medium, and communicate with other hardware and software in the physical device.
  • this embodiment can use the trained text extraction network model to effectively avoid the existing technical scheme relying on artificial rules, which has low accuracy and efficiency.
  • Low technical problems while solving the problem of only predicting whether each word in the article is a quotation, and unable to establish the necessary connection between words, thereby improving the flexibility and adaptability of text extraction, and effectively improving the accuracy of text information extraction Spend.
  • the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application.
  • the modules in the devices in the implementation scenario can be distributed among the devices in the implementation scenario according to the description of the implementation scenario, or can be located in one or more devices different from the implementation scenario according to corresponding changes.
  • the modules of the above implementation scenarios can be combined into one module, or can be further split into multiple sub-modules.

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed are a text information extraction method and apparatus, and a storage medium and a computer device, which can improve the accuracy of text information extraction. The method comprises: performing sentence recognition on a text paragraph to be subjected to extraction, so as to obtain an initial word vector group of said text paragraph; predicting, by using a pre-trained text extraction network model, a plurality of start word vectors to be determined, which are used for representing a text extraction start position, in the initial word vector group; according to the plurality of said start word vectors and the initial word vector group, predicting a plurality of end word vectors to be determined which correspond to said start word vectors in the initial word vector group; and according to the plurality of said start word vectors obtained by means of prediction, and the plurality of said end word vectors corresponding to said start word vectors, determining target text to be extracted. The present application is applicable to the extraction of target text in a data set.

Description

文本信息提取方法、装置、存储介质及计算机设备Text information extraction method, device, storage medium and computer equipment
本申请要求于2021年8月30日提交中国专利局、申请号为202111007458.6、申请名称为“文本信息提取方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application with the application number 202111007458.6 and the application title "text information extraction method, device, computer equipment and storage medium" submitted to the China Patent Office on August 30, 2021, the entire content of which is incorporated by reference incorporated in the application.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及文本信息提取方法、装置、存储介质及计算机设备。The present application relates to the technical field of artificial intelligence, in particular to a text information extraction method, device, storage medium and computer equipment.
背景技术Background technique
文本信息提取作为从文本数据中提取特定信息的一种技术,伴随人工智能等学科的发展,正向着数字化、智能化、语义化的方向深入发展,在社会知识管理方面发挥更大的作用。目前广泛应用的文本信息提取方式有,基于正则表达式,由人工设立的过滤或者匹配规则,对文本进行提取的正则规则方式;利用命名实体识别NER模型,通过设定提取任务进行处理的方式;以及,对文本中单个字词进行预测的其他主流方式。As a technology to extract specific information from text data, text information extraction is developing in the direction of digitization, intelligence, and semantics with the development of artificial intelligence and other disciplines, and it plays a greater role in social knowledge management. Currently widely used text information extraction methods include, based on regular expressions, artificial filtering or matching rules, regular rule methods for text extraction; use of named entity recognition NER models to process by setting extraction tasks; And, other mainstream ways to predict individual words in text.
现有技术中,发明人意识到,正则规则方式存在依赖人工规则的问题,当面临复杂的语句环境、语义不完整的文本时,无法完备地提取文本信息;NER模型识别容易产生过拟合,当面临包含新语料信息的文本时,提取的准确性大幅下降;以及提取孤立文本中的字词等,导致文本信息提取的准确性较低。In the prior art, the inventor realized that the regular rule method has the problem of relying on artificial rules. When faced with a complex sentence environment and text with incomplete semantics, it cannot completely extract text information; NER model recognition is prone to overfitting, When faced with texts containing new corpus information, the accuracy of extraction drops significantly; and words in isolated texts are extracted, resulting in low accuracy of text information extraction.
发明内容Contents of the invention
有鉴于此,本申请提供了文本信息提取方法、装置、存储介质及计算机设备。In view of this, the present application provides a text information extraction method, device, storage medium and computer equipment.
根据本申请的一个方面,提供了一种文本信息提取方法,该方法包括:According to one aspect of the present application, a method for extracting text information is provided, the method comprising:
通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组;By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;
利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;
根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量;Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;
根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
根据本申请的另一方面,提供了一种文本信息提取装置,该装置包括:According to another aspect of the present application, a text information extraction device is provided, the device comprising:
语句识别模块,用于通过对待提取文本段落进行语句识别,得到待提取文本段落的初 始词向量组;The sentence recognition module is used to carry out sentence recognition by the text paragraph to be extracted to obtain the initial word vector group of the text paragraph to be extracted;
第一位置预测模块,用于利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;The first position prediction module is used to use the pre-trained text extraction network model to predict a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group;
第二位置预测模块,用于根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量;The second position prediction module is used to predict a plurality of to-be-determined initial word vectors corresponding to each of the to-be-determined initial word vectors in the initial word vector group according to the plurality of to-be-determined initial word vectors and the said initial word vector group. Determine the end word vector;
确定模块,用于根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。The determination module is configured to determine the target text to be extracted according to the multiple to-be-determined start word vectors obtained through prediction, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
依据本申请又一个方面,提供了一种存储介质,其上存储有计算机可读指令,所述程序被处理器执行时实现上述文本信息提取方法,包括:According to another aspect of the present application, a storage medium is provided, on which computer-readable instructions are stored, and when the program is executed by a processor, the above text information extraction method is implemented, including:
通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组;By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;
利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;
根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量;Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;
根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
依据本申请再一个方面,提供了一种计算机设备,包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机可读指令,所述处理器执行所述程序时实现上述文本信息提取方法,包括:According to still another aspect of the present application, a computer device is provided, including a storage medium, a processor, and computer-readable instructions stored on the storage medium and operable on the processor. When the processor executes the program, the above-mentioned Text information extraction methods, including:
通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组;By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;
利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;
根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量;Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;
根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
借由上述技术方案,有效提高文本信息提取的准确性。By means of the above technical solution, the accuracy of text information extraction is effectively improved.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The schematic embodiments and descriptions of the application are used to explain the application and do not constitute an improper limitation to the application. In the attached picture:
图1示出了本申请实施例提供的一种文本信息提取方法的流程示意图;FIG. 1 shows a schematic flow chart of a method for extracting text information provided by an embodiment of the present application;
图2示出了本申请实施例提供另一种文本信息提取方法的流程示意图;FIG. 2 shows a schematic flow diagram of another text information extraction method provided by the embodiment of the present application;
图3示出了本申请实施例提供的训练阶段的文本提取网络模型架构示意图;FIG. 3 shows a schematic diagram of the text extraction network model architecture in the training phase provided by the embodiment of the present application;
图4示出了本申请实施例提供的一种的文本信息提取装置的结构示意图;FIG. 4 shows a schematic structural diagram of a text information extraction device provided by an embodiment of the present application;
图5示出了本申请实施例提供的另一种文本信息提取装置的结构示意图。FIG. 5 shows a schematic structural diagram of another apparatus for extracting text information provided by an embodiment of the present application.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present application will be described in detail with reference to the drawings and embodiments. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(AI:Artificial Intelligence)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI: Artificial Intelligence) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
针对现有技术中正则规则方式,NER模型识别方式,以及其他主流方式存在的文本信息提取准确性较低的技术问题,以正则规则方式为例,在数据集引用的上下文中,对于其他文本信息的引用通常会以较高频率出现“Survey”、“Data”、“Study”、“Database”、“Statistics”等字词,且所用词汇会以大写开头。正则规则方式通过对匹配出的引用信息过滤,进一步实现对文本信息的提取,但正则规则方式过于简单,且文本提取性能取决于人工规则的指定,文本提取效果相对较差。基于此,本实施例提供了一种文本信息提取方法,如图1所示,以该方法应用于服务器等计算机设备为例进行说明,其中,服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN:Content Delivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器,如智能医疗系统、数字医疗平台等。上述方法包括以下步骤:In view of the technical problems of low accuracy of text information extraction in the regular rule method, NER model recognition method, and other mainstream methods in the prior art, taking the regular rule method as an example, in the context of data set reference, for other text information Quotations in often have the terms "Survey", "Data", "Study", "Database", "Statistics" appear with high frequency, and the words used will start with a capital. The regular rule method further realizes the extraction of text information by filtering the matched reference information, but the regular rule method is too simple, and the text extraction performance depends on the specification of manual rules, and the text extraction effect is relatively poor. Based on this, this embodiment provides a method for extracting text information, as shown in Figure 1, taking the application of this method to a computer device such as a server as an example for illustration, wherein the server can be an independent server, or it can provide a cloud service , cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network (CDN: Content Delivery Network), and big data and artificial intelligence platforms and other basic clouds Cloud servers for computing services, such as intelligent medical systems, digital medical platforms, etc. The above method comprises the following steps:
步骤S101、通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组。Step S101. Obtain an initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.
在本实施例中,为了便于文本提取网络模型对文本信息的处理,对待提取文本段落字词进行分词处理,并按照预设序列长度,对分词后的文本段落进行划分,得到一个或多个包含完整语句的初始数据序列,并对初始数据序列进行词向量转换处理,得到初始词向量组。具体地,对文本段落以语句为单位进行划分,小于预设序列长度的文本段落进行补齐处理。In this embodiment, in order to facilitate the processing of text information by the text extraction network model, word segmentation processing is performed on the words of the text paragraph to be extracted, and the text paragraph after word segmentation is divided according to the preset sequence length, and one or more text paragraphs containing The initial data sequence of the complete sentence, and perform word vector conversion processing on the initial data sequence to obtain the initial word vector group. Specifically, the text paragraphs are divided into sentence units, and the text paragraphs smaller than the preset sequence length are completed.
本申请的示例性实施例中,对分词后的文本段落按512字词进行划分,能够增强文本提取网络模型对长文本的提取能力,进一步地,以语句为单位进行划分,能够有效避免文本段落划分过程中,将一个完整语句划分到不同的数据序列中,进而影响文本提取网络模型对上下文语义提取的准确性的问题。In the exemplary embodiment of the present application, dividing the text paragraphs after word segmentation by 512 words can enhance the ability of the text extraction network model to extract long texts. Further, dividing the text paragraphs in units of sentences can effectively avoid text paragraphs In the division process, a complete sentence is divided into different data sequences, which in turn affects the accuracy of the text extraction network model for contextual semantic extraction.
根据实际应用场景的需要,例如,百科问答的响应事件,根据用户输入的问题信息和获取到的目标文本段落进行拼接处理,得到包含问题信息的待提取文本段落,根据包含问题信息的待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组,从而进一步预测该待提取文本段落中的开始词向量位置,以及结束词向量位置,由于补入用户的问题信息,使得提取到的文本信息更加准确。According to the needs of actual application scenarios, for example, the response event of Baike Q&A, according to the question information input by the user and the obtained target text paragraphs, the splicing process is performed to obtain the text paragraphs to be extracted containing the question information, and according to the text to be extracted containing the question information The sentence recognition of the paragraph is carried out to obtain the initial word vector group of the text paragraph to be extracted, so as to further predict the position of the start word vector and the position of the end word vector in the text paragraph to be extracted. Due to the addition of the user's question information, the extracted text Information is more accurate.
步骤S102、利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。Step S102 , using the pre-trained text extraction network model, predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.
在本实施例中,利用预先训练好的文本提取网络模型中的预训练模块(GPT:Generative Pre-training),使初始词向量组中每个词向量学习其他词向量的语义信息,得到包含上下文语义信息的第一词向量组;进一步,利用第一位置预测模块,获取第一词向量组内每个词向量的起始位置预测概率值,通过遍历确定第一词向量组内K个起始位置预测概率值最大的待确定起始词向量。In this embodiment, the pre-training module (GPT: Generative Pre-training) in the pre-trained text extraction network model is used to enable each word vector in the initial word vector group to learn the semantic information of other word vectors to obtain contextual The first word vector group of semantic information; further, use the first position prediction module to obtain the predicted probability value of the starting position of each word vector in the first word vector group, and determine K starting positions in the first word vector group by traversing The to-be-determined starting word vector with the largest position prediction probability value.
本申请的示例性实施例中,预训练模型GPT采用多层Transformer架构,其中的自注意力机制self-attention使每个词向量在经过多层的学习后,能够提取除自身特征外的语法,句法及其他深层次的语义信息,建立每个词向量在初始词向量组中的上下文联系,从而提高文本提取网络模型对文本信息提取的准确性。In the exemplary embodiment of the present application, the pre-training model GPT adopts a multi-layer Transformer architecture, and the self-attention mechanism self-attention enables each word vector to extract grammar other than its own features after multi-layer learning. Syntactic and other deep-level semantic information establishes the contextual connection of each word vector in the initial word vector group, thereby improving the accuracy of text information extraction by the text extraction network model.
步骤S103、根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量。Step S103. According to the plurality of to-be-determined start word vectors and the initial word vector group, predict a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors in the initial word vector group.
在本实施例中,为进一步提高词向量起始和结束位置预测的准确性,借助预测得到的词向量起始位置信息对词向量结束位置进行预测。具体为,将K个待确定起始词向量分别与初始词向量组进行向量拼接处理,得到K个用于输入第二位置预测模块的拼接词向量组。利用第二位置预测模块获取拼接词向量组内对应每个待确定起始词向量的结束位置预测概率值,通过遍历确定拼接词向量组内对应每个待确定起始词向量的N个结束位置预测概率值最大的待确定结束词向量。其中,K与N可以根据实际应用场景的需求设定为相等或不相等。In this embodiment, in order to further improve the accuracy of word vector start and end position prediction, the word vector end position is predicted by using the predicted word vector start position information. Specifically, the K initial word vectors to be determined are respectively subjected to vector splicing processing with the initial word vector groups to obtain K spliced word vector groups for inputting into the second position prediction module. Utilize the second position prediction module to obtain the end position prediction probability value corresponding to each to-be-determined start word vector in the spliced word vector group, and determine the N end positions corresponding to each to-be-determined start word vector in the spliced word vector group by traversing The to-be-determined ending word vector with the largest predicted probability value. Wherein, K and N may be set to be equal or unequal according to requirements of actual application scenarios.
步骤S104、根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。Step S104: Determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
在本实施例中,根据每个待确定起始词向量,以及分别对应的N个结束位置预测概率值最大的待确定结束词向量,得到K*N个初始提取文本组合,进一步确定K*N个初始提取文本组合中满足预设条件的待确定提取文本组合,并根据待确定提取文本组合中每个待确定起始词向量的起始位置预测概率值,与其对应的多个待确定结束词向量的结束位置预测概率值,通过乘积计算确定最大乘积值对应的起始词向量为目标起始词向量,以及其对应的结束词向量为目标结束词向量,从而得到目标提取文本。In this embodiment, according to each to-be-determined start word vector, and the corresponding N end position prediction probability values of the corresponding to-be-determined end word vectors, K*N initially extracted text combinations are obtained, and K*N is further determined The to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations, and the predicted probability value according to the starting position of each to-be-determined start word vector in the to-be-determined extracted text combinations, and the corresponding multiple to-be-determined end words The predicted probability value of the end position of the vector is determined by product calculation to determine that the start word vector corresponding to the maximum product value is the target start word vector, and its corresponding end word vector is the target end word vector, thereby obtaining the target extracted text.
对于本实施例可以按照上述方案,将通过语句识别得到的待提取文本段落的初始词向量组输入预先训练好的文本提取网络模型,预测该初始词向量组中的多个待确定起始词向 量,根据多个待确定起始词向量和初始词向量组,预测每个待确定起始词向量对应的多个待确定结束词向量,从而根据预测得到的多个待确定起始词向量,以及每个待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。与现有正则规则方式,NER模型识别方式,以及其他主流方式的技术方案相比,本实施例能够基于起始词向量和初始词向量组提升文本提取网络模型的提取准确性,从而更加准确地提取文本段落中的目标文本。进一步的,作为上述实施例具体实施方式的细化和扩展,为了完整说明本实施例的具体实施过程,提供了另一种文本信息提取方法,如图2所示,该方法包括:For this embodiment, according to the above scheme, the initial word vector group of the text paragraph to be extracted obtained by sentence recognition is input into the pre-trained text extraction network model, and a plurality of initial word vectors to be determined in the initial word vector group are predicted. , according to a plurality of to-be-determined start word vectors and initial word vector groups, predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text. Compared with the existing regular rule method, NER model recognition method, and other mainstream technical solutions, this embodiment can improve the extraction accuracy of the text extraction network model based on the initial word vector and the initial word vector group, thereby more accurately Extract the target text in the text paragraph. Further, as a refinement and extension of the specific implementation of the above embodiment, in order to fully describe the specific implementation process of this embodiment, another text information extraction method is provided, as shown in Figure 2, the method includes:
步骤S201、训练初始文本提取网络模型。Step S201, training an initial text extraction network model.
具体地,如图3所示,为提升文本信息提取的准确性,所构建的初始文本提取网络模型包括串行的第一位置预测模块和第二位置预测模块,即span模块,分别用于实现对目标提取文本的起始位置和结束位置的预测;在第一位置预测模型的输入端增设预训练模块GPT,用于获取每个词向量的上下文语义信息;在模型训练阶段,通过增设修正模块,使文本提取网络模型训练得到模型参数的更新更容易趋于稳定。Specifically, as shown in Figure 3, in order to improve the accuracy of text information extraction, the initial text extraction network model constructed includes a serial first position prediction module and a second position prediction module, namely the span module, which are used to implement Predict the start position and end position of the target extracted text; add a pre-training module GPT at the input of the first position prediction model to obtain the contextual semantic information of each word vector; in the model training stage, by adding a correction module , so that the update of the model parameters obtained from the training of the text extraction network model tends to be more stable.
为了说明步骤201的具体实施方式,作为一种优选实施例,步骤201具体可以包括:根据训练样本中起始位置序号和结束位置序号对应的位置标签,训练所述初始文本提取网络模型;当监测到所述初始文本提取网络模型中的第一损失函数的当前损失值下降至初始损失值的预设百分比时,得到第一阶段文本提取网络模型;利用所述第一损失函数和预先设定的修正模块对应的第二损失函数,根据忽略所述位置标签的训练样本,对所述第一阶段文本提取网络模型进行二次训练,得到训练好的文本提取网络模型。中,修正模块用于辅助训练所述第一阶段文本提取网络模型中的第一位置预测模块和第二位置预测模块。In order to illustrate the specific implementation of step 201, as a preferred embodiment, step 201 may specifically include: training the initial text extraction network model according to the position labels corresponding to the starting position serial number and the ending position serial number in the training sample; when monitoring When the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value, the first-stage text extraction network model is obtained; using the first loss function and the preset The second loss function corresponding to the correction module performs secondary training on the first-stage text extraction network model according to the training samples ignoring the position label, to obtain a trained text extraction network model. Among them, the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
对构建的初始文本提取网络模型进行训练的具体步骤包括:The specific steps for training the initial text extraction network model constructed include:
1)获取一段文本段落作为训练样本,预设该训练样本中用于提取的标记文本序列为[w100,w101,w102,w103],其中w100对应引用的目标起始位置,w103对应引用的目标结束位置。1) Obtain a paragraph of text as a training sample, and the tagged text sequence used for extraction in the training sample is preset as [w100, w101, w102, w103], where w100 corresponds to the starting position of the referenced target, and w103 corresponds to the end of the referenced target Location.
2)对训练样本进行分词处理,英文采用空格进行分词,中文采用百度公开的分词工具jieba进行分词,得到分词后的文本段落。2) Perform word segmentation on the training samples. English uses spaces for word segmentation, and Chinese uses Baidu's public word segmentation tool jieba for word segmentation to obtain text paragraphs after word segmentation.
3)将分词后的文本段落按照预设的序列长度进行划分,得到一组或多组包含完整语句的初始数据序列。具体地,设定序列长度为512个字词,长度小于512个字词的文本段落进行补齐,以建立一组包含完整语句的初始数据序列;大于512个字词的文本段落基于完整语句进行截断,截断后对不足512个字词的部分进行补齐,以建立一组包含完整语句的初始数据序列,将截断后剩下的文本段落作为一段新的文本段落继续进行划分,直至划分结束,得到多组包含完整语句的初始数据序列。3) Divide the segmented text paragraphs according to the preset sequence length to obtain one or more initial data sequences containing complete sentences. Specifically, the length of the sequence is set to 512 words, and the text paragraphs whose length is less than 512 words are completed to establish a set of initial data sequences containing complete sentences; the text paragraphs longer than 512 words are based on complete sentences. Truncation, after truncation, fill in the parts less than 512 words to establish a set of initial data sequences containing complete sentences, and continue to divide the remaining text paragraphs after truncation as a new text paragraph until the end of the division, Get multiple sets of initial data sequences containing complete sentences.
4)利用训练好的word2vec或者GloVe词向量模块,将初始数据序列中的每个字词转化为词向量,得到初始词向量组,表示为[w1,w2,…w512]。4) Use the trained word2vec or GloVe word vector module to convert each word in the initial data sequence into a word vector, and obtain the initial word vector group, expressed as [w1, w2, ... w512].
5)利用初始文本提取网络模型中的GPT模块,对初始词向量组中的每个字词进行语义特征提取,得到包含上下文语义的第一词向量组,表示为[h1,h2,…h512]。具体地,GPT模型采用多层Transformer架构,在Transformer每一层中都包含自注意力机制self-attention,该机制可以使[w1,w2,…w512]中的每个字词对其他位置的字词做特征信息提取,并将提取到的特征信息用于对自身向量的更新,以获取其他字词与自身的深层次关系,即初始词向量组中的每个词向量在经过多层学习后,得到包含初始词向量组内其他所有位置字词信息的语法、句法及其他深层次语义信息的词向量,从而得到包含上下文语义的第一词向量组[h1,h2,…h512]。5) Use the GPT module in the initial text extraction network model to extract the semantic features of each word in the initial word vector group, and obtain the first word vector group containing contextual semantics, expressed as [h1, h2, ... h512] . Specifically, the GPT model adopts a multi-layer Transformer architecture, and each layer of the Transformer contains a self-attention mechanism self-attention, which can make each word in [w1, w2, ... w512] compare to words in other positions Word feature information is extracted, and the extracted feature information is used to update its own vector to obtain the deep relationship between other words and itself, that is, each word vector in the initial word vector group after multi-layer learning , to obtain word vectors containing grammar, syntax and other deep semantic information of all other positions in the initial word vector group, so as to obtain the first word vector group [h1, h2, ... h512] containing contextual semantics.
6)将第一词向量组[h1,h2,…h512]输入初始文本提取网络模型中的第一位置预测模块(span模块),输出目标起始位置处起始词向量h100的起始位置预测概率值。6) Input the first word vector group [h1, h2, ... h512] into the first position prediction module (span module) in the initial text extraction network model, and output the starting position prediction of the starting word vector h100 at the target starting position probability value.
7)将表征目标起始位置处第一词向量h100分别与第一词向量组中的每一个词向量进行拼接处理,得到拼接词向量组,表示为[h100+h1,h100+h2,…h100+h512]。7) Concatenate the first word vector h100 at the starting position of the target with each word vector in the first word vector group to obtain a spliced word vector group, expressed as [h100+h1, h100+h2,...h100 +h512].
8)将拼接词向量组输入初始文本提取网络模型中的第二位置预测模块,输出目标结束位置处结束词向量h103的结束位置预测概率值,其中,第一位置预测模块与第二位置预测模块相同。可选地,第一位置预测模块,第二位置预测模块可以是两个位置预测模块,也可以是一个位置预测模块,依次输出起始词向量,结束词向量的位置预测概率值,此处不对位置预测模块做具体限定。8) Input the concatenated word vector group into the second position prediction module in the initial text extraction network model, and output the end position prediction probability value of the end word vector h103 at the target end position, wherein the first position prediction module and the second position prediction module same. Optionally, the first position prediction module and the second position prediction module can be two position prediction modules, or one position prediction module, which sequentially output the position prediction probability values of the start word vector and the end word vector, which is not correct here The position prediction module makes specific limitations.
9)在训练过程中,以目标起始位置与目标结束位置的位置预测概率值的乘积最大为目标,若监测到位置预测模块的第一损失函数L1的当前损失值下降至初始损失值的30%,则将目标开始位置和目标结束位置设置为空,并利用多任务学习框架进行二次训练,得到训练好的文本提取网络模型。其中,多任务学习框架为利用修正模块辅助训练位置预测模块,即将第一损失函数L1的当前损失值L m下降至初始损失值的30%时,得到第一阶段文本提取网络模型,根据修正模块对应的第二损失函数L2和第一损失函数L1继续训练第一阶段文本提取网络模型,得到第二阶段文本提取网络模型并作为训练好的文本提取网络模型。具体为: 9) During the training process, the target is to maximize the product of the position prediction probability value of the target start position and the target end position. If it is monitored that the current loss value of the first loss function L1 of the position prediction module drops to 30% of the initial loss value %, set the target start position and target end position as empty, and use the multi-task learning framework for secondary training to obtain a trained text extraction network model. Among them, the multi-task learning framework is to use the correction module to assist the training position prediction module, that is, when the current loss value L m of the first loss function L1 is reduced to 30% of the initial loss value, the first-stage text extraction network model is obtained. According to the correction module The corresponding second loss function L2 and first loss function L1 continue to train the first-stage text extraction network model to obtain the second-stage text extraction network model as a trained text extraction network model. Specifically:
①将包含上下文语义的第一词向量组[h1,h2,…h512]重新输入第一阶段文本提取网络模型中的第一位置预测模块,输出每个第一词向量的起始位置预测概率值,取K个起始位置预测概率值最大的词向量作为待确定起始词向量;针对每个待确定起始词向量,将其分别与第一词向量组中的每个词向量进行拼接处理,得到拼接词向量组;通过第一阶段文本提取网络模型中的第二位置预测模块,得到对应每个待确定起始词向量的每个词向量的结束位置预测概率值,取N个结束位置预测概率值最大的词向量作为待确定结束词向量;根据K个待确定起始词向量,以及每个待确定起始词向量分别对应的N个待确定结束词向量,建立K*N个初始提取文本组合。① Re-input the first word vector group [h1, h2, ... h512] containing the context semantics into the first position prediction module in the first-stage text extraction network model, and output the starting position prediction probability value of each first word vector , take the K word vectors with the largest predicted probability values of the starting positions as the starting word vectors to be determined; for each starting word vector to be determined, splice it with each word vector in the first word vector group respectively , to obtain the concatenated word vector group; through the second position prediction module in the text extraction network model of the first stage, the predicted end position probability value of each word vector corresponding to each start word vector to be determined is obtained, and N end positions are taken The word vector with the largest predicted probability value is used as the ending word vector to be determined; K*N initial Extract text combinations.
位置预测模块用于计算词向量的位置预测概率值,其计算公式为:The position prediction module is used to calculate the position prediction probability value of the word vector, and its calculation formula is:
p=S(Wx+b)p=S(Wx+b)
其中,W为权重,b为偏置值,是通过模型训练、学习不断更新的网络模型参数,s代表sigmoid函数,表达式如下:Among them, W is the weight, b is the bias value, which is the network model parameter that is continuously updated through model training and learning, s represents the sigmoid function, and the expression is as follows:
Figure PCTCN2022071444-appb-000001
Figure PCTCN2022071444-appb-000001
②将包含上下文语义的第一词向量组[h1,h2,…h512]同步输入修正模块,用于直接提取文本信息,即通过不断预测下一位置字词的方式实现长文本提取,具体结构为一个全连接层加一个softmax层,即P=softmax(wx+b),其中x为包含上下文语义的词向量组[h1,h2,…h512],利用softmax层预测每个下一位置字词的位置概率值,输出和为1的数字向量。② Synchronously input the first word vector group [h1, h2, ... h512] containing the context semantics into the correction module to directly extract text information, that is, to realize long text extraction by continuously predicting the next word. The specific structure is A fully connected layer plus a softmax layer, that is, P=softmax(wx+b), where x is a word vector group [h1, h2,...h512] containing contextual semantics, and the softmax layer is used to predict the value of each next position word Position probability values, output as a numeric vector summing to 1.
③根据目标损失函数L对①和②的输出结果进行迭代训练,直至训练结束,得到训练好的文本提取网络模型。具体为:模型训练最多迭代次数为N轮,N默认为10000,用户可自定义。目标损失函数定义为L=L1+L2,损失函数L1用于计算目标开始位置和目标结束位置的负对数,公式如下:③ According to the target loss function L, the output results of ① and ② are iteratively trained until the training ends, and the trained text extraction network model is obtained. Specifically: the maximum number of iterations for model training is N rounds, and N is 10000 by default, which can be customized by the user. The target loss function is defined as L=L1+L2. The loss function L1 is used to calculate the negative logarithm of the target start position and target end position. The formula is as follows:
Figure PCTCN2022071444-appb-000002
Figure PCTCN2022071444-appb-000002
Figure PCTCN2022071444-appb-000003
Figure PCTCN2022071444-appb-000003
其中,P 开始位置表示第一位置预测模块输出的目标起始位置对应的词向量的位置预测概率值;P 结束位置表示第二位置预测模块输出的目标结束位置对应的词向量的位置预测概率值;M为预设词汇表大小,设定为50000个词向量;y hc表示当前词向量h索引c处的维度值为1,其他为0,0<c<M;p hc表示当前词向量h为c处的概率,即上述softmax层处理后数字向量第c维对应的值。 Wherein, the P start position represents the position prediction probability value of the word vector corresponding to the target start position output by the first position prediction module; the P end position represents the position prediction probability value of the word vector corresponding to the target end position output by the second position prediction module ; M is the size of the preset vocabulary, which is set to 50000 word vectors; y hc indicates that the dimension value at the index c of the current word vector h is 1, and the other values are 0, 0<c<M; p hc indicates the current word vector h is the probability at c, that is, the value corresponding to the cth dimension of the digital vector after the above-mentioned softmax layer processing.
通过补入修正模块的方式实现多任务训练,能够更加贴近于文本提取的实际场景,同时在训练过程中置空文本提取位置处的向量标记会导致模型的损失值骤增,增加学习难度,模型最终无法收敛,通过补入修正模块的方式对模型的第一损失函数进行辅助修正,使得模型参数的更新更加稳定,需要说明的是,在文本提取的实际应用中,训练好的文本提取网络模型中不包括修正模块,修正模块仅用于进一步优化位置预测模块中的模型参数。Multi-task training is realized by adding a correction module, which can be closer to the actual scene of text extraction. At the same time, emptying the vector mark at the position of text extraction during the training process will lead to a sudden increase in the loss value of the model and increase the difficulty of learning. In the end, it was unable to converge, and the first loss function of the model was assisted by adding a correction module to make the update of model parameters more stable. It should be noted that in the actual application of text extraction, the trained text extraction network model The correction module is not included in , and the correction module is only used to further optimize the model parameters in the position prediction module.
在PyTorch架构中,以损失函数L最小化为目标,利用随机梯度下降算法SGD对初始文本提取网络模型中的网络模型参数W,b进行迭代更新,得到训练好的文本提取网络模型。具体为,在模型训练过程中,若相邻两次训练得到的L m和L m+1的差值小于设定值,即L m-L m+1<0.01,则认为模型已收敛,判定训练结束,得到训练好的文本提取网络模型。 In the PyTorch architecture, with the goal of minimizing the loss function L, the stochastic gradient descent algorithm SGD is used to iteratively update the network model parameters W and b in the initial text extraction network model to obtain a trained text extraction network model. Specifically, during the model training process, if the difference between L m and L m+1 obtained from two adjacent training sessions is less than the set value, that is, L m -L m+1 <0.01, the model is considered to have converged, and the decision After the training is over, the trained text extraction network model is obtained.
步骤S202、对所述待提取文本段落进行分词处理,得到分词处理后的文本段落。Step S202, perform word segmentation processing on the text paragraph to be extracted, and obtain a text paragraph after word segmentation processing.
步骤S203、按照预设的序列长度,得到包含完整语句的初始数据序列。Step S203. According to the preset sequence length, an initial data sequence including complete sentences is obtained.
步骤S204、对所述初始数据序列进行词向量转换处理,得到初始词向量组。Step S204, performing word vector conversion processing on the initial data sequence to obtain an initial word vector group.
步骤S205、根据所述初始词向量组,利用所述预先训练好的文本提取网络模型中的预训练模块,得到包含上下文语义信息的第一词向量组。Step S205, according to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information.
步骤S206、利用所述预先训练好的文本提取网络模型中的第一位置预测模块,根据所述第一词向量组中每个词向量的起始位置预测概率值,得到所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。Step S206, using the first position prediction module in the pre-trained text extraction network model to obtain the initial word vector group according to the initial position prediction probability value of each word vector in the first word vector group In is used to characterize the multiple starting word vectors to be determined at the start position of text extraction.
步骤S207、针对每个待确定起始词向量,对所述待确定起始词向量和所述初始词向量组进行拼接处理,得到拼接词向量组。Step S207 , for each to-be-determined initial word vector, concatenate the to-be-determined initial word vector and the initial word vector group to obtain a concatenated word vector group.
为了说明步骤207的具体实施方式,作为一种优选实施例,步骤207具体可以包括:将所述待确定起始词向量分别与所述初始词向量组内的每个词向量进行拼接处理,得到拼接词向量组。In order to illustrate the specific implementation of step 207, as a preferred embodiment, step 207 may specifically include: splicing the to-be-determined starting word vector with each word vector in the initial word vector group to obtain A concatenated word vector group.
步骤S208、利用所述预先训练好的文本提取网络模型中的第二位置预测模块,根据拼接词向量组中每个词向量的结束位置预测概率值,得到所述拼接词向量组中用于表征文本提取结束位置的多个待确定结束词向量。Step S208, using the second position prediction module in the pre-trained text extraction network model, according to the predicted probability value of the end position of each word vector in the spliced word vector group, to obtain the representation value in the spliced word vector group A plurality of to-be-determined end word vectors at the end position of text extraction.
步骤S209、根据预测得的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定初始提取文本组合。Step S209: Determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
步骤S210、获取所述初始提取文本组合中满足预设条件的待确定提取文本组合。其中,所述预设条件至少包括:待确定结束词向量对应的结束位置序号与待确定起始词向量的起始位置序号的差值大于设定阈值。Step S210, obtaining the to-be-determined extracted text combinations satisfying the preset conditions among the initial extracted text combinations. Wherein, the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.
在本实施例中,不同于文本提取网络模型训练过程,在获取到K*N个初始提取文本组合后,根据预设条件确定K*N个初始提取文本组合中满足预设条件的待确定提取文本组合,预设条件为,待确定提取文本组合中待确定结束词向量对应的结束位置序号大于待确定起始词向量的起始位置序号,且结束位置序号与起始位置序号的差值大于设定阈值(例如,2),此处不对预设条件进行具体限定。In this embodiment, different from the text extraction network model training process, after K*N initial extraction text combinations are obtained, the to-be-determined extractions that meet the preset conditions among the K*N initial extraction text combinations are determined according to the preset conditions. Text combination, the preset condition is that the end position number corresponding to the end word vector to be determined in the text combination to be determined is greater than the start position number of the start word vector to be determined, and the difference between the end position number and the start position number is greater than A threshold (for example, 2) is set, and the preset condition is not specifically limited here.
步骤S211、根据所述待确定提取文本组合中每个所述待确定起始词向量的起始位置预测概率值,分别与每个所述待确定起始词向量对应的多个待确定结束词向量的结束位置预测概率值的概率乘积值,确定目标起始词向量及其对应的目标结束词向量。Step S211, according to the predicted probability value of the start position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, a plurality of to-be-determined end words corresponding to each of the to-be-determined start word vectors The probability product value of the predicted probability value of the end position of the vector is used to determine the target start word vector and its corresponding target end word vector.
在本实施例中,将所述待确定提取文本组合中每个待确定起始词向量的起始位置预测概率值,分别与其对应的N个待确定结束词向量的结束位置预测概率值进行乘积处理,通过遍历每个概率乘积值,将概率乘积值最大的提取文本组合确定目标起始词向量及其对应的目标结束词向量。In this embodiment, the predicted start position probability value of each start word vector to be determined in the to-be-determined extracted text combination is multiplied by the predicted end position probability values of the corresponding N end word vectors to be determined respectively Processing, by traversing each probability product value, combining the extracted text with the largest probability product value to determine the target start word vector and its corresponding target end word vector.
步骤S212、根据所述目标起始词向量对应的起始位置序号和所述目标结束词向量对应的结束位置序号,得到目标提取文本。Step S212, according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector, to obtain the target extracted text.
通过应用本实施例的技术方案,将通过语句识别得到的待提取文本段落的初始词向量组输入预先训练好的文本提取网络模型,预测该初始词向量组中的多个待确定起始词向量,根据多个待确定起始词向量和初始词向量组,预测每个待确定起始词向量对应的多个待确 定结束词向量,从而根据预测得到的多个待确定起始词向量,以及每个待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。可见,通过预先训练好的文本信息提取网络模型能够有效避免现有正则规则方式的人工规则依赖性较强,且无法完备地提取复杂或不完整的文本信息;NER模型识别易产生过拟合,且当面临包含新语料信息的文本时,文本提取准确性较低;以及,其他主流方式提取孤立文本中的字词导致文本提取准确性较低的技术问题,从而有效提高文本信息提取的准确性。By applying the technical solution of this embodiment, the initial word vector group of the text paragraph to be extracted obtained through sentence recognition is input into the pre-trained text extraction network model, and multiple to-be-determined initial word vectors in the initial word vector group are predicted , according to a plurality of to-be-determined start word vectors and initial word vector groups, predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text. It can be seen that the pre-trained text information extraction network model can effectively avoid the strong dependence of artificial rules in the existing regular rule method, and cannot completely extract complex or incomplete text information; NER model recognition is prone to overfitting, And when faced with text containing new corpus information, the accuracy of text extraction is low; and other mainstream methods to extract words in isolated texts lead to technical problems of low text extraction accuracy, thereby effectively improving the accuracy of text information extraction .
进一步地,作为图1方法的具体实现,本申请实施例提供了一种文本信息提取装置,如图4所示,该装置包括:语句识别模块32、第一位置预测模块33,第二位置预测模块34,确定模块35。Further, as a specific implementation of the method in FIG. 1 , an embodiment of the present application provides a text information extraction device, as shown in FIG. 4 , the device includes: a sentence recognition module 32, a first position prediction module 33, a second position prediction module Module 34 , determining module 35 .
语句识别模块32,可以用于通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组。The sentence recognition module 32 can be used to obtain the initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.
第一位置预测模块33,可以用于利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。The first position prediction module 33 may be configured to use a pre-trained text extraction network model to predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.
第二位置预测模块34,可以用于根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量。The second position prediction module 34 can be used to predict the number of each of the initial word vectors corresponding to each of the initial word vectors in the initial word vector group according to the plurality of initial word vectors to be determined and the initial word vector group. A to-be-determined ending word vector.
确定模块35,可以用于根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。The determining module 35 may be configured to determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
在具体的应用场景中,如图5所示,还包括模型训练模块31。In a specific application scenario, as shown in FIG. 5 , a model training module 31 is also included.
在具体的应用场景中,语句识别模块32包括分词处理单元321、分组划分单元322、词向量转化单元323。In a specific application scenario, the sentence recognition module 32 includes a word segmentation processing unit 321 , a grouping division unit 322 , and a word vector conversion unit 323 .
分词处理单元321,可以用于对所述待提取文本段落进行分词处理,得到分词处理后的文本段落。The word segmentation processing unit 321 may be configured to perform word segmentation processing on the to-be-extracted text paragraph to obtain a text paragraph after word segmentation processing.
分组划分单元322,可以用于按照预设的序列长度,得到包含完整语句的初始数据序列。The grouping unit 322 can be configured to obtain an initial data sequence including a complete sentence according to a preset sequence length.
词向量转化单元323,可以用于对所述初始数据序列进行词向量转换处理,得到初始词向量组。The word vector conversion unit 323 may be configured to perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
在具体的应用场景中,第一位置预测模块33包括预训练单元331、起始位置预测单元332。In a specific application scenario, the first position prediction module 33 includes a pre-training unit 331 and a starting position prediction unit 332 .
预训练单元331,可以用于根据所述初始词向量组,利用所述预先训练好的文本提取网络模型中的预训练模块,得到包含上下文语义信息的第一词向量组。The pre-training unit 331 may be configured to use the pre-trained module in the pre-trained text extraction network model according to the initial word vector group to obtain a first word vector group containing contextual semantic information.
起始位置预测单元332,可以用于利用所述预先训练好的文本提取网络模型中的第一位置预测模块,根据所述第一词向量组中每个词向量的起始位置预测概率值,得到所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。The starting position prediction unit 332 can be used to use the first position prediction module in the pre-trained text extraction network model to predict the probability value according to the starting position of each word vector in the first word vector group, A plurality of to-be-determined initial word vectors used to characterize the starting position of text extraction in the initial word vector group are obtained.
在具体的应用场景中,第二位置预测模块34包括向量拼接单元341、结束位置预测单元342。In a specific application scenario, the second position prediction module 34 includes a vector splicing unit 341 and an end position prediction unit 342 .
向量拼接单元341,可以用于针对每个待确定起始词向量,对所述待确定起始词向量和所述初始词向量组进行拼接处理,得到拼接词向量组。The vector concatenating unit 341 may be configured to concatenate the to-be-determined initial word vector and the initial word vector group for each to-be-determined initial word vector to obtain a concatenated word vector group.
结束位置预测单元342,可以用于利用所述预先训练好的文本提取网络模型中的第二位置预测模块,根据拼接词向量组中每个词向量的结束位置预测概率值,得到所述拼接词向量组中用于表征文本提取结束位置的多个待确定结束词向量。The end position prediction unit 342 can be used to use the second position prediction module in the pre-trained text extraction network model to obtain the spliced word according to the predicted probability value of the end position of each word vector in the spliced word vector group. Multiple to-be-determined end word vectors used to represent the end position of text extraction in the vector group.
在具体的应用场景中,确定模块35包括组合确定单元351、预设条件单元352、概率判定单元353、文本提取单元354。In a specific application scenario, the determination module 35 includes a combination determination unit 351 , a preset condition unit 352 , a probability determination unit 353 , and a text extraction unit 354 .
组合确定单元351,可以用于根据预测得的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定初始提取文本组合。The combination determining unit 351 may be configured to determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
预设条件单元352,可以用于获取所述初始提取文本组合中满足预设条件的待确定提取文本组合;其中,所述预设条件至少包括:待确定结束词向量对应的结束位置序号与待确定起始词向量的起始位置序号的差值大于设定阈值。The preset condition unit 352 can be used to obtain the to-be-determined extracted text combination that satisfies the preset condition in the initial extracted text combination; wherein, the preset condition includes at least: the end position sequence number corresponding to the to-be-determined end word vector and the number to be determined It is determined that the difference between the starting position numbers of the starting word vectors is greater than a set threshold.
概率值判定单元353,可以用于根据所述待确定提取文本组合中每个所述待确定起始词向量的起始位置预测概率值,分别与每个所述待确定起始词向量对应的多个待确定结束词向量的结束位置预测概率值的概率乘积值,确定目标起始词向量及其对应的目标结束词向量。The probability value determination unit 353 can be used to predict the probability value according to the initial position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, respectively corresponding to each of the to-be-determined start word vectors The probability product value of the predicted probability values of the end positions of the plurality of end word vectors to be determined is used to determine the target start word vector and its corresponding target end word vector.
文本提取单元354,可以用于根据所述目标起始词向量对应的起始位置序号和所述目标结束词向量对应的结束位置序号,得到目标提取文本。The text extraction unit 354 may be configured to obtain the target extracted text according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector.
在具体的应用场景中,模型训练模块31,可以用于训练初始文本提取网络模型。模型训练模块31包括第一阶段训练单元311、训练监测单元312、第二阶段训练单元313。In a specific application scenario, the model training module 31 can be used to train the initial text extraction network model. The model training module 31 includes a first-stage training unit 311 , a training monitoring unit 312 , and a second-stage training unit 313 .
第一阶段训练单元311,可以用于根据训练样本中起始位置序号和结束位置序号对应的位置标签,训练所述初始文本提取网络模型。The first-stage training unit 311 may be configured to train the initial text extraction network model according to the position labels corresponding to the start position number and the end position number in the training samples.
训练监测单元312,可以用于当监测到所述初始文本提取网络模型中的第一损失函数的当前损失值下降至初始损失值的预设百分比时,得到第一阶段文本提取网络模型。The training monitoring unit 312 can be used to obtain the first-stage text extraction network model when it is monitored that the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value.
第二阶段训练单元313,可以用于利用所述第一损失函数和预先设定的修正模块对应的第二损失函数,根据忽略所述位置标签的训练样本,对所述第一阶段文本提取网络模型进行二次训练,得到训练好的文本提取网络模型。The second-stage training unit 313 can be used to use the first loss function and the second loss function corresponding to the preset correction module to train the first-stage text extraction network according to the training samples ignoring the position label The model is trained twice to obtain a trained text extraction network model.
在具体的应用场景中,所述修正模块用于辅助训练所述第一阶段文本提取网络模型中的第一位置预测模块和第二位置预测模块。In a specific application scenario, the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
需要说明的是,本申请实施例提供的一种文本信息提取装置所涉及各功能单元的其他相应描述,可以参考图1和图2中的对应描述,在此不再赘述。It should be noted that for other corresponding descriptions of the functional units involved in a text information extraction device provided in the embodiment of the present application, reference may be made to the corresponding descriptions in FIG. 1 and FIG. 2 , and details are not repeated here.
基于上述如图1和图2所示方法,相应的,本申请实施例还提供了一种存储介质,其上存储有计算机可读指令,该可读指令被处理器执行时实现上述如图1和图2的文本信息提取方法。Based on the methods shown in Figure 1 and Figure 2 above, correspondingly, the embodiment of the present application also provides a storage medium on which computer-readable instructions are stored, and when the readable instructions are executed by a processor, the above-mentioned information shown in Figure 1 can be realized. And the text information extraction method of Fig. 2.
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景所述的方法。Based on this understanding, the technical solution of the present application can be embodied in the form of software products, which can be stored in a storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes the methods described in each implementation scenario of this application.
基于上述如图1、图2所示的方法,以及图4、图5所示的虚拟装置实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,具体可以为个人计算机、服务器、网络设备等,该实体设备包括存储介质和处理器;存储介质,用于存储计算机可读指令;处理器,用于执行计算机可读指令以实现上述如图1和图2所示的文本信息提取方法。Based on the method shown in Figure 1 and Figure 2 above, and the virtual device embodiment shown in Figure 4 and Figure 5, in order to achieve the above purpose, the embodiment of this application also provides a computer device, which can be a personal computer, A server, a network device, etc., the physical device includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to realize the text shown in Figure 1 and Figure 2 information extraction method.
可选的,该计算机设备还可以包括用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Optionally, the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the like, and optional user interfaces may also include a USB interface, a card reader interface, and the like. Optionally, the network interface may include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface) and the like.
本领域技术人员可以理解,本实施例提供的一种计算机设备结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of a computer device provided in this embodiment does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.
存储介质中还可以包括操作系统、网络通信模块。操作系统是管理计算机设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信,以及与该实体设备中其它硬件和软件之间通信。The storage medium may also include an operating system and a network communication module. An operating system is a program that manages the hardware and software resources of a computer device and supports the operation of information processing programs and other software and/or programs. The network communication module is used to realize the communication between various components inside the storage medium, and communicate with other hardware and software in the physical device.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。通过应用本申请的技术方案,与现有基于正则规则的文本信息提取方案相比,本实施例能够利用训练好的文本提取网络模型,有效避免现有技术方案依赖人工规则,准确率低,效率低的技术问题,同时解决了仅预测文章中每个字是不是引用,无法在字与字之间建立必要联系的问题,从而提高文本提取的灵活性和适应性,有效提升文本信息提取的准确度。Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be realized by means of software plus a necessary general-purpose hardware platform, or by hardware. By applying the technical solution of the present application, compared with the existing text information extraction scheme based on regular rules, this embodiment can use the trained text extraction network model to effectively avoid the existing technical scheme relying on artificial rules, which has low accuracy and efficiency. Low technical problems, while solving the problem of only predicting whether each word in the article is a quotation, and unable to establish the necessary connection between words, thereby improving the flexibility and adaptability of text extraction, and effectively improving the accuracy of text information extraction Spend.
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application. Those skilled in the art can understand that the modules in the devices in the implementation scenario can be distributed among the devices in the implementation scenario according to the description of the implementation scenario, or can be located in one or more devices different from the implementation scenario according to corresponding changes. The modules of the above implementation scenarios can be combined into one module, or can be further split into multiple sub-modules.
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The serial numbers of the above application are for description only, and do not represent the pros and cons of the implementation scenarios. The above disclosures are only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any changes conceivable by those skilled in the art shall fall within the protection scope of the present application.

Claims (20)

  1. 一种文本信息提取方法,其中,包括:A text information extraction method, including:
    通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组;By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;
    利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;
    根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量;Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;
    根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
  2. 根据权利要求1所述的方法,其中,所述通过对待提取文本段落进行语句识别,得到待提取文本的初始词向量组,具体包括:The method according to claim 1, wherein, the initial word vector group of the text to be extracted is obtained by performing sentence recognition on the text paragraph to be extracted, specifically comprising:
    对所述待提取文本段落进行分词处理,得到分词处理后的文本段落;Carrying out word segmentation processing on the text paragraph to be extracted, to obtain the text paragraph after word segmentation processing;
    按照预设的序列长度,得到包含完整语句的初始数据序列;According to the preset sequence length, an initial data sequence containing a complete sentence is obtained;
    对所述初始数据序列进行词向量转换处理,得到初始词向量组。Perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  3. 根据权利要求1所述的方法,其中,所述利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量,具体包括:The method according to claim 1, wherein, using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group, specifically comprising :
    根据所述初始词向量组,利用所述预先训练好的文本提取网络模型中的预训练模块,得到包含上下文语义信息的第一词向量组;According to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information;
    利用所述预先训练好的文本提取网络模型中的第一位置预测模块,根据所述第一词向量组中每个词向量的起始位置预测概率值,得到所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。Using the first position prediction module in the pre-trained text extraction network model, according to the initial position prediction probability value of each word vector in the first word vector group, obtain the initial word vector group for A plurality of to-be-determined starting word vectors characterizing the starting position of text extraction.
  4. 根据权利要求1或3所述的方法,其中,所述根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量,具体包括:The method according to claim 1 or 3, wherein, according to a plurality of said initial word vectors to be determined and said initial word vector groups, predict each said initial word vector group to be determined in said initial word vector group Multiple word vectors to be determined corresponding to word vectors, including:
    针对每个待确定起始词向量,对所述待确定起始词向量和所述初始词向量组进行拼接处理,得到拼接词向量组;For each initial word vector to be determined, the initial word vector to be determined and the initial word vector group are spliced to obtain a spliced word vector group;
    利用所述预先训练好的文本提取网络模型中的第二位置预测模块,根据拼接词向量组中每个词向量的结束位置预测概率值,得到所述拼接词向量组中用于表征文本提取结束位置的多个待确定结束词向量;Using the second position prediction module in the pre-trained text extraction network model, according to the predicted probability value of the end position of each word vector in the spliced word vector group, obtain the end of characterizing text extraction in the spliced word vector group Multiple to-be-determined ending word vectors of positions;
    所述对所述待确定起始词向量和所述初始词向量组进行拼接处理,得到拼接词向量组,具体包括:The splicing processing of the initial word vector to be determined and the initial word vector group is performed to obtain a spliced word vector group, which specifically includes:
    将所述待确定起始词向量分别与所述初始词向量组内的每个词向量进行拼接处理,得到拼接词向量组。The to-be-determined starting word vector is spliced with each word vector in the initial word vector group to obtain a spliced word vector group.
  5. 根据权利要求1或4所述的方法,其中,所述根据预测得到的多个待确定起始词 向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本,具体包括:The method according to claim 1 or 4, wherein, the plurality of to-be-determined start word vectors obtained according to the prediction, and the plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, determine Target extracted text, specifically:
    根据预测得的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定初始提取文本组合;According to the predicted multiple to-be-determined start word vectors, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, determine the initial extraction text combination;
    获取所述初始提取文本组合中满足预设条件的待确定提取文本组合;Acquiring the to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations;
    根据所述待确定提取文本组合中每个所述待确定起始词向量的起始位置预测概率值,分别与每个所述待确定起始词向量对应的多个待确定结束词向量的结束位置预测概率值的概率乘积值,确定目标起始词向量及其对应的目标结束词向量;According to the predicted probability value of the start position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, the ends of the plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors are respectively The probability product value of the position prediction probability value determines the target starting word vector and its corresponding target ending word vector;
    根据所述目标起始词向量对应的起始位置序号和所述目标结束词向量对应的结束位置序号,得到目标提取文本;Obtain the target extraction text according to the starting position serial number corresponding to the target starting word vector and the ending position serial number corresponding to the target ending word vector;
    其中,所述预设条件至少包括:待确定结束词向量对应的结束位置序号与待确定起始词向量的起始位置序号的差值大于设定阈值。Wherein, the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.
  6. 根据权利要求1所述的方法,其中,还包括:The method according to claim 1, further comprising:
    训练初始文本提取网络模型,具体包括:Train the initial text extraction network model, including:
    根据训练样本中起始位置序号和结束位置序号对应的位置标签,训练所述初始文本提取网络模型;According to the position label corresponding to the start position sequence number and the end position sequence number in the training sample, train the initial text extraction network model;
    当监测到所述初始文本提取网络模型中的第一损失函数的当前损失值下降至初始损失值的预设百分比时,得到第一阶段文本提取网络模型;When it is monitored that the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value, the first stage text extraction network model is obtained;
    利用所述第一损失函数和预先设定的修正模块对应的第二损失函数,根据忽略所述位置标签的训练样本,对所述第一阶段文本提取网络模型进行二次训练,得到训练好的文本提取网络模型。Using the first loss function and the second loss function corresponding to the pre-set correction module, according to the training samples ignoring the position label, perform secondary training on the first-stage text extraction network model to obtain the trained Text Extraction Network Model.
  7. 根据权利要求6所述的方法,其中,所述修正模块用于辅助训练所述第一阶段文本提取网络模型中的第一位置预测模块和第二位置预测模块。The method according to claim 6, wherein the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
  8. 一种文本信息提取装置,其中,包括:A text information extraction device, including:
    语句识别模块,用于通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组;The sentence recognition module is used to obtain the initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted;
    第一位置预测模块,用于利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;The first position prediction module is used to use the pre-trained text extraction network model to predict a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group;
    第二位置预测模块,用于根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量;The second position prediction module is used to predict a plurality of to-be-determined initial word vectors corresponding to each of the to-be-determined initial word vectors in the initial word vector group according to the plurality of to-be-determined initial word vectors and the said initial word vector group. Determine the end word vector;
    确定模块,用于根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。The determination module is configured to determine the target text to be extracted according to the multiple to-be-determined start word vectors obtained through prediction, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  9. 根据权利要求8所述的装置,其中,所述语句识别模块,具体包括:The device according to claim 8, wherein the sentence recognition module specifically comprises:
    分词处理单元,用于对所述待提取文本段落进行分词处理,得到分词处理后的文本段落;A word segmentation processing unit, configured to perform word segmentation processing on the text paragraph to be extracted, to obtain a text paragraph after word segmentation processing;
    分组划分单元,用于按照预设的序列长度,得到包含完整语句的初始数据序列;A grouping division unit, configured to obtain an initial data sequence containing a complete sentence according to a preset sequence length;
    词向量转化单元,用于对所述初始数据序列进行词向量转换处理,得到初始词向量组。The word vector conversion unit is configured to perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  10. 根据权利要求8所述的装置,其中,所述第一位置预测模块,具体包括:The device according to claim 8, wherein the first position prediction module specifically comprises:
    预训练单元,用于根据所述初始词向量组,利用所述预先训练好的文本提取网络模型中的预训练模块,得到包含上下文语义信息的第一词向量组;A pre-training unit, configured to use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information according to the initial word vector group;
    起始位置预测单元,用于利用所述预先训练好的文本提取网络模型中的第一位置预测模块,根据所述第一词向量组中每个词向量的起始位置预测概率值,得到所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。The starting position prediction unit is used to use the first position prediction module in the pre-trained text extraction network model to obtain the predicted probability value of the starting position of each word vector in the first word vector group. A plurality of undetermined initial word vectors used to characterize the starting position of text extraction in the initial word vector group.
  11. 根据权利要求8或10所述的装置,其中,所述第二位置预测模块,具体包括:The device according to claim 8 or 10, wherein the second position prediction module specifically includes:
    向量拼接单元,用于针对每个待确定起始词向量,对所述待确定起始词向量和所述初始词向量组进行拼接处理,得到拼接词向量组;A vector splicing unit, configured to splice the to-be-determined initial word vector and the initial word vector group for each to-be-determined initial word vector, to obtain a spliced word vector group;
    结束位置预测单元,用于利用所述预先训练好的文本提取网络模型中的第二位置预测模块,根据拼接词向量组中每个词向量的结束位置预测概率值,得到所述拼接词向量组中用于表征文本提取结束位置的多个待确定结束词向量;The end position prediction unit is used to use the second position prediction module in the pre-trained text extraction network model to obtain the spliced word vector group according to the predicted probability value of the end position of each word vector in the spliced word vector group A plurality of to-be-determined end word vectors used to characterize the end position of text extraction;
    所述对所述待确定起始词向量和所述初始词向量组进行拼接处理,得到拼接词向量组,具体包括:The splicing processing of the initial word vector to be determined and the initial word vector group is performed to obtain a spliced word vector group, which specifically includes:
    将所述待确定起始词向量分别与所述初始词向量组内的每个词向量进行拼接处理,得到拼接词向量组。The to-be-determined starting word vector is spliced with each word vector in the initial word vector group to obtain a spliced word vector group.
  12. 根据权利要求8或11所述的装置,其中,所述确定模块,具体包括:The device according to claim 8 or 11, wherein the determining module specifically comprises:
    组合确定单元,用于根据预测得的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定初始提取文本组合;A combination determination unit, configured to determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors;
    预设条件单元,用于获取所述初始提取文本组合中满足预设条件的待确定提取文本组合;A preset condition unit, configured to acquire the to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations;
    概率值判定单元,用于根据所述待确定提取文本组合中每个所述待确定起始词向量的起始位置预测概率值,分别与每个所述待确定起始词向量对应的多个待确定结束词向量的结束位置预测概率值的概率乘积值,确定目标起始词向量及其对应的目标结束词向量;A probability value determination unit, configured to predict the probability value according to the initial position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, and respectively correspond to each of the to-be-determined start word vectors To determine the probability product value of the end position prediction probability value of the end word vector, determine the target start word vector and its corresponding target end word vector;
    文本提取单元,用于根据所述目标起始词向量对应的起始位置序号和所述目标结束词向量对应的结束位置序号,得到目标提取文本;The text extraction unit is used to obtain the target extracted text according to the starting position serial number corresponding to the target starting word vector and the ending position serial number corresponding to the target ending word vector;
    其中,所述预设条件至少包括:待确定结束词向量对应的结束位置序号与待确定起始词向量的起始位置序号的差值大于设定阈值。Wherein, the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.
  13. 根据权利要求8所述的装置,其中,还包括:The device according to claim 8, further comprising:
    模型训练模块,用于训练初始文本提取网络模型,具体包括:The model training module is used to train the initial text extraction network model, specifically including:
    第一阶段训练单元,用于根据训练样本中起始位置序号和结束位置序号对应的位置标签,训练所述初始文本提取网络模型;The first-stage training unit is used to train the initial text extraction network model according to the position labels corresponding to the starting position serial number and the ending position serial number in the training sample;
    训练监测单元,用于当监测到所述初始文本提取网络模型中的第一损失函数的当前损 失值下降至初始损失值的预设百分比时,得到第一阶段文本提取网络模型;The training monitoring unit is used to obtain the first-stage text extraction network model when the current loss value of the first loss function in the initial text extraction network model is monitored to drop to a preset percentage of the initial loss value;
    第二阶段训练单元,用于利用所述第一损失函数和预先设定的修正模块对应的第二损失函数,根据忽略所述位置标签的训练样本,对所述第一阶段文本提取网络模型进行二次训练,得到训练好的文本提取网络模型。The second-stage training unit is configured to use the first loss function and the second loss function corresponding to the preset correction module to perform the first-stage text extraction network model according to the training samples ignoring the position label. Secondary training to get the trained text extraction network model.
  14. 根据权利要求13所述的装置,其中,所述修正模块用于辅助训练所述第一阶段文本提取网络模型中的第一位置预测模块和第二位置预测模块。The device according to claim 13, wherein the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
  15. 一种存储介质,其上存储有计算机可读指令,其中,所述可读指令被处理器执行时,实现文本信息提取方法,包括:A storage medium, on which computer-readable instructions are stored, wherein, when the readable instructions are executed by a processor, a method for extracting text information is realized, including:
    通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组;By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;
    利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;
    根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个所述待确定起始词向量对应的多个待确定结束词向量;Predicting a plurality of to-be-determined ending word vectors corresponding to each of the to-be-determined starting word vectors in the initial word vector group according to the plurality of to-be-determined starting word vectors and the described initial word vector groups;
    根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
  16. 根据权利要求15所述的存储介质,其中,所述通过对待提取文本段落进行语句识别,得到待提取文本的初始词向量组,具体包括:The storage medium according to claim 15, wherein the initial word vector group of the text to be extracted is obtained by performing sentence recognition on the text paragraph to be extracted, specifically comprising:
    对所述待提取文本段落进行分词处理,得到分词处理后的文本段落;Carrying out word segmentation processing on the text paragraph to be extracted, to obtain the text paragraph after word segmentation processing;
    按照预设的序列长度,得到包含完整语句的初始数据序列;According to the preset sequence length, an initial data sequence containing a complete sentence is obtained;
    对所述初始数据序列进行词向量转换处理,得到初始词向量组。Perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  17. 根据权利要求15所述的存储介质,其中,所述利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量,具体包括:The storage medium according to claim 15, wherein, using the pre-trained text extraction network model, predicting a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group, specifically include:
    根据所述初始词向量组,利用所述预先训练好的文本提取网络模型中的预训练模块,得到包含上下文语义信息的第一词向量组;According to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information;
    利用所述预先训练好的文本提取网络模型中的第一位置预测模块,根据所述第一词向量组中每个词向量的起始位置预测概率值,得到所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。Using the first position prediction module in the pre-trained text extraction network model, according to the initial position prediction probability value of each word vector in the first word vector group, obtain the initial word vector group for A plurality of to-be-determined starting word vectors characterizing the starting position of text extraction.
  18. 一种计算机设备,包括存储介质、处理器及存储在存储介质上并可在处理器上运行的计算机可读指令,其中,所述处理器执行所述可读指令时实现文本信息提取方法,包括:A computer device, comprising a storage medium, a processor, and computer-readable instructions stored on the storage medium and operable on the processor, wherein, when the processor executes the readable instructions, a method for extracting text information is implemented, including :
    通过对待提取文本段落进行语句识别,得到待提取文本段落的初始词向量组;By performing sentence recognition on the text paragraph to be extracted, the initial word vector group of the text paragraph to be extracted is obtained;
    利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量;Using the pre-trained text extraction network model, predicting a plurality of initial word vectors to be determined for representing the starting position of text extraction in the initial word vector group;
    根据多个所述待确定起始词向量和所述初始词向量组,预测所述初始词向量组中每个 所述待确定起始词向量对应的多个待确定结束词向量;According to a plurality of described initial word vectors to be determined and the initial word vector group, predict a plurality of to-be-determined end word vectors corresponding to each of the initial word vectors to be determined in the initial word vector group;
    根据预测得到的多个待确定起始词向量,以及每个所述待确定起始词向量对应的多个待确定结束词向量,确定目标提取文本。According to the predicted multiple to-be-determined start word vectors and multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors, the target text to be extracted is determined.
  19. 根据权利要求18所述的计算机设备,其中,所述通过对待提取文本段落进行语句识别,得到待提取文本的初始词向量组,具体包括:The computer device according to claim 18, wherein the initial word vector group of the text to be extracted is obtained by performing sentence recognition on the text paragraph to be extracted, specifically comprising:
    对所述待提取文本段落进行分词处理,得到分词处理后的文本段落;Carrying out word segmentation processing on the text paragraph to be extracted, to obtain the text paragraph after word segmentation processing;
    按照预设的序列长度,得到包含完整语句的初始数据序列;According to the preset sequence length, an initial data sequence containing a complete sentence is obtained;
    对所述初始数据序列进行词向量转换处理,得到初始词向量组。Perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  20. 根据权利要求18所述的计算机设备,其中,所述利用预先训练好的文本提取网络模型,预测所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量,具体包括:The computer device according to claim 18, wherein, using the pre-trained text extraction network model, predicting a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group, specifically include:
    根据所述初始词向量组,利用所述预先训练好的文本提取网络模型中的预训练模块,得到包含上下文语义信息的第一词向量组;According to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information;
    利用所述预先训练好的文本提取网络模型中的第一位置预测模块,根据所述第一词向量组中每个词向量的起始位置预测概率值,得到所述初始词向量组中用于表征文本提取开始位置的多个待确定起始词向量。Using the first position prediction module in the pre-trained text extraction network model, according to the initial position prediction probability value of each word vector in the first word vector group, obtain the initial word vector group for A plurality of to-be-determined starting word vectors characterizing the starting position of text extraction.
PCT/CN2022/071444 2021-08-30 2022-01-11 Text information extraction method and apparatus, and storage medium and computer device WO2023029354A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111007458.6A CN113722436A (en) 2021-08-30 2021-08-30 Text information extraction method and device, computer equipment and storage medium
CN202111007458.6 2021-08-30

Publications (1)

Publication Number Publication Date
WO2023029354A1 true WO2023029354A1 (en) 2023-03-09

Family

ID=78679376

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071444 WO2023029354A1 (en) 2021-08-30 2022-01-11 Text information extraction method and apparatus, and storage medium and computer device

Country Status (2)

Country Link
CN (1) CN113722436A (en)
WO (1) WO2023029354A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116016416A (en) * 2023-03-24 2023-04-25 深圳市明源云科技有限公司 Junk mail identification method, device, equipment and computer readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722436A (en) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 Text information extraction method and device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464641A (en) * 2020-10-29 2021-03-09 平安科技(深圳)有限公司 BERT-based machine reading understanding method, device, equipment and storage medium
CN113051926A (en) * 2021-03-01 2021-06-29 北京百度网讯科技有限公司 Text extraction method, equipment and storage medium
US20210216725A1 (en) * 2020-01-14 2021-07-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing information
CN113255327A (en) * 2021-06-10 2021-08-13 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113268571A (en) * 2021-07-21 2021-08-17 北京明略软件系统有限公司 Method, device, equipment and medium for determining correct answer position in paragraph
CN113722436A (en) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 Text information extraction method and device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674271B (en) * 2019-08-27 2023-01-06 腾讯科技(深圳)有限公司 Question and answer processing method and device
CN111597314B (en) * 2020-04-20 2023-01-17 科大讯飞股份有限公司 Reasoning question-answering method, device and equipment
CN112464656B (en) * 2020-11-30 2024-02-13 中国科学技术大学 Keyword extraction method, keyword extraction device, electronic equipment and storage medium
CN112685548B (en) * 2020-12-31 2023-09-08 科大讯飞(北京)有限公司 Question answering method, electronic device and storage device
CN112446216B (en) * 2021-02-01 2021-05-04 华东交通大学 Method and device for identifying nested named entities fusing with core word information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210216725A1 (en) * 2020-01-14 2021-07-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing information
CN112464641A (en) * 2020-10-29 2021-03-09 平安科技(深圳)有限公司 BERT-based machine reading understanding method, device, equipment and storage medium
CN113051926A (en) * 2021-03-01 2021-06-29 北京百度网讯科技有限公司 Text extraction method, equipment and storage medium
CN113255327A (en) * 2021-06-10 2021-08-13 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113268571A (en) * 2021-07-21 2021-08-17 北京明略软件系统有限公司 Method, device, equipment and medium for determining correct answer position in paragraph
CN113722436A (en) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 Text information extraction method and device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116016416A (en) * 2023-03-24 2023-04-25 深圳市明源云科技有限公司 Junk mail identification method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN113722436A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
KR102401942B1 (en) Method and apparatus for evaluating translation quality
US20230100376A1 (en) Text sentence processing method and apparatus, computer device, and storage medium
CN106777013B (en) Conversation management method and device
CN109086303A (en) The Intelligent dialogue method, apparatus understood, terminal are read based on machine
WO2023029354A1 (en) Text information extraction method and apparatus, and storage medium and computer device
CN109271493A (en) A kind of language text processing method, device and storage medium
CN111144120A (en) Training sentence acquisition method and device, storage medium and electronic equipment
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
JP7430820B2 (en) Sorting model training method and device, electronic equipment, computer readable storage medium, computer program
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN113158687B (en) Semantic disambiguation method and device, storage medium and electronic device
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN110808032A (en) Voice recognition method and device, computer equipment and storage medium
JP2023002690A (en) Semantics recognition method, apparatus, electronic device, and storage medium
JP7181999B2 (en) SEARCH METHOD AND SEARCH DEVICE, STORAGE MEDIUM
CN116109732A (en) Image labeling method, device, processing equipment and storage medium
CN114817478A (en) Text-based question and answer method and device, computer equipment and storage medium
CN114861758A (en) Multi-modal data processing method and device, electronic equipment and readable storage medium
CN114120166A (en) Video question and answer method and device, electronic equipment and storage medium
CN112906368B (en) Industry text increment method, related device and computer program product
CN110807097A (en) Method and device for analyzing data
CN109002498B (en) Man-machine conversation method, device, equipment and storage medium
WO2023040545A1 (en) Data processing method and apparatus, device, storage medium, and program product
CN111931503B (en) Information extraction method and device, equipment and computer readable storage medium
CN116821327A (en) Text data processing method, apparatus, device, readable storage medium and product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE