WO2023029354A1 - Procédé et appareil d'extraction d'informations textuelles, support de stockage et dispositif informatique - Google Patents

Procédé et appareil d'extraction d'informations textuelles, support de stockage et dispositif informatique Download PDF

Info

Publication number
WO2023029354A1
WO2023029354A1 PCT/CN2022/071444 CN2022071444W WO2023029354A1 WO 2023029354 A1 WO2023029354 A1 WO 2023029354A1 CN 2022071444 W CN2022071444 W CN 2022071444W WO 2023029354 A1 WO2023029354 A1 WO 2023029354A1
Authority
WO
WIPO (PCT)
Prior art keywords
word vector
initial
determined
text
word
Prior art date
Application number
PCT/CN2022/071444
Other languages
English (en)
Chinese (zh)
Inventor
谯轶轩
陈浩
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023029354A1 publication Critical patent/WO2023029354A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a text information extraction method, device, storage medium and computer equipment.
  • text information extraction is developing in the direction of digitization, intelligence, and semantics with the development of artificial intelligence and other disciplines, and it plays a greater role in social knowledge management.
  • text information extraction methods include, based on regular expressions, artificial filtering or matching rules, regular rule methods for text extraction; use of named entity recognition NER models to process by setting extraction tasks; And, other mainstream ways to predict individual words in text.
  • the inventor realized that the regular rule method has the problem of relying on artificial rules. When faced with a complex sentence environment and text with incomplete semantics, it cannot completely extract text information; NER model recognition is prone to overfitting, When faced with texts containing new corpus information, the accuracy of extraction drops significantly; and words in isolated texts are extracted, resulting in low accuracy of text information extraction.
  • the present application provides a text information extraction method, device, storage medium and computer equipment.
  • a method for extracting text information comprising:
  • the initial word vector group of the text paragraph to be extracted is obtained;
  • the target text to be extracted is determined.
  • a text information extraction device comprising:
  • the sentence recognition module is used to carry out sentence recognition by the text paragraph to be extracted to obtain the initial word vector group of the text paragraph to be extracted;
  • the first position prediction module is used to use the pre-trained text extraction network model to predict a plurality of initial word vectors to be determined for representing the start position of text extraction in the initial word vector group;
  • the second position prediction module is used to predict a plurality of to-be-determined initial word vectors corresponding to each of the to-be-determined initial word vectors in the initial word vector group according to the plurality of to-be-determined initial word vectors and the said initial word vector group. Determine the end word vector;
  • the determination module is configured to determine the target text to be extracted according to the multiple to-be-determined start word vectors obtained through prediction, and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • a storage medium on which computer-readable instructions are stored, and when the program is executed by a processor, the above text information extraction method is implemented, including:
  • the initial word vector group of the text paragraph to be extracted is obtained;
  • the target text to be extracted is determined.
  • a computer device including a storage medium, a processor, and computer-readable instructions stored on the storage medium and operable on the processor.
  • the processor executes the program, the above-mentioned Text information extraction methods, including:
  • the initial word vector group of the text paragraph to be extracted is obtained;
  • the target text to be extracted is determined.
  • FIG. 1 shows a schematic flow chart of a method for extracting text information provided by an embodiment of the present application
  • FIG. 2 shows a schematic flow diagram of another text information extraction method provided by the embodiment of the present application
  • FIG. 3 shows a schematic diagram of the text extraction network model architecture in the training phase provided by the embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of a text information extraction device provided by an embodiment of the present application
  • FIG. 5 shows a schematic structural diagram of another apparatus for extracting text information provided by an embodiment of the present application.
  • AI Artificial Intelligence
  • AI is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • this embodiment provides a method for extracting text information, as shown in Figure 1, taking the application of this method to a computer device such as a server as an example for illustration, wherein the server can be an independent server, or it can provide a cloud service , cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network (CDN: Content Delivery Network), and big data and artificial intelligence platforms and other basic clouds Cloud servers for computing services, such as intelligent medical systems, digital medical platforms, etc.
  • the above method comprises the following steps:
  • Step S101 Obtain an initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.
  • word segmentation processing is performed on the words of the text paragraph to be extracted, and the text paragraph after word segmentation is divided according to the preset sequence length, and one or more text paragraphs containing The initial data sequence of the complete sentence, and perform word vector conversion processing on the initial data sequence to obtain the initial word vector group.
  • the text paragraphs are divided into sentence units, and the text paragraphs smaller than the preset sequence length are completed.
  • dividing the text paragraphs after word segmentation by 512 words can enhance the ability of the text extraction network model to extract long texts. Further, dividing the text paragraphs in units of sentences can effectively avoid text paragraphs In the division process, a complete sentence is divided into different data sequences, which in turn affects the accuracy of the text extraction network model for contextual semantic extraction.
  • the splicing process is performed to obtain the text paragraphs to be extracted containing the question information, and according to the text to be extracted containing the question information
  • the sentence recognition of the paragraph is carried out to obtain the initial word vector group of the text paragraph to be extracted, so as to further predict the position of the start word vector and the position of the end word vector in the text paragraph to be extracted. Due to the addition of the user's question information, the extracted text Information is more accurate.
  • Step S102 using the pre-trained text extraction network model, predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.
  • the pre-training module (GPT: Generative Pre-training) in the pre-trained text extraction network model is used to enable each word vector in the initial word vector group to learn the semantic information of other word vectors to obtain contextual The first word vector group of semantic information; further, use the first position prediction module to obtain the predicted probability value of the starting position of each word vector in the first word vector group, and determine K starting positions in the first word vector group by traversing The to-be-determined starting word vector with the largest position prediction probability value.
  • GPT Generative Pre-training
  • the pre-training model GPT adopts a multi-layer Transformer architecture, and the self-attention mechanism self-attention enables each word vector to extract grammar other than its own features after multi-layer learning.
  • Syntactic and other deep-level semantic information establishes the contextual connection of each word vector in the initial word vector group, thereby improving the accuracy of text information extraction by the text extraction network model.
  • Step S103 According to the plurality of to-be-determined start word vectors and the initial word vector group, predict a plurality of to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors in the initial word vector group.
  • the word vector end position is predicted by using the predicted word vector start position information.
  • the K initial word vectors to be determined are respectively subjected to vector splicing processing with the initial word vector groups to obtain K spliced word vector groups for inputting into the second position prediction module.
  • Utilize the second position prediction module to obtain the end position prediction probability value corresponding to each to-be-determined start word vector in the spliced word vector group, and determine the N end positions corresponding to each to-be-determined start word vector in the spliced word vector group by traversing The to-be-determined ending word vector with the largest predicted probability value.
  • K and N may be set to be equal or unequal according to requirements of actual application scenarios.
  • Step S104 Determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • K*N initially extracted text combinations are obtained, and K*N is further determined
  • the to-be-determined extracted text combinations that meet the preset conditions in the initial extracted text combinations, and the predicted probability value according to the starting position of each to-be-determined start word vector in the to-be-determined extracted text combinations, and the corresponding multiple to-be-determined end words The predicted probability value of the end position of the vector is determined by product calculation to determine that the start word vector corresponding to the maximum product value is the target start word vector, and its corresponding end word vector is the target end word vector, thereby obtaining the target extracted text.
  • the initial word vector group of the text paragraph to be extracted obtained by sentence recognition is input into the pre-trained text extraction network model, and a plurality of initial word vectors to be determined in the initial word vector group are predicted.
  • a plurality of to-be-determined start word vectors and initial word vector groups predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text.
  • this embodiment can improve the extraction accuracy of the text extraction network model based on the initial word vector and the initial word vector group, thereby more accurately Extract the target text in the text paragraph. Further, as a refinement and extension of the specific implementation of the above embodiment, in order to fully describe the specific implementation process of this embodiment, another text information extraction method is provided, as shown in Figure 2, the method includes:
  • Step S201 training an initial text extraction network model.
  • the initial text extraction network model constructed includes a serial first position prediction module and a second position prediction module, namely the span module, which are used to implement Predict the start position and end position of the target extracted text; add a pre-training module GPT at the input of the first position prediction model to obtain the contextual semantic information of each word vector; in the model training stage, by adding a correction module , so that the update of the model parameters obtained from the training of the text extraction network model tends to be more stable.
  • step 201 may specifically include: training the initial text extraction network model according to the position labels corresponding to the starting position serial number and the ending position serial number in the training sample; when monitoring When the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value, the first-stage text extraction network model is obtained; using the first loss function and the preset
  • the second loss function corresponding to the correction module performs secondary training on the first-stage text extraction network model according to the training samples ignoring the position label, to obtain a trained text extraction network model.
  • the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
  • the specific steps for training the initial text extraction network model constructed include:
  • the GPT module in the initial text extraction network model to extract the semantic features of each word in the initial word vector group, and obtain the first word vector group containing contextual semantics, expressed as [h1, h2, ... h512] .
  • the GPT model adopts a multi-layer Transformer architecture, and each layer of the Transformer contains a self-attention mechanism self-attention, which can make each word in [w1, w2, ...
  • Word feature information is extracted, and the extracted feature information is used to update its own vector to obtain the deep relationship between other words and itself, that is, each word vector in the initial word vector group after multi-layer learning , to obtain word vectors containing grammar, syntax and other deep semantic information of all other positions in the initial word vector group, so as to obtain the first word vector group [h1, h2, ... h512] containing contextual semantics.
  • the first position prediction module and the second position prediction module can be two position prediction modules, or one position prediction module, which sequentially output the position prediction probability values of the start word vector and the end word vector, which is not correct here
  • the position prediction module makes specific limitations.
  • the target is to maximize the product of the position prediction probability value of the target start position and the target end position. If it is monitored that the current loss value of the first loss function L1 of the position prediction module drops to 30% of the initial loss value %, set the target start position and target end position as empty, and use the multi-task learning framework for secondary training to obtain a trained text extraction network model.
  • the multi-task learning framework is to use the correction module to assist the training position prediction module, that is, when the current loss value L m of the first loss function L1 is reduced to 30% of the initial loss value, the first-stage text extraction network model is obtained. According to the correction module The corresponding second loss function L2 and first loss function L1 continue to train the first-stage text extraction network model to obtain the second-stage text extraction network model as a trained text extraction network model. Specifically:
  • the position prediction module is used to calculate the position prediction probability value of the word vector, and its calculation formula is:
  • W is the weight
  • b is the bias value
  • s represents the sigmoid function
  • the output results of 1 and 2 are iteratively trained until the training ends, and the trained text extraction network model is obtained.
  • the maximum number of iterations for model training is N rounds, and N is 10000 by default, which can be customized by the user.
  • the loss function L1 is used to calculate the negative logarithm of the target start position and target end position. The formula is as follows:
  • the P start position represents the position prediction probability value of the word vector corresponding to the target start position output by the first position prediction module
  • the P end position represents the position prediction probability value of the word vector corresponding to the target end position output by the second position prediction module
  • M is the size of the preset vocabulary, which is set to 50000 word vectors
  • y hc indicates that the dimension value at the index c of the current word vector h is 1, and the other values are 0, 0 ⁇ c ⁇ M
  • p hc indicates the current word vector h is the probability at c, that is, the value corresponding to the cth dimension of the digital vector after the above-mentioned softmax layer processing.
  • Multi-task training is realized by adding a correction module, which can be closer to the actual scene of text extraction.
  • emptying the vector mark at the position of text extraction during the training process will lead to a sudden increase in the loss value of the model and increase the difficulty of learning.
  • the first loss function of the model was assisted by adding a correction module to make the update of model parameters more stable.
  • the trained text extraction network model The correction module is not included in , and the correction module is only used to further optimize the model parameters in the position prediction module.
  • the stochastic gradient descent algorithm SGD is used to iteratively update the network model parameters W and b in the initial text extraction network model to obtain a trained text extraction network model. Specifically, during the model training process, if the difference between L m and L m+1 obtained from two adjacent training sessions is less than the set value, that is, L m -L m+1 ⁇ 0.01, the model is considered to have converged, and the decision After the training is over, the trained text extraction network model is obtained.
  • Step S202 perform word segmentation processing on the text paragraph to be extracted, and obtain a text paragraph after word segmentation processing.
  • Step S203 According to the preset sequence length, an initial data sequence including complete sentences is obtained.
  • Step S204 performing word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  • Step S205 according to the initial word vector group, use the pre-trained module in the pre-trained text extraction network model to obtain the first word vector group containing contextual semantic information.
  • Step S206 using the first position prediction module in the pre-trained text extraction network model to obtain the initial word vector group according to the initial position prediction probability value of each word vector in the first word vector group In is used to characterize the multiple starting word vectors to be determined at the start position of text extraction.
  • Step S207 for each to-be-determined initial word vector, concatenate the to-be-determined initial word vector and the initial word vector group to obtain a concatenated word vector group.
  • step 207 may specifically include: splicing the to-be-determined starting word vector with each word vector in the initial word vector group to obtain A concatenated word vector group.
  • Step S208 using the second position prediction module in the pre-trained text extraction network model, according to the predicted probability value of the end position of each word vector in the spliced word vector group, to obtain the representation value in the spliced word vector group A plurality of to-be-determined end word vectors at the end position of text extraction.
  • Step S209 Determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • Step S210 obtaining the to-be-determined extracted text combinations satisfying the preset conditions among the initial extracted text combinations.
  • the preset condition at least includes: the difference between the end position number corresponding to the to-be-determined end word vector and the start-position number of the to-be-determined start word vector is greater than a set threshold.
  • the to-be-determined extractions that meet the preset conditions among the K*N initial extraction text combinations are determined according to the preset conditions.
  • the preset condition is that the end position number corresponding to the end word vector to be determined in the text combination to be determined is greater than the start position number of the start word vector to be determined, and the difference between the end position number and the start position number is greater than A threshold (for example, 2) is set, and the preset condition is not specifically limited here.
  • Step S211 according to the predicted probability value of the start position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, a plurality of to-be-determined end words corresponding to each of the to-be-determined start word vectors
  • the probability product value of the predicted probability value of the end position of the vector is used to determine the target start word vector and its corresponding target end word vector.
  • the predicted start position probability value of each start word vector to be determined in the to-be-determined extracted text combination is multiplied by the predicted end position probability values of the corresponding N end word vectors to be determined respectively Processing, by traversing each probability product value, combining the extracted text with the largest probability product value to determine the target start word vector and its corresponding target end word vector.
  • Step S212 according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector, to obtain the target extracted text.
  • the initial word vector group of the text paragraph to be extracted obtained through sentence recognition is input into the pre-trained text extraction network model, and multiple to-be-determined initial word vectors in the initial word vector group are predicted , according to a plurality of to-be-determined start word vectors and initial word vector groups, predict a plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector, thereby obtaining a plurality of to-be-determined start word vectors according to the prediction, and A plurality of to-be-determined end word vectors corresponding to each to-be-determined start word vector is determined to extract the target text.
  • the pre-trained text information extraction network model can effectively avoid the strong dependence of artificial rules in the existing regular rule method, and cannot completely extract complex or incomplete text information; NER model recognition is prone to overfitting, And when faced with text containing new corpus information, the accuracy of text extraction is low; and other mainstream methods to extract words in isolated texts lead to technical problems of low text extraction accuracy, thereby effectively improving the accuracy of text information extraction .
  • an embodiment of the present application provides a text information extraction device, as shown in FIG. 4 , the device includes: a sentence recognition module 32, a first position prediction module 33, a second position prediction module Module 34 , determining module 35 .
  • the sentence recognition module 32 can be used to obtain the initial word vector group of the text paragraph to be extracted by performing sentence recognition on the text paragraph to be extracted.
  • the first position prediction module 33 may be configured to use a pre-trained text extraction network model to predict a plurality of undetermined initial word vectors used to represent the start position of text extraction in the initial word vector group.
  • the second position prediction module 34 can be used to predict the number of each of the initial word vectors corresponding to each of the initial word vectors in the initial word vector group according to the plurality of initial word vectors to be determined and the initial word vector group. A to-be-determined ending word vector.
  • the determining module 35 may be configured to determine the target text to be extracted according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • a model training module 31 is also included.
  • the sentence recognition module 32 includes a word segmentation processing unit 321 , a grouping division unit 322 , and a word vector conversion unit 323 .
  • the word segmentation processing unit 321 may be configured to perform word segmentation processing on the to-be-extracted text paragraph to obtain a text paragraph after word segmentation processing.
  • the grouping unit 322 can be configured to obtain an initial data sequence including a complete sentence according to a preset sequence length.
  • the word vector conversion unit 323 may be configured to perform word vector conversion processing on the initial data sequence to obtain an initial word vector group.
  • the first position prediction module 33 includes a pre-training unit 331 and a starting position prediction unit 332 .
  • the pre-training unit 331 may be configured to use the pre-trained module in the pre-trained text extraction network model according to the initial word vector group to obtain a first word vector group containing contextual semantic information.
  • the starting position prediction unit 332 can be used to use the first position prediction module in the pre-trained text extraction network model to predict the probability value according to the starting position of each word vector in the first word vector group, A plurality of to-be-determined initial word vectors used to characterize the starting position of text extraction in the initial word vector group are obtained.
  • the second position prediction module 34 includes a vector splicing unit 341 and an end position prediction unit 342 .
  • the vector concatenating unit 341 may be configured to concatenate the to-be-determined initial word vector and the initial word vector group for each to-be-determined initial word vector to obtain a concatenated word vector group.
  • the end position prediction unit 342 can be used to use the second position prediction module in the pre-trained text extraction network model to obtain the spliced word according to the predicted probability value of the end position of each word vector in the spliced word vector group. Multiple to-be-determined end word vectors used to represent the end position of text extraction in the vector group.
  • the determination module 35 includes a combination determination unit 351 , a preset condition unit 352 , a probability determination unit 353 , and a text extraction unit 354 .
  • the combination determining unit 351 may be configured to determine an initial extracted text combination according to the predicted multiple to-be-determined start word vectors and the multiple to-be-determined end word vectors corresponding to each of the to-be-determined start word vectors.
  • the preset condition unit 352 can be used to obtain the to-be-determined extracted text combination that satisfies the preset condition in the initial extracted text combination; wherein, the preset condition includes at least: the end position sequence number corresponding to the to-be-determined end word vector and the number to be determined It is determined that the difference between the starting position numbers of the starting word vectors is greater than a set threshold.
  • the probability value determination unit 353 can be used to predict the probability value according to the initial position of each of the to-be-determined start word vectors in the to-be-determined extracted text combination, respectively corresponding to each of the to-be-determined start word vectors.
  • the probability product value of the predicted probability values of the end positions of the plurality of end word vectors to be determined is used to determine the target start word vector and its corresponding target end word vector.
  • the text extraction unit 354 may be configured to obtain the target extracted text according to the start position number corresponding to the target start word vector and the end position number corresponding to the target end word vector.
  • the model training module 31 can be used to train the initial text extraction network model.
  • the model training module 31 includes a first-stage training unit 311 , a training monitoring unit 312 , and a second-stage training unit 313 .
  • the first-stage training unit 311 may be configured to train the initial text extraction network model according to the position labels corresponding to the start position number and the end position number in the training samples.
  • the training monitoring unit 312 can be used to obtain the first-stage text extraction network model when it is monitored that the current loss value of the first loss function in the initial text extraction network model drops to a preset percentage of the initial loss value.
  • the second-stage training unit 313 can be used to use the first loss function and the second loss function corresponding to the preset correction module to train the first-stage text extraction network according to the training samples ignoring the position label
  • the model is trained twice to obtain a trained text extraction network model.
  • the correction module is used to assist in training the first position prediction module and the second position prediction module in the first-stage text extraction network model.
  • the embodiment of the present application also provides a storage medium on which computer-readable instructions are stored, and when the readable instructions are executed by a processor, the above-mentioned information shown in Figure 1 can be realized. And the text information extraction method of Fig. 2.
  • the technical solution of the present application can be embodied in the form of software products, which can be stored in a storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), and include several instructions to make A computer device (which may be a personal computer, a server, or a network device, etc.) executes the methods described in each implementation scenario of this application.
  • a storage medium which can be CD-ROM, U disk, mobile hard disk, etc.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the embodiment of this application also provides a computer device, which can be a personal computer, A server, a network device, etc.
  • the physical device includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to realize the text shown in Figure 1 and Figure 2 information extraction method.
  • the computer device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the like, and optional user interfaces may also include a USB interface, a card reader interface, and the like.
  • the network interface may include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface) and the like.
  • a computer device does not constitute a limitation to the physical device, and may include more or less components, or combine some components, or arrange different components.
  • the storage medium may also include an operating system and a network communication module.
  • An operating system is a program that manages the hardware and software resources of a computer device and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to realize the communication between various components inside the storage medium, and communicate with other hardware and software in the physical device.
  • this embodiment can use the trained text extraction network model to effectively avoid the existing technical scheme relying on artificial rules, which has low accuracy and efficiency.
  • Low technical problems while solving the problem of only predicting whether each word in the article is a quotation, and unable to establish the necessary connection between words, thereby improving the flexibility and adaptability of text extraction, and effectively improving the accuracy of text information extraction Spend.
  • the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application.
  • the modules in the devices in the implementation scenario can be distributed among the devices in the implementation scenario according to the description of the implementation scenario, or can be located in one or more devices different from the implementation scenario according to corresponding changes.
  • the modules of the above implementation scenarios can be combined into one module, or can be further split into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente demande se rapporte au domaine technique de l'intelligence artificielle. L'invention divulgue un procédé et un appareil d'extraction d'informations textuelles, ainsi qu'un support de stockage et un dispositif informatique, qui peuvent améliorer la précision d'extraction d'informations textuelles. Le procédé consiste : à réaliser une reconnaissance de phrase sur un paragraphe de texte à soumettre à une extraction de sorte à obtenir un groupe de vecteurs de mots initial dudit paragraphe de texte ; à prédire, en utilisant un modèle de réseau d'extraction de texte préformé, une pluralité de vecteurs de mots de début à déterminer, qui sont utilisés pour représenter une position de début d'extraction de texte, dans le groupe de vecteurs de mots initial ; en fonction de la pluralité desdits vecteurs de mots de début et du groupe de vecteurs de mots initial, à prédire une pluralité de vecteurs de mots de fin à déterminer qui correspondent auxdits vecteurs de mots de début dans le groupe de vecteurs de mots initial ; et, selon la pluralité desdits vecteurs de mots de départ obtenus par prédiction et la pluralité desdits vecteurs de mots de fin correspondant auxdits vecteurs de mots de début, à déterminer un texte cible à extraire. La présente demande est applicable à l'extraction d'un texte cible dans un ensemble de données.
PCT/CN2022/071444 2021-08-30 2022-01-11 Procédé et appareil d'extraction d'informations textuelles, support de stockage et dispositif informatique WO2023029354A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111007458.6 2021-08-30
CN202111007458.6A CN113722436A (zh) 2021-08-30 2021-08-30 文本信息提取方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023029354A1 true WO2023029354A1 (fr) 2023-03-09

Family

ID=78679376

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071444 WO2023029354A1 (fr) 2021-08-30 2022-01-11 Procédé et appareil d'extraction d'informations textuelles, support de stockage et dispositif informatique

Country Status (2)

Country Link
CN (1) CN113722436A (fr)
WO (1) WO2023029354A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116016416A (zh) * 2023-03-24 2023-04-25 深圳市明源云科技有限公司 垃圾邮件识别方法、装置、设备及计算机可读存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722436A (zh) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 文本信息提取方法、装置、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464641A (zh) * 2020-10-29 2021-03-09 平安科技(深圳)有限公司 基于bert的机器阅读理解方法、装置、设备及存储介质
CN113051926A (zh) * 2021-03-01 2021-06-29 北京百度网讯科技有限公司 文本抽取方法、设备和存储介质
US20210216725A1 (en) * 2020-01-14 2021-07-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing information
CN113255327A (zh) * 2021-06-10 2021-08-13 腾讯科技(深圳)有限公司 文本处理方法、装置、电子设备及计算机可读存储介质
CN113268571A (zh) * 2021-07-21 2021-08-17 北京明略软件系统有限公司 一种确定段落中正确答案位置的方法、装置、设备及介质
CN113722436A (zh) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 文本信息提取方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674271B (zh) * 2019-08-27 2023-01-06 腾讯科技(深圳)有限公司 一种问答处理方法及装置
CN111597314B (zh) * 2020-04-20 2023-01-17 科大讯飞股份有限公司 推理问答方法、装置以及设备
CN112464656B (zh) * 2020-11-30 2024-02-13 中国科学技术大学 关键词抽取方法、装置、电子设备和存储介质
CN112685548B (zh) * 2020-12-31 2023-09-08 科大讯飞(北京)有限公司 问题回答方法以及电子设备、存储装置
CN112446216B (zh) * 2021-02-01 2021-05-04 华东交通大学 一种融合中心词信息的嵌套命名实体识别方法与装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210216725A1 (en) * 2020-01-14 2021-07-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing information
CN112464641A (zh) * 2020-10-29 2021-03-09 平安科技(深圳)有限公司 基于bert的机器阅读理解方法、装置、设备及存储介质
CN113051926A (zh) * 2021-03-01 2021-06-29 北京百度网讯科技有限公司 文本抽取方法、设备和存储介质
CN113255327A (zh) * 2021-06-10 2021-08-13 腾讯科技(深圳)有限公司 文本处理方法、装置、电子设备及计算机可读存储介质
CN113268571A (zh) * 2021-07-21 2021-08-17 北京明略软件系统有限公司 一种确定段落中正确答案位置的方法、装置、设备及介质
CN113722436A (zh) * 2021-08-30 2021-11-30 平安科技(深圳)有限公司 文本信息提取方法、装置、计算机设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116016416A (zh) * 2023-03-24 2023-04-25 深圳市明源云科技有限公司 垃圾邮件识别方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN113722436A (zh) 2021-11-30

Similar Documents

Publication Publication Date Title
KR102401942B1 (ko) 번역품질 평가 방법 및 장치
US20230100376A1 (en) Text sentence processing method and apparatus, computer device, and storage medium
CN106777013B (zh) 对话管理方法和装置
WO2023029354A1 (fr) Procédé et appareil d'extraction d'informations textuelles, support de stockage et dispositif informatique
CN109271493A (zh) 一种语言文本处理方法、装置和存储介质
CN111144120A (zh) 一种训练语句的获取方法、装置、存储介质及电子设备
EP4113357A1 (fr) Procédé et appareil de reconnaissance d'entité, dispositif électronique et support d'enregistrement
CN111930792B (zh) 数据资源的标注方法、装置、存储介质及电子设备
JP7430820B2 (ja) ソートモデルのトレーニング方法及び装置、電子機器、コンピュータ可読記憶媒体、コンピュータプログラム
WO2024098533A1 (fr) Procédé, appareil et dispositif de recherche bidirectionnelle d'image-texte, et support de stockage lisible non volatil
WO2024098623A1 (fr) Procédé et appareil de récupération inter-média, procédé et appareil d'apprentissage de modèle de récupération inter-média, dispositif et système de récupération de recette
CN113158687B (zh) 语义的消歧方法及装置、存储介质、电子装置
CN112926308B (zh) 匹配正文的方法、装置、设备、存储介质以及程序产品
CN110808032A (zh) 一种语音识别方法、装置、计算机设备及存储介质
CN111563158A (zh) 文本排序方法、排序装置、服务器和计算机可读存储介质
CN110347802A (zh) 一种文本分析方法及装置
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
JP7181999B2 (ja) 検索方法及び検索装置、記憶媒体
CN116109732A (zh) 图像标注方法、装置、处理设备及存储介质
CN114817478A (zh) 基于文本的问答方法、装置、计算机设备及存储介质
CN114861758A (zh) 多模态数据处理方法、装置、电子设备及可读存储介质
CN114120166A (zh) 视频问答方法、装置、电子设备及存储介质
CN112906368B (zh) 行业文本增量方法、相关装置及计算机程序产品
CN110807097A (zh) 分析数据的方法和装置
CN111931503B (zh) 信息抽取方法及装置、设备、计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE