CN116882406A

CN116882406A - Information extraction method, information extraction device, electronic apparatus, and readable storage medium

Info

Publication number: CN116882406A
Application number: CN202310943971.9A
Authority: CN
Inventors: 卢健
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-10-13

Abstract

The disclosure provides an information extraction method, an information extraction device, electronic equipment and a readable storage medium, which are applied to the technical fields of information extraction and finance. The method comprises the following steps: acquiring target text information to be extracted and key prompt information to be extracted; inputting target text information and key prompt information into an information extraction model, and outputting an extracted information probability combination corresponding to the key prompt information, wherein the extracted information probability combination comprises a starting point probability sequence and an end point probability sequence, the starting point probability sequence comprises probabilities of different starting point positions of the extracted information, and the end point probability sequence comprises probabilities of different end point positions of the extracted information; processing the extracted information probability combination by using a double pointer method to determine target position information based on the start probability sequence and the end probability sequence, wherein the target position information comprises a target start position and a target end position; and extracting target extraction text content corresponding to the key prompt information from the target text information based on the target position information.

Description

Information extraction method, information extraction device, electronic apparatus, and readable storage medium

Technical Field

The present disclosure relates to the field of information extraction and finance, and in particular, to an information extraction method, an information extraction apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Text entity extraction methods are important components in information extraction methods and task oriented dialog systems, which aim to extract text entities from a given text to help people quickly find themselves in the vast amount of information in an automated manner, coping with challenges presented by information explosion. Among them, information entity extraction is a technology most valuable in information extraction, and its main task is to identify proper names and meaningful number of phrases appearing in text and categorize them.

However, when information extraction is performed based on the related art selected nearby, accurate start and end positions cannot be obtained, which results in a large difference between the finally obtained extracted content and the actual content in the original text, resulting in a low accuracy.

Disclosure of Invention

In view of the above, the present disclosure provides an information extraction method, an information extraction apparatus, an electronic device, a computer-readable storage medium, and a computer program product that improve extraction accuracy.

According to a first aspect of the present disclosure, there is provided an information extraction method, including:

acquiring target text information to be extracted and key prompt information to be extracted;

inputting the target text information and the key prompt information into an information extraction model, and outputting an extracted information probability combination corresponding to the key prompt information, wherein the extracted information probability combination comprises a starting point probability sequence and an end point probability sequence, the starting point probability sequence comprises probabilities of different starting point positions of the extracted information, and the end point probability sequence comprises probabilities of different end point positions of the extracted information;

processing the extracted information probability combination by using a double pointer method to determine target position information based on the starting point probability sequence and the end point probability sequence, wherein the target position information comprises a target starting point position and a target end point position;

and extracting target extraction text content corresponding to the key prompt information from the target text information based on the target position information.

According to an embodiment of the present disclosure, processing the above-mentioned extracted information probability combination using a double pointer method to determine target position information based on the above-mentioned start probability sequence and the above-mentioned end probability sequence includes:

Traversing the starting point probability sequence and the ending point probability sequence by using the double pointer method to obtain a plurality of first combinations and a plurality of second combinations, wherein the first combinations comprise a plurality of initial starting point position probabilities, and the second combinations comprise a plurality of initial ending point position probabilities;

selecting a third combination from the first combinations and the second combinations based on a preset selection rule, wherein the third combination comprises a plurality of transition starting point position probabilities and a plurality of transition ending point position probabilities;

and determining the probability value which is the largest in the probabilities of the plurality of transition start point positions and the plurality of transition end point positions as the probabilities of the target start point position and the target end point position.

According to an embodiment of the present disclosure, the start probability sequence includes a plurality of start probabilities, and the end probability sequence includes a plurality of end probabilities;

the method for traversing the starting point probability sequence and the ending point probability sequence by using the double pointer method to obtain a plurality of first combinations and a plurality of second combinations comprises the following steps:

traversing the starting point probability sequence and the end point probability sequence by using a pointer respectively to obtain a first array and a second array, wherein the first array comprises a plurality of first starting point probabilities, and the second array comprises a plurality of first end point probabilities;

Acquiring a starting point position variable and an end point position variable corresponding to a current pointer under the condition that pointer information of the starting point probability sequence and pointer information of the end point probability sequence meet a preset length condition, wherein the starting point position variable comprises a starting point position and a starting point probability, and the end point position variable comprises an end point position and an end point probability;

under the condition that the starting point position is the same as the ending point position and the state variable is not a preset value, respectively determining the first starting point probability and the first ending point probability of the probability maximum value in the first array and the second array as the second starting point probability and the second ending point probability;

generating a first combination and a second combination according to a plurality of the second start probabilities and a plurality of the second end probabilities;

the pointer information and the state variables of the two pointers are adjusted to iteratively acquire a start position variable and an end position variable using the adjusted pointer information and the state variables and iteratively determine a first combination and a second combination.

According to an embodiment of the present disclosure, before the adjustment, further comprising:

and storing the first starting point probability and the first end point probability corresponding to the current pointer in a first array and a second array respectively under the condition that the starting point position is the same as the end point position and the state variable is a preset value.

According to an embodiment of the present disclosure, the information extraction method further includes:

storing a first starting point probability corresponding to the current pointer in a first array under the condition that the starting point position is smaller than the end point position and the state variable is a preset value;

under the condition that the starting point position is smaller than the end point position and the state variable is not a preset value, respectively determining the first starting point probability and the first end point probability of the probability maximum value in the first array and the second array as the second starting point probability and the second end point probability;

and adjusting pointer information and state variables corresponding to the starting point positions to iteratively acquire starting point position variables and ending point position variables by using the adjusted pointer information and state variables and iteratively determine a first combination and a second combination.

storing a first end probability corresponding to the current pointer in a second array under the condition that the starting point position is larger than the end point position;

and adjusting pointer information and state variables corresponding to the end positions to iteratively acquire the start position variables and the end position variables by using the adjusted pointer information and state variables and iteratively determine the first combination and the second combination.

According to an embodiment of the present disclosure, selecting a third combination from the first combination and the second combination, respectively, based on a preset selection rule includes:

for each first combination and each second combination, determining the initial starting point position probability and the initial end point position probability of the maximum probability in the first combination and the second combination as transition starting point position probability and transition end point position probability respectively based on a probability maximum principle;

and associating a plurality of transition start position probabilities with a plurality of transition end position probabilities, and storing the associated transition start position probabilities in the third combination.

According to an embodiment of the present disclosure, determining a probability value having the largest probability value among the plurality of transition start point position probabilities and the plurality of transition end point position probabilities as the probability of the target start point position and the target end point position includes:

sorting the transition starting point position probabilities in the third combination and storing the sorted transition starting point position probabilities in a starting point set;

sorting the transition end point position probabilities in the third combination and storing the sorted transition end point position probabilities in an end point set;

and respectively determining the transition starting point position probability and the transition end point position probability with the highest probability in the starting point set and the end point set as the probabilities of the target starting point position and the target end point position based on the principle of maximum probability.

According to an embodiment of the present disclosure, the above-described target text information is generated by:

acquiring initial text information;

and under the condition that the text length of the initial text information exceeds a preset length threshold value, dividing the initial text information based on a preset dividing rule to obtain a plurality of target text information, so that the information extraction model generates an extraction information probability combination corresponding to each target text information according to the plurality of target text information.

and under the condition that the plurality of extraction information probability combinations correspond to the key prompt information, carrying out fusion processing on the plurality of extraction information probability combinations to obtain a fused extraction information probability combination, and processing the fused extraction information probability combination by using the double pointer method.

A second aspect of the present disclosure provides an information extraction apparatus, including:

the acquisition module is used for acquiring target text information to be extracted and key prompt information to be extracted;

the processing module is used for inputting the target text information and the key prompt information into an information extraction model and outputting an extracted information probability combination corresponding to the key prompt information, wherein the extracted information probability combination comprises a starting point probability sequence and an end point probability sequence, the starting point probability sequence comprises probabilities of different starting point positions of the extracted information, and the end point probability sequence comprises probabilities of different end point positions of the extracted information;

A determining module, configured to process the extracted information probability combination by using a double pointer method, so as to determine target position information based on the start probability sequence and the end probability sequence, where the target position information includes a target start position and a target end position;

and the extraction module is used for extracting target extraction text content corresponding to the key prompt information from the target text information based on the target position information.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the information extraction method described above.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described information extraction method.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described information extraction method.

According to the embodiment of the disclosure, the information extraction model is utilized to process the target text information and the key prompt information, the extracted information probability combination can be output, and the extracted information probability combination is processed by utilizing the double pointer method, so that the double pointer determines the target position information which is closer to the position in the real text in the traversing process, and the text content corresponding to the key prompt information can be obtained from the original text according to the target position information. The method for extracting the information can obtain more accurate text content when extracting the information, and effectively improves the accuracy of information extraction because the target starting point position and the target end point position for determining the target position information can be selected from the probability sequence through the two pointers in the traversing process.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of an information extraction method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an information extraction method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a sequential schematic of a start probability sequence and an end probability sequence according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a classification scheme of target text information according to an embodiment of the disclosure;

fig. 5 schematically shows a block diagram of a structure of an information extraction apparatus according to an embodiment of the present disclosure; and

fig. 6 schematically illustrates a block diagram of an electronic device adapted to implement the information extraction method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the disclosure provides an information extraction method, an information extraction device, electronic equipment and a readable storage medium, wherein the method comprises the steps of obtaining target text information to be extracted and key prompt information to be extracted; inputting target text information and key prompt information into an information extraction model, and outputting an extracted information probability combination corresponding to the key prompt information, wherein the extracted information probability combination comprises a starting point probability sequence and an end point probability sequence, the starting point probability sequence comprises probabilities of different starting point positions of the extracted information, and the end point probability sequence comprises probabilities of different end point positions of the extracted information; processing the extracted information probability combination by using a double pointer method to determine target position information based on a start probability sequence and an end probability sequence, wherein the target position information comprises a target start position and a target end position; and extracting target extraction text content corresponding to the key prompt information from the target text information based on the target position information.

It should be noted that the information extraction method and apparatus provided in the present disclosure may be used in a financial field, for example, a financial institution such as a bank, and may also be used in any field other than the financial field, for example, an institution such as a hospital, and thus, an application field of the information extraction method and apparatus provided in the present disclosure is not limited.

In the technical scheme of the invention, the related extracted user information (including but not limited to user personal information, user image information, user equipment information such as position information and the like) and data (including but not limited to data for analysis, stored data, displayed data and the like) are information and data authorized by a user or fully authorized by all parties, and the related data are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, and all the processing of the related data are processed in compliance with the related laws and regulations and standards of related countries and regions, necessary security measures are adopted, the public prefects are not violated, and corresponding operation entries are provided for the user to select authorization or rejection. .

Fig. 1 schematically illustrates an application scenario diagram of an information extraction method according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include credit approval book information extraction and the like. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the information extraction method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the information extraction apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The information extraction method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the information extraction apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flow chart of an information extraction method according to an embodiment of the present disclosure.

As shown in fig. 2, the information extraction method of this embodiment includes operations S210 to S240.

In operation S210, obtaining target text information to be extracted and key prompt information to be extracted;

in operation S220, inputting the target text information and the key prompt information into the information extraction model, and outputting an extracted information probability combination corresponding to the key prompt information, wherein the extracted information probability combination includes a start probability sequence and an end probability sequence, the start probability sequence includes probabilities of different start positions of the extracted information, and the end probability sequence includes probabilities of different end positions of the extracted information;

in operation S230, the extracted information probability combination is processed using the double pointer method to determine target position information based on the start probability sequence and the end probability sequence, wherein the target position information includes a target start position and a target end position;

In operation S240, a target extraction text content corresponding to the key hint information is extracted from the target text information based on the target location information.

According to the embodiment of the disclosure, the target text information may be a document with text content, for example, a document in a format of Word, PDF, or the like, or may be a document obtained by performing image recognition on a picture. The kind of the key prompt information is set according to the required content, for example, in the field of sports, the key prompt information can be characters, event names, scores, dates and the like. The corresponding target text information may be "×month×day morning sport free skiing female big jump table resolution player a obtains the first name at 188.25-! ".

According to an embodiment of the present disclosure, the information extraction model is pre-trained based on a pre-trained language model, specifically, the information extraction model is a Seq2Seq (Sequence to Sequence, sequence-to-sequence) model. The structure includes an encoding network (Encoder) and a decoding network (Decoder).

According to the embodiment of the disclosure, the target text information and the key prompt information are input into an information extraction model, and an extracted information probability combination corresponding to the key prompt information is output, wherein the extracted information probability combination comprises a starting point probability sequence and an end point probability sequence, the starting point probability sequence has a plurality of starting point probabilities, and each starting point probability represents a probability value of a corresponding starting point position, for example, the similarity between the starting point position is start_id and the true starting point of the text content to be extracted. The start_id is represented by a character number in the target text information, for example, the 6 th character in the target text information is represented by start_id=6, which is the starting point of the text content corresponding to the key prompt information.

According to an embodiment of the present disclosure, the extracted information probability combination is processed using a double pointer method to iteratively determine target location information based on a start probability sequence and an end probability sequence, and after determining target location information of text content corresponding to the key hint information, the text content may be extracted from the target text information through the target location information.

Fig. 3 schematically illustrates a sequential schematic of a start probability sequence and an end probability sequence according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, extracting an information probability combination using a two pointer method process to determine target position information based on a start probability sequence and an end probability sequence includes:

traversing the starting point probability sequence and the end point probability sequence by using a double pointer method to obtain a plurality of first combinations and a plurality of second combinations, wherein the first combinations comprise a plurality of initial starting point position probabilities, and the second combinations comprise a plurality of initial end point position probabilities;

selecting a third combination from the first combinations and the second combinations based on a preset selection rule, wherein the third combination comprises a plurality of transition start point position probabilities and a plurality of transition end point position probabilities;

and determining the probability value which is the largest in the probabilities of the plurality of transition starting point positions and the plurality of transition end point positions as the probabilities of the target starting point position and the target end point position.

According to an embodiment of the present disclosure, a starting point probability sequence and an ending point probability sequence (a circle in fig. 3 represents a starting point position probability of the starting point probability sequence, a square represents an ending point position probability of the ending point probability sequence, and fig. 3 is arranged with position ids from small to large) shown in fig. 3 are traversed by using a double pointer method, so as to obtain a plurality of first combined sp ¹ And a plurality of second combinations ep ¹ Thereafter selecting a plurality of first combinations sp based on a preset selection rule ¹ And a plurality of second combinations ep ¹ Selecting a third combination, wherein the preset selection rule may refer to selecting each first combination sp ¹ Determining the initial starting point position probability with the highest probability as one transition starting point position probability in the third combination, and ep of each second combination ¹ The initial end point position probability with the highest probability is determined as one transition end point position probability in the third combination.

It should be noted that the preset selection rule may be adjusted according to actual situations, for example, the next largest value may be determined as the transition position.

According to the embodiments of the present disclosure, for the plurality of transition start point position probabilities and the plurality of transition end point position probabilities in the third combination, the transition start point position probability with the largest probability value and the transition end point position probability with the largest probability value may be determined as the probabilities of the target start point position and the target end point position. Thereby, text content corresponding to the key prompt information can be extracted from the target text information based on the target end point position corresponding to the probability.

According to the embodiment of the disclosure, the first combinations and the second combinations are iteratively extracted by the double-pointer method, so that the corresponding target position information is finally determined from the third combinations based on the third combinations selected by the first combinations and the second combinations.

traversing the starting point probability sequence and the end point probability sequence by using a double pointer method to obtain a plurality of first combinations and a plurality of second combinations, wherein the method comprises the following steps:

under the condition that the pointer information of the start probability sequence and the pointer information of the end probability sequence meet the preset length condition, acquiring a start position variable and an end position variable corresponding to the current pointer, wherein the start position variable comprises a start position and start probability, and the end position variable comprises an end position and end probability;

under the condition that the starting point position is the same as the end point position and the state variable is not a preset value, respectively determining the first starting point probability and the first end point probability of the probability maximum value in the first array and the second array as the second starting point probability and the second end point probability;

generating a first combination and a second combination based on the plurality of second start probabilities and the plurality of second end probabilities;

Pointer information and state variables of the two pointers are adjusted to iteratively acquire a start position variable and an end position variable using the adjusted pointer information and state variables and iteratively determine a first combination and a second combination.

According to an embodiment of the present disclosure, a pointer for traversing start positions start_ids in a start probability sequence is first initialized, pointer information start_pointer=0 of the pointer, and a pointer for traversing end positions end in an end probability sequence is first initializedPointer of_ids whose pointer information end_pointer=0, one for judging that the previous id is loaded into the first array sp ² Or a second array ep ² Wherein the state variable flag=0 represents loading of the first array sp ² . First array sp ² For storing start_ids, a start position start_id corresponding to a first start probability, a second array ep ² For storing end_ids, an end_id corresponds to a first end probability.

According to the embodiment of the disclosure, when the pointer information start_pointer of the start probability sequence is less than or equal to the length (start_ids) of the start probability sequence and the pointer information end_pointer of the end probability sequence is less than or equal to the length (end_ids) of the end probability sequence, i.e. both pointers do not reach the tail of the respective sequences, the start position variable start_id corresponding to the current pointer is obtained ¹ And end position variable end_id ¹ The two variables include the position and probability of their corresponding text. See formula (1) and formula (2).

start_id ¹ ＝start_ids[start_pointer] (1)

end_id ¹ ＝end_ids[end_pointer] (2)

According to an embodiment of the present disclosure, in the case where the start position start_id and the end position end_id are the same and the state variable flag is not a preset value (e.g., 0, etc.), the first arrays sp are respectively ² And a second plurality ep ² The first start probability and the first end probability of the middle probability maximum are determined as the second start probability start_id and the second end probability end_id. The two are combined into a combination (start_id) and stored. Then the first array sp ² Set to [ start_id ]]A second array ep ² Set to [ end_id ]]。

According to an embodiment of the present disclosure, a first combined sp is generated from a plurality of second start probabilities start_id and a plurality of second end probabilities end_id ¹ And a second combination ep ¹ 。

According to embodiments of the present disclosure, pointer information and state variables of two pointers are adjusted, e.g., start_pointer is self-added1, end_pointer self-adds 1, flag=1 to iteratively acquire a start position variable and an end position variable using the adjusted pointer information and state variable and iteratively generate a first combined sp ¹ And a second combination ep ¹ 。

According to an embodiment of the present disclosure, before the adjustment of the pointer information and the state variable, further comprising:

and under the condition that the starting point position is the same as the ending point position and the state variable is a preset value, storing the first starting point probability and the first ending point probability corresponding to the current pointer in the first array and the second array respectively.

According to an embodiment of the present disclosure, when the start position start_id and the end position end_id are the same and the state variable flag is not a preset value (e.g., 0, etc.), the first start probability start_id and the first end probability end_id corresponding to the current pointer are stored in the first array sp, respectively ² And a second plurality ep ² Is a kind of medium.

under the condition that the starting point position is smaller than the ending point position and the state variable is not a preset value, respectively determining the first starting point probability and the first ending point probability of the probability maximum value in the first array and the second array as the second starting point probability and the second ending point probability;

pointer information and state variables corresponding to the start position are adjusted to iteratively acquire the start position variable and the end position variable using the adjusted pointer information and state variables and iteratively determine the first combination and the second combination.

According to an embodiment of the present disclosure, if the first start probability start_id corresponds to a coordinate smaller than the coordinate corresponding to the first end probability end_id: if the state variable flag is a preset value, the current first start probability start_id is put into the input first array sp ² 。

In accordance with an embodiment of the present disclosure,if the state variable flag is not the preset value, when the first array sp ² And a second plurality ep ² None are empty, then from the first array sp ² And a second plurality ep ² The first start point probability and the first end point probability with the largest probability are selected and determined as the second start point probability start_id and the second end point probability end_id, and the two probabilities are combined into a combination (start_id and end_id) and stored. Then the first array sp ² Set to [ start_id ]]A second array ep ² Put into []。

According to an embodiment of the present disclosure, pointer information and a state variable corresponding to a start position are adjusted, for example, pointer information start_pointer corresponding to the start position is self-added by 1, and state variable flag=0, to iteratively acquire the start position variable and the end position variable using the adjusted pointer information and state variable and iteratively determine a first combination sp ¹ And a second combination ep ¹ 。

pointer information and state variables corresponding to the end positions are adjusted to iteratively acquire start position variables and end position variables using the adjusted pointer information and state variables and iteratively determine first and second combinations.

According to an embodiment of the present disclosure, if the first start probability start_id corresponds to a coordinate greater than the coordinate corresponding to the first end probability end_id: then the first end probability end_id corresponding to the current pointer is put into the input second array ep ² Adjusting pointer information corresponding to the end position and a state variable, for example, pointer information end_pointer corresponding to the end position is self-added to 1, and the state variable flag=1, to iteratively acquire the start position variable and the end position variable using the adjusted pointer information and state variable and iteratively determine a first combination sp ¹ And a second combination ep ¹ 。

According to an embodiment of the present disclosure, selecting a third combination from the first combination and the second combination, respectively, based on a preset selection rule, includes:

The plurality of transition start point position probabilities and the plurality of transition end point position probabilities are correlated and stored in a third combination.

According to an embodiment of the present disclosure, if the first combination sp ¹ And a second combination ep ¹ Are not empty, then respectively from the first combination sp ¹ And a second combination ep ¹ The initial start position probability start_id and the initial end position probability end_id with the largest probability are selected and determined as transition start position probability start_id ₁ And transition end position probability end_id ₁ And combine them into a combination (start_id:) ₁ ，end_id* ₁ ) And stored in the third combination.

According to an embodiment of the present disclosure, determining a probability that a probability value is largest among a plurality of transition start point position probabilities and a plurality of transition end point position probabilities as a probability of a target start point position and a target end point position includes:

the transition starting point position probabilities in the third combination are sorted and stored in a starting point set;

and respectively determining the transition starting point position probability and the transition end point position probability with the maximum probability in the starting point set and the end point set as the probabilities of the target starting point position and the target end point position based on the principle of maximum probability.

According to an embodiment of the present disclosure, a plurality of transition start point position probabilities start_id in a third combination are calculated ₁ After being ordered, the three combinations are stored in a starting point set, and a plurality of transition end point position probabilities end_id in the third combination are calculated ₁ And storing the sorted data in the end point set.

According to an embodiment of the present disclosure, transition start position probabilities start_id with the highest probability in a start set and an end set are respectively calculated based on a probability maximization principle ₁ And transition end position probability end_id ₁ The probabilities of the target start point position and the target end point position are determined. And then clearing the starting point set and the ending point set, and then picking downwards according to the same condition until all the starting point probability sequences and the ending point probability sequences are traversed.

According to the embodiment of the disclosure, the time complexity of the information extraction method is O (m+n), m is the sequence length of the start probability sequence start_ids, and n is the sequence length of the end probability sequence end_ids, which belongs to a logic implementation mode with the fastest operation, and can improve the information extraction efficiency on the premise of ensuring the information extraction accuracy. The two groups of ids can be combined into one array by a double pointer method, and then the optimal combination is grouped and selected according to the states of the previous id category and the current id category of the array.

According to an embodiment of the present disclosure, the target text information is generated by:

acquiring initial text information;

and under the condition that the text length of the initial text information exceeds a preset length threshold value, dividing the initial text information based on a preset dividing rule to obtain a plurality of target text information, so that the information extraction model generates an extracted information probability combination corresponding to each target text information according to the plurality of target text information.

According to an embodiment of the present disclosure, if the text length of the initial text information exceeds a preset length threshold, for example, exceeds the maximum length limit of the information extraction model, the initial text information is sliced according to a step size step_len and a cross length overlap_len. Each step (i-th step) intercepts the initial text information with a start point i step_len end point (i+1) step_len+overlap_len and a fragment length step_len+overlap_len as the target text information input to the information extraction model.

Fig. 4 schematically illustrates a classification scheme of target text information according to an embodiment of the present disclosure.

under the condition that the plurality of extraction information probability combinations correspond to the key prompt information, fusion processing is carried out on the plurality of extraction information probability combinations to obtain fused extraction information probability combinations, and the fused extraction information probability combinations are processed by the double pointer method.

According to the embodiment of the disclosure, a plurality of extracted information probability combinations (start_id, end_id) of all target text information belonging to the same key prompt information prompt are fused. The repeated or interleaved combinations are combined into one.

According to the embodiment of the disclosure, the specific implementation concept is to start processing with all combinations of target text information output of the first segment, the start-end combinations whose end_id is before step_len need not be processed, and those combinations whose end_id is greater than or equal to step_len in the start-end combinations are put into a temporary storage area, which belongs to the combinations to be combined. All start-end combinations of the target text information for the next segment are then classified into three categories, as shown in fig. 4.

According to an embodiment of the present disclosure, the first class is potentially intersected by the previous segment. I.e. having a start_id of (i-1) or less step_len+overlap_len, wherein the start_id belongs to the first type of region. The second category is a category that does not need to be processed, i.e., that has a start_id greater than (i-1)

* step_len+overlap_len and end_id < i step_len, wherein both start_id and end_id are in the second type region. The third class is a start-end combination where end_id is greater than i step_len, where end_id belongs to the third class of region.

According to embodiments of the present disclosure, there may be combinations that belong to both the first and third classes when categorized, where the first class is highest priority and is assigned to the first class. After classification, judging the combination of the temporary storage areas in the previous step and the combination in the first type of the previous step two by two, if the two have intersection (the condition that the end_id of the temporary storage area combination is smaller than the start_id of the first type of combination or the start_id of the temporary storage area combination is larger than the end_id of the first type of combination) is not met, taking the start_id as the smallest one of the two start_ids, and the end_id as the largest one of the two end_ids, and merging the two.

According to the embodiment of the disclosure, the merged person in the temporary storage area is removed after merging, and the remaining removed person is classified as the second category and is not processed. For the new combination after merging, it needs to determine whether its end_id is greater than i step_len. If so, classifying the data into a third class, otherwise classifying the data into the second class, and performing no processing. And then, the third class is put into a temporary storage area to classify the combination of the next fragments, and the same step is used for judging whether to combine with the temporary storage area. Until the target text information of the last segment. After looping to the target text information for the last segment, the combination of buffers is incorporated directly into the final result.

According to the embodiment of the disclosure, through the above combination of segmentation of the initial text information and the probability combination of the extracted information, the information extraction method of the disclosure supports the combination of repeated and crossed information extracted among a plurality of fragments, the accuracy of the obtained text content is higher, and the probability of "breaking the sense of meaning" in the information extraction process is reduced.

Fig. 5 schematically shows a block diagram of the structure of an information extraction apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the information extraction apparatus 500 of this embodiment includes an acquisition module 501, a processing module 502, a determination module 503, and an extraction module 504.

The acquiring module 501 is configured to acquire target text information to be extracted and key prompt information to be extracted;

the processing module 502 is configured to input the target text information and the key prompt information into the information extraction model, and output an extracted information probability combination corresponding to the key prompt information, where the extracted information probability combination includes a start probability sequence and an end probability sequence, the start probability sequence includes probabilities of different start positions of the extracted information, and the end probability sequence includes probabilities of different end positions of the extracted information;

A determining module 503, configured to process the extracted information probability combination by using a double pointer method, so as to determine target position information based on a start probability sequence and an end probability sequence, where the target position information includes a target start position and a target end position;

the extracting module 504 is configured to extract, from the target text information, target extraction text content corresponding to the key hint information based on the target location information.

According to an embodiment of the present disclosure, the determining module 503 includes a deriving unit, a selecting unit, and a determining unit.

The obtaining unit is used for traversing the starting point probability sequence and the end point probability sequence by using a double pointer method to obtain a plurality of first combinations and a plurality of second combinations, wherein the first combinations comprise a plurality of initial starting point position probabilities, and the second combinations comprise a plurality of initial end point position probabilities;

a selecting unit, configured to select a third combination from the plurality of first combinations and the plurality of second combinations based on a preset selecting rule, where the third combination includes a plurality of transition start point position probabilities and a plurality of transition end point position probabilities;

and a determining unit configured to determine, as the probabilities of the target start point position and the target end point position, a probability value that is the largest of the plurality of transition start point position probabilities and the plurality of transition end point position probabilities.

According to an embodiment of the present disclosure, the start probability sequence includes a plurality of start probabilities and the end probability sequence includes a plurality of end probabilities.

According to an embodiment of the present disclosure, the obtaining unit includes a obtaining subunit, a first determining subunit, a generating subunit, and a first iterating subunit.

The method comprises the steps of obtaining a subunit, namely traversing a starting point probability sequence and an ending point probability sequence by using a pointer respectively to obtain a first array and a second array, wherein the first array comprises a plurality of first starting point probabilities, and the second array comprises a plurality of first ending point probabilities;

The first acquisition subunit is used for acquiring a starting point position variable and an end point position variable corresponding to the current pointer when the pointer information of the starting point probability sequence and the pointer information of the end point probability sequence meet the preset length condition, wherein the starting point position variable comprises a starting point position and a starting point probability, and the end point position variable comprises an end point position and an end point probability;

the first determining subunit is configured to determine, when the starting point position is the same as the ending point position and the state variable is not a preset value, a first starting point probability and a first ending point probability of a probability maximum value in the first array and the second array as a second starting point probability and a second ending point probability respectively;

a generation subunit for generating a first combination and a second combination based on the plurality of second start probabilities and the plurality of second end probabilities;

and the first iteration subunit is used for adjusting pointer information and state variables of the two pointers to iteratively acquire a starting point position variable and an ending point position variable by using the adjusted pointer information and state variables and iteratively determine a first combination and a second combination.

According to an embodiment of the present disclosure, the get unit further comprises a first storage subunit.

The first storage subunit is configured to store, when the starting point position is the same as the ending point position and the state variable is a preset value, a first starting point probability and a first ending point probability corresponding to the current pointer in the first array and the second array respectively.

According to an embodiment of the present disclosure, the obtaining unit further comprises a second storage subunit, a second determination subunit, and a second iteration subunit.

The second storage subunit is used for storing the first starting point probability corresponding to the current pointer in the first array under the condition that the starting point position is smaller than the end point position and the state variable is a preset value;

the second determining subunit is configured to determine, when the starting point position is smaller than the ending point position and the state variable is not a preset value, a first starting point probability and a first ending point probability of a probability maximum value in the first array and the second array as a second starting point probability and a second ending point probability respectively;

and a second iteration subunit for adjusting pointer information and state variables corresponding to the start position to iteratively acquire the start position variable and the end position variable using the adjusted pointer information and state variables and iteratively determine the first combination and the second combination.

According to an embodiment of the present disclosure, the obtaining unit further comprises a third storing subunit, a third iteration subunit.

The third storage subunit is used for storing the first endpoint probability corresponding to the current pointer in the second array under the condition that the starting point position is larger than the ending point position;

and a third iteration subunit for adjusting pointer information and state variables corresponding to the end position to iteratively acquire the start position variable and the end position variable using the adjusted pointer information and state variables and iteratively determine the first combination and the second combination.

According to an embodiment of the present disclosure, the pick unit comprises a third determination subunit, an association storage subunit.

A third determining subunit configured to determine, for each of the first combinations and each of the second combinations, an initial start point position probability and an initial end point position probability of a maximum probability value in the first combination and the second combination as a transition start point position probability and a transition end point position probability, respectively, based on a probability maximization principle;

and the association storage subunit is used for associating the plurality of transition starting point position probabilities with the plurality of transition ending point position probabilities and storing the plurality of transition starting point position probabilities in a third combination.

According to an embodiment of the present disclosure, the determining unit comprises a first ordering storage subunit, a second ordering storage subunit, a fourth determining subunit.

The first ordering storage subunit is used for ordering the plurality of transition starting point position probabilities in the third combination and then storing the ordered transition starting point position probabilities in the starting point set;

the second ordering storage subunit is used for ordering the transition end point position probabilities in the third combination and then storing the transition end point position probabilities in the end point set;

and the fourth determining subunit is used for determining the transition starting point position probability and the transition end point position probability with the maximum probability in the starting point set and the end point set respectively as the probabilities of the target starting point position and the target end point position based on the principle of maximum probability.

According to an embodiment of the present disclosure, the target text information is generated by a second acquisition subunit, a segmentation subunit, a fusion subunit.

The second acquisition subunit is used for acquiring the initial text information;

the segmentation subunit is configured to segment the initial text information based on a preset segmentation rule to obtain a plurality of target text information when the text length of the initial text information exceeds a preset length threshold, so that the information extraction model generates an extracted information probability combination corresponding to each target text information according to the plurality of target text information.

And the fusion subunit is used for carrying out fusion processing on the plurality of extraction information probability combinations under the condition that the plurality of extraction information probability combinations are corresponding to the key prompt information, so as to obtain the fused extraction information probability combination, and processing the fused extraction information probability combination by using a double pointer method.

Any of the acquisition module 501, the processing module 502, the determination module 503, the extraction module 504 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules, according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. At least one of the acquisition module 501, the processing module 502, the determination module 503, the extraction module 504 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or by hardware or firmware in any other reasonable way of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware, according to embodiments of the present disclosure. Alternatively, at least one of the acquisition module 501, the processing module 502, the determination module 503, the extraction module 504 may be at least partially implemented as a computer program module, which when executed may perform the respective functions.

As shown in fig. 6, an electronic device 600 according to an embodiment of the present disclosure includes a processor 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. The processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 601 may also include on-board memory for caching purposes. The processor 601 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 602 and/or the RAM 603. Note that the program may be stored in one or more memories other than the ROM 602 and the RAM 603. The processor 601 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 600 may also include an input/output (I/O) interface 605, the input/output (I/O) interface 605 also being connected to the bus 604. The electronic device 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 602 and/or RAM 603 and/or one or more memories other than ROM 602 and RAM 603 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code means for causing a computer system to carry out the information extraction method provided by the embodiments of the present disclosure when the computer program product is run on the computer system.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of signals over a network medium, and downloaded and installed via the communication section 609, and/or installed from the removable medium 611. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. An information extraction method, comprising:

2. The method of claim 1, wherein processing the extracted information probability combination using a double pointer method to determine target location information based on the starting point probability sequence and the ending point probability sequence comprises:

and determining the probability value which is the largest in the transition starting point position probabilities and the transition end point position probabilities as the probability of the target starting point position and the target end point position.

3. The method of claim 2, the starting point probability sequence comprising a plurality of starting point probabilities, the ending point probability sequence comprising a plurality of ending point probabilities;

traversing the starting point probability sequence and the ending point probability sequence by using the double pointer method to obtain a plurality of first combinations and a plurality of second combinations, wherein the method comprises the following steps:

acquiring a starting point position variable and an end point position variable corresponding to the current pointer under the condition that the pointer information of the starting point probability sequence and the pointer information of the end point probability sequence meet the preset length condition, wherein the starting point position variable comprises a starting point position and a starting point probability, and the end point position variable comprises an end point position and an end point probability;

the pointer information and state variables of the two pointers are adjusted to iteratively acquire a start position variable and an end position variable using the adjusted pointer information and state variables and iteratively determine a first combination and a second combination.

4. The method of claim 3, further comprising, prior to the adjusting:

and under the condition that the starting point position is the same as the ending point position and the state variable is a preset value, storing the first starting point probability and the first ending point probability corresponding to the current pointer in a first array and a second array respectively.

5. A method according to claim 3, further comprising:

and adjusting pointer information and state variables corresponding to the starting point position to iteratively acquire the starting point position variable and the ending point position variable by using the adjusted pointer information and state variables and iteratively determine a first combination and a second combination.

6. A method according to claim 3, further comprising:

and adjusting pointer information and state variables corresponding to the end position to iteratively acquire a start position variable and an end position variable using the adjusted pointer information and state variables and iteratively determine a first combination and a second combination.

7. The method of claim 2, wherein selecting a third combination from the first combination and the second combination, respectively, based on a preset selection rule, comprises:

For each first combination and each second combination, determining initial starting point position probability and initial end point position probability of a maximum probability value in the first combination and the second combination as transition starting point position probability and transition end point position probability respectively based on a probability maximum principle;

and correlating a plurality of transition start position probabilities with a plurality of transition end position probabilities, and storing the probabilities in the third combination.

8. The method of claim 7, wherein determining the probability value of the largest of the plurality of transition start point position probabilities and the plurality of transition end point position probabilities as the probability of the target start point position and the target end point position comprises:

9. The method of claim 1, wherein the target text information is generated by:

acquiring initial text information;

10. The method of claim 9, further comprising:

and under the condition that the plurality of extraction information probability combinations correspond to the key prompt information, carrying out fusion processing on the plurality of extraction information probability combinations to obtain fused extraction information probability combinations, and processing the fused extraction information probability combinations by using the double pointer method.

11. An information extraction apparatus comprising:

A determining module, configured to process the extracted information probability combination by using a double pointer method, so as to determine target position information based on the starting point probability sequence and the end point probability sequence, where the target position information includes a target starting point position and a target end point position;

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-10.

13. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1 to 10.

14. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 10.