CN114792086A

CN114792086A - Information extraction method, device, equipment and medium supporting text cross coverage

Info

Publication number: CN114792086A
Application number: CN202110105562.2A
Authority: CN
Inventors: 李建平; 朱晓谦; 吴登生
Original assignee: Institute Of Science And Development Chinese Academy Of Sciences; University of Chinese Academy of Sciences
Current assignee: Institute Of Science And Development Chinese Academy Of Sciences; University of Chinese Academy of Sciences
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2022-07-26

Abstract

The embodiment of the disclosure discloses an information extraction method, an information extraction device, information extraction equipment and a computer readable medium. One embodiment of the method comprises: acquiring a target text; coding each word in the target text to generate a word vector to obtain a word vector sequence; determining a target probability value set corresponding to each word vector in the word vector sequence to obtain a target probability value set sequence; generating an object vector sequence set based on the target probability value set sequence and the label set; generating a label sequence set based on the object vector sequence set and the object transfer matrix set; and extracting object information corresponding to each label sequence in the label sequence set from the target text to obtain an object information set. The implementation mode realizes the information extraction of the text with the cross information and provides convenience for application scenes such as text analysis and the like.

Description

Information extraction method, device, equipment and medium supporting text cross coverage

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to an information extraction method, apparatus, device, and computer-readable medium.

Background

Information extraction is a text processing technique for extracting information such as entities, relationships, events, and the like of a specified object from a natural language text. The existing information extraction method generally converts an information extraction task into a sequence labeling problem, namely, each character in a text is labeled, and thus a part of characters are extracted as information.

However, when the information extraction is performed in the above manner, the following technical problems often occur:

first, the information that needs to be extracted in many application scenarios often has situations such as crossing or covering, and the existing sequence labeling method can extract each word only once, resulting in incomplete extracted information.

Secondly, when information extraction is performed in many application scenarios, the relationship between the semantic meaning and the labels corresponding to the words is not balanced, which results in low accuracy of the extracted information.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose information extraction methods to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method for information extraction, the method including: acquiring a target text; coding each word in the target text to generate a word vector to obtain a word vector sequence; determining a target probability value set corresponding to each word vector in the word vector sequence to obtain a target probability value set sequence; generating an object vector sequence set based on the target probability value set sequence and the label set; generating a label sequence set based on the object vector sequence set and the object transfer matrix set; and extracting object information corresponding to each label sequence in the label sequence set from the target text to obtain an object information set.

In a second aspect, some embodiments of the present disclosure provide an information extraction apparatus, including: an acquisition unit configured to acquire a target text; the encoding unit is configured to encode each word in the target text to generate a word vector, and a word vector sequence is obtained; a determining unit configured to determine a target probability value set corresponding to each word vector in the word vector sequence, resulting in a target probability value set sequence; a first generation unit configured to generate a set of object vector sequences based on the set of target probability values and the set of labels; a second generating unit configured to generate a set of tag sequences based on the set of object vector sequences and the set of object transfer matrices; and the extracting unit is configured to extract the object information corresponding to each label sequence in the label sequence set from the target text to obtain an object information set.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device, on which one or more programs are stored, which when executed by one or more processors cause the one or more processors to implement the method described in any implementation of the first aspect.

In a fourth aspect, some embodiments of the disclosure provide a computer readable medium on which a computer program is stored, wherein the program when executed by a processor implements the method described in any implementation of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: each word in the obtained target text is coded to generate a word vector, the coded word vectors have text information and position information, and the design of bottom layer sharing enables objects with poor training numbers to be trained fully, so that the learning ability of the whole process is improved. By endowing each character in the target text with a label related to each target object, the object information corresponding to each object can be extracted preliminarily. And then, introducing a label transfer matrix to more accurately express the relationship between every two adjacent words, thereby realizing accurate extraction of the overlapping information.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a schematic diagram of one application scenario of an information extraction method according to some embodiments of the present disclosure;

FIG. 2 is a flow diagram of some embodiments of an information extraction method according to the present disclosure;

FIG. 3 is a flow diagram of some embodiments of an information extraction apparatus according to the present disclosure;

FIG. 4 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure;

fig. 5 is a schematic diagram of another application scenario of an information extraction method according to some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of an application scenario of an information extraction method according to some embodiments of the present disclosure.

In the application scenario of fig. 1, first, the computing device 101 may obtain target text 102; next, the computing device 101 may encode each word in the target text 102 to generate a word vector, resulting in a word vector sequence 103; then, the computing device 101 may determine a target probability value set corresponding to each word vector in the word vector sequence 103, resulting in a target probability value set sequence 104; thereafter, computing device 101 can generate a set of object vector sequences 106 based on the sequence of target probability value sets 104 and the set of tag sets 105; thereafter, computing device 101 can generate a set of tag sequences 108 based on the set of object vector sequences 106 and the set of object transfer matrices 107; finally, the computing device 101 may extract object information corresponding to each tag sequence in the set of tag sequences 108 from the target text 102, resulting in a set of object information 109.

The computing device 101 may be hardware or software. When the computing device is hardware, it may be implemented as a distributed cluster composed of multiple servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices enumerated above. It may be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of computing devices in FIG. 1 is merely illustrative. There may be any number of computing devices, as implementation needs dictate.

With continued reference to fig. 2, a flow 200 of some embodiments of an information extraction method according to the present disclosure is shown. The information extraction method comprises the following steps:

step 201, obtaining a target text.

In some embodiments, the subject of execution of the information extraction method (e.g., computing device 101) may obtain the target text in various ways, such as via a web page, text file, picture, and so forth. The target text may be an article, a paragraph, or a sentence. The target object may be a noun predetermined according to the application scenario requirements.

As an example, the target object may be pork, and the target text may be "corn pork big rise".

Step 202, each word in the target text is encoded to generate a word vector, and a word vector sequence is obtained.

In some embodiments, the execution subject may input a word sequence corresponding to the target text into a long-short term memory artificial neural network for encoding, thereby obtaining the word vector sequence. The word sequence is a sequence formed by all words in the target text, and the word vector is a vector corresponding to a word obtained by encoding the word through the network. As an example, a word vector may be [1, 0, 1, 0, 0, 1, 0, 1, 1, 1 ].

In some optional implementation manners of some embodiments, the executing entity may input the target text into a pre-trained text coding model to obtain a word vector sequence.

As an example, the text coding model may be BERT (Bidirectional Encoder representation based on transformation).

And step 203, determining a target probability value set corresponding to each word vector in the word vector sequence to obtain a target probability value set sequence.

In some embodiments, the execution main body may sequentially input the word vector sequence into a full-connected layer to obtain a preliminary dimension-reduced word vector sequence, may then input a downsampled layer to obtain a second dimension-reduced word vector sequence, and may finally input a full-connected layer to obtain a third dimension-reduced word vector sequence. Each word vector in the sequence can be normalized to generate a target probability value set, resulting in a sequence of target probability value sets. Wherein the number of target probability values in the set of target probability values is equal to the number of tags. The ith target probability value represents the probability that the word corresponding to the word vector is assigned the ith label. Wherein, i can take any integer between 1 and 2N +1, and N represents the number of target objects.

In some optional implementations of some embodiments, the executing agent may input the word vector sequence to at least one fully-connected layer of the pre-training to obtain the target probability value set sequence. And obtaining a corresponding target probability value set for each word vector through the at least one full connection layer.

And step 204, generating an object vector sequence set based on the target probability value set sequence and the label set.

In some embodiments, the execution entity may filter the target probability values in the sequence of target probability value sets according to the tag sets. For example, the variation information of the pork price is extracted from "pork and corn greatly increased this year", in this case, the target object may be pork, and three types of tags, I-pork and O, are used as one tag group, and if corn is another target object, 3 types of tags, I-corn and O, may be used as another tag group. Each target object may correspond to a tag group, and all target objects may correspond to a set of tag groups. Each word in the target text may correspond to a set of target probability values, each target probability value may correspond to a tag, where a target probability value represents a probability that the word is determined to be the current tag. Then, sequentially selecting a target probability value corresponding to each label in the target label group from each target probability value group in the target probability value group sequence to obtain a target probability value sequence corresponding to each label in the target label group. And then sequentially taking target probability values of the same positions in the sequences corresponding to the labels 'B-X', 'I-X' and 'O' to form a triple, and taking the triple as an object vector. And obtaining an object vector sequence by three sequences corresponding to the three tags of the target tag group. Wherein, the target label group is a label group in the label group set.

In some optional implementations of some embodiments, the execution subject may obtain the set of object vector sequences by:

the method comprises the following steps of firstly, selecting a target probability value corresponding to each target label in a target label group from a target probability value group sequence in sequence to obtain the target probability value sequence, wherein the target label group is a label group in a label group set.

As an example, the target text may be "pork big rise this year", and there are two types of target objects: pork and corn. There are five types of tags: "B-pork", "I-pork", "B-corn", "I-corn" and "O". The label group corresponding to the pork comprises three labels of B-pork, I-pork and O. The sequence of target sets of probability values may be [0.2, 0.5, 0.6, 0.1, 0.3], [0.2, 0.8, 0.3, 0.9, 0.2], [0.5, 0.8, 0.3, 0.4, 0.5], [0.4, 0.2, 0.7, 0.6, 0.1], [0.1, 0.2, 0.5, 0.3, 0.8], [0.7, 0.3, 0.2, 0.5, 0.2 ]. In order to extract variation information of pork price, a target probability value corresponding to each tag in a tag group corresponding to a target object "pork" is selected from the above sequences, and for the tag "B-pork", a target probability value sequence of 0.2, 0.2, 0.5, 0.4, 0.1, 0.7 is selected, and for the tag "I-pork", a target probability value sequence of 0.5, 0.8, 0.8, 0.2, 0.2, 0.3 is selected, and for the tag "O", a target probability value sequence of 0.3, 0.2, 0.5, 0.1, 0.8, 0.2 is selected.

And secondly, generating an object vector sequence set based on the obtained target probability value sequence. For any label group, a target probability value sequence corresponding to each label in the label group can be obtained through the first step. And selecting elements at the same position in the sequence from the three target probability value sequences to form a triple, and taking the triple as an object vector to obtain an object vector sequence.

As an example, the elements at the same position in the sequence are selected from the 3 sequences of target probability values selected in the first step, resulting in the following object vectors: (0.2, 0.5, 0.3), (0.2, 0.8, 0.2), (0.5, 0.8, 0.5), (0.4, 0.2, 0.1), (0.1, 0.2, 0.8), (0.7, 0.3, 0.2), from which a sequence of object vectors can be derived, depending on the position of the element in the corresponding sequence at the time of selection. The object vectors in the sequence respectively correspond to the characters of 'present', 'year', 'pig', 'meat', 'big' and 'swelling'.

Step 205, generating a tag sequence set based on the object vector sequence set and the object transfer matrix set.

In some embodiments, the execution body may use the object vector sequence set as a tag sequence set, where all elements in the object transfer matrix are a constant. The elements in the object transition matrix are the values of the transition probabilities between the tags. The object transition matrix is a 3 × 3 matrix, where N is the number of target objects. For example, if the target object is pork and the object is to extract information on the change in pork price, the tag sequence "O B — pork I-pork" may be assigned to "pork price greatly increased this year", and if the execution entity assigns "O" to "this day", the probability that "year" is assigned to each tag is given by using information that "O" is the label of "this day". Thereby increasing the probability that the "year" is assigned the correct label. In this process, the probability that the next word is assigned to a certain tag is estimated from the tag of the current word, which is called the tag-to-tag transition probability.

In some optional implementations of some embodiments, the executing body may generate the tag sequence by using a viterbi algorithm for each object vector sequence in the object vector sequence set and an object transition matrix corresponding to the object vector sequence. The object vector sequence corresponding to the target object X may generate a matrix as follows: sequentially corresponding each object vector in the sequence to a first column, a second column, a third column and a fourth column of a matrix; and sequentially assigning the elements in the object vector to the elements in the first row, the second row and the third row in the corresponding column. Where M is the length of any vector sequence. The matrix generated as described above is referred to as an emission matrix of a target object X of the target text. The transmitting matrix and the object transfer matrix are combined. Through the Viterbi algorithm, an optimal sequence can be obtained and used as a label sequence.

In some optional implementation manners of some embodiments, the executing entity may obtain, for each object vector sequence in the object vector sequence set and an object transition matrix corresponding to the object vector sequence, a score of the candidate tag sequence by using the following formula:

wherein the content of the first and second substances,

representing the score of the candidate tag sequence.

Is shown in

A first score for the sequence that is the starting point.

Is shown in

A second score for the sequence that is the starting point. λ represents a regulatory factor. j denotes the sequence number of the object vector sequence. i.e. i _j Indicating the sequence number of the element in the jth object vector. i denotes the sequence number of the element in the object vector. M denotes the length of the object vector sequence. x represents the element value of the object vector. y represents the element value of the object transition matrix.

Representing the ith in the jth object vector _j A value of each element.

Representing the ith in an object transition matrix _j Line, i th _j+1 The value of the element of the column. Wherein, lambda is a regulation factor, and the proportion of the object vector sequence and the transfer matrix can be controlled according to the application scene. The first score is obtained by adding one element selected from each object vector in the sequence of object vectors, the starting point indicating the position of the first element in the first object vector. Second oneThe score is obtained by adding the elements of the object transition matrix which are selected to have the corresponding serial numbers. As an example, λ may be 1, the object vector length may be 5, M may be 20, then there are 5 options for the starting point, and the total number of selectable non-repeating paths is 5 ²⁰ These paths are the resulting set of sequences, each

Corresponding to one sequence.

And then selecting the candidate tag sequence corresponding to the maximum score from the obtained scores of the candidate tag sequences as the tag sequence. Will be largest

The corresponding sequence serves as the tag sequence.

As an example, the at least one full connection layer and the object transfer matrix set may be obtained by:

the method comprises the following steps of firstly, obtaining a training sample set, wherein the training sample is a text which is selected according to a target object and does not exceed a preset length. Wherein the preset length may be 100 words in length.

And secondly, carrying out BIO labeling on the training sample set to obtain a label sequence corresponding to the training sample. The training samples are then input into the BERT, resulting in a vector representation for each word. And taking the vector as a word vector, so that the training sample corresponds to a word vector sequence. The BIO labeling is based on a target object set, each word in the text is endowed with a label, and the label is an identification of the word. Wherein each target object has three types of tags. For example, for target object X, the three types of labels are "B-X", "I-X", and "O", respectively. And forming a set by the three types of labels, wherein the set is called a label group of the target object X. Wherein, the word is given label "B-X", which indicates that the word where the word is located is used to describe the information of the target object X, and the word is at the start position of the word. The word is given a label "I-X" to indicate that the word in which the word is located is used to describe the information of the target object X, and the word is at the non-start position of the word, and the word is given a label "O" to indicate that the information expressed by the word is irrelevant to the target object X. For example, if the target object is pork and the purpose is to extract information on the change in the pork price, the tag sequence "O B-pork I-pork" can be assigned to "pork price greatly increased this year".

And thirdly, inputting the word vector sequence into a plurality of full-connection layers which are connected in front and back, and outputting the probability value group corresponding to each word in the training sample to obtain a probability value group sequence. The number of neurons in the first fully-connected layer may be the preset length multiplied by the length of the word vector, and the number of neurons in the last fully-connected layer may be the preset length. And when the label is output, the corresponding dimension of each word is the number of the labels 2N + 1. Where N is the number of target objects. The probability value group and the target probability value group are only used for distinguishing different output results during training and testing, and the probability values are not different except for possible different probability values. Each of the sequence of sets of probability values is, in turn, mapped to a column of the matrix. The first element in the set of probability values corresponds to the first element in the corresponding column of the matrix, and the remaining elements are placed in order. And the obtained matrix is called as a transmission matrix of the training sample. By taking out the elements of each column in the transmission matrix, the corresponding sequence of sets of probability values can be obtained.

And fourthly, obtaining the probability of the corresponding optimal sequence through a Viterbi algorithm according to the emission matrix and the transition matrix. The optimal sequence is the label sequence corresponding to the training sample labeled in the second step. The target function is a function which enables the probability of the optimal sequence to be maximum, all weights and transfer matrixes of at least one full-link layer can be continuously updated through a gradient descent method, and finally at least one trained full-link layer and transfer matrix are obtained. The trained transition matrix is called a label transition matrix. Each row of the transfer matrix corresponds to a unique label, which is called a row label, and each column also corresponds to a unique label, which is called a column label. The elements in the transition matrix are transition probabilities between every two tags in all 2N +1 tags, and N represents the number of target objects.

Fifthly, for the target object X, sequentially selecting elements with row labels of 'B-X', column labels of 'B-X', 'I-X' and 'O' from the label transfer matrix to obtain a first triple; sequentially selecting elements with row labels of 'I-X', column labels of 'B-X', 'I-X' and 'O' to obtain a second triple; and sequentially selecting elements with row labels of 'O', column labels of 'B-X', 'I-X' and 'O' to obtain a third triple. The three elements in the first triplet obtained above are respectively taken as the elements of the first column, the second column and the third column of the first row of a 3 × 3 matrix. And putting the obtained other two triples into the matrix according to the same rule to obtain an object transfer matrix of the target object X. Similarly, object transition matrices for the remaining target objects may be obtained.

The above formula and the related content are used as an invention point of the embodiment of the disclosure, and the technical problem that the extracted information accuracy is low due to the fact that the relation between the semantic meaning and the word attribute is not balanced when information extraction is performed in many application scenes, which is mentioned in the background art, is solved. The factors that lead to the low accuracy tend to be as follows: the relationship between semantics and the labels to which the words correspond is not fully considered. If the above factors are solved, the effect of improving the accuracy of information extraction can be achieved. To achieve this effect, the present disclosure introduces regulatory factors to improve the accuracy of information extraction. When the semantic information of the target text is concerned more, the regulation and control factor can be reduced to obtain more semantic information. When the relation between the labels corresponding to the words of the target text is more concerned, the regulation factor can be increased, so that the extracted information has higher accuracy in grammar.

Step 206, extracting the object information corresponding to each tag sequence in the tag sequence set from the target text to obtain an object information set.

In some embodiments, the execution subject may extract a word in the tag sequence corresponding to a non "O" tag from the target text to obtain a piece of object information.

As an example, as shown in fig. 5, the target objects are pork and corn, and the tag sequence corresponding to the text "pork and corn big-swollen" is "O B-pork I-pork", so that the pork "big-swollen" can be extracted; the label sequence corresponding to the text "pork and corn big swelling" is "O O O B-corn I-corn", so that the corn "big swelling" can be extracted.

The above embodiments of the present disclosure have the following advantages: each word in the obtained target text is coded to generate a word vector, the word vectors obtained through coding have text information and position information, and the design of bottom layer sharing enables objects with poor training numbers to be trained fully, so that the learning ability of the whole process is improved. By endowing each character in the target text with a label related to each target object, the object information corresponding to each object can be extracted preliminarily. And then, introducing a plurality of object transfer matrixes, and more accurately expressing the relationship between every two adjacent words so as to accurately extract the overlapping information.

With further reference to fig. 3, as an implementation of the method shown above, the present disclosure provides some embodiments of an information extraction apparatus, which correspond to those of the method shown in fig. 2, and which can be applied in various electronic devices in particular.

As shown in fig. 3, the information extraction apparatus 300 of some embodiments includes: an acquisition unit 301, an encoding unit 302, a determination unit 303, a first generation unit 304, a second generation unit 305, and an extraction unit 306. Wherein the obtaining unit 301 is configured to obtain the target text; the encoding unit 302 is configured to encode each word in the target text to generate a word vector, resulting in a word vector sequence; the determining unit 303 is configured to determine a target probability value set corresponding to each word vector in the word vector sequence, resulting in a target probability value set sequence; the first generating unit 304 is configured to generate a set of object vector sequences based on the set of target probability values and the set of labels; the second generating unit 305 is configured to generate a set of tag sequences based on the set of object vector sequences and the set of object transfer matrices; the extracting unit 306 is configured to extract object information corresponding to each tag sequence in the set of tag sequences from the target text, resulting in a set of object information.

It will be understood that the units described in the apparatus 300 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 300 and the units included therein, and are not described herein again.

Referring now to FIG. 4, shown is a block diagram of an electronic device (e.g., computing device 101 of FIG. 1)400 suitable for use in implementing some embodiments of the present disclosure. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication device 409 may allow the electronic device 400 to communicate with other devices, either wirelessly or by wire, to exchange data. While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided. Each block shown in fig. 4 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through communications device 409, or installed from storage device 408, or installed from ROM 402. The computer program, when executed by the processing apparatus 401, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target text; coding each word in the target text to generate a word vector to obtain a word vector sequence; determining a target probability value set corresponding to each word vector in the word vector sequence to obtain a target probability value set sequence; generating an object vector sequence set based on the target probability value set sequence and the tag set; generating a label sequence set based on the object vector sequence set and the object transfer matrix set; and extracting object information corresponding to each label sequence in the label sequence set from the target text to obtain an object information set.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, which may be described as: a processor includes an acquisition unit, an encoding unit, a determination unit, a first generation unit, a second generation unit, and a decimation unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the acquisition unit may also be described as "a unit that acquires a target text".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. An information extraction method, comprising:

acquiring a target text;

encoding each word in the target text to generate a word vector to obtain a word vector sequence;

determining a target probability value set corresponding to each word vector in the word vector sequence to obtain a target probability value set sequence;

generating an object vector sequence set based on the target probability value set sequence and the tag set;

generating a label sequence set based on the object vector sequence set and the object transfer matrix set;

and extracting object information corresponding to each label sequence in the label sequence set from the target text to obtain an object information set.

2. The method of claim 1, wherein said encoding each word in the target text to generate a word vector, resulting in a sequence of word vectors, comprises:

and inputting the target text into a pre-trained text coding model to obtain the word vector sequence.

3. The method of claim 2, wherein said determining a set of target probability values for each word vector in said sequence of word vectors, resulting in a sequence of target probability value sets, comprises:

and inputting the word vector sequence into at least one pre-trained full-connection layer to obtain the target probability value set sequence.

4. The method of claim 3, wherein generating a set of object vector sequences based on the set of target sets of probability values and the set of tag groups comprises:

selecting a target probability value corresponding to each target label in a target label group from the target probability value group sequence in sequence to obtain a target probability value sequence, wherein the target label group is a label group in the label group set;

generating the set of object vector sequences based on the obtained sequence of target probability values.

5. The method of claim 4, wherein generating a set of tag sequences based on the set of object vector sequences and a set of object transfer matrices comprises:

and generating a label sequence by utilizing a Viterbi algorithm for each object vector sequence in the object vector sequence set and the object transfer matrix corresponding to the object vector sequence.

6. The method of claim 5, wherein the generating a tag sequence using a Viterbi algorithm for each object vector sequence in the set of object vector sequences and an object transition matrix corresponding to the object vector sequence comprises:

and for each object vector sequence in the object vector sequence set and the object transfer matrix corresponding to the object vector sequence, obtaining the score of the tag sequence to be selected by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

a score representing the sequence of tags to be selected,

is shown in

Is the first score of the sequence of starting points,

is shown in

A second score of the sequence as a starting point, λ represents a regulatory factor, j represents a sequence number of the object vector sequence, i _j Denotes the sequence number of an element in the jth object vector, i denotes the sequence number of an element in an object vector, M denotes the length of the sequence of object vectors, x denotes the value of an element of an object vector, y denotes the value of an element of the object transition matrix,

representing the ith in the jth object vector _j The value of each of the elements is,

representing the ith in the object transition matrix _j Line, i th _j+1 The value of the element of the column;

and selecting the candidate tag sequence corresponding to the maximum score from the obtained scores of the candidate tag sequences as the tag sequence.

7. An information extraction apparatus comprising:

an acquisition unit configured to acquire a target text;

the encoding unit is configured to encode each word in the target text to generate a word vector, and a word vector sequence is obtained;

a determining unit configured to determine a target probability value set corresponding to each word vector in the word vector sequence, resulting in a target probability value set sequence;

a first generating unit configured to generate a set of object vector sequences based on the set of target probability values and the set of labels;

a second generating unit configured to generate a set of tag sequences based on the set of object vector sequences and the set of object transfer matrices;

and the extracting unit is configured to extract object information corresponding to each label sequence in the label sequence set from the target text to obtain an object information set.

8. An electronic device/terminal/server comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

9. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.