CN115983281A - Information extraction and model acquisition method and device, electronic equipment and storage medium - Google Patents

Information extraction and model acquisition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115983281A
CN115983281A CN202310085290.3A CN202310085290A CN115983281A CN 115983281 A CN115983281 A CN 115983281A CN 202310085290 A CN202310085290 A CN 202310085290A CN 115983281 A CN115983281 A CN 115983281A
Authority
CN
China
Prior art keywords
entity
processed
text
sequence
dimensional matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310085290.3A
Other languages
Chinese (zh)
Inventor
贾桐
王建华
冯知凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310085290.3A priority Critical patent/CN115983281A/en
Publication of CN115983281A publication Critical patent/CN115983281A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The disclosure provides an information extraction and model acquisition method, an information extraction and model acquisition device, electronic equipment and a storage medium, and relates to the field of artificial intelligence such as deep learning, natural language processing and knowledge maps. The information extraction method comprises the following steps: the method comprises the steps that a three-dimensional matrix corresponding to a first sequence is constructed by utilizing a first model obtained through pre-training aiming at a text to be processed, the first sequence is generated according to the text to be processed, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on preset C different relation types, C is a positive integer larger than one, the relation types comprise different entity types and different facet types, the two characters comprise the same character and different characters, and a core entity and facet information extracted from the text to be processed are determined according to the three-dimensional matrix. By applying the scheme disclosed by the invention, the complexity can be reduced, the information extraction speed can be increased, and the like.

Description

Information extraction and model acquisition method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for extracting information and obtaining a model in the fields of deep learning, natural language processing, and knowledge graph, an electronic device, and a storage medium.
Background
In many scenarios, for a given text, core entity and facet information need to be extracted, for example, in a retrieval scenario, accuracy of a retrieval result can be improved by extracting the core entity and facet information.
Disclosure of Invention
The disclosure provides an information extraction and model acquisition method, an information extraction and model acquisition device, an electronic device and a storage medium.
An information extraction method, comprising:
aiming at a text to be processed, constructing a three-dimensional matrix corresponding to a first sequence by using a first model obtained by pre-training, wherein the first sequence is a sequence generated according to the text to be processed, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on C preset different relation types, C is a positive integer greater than one, the relation types comprise different entity types and different facet types, and the two characters comprise the same character and different characters;
and determining the core entity and the facet information extracted from the text to be processed according to the three-dimensional matrix.
A model acquisition method, comprising:
acquiring training data, wherein the training data comprises: the method comprises the steps of a text, a core entity extracted from the text, the type of the core entity and facet information corresponding to at least one core entity;
the training data is utilized to train a first model, the first model is used for learning a construction mode of a three-dimensional matrix corresponding to a first sequence, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on C preset different relation types, C is a positive integer larger than one, the relation types comprise different entity types and different facet types, the first sequence is a sequence generated according to any text to be processed, the two characters comprise the same character and different characters, and the three-dimensional matrix is used for determining a core entity and facet information extracted from the text to be processed.
An information extraction apparatus comprising: the device comprises a matrix construction module and an information extraction module;
the matrix construction module is used for constructing a three-dimensional matrix corresponding to a first sequence by utilizing a first model obtained by pre-training aiming at a text to be processed, wherein the first sequence is generated according to the text to be processed, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on predetermined C different relation types, C is a positive integer greater than one, the relation types comprise different entity types and different facet types, and the two characters comprise the same character and different characters;
and the information extraction module is used for determining the core entity and the facet information extracted from the text to be processed according to the three-dimensional matrix.
A model acquisition apparatus comprising: the system comprises a data acquisition module and a model training module;
the data acquisition module is configured to acquire training data, where the training data includes: the method comprises the following steps of a text, a core entity extracted from the text, the type of the core entity and facet information corresponding to at least one core entity;
the model training module is configured to train a first model by using the training data, and is configured to learn a construction method of a three-dimensional matrix corresponding to a first sequence by the first model, where the three-dimensional matrix includes dependencies of any two characters in the first sequence on predetermined C different relationship types, C is a positive integer greater than one, the relationship types include different entity types and different facet types, the first sequence is a sequence generated according to any text to be processed, the two characters include the same character and different characters, and the three-dimensional matrix is used to determine a core entity and facet information extracted from the text to be processed.
An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.
A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.
A computer program product comprising computer programs/instructions which, when executed by a processor, implement a method as described above.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of an embodiment of an information extraction method according to the present disclosure;
FIG. 2 is a schematic diagram of a two-dimensional matrix according to the present disclosure;
fig. 3 is a schematic diagram of a pending matrix (two-dimensional matrix) corresponding to the entity "actir #7" of the present disclosure;
FIG. 4 is a schematic diagram of a pending matrix corresponding to an entity of "automobile" according to the present disclosure;
FIG. 5 is a schematic diagram of a pending matrix corresponding to the "color collocation" entity of the present disclosure;
fig. 6 is a schematic diagram of a to-be-processed matrix corresponding to the "color matching" section information according to the present disclosure;
FIG. 7 is a flow chart of an embodiment of a model acquisition method according to the present disclosure;
fig. 8 is a schematic diagram of a component structure of an embodiment 800 of an information extraction apparatus according to the present disclosure;
FIG. 9 is a schematic diagram illustrating an exemplary structure of a model obtaining apparatus 900 according to the present disclosure;
FIG. 10 shows a schematic block diagram of an electronic device 1000 that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.
Fig. 1 is a flowchart of an embodiment of an information extraction method according to the present disclosure. As shown in fig. 1, the following detailed implementation is included.
In step 101, a three-dimensional matrix corresponding to a first sequence is constructed by using a first model obtained through pre-training for a text to be processed, the first sequence is generated according to the text to be processed, the three-dimensional matrix includes dependencies of any two characters (tokens) in the first sequence on predetermined C different relation types, C is a positive integer greater than one, the relation types include different entity types and different facet types, and the two characters include the same character and different characters.
In step 102, core entities and facet information extracted from the text to be processed are determined according to the three-dimensional matrix.
The facet information of the entity refers to attribute information or point of interest information of the entity.
In a conventional extraction method, two independent policy models are usually adopted to perform entity extraction and entity-facet pair (pair) extraction, and then contents extracted by the two policy models are combined to obtain intersection and the like, so as to obtain finally required core entity and facet information.
The first part of strategy model can be composed of a Named Entity Recognition (NER) extraction model and an NER sorting model in series connection, wherein the NER extraction model is used for generating an NER extraction list, and the NER sorting model is used for sorting all entities in the NER extraction list according to the sequence of the core degree from high to low. The second part of the strategy model can be an entity facet extraction model used for extracting entity-facet pairs and sorting the entity-facet pairs according to the sequence of the association degree of the entity and the facet. Further, an intersection set of entities obtained by the two policy models can be used as a required core entity, and a facet corresponding to the core entity can be used as required facet information.
In the above processing mode, since training of a plurality of different models needs to be performed in advance, the implementation complexity is high, time cost and resource consumption are increased, and the implementation of the whole extraction process is complicated, so that the extraction speed is low, and the like.
By adopting the scheme of the method embodiment, the first model is obtained only by pre-training, and then the extraction of the core entity and the facet information can be realized by means of the first model, the realization mode is simple, namely the realization complexity is reduced, so that the time cost, the resource consumption and the like are saved.
The method comprises the steps that a three-dimensional matrix corresponding to a first sequence can be constructed by utilizing a first model for a text to be processed, the first sequence is generated according to the text to be processed, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on C preset different relation types, C is a positive integer larger than one, the relation types can comprise different entity types and different facet types, and the two characters can comprise the same characters and different characters.
Preferably, the first sequence may include: the method comprises the steps of processing a text to be processed, wherein the text to be processed comprises a first start character ([ cls0 ]), a second start character ([ cls1 ]), each character in the text to be processed and an end character ([ sep ]), wherein the first start character and the second start character are identifiers sequentially added before the first character in the text to be processed, and the end character is an identifier added after the last character in the text to be processed.
For example, if the text to be processed is "the latest version of the original di R #7 car has eight different color arrangements", the first sequence may be generated to sequentially include the following characters: [ cls0], [ cls1], latest, new, original, di, R, #, 7, automobile, owned, eighty, money, different, same, color, match, [ sep ].
Through the processing, the initial position and the end position of the text to be processed can be identified, so that a good foundation is laid for subsequent processing.
Preferably, the three-dimensional matrix may include: and C layers of two-dimensional matrixes, wherein each layer of two-dimensional matrix corresponds to a different relationship type, C is a positive integer greater than one, and the value of an element with the coordinate (i, j) in any two-dimensional matrix is as follows: the dependency of the ith character in the first sequence and the jth character in the first sequence on the jth relation type is that i is more than or equal to 1 and less than or equal to M, j is more than or equal to 1 and less than or equal to M, C is more than or equal to 1 and less than or equal to C, M is a positive integer which is more than one and represents the number of characters in the first sequence, the larger the value is, the stronger the dependency is, and the jth relation type represents the relation type corresponding to the two-dimensional matrix. Correspondingly, corresponding to the three-dimensional space, the value of the element with the coordinate (i, j, c) in the three-dimensional matrix is: a dependency of the ith character in the first sequence on the jth character in the first sequence on the c-th relationship type.
The specific value of C can be determined according to actual needs. As mentioned above, the relationship types may include different entity types and different facet types, for example, the different entity types may include a person name, a place name, a vehicle name, an organization name, a geographic location, a time, and the like, and the different facet types may include different facets of the different entity types, such as color matching, a displacement, and the like of a car. The specific inclusion of which entity types and which facet types may depend on actual needs, and preferably, all possible entity types and all possible facet types may be included as much as possible.
Fig. 2 is a schematic diagram of a two-dimensional matrix according to the present disclosure. As shown in fig. 2, each small square represents an element, and the value of the element with the coordinate (i, j) is: the dependency of the ith character in the first sequence and the jth character in the first sequence on the c-th relationship type, where the c-th relationship type represents the relationship type corresponding to the two-dimensional matrix, for example, the value of the element with the coordinate of (3, 5) represents the dependency of the 3 rd character "the most" in the first sequence and the 5 th character "the money" in the first sequence on the c-th relationship type, that is, the possibility of both being components of the c-th relationship type, and for example, the value of the element with the coordinate of (4, 4) represents the dependency of the 4 th character "new" in the first sequence and the 4 th character "new" in the first sequence on the c-th relationship type.
By means of the constructed three-dimensional matrix, the core entity and the facet information extracted from the text to be processed can be determined.
Preferably, each layer of the two-dimensional matrix can be traversed, and for each traversed two-dimensional matrix, the following processes can be performed respectively: and taking the traversed two-dimensional matrix as a matrix to be processed, taking the element with the value larger than the first threshold value as a target element in response to the fact that the element with the value larger than the first threshold value exists in the matrix to be processed, and taking the content formed by the characters corresponding to the target element as the extracted content. The specific value of the first threshold can be determined according to actual needs.
In addition, preferably, the extracted content may be determined to be an entity or facet information according to the relationship type corresponding to the matrix to be processed and/or the appearance position of the target element in the matrix to be processed, and the extracted entity may be used as a required core entity.
Taking the text to be processed as "the latest version of the original di R #7 car has eight different color matching" as an example, suppose that three entities are extracted from the text, namely "# di R #7", "car" and "color matching".
Fig. 3 is a schematic diagram of a pending matrix (two-dimensional matrix) corresponding to an entity of "ddir #7" in the present disclosure. As shown in fig. 3, there are 30 elements whose values are greater than the first threshold, 25 of them form a square region with the diagonal of the matrix to be processed as the symmetry axis, the inside of the region is modeled with "xdyr #7" dependency (correlation) between characters inside the entity and distinguishes the boundary between characters inside the entity and characters outside the entity, and the other 5 elements are located in the second row of the matrix to be processed, in practical applications, [ cls1] generally represents the abstract semantics of the whole text to be processed, so the values of the elements in the second row of the matrix to be processed can be understood as: as shown in fig. 3, 30 elements in the matrix to be processed whose values are greater than a first threshold may be used as target elements, and a content "xdir #7" formed by the characters corresponding to the target elements may be used as an extracted content, that is, the extracted content "xdir #7" may be determined by the values of the elements in the second row and an entity boundary defined in the square region, and the like.
Fig. 4 is a schematic diagram of a pending matrix corresponding to an entity of "car" according to the present disclosure. Fig. 5 is a schematic diagram of a to-be-processed matrix corresponding to the "color matching" entity according to the present disclosure.
Since the type of the relationship corresponding to each pending matrix is known, it can be determined that the extracted "actir #7", "car" and "color matching" are all entities according to the type of the relationship corresponding to each pending matrix shown in fig. 3, 4 and 5, such as the type of entity. Or, since the square regions corresponding to the entities are all square regions with the diagonal line of the matrix to be processed as the symmetry axis, it can also be determined that the extracted "ddir #7", "car", and "color collocation" are all entities according to the appearance position of the target element in the matrix to be processed. The extracted entities can be all used as required core entities.
Still taking the example that the text to be processed is "the latest version x di R #7 car has eight different color matching", it is assumed that a piece of facet information, i.e., "color matching", is extracted from the text.
Fig. 6 is a schematic diagram of a to-be-processed matrix corresponding to the section information of "color matching" according to the present disclosure. As shown in fig. 6, there are 20 elements whose values are greater than the first threshold, the 20 target elements form a rectangular region, and the rectangular region is a rectangular region appearing on an off-diagonal non-cls line, and the content "color matching" formed by the characters corresponding to the 20 target elements can be used as the extracted content.
Since the relationship type corresponding to the to-be-processed matrix shown in fig. 6 is known, it may be determined that the extracted "color collocation" is facet information according to the corresponding relationship type, or it may be determined that the extracted "color collocation" is facet information according to the appearance position of the target element in the to-be-processed matrix because the rectangular region corresponding to the facet information is a rectangular region appearing on an off-diagonal non-cls line.
It can be seen that, by adopting the processing mode, the extracted content can be efficiently and accurately determined by traversing each two-dimensional matrix and judging the values of the elements in the traversed two-dimensional matrix, and the extracted content can be determined to be entity or facet information in different modes, so that the processing mode is very flexible and convenient.
In practical application, the extracted entity can be used as a required core entity, and the extracted facet information can be used as required facet information.
Or, preferably, for any extracted entity, the core degree of the entity in the text to be processed may be determined according to the target element in the matrix to be processed corresponding to the entity, and in response to determining that the core degree is greater than the second threshold, the entity is taken as the extracted core entity. The specific value of the second threshold can be determined according to actual needs.
The extracted entities can be further filtered, and only the entities with the core degree greater than the second threshold value are reserved, so that possible error interference is removed, and the accuracy of the finally obtained core entities is further improved.
Preferably, the determining the core degree of the entity in the text to be processed may include: and acquiring the mean value of the values of the elements corresponding to the characters forming the entity in the second row of the matrix to be processed corresponding to the entity, and taking the mean value as the core degree of the entity in the text to be processed.
The core degrees of different entities can be efficiently and accurately determined by means of part of target elements with specific meanings, additional acquisition of other information is not needed, and the method is very simple and convenient.
Taking fig. 3 as an example, the mean values of the elements corresponding to the 5 characters, i.e., the characters x, di, R, # and 7, in the second row of the matrix to be processed may be calculated, and the obtained mean value is used as the core degree of the entity, i.e., the character x di R #7 ". In a similar manner, the core degrees of the two entities "car" and "color matching" can be obtained, and further, the core degrees of the three entities can be compared with a second threshold, and assuming that only the core degree of "xdir #7" is greater than the second threshold, "xdir #7" can be used as the extracted core entity.
In addition, as shown in fig. 6, through the to-be-processed matrix, it can be determined that the core entity corresponding to the hierarchical information of "color collocation" is "xdir #7", so the final extracted content will be: the core entity "# dir #7" and the core entity "# dir #7" correspond to the facet information "color collocation".
It can be seen that the implementation of the solution of the present disclosure needs to rely on the first model obtained by pre-training, and the following describes the manner of obtaining the first model.
Fig. 7 is a flowchart of an embodiment of a model obtaining method according to the present disclosure. As shown in fig. 7, the following detailed implementation is included.
In step 701, training data is obtained, where the training data includes: the method comprises the following steps of text, core entities extracted from the text, types of the core entities and facet information corresponding to at least one core entity.
In step 702, the obtained training data is used to train a first model, which is used for the first model to learn a construction method of a three-dimensional matrix corresponding to the first sequence, where the three-dimensional matrix includes dependencies of any two characters in the first sequence on predetermined C different relationship types, C is a positive integer greater than one, the relationship types include different entity types and different facet types, the first sequence is a sequence generated according to any text to be processed, the two characters include the same character and different characters, and the three-dimensional matrix is used to determine a core entity and facet information extracted from the text to be processed.
By adopting the scheme of the method embodiment, the first model is obtained only by pre-training, the extraction of the core entity and the facet information can be realized subsequently by means of the first model, the realization mode is simple, the realization complexity is reduced, the time cost and the resource consumption are saved, in addition, the core entity and the facet information extracted from the text to be processed can be directly determined by utilizing the three-dimensional matrix constructed based on the first model, the information extraction process is simplified, the information extraction speed is further improved, in addition, the accuracy of the constructed three-dimensional matrix can be improved through large-scale model training, the accuracy of the information extraction result is further improved, and the like.
How to acquire the training data is not limited, for example, text data accumulated in each scene may be acquired, and the required training data may be generated by combining with manual labeling and the like.
By training the first model, the first model can learn the dependency of any two (same or different) characters on different relation types, and accordingly, after the training is finished, any text to be processed is given, and the corresponding three-dimensional matrix can be generated by the first model.
It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts, those skilled in the art will appreciate that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure. In addition, for parts which are not described in detail in a certain embodiment, reference may be made to relevant descriptions in other embodiments.
In a word, by adopting the scheme of the embodiment of the method, the time cost and the resource consumption can be saved, the information extraction speed can be increased, the accuracy of the information extraction result can be improved, and the like.
The above is a description of embodiments of the method, and the embodiments of the apparatus are described below to further illustrate the aspects of the disclosure.
Fig. 8 is a schematic diagram illustrating a structure of an information extraction apparatus 800 according to an embodiment of the present disclosure. As shown in fig. 8, includes: a matrix construction module 801 and an information extraction module 802.
The matrix construction module 801 is configured to construct, for a text to be processed, a three-dimensional matrix corresponding to a first sequence by using a first model obtained through pre-training, where the first sequence is a sequence generated according to the text to be processed, the three-dimensional matrix includes dependencies of any two characters in the first sequence on predetermined C different relationship types, C is a positive integer greater than one, the relationship types include different entity types and different facet types, and the two characters include the same character and different characters.
And an information extraction module 802, configured to determine, according to the three-dimensional matrix, core entities and facet information extracted from the text to be processed.
By adopting the scheme of the embodiment of the device, the first model is obtained only by pre-training, the core entity and the facet information can be extracted subsequently by means of the first model, the implementation mode is simple, the implementation complexity is reduced, the time cost and the resource consumption are saved, in addition, the core entity and the facet information extracted from the text to be processed can be directly determined by utilizing the three-dimensional matrix constructed based on the first model, the information extraction process is simplified, the information extraction speed is further improved, in addition, the accuracy of the constructed three-dimensional matrix can be improved through large-scale model training, the accuracy of the information extraction result is further improved, and the like.
Preferably, the first sequence may include: the text processing method comprises a first start character, a second start character, characters in the text to be processed and an end character, wherein the first start character and the second start character are identifiers sequentially added before a first character in the text to be processed, and the end character is an identifier added after a last character in the text to be processed.
Preferably, the three-dimensional matrix may include: and C layers of two-dimensional matrixes, wherein each layer of two-dimensional matrix corresponds to a different relationship type, C is a positive integer greater than one, and the value of an element with the coordinate (i, j) in any two-dimensional matrix is as follows: the dependency of the ith character in the first sequence and the jth character in the first sequence on the jth relation type is that i is larger than or equal to 1 and is smaller than or equal to M, j is larger than or equal to 1 and is smaller than or equal to M, C is larger than or equal to 1 and is smaller than or equal to C, M is a positive integer larger than one and represents the number of characters in the first sequence, the larger the value is, the stronger the dependency is, and the jth relation type represents the relation type corresponding to the two-dimensional matrix. Correspondingly, corresponding to the three-dimensional space, the value of the element with the coordinate (i, j, c) in the three-dimensional matrix is: a dependency of the ith character in the first sequence on the jth character in the first sequence on the c-th relationship type.
With the aid of the constructed three-dimensional matrix, the information extraction module 802 can determine the core entity and facet information extracted from the text to be processed.
Preferably, the information extraction module 802 may traverse each layer of the two-dimensional matrix, and, for each traversed two-dimensional matrix, may perform the following processing: and taking the traversed two-dimensional matrix as a matrix to be processed, responding to the fact that elements with values larger than a first threshold exist in the matrix to be processed, taking the elements with values larger than the first threshold as target elements, and taking contents formed by characters corresponding to the target elements as extracted contents.
In addition, preferably, the information extraction module 802 may further determine whether the extracted content is an entity or facet information according to the relationship type corresponding to the matrix to be processed and/or the occurrence position of the target element in the matrix to be processed, and may use the extracted entity as a required core entity.
Or, preferably, the information extraction module 802 may further determine, for any extracted entity, a core degree of the entity in the text to be processed according to the target element in the matrix to be processed corresponding to the entity, and in response to determining that the core degree is greater than the second threshold, take the entity as the extracted core entity.
Preferably, the way that the information extraction module 802 determines the core degree of the entity in the text to be processed may include: and acquiring the mean value of the values of the elements corresponding to the characters forming the entity in the second row of the matrix to be processed corresponding to the entity, and taking the mean value as the core degree of the entity in the text to be processed.
Fig. 9 is a schematic diagram illustrating a structure of a model obtaining apparatus 900 according to an embodiment of the present disclosure. As shown in fig. 9, includes: a data acquisition module 901 and a model training module 902.
A data obtaining module 901, configured to obtain training data, where the training data includes: the method comprises the following steps of text, core entities extracted from the text, types of the core entities and facet information corresponding to at least one core entity.
The model training module 902 is configured to train a first model by using the acquired training data, and is configured to learn, by the first model, a construction manner of a three-dimensional matrix corresponding to the first sequence, where the three-dimensional matrix includes dependencies of any two characters in the first sequence on C predetermined different relationship types, C is a positive integer greater than one, the relationship types include different entity types and different facet types, the first sequence is a sequence generated according to any text to be processed, the two characters include the same character and different characters, and the three-dimensional matrix is used to determine a core entity and facet information extracted from the text to be processed.
By adopting the scheme of the embodiment of the device, the first model is obtained only by pre-training, and the extraction of the core entity and the facet information can be realized subsequently by means of the first model, so that the realization mode is simple, the realization complexity is reduced, the time cost and the resource consumption are saved, in addition, the core entity and the facet information extracted from the text to be processed can be directly determined by utilizing the three-dimensional matrix constructed based on the first model, the information extraction process is simplified, the information extraction speed is further improved, in addition, the accuracy of the constructed three-dimensional matrix can be improved through large-scale model training, the accuracy of the information extraction result is further improved, and the like.
How to acquire the training data is not limited, for example, text data accumulated in each scene may be acquired, and the required training data may be generated by combining with manual labeling and the like.
By training the first model, the first model can learn the dependency of any two (same or different) characters on different relation types, and accordingly, after the training is finished, any text to be processed is given, and the corresponding three-dimensional matrix can be generated by the first model.
The specific work flow of the device embodiment shown in fig. 8 and fig. 9 may refer to the related description in the foregoing method embodiment, and is not repeated.
In a word, by adopting the scheme of the embodiment of the device disclosed by the invention, the time cost and the resource consumption can be saved, the information extraction speed can be increased, the accuracy of the information extraction result can be improved, and the like.
The scheme disclosed by the invention can be applied to the field of artificial intelligence, in particular to the fields of deep learning, natural language processing, knowledge maps and the like. Artificial intelligence is a subject of studying a computer to simulate some thinking process and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has both hardware-level technologies and software-level technologies, the artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises computer vision technology, voice recognition technology, natural language processing technology, machine learning/deep learning, big data processing technology, knowledge graph technology and the like.
The text and the like in the embodiments of the present disclosure are not specific to a specific user, and cannot reflect personal information of a specific user. In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 10 shows a schematic block diagram of an electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 1001 performs the various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into RAM1003 and executed by computing unit 1001, one or more steps of the methods described in the present disclosure may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured in any other suitable manner (e.g., by way of firmware) to perform the methods described in this disclosure.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (19)

1. An information extraction method, comprising:
aiming at a text to be processed, constructing a three-dimensional matrix corresponding to a first sequence by using a first model obtained by pre-training, wherein the first sequence is a sequence generated according to the text to be processed, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on C preset different relation types, C is a positive integer greater than one, the relation types comprise different entity types and different facet types, and the two characters comprise the same character and different characters;
and determining the core entity and the facet information extracted from the text to be processed according to the three-dimensional matrix.
2. The method of claim 1, wherein,
the first sequence includes: the first start character, the second start character, each character in the text to be processed and the end character;
the first start character and the second start character are identifiers sequentially added before the first character in the text to be processed, and the end character is an identifier added after the last character in the text to be processed.
3. The method of claim 2, wherein,
the three-dimensional matrix comprises: c, two-dimensional matrixes of the layers are arranged, and each two-dimensional matrix of the layers corresponds to a different relation type;
the values of the elements with coordinates (i, j) in any two-dimensional matrix are respectively as follows: the dependency of the ith character in the first sequence and the jth character in the first sequence on the jth relation type is more than or equal to 1 and less than or equal to M, more than or equal to 1 and less than or equal to C, M is a positive integer greater than one and represents the number of characters in the first sequence, and the jth relation type represents the relation type corresponding to the two-dimensional matrix.
4. The method of claim 3, wherein,
the determining of the core entity and the facet information extracted from the text to be processed according to the three-dimensional matrix comprises the following steps:
traversing each layer of two-dimensional matrix;
aiming at the traversed two-dimensional matrix, the following processing is respectively carried out: and taking the traversed two-dimensional matrix as a matrix to be processed, responding to the fact that an element with a value larger than a first threshold exists in the matrix to be processed, taking the element with the value larger than the first threshold as a target element, and taking the content formed by characters corresponding to the target element as the extracted content.
5. The method of claim 4, further comprising:
and determining whether the extracted content is entity or facet information according to the corresponding relation type of the matrix to be processed and/or the appearance position of the target element in the matrix to be processed, and taking the extracted entity as the core entity.
6. The method of claim 5, further comprising:
and aiming at any extracted entity, respectively determining the core degree of the entity in the text to be processed according to the target elements in the matrix to be processed corresponding to the entity, and taking the entity as the core entity in response to the determination that the core degree is greater than a second threshold value.
7. The method of claim 6, wherein,
the determining the core degree of the entity in the text to be processed comprises: and acquiring a mean value of values of elements corresponding to each character forming the entity in a second row of the matrix to be processed corresponding to the entity, and taking the mean value as a core degree of the entity in the text to be processed.
8. A model acquisition method, comprising:
acquiring training data, wherein the training data comprises: the method comprises the following steps of a text, a core entity extracted from the text, the type of the core entity and facet information corresponding to at least one core entity;
the method comprises the steps that a first model is trained by utilizing training data, the first model is used for learning a construction mode of a three-dimensional matrix corresponding to a first sequence, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on preset C different relation types, C is a positive integer larger than one, the relation types comprise different entity types and different facet types, the first sequence is a sequence generated according to any text to be processed, the two characters comprise the same character and different characters, and the three-dimensional matrix is used for determining a core entity and facet information extracted from the text to be processed.
9. An information extraction apparatus comprising: the device comprises a matrix construction module and an information extraction module;
the matrix construction module is used for constructing a three-dimensional matrix corresponding to a first sequence by utilizing a first model obtained by pre-training aiming at a text to be processed, wherein the first sequence is generated according to the text to be processed, the three-dimensional matrix comprises the dependency of any two characters in the first sequence on predetermined C different relation types, C is a positive integer greater than one, the relation types comprise different entity types and different facet types, and the two characters comprise the same character and different characters;
and the information extraction module is used for determining the core entity and the facet information extracted from the text to be processed according to the three-dimensional matrix.
10. The apparatus of claim 9, wherein,
the first sequence includes: the first start character, the second start character, each character in the text to be processed and the end character;
the first start character and the second start character are identifiers sequentially added before the first character in the text to be processed, and the end character is an identifier added after the last character in the text to be processed.
11. The apparatus of claim 10, wherein,
the three-dimensional matrix comprises: c, two-dimensional matrixes of layers are obtained, and each two-dimensional matrix of the layers corresponds to a different relationship type;
the values of the elements with coordinates (i, j) in any two-dimensional matrix are respectively as follows: the dependency of the ith character in the first sequence and the jth character in the first sequence on the jth relation type is more than or equal to 1 and less than or equal to M, more than or equal to 1 and less than or equal to C, M is a positive integer greater than one and represents the number of characters in the first sequence, and the jth relation type represents the relation type corresponding to the two-dimensional matrix.
12. The apparatus of claim 11, wherein,
the information extraction module traverses each layer of two-dimensional matrix, and respectively performs the following processing for the traversed two-dimensional matrix: and taking the traversed two-dimensional matrix as a matrix to be processed, responding to the fact that an element with a value larger than a first threshold exists in the matrix to be processed, taking the element with the value larger than the first threshold as a target element, and taking the content formed by characters corresponding to the target element as the extracted content.
13. The apparatus of claim 12, wherein,
the information extraction module is further configured to determine whether the extracted content is entity or facet information according to the relationship type corresponding to the matrix to be processed and/or the occurrence position of the target element in the matrix to be processed, and use the extracted entity as the core entity.
14. The apparatus of claim 13, wherein,
the information extraction module is further configured to, for any extracted entity, determine a core degree of the entity in the text to be processed according to the target element in the matrix to be processed corresponding to the entity, and in response to determining that the core degree is greater than a second threshold, take the entity as the core entity.
15. The apparatus of claim 14, wherein,
the information extraction module obtains a mean value of values of elements corresponding to characters forming the entity in a second row of the matrix to be processed corresponding to the entity, and the mean value is used as a core degree of the entity in the text to be processed.
16. A model acquisition apparatus comprising: the system comprises a data acquisition module and a model training module;
the data acquisition module is configured to acquire training data, where the training data includes: the method comprises the following steps of a text, a core entity extracted from the text, the type of the core entity and facet information corresponding to at least one core entity;
the model training module is used for training a first model by using the training data, and is used for the first model to learn a construction mode of a three-dimensional matrix corresponding to a first sequence, the three-dimensional matrix comprises the dependencies of any two characters in the first sequence on preset C different relation types, C is a positive integer greater than one, the relation types comprise different entity types and different facet types, the first sequence is a sequence generated according to any text to be processed, the two characters comprise the same character and different characters, and the three-dimensional matrix is used for determining a core entity and facet information extracted from the text to be processed.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-8.
19. A computer program product comprising a computer program/instructions which, when executed by a processor, implement the method of any one of claims 1-8.
CN202310085290.3A 2023-01-19 2023-01-19 Information extraction and model acquisition method and device, electronic equipment and storage medium Pending CN115983281A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310085290.3A CN115983281A (en) 2023-01-19 2023-01-19 Information extraction and model acquisition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310085290.3A CN115983281A (en) 2023-01-19 2023-01-19 Information extraction and model acquisition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115983281A true CN115983281A (en) 2023-04-18

Family

ID=85962517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310085290.3A Pending CN115983281A (en) 2023-01-19 2023-01-19 Information extraction and model acquisition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115983281A (en)

Similar Documents

Publication Publication Date Title
CN112560496A (en) Training method and device of semantic analysis model, electronic equipment and storage medium
CN113963110B (en) Texture map generation method and device, electronic equipment and storage medium
US20230068238A1 (en) Method and apparatus for processing image, electronic device and storage medium
US20240144570A1 (en) Method for generating drivable 3d character, electronic device and storage medium
CN112989970A (en) Document layout analysis method and device, electronic equipment and readable storage medium
CN113590776A (en) Text processing method and device based on knowledge graph, electronic equipment and medium
CN114492831A (en) Method and device for generating federal learning model
CN113407850A (en) Method and device for determining and acquiring virtual image and electronic equipment
CN114648676A (en) Point cloud processing model training and point cloud instance segmentation method and device
CN113902956A (en) Training method of fusion model, image fusion method, device, equipment and medium
CN114861059A (en) Resource recommendation method and device, electronic equipment and storage medium
CN114111813A (en) High-precision map element updating method and device, electronic equipment and storage medium
CN113641804A (en) Pre-training model obtaining method and device, electronic equipment and storage medium
CN113641829A (en) Method and device for training neural network of graph and complementing knowledge graph
US20230111511A1 (en) Intersection vertex height value acquisition method and apparatus, electronic device and storage medium
CN113344214B (en) Training method and device of data processing model, electronic equipment and storage medium
CN116152702A (en) Point cloud label acquisition method and device, electronic equipment and automatic driving vehicle
CN115511779A (en) Image detection method, device, electronic equipment and storage medium
CN115983281A (en) Information extraction and model acquisition method and device, electronic equipment and storage medium
CN113704256A (en) Data identification method and device, electronic equipment and storage medium
CN114220163A (en) Human body posture estimation method and device, electronic equipment and storage medium
CN113591567A (en) Target detection method, training method of target detection model and device thereof
CN113033179A (en) Knowledge acquisition method and device, electronic equipment and readable storage medium
CN114840656B (en) Visual question-answering method, device, equipment and storage medium
CN113553407B (en) Event tracing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination