CN113420533A

CN113420533A - Training method and device of information extraction model and electronic equipment

Info

Publication number: CN113420533A
Application number: CN202110780566.0A
Authority: CN
Inventors: 董国鹏; 冯丹; 桂文才; 邢闯锋; 朱卫威; 宗广辉; 赵威; 李金贝; 李辰; 黄林冲; 赖正首
Original assignee: Sun Yat Sen University; China Railway Seventh Group Co Ltd; Survey and Design Research Institute of China Railway Seventh Group Co Ltd
Current assignee: Sun Yat Sen University; China Railway Seventh Group Co Ltd; Survey and Design Research Institute of China Railway Seventh Group Co Ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-09-21
Anticipated expiration: 2041-07-09
Also published as: CN113420533B

Abstract

The application belongs to the field of information processing, and provides a training method and device of an information model and electronic equipment. The method comprises the following steps: performing data amplification processing on the acquired first sample data set to obtain an amplified second sample data set; converting text data in the second sample data set into semantic vectors; and inputting the semantic vector into an information extraction model for calculation, outputting the information category corresponding to the character, and adjusting the parameters of the character classification model according to the difference between the output information category and the calibrated information category until the difference between the output information category and the calibrated information category meets the preset difference requirement. By converting the text data in the second sample data set into semantic vectors, word segmentation programs in the traditional information extraction technology can be eliminated, so that the model can be suitable for different language types, manual labeling is not needed during data enhancement processing, small samples can be utilized more fully, and a reliable and robust information extraction model can be trained efficiently.

Description

Training method and device of information extraction model and electronic equipment

Technical Field

The application belongs to the field of information processing, and particularly relates to a training method and device of an information extraction model and electronic equipment.

Background

In a construction project, management of construction safety is very important. When a safety accident occurs, the related information of the accident occurring in the actual construction process is recorded by using the engineering safety accident notification. The extraction and analysis of the key data information of the engineering safety accident notification, such as the organization of documents, are beneficial to helping managers to fast and accurately grasp the information such as the accident occurrence reason, so that the construction process is managed more scientifically and safely.

At present, when key information is extracted from engineering safety accident notification, because data sources in the building engineering industry are very limited, and a data set sample is not enough to be used for establishing a relevant information extraction model, processing is usually performed in a manual interpretation mode at present, and not only is a processing person required to have a relevant knowledge background and is low in reliability, but also the processing efficiency is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a training method and apparatus for an information extraction model, and an electronic device, so as to solve the problems in the prior art that data in the building engineering industry is very limited, data samples are not enough to establish the information extraction model, reliability is not high when manual processing is performed, and processing efficiency is low.

A first aspect of an embodiment of the present application provides a method for training an information extraction model, where the method includes:

determining a first sample data set for training, and performing data amplification processing on the first sample data set to obtain an amplified second sample data set;

converting text data in the second sample data set into semantic vectors;

and inputting the semantic vector into an information extraction model for calculation, outputting an information category corresponding to the character, determining the difference between the information category of the output character and the information category calibrated in second sample data, and adjusting the parameters of the character classification model according to the difference until the difference meets the preset difference requirement.

With reference to the first aspect, in a first possible implementation manner of the first aspect, performing data amplification processing on the first sample data set to obtain an amplified second sample data set includes:

and performing augmentation processing on the first sample data set in a cross combination mode to obtain a second sample data set.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, performing augmentation processing on the first sample data set in a cross-combination manner to obtain a second sample data set, including:

acquiring information categories marked by characters of the first sample data set, and determining characters included in different information categories;

determining a sentence template which is contained in the first sample data set and is formed by information categories according to the information category information of the first sample data set;

and extracting characters included in different information categories in a permutation and combination mode according to the information categories included in the determined sentence template to obtain a second sample data set after augmentation processing.

With reference to the first aspect, in a third possible implementation manner of the first aspect, converting text data in the second sample data set into a semantic vector includes:

and converting the text data in the second sample data set into semantic vectors by adopting a distributed representation mode.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the converting text data in the second sample data set into a semantic vector in a distributed representation manner includes:

determining a character ID corresponding to a character included in the second sample data set;

determining surrounding characters corresponding to the ith character in the second sample data set;

determining a weight matrix W according to the character IDs of the characters around the ith character and the character ID of the ith character;

and obtaining a semantic vector corresponding to the ith character according to the weight matrix and the character ID of the ith character.

With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, determining a weight matrix W according to the character IDs of the characters around the ith character and the character ID of the ith character includes:

and embedding characters of the second sample data set into a formula according to the characters:

and

and determining a weight matrix W in the second sample data set, wherein c is the window radius of the characters around the ith character, Yi is the ith character embedded in the output of the formula, and W' is a matrix for converting the semantic vector into the ith character.

A second aspect of the embodiments of the present application provides an information extraction method, where the information extraction method extracts information types of text data to be extracted according to an information extraction model trained by the training method of the information extraction model according to any one of the first aspect.

A third aspect of an embodiment of the present application provides a training apparatus for an information extraction model, where the apparatus includes:

the system comprises a sample data processing unit, a data processing unit and a data processing unit, wherein the sample data processing unit is used for determining a first sample data set for training, and performing data amplification processing on the first sample data set to obtain an amplified second sample data set;

a semantic vector conversion unit, configured to convert text data in the second sample data set into a semantic vector;

and the model training unit is used for inputting the semantic vector into the information extraction model for calculation, outputting the information category corresponding to the character, determining the difference between the information category of the output character and the information category calibrated in the second sample data, and adjusting the parameters of the character classification model according to the difference until the difference meets the preset difference requirement.

A third aspect of embodiments of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method according to any one of the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, performs the steps of the method according to any one of the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: according to the method and the device, the second sample data set with richer data volume is obtained by performing augmentation processing on the first sample data set, so that the information extraction model can be trained more effectively. By converting the text data in the second sample data set into semantic vectors, word segmentation programs in the traditional information extraction technology can be eliminated, so that the model can be suitable for different language types, manual labeling is not needed during data enhancement processing, small samples can be utilized more fully, and a reliable and robust information extraction model can be trained efficiently.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart of an implementation of a training method for an information extraction model according to an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating an implementation of a data augmentation processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an implementation flow of a method for converting semantic vectors according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training apparatus for an information extraction model according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

The training method of the information extraction model is mainly used for improving the extraction method of the urgent information of the engineering safety accident notification information in the construction engineering project. The method aims at the problems that the existing engineering safety accident notification has less sample data and insufficient data samples, and is a huge challenge for establishing a related information extraction model.

In order to solve the problem that sample data is insufficient and the efficiency is not high through manual labeling, an embodiment of the present application provides a training method for an information extraction model, as shown in fig. 1, the method includes:

in S101, a first sample data set used for training is determined, and data augmentation processing is performed on the first sample data set to obtain an augmented second sample data set.

The first sample data set is a sample data set with information types calibrated in advance. The information category, or also referred to as information item category, may be preset by the staff. For example, for an engineering safety accident notification for a construction engineering project, the included information categories may include time, location, type of accident, cause of accident, number of injured people, number of dead people, units involved in the report, and so on. Of course, without limitation, other information categories may also be included. In a possible implementation, the information categories included in the engineering safety accident notification may include a target information category and an invalid information category. The target information category is an information category expected to be acquired in the security incident notification.

For example, the first sample data set includes engineering incident safety announcements: "42 minutes in the morning of 27 am in 4 and 27 months in 2013, and the XX road XX bridge H75# -H76# main piers in XX city are collapsed together when the preloading operation of the cast-in-place box girder support is carried out, and 6 constructors are pressed because the precast slabs on the support fall off due to the sudden fracture of the section steel support". When the content of the advertisement is taken as the first sample data set, the information category included in the advertisement can be determined in advance by a manual marking mode. For example, the information category "42 minutes in morning of 9 am on day 4 and 27 of 2013" is "time" information category, "the information category" place "is" between XX road XX bridge H75# -H76# main piers, "collapse accident" is "accident type" information category, "the information category" accident cause "is" information category "when the prefabricated slab on the bracket falls due to sudden breakage of the steel bracket," 6 constructors are pressed, "the information category" number of injured people "is" and the other information is invalid information category.

After the first sample data set is obtained or determined, the first sample data set can be subjected to augmentation processing in a cross combination mode, so that a second sample data set with more sample data is obtained.

The implementation flow of performing the augmentation processing on the first sample data may include, as shown in fig. 2:

in S201, information categories marked by characters of the first sample data set are obtained, and characters included in different information categories are determined.

And obtaining the information category corresponding to the characters in the sample data according to the calibration information of the sample data in the first sample data set in advance. For example, the characters in the acquired sample data belong to the "time" information category, or belong to the "place" information category, etc.

Usually, the first sample data set includes a plurality of engineering safety accident notifications, and characters included in different engineering safety accident notifications in each information category are obtained according to the information category calibrated in each engineering safety accident notification. That is, each information category corresponds to a character group, and characters in the character group are from different engineering safety accident notices.

To take an example, the sentence 1 "Xiaoming included in the first sample dataset goes out today and learns. ", sentence 2" Xiaoli yesterday consumed the hamburger. ", sentence 3" the plum will travel tomorrow ". The information categories of the three sentences include "person", "time", and "activity". And extracting the characters belonging to the same information category in each sentence into a group to obtain three character groups belonging to the same information category. For example, the "character group includes characters" xiaoming "," xiaoli "and" plum ".

In S202, a sentence template composed of information categories included in the first sample data set is determined according to the information category information of the first sample data set.

After the category information corresponding to the sentences in the first sample data set is extracted, the information category corresponding to each sentence can be obtained according to the category information. And determining a sentence template according to the composition of the information category.

For example, the information category of sentence 1 is constructed to include "person" + "time" + "activity", so that a sentence template of "person" + "time" + "activity" can be generated from sentence 1.

In a possible implementation, if a plurality of different information category constituents are included in the first data set, a plurality of sentence templates may be generated.

In S203, according to the information categories included in the determined sentence template, characters included in different information categories are extracted in a permutation and combination manner, so as to obtain a second sample data set after the augmentation processing.

When the sentence template and the characters included in each information category are determined, a plurality of new sentences can be generated by arranging and combining.

For example, for a sentence template of "person" + "time" + "activity", the "person" information category includes "Xiaoming", "Xiaoli", "Ming-Tu" information category includes "today", "tomorrow" and "yesterday", and the "activity" information category includes "go to school", "have hamburger" and "go to travel". By permutation and combination, new sentences can be generated, including for example "Xiaoming yesterday to travel", "plumes have eaten hamburgers today", etc. And according to the sentence generated by the sentence template and the information type of the characters in the sentence, obtaining the second sample data set after quantity expansion.

In S102, text data in the second sample data set is converted into semantic vectors.

In a possible implementation manner, the text data in the second sample data may be converted into a semantic vector in a distributed representation manner.

Fig. 3 is a schematic diagram illustrating an implementation process of converting text data into semantic vectors according to an embodiment of the present application, where the process includes:

in S301, a character ID corresponding to a character included in the second sample data set is determined.

By means of cross combination, after the second sample data set is generated by the first sample data set, the character ID corresponding to the character of the sample data in the second sample data set can be determined.

Wherein the character ID is used to identify the character in the second sample data. For example, the second sample data includes "Hello this is a robot" 6 characters, and the corresponding character ID may be a one-hot coded character ID, that is: hello, 000001, this, 000010, is, 000100, a, 001000, robot, 010000 and right. ":"100000".

Of course, the manner of determining the character ID corresponding to the character is not limited to the one-hot encoding manner, and may include other encoding manners.

In S302, the peripheral characters corresponding to the ith character in the second sample data set are determined.

After the character ID corresponding to each character in the second sample data is determined, the peripheral characters of any character in the second sample data may be determined according to a predetermined window radius. Assuming that the radius of the window is 1, the window corresponding to the target character is represented, and the window is the window with the character length of 1, that is, the first character is searched leftwards and rightwards by the target character, and the peripheral characters of the target character are formed by the two searched characters.

For example, for the sentence "Hello this is a robot", it is assumed that the target character of the surrounding characters needs to be determined to be "this", and the corresponding surrounding characters are "Hello" and "is".

In S303, a weight matrix W is determined according to the character IDs of the characters surrounding the ith character and the character IDs of the ith character.

From the determined character IDs of the characters surrounding the ith character, in combination with the character ID of the ith character, the following character embedding formula may be used:

and

and determining a weight matrix W corresponding to the second sample data set. Where c is a window radius of characters around the ith character, indicating that c characters are fetched forward and c characters are fetched backward with the ith character as a center, Yi character is embedded in the ith character of the output of the formula, and W' is a matrix for converting the semantic vector into the ith character.

When determining the weight matrix W, the parameters of the weight matrix may be continuously adjusted according to the input and output of the characters of the second sample data set until the calculated weight matrix meets the calculation requirements of the data of the second sample data set.

In S304, a semantic vector corresponding to the ith character is obtained according to the weight matrix and the character ID of the ith character.

After the weight matrix W corresponding to the second sample data set is determined, the semantic vector corresponding to any character can be determined according to the product of the character and the weight matrix.

And training the weight matrix according to the marked samples, so that the information category of the middle character can be predicted according to the content of the surrounding characters. For example, "the accident causes 2 persons to be injured", "2 persons are injured" belonging to the number of injured persons, and the front and back are usually accompanied by "cause", "and". "and the like.

In S103, the semantic vector is input into the information extraction model for calculation, the information category corresponding to the character is output, the difference between the information category of the output character and the information category calibrated in the second sample data is determined, and the parameter of the character classification model is adjusted according to the difference until the difference between the two meets the preset difference requirement.

And inputting the characters converted into the semantic vectors into the information extraction model with initialized parameters, and outputting the information category corresponding to the input characters according to the initialized parameters in the information extraction model. And comparing the output information category with the information category marked by the character to determine whether the output information category is different from the information category marked by the character. If the difference exists, the output information category is further calculated by modifying the parameters of the information extraction model, the output information category is compared with the calibrated information category, and whether the parameters of the information extraction model are adjusted or not is determined according to the comparison result until the output information category is matched with the calibrated information category. And finishing the training of the information extraction model after all the characters in the second sample data set are trained.

According to the completed information extraction model, the intelligent identification of the information category can be carried out on the characters in the newly acquired engineering safety accident notice.

Because this application is through carrying out the augmentation processing to first sample data set, obtains the second sample data set that the data bulk is more abundant to can train the information extraction model more effectively. By converting the text data in the second sample data set into semantic vectors, word segmentation programs in the traditional information extraction technology can be eliminated, so that the model can be suitable for different language types, manual labeling is not needed during data enhancement processing, small samples can be utilized more fully, and a reliable and robust information extraction model can be trained efficiently.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 4 is a schematic diagram of an information extraction model training apparatus provided in an embodiment of the present application, where the apparatus includes:

the sample data processing unit 401 is configured to determine a first sample data set used for training, and perform data augmentation processing on the first sample data set to obtain an augmented second sample data set;

a semantic vector converting unit 402, configured to convert text data in the second sample data set into a semantic vector;

the model training unit 403 is configured to input the semantic vector into the information extraction model for calculation, output an information category corresponding to a character, determine a difference between the information category of the output character and an information category calibrated in second sample data, and adjust a parameter of the character classification model according to the difference until the difference between the two meets a preset difference requirement.

The training device of the information extraction model shown in fig. 4 corresponds to the training method of the information extraction model shown in fig. 1.

Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52, such as a training program for an information extraction model, stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-described embodiments of the training method for the respective information extraction models. Alternatively, the processor 50 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 52.

Illustratively, the computer program 52 may be partitioned into one or more modules/units, which are stored in the memory 51 and executed by the processor 50 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 52 in the electronic device 5.

The electronic device may include, but is not limited to, a processor 50, a memory 51. Those skilled in the art will appreciate that fig. 5 is merely an example of an electronic device 5 and does not constitute a limitation of the electronic device 5 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the electronic device 5, such as a hard disk or a memory of the electronic device 5. The memory 51 may also be an external storage device of the electronic device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the electronic device 5. The memory 51 is used for storing the computer program and other programs and data required by the electronic device. The memory 51 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for training an information extraction model, the method comprising:

converting text data in the second sample data set into semantic vectors;

2. The method of claim 1, wherein performing data augmentation on the first sample data set to obtain an augmented second sample data set comprises:

3. The method of claim 2, wherein performing augmentation processing on the first sample data set by cross-combining to obtain a second sample data set, comprises:

4. The method of claim 1, wherein converting text data in the second sample data set to semantic vectors comprises:

5. The method of claim 4, wherein converting text data in the second sample data set into semantic vectors in a distributed representation comprises:

6. The method of claim 5, wherein determining a weight matrix W according to the character IDs of the characters surrounding the ith character and the character ID of the ith character comprises:

and

and determining a weight matrix W corresponding to the second sample data set, wherein c is the window radius of the characters around the ith character, Yi is the ith character embedded in the output of the formula, and W' is a matrix for converting the semantic vector into the ith character.

7. An information extraction method, characterized in that the information extraction method extracts the information type of the text data to be extracted according to the information extraction model trained by the training method of the information extraction model according to any one of claims 1 to 6.

8. An apparatus for training an information extraction model, the apparatus comprising:

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.