CN109657207B

CN109657207B - Formatting processing method and processing device for clauses

Info

Publication number: CN109657207B
Application number: CN201811443586.3A
Authority: CN
Inventors: 黄成�; 苏孝强; 刘小伟
Original assignee: Aibao Technology Co ltd
Current assignee: Aibao Technology Co ltd
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2023-11-03
Anticipated expiration: 2038-11-29
Also published as: CN109657207A

Abstract

The application discloses a formatting processing method and a formatting processing device for clauses. The method comprises the steps of obtaining clauses in an editable format and performing word segmentation; classifying the clauses; and converting each sentence and each word in the classified and segmented clauses into vectors, and inputting the vectors into language models of corresponding classes to obtain field values respectively corresponding to different attribute fields of the clauses. The device comprises a clause acquisition unit, a word segmentation unit, a classification unit, a vector conversion unit and a field extraction unit. The application can rapidly realize the formatting of clauses, has high yield, and is easy to maintain data and extend attribute fields and perform function extension in the later period.

Description

Formatting processing method and processing device for clauses

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a formatting processing method and a processing apparatus for clauses.

Background

For an insurance service platform, it is a great challenge to store and manage more than one hundred thousand on-sale insurance terms in the background so that they are easy to show to users at the front end. Currently, the method of extracting attribute fields is mainly adopted to deal with the problem. As shown in FIG. 1, the insurance responsibilities (such as unexpected statues/disabilities, unexpected medical treatments, etc.) and the effective time and other attribute fields and corresponding field values of a term are extracted and recorded in a database, recalled when needed, and presented to the user on an interface.

Theoretically, as long as enough attribute fields can be extracted, the main information of a term can be completely extracted.

Currently, most clause formatting schemes are a method of manually reading clauses and then manually extracting attribute fields to perform management and storage of formatting of the clauses. This has several disadvantages:

(1) The efficiency of manual extraction is too low;

(2) The method is not easy to maintain and expand in the later period, for example, if we want to add a new attribute field, all products in the prior record storage must be manually checked once;

(3) For both of the above reasons, most companies are unable to display the complete information of the product to the user with a sufficient number of attribute fields.

Disclosure of Invention

The application mainly aims to provide a formatting processing method and a formatting processing device for clauses, which can rapidly and accurately extract and store all attribute fields in the clauses through a text mining technology in natural language processing, and solve the problems that the clauses are used as unstructured data and the formatting efficiency is low.

To achieve the above object, according to one aspect of the present application, there is provided a formatting process method of clauses. The formatting processing method of the clause comprises the following steps:

obtaining clauses of an editable format and performing word segmentation;

classifying the clauses;

and converting each sentence and each word in the classified and segmented clauses into vectors, and inputting the vectors into language models of corresponding classes to obtain field values respectively corresponding to different attribute fields of the clauses.

Further, the obtaining the editable format terms includes: determining whether the clause is in an editable format, and converting to the editable format if the clause is not in the editable format.

Further, the language model is generated by:

classifying the clauses, and acquiring a first preset number of clauses in an editable format for each classified class;

word segmentation is carried out on each clause, and sentences and words in each clause belonging to the same category are converted into vectors;

determining attribute fields to be extracted from different types of clauses, and respectively marking field values corresponding to different attribute fields for the first preset number of clauses;

training the language model of the corresponding category by using the sentences and the vectors converted by the words belonging to the same category, and obtaining the trained language model of each category.

Further, in the case that the terms of the specified category need to be added with attribute fields, the method for generating the language model further includes:

obtaining a second predetermined number of editable format terms of the specified category;

respectively marking field values corresponding to different attribute fields for the second preset number of clauses according to the original attribute field and the added attribute field to be extracted from the appointed category clauses;

word segmentation is carried out on each clause, and each sentence and each word in each clause of the appointed category are converted into vectors;

and training the language model of the corresponding category by using each sentence and each word in each clause and the vector converted by each word, and obtaining the trained language model of the specified category.

Further, the word segmentation of the clause includes:

and segmenting the clauses of the editable format by using a reference dictionary and a stop word list, removing words belonging to the stop word list, and storing the rest words in the clauses into a database.

Further, the language model is a long-short-term memory network LSTM model.

To achieve the above object, according to another aspect of the present application, there is provided a formatting processing apparatus of the clause. The formatting processing device of the clause comprises:

a clause acquiring unit for acquiring clauses of the editable format;

the word segmentation unit is used for segmenting the clauses of the editable format;

a classification unit for classifying the clauses;

a vector conversion unit for converting each sentence and each word in the clauses after classification and word segmentation into vectors;

and the field extraction unit is used for inputting the converted vectors of each sentence and each word of the clause into a language model of a corresponding category to obtain field values respectively corresponding to the fields of different attributes of the clause.

Further, the formatting processing device of the clause further includes: and a language model training unit for training the language model of the corresponding category by using a predetermined number of terms belonging to the same category.

Further, the word segmentation unit is further configured to segment the terms in the editable format by using a reference dictionary and a stop vocabulary, and store a word segmentation result as a word segmentation table.

Further, the clause obtaining unit includes a format conversion module for converting the clause of the non-editable format into the clause of the editable format.

The data processing speed and the yield of the clause formatting processing method and the clause formatting processing device are high, the clauses can be processed in batches, the formatting of the clauses can be realized rapidly, and the data and the extended attribute fields are easy to maintain in the later stage and the function is extended.

The application designs different formatted data storage structures aiming at different dangerous categories, and then extracts and stores all attribute fields in terms rapidly and accurately through a text mining technology in natural language processing. The problem of low formatting efficiency of clauses as unstructured data is solved at a time. In the later-stage expansion application, only a small amount of labeling is needed manually, and the characteristics of the new field can be obtained through retraining, so that the later-stage maintenance becomes very convenient and quick.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application, are incorporated in and constitute a part of this specification. The drawings and their description are illustrative of the application and are not to be construed as unduly limiting the application. In the drawings:

FIG. 1 is an illustration of an exemplary provided attribute field and corresponding field value extracted from a clause;

FIG. 2 is a flow chart of a formatting process method of the clauses provided by one embodiment of the present application;

FIG. 3 is a flow chart providing an example method for generating a language model in the formatting process method of the clause shown in FIG. 2;

FIG. 4 is a flow chart of an example provided method of generating the language model shown in FIG. 3 in the case where the terms of a specified category require the addition of an attribute field;

fig. 5 is a schematic structural diagram of a formatting processing device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the application herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the present application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal" and the like indicate an azimuth or a positional relationship based on that shown in the drawings. These terms are only used to better describe the present application and its embodiments and are not intended to limit the scope of the indicated devices, elements or components to the particular orientations or to configure and operate in the particular orientations.

Also, some of the terms described above may be used to indicate other meanings in addition to orientation or positional relationships, for example, the term "upper" may also be used to indicate some sort of attachment or connection in some cases. The specific meaning of these terms in the present application will be understood by those of ordinary skill in the art according to the specific circumstances.

Furthermore, the terms "mounted," "configured," "provided," "connected," "coupled," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; may be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements, or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

FIG. 2 is a flow chart of a formatting process method of clauses provided by one embodiment of the present application. As shown in fig. 2, the formatting processing method of the clause provided in this embodiment includes the following steps:

and S1, obtaining the clauses of the editable format and performing word segmentation.

The specific process of step S1 may include: determining whether the clause is in an editable format, and if not, converting to the editable format. For subsequent processing, the terms are first converted to an editable format, for example, PDF format terms may be converted to TXT format using a pdfminer library in the Python language. The editable format terms are then segmented, such as by using a reference dictionary and a stop vocabulary. Taking insurance clauses as an example, the reference dictionary can be built based on the nlpir and jieba libraries of the Python language with default dictionaries, and all specialized vocabularies in the insurance industry data structure can be imported into the dictionary according to the special properties and specialized properties of the insurance industry. In addition, the stop vocabulary includes all punctuation marks, prepositions, assisted words and intonation words which have no effect on semantic analysis, and the like, so as to improve the accuracy of machine learning. All terms are segmented into words using a reference dictionary and a stop-word list, each term generating a word segmentation list. And using the stop word list, and completely removing the stop words in the word list.

And S2, classifying the clauses.

Here, step S2 will be described by taking insurance clauses as an example. It is necessary to analyze which categories the existing insurance clauses fall into before step S2, and may be classified by insurance category such as serious illness, medical treatment, accident, and life, by different insurance companies, or both. According to the category of the clauses which are divided in advance, determining which category the currently acquired clauses belong to.

And S3, converting each sentence and each word in the classified and segmented clauses into vectors, and inputting the vectors into the language model of the corresponding class to obtain field values respectively corresponding to the different attribute fields of the clauses.

The specific process of step S3 may include: the sentences in all clauses and words in the word segmentation table are converted into digital vectors of a predetermined dimension, such as 128-dimensional vectors, using doc2vec techniques in a genesim library. And saving the field values corresponding to the obtained different attribute fields of the clauses into a database, thereby forming an insurance clause formatted data storage structure.

The order of the steps described above is only an example and is not intended to limit the order in which the steps are performed, and the order of the steps may be interchanged in the term formatting process of the present application. For example, the order of steps S1 and S2 may be exchanged, and the terms may be classified first, and then it may be determined whether the classified terms are in an editable format, and if so, the terms may be segmented, and if not, the terms may be converted into an editable format.

All currently available clauses are respectively input into the language model of the corresponding category, and the output result is stored into the data structure of the industry clause.

As shown in fig. 3, as an alternative embodiment, the method for generating the language model in the step S3 may include the following steps:

step S31, classifying the clauses, and acquiring a predetermined number of editable format clauses for each classified class.

Here, terms are classified, and then each class obtains a first predetermined number of terms to train the language model of the corresponding class. The classification is the same as the classification in step S2 described above. The first predetermined number of terms may be 500, but may be other numbers of terms, and the number of terms for training the language model is selected according to whether the results obtained by training the language model meet the requirements, such as the accuracy of the results output by the model being not less than 95%. The corpus used to train the language model may be obtained from an associated website. Then, it is determined whether the clause is in an editable format, and if not, it is converted into an editable format. For subsequent processing, the terms are first converted to an editable format, for example, PDF format terms may be converted to TXT format using a pdfminer library in the Python language.

Step S32, word segmentation is carried out on each term, and sentences and words in each term belonging to the same category are converted into vectors.

In step S32, the terms are segmented in the same manner as in step S1 described above.

Step S33, determining attribute fields to be extracted from the terms of different categories, and respectively marking field values corresponding to the different attribute fields for a first predetermined number of terms.

For step S33, taking insurance clauses as an example, information useful for the user in each type of insurance (such as serious diseases, medical treatment, etc.) clauses may be given by a business expert according to experience of practitioners, and all the possibly useful information is stored in the database as an attribute field assembly table of the data structure. For example, an insurance of this category of heavy disease insurance may extract attribute fields of "guaranteed disease category", "effective time", "guard", and "payment time", etc. The labeling of the terms may be performed manually, for example, by an insurance industry expert manually labeling a predetermined number of insurance terms, i.e., the field values corresponding to the attribute fields in the terms and their positions are marked down, and are ready for later machine learning.

Step S34, training the language model of the corresponding category by using the converted vectors of each sentence and each word belonging to each term of the same category, and obtaining the trained language model of each category.

The order of the respective steps described above is only an example, and is not intended to limit the order of execution of the respective steps described above, and the order of the respective steps of the method of generating a language model may be exchanged. For example, the order of step S32 and step S33 may be exchanged, the attribute fields to be extracted from the terms of the different categories may be determined first, the field values corresponding to the different attribute fields are respectively noted for the first predetermined number of terms, then the terms are segmented, and the sentences and words in the terms belonging to the same category are converted into vectors.

FIG. 4 is a flow chart of an example provided method of generating the language model shown in FIG. 3 in the case where the terms of a specified category require the addition of an attribute field. In the case where the terms of the specified category require the addition of an attribute field, the language model generation method may further include the steps of:

step S41, obtaining the second predetermined number of editable format clauses of the specified category;

step S42, respectively marking field values corresponding to different attribute fields for a second preset number of clauses according to the original attribute field and the added attribute field to be extracted from the clauses of the appointed category;

step S43, word segmentation is carried out on each term, and each sentence and each word in each term of the appointed category are converted into vectors;

and step S44, training the language model of the corresponding category by using each sentence and each word in each clause and the vector converted by each word, and obtaining the trained language model of the specified category.

According to the steps, if the attribute field needs to be added, selecting a second preset number of insurance clauses to be labeled again, and carrying out feature engineering and model training again. Here, the second predetermined number may be the same as the first predetermined number.

The order of the respective steps described above is only an example, and is not intended to limit the order of execution of the respective steps described above, and the order of the respective steps of the method of generating a language model may be exchanged. For example, the order of step S42 and step S43 may be exchanged, each term may be segmented first, each sentence and each word in each term of the specified category may be converted into a vector, and then, according to the original attribute field and the added attribute field to be extracted from the term of the specified category, the field values corresponding to the different attribute fields may be respectively noted for the second predetermined number of terms.

The language model may be a Long Short Term Memory network (LSTM) model.

The embodiment also provides a formatting processing device of the clause. As shown in fig. 5, the formatting processing apparatus of the clause provided by the embodiment includes a clause acquisition unit 51, a word segmentation unit 52, a classification unit 53, a vector conversion unit 54, and a field extraction unit 55.

Wherein the term acquisition unit 51 is configured to acquire terms of an editable format.

The word segmentation unit 52 is used for word segmentation of the terms in the editable format. For example, the word segmentation unit 52 may segment the terms of the editable format using the reference dictionary and the stop vocabulary, and save the segmentation result as the vocabulary. Taking insurance clauses as an example, the reference dictionary can be built based on nlpir and jieba libraries of Pytho n language with default dictionary, and then all professional vocabularies in the insurance industry data structure are imported into the dictionary according to special properties and professional properties of the insurance industry. In addition, the stop vocabulary includes all punctuation marks, prepositions, assisted words and intonation words which have no effect on semantic analysis, and the like, so as to improve the accuracy of machine learning. All terms are segmented into words using a reference dictionary and a stop-word list, each term generating a word segmentation list. And using the stop word list, and completely removing the stop words in the word list.

The classification unit 53 is used for classifying the terms. Taking the insurance clause as an example, the insurance clause can be classified according to insurance types such as serious illness, medical treatment, accident and life, can be classified according to different insurance companies, or can be both classified.

The vector conversion 54 is used to convert each sentence and each word in the classified and segmented clause into a vector. The doc2vec technique in the genesim library can be used to convert sentences in all clauses and words in the word segmentation table into numerical vectors of predetermined dimensions, such as 128-dimensional vectors.

The field extraction unit 55 is configured to input the converted vectors of each sentence and each word of the clause into a language model of a corresponding category, and obtain field values corresponding to the fields of different attributes of the clause.

The formatting processing apparatus of the terms provided by this embodiment further includes a language model training unit 56 for training the language model of the corresponding category with a predetermined number of terms belonging to the same category. The corpus used to train the language model may be obtained from an associated website. Then, it is determined whether the clause is in an editable format, and if not, it is converted into an editable format. For subsequent processing, the terms are first converted to an editable format, for example, PDF format terms may be converted to TXT format using a pdfminer library in the Python language. The terms are segmented and sentences and words in the terms belonging to the same category are converted into vectors. Determining attribute fields to be extracted from the terms of different categories, and respectively marking field values corresponding to the different attribute fields for a first predetermined number of terms. Taking insurance clauses as an example, information useful for users in each type of insurance (such as serious diseases, medical treatment and the like) clauses can be given by a business expert according to the experience of practise, and all the information which is possibly useful is stored in a database as an attribute field assembly table of a data structure. For example, an insurance of this category of heavy disease insurance may extract attribute fields of "guaranteed disease category", "effective time", "guard", and "payment time", etc. The labeling of the terms may be performed manually, for example, by an insurance industry expert manually labeling a predetermined number of insurance terms, i.e., the field values corresponding to the attribute fields in the terms and their positions are marked down, and are ready for later machine learning. And training the language model of the corresponding category by using the converted vectors of each sentence and each word belonging to each clause of the same category, and obtaining the trained language model of each category.

Wherein the clause obtaining unit 51 comprises a format conversion module 511 for converting clauses of the non-editable format into clauses of the editable format. The terms are converted to an editable format, for example, PDF format terms may be converted to TXT format using a pdfminer library in the Python language.

It will be apparent to those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A formatting process method of clauses, comprising:

obtaining clauses of an editable format and performing word segmentation;

classifying the clauses;

converting each sentence and each word in the classified and segmented clauses into vectors, and inputting a language model of a corresponding class to obtain field values respectively corresponding to different attribute fields of the clauses;

the obtaining the editable format includes:

judging whether the clause is in an editable format, if not, converting the clause into the editable format;

the language model is generated by the following steps:

training a language model of a corresponding category by using each sentence and each word of each clause belonging to the same category and the vector converted by each word, and obtaining the trained language model of each category;

in the case that the terms of the specified category need to be added with attribute fields, the method for generating the language model further comprises the following steps:

2. The method of formatting the clause according to claim 1, wherein the word segmentation of the clause comprises:

3. The formatting process of claim 1, wherein the language model is a long-short-term memory network LSTM model.

4. A formatting processing apparatus of the clause, comprising:

a clause acquiring unit for acquiring clauses of the editable format;

a classification unit for classifying the clauses;

the field extraction unit is used for inputting the converted vectors of each sentence and each word of the clause into a language model of a corresponding category to obtain field values respectively corresponding to the fields of different attributes of the clause;

the term acquisition unit includes:

a format conversion subunit, configured to determine whether the clause is in an editable format, and if not, convert the clause to the editable format;

the language model is generated by the following steps:

5. The formatting process apparatus of claim 4, further comprising: and a language model training unit for training the language model of the corresponding category by using a predetermined number of terms belonging to the same category.

6. The formatting process device of claim 4, wherein the word segmentation unit is further configured to segment the editable format of the clause using a reference dictionary and a stop word list, and store the segmentation result as a word segmentation list.

7. The formatting process apparatus of claim 4, wherein the term acquiring unit includes a format converting module for converting the term of the non-editable format into the term of the editable format.