CN115758997A

CN115758997A - Method and device for constructing natural language model and electronic equipment

Info

Publication number: CN115758997A
Application number: CN202211457954.6A
Authority: CN
Inventors: 秦小林; 张思齐; 钱杨舸; 廖兴滨; 单靖杨; 陈敏; 王乾垒
Original assignee: Chengdu Information Technology Co Ltd of CAS
Current assignee: Chengdu Information Technology Co Ltd of CAS
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-03-07

Abstract

The disclosure relates to a method, a device and an electronic device for constructing a natural language model, wherein the method comprises the following steps: acquiring a Chinese character table and a Chinese character structure table to be decomposed; wherein, different Chinese character structures in the Chinese character structure table have unique codes, and different variants of the same Chinese character structure have different codes; constructing a Chinese character decomposer, and dividing Chinese characters to be decomposed into Chinese character components and structural positions; and decomposing each Chinese character of the input text data through the Chinese character decomposer, and constructing an initial natural language model based on the decomposed sequence. The classification model in the disclosure can enhance the sequence length of the short text, thereby alleviating the technical problems that the existing language model is not high in semantic accuracy when the Chinese characters are continuously decomposed, greatly increasing the training cost of the model and the dependency on data, and improving the semantic understanding accuracy in the text.

Description

Method and device for constructing natural language model and electronic equipment

Technical Field

The present disclosure relates to the field of natural language processing, and in particular, to a method and an apparatus for constructing a natural language model, and an electronic device.

Background

In common chinese character structures, some chinese character structures may be omitted or deformed during the evolution process, for example: the most common structure is the "moon" character, which has a distinct meaning in different Chinese characters, such as in the "Lang" character, where "moon" represents the moon, and in the "shoulder" character, where "moon" actually represents the flesh of a person, because in the evolution of Chinese characters, the "shoulder" month is actually a variant of "meat" (called "fleshy moon" in some documents), and gradually forms the current "moon" form in the gradual evolution and writing specifications, but it is not the same as the "moon" in "Lang" and cannot be easily deduced in context by means of the "moon" character. The common Chinese character model can simply mix the two into a whole, and the training cost of the model and the dependency on data are invisibly increased.

In some common chinese language models with chinese character structures, a preprocessing method in an english language model may be used, where the preprocessing method for english words in the english language model is to decompose english words according to their occurrence frequencies (or roots) and then assign corresponding position coding information to different english roots to distinguish their order in a sentence. Correspondingly, in the Chinese language model, the Chinese characters are manually decomposed, and position information is given to each Chinese character structure. However, this method will generate a certain deviation because the chinese characters are two-dimensional characters, the amount of information contained therein is greater than that contained in english words, and the two-dimensional positional relationship between the structures of the chinese characters will be lost only by decomposing the chinese characters according to their structures and giving positional information.

Therefore, when the existing Chinese language model is used for continuously decomposing Chinese characters, the character meaning accuracy is not high, and the training cost and the dependence on data of the model are greatly increased.

Disclosure of Invention

The invention aims to provide a method, a device and electronic equipment for constructing a natural language model, which are used for solving the problems that the existing Chinese language model has low meaning accuracy when the existing Chinese language model continuously decomposes Chinese characters, and greatly increases the training cost of the model and the dependence on data.

To achieve the above object, a first aspect of the present disclosure provides a method of constructing a natural language model, including:

acquiring a Chinese character table and a Chinese character structure table to be decomposed; wherein, different Chinese character structures in the Chinese character structure table have unique codes, and different variants of the same Chinese character structure have different codes;

constructing a Chinese character decomposer, and decomposing a Chinese character to be decomposed into Chinese character components and structural positions, wherein the structural positions are used for indicating the positions of the Chinese character components in the whole Chinese character, so that a sequence formed by the Chinese character components and the structural positions can uniquely distinguish different Chinese characters in the Chinese character table;

decomposing each Chinese character of input text data through the Chinese character decomposer, and constructing an initial natural language model based on a decomposed sequence;

in the input layer of the natural language model, endowing the decomposed sequence with a position vector, and endowing a sentence beginning code and a sentence end code at the sentence beginning of the input text;

coding the input text data through a pre-training model of the natural language model to obtain word vectors, and obtaining an input sequence based on the word vectors;

reading sentence start vector information of the sequence output by the pre-training model in a text classification layer of the natural language model, and selecting the category with the front weight as a candidate category after calculating the weight.

Optionally, the method further includes:

and reading sentence head vector information of the sequence output by the pre-training model according to the structure of the input text in a text classification layer of the natural language model, and classifying the text according to the sentence head vector information during classification to obtain a classification result.

Optionally, the pre-training model is a Bert pre-training model, and the method further includes: and for the Bert pre-training model, adding a Chinese character structure table into a Bert word vector table.

Optionally, when the position vector is assigned to the decomposed sequence, the same position vector is assigned to the structure of the Chinese character after the same Chinese character is decomposed.

A second aspect of the present disclosure provides an apparatus for constructing a natural language model, including:

the Chinese character table acquisition module is used for acquiring a Chinese character table to be decomposed and a Chinese character structure table; wherein, different Chinese character structures in the Chinese character structure table have unique codes, and different variants of the same Chinese character structure have different codes;

the Chinese character resolver building module is used for splitting a Chinese character to be resolved into a Chinese character component and a structural position, wherein the structural position is used for indicating the position of the Chinese character component in the whole Chinese character so that a sequence formed by the Chinese character component and the structural position can uniquely distinguish different Chinese characters in the Chinese character table;

the model building module is used for decomposing each Chinese character of the input text data through the Chinese character decomposer and building an initial natural language model based on the decomposed sequence; in the input layer of the natural language model, endowing the decomposed sequence with a position vector, and endowing a sentence beginning code and a sentence end code at the sentence beginning of the input text; coding the input text data through a pre-training model of the natural language model to obtain word vectors, and obtaining an input sequence based on the word vectors; reading sentence start vector information of the sequence output by the pre-training model in a text classification layer of the natural language model, and selecting the category with the front weight as a candidate category after calculating the weight.

A third aspect of the disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

A fourth aspect of the present disclosure provides an electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of the first aspect.

In the scheme of the embodiment of the disclosure, different Chinese character structures in the Chinese character table have unique codes, and different variants of the same Chinese character structure have different codes; when the Chinese character decomposer decomposes Chinese characters, the Chinese characters to be decomposed are decomposed into Chinese character components and structure positions, and the structure positions are used for indicating the positions of the Chinese character components in the whole Chinese character, so that a sequence formed by the Chinese character components and the structure positions can uniquely distinguish different Chinese characters in the Chinese character table; then, each Chinese character of the input text data is decomposed through the Chinese character decomposer, and an initial natural language model is constructed based on the decomposed sequence. The classification model in the embodiment of the disclosure can enhance the sequence length of the short text, thereby alleviating the technical problems that the existing language model has low semantic accuracy when the Chinese characters are decomposed continuously, greatly increasing the training cost of the model and the dependency on data, and improving the semantic understanding accuracy in the text.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of building a natural language model in accordance with an exemplary embodiment;

FIG. 2 is a diagram of a natural language model shown in accordance with an exemplary embodiment;

FIG. 3 is a block diagram illustrating an apparatus for building a natural language model in accordance with an exemplary embodiment;

FIG. 4 is a block diagram of an electronic device shown in accordance with an example embodiment.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

The disclosed embodiment provides a method for constructing a natural language model, which comprises the following steps as shown in fig. 1.

Step 101, acquiring a Chinese character table and a Chinese character structure table to be decomposed; wherein, different Chinese character structures in the Chinese character structure table have unique codes, and different variants of the same Chinese character structure have different codes.

When preparing a Chinese character table to be decomposed, generally selecting 3500 common characters or 7000 common characters as the Chinese character table; preparing a reference Chinese character structure table, such as a Chinese character structure table generally accepted in the academic world, such as 'modern commonly used character component and component name standard' published by the 2009 education department; different Chinese character structures in the Chinese character structure table have unique codes; and decomposing the Chinese characters in the Chinese character table according to the Chinese character structure table to form a Chinese character decomposition table.

For example, two forms of Chinese character component "month", such as "month" (example: "Lang" character) and "meat month" (example: "shoulder" character), and for example "month" (example: "Lang" character) and "meat month" (example: "arm" character), belong to different variants of the same Chinese character structure, and the meanings represented by the different variants are different, and the two components need to be endowed with different codes.

102, constructing a Chinese character decomposer, and dividing the Chinese character to be decomposed into Chinese character components and structural positions, wherein the structural positions are used for indicating the positions of the Chinese character components in the whole Chinese character, so that the sequence formed by the Chinese character components and the structural positions can uniquely distinguish different Chinese characters in the Chinese character table.

In the aspect of the position of the Chinese character structure, clear two-dimensional position coordinates are not required to be given, and only the result of the position distribution of different Chinese character structures can be identified, for example, wu and swallow of the Chinese characters are identified by using ((kou, 1), (day, 2)) and ((kou, 2), (day, 1)) with additional structural position information, namely, the sequence formed by Chinese character components and structural positions can uniquely identify different Chinese characters in a Chinese character table, so that the problem of losing the two-dimensional position relationship between the Chinese character structures is solved. After decomposition, the structure position vector and the Chinese character structure embedding vector are used as an initial word vector input model together.

And 103, decomposing each Chinese character of the input text data through the Chinese character decomposer, and constructing an initial natural language model based on the decomposed sequence.

It can be seen from the above solutions that the solutions in the embodiments of the present disclosure differentiate different variants of the same chinese character structure, construct a corresponding chinese character structure table for the chinese characters in the chinese character table, split the chinese characters to be decomposed into chinese character components and structure positions by the chinese character decomposer, and then input the chinese character components and structure positions into the Bert model. The structural positions are used for noting the positions of all Chinese character components in the whole Chinese character, so that the final classification model can enhance the sequence length of the short text, the technical problems that the existing Chinese language model is low in semantic accuracy when the Chinese characters are continuously decomposed, the training cost of the model and the dependency on data are greatly increased are solved, and the semantic understanding accuracy in the text is improved.

Next, the description is continued on the method of constructing the initial natural language model. As shown in fig. 2, in the chinese character structure decomposer, the chinese characters in the chinese character decomposition table need to be decomposed into the chinese character structure and the position information, and it should be ensured that the chinese characters which are not in the chinese character decomposition table are not changed.

In the input layer of the natural language model, each Chinese character of the input text passes through a Chinese character decomposer, and then a position vector is given to the decomposed sequence; note that the structure of the chinese character after the same chinese character decomposition should be assigned the same position vector (this position vector is not the same as the aforementioned structure position vector), and an additional sentence beginning vector [ cls ] and a sentence end vector [ sep ] are assigned at the sentence beginning of the input text.

For the Bert (English language full name: bidirectional Encoder reproduction from transformations) pre-training model of the natural language model, the Chinese character structure table can be added to the Bert's word vector table.

In the text classification layer of the natural language model, corresponding sentence head vectors [ cls ] calculated by the model can be read from the sentence heads according to the structure of the input text, and text classification is carried out by relying on the sentence head vector information during classification.

And then, according to the downstream short text classification task fine tuning model, when the downstream task fine tuning model is used, chinese characters in the downstream task all appear in a Chinese character structure table, otherwise, the Chinese characters outside the table are required to be additionally calibrated.

Based on the same inventive concept, an embodiment of the present disclosure further provides a device for constructing a natural language model, as shown in fig. 3, the device includes:

a Chinese character table obtaining module 301, configured to obtain a Chinese character table and a Chinese character structure table to be decomposed; wherein, different Chinese character structures in the Chinese character structure table have unique codes, and different variants of the same Chinese character structure have different codes;

a chinese character decomposer building module 302, configured to decompose a chinese character to be decomposed into a chinese character component and a structural position, where the structural position is used to indicate a position occupied by the chinese character component in a whole chinese character, so that a sequence formed by the chinese character component and the structural position can uniquely distinguish different chinese characters in the chinese character table;

and the model building module 303 is configured to decompose each Chinese character of the input text data by the Chinese character decomposer, and build an initial natural language model based on the decomposed sequence.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 4 is a block diagram illustrating an electronic device 400 according to an example embodiment. As shown in fig. 4, the electronic device 400 may include: a processor 401 and a memory 402. The electronic device 400 may also include one or more of a multimedia component 403, an input/output (I/O) interface 404, and a communications component 405.

The processor 401 is configured to control the overall operation of the electronic device 400, so as to complete all or part of the steps in the method for constructing a natural language model. The memory 402 is used to store various types of data to support operations at the electronic device 400, such as instructions for any application or method operating on the electronic device 400 and application-related data. The multimedia components 403 may include a screen and an audio component. The I/O interface 404 provides an interface between the processor 401 and other interface modules. The communication component 405 is used for wired or wireless communication between the electronic device 400 and other devices.

In another exemplary embodiment, there is also provided a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the above-described method of constructing a natural language model. For example, the computer readable storage medium may be the memory 402 comprising program instructions executable by the processor 401 of the electronic device 400 to perform the method of constructing a natural language model described above.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of constructing a natural language model, comprising:

reading sentence start vector information of a sequence output by the pre-training model at a text classification layer of the natural language model, and selecting a category with a front weight as a candidate category after calculating the weight.

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 1, wherein the pre-trained model is a Bert pre-trained model, the method further comprising: and for the Bert pre-training model, adding a Chinese character structure table into a word vector table of the Bert.

4. The method of claim 1, wherein when assigning position vectors to the decomposed sequences, the same position vectors are assigned to the decomposed kanji structures of the same kanji.

5. An apparatus for constructing a natural language model, comprising:

6. A non-transitory computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps of the method of any one of claims 1 to 4.

7. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 4.