CN110826324B

CN110826324B - Language model training and word segmentation prediction method and device and language model

Info

Publication number: CN110826324B
Application number: CN201911047639.4A
Authority: CN
Inventors: 曹绍升; 崔卿
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2024-02-09
Anticipated expiration: 2039-10-30
Also published as: CN110826324A

Abstract

The embodiment of the specification provides a training and word segmentation prediction method and device for a language model, and the language model, wherein word segmentation and stroke sets thereof are adopted to train the language model and predict target word segmentation, and as the characteristics of word segmentation level and stroke level are extracted, the characteristic granularity is smaller, the trained language model is higher in accuracy, and the target word segmentation prediction accuracy is higher.

Description

Language model training and word segmentation prediction method and device and language model

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for training a language model and predicting word segmentation, and a language model.

Background

Colloquially, the language model functions to determine how similar a computer-generated sentence is to human language. Language models have very wide application, for example, in question-answering systems, by which answers can be automatically generated, and the generated answers are expressed as much as possible in human language, except that they can answer questions accurately. Therefore, there is a need for improvements to language models to increase the accuracy of the language models.

Disclosure of Invention

Based on the above, the embodiment of the specification provides a training and word segmentation prediction method and device of a language model and the language model.

According to a first aspect of embodiments of the present specification, there is provided a training method of a language model, the method comprising:

acquiring a training text;

respectively obtaining a first stroke set of each first word in the training text, wherein the first stroke set comprises first stroke codes of each character in the first word;

and taking the first word and the first stroke codes thereof as input of the language model, and taking at least one other first word positioned after the first word in the training text as output of the language model so as to train the language model.

By applying the scheme of the embodiment of the specification, the first word segmentation and the stroke set of each first word segmentation are obtained from the training text, then the language model is trained together according to the first word segmentation and the stroke set thereof, word segmentation level and stroke level features can be extracted from the training text, and the extracted feature granularity is smaller, so that the language model obtained by training according to the features can extract text features more effectively, and the accuracy of the language model is improved.

According to a second aspect of embodiments of the present specification, there is provided a word segmentation prediction method, the method comprising:

acquiring a second word and a second stroke set; wherein the second stroke set comprises second stroke codes of all characters in the second word segmentation;

the second word and the second stroke code thereof are input into a pre-trained language model to predict at least one target word following the second word.

By applying the scheme of the embodiment of the specification, the second word segmentation and the stroke set of each second word segmentation are input into the pre-trained language model together to predict at least one target word segmentation after the second word segmentation, the word segmentation level and stroke level characteristics can be extracted during prediction, and the extracted characteristic granularity is smaller, so that text characteristics can be extracted more effectively according to the characteristics, and the prediction accuracy of the language model is improved.

According to a third aspect of embodiments of the present specification, there is provided a language model comprising:

the first machine learning model, the second machine learning model and the third machine learning model are sequentially connected;

the first machine learning model is used for inputting a second stroke set of a second segmentation word; wherein the second stroke set comprises second stroke codes of all characters in the second word segmentation;

the second machine learning model is used for inputting the second word segmentation;

the third machine learning model is used for predicting at least one target word segment after the second word segment.

According to a fourth aspect of embodiments of the present specification, there is provided a training apparatus for a language model, the apparatus comprising:

the first acquisition module is used for acquiring training texts;

the second acquisition module is used for respectively acquiring a first stroke set of each first word in the training text, wherein the first stroke set comprises first stroke codes of each character in the first word;

the training module is used for taking the first word and the first stroke code thereof as input of the language model, and taking at least one other first word positioned after the first word in the training text as output of the language model so as to train the language model.

According to a fifth aspect of embodiments of the present specification, there is provided a word segmentation prediction apparatus, the apparatus comprising:

the third acquisition module is used for acquiring the second word and the second stroke set; wherein the second stroke set comprises second stroke codes of all characters in the second word segmentation;

and the prediction module is used for inputting the second word and the second stroke code thereof into a pre-trained language model so as to predict at least one target word after the second word.

According to a sixth aspect of embodiments of the present description, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of the embodiments.

According to a seventh aspect of the embodiments of the present specification, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the embodiments when executing the program.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the specification.

FIG. 1 is a flowchart of a training method for a language model according to an embodiment of the present disclosure.

FIG. 2 is a training flow diagram of a language model in accordance with an embodiment of the present disclosure.

Fig. 3 is a flowchart of a word segmentation prediction method according to an embodiment of the present disclosure.

Fig. 4 is a schematic structural diagram of a language model according to an embodiment of the present disclosure.

Fig. 5 is a block diagram of a training apparatus of a language model according to an embodiment of the present specification.

Fig. 6 is a block diagram of a word segmentation prediction apparatus according to an embodiment of the present specification.

FIG. 7 is a schematic diagram of a computer device for implementing the method of an embodiment of the present specification, according to an embodiment of the present specification.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description as detailed in the accompanying claims.

The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

As shown in fig. 1, an embodiment of the present disclosure provides a method for detecting a picture, where the method may include:

step S102: acquiring a training text;

step S104: respectively acquiring stroke sets of each first word in the training text, wherein the stroke sets comprise first stroke codes of each character in the first word;

step S106: and taking the first word and the first stroke codes thereof as input of the language model, and taking at least one other first word positioned after the first word in the training text as output of the language model so as to train the language model.

For step S102, generally, for training of language models, the larger training text can enhance the diversity of languages, and thus, the training text is often crawled from the internet by a search engine. For the Chinese language model, the obtained training text is a Chinese text, and correspondingly, the training text can be grabbed by utilizing a crawler on a Chinese webpage.

Further, since the grabbed training text is often quite noisy, for example, the training text may include codes of pictures inserted in the web page, and the like, the training text may be filtered to improve the training accuracy. A filter may be designed to filter out these noises, and since the position of the code in the web page is generally fixed, the code position may be stored in the filter, and when the training text is grabbed, the position of the training text in the web page is obtained, if the position of the training text in the web page is the same as the code position stored in the filter, the training text in the corresponding position is filtered out, otherwise, the training text is retained. The resulting training text may then be used for language model training.

For step S104, after the training text is obtained, word segmentation processing may be performed on the training text to obtain one or more first word segments. For Chinese, each first word obtained by word segmentation processing is a mutually independent, complete and correct word. For example, assuming that the training text is "cat eat fish", two first words of "cat" and "eat fish" are available.

For each first word, its stroke set may be acquired separately. Chinese is generally composed of five strokes of horizontal, vertical, left-falling, right-falling and folding, and each stroke may be assigned a code, which is used as identification information of the stroke and is used to uniquely identify the stroke, for example, the codes of horizontal, vertical, left-falling, right-falling and folding are respectively marked as 1, 2, 3, 4 and 5. Each stroke may also have a variant, e.g., a shaft may be deformed into a shaft, and a stroke may take the same code as its variant. And respectively generating first stroke codes of all characters in the first segmentation according to the codes. Assuming that N first word segments are provided, the ith first word segment comprises Ki characters, N stroke sets are correspondingly obtained, and the ith stroke set comprises first stroke codes of the Ki characters in the ith first word segment.

Further, if the sequence of the character C1 in the first word segment precedes the character C2, the first stroke code of the character C1 in the stroke set also precedes the first stroke code of the character C2. The sequence of the strokes corresponding to the first stroke codes is the same as the sequence of the strokes of the character. Thus, the first set of strokes of the first word may be generated based on the order of the individual characters in the first word, the stroke order of the characters, and the codes of the individual strokes. The mapping table of the characters and the stroke sequences thereof in each training text can be established in advance, then, when the stroke set is acquired, the stroke sequences of the characters in the first word segment are searched in the mapping table, and then, the stroke set of the first word segment is generated according to the stroke sequences and codes of the strokes. The mapping table may be manually extracted from a dictionary or crawled from a dictionary of web page versions.

Still taking the training text as a "small cat eating fish" as an example, wherein the first stroke code of the "small cat" sequentially comprises a first stroke code of the "small" and a first stroke code of the "cat", wherein the strokes of the "small" sequentially comprise a "vertical hook, a" skim, a "right-falling", and a "cat" sequentially comprise strokes of the "skim, a" horizontal, a "vertical, a" folding, a "horizontal, a" vertical, a "horizontal", and thus, the strokes corresponding to the "small cat" are collected as "23435312225121". The stroke set corresponding to "eat fish", i.e., "25131535251211", may be obtained in a similar manner.

For step 106, the first word segment and the first stroke code of the first word segment may be used together as input to a language model, and at least one other first word segment located after the first word segment in the training text may be used as output of the language model to train the language model.

FIG. 2 is a training flow diagram of a language model in accordance with an embodiment of the present disclosure. Parameters in the language model may be trained using "kittens" and "23435312225121" as inputs to the language model and "eating fish" as outputs of the language model. The language model in the figure is a language model which is sequentially composed of a stroke-level convolutional neural network, a character-level convolutional neural network and an LSTM. For example, for the training text of "eat together at night", the first word segment "night" and the corresponding first stroke set thereof may be used as input of the language model, and the two first word segments "together" and "eat" may be used as output of the language model, so as to train the parameters in the language model.

In some embodiments, the language model includes a first machine learning model, a second machine learning model, and a third machine learning model that are connected in sequence; the step of using the first word segment and the first stroke code thereof as input of the language model, and using at least one other first word segment located after the first word segment in the training text as output of the language model to train the language model comprises the following steps: the first stroke code is used as an input of the first machine learning model, the first word segment is used as an input of the second machine learning model, and the at least one other first word segment is used as an output of the third machine learning model to train the language model.

According to the embodiment, the strokes with smaller granularity are input into the first machine learning model, the output result of the first machine learning model and the word segmentation with larger granularity are input into the second machine learning model, the output result of the second machine learning model is input into the third machine learning model to train the language model, the feature of the word segmentation level and the stroke level can be extracted from the training text, the extracted feature granularity is smaller, and therefore the language model obtained through training according to the feature can extract text features more effectively, and word segmentation prediction can be performed more accurately.

In practical application, the first machine learning model and the second machine learning model are convolutional neural networks (Convolutional Neural Networks, CNN), and the third machine learning model is a Long Short-Term Memory network (LSTM).

More important stroke information in strokes can be extracted through a stroke-level convolutional neural network (strokeNN), and more important character information in word segmentation can be extracted through a character-level convolutional neural network (charCNN). The outputs of the first two convolutional neural networks can be processed synthetically by LSTM. The training texts are processed layer by layer through the three machine learning models, so that the characteristics of the training texts from small granularity to large granularity are obtained, and a more accurate language model can be trained.

The characters in the above embodiments may be kanji characters, and the language model in the above embodiments may be a chinese language model. Chinese is a special language and has strong pictographic characteristics, so that compared with western languages such as English, a Chinese language model can describe more Chinese characteristics and can obtain better effects. Therefore, more Chinese characteristics can be characterized by extracting stroke characteristics with finer granularity, so that the language model is more accurate.

As shown in fig. 3, a flowchart of a word segmentation prediction method according to an embodiment of the present disclosure is shown. The method may comprise:

step S302: acquiring a second word and a stroke set thereof; wherein the stroke set comprises second stroke codes of all characters in the second word segmentation;

step S304: the second word and the second stroke code thereof are input into a pre-trained language model to predict at least one target word following the second word.

For step S302, a second word to be processed and a stroke set thereof may be acquired, where the second word may be a word entered by a user, and the input manner includes, but is not limited to, input through a keyboard, input through voice, input through handwriting, and the like. For example, the second word may be "puppy". The manner of obtaining the stroke set of the second word segment is similar to that of obtaining the stroke set of the first word segment, and will not be described here again. For the second word dog, its corresponding stroke set is "23435335251".

For step S304, "puppies" and "23435335251" may be input into a pre-trained language model to predict at least one target word following the "puppies". For example, the predicted target word may be "eating meat".

In some embodiments, the language model may be trained from each first word segment and its first stroke code in the training text. The language model adopted in the embodiment may be obtained by training the training method of the language model in any embodiment, for example, a first stroke set of each first word in the training text may be obtained respectively, where the first stroke set includes first stroke codes of each character in the first word; and taking the first word and the first stroke codes thereof as input of the language model, and taking at least one other first word positioned after the first word in the training text as output of the language model so as to train the language model.

In some embodiments, the language model includes a first machine learning model, a second machine learning model, and a third machine learning model that are connected in sequence; the step of inputting the second word and its second stroke code into a pre-trained language model to predict at least one target word following the second word comprises: inputting the second stroke code into the first machine learning model, inputting the second word into the second machine learning model, and obtaining at least one target word outputted by the third machine learning model.

Firstly, inputting strokes with smaller granularity into a first machine learning model, then inputting an output result of the first machine learning model and words with larger granularity into a second machine learning model, and then inputting an output result of the second machine learning model into a third machine learning model to train a language model, wherein word segmentation level and stroke level features can be extracted from training texts, and the extracted feature granularity is smaller.

More important stroke information in strokes can be extracted through a stroke-level convolutional neural network (strokeNN), and more important character information in word segmentation can be extracted through a character-level convolutional neural network (charCNN). The outputs of the first two convolutional neural networks can be processed synthetically by LSTM. By training the language model formed by the three machine learning models, the characteristics of the training text from small granularity to large granularity are obtained, and the accuracy of word segmentation prediction is higher.

As shown in fig. 4, a schematic diagram of a language model according to an embodiment of the present disclosure, the language model includes:

In some embodiments, the first machine learning model and the second machine learning model are both convolutional neural networks, and the third machine learning model is a long-term memory network.

The language model of the above embodiment may be trained based on the training method of the language model of any embodiment, and after the training is completed, the language model may be used in the word segmentation prediction method of any embodiment to perform word segmentation prediction.

As shown in fig. 5, which is a block diagram of a training apparatus for a language model according to one embodiment of the present specification, the apparatus may include:

a first obtaining module 502, configured to obtain training text;

a second obtaining module 504, configured to obtain a stroke set of each first word in the training text, where the stroke set includes a first stroke code of each character in the corresponding first word;

the training module 506 is configured to take the first word segment and the first stroke code thereof as input of the language model, and take at least one other first word segment located after the first word segment in the training text as output of the language model, so as to train the language model.

Specific details of the implementation process of the functions and roles of each module in the device are shown in the implementation process of corresponding steps in the training method of the language model, and are not repeated here.

As shown in fig. 6, a block diagram of a word segmentation prediction apparatus according to an embodiment of the present disclosure may include:

a third obtaining module 602, configured to obtain a second term and a stroke set thereof; wherein the stroke set comprises second stroke codes of all characters in the second word segmentation;

a prediction module 604, configured to input the second word and the second stroke code thereof into a pre-trained language model to predict at least one target word following the second word.

Specific details of the implementation process of the functions and actions of each module in the device are shown in the implementation process of the corresponding steps in the word segmentation prediction method, and are not repeated here.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiments of the apparatus of this specification may be applied to a computer device, such as a server or a terminal device. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory through a processor of the file processing where the device is located. In the hardware level, as shown in fig. 7, a hardware structure diagram of a computer device where the apparatus of the present disclosure is located is shown in fig. 7, and in addition to the processor 702, the memory 704, the network interface 706, and the nonvolatile memory 708 shown in fig. 7, a server or an electronic device where the apparatus is located in an embodiment may generally include other hardware according to an actual function of the computer device, which will not be described herein again.

Accordingly, the present specification embodiment also provides a computer storage medium having a program stored therein, which when executed by a processor, implements the method in any of the above embodiments.

Accordingly, the present description also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the above embodiments when executing the program.

Embodiments of the present description may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Computer-usable storage media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by the computing device.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present disclosure.

Claims

1. A method of training a language model, the method comprising:

acquiring a training text;

the first word and the first stroke code are used as input of the language model, and at least one other first word positioned after the first word in the training text is used as output of the language model so as to train the language model;

the language model comprises a first machine learning model, a second machine learning model and a third machine learning model which are sequentially connected;

the input of the first machine learning model encodes the first stroke;

the input of the second machine learning model is the output result of the first machine learning model and the first word segmentation;

the input of the third machine learning model is the output result of the second machine learning model;

the output of the third machine learning model is the at least one other first segmentation.

2. The method of claim 1, the first machine learning model and the second machine learning model being convolutional neural networks, the third machine learning model being a long-term memory network.

3. The method of any of claims 1-2, prior to separately obtaining the first set of strokes for each first segmentation in the training text, the method further comprising:

and filtering the training text.

4. A method of word segmentation prediction, the method comprising:

inputting the second word and the second stroke code thereof into a pre-trained language model to predict at least one target word after the second word;

the input of the first machine learning model encodes the second stroke;

the input of the second machine learning model is the output result of the first machine learning model and the second word segmentation;

the output of the third machine learning model is the at least one target word segment.

5. The method of claim 4, wherein the language model is trained from each first word segment in the training text and its first stroke code.

6. The method of claim 5, the language model being trained in accordance with:

wherein an input of the first machine learning model encodes the first stroke;

7. The method of claim 4, the first machine learning model and the second machine learning model being convolutional neural networks, the third machine learning model being a long-term memory network.

8. A language model, the language model comprising:

the input of the second machine learning model comprises the output results of the second word segmentation and the first machine learning model;

the input of the third machine learning model is the output result of the second machine learning model, and the third machine learning model is used for predicting at least one target word after the second word.

9. The language model of claim 8, the first machine learning model and the second machine learning model being convolutional neural networks, the third machine learning model being a long-term memory network.

10. A training apparatus for a language model, the apparatus comprising:

the first acquisition module is used for acquiring training texts;

the training module is used for taking the first word segmentation and the first stroke codes thereof as input of the language model, and taking at least one other first word segmentation positioned after the first word segmentation in the training text as output of the language model so as to train the language model;

the input of the first machine learning model encodes the first stroke;

11. A word segmentation prediction apparatus, the apparatus comprising:

the prediction module is used for inputting the second word and the second stroke code thereof into a pre-trained language model so as to predict at least one target word after the second word;

the input of the first machine learning model encodes the second stroke;

12. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of any of claims 1 to 7.

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when the program is executed.