CN111859981A - Language model acquisition and Chinese semantic understanding method, device and storage medium - Google Patents
Language model acquisition and Chinese semantic understanding method, device and storage medium Download PDFInfo
- Publication number
- CN111859981A CN111859981A CN202010552815.6A CN202010552815A CN111859981A CN 111859981 A CN111859981 A CN 111859981A CN 202010552815 A CN202010552815 A CN 202010552815A CN 111859981 A CN111859981 A CN 111859981A
- Authority
- CN
- China
- Prior art keywords
- information
- training
- language model
- embedded information
- sound
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 110
- 230000015654 memory Effects 0.000 claims description 19
- 238000007781 pre-processing Methods 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a method, a device and a storage medium for language model acquisition and Chinese semantic understanding, which relate to the field of natural language processing and deep learning, wherein the method can comprise the following steps: acquiring a Chinese text serving as training data; aiming at any training data, respectively acquiring preset embedded information of each character in the training data, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information; and training a language model by utilizing the training data according to the preset embedding information, wherein the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedding information of each character in the Chinese text to be processed. By applying the scheme, the accuracy of the semantic understanding result can be improved.
Description
Technical Field
The present application relates to computer application technologies, and in particular, to a method and an apparatus for language model acquisition and Chinese semantic understanding in the fields of natural language processing and deep learning, and a storage medium.
Background
With the introduction of large general pre-training language models such as the knowledge Enhanced semantic Representation model (ERNIE), the Bidirectional encoded Representation of the converter (BERT), and so on, the chinese semantic understanding task has taken a qualitative leap forward. Transformer is a common basic structure of such models, and the self-attention (self-attention) mechanism adopted by the Transformer enables the models to better realize the understanding of semantic information of text by capturing the context information of the text.
However, there are many ambiguous characters in chinese, and it is difficult to resolve the ambiguity based on the context information alone, so that the semantic understanding result is not accurate enough.
Disclosure of Invention
The application provides a language model acquisition and Chinese semantic understanding method, a language model acquisition and Chinese semantic understanding device and a storage medium.
A language model acquisition method, comprising:
acquiring a Chinese text serving as training data;
aiming at any training data, respectively acquiring preset embedded information of each character in the training data, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and training a language model by using the training data according to the preset embedding information, wherein the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedding information of each character in the Chinese text to be processed.
A Chinese semantic understanding method comprises the following steps:
respectively acquiring preset embedded information of each character in a Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and obtaining semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
A language model acquisition apparatus comprising: the system comprises a data acquisition module and a model training module;
the data acquisition module is used for acquiring a Chinese text serving as training data;
the model training module is used for respectively acquiring preset embedded information of each character in any training data, the preset embedded information at least comprises two kinds of embedded information, one of the preset embedded information is tone embedded information, a language model is trained by using the training data according to the preset embedded information, and the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedded information of each character in the Chinese text to be processed.
A Chinese semantic understanding apparatus, comprising: the system comprises a preprocessing module and a semantic acquisition module;
the preprocessing module is used for respectively acquiring preset embedded information of each character in the Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and the semantic acquisition module is used for acquiring semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.
A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.
One embodiment in the above application has the following advantages or benefits: the tone information is introduced into a training process and a Chinese semantic understanding process of the language model, so that the model has the capability of judging the semantic information of the text in different contexts, ambiguity is reduced through the tone information, and the accuracy of semantic understanding results is improved. It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow chart of an embodiment of a language model acquisition method according to the present application;
FIG. 2 is a schematic diagram illustrating a training process of a language model according to the present application;
FIG. 3 is a flowchart of an embodiment of a Chinese semantic understanding method according to the present application;
FIG. 4 is a schematic diagram illustrating a structure of an embodiment of a language model obtaining apparatus 40 according to the present application;
FIG. 5 is a schematic diagram illustrating a structure of an embodiment of a Chinese semantic understanding apparatus 50 according to the present application;
FIG. 6 is a block diagram of an electronic device according to the method of an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
In Chinese, different tones may represent different semantics, for example, for a kahn's word, different tones may represent semantics such as approval (confirmation) or question, and then introducing tones into the training of the language model and the Chinese semantic understanding process may greatly eliminate ambiguity.
Fig. 1 is a flowchart of an embodiment of a language model obtaining method according to the present application. As shown in fig. 1, the following detailed implementation is included.
In 101, a chinese text is acquired as training data.
In 102, for any training data, predetermined Embedding (Embedding) information of each character therein is respectively obtained, where the predetermined Embedding information includes at least two kinds of Embedding information, and one of the two kinds of Embedding information is tone Embedding (ToneEmbedding) information.
In 103, a language model is trained by using the training data according to the predetermined Embedding information, and the language model is used for generating semantic representation information of the Chinese text to be processed according to the predetermined Embedding information of each character in the Chinese text to be processed.
In this embodiment, the tone information may be introduced into a training process of the language model, so that the model has the capability of determining semantic information of the text in different contexts.
For each training data in each chinese text as training data, predetermined Embedding information of each character therein may be acquired, and the predetermined Embedding information may include: tone Embedding information, namely Tone information with different Tone modeling introduced. Preferably, the tones may include: the flat sound, the first sound, the second sound, the third sound and the fourth sound can be represented by five kinds of ids of 0, 1, 2, 3 and 4 respectively. Each Chinese text can respectively comprise one sentence or a plurality of sentences, etc.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding (Word Embedding) information, Sentence Embedding (sequence Embedding) information, Position Embedding (Position Embedding) information, Task Embedding (Task Embedding) information. Preferably, the predetermined Embedding information may include both Tone Embedding information, Word Embedding information, sequence Embedding information, Position Embedding information, and Task Embedding information. Generally speaking, the richer the Embedding information contained in the predetermined Embedding information is, the better the model performance obtained by training is.
For each training data, the tone of each character in the training data can be generated in a manual marking mode or an automatic marking mode, and the specific mode is not limited. How to obtain the foregoing Embedding information is the prior art. Each Embedding information is in a vector form and has the same dimension.
And for each training data, weighting and adding the Embedding information of each character, and taking the weighted and added result corresponding to each character as the input of the language model to train the language model. For example, the predetermined Embedding information includes both the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the position Embedding information, and the Task Embedding information, and then the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the position Embedding information, and the Task Embedding information of each character in the training data may be weighted and added to each character in the training data, so as to obtain the weighted and added result corresponding to the character. The weighted addition means that different Embedding information is multiplied by corresponding weights, and the multiplication results are added. The weights corresponding to different Embedding information may be the same or different, depending on actual needs. Through weighted addition, different Embedding information can be fused, so that the input of the model is enriched, and the model training effect, the model performance and the like are improved.
When a language model is trained by using training data, a Word-level Pre-training Task (Word-aware Pre-training Task), a Structure-level Pre-training Task (Structure-aware Pre-training Task), and a Semantic-level Pre-training Task (Semantic-aware Pre-training Task) can be used as training tasks to train the language model.
Each level pre-training task may include one or more sub-tasks. For example, Word-aware Pre-tracking Task may include sub-tasks such as phrase mask (Knowledge Masking), capitalization Prediction (capitalization Prediction) and whether a Word will appear elsewhere in the Document (Token-Document relationship), Structure-aware Pre-tracking Task may include sub-tasks such as sentence ordering classification (sequences ranking) and sentence Distance classification (sequences Distance), and Semantic Relation of Sentences (relationship) and Relevance of sentence retrieval (IR retrieval).
The training of the language model described in this embodiment is unsupervised training, and each task/subtask is related to learning/computing semantics, so that the model has/learns semantic comprehension capability.
In training the language model using the training data, the language model may also be trained in a continuous learning (continuous learning) manner based on a predetermined model warm start.
The predetermined model may be an ERNIE 2.0 model. If the language model of this embodiment starts training from the zero base, it will need to consume larger machine resources and time, so it can be hot started based on the ERNIE 2.0 model, i.e. the existing ERNIE 2.0 model is selected as the basis of the language model of this embodiment, and training is further performed on the basis, thereby reducing the model training cost, such as saving machine resources and time cost. In addition, by means of continualting, the model can be learned faster and better.
Based on the above description, fig. 2 is a schematic diagram of a training process of the language model described in the present application, and please refer to the related description.
FIG. 3 is a flowchart of an embodiment of a Chinese semantic understanding method according to the present application. As shown in fig. 3, the following detailed implementation is included.
In 301, for a chinese text to be processed, predetermined Embedding information of each character is respectively obtained, where the predetermined Embedding information includes at least two kinds of Embedding information, and one of the two kinds of Embedding information is Tone Embedding information.
In 302, semantic representation information of the to-be-processed chinese text is obtained according to the predetermined Embedding information and the pre-trained language model.
Preferably, the tones may include: the flat sound, the first sound, the second sound, the third sound and the fourth sound can be represented by five kinds of ids of 0, 1, 2, 3 and 4 respectively.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding information, sequence Embedding information, Position Embedding information, and taskeembedding information. Preferably, the predetermined Embedding information may include Tone Embedding information, wordleembedding information, sequence Embedding information, Position Embedding information, and Task Embedding information at the same time.
For the Chinese text to be processed, weighting and adding the Embedding information of each character in the Chinese text to be processed respectively, and taking the weighting and adding result corresponding to each character as the input of the language model to obtain the output semantic representation information.
For example, the predetermined Embedding information includes both the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information, and then for each character in the Chinese text to be processed, the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information of the character can be weighted and added, so as to obtain the weighted and added result corresponding to the character. The weights corresponding to different Embedding information may be the same or different. The weighted addition result corresponding to each character can be used as the input of the language model, so as to obtain the output semantic representation information. The specific form of the semantic representation information is not limited, and can be a matrix form.
It can be seen that in the method of this embodiment, the tone information can be introduced into the training process of the language model and the chinese semantic understanding process, so that the model has the capability of judging the semantic information of the text in different contexts, ambiguity is reduced by the tone information, and the accuracy of the semantic understanding result is further improved.
It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, those skilled in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application. In addition, for parts which are not described in detail in a certain embodiment, reference may be made to relevant descriptions in other embodiments.
The above is a description of method embodiments, and the embodiments of the present application are further described below by way of apparatus embodiments.
Fig. 4 is a schematic structural diagram of a language model obtaining apparatus 40 according to an embodiment of the present application. As shown in fig. 4, includes: a data acquisition module 401 and a model training module 402.
A data obtaining module 401, configured to obtain a chinese text as training data.
The model training module 402 is configured to acquire, for any training data, predetermined Embedding information of each character in the training data, where the predetermined Embedding information includes at least two types of Embedding information, one of the predetermined Embedding information is ToneEmbedding information, and train a language model using the training data according to the predetermined Embedding information, where the language model is configured to generate semantic representation information of the to-be-processed chinese text according to the predetermined Embedding information of each character in the to-be-processed chinese text.
Preferably, the tones may include: flat sound, first sound, second sound, third sound and fourth sound.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding information, sequence Embedding information, Position Embedding information, and taskeembedding information. Preferably, the predetermined Embedding information may include Tone Embedding information, wordleembedding information, sequence Embedding information, Position Embedding information, and Task Embedding information at the same time.
The model training module 402 may add the Embedding information of each character in any training data in a weighted manner, and use the weighted addition result corresponding to each character as the input of the language model to train the language model.
For example, the predetermined Embedding information includes both the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information, and then the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information of each character in the training data may be weighted and added to each character in the training data, so as to obtain the weighted and added result corresponding to the character. The weights corresponding to different Embedding information may be the same or different.
In training a language model using training data, model training module 402 may train the language model using Word-aware Pre-training Task, Structure-aware Pre-training Task, and Semantic-aware Pre-training Task as training tasks.
In training the language model using the training data, the model training module 402 may also train the language model in a continuous learning manner based on a predetermined model warm-start. The predetermined model may be an ERNIE 2.0 model.
FIG. 5 is a schematic diagram illustrating a structure of an embodiment of a Chinese semantic understanding apparatus 50 according to the present application. As shown in fig. 5, includes: a preprocessing module 501 and a semantic acquisition module 502.
The preprocessing module 501 is configured to obtain, for a chinese text to be processed, predetermined Embedding information of each character in the chinese text, where the predetermined Embedding information includes at least two kinds of Embedding information, and one of the two kinds of Embedding information is ToneEmbedding information.
And the semantic acquisition module is used for acquiring semantic representation information of the Chinese text to be processed according to the preset Embedding information and the language model obtained by pre-training.
Preferably, the tones may include: flat sound, first sound, second sound, third sound and fourth sound.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding information, sequence Embedding information, Position Embedding information, and taskeembedding information. Preferably, the predetermined Embedding information may include Tone Embedding information, wordleembedding information, sequence Embedding information, Position Embedding information, and Task Embedding information at the same time.
The semantic obtaining module 502 may add the Embedding information of each character in the to-be-processed chinese text in a weighted manner, and obtain output semantic representation information by using the weighted addition result corresponding to each character as the input of the language model.
For a specific work flow of the device embodiments shown in fig. 4 and fig. 5, reference is made to the related description in the foregoing method embodiments, and details are not repeated.
In a word, by adopting the scheme of the embodiment of the device, tone information can be introduced into a training process and a Chinese semantic understanding process of a language model, so that the model has the capability of judging the semantic information of the text under different contexts, ambiguity is reduced through the tone information, and the accuracy of semantic understanding results is improved; through weighted addition, different Embedding information can be fused, so that the input of the model is enriched, and the model training effect, the model performance and the like are improved; the hot start based on the preset model can be used for reducing the model training cost and the like, and the model can be learned faster and better in a continuous learning mode.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 6 is a block diagram of an electronic device according to the method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the electronic apparatus includes: one or more processors Y01, a memory Y02, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information for a graphical user interface on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor Y01 is taken as an example.
Memory Y02 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.
Memory Y02 is provided as a non-transitory computer readable storage medium that can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods of the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory Y02.
The memory Y02 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Additionally, the memory Y02 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, blockchain networks, local area networks, mobile communication networks, and combinations thereof.
The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, the memory Y02, the input device Y03 and the output device Y04 may be connected by a bus or in another manner, and the connection by the bus is exemplified in fig. 6.
The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer, one or more mouse buttons, track ball, joystick, or other input device. The output device Y04 may include a display device, an auxiliary lighting device, a tactile feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuits, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, verbal, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, blockchain networks, and the internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (22)
1. A language model acquisition method, comprising:
acquiring a Chinese text serving as training data;
aiming at any training data, respectively acquiring preset embedded information of each character in the training data, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and training a language model by using the training data according to the preset embedding information, wherein the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedding information of each character in the Chinese text to be processed.
2. The method of claim 1, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
3. The method of claim 1, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
4. The method of claim 1, wherein the training a language model with the training data according to the predetermined embedded information comprises: and for any training data, carrying out weighted addition on the embedded information of each character, taking the weighted addition result corresponding to each character as the input of the language model, and training the language model.
5. The method of claim 1, wherein the training a language model using the training data comprises: and training the language model by taking a word level pre-training task, a structure level pre-training task and a semantic level pre-training task as training tasks.
6. The method of claim 1, wherein the training a language model using the training data comprises: training the language model in a continuous learning manner based on a predetermined model warm start.
7. A Chinese semantic understanding method comprises the following steps:
respectively acquiring preset embedded information of each character in a Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and obtaining semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
8. The method of claim 7, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
9. The method of claim 7, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
10. The method according to claim 7, wherein the obtaining semantic representation information of the chinese text to be processed according to the predetermined embedding information and a pre-trained language model comprises:
and for the Chinese text to be processed, performing weighted addition on the embedded information of each character in the Chinese text to be processed respectively, and taking a weighted addition result corresponding to each character as the input of the language model to obtain the output semantic representation information.
11. A language model acquisition apparatus comprising: the system comprises a data acquisition module and a model training module;
the data acquisition module is used for acquiring a Chinese text serving as training data;
the model training module is used for respectively acquiring preset embedded information of each character in any training data, the preset embedded information at least comprises two kinds of embedded information, one of the preset embedded information is tone embedded information, a language model is trained by using the training data according to the preset embedded information, and the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedded information of each character in the Chinese text to be processed.
12. The apparatus of claim 11, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
13. The apparatus of claim 11, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
14. The apparatus according to claim 11, wherein the model training module is configured to perform weighted addition of the embedded information of each character in any training data, and to use the result of the weighted addition corresponding to each character as the input of the language model to train the language model.
15. The apparatus of claim 11, wherein the model training module trains the language model with a word-level pre-training task, a structure-level pre-training task, and a semantic-level pre-training task as training tasks.
16. The apparatus of claim 11, wherein the model training module trains the language model in a continuous learning manner based on a predetermined model warm start.
17. A Chinese semantic understanding apparatus, comprising: the system comprises a preprocessing module and a semantic acquisition module;
The preprocessing module is used for respectively acquiring preset embedded information of each character in the Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and the semantic acquisition module is used for acquiring semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
18. The apparatus of claim 17, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
19. The apparatus of claim 17, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
20. The apparatus according to claim 17, wherein the semantic acquiring module adds embedded information of each character in the to-be-processed chinese text in a weighted manner, and obtains the output semantic representation information by taking a weighted addition result corresponding to each character as an input of the language model.
21. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010552815.6A CN111859981B (en) | 2020-06-17 | 2020-06-17 | Language model acquisition and Chinese semantic understanding method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010552815.6A CN111859981B (en) | 2020-06-17 | 2020-06-17 | Language model acquisition and Chinese semantic understanding method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111859981A true CN111859981A (en) | 2020-10-30 |
CN111859981B CN111859981B (en) | 2024-03-26 |
Family
ID=72986672
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010552815.6A Active CN111859981B (en) | 2020-06-17 | 2020-06-17 | Language model acquisition and Chinese semantic understanding method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111859981B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107577662A (en) * | 2017-08-08 | 2018-01-12 | 上海交通大学 | Towards the semantic understanding system and method for Chinese text |
GB201904719D0 (en) * | 2019-04-03 | 2019-05-15 | Mashtraxx Ltd | Method of training a neural network to reflect emotional perception and related system and method for categorizing and finding associated content |
KR20190085882A (en) * | 2018-01-11 | 2019-07-19 | 네오사피엔스 주식회사 | Method and computer readable storage medium for performing text-to-speech synthesis using machine learning |
CN111078887A (en) * | 2019-12-20 | 2020-04-28 | 厦门市美亚柏科信息股份有限公司 | Text classification method and device |
-
2020
- 2020-06-17 CN CN202010552815.6A patent/CN111859981B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107577662A (en) * | 2017-08-08 | 2018-01-12 | 上海交通大学 | Towards the semantic understanding system and method for Chinese text |
KR20190085882A (en) * | 2018-01-11 | 2019-07-19 | 네오사피엔스 주식회사 | Method and computer readable storage medium for performing text-to-speech synthesis using machine learning |
GB201904719D0 (en) * | 2019-04-03 | 2019-05-15 | Mashtraxx Ltd | Method of training a neural network to reflect emotional perception and related system and method for categorizing and finding associated content |
CN111078887A (en) * | 2019-12-20 | 2020-04-28 | 厦门市美亚柏科信息股份有限公司 | Text classification method and device |
Non-Patent Citations (3)
Title |
---|
HUAKANG LI ET AL: "Sentiment Analysis based on Bi-LSTM using Tone", 《2019 15TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG)》 * |
赵理;崔杜武;: "基于汉字拼音声调的文本水印算法", 计算机工程, no. 10 * |
邓力;梁向东;: "基于DFT的嵌入式普通话语音快速识别", 实验室研究与探索, no. 06 * |
Also Published As
Publication number | Publication date |
---|---|
CN111859981B (en) | 2024-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7317791B2 (en) | Entity linking method, device, apparatus and storage medium | |
CN111859994B (en) | Machine translation model acquisition and text translation method, device and storage medium | |
CN111125335B (en) | Question and answer processing method and device, electronic equipment and storage medium | |
US11403468B2 (en) | Method and apparatus for generating vector representation of text, and related computer device | |
CN111061868B (en) | Reading method prediction model acquisition and reading method prediction method, device and storage medium | |
JP7179123B2 (en) | Language model training method, device, electronic device and readable storage medium | |
JP7149993B2 (en) | Pre-training method, device and electronic device for sentiment analysis model | |
JP2021108115A (en) | Method and device for training machine reading comprehension model, electronic apparatus, and storage medium | |
CN110674260B (en) | Training method and device of semantic similarity model, electronic equipment and storage medium | |
JP2021111420A (en) | Method and apparatus for processing semantic description of text entity, and device | |
CN112507735A (en) | Training method and device of machine translation model and electronic equipment | |
CN112507101A (en) | Method and device for establishing pre-training language model | |
KR102630243B1 (en) | method and device for predicting punctuation | |
JP2021131858A (en) | Entity word recognition method and apparatus | |
CN111680517A (en) | Method, apparatus, device and storage medium for training a model | |
JP2021108098A (en) | Review information processing method, device, computer apparatus, and medium | |
JP2022008207A (en) | Method for generating triple sample, device, electronic device, and storage medium | |
CN110807331A (en) | Polyphone pronunciation prediction method and device and electronic equipment | |
CN112506949B (en) | Method, device and storage medium for generating structured query language query statement | |
JP7198800B2 (en) | Intention Recognition Optimization Processing Method, Apparatus, Equipment and Storage Medium | |
CN112269862B (en) | Text role labeling method, device, electronic equipment and storage medium | |
CN111079945A (en) | End-to-end model training method and device | |
CN111858880A (en) | Method and device for obtaining query result, electronic equipment and readable storage medium | |
CN113360751A (en) | Intention recognition method, apparatus, device and medium | |
CN112270169B (en) | Method and device for predicting dialogue roles, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |