CN111859981A - Language model acquisition and Chinese semantic understanding method, device and storage medium - Google Patents

Language model acquisition and Chinese semantic understanding method, device and storage medium Download PDF

Info

Publication number
CN111859981A
CN111859981A CN202010552815.6A CN202010552815A CN111859981A CN 111859981 A CN111859981 A CN 111859981A CN 202010552815 A CN202010552815 A CN 202010552815A CN 111859981 A CN111859981 A CN 111859981A
Authority
CN
China
Prior art keywords
information
training
language model
embedded information
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010552815.6A
Other languages
Chinese (zh)
Other versions
CN111859981B (en
Inventor
丁思宇
王硕寰
孙宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010552815.6A priority Critical patent/CN111859981B/en
Publication of CN111859981A publication Critical patent/CN111859981A/en
Application granted granted Critical
Publication of CN111859981B publication Critical patent/CN111859981B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device and a storage medium for language model acquisition and Chinese semantic understanding, which relate to the field of natural language processing and deep learning, wherein the method can comprise the following steps: acquiring a Chinese text serving as training data; aiming at any training data, respectively acquiring preset embedded information of each character in the training data, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information; and training a language model by utilizing the training data according to the preset embedding information, wherein the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedding information of each character in the Chinese text to be processed. By applying the scheme, the accuracy of the semantic understanding result can be improved.

Description

Language model acquisition and Chinese semantic understanding method, device and storage medium
Technical Field
The present application relates to computer application technologies, and in particular, to a method and an apparatus for language model acquisition and Chinese semantic understanding in the fields of natural language processing and deep learning, and a storage medium.
Background
With the introduction of large general pre-training language models such as the knowledge Enhanced semantic Representation model (ERNIE), the Bidirectional encoded Representation of the converter (BERT), and so on, the chinese semantic understanding task has taken a qualitative leap forward. Transformer is a common basic structure of such models, and the self-attention (self-attention) mechanism adopted by the Transformer enables the models to better realize the understanding of semantic information of text by capturing the context information of the text.
However, there are many ambiguous characters in chinese, and it is difficult to resolve the ambiguity based on the context information alone, so that the semantic understanding result is not accurate enough.
Disclosure of Invention
The application provides a language model acquisition and Chinese semantic understanding method, a language model acquisition and Chinese semantic understanding device and a storage medium.
A language model acquisition method, comprising:
acquiring a Chinese text serving as training data;
aiming at any training data, respectively acquiring preset embedded information of each character in the training data, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and training a language model by using the training data according to the preset embedding information, wherein the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedding information of each character in the Chinese text to be processed.
A Chinese semantic understanding method comprises the following steps:
respectively acquiring preset embedded information of each character in a Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and obtaining semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
A language model acquisition apparatus comprising: the system comprises a data acquisition module and a model training module;
the data acquisition module is used for acquiring a Chinese text serving as training data;
the model training module is used for respectively acquiring preset embedded information of each character in any training data, the preset embedded information at least comprises two kinds of embedded information, one of the preset embedded information is tone embedded information, a language model is trained by using the training data according to the preset embedded information, and the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedded information of each character in the Chinese text to be processed.
A Chinese semantic understanding apparatus, comprising: the system comprises a preprocessing module and a semantic acquisition module;
the preprocessing module is used for respectively acquiring preset embedded information of each character in the Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and the semantic acquisition module is used for acquiring semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.
A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.
One embodiment in the above application has the following advantages or benefits: the tone information is introduced into a training process and a Chinese semantic understanding process of the language model, so that the model has the capability of judging the semantic information of the text in different contexts, ambiguity is reduced through the tone information, and the accuracy of semantic understanding results is improved. It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow chart of an embodiment of a language model acquisition method according to the present application;
FIG. 2 is a schematic diagram illustrating a training process of a language model according to the present application;
FIG. 3 is a flowchart of an embodiment of a Chinese semantic understanding method according to the present application;
FIG. 4 is a schematic diagram illustrating a structure of an embodiment of a language model obtaining apparatus 40 according to the present application;
FIG. 5 is a schematic diagram illustrating a structure of an embodiment of a Chinese semantic understanding apparatus 50 according to the present application;
FIG. 6 is a block diagram of an electronic device according to the method of an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
In Chinese, different tones may represent different semantics, for example, for a kahn's word, different tones may represent semantics such as approval (confirmation) or question, and then introducing tones into the training of the language model and the Chinese semantic understanding process may greatly eliminate ambiguity.
Fig. 1 is a flowchart of an embodiment of a language model obtaining method according to the present application. As shown in fig. 1, the following detailed implementation is included.
In 101, a chinese text is acquired as training data.
In 102, for any training data, predetermined Embedding (Embedding) information of each character therein is respectively obtained, where the predetermined Embedding information includes at least two kinds of Embedding information, and one of the two kinds of Embedding information is tone Embedding (ToneEmbedding) information.
In 103, a language model is trained by using the training data according to the predetermined Embedding information, and the language model is used for generating semantic representation information of the Chinese text to be processed according to the predetermined Embedding information of each character in the Chinese text to be processed.
In this embodiment, the tone information may be introduced into a training process of the language model, so that the model has the capability of determining semantic information of the text in different contexts.
For each training data in each chinese text as training data, predetermined Embedding information of each character therein may be acquired, and the predetermined Embedding information may include: tone Embedding information, namely Tone information with different Tone modeling introduced. Preferably, the tones may include: the flat sound, the first sound, the second sound, the third sound and the fourth sound can be represented by five kinds of ids of 0, 1, 2, 3 and 4 respectively. Each Chinese text can respectively comprise one sentence or a plurality of sentences, etc.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding (Word Embedding) information, Sentence Embedding (sequence Embedding) information, Position Embedding (Position Embedding) information, Task Embedding (Task Embedding) information. Preferably, the predetermined Embedding information may include both Tone Embedding information, Word Embedding information, sequence Embedding information, Position Embedding information, and Task Embedding information. Generally speaking, the richer the Embedding information contained in the predetermined Embedding information is, the better the model performance obtained by training is.
For each training data, the tone of each character in the training data can be generated in a manual marking mode or an automatic marking mode, and the specific mode is not limited. How to obtain the foregoing Embedding information is the prior art. Each Embedding information is in a vector form and has the same dimension.
And for each training data, weighting and adding the Embedding information of each character, and taking the weighted and added result corresponding to each character as the input of the language model to train the language model. For example, the predetermined Embedding information includes both the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the position Embedding information, and the Task Embedding information, and then the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the position Embedding information, and the Task Embedding information of each character in the training data may be weighted and added to each character in the training data, so as to obtain the weighted and added result corresponding to the character. The weighted addition means that different Embedding information is multiplied by corresponding weights, and the multiplication results are added. The weights corresponding to different Embedding information may be the same or different, depending on actual needs. Through weighted addition, different Embedding information can be fused, so that the input of the model is enriched, and the model training effect, the model performance and the like are improved.
When a language model is trained by using training data, a Word-level Pre-training Task (Word-aware Pre-training Task), a Structure-level Pre-training Task (Structure-aware Pre-training Task), and a Semantic-level Pre-training Task (Semantic-aware Pre-training Task) can be used as training tasks to train the language model.
Each level pre-training task may include one or more sub-tasks. For example, Word-aware Pre-tracking Task may include sub-tasks such as phrase mask (Knowledge Masking), capitalization Prediction (capitalization Prediction) and whether a Word will appear elsewhere in the Document (Token-Document relationship), Structure-aware Pre-tracking Task may include sub-tasks such as sentence ordering classification (sequences ranking) and sentence Distance classification (sequences Distance), and Semantic Relation of Sentences (relationship) and Relevance of sentence retrieval (IR retrieval).
The training of the language model described in this embodiment is unsupervised training, and each task/subtask is related to learning/computing semantics, so that the model has/learns semantic comprehension capability.
In training the language model using the training data, the language model may also be trained in a continuous learning (continuous learning) manner based on a predetermined model warm start.
The predetermined model may be an ERNIE 2.0 model. If the language model of this embodiment starts training from the zero base, it will need to consume larger machine resources and time, so it can be hot started based on the ERNIE 2.0 model, i.e. the existing ERNIE 2.0 model is selected as the basis of the language model of this embodiment, and training is further performed on the basis, thereby reducing the model training cost, such as saving machine resources and time cost. In addition, by means of continualting, the model can be learned faster and better.
Based on the above description, fig. 2 is a schematic diagram of a training process of the language model described in the present application, and please refer to the related description.
FIG. 3 is a flowchart of an embodiment of a Chinese semantic understanding method according to the present application. As shown in fig. 3, the following detailed implementation is included.
In 301, for a chinese text to be processed, predetermined Embedding information of each character is respectively obtained, where the predetermined Embedding information includes at least two kinds of Embedding information, and one of the two kinds of Embedding information is Tone Embedding information.
In 302, semantic representation information of the to-be-processed chinese text is obtained according to the predetermined Embedding information and the pre-trained language model.
Preferably, the tones may include: the flat sound, the first sound, the second sound, the third sound and the fourth sound can be represented by five kinds of ids of 0, 1, 2, 3 and 4 respectively.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding information, sequence Embedding information, Position Embedding information, and taskeembedding information. Preferably, the predetermined Embedding information may include Tone Embedding information, wordleembedding information, sequence Embedding information, Position Embedding information, and Task Embedding information at the same time.
For the Chinese text to be processed, weighting and adding the Embedding information of each character in the Chinese text to be processed respectively, and taking the weighting and adding result corresponding to each character as the input of the language model to obtain the output semantic representation information.
For example, the predetermined Embedding information includes both the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information, and then for each character in the Chinese text to be processed, the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information of the character can be weighted and added, so as to obtain the weighted and added result corresponding to the character. The weights corresponding to different Embedding information may be the same or different. The weighted addition result corresponding to each character can be used as the input of the language model, so as to obtain the output semantic representation information. The specific form of the semantic representation information is not limited, and can be a matrix form.
It can be seen that in the method of this embodiment, the tone information can be introduced into the training process of the language model and the chinese semantic understanding process, so that the model has the capability of judging the semantic information of the text in different contexts, ambiguity is reduced by the tone information, and the accuracy of the semantic understanding result is further improved.
It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, those skilled in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application. In addition, for parts which are not described in detail in a certain embodiment, reference may be made to relevant descriptions in other embodiments.
The above is a description of method embodiments, and the embodiments of the present application are further described below by way of apparatus embodiments.
Fig. 4 is a schematic structural diagram of a language model obtaining apparatus 40 according to an embodiment of the present application. As shown in fig. 4, includes: a data acquisition module 401 and a model training module 402.
A data obtaining module 401, configured to obtain a chinese text as training data.
The model training module 402 is configured to acquire, for any training data, predetermined Embedding information of each character in the training data, where the predetermined Embedding information includes at least two types of Embedding information, one of the predetermined Embedding information is ToneEmbedding information, and train a language model using the training data according to the predetermined Embedding information, where the language model is configured to generate semantic representation information of the to-be-processed chinese text according to the predetermined Embedding information of each character in the to-be-processed chinese text.
Preferably, the tones may include: flat sound, first sound, second sound, third sound and fourth sound.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding information, sequence Embedding information, Position Embedding information, and taskeembedding information. Preferably, the predetermined Embedding information may include Tone Embedding information, wordleembedding information, sequence Embedding information, Position Embedding information, and Task Embedding information at the same time.
The model training module 402 may add the Embedding information of each character in any training data in a weighted manner, and use the weighted addition result corresponding to each character as the input of the language model to train the language model.
For example, the predetermined Embedding information includes both the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information, and then the Tone Embedding information, the Word Embedding information, the sequence Embedding information, the Position Embedding information, and the Task Embedding information of each character in the training data may be weighted and added to each character in the training data, so as to obtain the weighted and added result corresponding to the character. The weights corresponding to different Embedding information may be the same or different.
In training a language model using training data, model training module 402 may train the language model using Word-aware Pre-training Task, Structure-aware Pre-training Task, and Semantic-aware Pre-training Task as training tasks.
In training the language model using the training data, the model training module 402 may also train the language model in a continuous learning manner based on a predetermined model warm-start. The predetermined model may be an ERNIE 2.0 model.
FIG. 5 is a schematic diagram illustrating a structure of an embodiment of a Chinese semantic understanding apparatus 50 according to the present application. As shown in fig. 5, includes: a preprocessing module 501 and a semantic acquisition module 502.
The preprocessing module 501 is configured to obtain, for a chinese text to be processed, predetermined Embedding information of each character in the chinese text, where the predetermined Embedding information includes at least two kinds of Embedding information, and one of the two kinds of Embedding information is ToneEmbedding information.
And the semantic acquisition module is used for acquiring semantic representation information of the Chinese text to be processed according to the preset Embedding information and the language model obtained by pre-training.
Preferably, the tones may include: flat sound, first sound, second sound, third sound and fourth sound.
Besides the Tone Embedding information, the predetermined Embedding information may further include one or any combination of the following: word Embedding information, sequence Embedding information, Position Embedding information, and taskeembedding information. Preferably, the predetermined Embedding information may include Tone Embedding information, wordleembedding information, sequence Embedding information, Position Embedding information, and Task Embedding information at the same time.
The semantic obtaining module 502 may add the Embedding information of each character in the to-be-processed chinese text in a weighted manner, and obtain output semantic representation information by using the weighted addition result corresponding to each character as the input of the language model.
For a specific work flow of the device embodiments shown in fig. 4 and fig. 5, reference is made to the related description in the foregoing method embodiments, and details are not repeated.
In a word, by adopting the scheme of the embodiment of the device, tone information can be introduced into a training process and a Chinese semantic understanding process of a language model, so that the model has the capability of judging the semantic information of the text under different contexts, ambiguity is reduced through the tone information, and the accuracy of semantic understanding results is improved; through weighted addition, different Embedding information can be fused, so that the input of the model is enriched, and the model training effect, the model performance and the like are improved; the hot start based on the preset model can be used for reducing the model training cost and the like, and the model can be learned faster and better in a continuous learning mode.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 6 is a block diagram of an electronic device according to the method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the electronic apparatus includes: one or more processors Y01, a memory Y02, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information for a graphical user interface on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor Y01 is taken as an example.
Memory Y02 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.
Memory Y02 is provided as a non-transitory computer readable storage medium that can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods of the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory Y02.
The memory Y02 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Additionally, the memory Y02 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, blockchain networks, local area networks, mobile communication networks, and combinations thereof.
The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, the memory Y02, the input device Y03 and the output device Y04 may be connected by a bus or in another manner, and the connection by the bus is exemplified in fig. 6.
The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer, one or more mouse buttons, track ball, joystick, or other input device. The output device Y04 may include a display device, an auxiliary lighting device, a tactile feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuits, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, verbal, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, blockchain networks, and the internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (22)

1. A language model acquisition method, comprising:
acquiring a Chinese text serving as training data;
aiming at any training data, respectively acquiring preset embedded information of each character in the training data, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and training a language model by using the training data according to the preset embedding information, wherein the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedding information of each character in the Chinese text to be processed.
2. The method of claim 1, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
3. The method of claim 1, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
4. The method of claim 1, wherein the training a language model with the training data according to the predetermined embedded information comprises: and for any training data, carrying out weighted addition on the embedded information of each character, taking the weighted addition result corresponding to each character as the input of the language model, and training the language model.
5. The method of claim 1, wherein the training a language model using the training data comprises: and training the language model by taking a word level pre-training task, a structure level pre-training task and a semantic level pre-training task as training tasks.
6. The method of claim 1, wherein the training a language model using the training data comprises: training the language model in a continuous learning manner based on a predetermined model warm start.
7. A Chinese semantic understanding method comprises the following steps:
respectively acquiring preset embedded information of each character in a Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and obtaining semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
8. The method of claim 7, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
9. The method of claim 7, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
10. The method according to claim 7, wherein the obtaining semantic representation information of the chinese text to be processed according to the predetermined embedding information and a pre-trained language model comprises:
and for the Chinese text to be processed, performing weighted addition on the embedded information of each character in the Chinese text to be processed respectively, and taking a weighted addition result corresponding to each character as the input of the language model to obtain the output semantic representation information.
11. A language model acquisition apparatus comprising: the system comprises a data acquisition module and a model training module;
the data acquisition module is used for acquiring a Chinese text serving as training data;
the model training module is used for respectively acquiring preset embedded information of each character in any training data, the preset embedded information at least comprises two kinds of embedded information, one of the preset embedded information is tone embedded information, a language model is trained by using the training data according to the preset embedded information, and the language model is used for generating semantic representation information of the Chinese text to be processed according to the preset embedded information of each character in the Chinese text to be processed.
12. The apparatus of claim 11, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
13. The apparatus of claim 11, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
14. The apparatus according to claim 11, wherein the model training module is configured to perform weighted addition of the embedded information of each character in any training data, and to use the result of the weighted addition corresponding to each character as the input of the language model to train the language model.
15. The apparatus of claim 11, wherein the model training module trains the language model with a word-level pre-training task, a structure-level pre-training task, and a semantic-level pre-training task as training tasks.
16. The apparatus of claim 11, wherein the model training module trains the language model in a continuous learning manner based on a predetermined model warm start.
17. A Chinese semantic understanding apparatus, comprising: the system comprises a preprocessing module and a semantic acquisition module;
The preprocessing module is used for respectively acquiring preset embedded information of each character in the Chinese text to be processed, wherein the preset embedded information at least comprises two kinds of embedded information, and one kind of embedded information is tone embedded information;
and the semantic acquisition module is used for acquiring semantic representation information of the Chinese text to be processed according to the preset embedded information and a language model obtained by pre-training.
18. The apparatus of claim 17, wherein the tone comprises: flat sound, first sound, second sound, third sound and fourth sound.
19. The apparatus of claim 17, wherein the predetermined embedded information further comprises one or any combination of the following: word embedding information, sentence embedding information, position embedding information, task embedding information.
20. The apparatus according to claim 17, wherein the semantic acquiring module adds embedded information of each character in the to-be-processed chinese text in a weighted manner, and obtains the output semantic representation information by taking a weighted addition result corresponding to each character as an input of the language model.
21. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.
CN202010552815.6A 2020-06-17 2020-06-17 Language model acquisition and Chinese semantic understanding method, device and storage medium Active CN111859981B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010552815.6A CN111859981B (en) 2020-06-17 2020-06-17 Language model acquisition and Chinese semantic understanding method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010552815.6A CN111859981B (en) 2020-06-17 2020-06-17 Language model acquisition and Chinese semantic understanding method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111859981A true CN111859981A (en) 2020-10-30
CN111859981B CN111859981B (en) 2024-03-26

Family

ID=72986672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010552815.6A Active CN111859981B (en) 2020-06-17 2020-06-17 Language model acquisition and Chinese semantic understanding method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111859981B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577662A (en) * 2017-08-08 2018-01-12 上海交通大学 Towards the semantic understanding system and method for Chinese text
GB201904719D0 (en) * 2019-04-03 2019-05-15 Mashtraxx Ltd Method of training a neural network to reflect emotional perception and related system and method for categorizing and finding associated content
KR20190085882A (en) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 Method and computer readable storage medium for performing text-to-speech synthesis using machine learning
CN111078887A (en) * 2019-12-20 2020-04-28 厦门市美亚柏科信息股份有限公司 Text classification method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577662A (en) * 2017-08-08 2018-01-12 上海交通大学 Towards the semantic understanding system and method for Chinese text
KR20190085882A (en) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 Method and computer readable storage medium for performing text-to-speech synthesis using machine learning
GB201904719D0 (en) * 2019-04-03 2019-05-15 Mashtraxx Ltd Method of training a neural network to reflect emotional perception and related system and method for categorizing and finding associated content
CN111078887A (en) * 2019-12-20 2020-04-28 厦门市美亚柏科信息股份有限公司 Text classification method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUAKANG LI ET AL: "Sentiment Analysis based on Bi-LSTM using Tone", 《2019 15TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG)》 *
赵理;崔杜武;: "基于汉字拼音声调的文本水印算法", 计算机工程, no. 10 *
邓力;梁向东;: "基于DFT的嵌入式普通话语音快速识别", 实验室研究与探索, no. 06 *

Also Published As

Publication number Publication date
CN111859981B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
JP7317791B2 (en) Entity linking method, device, apparatus and storage medium
CN111859994B (en) Machine translation model acquisition and text translation method, device and storage medium
CN111125335B (en) Question and answer processing method and device, electronic equipment and storage medium
US11403468B2 (en) Method and apparatus for generating vector representation of text, and related computer device
CN111061868B (en) Reading method prediction model acquisition and reading method prediction method, device and storage medium
JP7179123B2 (en) Language model training method, device, electronic device and readable storage medium
JP7149993B2 (en) Pre-training method, device and electronic device for sentiment analysis model
JP2021108115A (en) Method and device for training machine reading comprehension model, electronic apparatus, and storage medium
CN110674260B (en) Training method and device of semantic similarity model, electronic equipment and storage medium
JP2021111420A (en) Method and apparatus for processing semantic description of text entity, and device
CN112507735A (en) Training method and device of machine translation model and electronic equipment
CN112507101A (en) Method and device for establishing pre-training language model
KR102630243B1 (en) method and device for predicting punctuation
JP2021131858A (en) Entity word recognition method and apparatus
CN111680517A (en) Method, apparatus, device and storage medium for training a model
JP2021108098A (en) Review information processing method, device, computer apparatus, and medium
JP2022008207A (en) Method for generating triple sample, device, electronic device, and storage medium
CN110807331A (en) Polyphone pronunciation prediction method and device and electronic equipment
CN112506949B (en) Method, device and storage medium for generating structured query language query statement
JP7198800B2 (en) Intention Recognition Optimization Processing Method, Apparatus, Equipment and Storage Medium
CN112269862B (en) Text role labeling method, device, electronic equipment and storage medium
CN111079945A (en) End-to-end model training method and device
CN111858880A (en) Method and device for obtaining query result, electronic equipment and readable storage medium
CN113360751A (en) Intention recognition method, apparatus, device and medium
CN112270169B (en) Method and device for predicting dialogue roles, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant