CN117764061A - Dictionary fusion method and device, electronic equipment and storage medium - Google Patents

Dictionary fusion method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117764061A
CN117764061A CN202311789361.4A CN202311789361A CN117764061A CN 117764061 A CN117764061 A CN 117764061A CN 202311789361 A CN202311789361 A CN 202311789361A CN 117764061 A CN117764061 A CN 117764061A
Authority
CN
China
Prior art keywords
training
corpus
language
word
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311789361.4A
Other languages
Chinese (zh)
Inventor
王茜
赵金涛
孟振南
王倪东
丁辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mobvoi Information Technology Co ltd
Original Assignee
Shanghai Mobvoi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mobvoi Information Technology Co ltd filed Critical Shanghai Mobvoi Information Technology Co ltd
Priority to CN202311789361.4A priority Critical patent/CN117764061A/en
Publication of CN117764061A publication Critical patent/CN117764061A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The disclosure provides a dictionary fusion method, a dictionary fusion device, electronic equipment and a storage medium. The dictionary fusion method disclosed by the invention comprises the following steps: receiving a pre-training corpus input by a user, wherein the pre-training corpus is a first language-based pre-training corpus; training a first word segmentation device based on a pre-training corpus input by a user to obtain a first language word segmentation device; and fusing the vocabulary of the first language word segmentation device with the vocabulary of the second language word segmentation device to obtain a fused dictionary, wherein the second language word segmentation device is a word segmentation device obtained by training based on the second language, and the first language is different from the second language.

Description

Dictionary fusion method and device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of large models, and in particular relates to a dictionary fusion method, a dictionary fusion device, electronic equipment and a storage medium.
Background
In order to further attach an application scene taking Chinese as a main body, a Chinese comparison experiment needs to be carried out on an open source model and a self-grinding model, and as the base open source model LLaMA2 is mainly an open source project of an English community, the capability is strong but the performance is poor in Chinese. The token of the open source model is mainly trained by English corpus, only contains about 700 Chinese characters and is a single character, and the characters which are not in the vocabulary are processed into 3 bytes, but after the Chinese characters are shredded by the method, the word segmentation device reflects the meaning poorly, so that the understanding capability of the model on the Chinese characters is limited; and if each Chinese character needs more token codes, training and reasoning efficiency can be greatly affected.
The related art provides a method comprising: receiving sentences to be recognized, and matching related vocabulary from a dictionary for each word in the sentences; dynamically learning the relevance weight between the word and the corresponding word by using self-attribute as a word information fusion device so as to fuse word information; adopting an improved transducer layer, and integrating position information in a mode of optimizing initial position coding while modeling semantic information of a context; the learned context representation is input to a conditional random prediction.
However, the above method does not provide a method for Chinese dictionary fusion, and the understanding ability of the model to Chinese can not be improved by the method, so that the reasoning efficiency and training efficiency of the model are affected.
Disclosure of Invention
In order to solve at least one of the above technical problems, the present disclosure provides a dictionary fusion method, a dictionary fusion device, an electronic apparatus, and a storage medium.
In one aspect, a dictionary fusion method is provided, the dictionary fusion method including:
receiving a pre-training corpus input by a user, wherein the pre-training corpus is a pre-training corpus based on a first language;
training a first word segmentation device based on a pre-training corpus input by a user to obtain a first language word segmentation device; fusing a vocabulary of the first language word segmentation device with a vocabulary of a second language word segmentation device to obtain a fused dictionary, wherein the second language word segmentation device is a word segmentation device obtained by training based on a second language, and the first language is different from the second language.
According to an optional embodiment of the disclosure, training the first word segmentation unit based on the pre-training corpus to obtain a first language word segmentation unit includes:
and training a first word segmentation device based on the pre-training word material to obtain a first language word segmentation device.
According to an optional embodiment of the disclosure, obtaining a pre-training word corpus based on the pre-training corpus includes:
dividing the pre-training corpus into pre-training word corpus, and obtaining the occurrence frequency of the pre-training word corpus;
and obtaining the pre-training word corpus based on the occurrence frequency of the pre-training word corpus.
According to an optional embodiment of the disclosure, obtaining the pre-training word corpus based on the occurrence frequency of the pre-training word corpus includes:
and obtaining the occurrence frequency between at least two adjacent pre-training word corpuses, and combining the pre-training word corpuses with the occurrence frequency between the adjacent pre-training word corpuses within a threshold range to obtain the pre-training word corpuses.
According to an optional embodiment of the disclosure, fusing the vocabulary of the first language word segmentation device with the vocabulary of the second language word segmentation device to obtain a fused dictionary, including:
and inserting the vocabulary of the first language word segmentation device into the last word of the vocabulary of the second language word segmentation device to obtain a fused dictionary.
According to an alternative embodiment of the present disclosure, receiving a pre-training corpus input by a user includes:
and receiving different types of corpus input by a user, and processing the different types of corpus into corpus with the same format as the pre-training corpus.
According to an alternative embodiment of the present disclosure, processing the different types of corpora into corpora of the same format as the pre-training corpora includes:
labeling different types of corpus, and processing the labeled corpus into corpus with the same format, wherein the pre-training corpus comprises the label.
According to an alternative embodiment of the present disclosure, the first word segmentation unit is trained using a BPE algorithm in sentencepie based on the pre-training corpus.
According to an optional embodiment of the disclosure, fusing the vocabulary of the first language word segmentation device with the vocabulary of the second language word segmentation device to obtain a fused dictionary, including:
and adding the vocabulary of the first language word segmentation device into the vocabulary of the second language word segmentation device for fusion by adopting an add pieces method of the sense pieces, thereby obtaining a fused dictionary.
In another aspect, a dictionary fusion apparatus is provided, including:
the receiving module receives a pre-training corpus input by a user, wherein the pre-training corpus is based on a first language;
the training module is used for responding to the pre-training corpus input by the user, training the first word segmentation device based on the pre-training corpus and obtaining a first language word segmentation device;
the fusion module is used for fusing the vocabulary of the first language word segmentation device with the vocabulary of the second language word segmentation device to obtain a fused dictionary, wherein the second language word segmentation device is a word segmentation device obtained by training based on a second language, and the first language is different from the second language.
In yet another aspect, an electronic device is provided, including:
a memory storing execution instructions;
a processor executing the execution instructions stored in the memory, implementing the method of any one of the above.
In yet another aspect, a readable storage medium is provided, in which execution instructions are stored, which when executed by a processor, implement the method of any of the above.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
Fig. 1 is a schematic flow chart of a dictionary fusion method according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of step S102 in the dictionary fusion method provided in the embodiment of the present disclosure.
Fig. 3 is a simplified schematic diagram of a preferred dictionary fusion flow method according to an embodiment of the present disclosure.
Fig. 4 is a block diagram schematically illustrating the structure of a dictionary fusion apparatus employing a hardware implementation of a processor according to one embodiment of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant content and not limiting of the present disclosure. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.
In addition, embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict. The technical aspects of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Unless otherwise indicated, the exemplary implementations/embodiments shown are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, features of the various implementations/embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concepts of the present disclosure.
The use of cross-hatching and/or shading in the drawings is typically used to clarify the boundaries between adjacent components. As such, the presence or absence of cross-hatching or shading does not convey or represent any preference or requirement for a particular material, material property, dimension, proportion, commonality between illustrated components, and/or any other characteristic, attribute, property, etc. of a component, unless indicated. In addition, in the drawings, the size and relative sizes of elements may be exaggerated for clarity and/or descriptive purposes. While the exemplary embodiments may be variously implemented, the specific process sequences may be performed in a different order than that described. For example, two consecutively described processes may be performed substantially simultaneously or in reverse order from that described. Moreover, like reference numerals designate like parts.
When an element is referred to as being "on" or "over", "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. However, when an element is referred to as being "directly on," "directly connected to," or "directly coupled to" another element, there are no intervening elements present. For this reason, the term "connected" may refer to physical connections, electrical connections, and the like, with or without intermediate components.
The terminology used herein is for the purpose of describing particular embodiments and is not intended to limit the scope of the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising," and variations thereof, are used in the present specification, the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof is described, but the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximation terms and not as degree terms, and as such, are used to explain the inherent deviations of measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.
Fig. 1 to 3 are flow diagrams of a dictionary fusion method according to one embodiment of the present disclosure.
Referring to fig. 1, a dictionary fusion method M100 includes:
s102, receiving a pre-training corpus input by a user, wherein the pre-training corpus is a pre-training corpus based on a first language.
S104, responding to the pre-training corpus input by the user, and training the first word segmentation device based on the pre-training corpus to obtain the first language word segmentation device.
S106, fusing a vocabulary of the first language word segmentation device with a vocabulary of the second language word segmentation device to obtain a fused dictionary, wherein the second language word segmentation device is a word segmentation device trained based on a second language, and the first language is different from the second language.
According to the dictionary fusion method provided by the embodiment of the disclosure, the first word segmentation device is trained based on the pre-training corpus of the first language, so that the first word segmentation device is obtained, the word list of the first word segmentation device and the word list of the second word segmentation device are fused, and the fused dictionary is obtained, and because the second word segmentation device is the word segmentation device obtained by training based on the second language, the first language is different from the second language, the recognition and understanding capability of the model of the dictionary adopting the embodiment of the disclosure on the first language and the second language are improved, the model can recognize multiple languages, and the reasoning capability of the model of the dictionary adopting the embodiment of the disclosure is improved.
The existing token of the open source model mainly adopts English corpus training, only contains about 700 Chinese characters and is a single character. The characters which are not in the model vocabulary can be processed into 3 bytes, and the processing mode can generate the problem that the Chinese characters are cut up, and the Chinese semantics are poorly reflected after word segmentation, so that the understanding capability of the model on Chinese is limited. And if each Chinese character is coded it can greatly affect the efficiency of model training and reasoning,
in step S102 in the dictionary fusion method provided in the embodiments of the present disclosure, a pre-training corpus input by a user is received, where the pre-training corpus is a pre-training corpus based on a first language. The pre-training corpus input by the user can be in a text form or a voice form. The pre-training corpus is preprocessed by the embodiment of the disclosure, and the pre-training corpus in a voice form is converted into a text form. And noise reduction treatment is carried out on the pre-training corpus, and the corpus with unclear semantics, repetition and the like in the pre-training corpus is removed.
Illustratively, the first language in the embodiments of the present disclosure may be chinese, or may be another language. The embodiment of the present disclosure is not limited to this type of first language.
As an example, the user may input the pre-training corpus in the form of a terminal input, which may include, for example, a cell phone, a computer, an iPad, etc. The embodiment of the present disclosure is not limited to this type of terminal.
In step S104 of the dictionary fusion method provided in the embodiment of the present disclosure, in response to a pre-training corpus input by a user, training a first word segmentation unit based on the pre-training corpus, and obtaining a first language word segmentation unit.
The embodiment of the disclosure adopts the pre-training corpus of the first language to train the first word segmentation device, so that the first word segmentation device has the capabilities of segmenting, identifying and understanding the first language.
Further, embodiments of the present disclosure may define the number of corpora in the first language word segmentation engine. The method includes the steps that the number of the linguistic data in the word list of the first word segmentation device can be obtained in real time, the relation between the number of the linguistic data in the word list of the first word segmentation device and a preset number threshold value is judged, when the number of the linguistic data in the word list of the first word segmentation device is smaller than the preset number threshold value, the pre-training linguistic data input by a user can be continuously received, and the first word segmentation device is continuously trained based on the pre-training linguistic data. And stopping receiving the input pre-training corpus when the number of the corpus in the word list of the first word segmentation device is larger than or equal to a preset number threshold value. Illustratively, the number of corpus in the first word segmentation unit provided by the embodiment of the disclosure may be thirty thousand.
According to a preferred embodiment of the present disclosure, the first word segmentation engine is trained based on a pre-training corpus using a BPE algorithm in Sentencepiece (text token tool with Sentencepiece being a google open source).
In step S106 of the dictionary fusion method provided in the embodiment of the present disclosure, a vocabulary of a first language word segmentation device and a vocabulary of a second language word segmentation device are fused to obtain a fused dictionary, where the second language word segmentation device is a word segmentation device obtained by training based on a second language, and the first language is different from the second language.
In the embodiment of the disclosure, since the first language and the second language are different, the vocabulary in the first word splitter based on the first language training and the vocabulary in the second word splitter based on the second language training are also different, so that the vocabulary of the first word splitter and the vocabulary of the second word splitter are fused, two word splitters (the first word splitter and the second word splitter) with different languages can be obtained, and recognition, segmentation, understanding and the like of the two different languages can be realized based on the two word splitters with different languages.
As an example, the first language may be chinese and the second language may be english. As another example, the first language may be korean and the second language may be english. The embodiments of the present disclosure are not limited to this in the kind of first language and second language.
According to a preferred embodiment of the present disclosure, step S102 provided by the embodiment of the present disclosure trains a first word segmentation unit based on a pre-training corpus, to obtain a first language word segmentation unit, including:
the method comprises the steps of obtaining a pre-training word corpus based on the pre-training word corpus, training a first word segmentation device based on the pre-training word corpus, and obtaining a first language word segmentation device.
The pre-training corpus comprises pre-training article corpus, pre-training long sentence corpus, pre-training short sentence corpus and the like, and in the embodiment of the present disclosure, the pre-training word corpus is obtained based on the pre-training corpus, that is, the pre-training word corpus is obtained based on the pre-training long sentence corpus, the pre-training short sentence corpus, the pre-training article corpus and the like.
In other words, the conversion of long sentence corpus into word material, the conversion of short sentence corpus into word corpus, and the conversion of pre-training article corpus into word material are performed on pre-training article corpus, pre-training long sentence corpus, pre-training short sentence corpus, and the like.
According to a preferred embodiment of the present disclosure, performing long sentence corpus to word material conversion, short sentence corpus to word corpus conversion, and pre-training article corpus to word material conversion on pre-training long sentence corpus, pre-training short sentence corpus, pre-training article corpus, and the like, includes:
the method comprises the steps of segmenting the pre-training long sentence corpus, segmenting the pre-training long sentence corpus into long sentence corpus taking the pre-training long sentence corpus as a unit, and forming a pre-training long sentence corpus set with the pre-training long sentence corpus in the pre-training corpus.
The method comprises the steps of segmenting pre-training long sentence corpus in a pre-training long sentence corpus, segmenting the pre-training long sentence corpus into pre-training short sentence corpus taking the pre-training short sentence corpus as a unit, and forming the pre-training short sentence corpus with the pre-training short sentence corpus in the pre-training long sentence corpus.
Word segmentation is carried out on the pre-training phrase linguistic data in the pre-training phrase linguistic data set, and the pre-training phrase linguistic data is segmented into pre-training word linguistic data taking words as units.
And obtaining the pre-training word corpus according to the pre-training word corpus obtained by segmentation.
According to the embodiment of the disclosure, the segmentation accuracy of the corpus can be improved by carrying out stepwise and staged segmentation on the pre-training article corpus, the pre-training long sentence corpus, the pre-training short sentence corpus and the like, so that the training effect is further ensured.
Referring to fig. 2, in step S102 in the embodiment of the present disclosure, obtaining the pre-training word material based on the pre-training corpus includes:
s1022, segmenting the pre-training corpus into pre-training word corpus, and obtaining the occurrence frequency of the pre-training word corpus.
As mentioned above, in the embodiment of the present disclosure, the pre-training corpus includes a pre-training article corpus, a pre-training long sentence corpus, a pre-training short sentence corpus, and the like, and after the pre-training article corpus, the pre-training long sentence corpus, and the pre-training short sentence corpus are segmented, a plurality of pre-training word corpuses are formed, and according to the occurrence frequency of the pre-training word corpuses, the association relationship between the pre-training word corpuses can be determined, so that different pre-training word corpuses are combined into the pre-training word corpuses.
According to a preferred embodiment of the present disclosure, in step S1022 in the embodiment of the present disclosure, the pre-training corpus is segmented into pre-training word corpora, and the frequency of occurrence of the pre-training word corpora is obtained, including:
and de-duplicating the segmented pre-training word corpus, namely removing repeated pre-training word corpus, and obtaining the occurrence frequency of the pre-training word corpus.
S1024, obtaining the pre-training word corpus based on the occurrence frequency of the pre-training word corpus.
According to a preferred embodiment of the present disclosure, step S1022 in the embodiment of the present disclosure, obtaining the pre-training word corpus based on the occurrence frequency of the pre-training word corpus includes:
and obtaining the occurrence frequency between the adjacent at least two pre-training word corpuses, and combining the pre-training word corpuses with the occurrence frequency between the adjacent at least two pre-training word corpuses within a threshold range to obtain the pre-training word corpuses.
The pre-training word corpus in the embodiment of the disclosure may be one word or a word phrase, and one word phrase or one word may include at least two words. Therefore, in the embodiment of the disclosure, the occurrence frequency between the adjacent at least two pre-training word corpora is obtained, and a pre-training word corpus can be formed based on the adjacent at least two pre-training word corpora.
It will be appreciated that if the frequency of occurrence of two adjacent pre-training word corpora reflects the relationship between the two pre-training word corpora, it is possible to compose a pre-training word corpus. Therefore, the embodiment of the disclosure combines the pre-training words with the occurrence frequency between the adjacent pre-training words within the threshold range to obtain the corpus of the pre-training words. The threshold range may be determined according to a relationship between the pre-training word corpus and the pre-training word corpus, so long as the purpose of combining two adjacent pre-training word corpuses into the pre-training word corpus can be achieved.
As an example, the most frequently occurring between two adjacent pre-training word corpora may be combined into one pre-training word corpus.
According to a preferred embodiment of the present disclosure, in step S106 in the embodiment of the present disclosure, fusing the vocabulary of the first language word segmentation unit with the vocabulary of the second language word segmentation unit to obtain a fused dictionary, including:
and inserting the vocabulary of the first language word segmentation device into the last word of the vocabulary of the second language word segmentation device to obtain a fused dictionary.
As an example, when the first language is chinese and the second language is english, when the first language word segmentation device and the second language word segmentation device are fused, it is necessary to consider that the segmentation of english and LLaMA (large pre-training model of open source of facebook) are guaranteed to be consistent, and the ID of english single time after segmentation is unchanged; the segmentation of numbers also uses LLaMA processing method to better ensure the digital understanding ability of the disambiguation maintaining model. Therefore, in the embodiment of the disclosure, the vocabulary of the first language word splitter is inserted into the vocabulary of the second language word splitter after the last word of the vocabulary of the second language word splitter, so that the original ID sequence of the second language in the vocabulary of the second language word splitter is not damaged, and the vocabulary of the first language word splitter and the vocabulary of the second language word splitter are fused.
According to a preferred embodiment of the present disclosure, fusing a vocabulary of a first language word segmentation unit with a vocabulary of a second language word segmentation unit to obtain a fused dictionary, including:
and adding the vocabulary of the first language word segmentation device into the vocabulary of the second language word segmentation device for fusion by adopting an add pieces method of sendenceiecene, thereby obtaining a fused dictionary.
According to a preferred embodiment of the present disclosure, in step S102 provided in the embodiment of the present disclosure, receiving a pre-training corpus input by a user includes:
receiving different types of corpus input by a user, and processing the different types of corpus into corpus with the same format as pre-training corpus.
It will be appreciated that the pre-training corpus input by the user may include text, space, special character, or other types of corpus. Therefore, the language materials of the types such as the characters, the spaces or the special characters are required to be processed to have the same format, and further in the subsequent training, the space language materials can be regarded as the language materials of the characters to be used as the pre-training language materials to participate in the training without distinguishing the characters from the spaces.
According to a preferred embodiment of the present disclosure, processing different types of corpora into corpora of the same format as a pre-training corpus includes:
labeling different types of corpus, and processing the labeled corpus into corpus with the same format, wherein the pre-trained corpus comprises labels.
It can be understood that when inputting, special characters, special punctuations or special single times exist according to semantic requirements, so that different types of corpus can be labeled according to user requirements, and the labeled corpus of different types is processed into corpus of the same format, wherein the pre-training corpus comprises labels. In other words, the labels of the pre-trained corpus may also be embodied in the corpus processed into the same format. By adopting the method, the segmentation effect of the custom character can be improved, and the method is more suitable for various scene tasks.
Illustratively, for code segmentation and dialog tagging, etc., the disclosed embodiments make special character additions, i.e., labeling in the disclosed embodiments, such as spaces and linefeeds of different lengths, and "# # # user: # # bot: the marks such as # # end and the like enable the first language word segmentation device to segment the marks together according to the needs of users, so that the dictionary meaning can be better learned by the model of the dictionary adopting the embodiment of the disclosure, and better performance is achieved.
Referring to fig. 3, the first language is chinese, and the second language is english for example, to further illustrate the dictionary fusion method provided in the embodiment of the present disclosure. According to the embodiment of the disclosure, the sentencepiece model is adopted to train the received pre-training corpus input by the user, the BPE algorithm appointed in the sentencepiece is adopted to train to obtain the special Chinese word segmentation device (namely, the first language word segmentation device) in the figure 3, and the Chinese word segmentation device is fused with the used word segmentation device (namely, the second language word segmentation device) to obtain the fused new word segmentation device. The method of add pieces carried by sentencepie is adopted, the vocabulary of a Chinese word segmentation device (namely a first language word segmentation device) is directly added into the vocabulary of the sentencepie, and tags of different types of pre-training corpus are added to the new word segmentation device after fusion in the word segmentation device of the dictionary after fusion by adopting the method of add pieces, so that a final word segmentation device (namely the word segmentation device of the dictionary after fusion) is obtained.
FIG. 4 illustrates an example diagram of an apparatus employing a hardware implementation of a processing system.
The apparatus may include corresponding modules that perform the steps of the flowcharts described above. Thus, each step or several steps in the flowcharts described above may be performed by respective modules, and the apparatus may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform the respective steps, or be implemented by a processor configured to perform the respective steps, or be stored within a computer-readable medium for implementation by a processor, or be implemented by some combination.
The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. Bus 1100 connects together various circuits including one or more processors 1200, memory 1300, and/or hardware modules. Bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.
Bus 1100 may be an industry standard architecture (ISA, industry Standard Architecture) bus, a peripheral component interconnect (PCI, peripheral Component) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one connection line is shown in the figure, but not only one bus or one type of bus.
Any process or method description in a flowchart or otherwise described herein may be understood as: a module, segment, or portion of code, which comprises one or more executable instructions for implementing the steps of a specified logical function(s) or process (es). The scope of the preferred embodiments of the present disclosure may include other implementations in which functions may be performed out of the order described, for example, in a substantially simultaneous manner or in an opposite order depending on the function involved, as would be understood by one of skill in the art. The processor may be used to perform the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program stored on a computer readable storage medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when the software program is loaded and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).
Logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium for the printable program, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a memory.
It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the methods of the embodiments described above may be implemented by a program to instruct related hardware. The program may be stored in a readable storage medium. The program, when executed, includes one or a combination of steps for implementing the method.
Furthermore, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic disk or optical disk, etc.
Fig. 4 is a schematic structural view of a dictionary fusion device according to one embodiment of the present disclosure.
As shown in fig. 4, a dictionary fusion apparatus 1000 according to the present disclosure may include:
the receiving module 1002 receives a pre-training corpus input by a user, where the pre-training corpus is a pre-training corpus based on a first language.
And the training module 1004, wherein the training module 1004 is used for training the first word segmentation device based on the pre-training corpus to obtain the first language word segmentation device in response to the pre-training corpus input by the user.
The fusion module 1006, the fusion module 1006 fuses the vocabulary of the first language word segmentation device and the vocabulary of the second language word segmentation device to obtain a fused dictionary, wherein the second language word segmentation device is a word segmentation device obtained by training based on the second language, and the first language is different from the second language.
According to an alternative embodiment of the present disclosure, the training module 1004 obtains a pre-training word corpus based on the pre-training word corpus, trains the first word segmentation unit based on the pre-training word corpus, and obtains the first language word segmentation unit.
According to an alternative embodiment of the present disclosure, the training module 1004 segments the pre-training corpus into pre-training word corpora, and obtains the occurrence frequency of the pre-training word corpora.
And obtaining the pre-training word corpus based on the occurrence frequency of the pre-training word corpus.
According to an alternative embodiment of the present disclosure, the training module 1004 obtains the occurrence frequency between at least two adjacent pre-training word corpora, and combines the pre-training word corpora with the occurrence frequency between the adjacent pre-training word corpora within a threshold range to obtain the pre-training word corpora.
According to an alternative embodiment of the present disclosure, the fusion module 1006 inserts the vocabulary of the first language word segmentation unit behind the last word of the vocabulary of the second language word segmentation unit, resulting in a fused dictionary.
According to an alternative embodiment of the present disclosure, the receiving module 1002 receives different types of corpora input by a user, and processes the different types of corpora into corpora in the same format as a pre-training corpus.
According to an alternative embodiment of the present disclosure, the receiving module 1002 tags different types of corpora, and processes the tagged corpora into corpora of the same format, where the pre-training corpora includes tags.
The present disclosure also provides an electronic device, including: a memory storing execution instructions; and a processor or other hardware module that executes the memory-stored execution instructions, causing the processor or other hardware module to perform the method described above.
The present disclosure also provides a readable storage medium having stored therein execution instructions which when executed by a processor are adapted to carry out the above-described method.
In the description of the present specification, reference to the terms "one embodiment/example," "some embodiments/examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/example or example is included in at least one embodiment/mode or example of the present application. In this specification, the schematic representations of the above terms are not necessarily in the same implementation/example or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more implementations/embodiments or examples. Furthermore, the various embodiments/examples or examples described in this specification and the features of the various embodiments/embodiments or examples may be combined and combined by persons skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.
It will be appreciated by those skilled in the art that the above-described embodiments are merely for clarity of illustration of the disclosure, and are not intended to limit the scope of the disclosure. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present disclosure.

Claims (10)

1. A dictionary fusion method, characterized in that the dictionary fusion method comprises:
receiving a pre-training corpus input by a user, wherein the pre-training corpus is a pre-training corpus based on a first language;
training a first word segmentation device based on a pre-training corpus input by a user to obtain a first language word segmentation device; and
fusing a vocabulary of the first language word segmentation device with a vocabulary of a second language word segmentation device to obtain a fused dictionary, wherein the second language word segmentation device is a word segmentation device obtained by training based on a second language, and the first language is different from the second language.
2. The dictionary fusion method of claim 1, wherein training the first word segmentation unit based on the pre-training corpus to obtain the first language word segmentation unit comprises:
and training a first word segmentation device based on the pre-training word material to obtain a first language word segmentation device.
3. The dictionary fusion method of claim 2, wherein obtaining a pre-trained word corpus based on the pre-trained corpus comprises:
dividing the pre-training corpus into pre-training word corpus, and obtaining the occurrence frequency of the pre-training word corpus; and
and obtaining the pre-training word corpus based on the occurrence frequency of the pre-training word corpus.
4. The dictionary fusion method of claim 3, wherein obtaining the pre-training word corpus based on the occurrence frequency of the pre-training word corpus comprises:
and obtaining the occurrence frequency between at least two adjacent pre-training word corpuses, and combining the pre-training word corpuses with the occurrence frequency between the adjacent pre-training word corpuses within a threshold range to obtain the pre-training word corpuses.
5. The dictionary fusion method according to any one of claims 1-4, wherein fusing the vocabulary of the first language word segmentation unit with the vocabulary of the second language word segmentation unit to obtain a fused dictionary, comprising:
and inserting the vocabulary of the first language word segmentation device into the last word of the vocabulary of the second language word segmentation device to obtain a fused dictionary.
6. The dictionary fusion method of any one of claims 1-4, wherein receiving a pre-training corpus input by a user comprises:
and receiving different types of corpus input by a user, and processing the different types of corpus into corpus with the same format as the pre-training corpus.
7. The dictionary fusion method according to claim 6, wherein processing the different types of corpus into the corpus of the same format as the pre-training corpus includes:
labeling different types of corpus, and processing the labeled corpus into corpus with the same format, wherein the pre-training corpus comprises the label.
8. A dictionary fusion device, comprising:
the receiving module receives a pre-training corpus input by a user, wherein the pre-training corpus is based on a first language;
the training module is used for responding to the pre-training corpus input by the user, training the first word segmentation device based on the pre-training corpus and obtaining a first language word segmentation device; and
the fusion module is used for fusing the vocabulary of the first language word segmentation device with the vocabulary of the second language word segmentation device to obtain a fused dictionary, wherein the second language word segmentation device is a word segmentation device obtained by training based on a second language, and the first language is different from the second language.
9. An electronic device, comprising:
a memory storing execution instructions; and
a processor executing the memory-stored execution instructions to implement the method of any one of claims 1 to 7.
10. A readable storage medium having stored therein execution instructions which when executed by a processor implement the method of any one of claims 1 to 7.
CN202311789361.4A 2023-12-22 2023-12-22 Dictionary fusion method and device, electronic equipment and storage medium Pending CN117764061A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311789361.4A CN117764061A (en) 2023-12-22 2023-12-22 Dictionary fusion method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311789361.4A CN117764061A (en) 2023-12-22 2023-12-22 Dictionary fusion method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117764061A true CN117764061A (en) 2024-03-26

Family

ID=90323547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311789361.4A Pending CN117764061A (en) 2023-12-22 2023-12-22 Dictionary fusion method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117764061A (en)

Similar Documents

Publication Publication Date Title
CN110135457B (en) Event trigger word extraction method and system based on self-encoder fusion document information
CN107767870B (en) Punctuation mark adding method and device and computer equipment
CN107195295B (en) Voice recognition method and device based on Chinese-English mixed dictionary
CN109657230B (en) Named entity recognition method and device integrating word vector and part-of-speech vector
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN110413760B (en) Man-machine conversation method, device, storage medium and computer program product
CN107679032A (en) Voice changes error correction method and device
CN114580382A (en) Text error correction method and device
CN109410949B (en) Text content punctuation adding method based on weighted finite state converter
CN106354716A (en) Method and device for converting text
CN107832302B (en) Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN111916063A (en) Sequencing method, training method, system and storage medium based on BPE (Business Process Engineer) coding
Haertel et al. Automatic diacritization for low-resource languages using a hybrid word and consonant CMM
CN110874408B (en) Model training method, text recognition device and computing equipment
CN112464644B (en) Automatic sentence-breaking model building method and automatic sentence-breaking method
CN117764061A (en) Dictionary fusion method and device, electronic equipment and storage medium
Mukund et al. NE tagging for Urdu based on bootstrap POS learning
CN112183114A (en) Model training and semantic integrity recognition method and device
CN114707489B (en) Method and device for acquiring annotation data set, electronic equipment and storage medium
CN115713934B (en) Error correction method, device, equipment and medium for converting voice into text
KR101472029B1 (en) Natural language-based syntax analysis method using index element and system thereof
CN116705058B (en) Processing method of multimode voice task, electronic equipment and readable storage medium
JP3009654B1 (en) Machine translation processor
Malburg Comparative evaluation of techniques for word recognition improvement by incorporation of syntactic information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination