CN114330296A

CN114330296A - New word discovery method, device, equipment and storage medium

Info

Publication number: CN114330296A
Application number: CN202111229387.4A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2022-04-12

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for discovering new words, which are applicable to the fields of artificial intelligence, cloud technology, databases and the like. The method comprises the following steps: determining an initial characteristic sequence corresponding to each word in the statement to be predicted, and inputting the initial characteristic sequence into a language model to obtain a predicted characteristic sequence corresponding to each word in the statement to be predicted; and determining a new word in the sentence to be predicted based on the prediction characteristic sequence. By adopting the embodiment of the application, the new words can be determined efficiently and accurately, and the applicability is high.

Description

New word discovery method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for discovering new words.

Background

In the age of rapid development of the internet, as the threshold of content production is reduced, the distribution amount of various contents is increased at an exponential speed, and new words in mass contents are found to be the most important in the content processing link.

The common new word discovery method is often based on a statistical mode or model prediction to determine new words in mass contents, the former has low efficiency, and the latter is difficult to accurately discover the new words due to the fact that the new words are often new words. Therefore, how to efficiently and accurately determine new words in massive contents becomes an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a new word discovery method, a device, equipment and a storage medium, which can efficiently and accurately determine new words and have high applicability.

In one aspect, an embodiment of the present application provides a new word discovery method, including:

determining an initial characteristic sequence corresponding to each word in a sentence to be predicted, and inputting the initial characteristic sequence into a language model to obtain a predicted characteristic sequence corresponding to each word in the sentence to be predicted; determining new words in the sentence to be predicted based on the prediction characteristic sequence;

wherein, the language model is obtained by training based on the following modes:

acquiring a training sample set, wherein the training sample set comprises a plurality of sample sentences;

for each sample statement, determining an initial word sequence of the sample statement, replacing a word to be predicted in the initial word sequence with a preset mask character to obtain a first word sequence, determining the initial word sequence of the sample statement, replacing the word to be predicted corresponding to the word to be predicted in the initial word sequence with the preset mask character to obtain a first word sequence, and determining a sample feature sequence corresponding to the sample statement based on the first word sequence and the first word sequence of the sample statement;

for each sample statement, inputting a sample feature sequence corresponding to the sample statement into an initial language model to obtain a sample prediction feature sequence corresponding to each character in the sample statement, and determining a prediction word corresponding to a word to be predicted in the sample statement based on the sample prediction feature sequence;

and determining a training loss value based on the words to be predicted and the corresponding predicted words in each sample sentence, performing iterative training on the initial language model according to the training loss value and the training sample set, and determining the model at the end of training as the language model when the training loss value meets the training end condition.

On the other hand, an embodiment of the present application provides a new word discovery apparatus, including:

the characteristic processing module is used for determining an initial characteristic sequence corresponding to each word in a sentence to be predicted and inputting the initial characteristic sequence into a language model to obtain a predicted characteristic sequence corresponding to each word in the sentence to be predicted;

a new word determining module, configured to determine a new word in the sentence to be predicted based on the prediction feature sequence;

wherein the language model is trained based on a model training device, the model training device being configured to:

In another aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other;

the memory is used for storing computer programs;

the processor is configured to execute the new word discovery method provided by the embodiment of the application when the computer program is called.

In another aspect, the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the new word discovery method provided by the present application.

In another aspect, the present application provides a computer program product, which includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the new word discovery method provided by the present application is provided.

In the embodiment of the application, the sample feature sequence corresponding to the sample statement is determined by the first word sequence and the first word sequence corresponding to the sample statement, where the first word sequence and the first word sequence include preset mask characters, so that the sample feature corresponding to each word in the sample statement may include related information of the word, and also include related information of the word in which the word is located. Based on the method, the language model obtained based on the sample feature sequence training determines the prediction feature sequence corresponding to the sentence to be predicted under the condition that the meaning and the context semantics of each word in the sentence to be predicted can be fully considered, so that the new word in the sentence to be predicted can be accurately and efficiently determined based on the prediction feature sequence, and the applicability is high.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a new word discovery method provided in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for training a language model according to an embodiment of the present disclosure;

FIG. 3a is a schematic diagram of a scenario in which a first word sequence is determined according to an embodiment of the present application;

FIG. 3b is a schematic diagram of a scenario for determining a first word sequence according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a scenario for determining a sample feature sequence according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a scenario of an initial language model provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a new word discovery apparatus provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The new word discovery method provided by the embodiment of the application can be applied to various fields such as artificial intelligence, big data and cloud technology, for example, new words in a text can be discovered based on the technologies such as Machine Learning (ML) and Natural Language Processing (NLP) in the field of artificial intelligence.

The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge. The language model used in the new word discovery process can be obtained through machine learning according to the embodiment of the application.

Among them, natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics.

The new word discovery method provided by the embodiment of the application can be executed by any terminal device or server. When the method for discovering new words is executed by a server, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, a big data and artificial intelligence platform. When the new word discovery method provided by the embodiment of the application is executed by a terminal device, the terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart sound box, a smart watch, a vehicle-mounted terminal, and the like, but is not limited thereto.

Referring to fig. 1, fig. 1 is a schematic flow chart of a new word discovery method provided in an embodiment of the present application. As shown in fig. 1, the new word discovery method provided in the embodiment of the present application includes the following steps:

and step S11, determining an initial characteristic sequence corresponding to each word in the sentence to be predicted, and inputting the initial characteristic sequence into the language model to obtain a predicted characteristic sequence corresponding to each word in the sentence to be predicted.

The language model provided in the embodiment of the present application is obtained through pre-training, and the training mode of the language model may specifically refer to fig. 2, and fig. 2 is a flowchart illustrating a training method of the language model provided in the embodiment of the present application. As shown in fig. 2, the method for training a language model provided in the embodiment of the present application includes the following steps:

step S21, a training sample set is obtained, wherein the training sample set comprises a plurality of sample sentences.

In some possible embodiments, the related information of the vertical website may be obtained based on big data, crawler technology, and the like, for example, website content in a game website, a video website, or an animation website may be obtained as a corpus. The big data refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. The big data can effectively obtain training corpora required by construction of training samples based on technologies such as a large-scale parallel processing database, data mining, a distributed file system, a distributed database, cloud computing and the like.

Optionally, the day-level incremental news, the hour hot news and the like may also be used as the training corpus, and the audio information such as the broadcast and the television news may also be converted into the text information to be used as the training corpus. Alternatively, the text content accumulated in the information stream distribution process or history may be used as a corpus, for example, the text content in a media account, a user comment, a bullet screen, an electronic book, or various application programs may be used as a corpus, and the like.

The specific source of the corpus used for constructing the training sample set may be determined based on the actual application scenario requirements, which is not limited herein. In addition, in the information flow corpus text, in order to improve the recommended weight and opportunity of the content of the author in recommending and distributing, some text information can be hidden in the text, for example, the font color and the background are set to be the same or the hidden label is adopted for marking, so that the hidden text in the corpus text needs to be removed when the corpus is acquired, so as to eliminate the negative influence caused by the hidden text.

Further, after the corpus is obtained, repeated or similar text paragraphs in the corpus may be screened to obtain a screened corpus set. Specifically, similar text paragraphs or the same text paragraphs may be determined by determining the similarity between the text paragraphs, or a simhash value of each text paragraph may also be determined, and if the simhash values of two text paragraphs are smaller than a preset threshold, the two text paragraphs may be determined to be similar paragraphs, so that repeated text paragraphs and similar text paragraphs in the corpus are screened out based on the above manner.

The determination of the similarity or the simhash value between the text paragraphs can be realized by a cloud technology. Cloud Computing refers to obtaining required resources in an on-demand and easily-extensible manner through a Network, and is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), Distributed Computing (Distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.

In some possible embodiments, after the filtered corpus set is obtained, sentences in the corpus set with text lengths within a preset length range may be used as sample sentences, so that the set of sample sentences is used as a training sample set for training the language model.

Step S22, for each sample sentence, determining an initial word sequence of the sample sentence, replacing a word to be predicted in the initial word sequence with a preset mask character to obtain a first word sequence, determining the initial word sequence of the sample sentence, replacing the word to be predicted corresponding to the word to be predicted in the initial word sequence with a preset mask character to obtain a first word sequence, and determining a sample feature sequence corresponding to each word in the sample sentence based on the first word sequence and the first word sequence of the sample sentence.

In some possible embodiments, before the model training process, for each sample sentence, an initial word sequence of the sample sentence may be determined, and a word to be predicted in the initial word sequence is replaced with a preset mask character, so as to obtain a first word sequence.

Specifically, the sample sentence may be subjected to word segmentation processing to obtain an initial word sequence of the sample sentence, and then a target word sequence in which any text length in the initial word sequence is a preset text length is determined, that is, the target word sequence is a part of the initial word sequence. Further, words belonging to the preset word type in the target word sequence can be determined as words to be predicted, and each word to be predicted is replaced by a preset mask character.

The preset part-of-speech type includes, but is not limited to, nouns, verbs, adjectives, vernouns, and the like, and may be determined based on requirements of an actual application scenario, which is not limited herein. The preset MASK characters may be any characters used to replace the word to be predicted, for example, the word to be predicted may be replaced by [ MASK ], which is not limited herein.

As shown in fig. 3a, fig. 3a is a schematic view of a scenario for determining a first word sequence according to an embodiment of the present application. If the sample sentence is 'good tomorrow weather', the initial word sequence 'good tomorrow weather' of the sample sentence is obtained after the word segmentation processing is carried out on the sample sentence. If the weather is determined to be the word to be predicted and the MASK is the preset MASK character, the word to be predicted, namely the weather, can be replaced by the preset MASK character, and the first word sequence corresponding to the sample sentence, namely the tomorrow MASK is good, is obtained.

Optionally, a word in the initial word sequence of the sample sentence, which belongs to a preset proportion of words in the preset type, may be replaced with another word which does not belong to any word in the initial word sequence, so as to obtain the second word sequence. And further determining any target word sequence with the text length being the preset text length in the second word sequence. And determining words belonging to the preset word type in the target word sequence as words to be predicted, and replacing each word to be predicted with a preset mask character to obtain a first word sequence.

In some possible embodiments, before the model training process, for each sample sentence, the initial word sequence of the sample sentence may be determined as well, that is, the sample sentence is arranged by word unit to obtain the initial word sequence.

Furthermore, determining each word to be predicted corresponding to the word to be predicted of the sample sentence in the initial word sequence, that is, determining the word to be predicted in the initial word sequence used for forming the word to be predicted in the sample sentence in the initial word sequence as the word to be predicted. If the word to be predicted in the initial word sequence of the sample sentence with the low tomorrow air temperature is the air temperature, the word to be predicted in the initial word sequence of the sample sentence is the air and the temperature. After determining the word to be predicted in the initial word sequence, the word to be predicted in the initial word sequence may be replaced with a preset mask character, so as to obtain the first word sequence of the sample sentence.

As shown in fig. 3b, fig. 3b is a schematic diagram of a scenario for determining a first word sequence according to an embodiment of the present application. If the sample sentence is 'tomorrow weather is good', [ MASK ] is a preset MASK character and the word to be predicted in the initial word sequence of the sample sentence is 'weather', the initial word sequence of the sample sentence 'tomorrow weather is good' can be determined, and the word to be predicted corresponding to the word to be predicted in the initial word sequence is replaced by the preset MASK character, so that the first word sequence 'tomorrow MASK good' corresponding to the sample sentence is obtained.

Optionally, if, when determining the first word sequence of the sample sentence, in addition to replacing the word to be predicted with the preset mask character, a part of the words belonging to the preset type is replaced with other words, when determining the first word sequence of the sample sentence, it is also necessary to replace each word in the initial word sequence corresponding to the replaced word with a corresponding word to obtain a second word sequence, and replace the word to be predicted in the second word sequence corresponding to the word to be predicted with the preset mask character to obtain the first word sequence of the sample sentence.

In some possible embodiments, for any sample statement, since the input of the model is the relevant features represented by the vector, after determining the first word sequence and the first word sequence of the sample statement, the sample feature sequence corresponding to each word in the sample statement may be determined based on the first word sequence and the first word sequence of the sample statement, so as to use the sample feature sequence corresponding to the sample statement as the input of the model.

Specifically, each word in the first word sequence of the sample sentence and the preset mask character may be encoded, respectively, to obtain an initial word feature corresponding to each word in the first word sequence. And coding each word in the first word sequence of the sample statement and the preset mask character respectively to obtain the initial word characteristic corresponding to each word in the first word sequence. Further, a sample feature sequence corresponding to each word in the sample sentence is determined based on the initial word feature corresponding to each word in the first word sequence and the initial word feature corresponding to each word in the first word sequence.

When a sample feature sequence corresponding to each word in the sample sentence is determined based on an initial word feature corresponding to each word in the first word sequence and an initial word feature corresponding to each word in the first word sequence, for each word in the first word sequence, an initial word feature corresponding to the word in the first word sequence and a target initial word feature of a target word corresponding to the word in the first word sequence can be determined, and then a fusion feature corresponding to the word is determined based on the initial word feature corresponding to the word and the target initial word feature. Based on the obtained fusion characteristics, fusion characteristics corresponding to the words in the first word sequence can be obtained, and further, a sample characteristic sequence corresponding to each word in the sample sentence is determined based on the fusion characteristics corresponding to each word in the first word sequence.

For example, the first word sequence is "tomorrow MASKMASK low" and the first word sequence is "tomorrow MASK low", and for the word "day" in the first word sequence, the fusion feature corresponding to "day" may be determined based on the initial word feature corresponding to "day" and the target initial word feature of the target word "tomorrow" corresponding to "day". For any "MASK" in the first word sequence, a fused feature corresponding to the "MASK" in the first word sequence may be determined based on its corresponding initial word feature and the initial word features corresponding to the "MASK" in the first word sequence.

For each character, the initial character features corresponding to the character and the target initial word features can be spliced to obtain the fusion features corresponding to the character. Or, the initial character features and the target initial word features corresponding to the character may be subjected to fusion coding to obtain fusion features corresponding to the character, which may be specifically determined based on the actual application scene requirements, and are not limited herein. Based on the mode, the fusion characteristics of each word in the sample sentence comprise the information of the word and the related information of the word where the word is located, and further the information of the word and the semantic information of the word in the sample sentence can be fully understood in the model training process.

After the fused features corresponding to the words in the first word sequence of the sample sentence are determined, the feature sequence formed by the fused features corresponding to the words in the first word sequence can be determined as the sample feature sequence corresponding to the words in the sample sentence. Or, the fusion features corresponding to the words in the first word sequence may be further encoded, and the sample feature sequence corresponding to the words in the sample sentence is obtained based on the encoding result.

Referring to fig. 4, fig. 4 is a schematic view of a scenario for determining a sample feature sequence according to an embodiment of the present application. As shown in fig. 4, the sample sentence is "tomorrow weather is good", the first word sequence corresponding to the sample sentence is "tomorrow MASK good", and the corresponding first word sequence is "tomorrow MASK good". Coding each character in the first character sequence through a character fine-grained coding layer to obtain initial character characteristics T corresponding to each character in the first character sequence respectively₀、T₁、T₂、T₃、T₄、T₅Encoding each word in the first word sequence through a word coarse-grained encoding layer to obtain initial word characteristics 'S' corresponding to each word in the first word sequence respectively₀、S₁、S₂”。

Further, for each word in the first word sequence, the initial word feature T corresponding to "Ming" is based on₀And corresponding target word features S₀The fusion characteristic H corresponding to 'Ming' can be obtained₀(ii) a Initial character characteristic T based on 'day' correspondence₁And corresponding target word features S₀The fusion characteristic H corresponding to the "day" can be obtained₁(ii) a Initial character characteristic T corresponding to first MASK₂And word features S corresponding to MASK in the first word sequence₁Obtaining the fusion characteristic H corresponding to the first MASK₂(ii) a Initial character characteristic T corresponding based on second MASK₃And word features S corresponding to MASK in the first word sequence₁A second corresponding fusion signature H is obtained₃(ii) a Initial character characteristic T based on 'very' correspondence₄And corresponding target word features S₂The fusion feature H corresponding to 'very' can be obtained₄(ii) a Initial character characteristic T based on 'good' correspondence₅And corresponding target word features S₂The fusion feature H corresponding to 'good' can be obtained₅. Based on the above fusion characteristics, a sample word can be obtainedSample characteristic sequence 'H' corresponding to sentence₀、H₁、H₂、H₃、H₄、H₅”。

Step S23, for each sample sentence, inputting the sample feature sequence corresponding to the sample sentence into the initial language model to obtain the sample prediction feature sequence corresponding to each word in the sample sentence, and determining the prediction word corresponding to the word to be predicted in the sample sentence based on the sample prediction feature sequence.

In some possible embodiments, the initial language model may be a language prediction model, including but not limited to a converter-based Bidirectional encoding from transforms (BERT) model and other models with word prediction capability, and is not limited herein.

The initial language model may be a combination model of a language prediction model and a Conditional Random Field (CRF), such as a combination model of a BERT model and a CRF.

For each sample sentence in the training sample set, sequence labeling can be performed on the sample sentence to obtain word labeling information corresponding to each word in the sample sentence. The word marking information corresponding to each word in the sample sentence can represent whether the word is a component of the word in the sentence, and the word marking information is position information in the word under the condition that the word is the component of the word in the sentence. For example, the label of the first word of each word in the sample sentence may be determined as "B" indicating that it is the beginning of the word, the labels of the other words than the first word in the sample sentence may be determined as "I" indicating that it is a word non-beginning component, the labels of the other words than the word in the sample sentence may be determined as "O", and the label of the last word in the sample sentence may be determined as E indicating that the word is the ending character of the sample sentence.

Further, a sample feature sequence corresponding to the sample sentence and corresponding word label information are input into the initial language model, the sample prediction feature sequence corresponding to the sample sentence is determined through a language prediction model in the initial language model, and word label information corresponding to each sample feature in the sample feature sequence is learned through a conditional random field, so that the predicted word label information of each predicted word corresponding to the sample prediction feature sequence can be determined based on the initial language model. Because each predicted word corresponding to the sample prediction feature sequence comprises a predicted word, words in a prediction sentence composed of each predicted word can be determined based on the predicted word labeling information. Based on this, one training of the initial language model may be done so that the initial language model may predict the predicted words in the sample sentence.

Referring to fig. 5, fig. 5 is a schematic view of a scenario of an initial language model provided in an embodiment of the present application. The initial language model shown in fig. 5 includes a BERT model and a conditional random field, a sample feature sequence corresponding to a sample statement is input into the initial language model, a sample prediction feature sequence corresponding to the sample statement is obtained through the BERT model, a prediction word corresponding to each sample prediction feature sequence in the sample prediction feature sequence can be determined through a Softmax function, and word tagging information corresponding to each prediction word can be determined through the conditional random field. And then determining each word in the prediction sentence formed by each prediction word based on the word marking information of each prediction word, and also determining the prediction word corresponding to the word to be predicted in the sample sentence.

And step S24, determining a training loss value based on the words to be predicted and the corresponding predicted words in each sample sentence, performing iterative training on the initial language model according to the training loss value and the training sample set, and determining the model after the training is finished as the language model until the training loss value meets the training finishing condition.

In some possible embodiments, the predicted word corresponding to the word to be predicted in each sample sentence may be determined based on the initial language model, and therefore, based on the word to be predicted and the corresponding predicted word in each sample sentence, a corresponding training loss value in the training process for the initial language model may be determined.

The training loss value can represent the difference between the word to be predicted and the corresponding predicted word in each sample sentence. And the training loss value may be determined based on a negative log-likelihood function, a cross entropy loss function, and the like, which is not limited herein.

Further, performing iterative training on the initial language model according to the training loss value and each sample sentence in the training sample set, continuously adjusting each model parameter of the initial language model in the iterative training process, and determining the model after the training is finished as the final language model until the training loss value meets the training finishing condition.

The training end condition may be that the training loss value tends to be stable, or a plurality of continuous loss values are smaller than a preset threshold, and a difference between the plurality of continuous loss values and a previous training loss value is smaller than a preset difference, and the like, and may be determined based on a practical application scenario requirement, and is not limited herein.

Optionally, in a case where the initial language model includes conditional random fields, in determining the training loss value, a first training loss value may be determined based on the word to be predicted and the corresponding predicted word in each sample sentence. And determining a second training loss value based on the word marking information corresponding to each sample sentence and the word marking information determined by the initial language model. The second training loss value represents a difference between the real word expression information of the sample sentence and the word annotation information determined by the initial language model, and may be determined based on a negative log-likelihood function or a cross entropy loss function, which is not limited herein.

Further, a training loss value in the model training process may be determined based on the first training loss value and the second training loss value, so as to iteratively train the initial language model according to the training loss value and each sample sentence in the training sample set, and when the training loss value meets the training ending condition, the model at the training end is determined as the final language model.

The sum of the first training loss value and the second training loss value may be determined as a training loss value in the model training process, or a loss weight corresponding to the first training loss value and a weight corresponding to the second training loss value may be determined, and then the sum of the weights of the first training loss value and the second training loss value may be determined as a training loss value corresponding to the model training process.

It should be particularly noted that the implementation of determining the training loss value in the model training process is merely an example, and the determination may be specifically determined based on the requirements of the actual application scenario, and is not limited herein.

Based on the training mode, the finally obtained language model has the capabilities of predicting words and determining word labeling information of the to-be-predicted sentence, so that when the to-be-predicted sentence comprises the uncommon words or the new words, the uncommon words and the new words in the to-be-predicted sentence can be accurately distinguished based on the language model.

Meanwhile, because the text types of the training corpuses in the same field are similar, the text types of the training corpuses in different fields are different greatly, for example, the text types of the training corpuses in the game field and the urban traffic field are different greatly, so that a training sample set can be constructed based on the training corpuses in the same field, and the training sample sets in different fields can be obtained. Furthermore, the same initial language model can be trained respectively based on training sample sets in different fields, and a language model suitable for the fields is obtained. The implementation manner of training the initial language model based on the training samples in any field is not described herein again.

And step S12, determining a new word in the sentence to be predicted based on the prediction characteristic sequence.

In some feasible embodiments, after the final language model is obtained through training based on the method shown in fig. 2, for any statement to be predicted, the initial feature sequence corresponding to each word in the statement to be predicted may be determined, the initial feature sequence is input to the language model to obtain the predicted feature sequence corresponding to each word in the statement to be predicted, and then the word labeling information corresponding to each word in the statement to be predicted may be determined based on the predicted feature sequence.

And determining candidate words in the sentence to be predicted based on the word marking information of each word in the sentence to be predicted. For each word in the sentence to be predicted, whether the word belongs to one word in the sentence to be predicted can be determined based on the word labeling information corresponding to the word, if the word belongs to one word in the sentence to be predicted, the specific position of the word in the word can be determined based on the word labeling information of the word, and then each word in the sentence to be predicted can be determined based on the mode. At this time, since only a word in the sentence to be predicted can be determined based on the word tagging information, but it cannot be determined whether the word is a new word, a word determined from the sentence to be predicted based on the word tagging information can be determined as a candidate word for the new word.

Further, the candidate words determined from the sentence to be predicted can be matched with a preset word bank, and for any candidate word, if the candidate word is included in the preset word bank, the candidate word is determined to be an existing word. If the candidate word is not included in the preset word bank, the candidate word can be determined as a new word. So that the new words in the sentence to be predicted can be determined based on the language model and the preset word library.

In some possible embodiments, the preset lexicon in the embodiment of the present application may be stored in a server, a database, a cloud storage (cloud storage), or a block chain (Blockchain). When any candidate word needs to be matched with the preset word bank, the candidate word can be directly searched in the storage space of the preset word bank so as to determine whether the preset word bank is included in the preset word bank.

The database may be regarded as an electronic file cabinet, which is a place for storing electronic files, and may be used to store a preset lexicon and a training sample set used in a model training process in the present application. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A blockchain is essentially a decentralized database, a string of data blocks that are associated using cryptography.

In the application, each data block in the block chain can store a preset word stock and a training sample set used in a model training process. The cloud storage is a new concept extended and developed from a cloud computing concept, and refers to that a large number of storage devices (storage devices are also called storage nodes) of various types in a network are gathered through application software or application interfaces to cooperatively work through functions such as cluster application, a grid technology, a distributed storage file system and the like, and a preset word bank and a training sample set used in a model training process are jointly stored.

In some possible embodiments, the preset word library is a word library including common words or common words, and may be constructed by obtaining words through big data, a crawler technology, or the like, or may be constructed based on dictionary data, and the like, which is not limited herein.

And hot words or new words in the original linguistic data in different fields can be found regularly so as to build and update the participated preset word bank.

Specifically, for some new words often appearing in some meeting sets in the same field, for example, a lot of new words are often created by authors in science fiction works, and new words created by net friends are often created in the social field, so that a plurality of candidate texts in which new words may exist can be screened from original linguistic data in different fields. Further, for the candidate text in any field, the word characteristics of each character in the candidate text in the field can be determined, a target word can be determined from the candidate text in the field based on the word characteristics of all the characters, and if the preset word bank does not include the target word, the target word can be used as a new word to update the preset word bank.

The word features include, but are not limited to, one or more of word frequency, mutual information, degree of solidity, and information entropy, and may be determined based on the requirements of the actual application scenario, which is not limited herein.

As an example, if the word frequency of a certain character combination in the candidate text of a certain field is higher than the preset word frequency, the character combination may be determined as a target word, and in the case that the preset word bank does not include the target word, the preset word bank may be updated based on the target word.

Or determining a term frequency-inverse document frequency (TF-IDF) weight corresponding to the character combination based on the term frequency of the character combination, determining the character combination as a target word if the TF-IDT weight of the character combination is higher than a preset weight, and updating the preset lexicon based on the target word if the preset lexicon does not include the target word.

As an example, the degree of solidity corresponding to each character combination in each field can be determined, the degree of solidity represents the degree of closeness between characters in one character combination, and the higher the degree of solidity is, the character combination is probably a new word. Therefore, the character combination with the solidification degree larger than the preset solidification degree can be determined as the target word, and the preset word bank can be updated based on the target word under the condition that the preset word bank does not include the target word.

As an example, the mutual information can reflect the probability of the simultaneous occurrence of each character in the two character combinations, and the larger the mutual information value is, the higher the correlation between the two characters is, and the higher the possibility of forming a new word is. Therefore, mutual information between adjacent words in the candidate texts in each field can be determined, if the mutual information value is larger than a preset mutual information value, the character combination can be determined as a target word, and in the case that the preset word bank does not include the target word, the preset word bank can be updated based on the target word.

As an example, the occurrence frequency of a left adjacent word and the occurrence frequency of a right adjacent word of any character combination in the candidate texts in each domain may be determined, and then the left information entropy may be determined based on the occurrence frequency of the left adjacent word and the occurrence frequency of the character combination, and the right information entropy may be determined based on the occurrence frequency of the right adjacent word and the occurrence frequency of the character combination. The larger the left-right entropy, the more abundant the surrounding words indicating the character combination, meaning that the greater the degree of freedom of the character combination, the greater the probability that it becomes an independent word. If both the left information entropy and the right information entropy are larger than the preset threshold value, the character combination can be determined to be a target word, and the preset word bank can be updated based on the target word under the condition that the preset word bank does not include the target word.

As an example, after determining a plurality of target words based on the plurality of target word determination methods, since the same target word may be determined simultaneously based on the plurality of methods, for each target word, the determination method and the corresponding weight corresponding to the target word may be determined, and the weight sum corresponding to the determination methods of the target word may be determined. And further matching the target words with the weights and the weights higher than the preset threshold value with a preset word bank, and updating the preset word bank based on the target words under the condition that the preset word bank does not include the target words.

As an example, after a plurality of target words are determined based on any one or more of the above manners, the target words may be encoded to obtain encoding characteristics, and then the target words are clustered based on the encoding characteristics, so as to determine, in each type of target word set, target words that are screened out as possible new words, and in a case that the preset word bank does not include the target words, the preset word bank is updated based on the target words.

Alternatively, since current social media is rapidly developed, media information of each social media tends to cause a large amount of comments by net friends, so that new words may be generated in the large amount of comment information. Therefore, comment information corresponding to media information in each field, such as comment information in a chat community, barrage information of a video website, and the like, can be obtained, which is not limited herein.

Further, a first original word and a second original word may be determined from each piece of comment information, and the first original word and the second original word are any two different words in each piece of comment information. For example, the names of the people in the comment information can be determined based on named entity recognition, and two of the names of the people are determined as a first original word and a second original word. After the first original word and the second original word are determined, the target comment information including the first original word and the second original word at the same time, the total number and the total number of prawns corresponding to the target comment information can be determined, and the evaluation score of the combined word composed of the first original word and the second original word is determined based on the number and the prawns corresponding to the target comment information.

Wherein, the higher the evaluation score is, the higher the recognition degree of the combined word appears at the same time, and the more possible the combined word becomes a new word. Based on this, if the evaluation score corresponding to the combined word is greater than the preset evaluation score, the combined word can be used as a new word to update the preset word bank. Based on the implementation mode, the word combination in each comment information can be traversed, and new words generated in the comment information of the media information can be obtained as much as possible.

In the embodiment of the application, the sample feature sequence corresponding to the sample statement is determined by the first word sequence and the first word sequence corresponding to the sample statement, where the first word sequence and the first word sequence include preset mask characters, so that the sample feature corresponding to each word in the sample statement may include related information of the word, and also include related information of the word in which the word is located. Based on the method, the language model obtained based on the sample feature sequence training can determine the prediction feature sequence corresponding to the sentence to be predicted under the condition that the meaning of each word in the sentence to be predicted and the meaning of the sentence to be predicted can be fully understood, so that the new word in the sentence to be predicted can be accurately and efficiently determined based on the prediction feature sequence. The applicability is high.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a new word discovery apparatus provided in an embodiment of the present application. The device for discovering new words provided by the embodiment of the application comprises:

the feature processing module 61 is configured to determine an initial feature sequence corresponding to each word in a sentence to be predicted, and input the initial feature sequence into a language model to obtain a predicted feature sequence corresponding to each word in the sentence to be predicted;

a new word determining module 62, configured to determine a new word in the sentence to be predicted based on the prediction feature sequence;

In some possible embodiments, for each of the sample sentences, the model training device is configured to:

respectively encoding each word in a first word sequence of the sample sentence and a preset mask character to obtain an initial word characteristic corresponding to each word in the first word sequence;

respectively encoding each word in a first word sequence of the sample statement and a preset mask character to obtain an initial word characteristic corresponding to each word in the first word sequence;

and determining a sample feature sequence corresponding to the sample sentence based on the initial word features corresponding to the words in the first word sequence and the initial word features corresponding to the words in the first word sequence.

for each character in the first character sequence, determining target initial character features of a target character corresponding to the character in the first character sequence, and determining fusion features corresponding to the character based on the initial character features corresponding to the character and the target initial character features;

and determining a sample feature sequence corresponding to the sample sentence based on the fusion features corresponding to the words in the first word sequence.

In some possible embodiments, the new word determining module 62 is configured to:

determining word marking information of each character in the sentence to be predicted based on the prediction characteristic sequence;

determining candidate words in the sentence to be predicted based on the word marking information;

and for any candidate word, if the candidate word is not included in the preset word library, determining the candidate word as a new word.

In some possible embodiments, the model training apparatus is configured to:

determining a target word sequence with any text length in the initial word sequence as a preset text length;

determining words belonging to preset word types in the target word sequence as words to be predicted, and replacing each word to be predicted with a preset mask character.

In some possible embodiments, the new word discovering apparatus further includes a new word updating module 63, and the new word updating module 63 is further configured to:

determining candidate texts from original corpora in different fields;

for a candidate text in any field, determining word characteristics of characters in the candidate text in the field, and determining a first target word from the candidate text in the field based on the word characteristics, wherein the word characteristics comprise at least one of word frequency, mutual information, degree of solidification and information entropy;

and updating the preset word bank based on the first target word.

In some possible embodiments, the new word update module 63 is further configured to:

obtaining comment information corresponding to media information in each field, and determining a first original word and a second original word from each comment information, wherein the first original word and the second original word are any two different words in each comment information;

determining target comment information which simultaneously comprises the first original word and the second original word in each comment information;

determining the evaluation score of the combined word corresponding to the first original word and the second original word based on the number of the target comment information and the number of praise corresponding to the target comment information;

and if the evaluation score is larger than a preset evaluation score, updating the preset word bank based on the combined word.

In a specific implementation, the new word discovery apparatus may execute, through each built-in functional module thereof, the implementation manners provided in each step in fig. 1 and/or fig. 2, which may be referred to specifically for the implementation manners provided in each step, and is not described herein again.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. As shown in fig. 7, the electronic device 1000 in the present embodiment may include: the processor 1001, the network interface 1004, and the memory 1005, and the electronic device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 7, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the electronic device 1000 shown in fig. 7, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

the processor 1001, when training the language model, is configured to:

In some possible embodiments, for each of the sample statements, the processor 1001 is configured to:

In some possible embodiments, the processor 1001 is configured to:

In some possible embodiments, the processor 1001 is further configured to:

determining candidate texts from original corpora in different fields;

and updating the preset word bank based on the first target word.

In some possible embodiments, the processor 1001 is further configured to:

It should be understood that in some possible embodiments, the processor 1001 may be a Central Processing Unit (CPU), and the processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In a specific implementation, the electronic device 1000 may execute, through each built-in functional module thereof, the implementation manners provided in each step in fig. 1 and/or fig. 2, which may be referred to specifically for the implementation manners provided in each step, and are not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and is executed by a processor to implement the method provided in each step in fig. 1 and/or fig. 2, which may specifically refer to an implementation manner provided in each step, and is not described herein again.

The computer-readable storage medium may be any one of the above-mentioned newword discovery apparatuses and/or an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. The computer readable storage medium may further include a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), and the like. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application provide a computer program product comprising a computer program or computer instructions, which when executed by a processor provides the method provided in the steps of fig. 1 and/or fig. 2.

The terms "first", "second", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or electronic device that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or electronic device. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not intended to limit the scope of the present application, which is defined by the appended claims.

Claims

1. A method for discovering new words, the method comprising:

determining an initial characteristic sequence corresponding to each word in a sentence to be predicted, inputting the initial characteristic sequence into a language model to obtain a predicted characteristic sequence corresponding to each word in the sentence to be predicted, and determining a new word in the sentence to be predicted based on the predicted characteristic sequence;

wherein the language model is trained based on the following modes:

obtaining a training sample set, wherein the training sample set comprises a plurality of sample sentences;

for each sample statement, determining an initial word sequence of the sample statement, replacing words to be predicted in the initial word sequence with preset mask characters to obtain a first word sequence, determining the initial word sequence of the sample statement, replacing words to be predicted corresponding to the words to be predicted in the initial word sequence with the preset mask characters to obtain a first word sequence, and determining a sample feature sequence corresponding to the sample statement based on the first word sequence and the first word sequence of the sample statement;

determining a training loss value based on the words to be predicted and the corresponding predicted words in each sample sentence, performing iterative training on the initial language model according to the training loss value and the training sample set, and determining the model at the end of training as the language model when the training loss value meets the training end condition.

2. The method of claim 1, wherein for each sample sentence, determining a sample feature sequence corresponding to each word in the sample sentence based on the first word sequence and the first word sequence of the sample sentence comprises:

and determining a sample characteristic sequence corresponding to the sample sentence based on the initial word characteristics corresponding to each word in the first word sequence and the initial word characteristics corresponding to each word in the first word sequence.

3. The method of claim 2, wherein for each sample sentence, the determining a sample feature sequence corresponding to the sample sentence based on the initial word features corresponding to the words in the first word sequence and the initial word features corresponding to the words in the first word sequence comprises:

and determining a sample feature sequence corresponding to the sample statement based on the fusion features corresponding to the words in the first word sequence.

4. The method of claim 1, wherein the determining a new word in the sentence to be predicted based on the sequence of predicted features comprises:

determining word marking information of each word in the sentence to be predicted based on the prediction characteristic sequence;

5. The method of claim 1, wherein replacing the word to be predicted in the initial word sequence with a preset mask character to obtain a first word sequence comprises:

determining words belonging to a preset word type in the target word sequence as words to be predicted, and replacing each word to be predicted with a preset mask character.

6. The method of claim 4, further comprising:

determining candidate texts from original corpora in different fields;

for a candidate text in any field, determining word characteristics of characters in the candidate text in the field, and determining a first target word from the candidate text in the field based on the word characteristics, wherein the word characteristics comprise at least one of word frequency, mutual information, freezing degree and information entropy;

and updating the preset word bank based on the first target word.

7. The method of claim 4, further comprising:

determining the evaluation score of a combined word corresponding to the first original word and the second original word based on the number of the target comment information and the number of praise corresponding to the target comment information;

8. A new word discovery apparatus, the apparatus comprising:

the characteristic processing module is used for determining an initial characteristic sequence corresponding to each word in the statement to be predicted and inputting the initial characteristic sequence into a language model to obtain a predicted characteristic sequence corresponding to each word in the statement to be predicted;

9. An electronic device comprising a processor and a memory, the processor and the memory being interconnected;

the memory is used for storing a computer program;

the processor is configured to perform the method of any of claims 1 to 7 when the computer program is invoked.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 7.

11. A computer program product, characterized in that it comprises a computer program or computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 7.