CN114398482A

CN114398482A - Dictionary construction method and device, electronic equipment and storage medium

Info

Publication number: CN114398482A
Application number: CN202111475744.5A
Authority: CN
Inventors: 胡飞雄; 朱磊; 朱晓燕; 张雨; 付晨阳; 何萍
Original assignee: Tencent Cyber Tianjin Co Ltd
Current assignee: Tencent Cyber Tianjin Co Ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-04-26

Abstract

The application provides a dictionary construction method, a dictionary construction device, electronic equipment and a storage medium, and relates to the technical field of natural language processing. After a text to be processed and a basic dictionary containing a plurality of word categories are obtained, at least one target word of which the semantic information belongs to a named entity is selected from the text to be processed based on a trained category prediction model, probability values of the at least one target word respectively belonging to the plurality of word categories are determined, and the at least one target word is respectively attributed to the word categories of which the corresponding probability values accord with set probability conditions. The target words belonging to the named entities in the text and the probability values of the target words belonging to the named entity categories in the basic dictionary can be accurately determined according to the semantic information, so that the target words can be accurately attributed to the corresponding named entity categories, and the accuracy of dictionary construction is improved.

Description

Dictionary construction method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of natural language processing, in particular to a dictionary construction method and device, electronic equipment and a storage medium.

Background

When recognizing information of a text, a dictionary structure is usually performed first, words of various categories are added to the dictionary, and then the structured dictionary is used to recognize the text, so that the category of the word contained in the text can be recognized quickly and accurately.

In the related art, a dictionary construction is often performed using an N-Gram-based method. The dictionary construction method based on the N-Gram includes the steps of firstly adopting an N-Gram algorithm to segment a text to obtain N words, then using a plurality of existing dictionaries to compare, and selecting new words from the N words to add into the dictionaries.

Because the new words extracted by the method are related to the N words obtained by adopting the N-Gram algorithm, when the words obtained by the segmentation of the N-Gram algorithm are inaccurate, the new words extracted from the N words are not accurate enough, and therefore, the accuracy of the constructed dictionary is not high.

Disclosure of Invention

In order to solve technical problems in the related art, embodiments of the present application provide a dictionary construction method, apparatus, electronic device, and storage medium, which can improve the accuracy of dictionary construction.

The embodiment of the application provides the following specific technical scheme:

a dictionary construction method, comprising:

acquiring a text to be processed and a basic dictionary; wherein the base dictionary contains a plurality of word categories;

determining at least one candidate word contained in the text to be processed and respective semantic information of the at least one candidate word based on the trained category prediction model;

selecting at least one target word which meets set semantic conditions according to the respective semantic information of the at least one candidate word through the category prediction model, and determining probability values of the at least one target word belonging to the plurality of word categories respectively;

and respectively attributing the at least one target word to the word categories with the corresponding probability values meeting the set probability conditions.

A dictionary construction apparatus comprising:

the acquisition module is used for acquiring the text to be processed and the basic dictionary; wherein the base dictionary contains a plurality of word categories;

the word recognition module is used for determining at least one candidate word contained in the text to be processed and semantic information of the at least one candidate word based on the trained category prediction model;

the category identification module is used for selecting at least one target word which meets set semantic conditions according to the respective semantic information of the at least one candidate word through the category prediction model, and determining probability values of the at least one target word belonging to the plurality of word categories respectively;

and the dictionary construction module is used for respectively attributing the at least one target word to the word categories with the corresponding probability values meeting the set probability conditions.

Optionally, the category prediction model includes a pre-training language sub-model and a named entity recognition sub-model; the word recognition module is specifically configured to:

based on the text to be processed, obtaining a word vector corresponding to at least one word contained in the text to be processed through the pre-training language sub-model; wherein each word vector represents semantic information of a corresponding word;

combining the at least one word through the named entity recognition submodel based on the word vector corresponding to the at least one word to obtain at least one candidate word and semantic information of the at least one candidate word; wherein each candidate word comprises at least one word.

Optionally, the category identifying module is specifically configured to:

selecting at least one target word of which the semantic information belongs to the named entity from the at least one candidate word through the named entity identifier model; the named entity is an entity name with specific semantics;

for the at least one target word, performing the following operations, respectively: and determining probability values of the target word belonging to the plurality of word categories respectively according to the semantic information of the target word by the named entity recognition submodel.

Optionally, the system further comprises a model training module, wherein the model training module is configured to:

acquiring a training data set; the training data set comprises a plurality of text data samples, and the text data samples are marked with set categories;

iteratively training the category prediction model based on the training data set until a set convergence condition is met, wherein one iterative training process comprises:

determining at least one target word in the text data sample through the category prediction model based on the text data sample extracted from the training data set, and determining a target word category corresponding to each of the at least one target word;

and determining a corresponding loss value according to the target word category and the set category, and carrying out parameter adjustment on the category prediction model according to the loss value.

Optionally, the category prediction model includes a pre-training language sub-model and a named entity recognition sub-model; the training data set comprises encyclopedic text data samples, field text data samples and named entity recognition text data samples; the model training module is further configured to:

determining corresponding embedded vector samples through the pre-training language sub-model based on the extracted text data samples from the encyclopedic text data samples and the field text data samples; the encyclopedic text data samples are text data of non-pertinence fields, each field text data sample is text data comprising a plurality of named entities of set categories, and the text data are marked with corresponding field categories;

determining, by the named entity recognition submodel, a corresponding at least one target word and its respective corresponding target word class based on vector samples extracted from the embedded vector samples and word vector samples; the word vector sample is obtained based on the named entity recognition text data sample; the named entity recognition text data sample is text data comprising at least one named entity, and each named entity word is marked with a corresponding named entity category.

Optionally, the model training module is further configured to:

determining a first loss value according to the target word category and the field category, and carrying out parameter adjustment on the pre-training language sub-model according to the first loss value;

and determining a second loss value according to the target word category, the field category and the named entity category, and carrying out parameter adjustment on the named entity identification submodel according to the second loss value.

Optionally, the model training module is further configured to:

preprocessing the encyclopedic text data samples, the field text data samples and the named entity recognition text data samples in the training data set respectively; the pre-processing operation includes at least one of data screening and format conversion.

Optionally, the dictionary constructing module is specifically configured to:

determining a word category corresponding to the maximum probability value based on the probability values of the target words belonging to the word categories respectively, and taking the word category as the target category;

if the probability value of the target word belonging to the target category is larger than a first set threshold value, attributing the target word to the target category.

Optionally, the dictionary constructing module is further configured to:

if the probability value of the target word belonging to the plurality of word categories respectively is not larger than a second set threshold, selecting at least one word category with the probability value larger than a third set threshold as a candidate category; the third set threshold is less than the second set threshold;

and selecting a candidate category meeting a set similarity condition as a target category based on the similarity between the target word and the word contained in each candidate category, and attributing the target word to the target category.

An electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory stores program code, and when the program code is executed by the processor, the processor is caused to execute any one of the steps of the dictionary construction method.

A computer-readable storage medium provided in an embodiment of the present application includes program code, when the program code runs on an electronic device, the program code is configured to cause the electronic device to perform any one of the steps of the dictionary construction method described above.

A computer program product according to an embodiment of the present application includes a computer program/instructions, which when run on a computer, causes the computer to execute the above-mentioned dictionary construction method.

The beneficial effect of this application is as follows:

after a text to be processed and a basic dictionary containing a plurality of word categories are obtained, at least one target word of which semantic information belongs to a named entity can be selected from the text to be processed based on a trained category prediction model, probability values of the at least one target word respectively belonging to the plurality of word categories are determined, and the at least one target word is respectively attributed to the word categories of which the corresponding probability values meet set probability conditions. Compared with the N-Gram-based dictionary construction method in the related technology, the method can accurately determine the target words belonging to the named entities in the text to be processed according to the semantic information of the words and the probability value of the named entity category of the target words belonging to the basic dictionary based on the category prediction model, so that the accuracy of dictionary construction can be improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

Fig. 1a is a schematic view of an application scenario in an embodiment of the present application;

fig. 1b is a schematic view of another application scenario in the embodiment of the present application;

FIG. 2a is a schematic flowchart illustrating a process of training a class prediction model according to an embodiment of the present application;

FIG. 2b is a schematic diagram illustrating labeling of named entity recognition text data samples in an embodiment of the present application;

FIG. 2c is a schematic diagram of the output of embedded vector samples in the embodiment of the present application;

FIG. 2d is a schematic diagram illustrating the training of a pre-trained language sub-model according to an embodiment of the present application;

FIG. 2e is a diagram illustrating another example of training a pre-trained language sub-model in the present application;

FIG. 2f is a schematic diagram illustrating training of a named entity recognizer model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating the analysis and processing of a training data set according to an embodiment of the present application;

FIG. 4a is a flowchart illustrating a dictionary construction method according to an embodiment of the present application;

FIG. 4b is a schematic diagram of determining candidate words in an embodiment of the present application;

FIG. 4c is a diagram illustrating an output result of a predictive model for determining a category according to an embodiment of the present application;

FIG. 4d is a schematic diagram of a process for determining a target class in an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating another example of determining a target class according to the present application;

FIG. 6 is a diagram illustrating a dictionary construction method according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a dictionary construction apparatus in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another dictionary constructing apparatus in the embodiment of the present application;

fig. 9 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computing device in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

Dictionary: the words of the predefined categories are extracted from the text to construct a required dictionary, when information of a certain text is identified, the constructed dictionary can be used for identifying the text, and the categories of the words contained in the text can be quickly and accurately identified.

Naming an entity: the entity names with specific semantics in the text mainly include names of people, places, organizations, proper nouns and the like.

The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The terms "first" and "second" are used herein for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

The embodiment of the present application relates to Artificial Intelligence (AI) and Machine Learning (ML) technology and Natural Language Processing (NLP), which are designed based on Machine Learning technology and natural Language processing technology in the AI.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The method comprises the steps of determining at least one target word in a text by adopting a class prediction model based on machine learning, and determining probability values of the at least one target word belonging to a plurality of word classes in a basic dictionary respectively.

The following briefly introduces the design concept of the embodiments of the present application:

dictionaries may be used for information recognition of text to quickly and accurately determine predefined categories to which words in the text belong. In the related technology, a dictionary construction method based on N-Gram is often adopted, the method firstly segments a text based on an N-Gram algorithm to obtain N words, then uses a plurality of existing dictionaries for comparison, selects new words from the N words and adds the new words into the dictionaries. However, the N-Gram algorithm adopted by the method cannot accurately segment words in the text, so that the accuracy of the finally constructed dictionary is not high.

In view of this, embodiments of the present application provide a dictionary construction method, an apparatus, an electronic device, and a storage medium, which may select a target word whose semantic information belongs to a named entity from a to-be-processed text according to semantic information of a word in the text based on a trained category prediction model, and then attribute the target word to a target named entity category in a dictionary according to a probability value that the target word belongs to each named entity category in the dictionary, thereby completing construction of the dictionary, and thus may improve accuracy of dictionary construction.

The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the present application, and features of the embodiments and examples of the present application may be combined with each other without conflict.

Fig. 1a is a schematic view of an application scenario in the embodiment of the present application. The application scenario diagram includes the terminal device 100 and the server 200. The terminal device 100 and the server 200 can communicate with each other through a communication network. Alternatively, the communication network may be a wired network or a wireless network. The terminal device 100 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In the embodiment of the present application, the terminal device 100 is an electronic device used by a user, and the electronic device may be a personal computer, a mobile phone, a tablet computer, a notebook, an electronic book reader, a smart home, a vehicle-mounted terminal, and the like. The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Illustratively, when a user browses text on the terminal device 100, the browsed text may be sent to the server 200 through the terminal device 100, and the server 200 may determine, based on a trained category prediction model, at least one candidate word included in the text and semantic information of each of the at least one candidate word, select, according to the semantic information of each of the at least one candidate word, at least one target word meeting a set semantic condition through the category prediction model, and determine probability values that each of the at least one target word belongs to a plurality of word categories in the basic dictionary. After determining the probability values, the following operations may be performed for at least one target word: and selecting the word categories meeting the set probability condition as corresponding target categories based on the probability values of the target words belonging to the plurality of word categories respectively, and attributing the target words to the target categories, thereby completing construction and expansion of the dictionary.

In addition, after determining the target words contained in the text and the word categories to which the target words belong, the server 200 may also send the result to the terminal device 100, so that the terminal device 100 displays the categories to which the words contained in the text belong to the user, thereby achieving the purpose of performing information identification on the text.

It should be noted that fig. 1a is an exemplary description of an application scenario of the dictionary construction method of the present application, and an application scenario to which the method in the embodiment of the present application may be applied is not limited to this.

In some embodiments, the application scenario in the embodiment of the present application may also be as shown in fig. 1b, and includes the terminal device 100 and the server 200. The server 200 includes a dictionary, and the dictionary is a dictionary constructed by the dictionary construction method in the present application.

Specifically, the terminal device 100 may send the text to be recognized to the server 200, and after receiving the text to be recognized, the dictionary in the server 200 may recognize the named entity in the text to be recognized, and send the obtained recognition result to the terminal device 100.

For example, if the text to be recognized is "floret appears when climbing taishan mountain in mid-autumn 5 am", the terminal device 100 sends the text to be recognized to the server 200, and recognizes the text to be recognized through the dictionary in the server 200, the name of the person in the text to be recognized is "floret", the festival is "mid-autumn", the time is "5 am", and the place is "taishan", that is, "name of the person: florets and festivals: in mid-autumn and in time: 5 am, place name: taishan mountain. The server 200, having obtained the recognition result, may transmit the recognition result to the terminal device 100.

Further, after receiving the recognition result, the terminal device 100 may analyze and manage the text to be recognized according to the named entity in the recognition result. For example, assuming that the text to be recognized is a news message, the named entity in the news message can be recognized through the dictionary, so as to extract important information in the news message, and the news message is analyzed to be a fake news message or a news message containing sensitive words according to the extracted important information.

First, a detailed description will be given of a training process of the class prediction model in the embodiment of the present application, which may be performed by a server, such as the server 200 in fig. 1 a. Referring to fig. 2a, a schematic diagram of a training process of the class prediction model in the embodiment of the present application is shown, and the training process of the class prediction model in the embodiment of the present application is described in detail below with reference to fig. 2 a.

Step S201, a training data set is acquired.

The acquired training data set may include a plurality of text data samples. In particular, encyclopedic text data samples, domain text data samples, and named entity recognition text data samples may be included in the training dataset.

Wherein, the encyclopedic text data sample is text data without a pertinence field. For example, the text "shifting compound mountain lost container shift register map round in the arithmetic peak column length-float buffer 6days," the text "Hong Kong vector operator Liao Qizhi, the text" Hong Won the Hong Kong Film amplitude for Best reporting operator, the text in the Prime of waves Hospital at the product of the term of the. When the encyclopedic text data is used as sample data to train the category prediction model, each encyclopedic text data sample does not need to be labeled.

The domain text data sample is text data which comprises a plurality of named entities in set categories, namely, named entities which are highly related to a specific named entity or contain more named entities. For example, the text "After the Danish coverage soil of the Caribbean color of the West industries to the United States, the United States brand to control the U.S. Virgin Islands" may be a domain text data, and the text is a location domain text data. When the domain text data is used as sample data to train the class prediction model, each domain text data sample needs to be labeled with a corresponding domain class. For example, for the above-mentioned domain text data, since there are many place-named entities included in the text data, the text data may be labeled as "Location".

Named entity identifies a sample of textual data as including at least one named entity, i.e., textual data that requires a subordinate named entity label. For example, the text "A cruise ship trying ourists on Qiandao Lake in Zhejiang Provisions in souutheaster China" may identify textual data for a named entity. When the named entity recognition text data is used as sample data to train the category prediction model, each named entity word included in each named entity recognition text data needs to be labeled with a corresponding named entity category. For each named entity recognition text data sample, a "biees notation" may be employed to denote the named entity class and relative position to which each term contained in the named entity recognition text data sample belongs. For example, as shown in fig. 2B, for the named entity recognition text data "a ship bearing pointers on Qiandao Lake in Zhejiang provision", labeling the named entity recognition text data with "biees labeling method" may obtain "O S-LOC O B-LOC E-LOC O". Where B denotes a start position, I denotes a middle position, E denotes an end position, S denotes an individual named entity, followed by a label denotes an entity class, e.g., "-LOC" denotes that the named entity belongs to a place-named entity, and O denotes a non-named entity.

Step S202, extracting text data samples from encyclopedic text data samples and field text data samples.

The class prediction model to be trained comprises a pre-training language submodel and a named entity recognition submodel, and when the pre-training language submodel is trained, text data samples can be extracted from encyclopedic text data samples and field text data samples of a training data set to serve as training sample data.

Step S203, inputting the extracted text data sample into a pre-training language sub-model to be trained to obtain a corresponding embedded vector sample.

And inputting the extracted text data sample into a pre-training language sub-model to be trained, and performing feature extraction on the text data sample based on the pre-training language sub-model to obtain a corresponding embedded vector sample, wherein the embedded vector sample can comprise a word vector corresponding to each word in the corresponding text data sample. For example, for a text data sample "After the Danish coverage soil and the Caribbean colony", After passing through the pre-training language sub-model, the embedded vector samples shown in FIG. 2c can be obtained, i.e., E1, E2, E3, … E8.

Specifically, as shown in fig. 2d, before the extracted text data sample is input into the pre-training language sub-model to be trained, the text data sample needs to be processed into a [ CLS ] + participle + [ SEP ] structure, that is, the text data sample is processed into a "[ CLS ], I1, I2, I3, …, [ SEP ]" structure. Wherein a special tag [ CLS ] can be used for the task of recognition of the domain class.

After processing the text data samples into "[ CLS ], I1, I2, I3, …, [ SEP ]" structures, the text data samples may also be subjected to random Mask processing, and processed into "[ CLS ], I1, [ MASK ], I3, …, [ SEP ]" structures. And processing the text data sample input into the pre-training language submodel by adopting a random Mask mechanism so as to enable the pre-training language submodel to complete a Mask prediction task. Wherein, the random Mask mechanism randomly masks or replaces any character or word in a sentence, and then lets the pre-training language sub-model predict which part is masked or replaced through the understanding of the context.

After the text data samples are processed by random Mask, the pre-training language sub-model can be trained by using the processed text data samples. Firstly, the text data sample is subjected to embedding, namely vectorization to obtain 'E [ CLS ], E1, E [ MASK ], E3, … and E [ SEP ]', and then the vectorized text data sample is subjected to a plurality of hidden layers in a pre-training language sub-model to obtain an output result 'C, T1, T [ MASK ], T3, … and T [ SEP ]'. Because the text data samples input into the pre-training language submodel also comprise the field text data samples marked with the field types, the pre-training language submodel can predict the field types of the input text data samples through training the pre-training language submodel. And simultaneously, a random Mask mechanism and a field category identification task are adopted to train the pre-training language submodel, so that the pre-training language submodel can learn more semantic information.

For example, the extracted text data sample may be "mapping company Green Marine used connected shifts in the efficient Suez cancer string re-flow restore 6days," the local restore traffic ". As shown in FIG. 2e, before inputting the text data sample into the pre-training language sub-model, the text data sample may be processed into "[ CLS ] mapping combination Green Stone free condainer shifts in the exemplary fashion sugar column shifted re-flow result 6 data, and the text data sample may be processed into" [ CLS ] mapping [ MASK ] Green [ MASK ] derived condainer … [ SEP ] ", by performing random Mask processing on the text data sample.

After the text data sample processed by random Mask is inputted into the pre-training language submodel, the pre-training language submodel can vectorize the text data sample to obtain "E [ CLS ], E1, E2, E [ MASK ], E4, E [ MASK ], E6, E [ … ], E [ SEP ], and then obtain the output results" C, T1, T2, T [ MASK ], T4, T [ MASK ], T6, T [ … ], T [ SEP ].

After the extracted text data samples are processed by random Mask, the text data samples are input into a pre-training language submodel for training, the pre-training language submodel can predict the covered or replaced parts through the understanding of the context, namely for the input [ CLS ] masking [ MASK ] Green [ MASK ] free contact … [ SEP ] ", the pre-training language submodel can predict that the first [ MASK ]" is "company" and the second [ MASK ] "is" ground "through the understanding of the context.

Optionally, the pre-training language sub-model in the embodiment of the present application may be implemented by using transform-based Bidirectional coder features (BERTs).

In the embodiment of the application, encyclopedic text data and field text data with field category labels are used as training data, the pre-training language submodel is trained, and the pre-training language submodel can learn more semantic information, so that when the named entity identification is carried out on the text data, the semantic information in the text data can be utilized, and the target words belonging to the named entity in the text data can be identified more quickly and better.

Step S204, extracting vector samples from the embedded vector samples and the word vector samples obtained by identifying the text data samples based on the named entities.

In this embodiment, the named entity recognition text data samples of the training data set may be vectorized to obtain corresponding word vector samples. When vectorizing the named entity recognition text data sample, a Word Embedding method can be adopted to obtain a Word vector sample corresponding to the named entity recognition text data sample.

After the word vector samples corresponding to the named entity recognition text data samples are obtained, the vector samples can be extracted from the embedded vector samples and the word vector samples obtained based on the pre-training language submodels.

Step S205, inputting the extracted vector samples into the named entity identifier model to obtain at least one corresponding target word and a corresponding target word category.

And inputting the extracted vector samples into a named entity identification submodel, and identifying at least one target word belonging to the named entity in the text data sample and the named entity category corresponding to each target word based on the named entity identification submodel.

Specifically, as shown in FIG. 2f, assume that the extracted text data sample is "[ CLS ] mapping company Green Marine free contact pads … [ SEP ]", and has the label "[ CLS ] O B-ORG E-ORG O O O O O … [ SEP ]". Wherein O represents a non-named entity, B represents the starting position of the named entity, E represents the ending position of the named entity, S represents a separate named entity, the succeeding label represents an entity class, "-LOC" represents a location entity class, "-ORG" represents an organizational entity class.

The text data sample with label is input into a pre-training language sub-model, and the corresponding embedded vector sample 'E [ CLS ], E1, E2, E3, E4, E5, E6, E7, E8, E …, E [ SEP ]' can be obtained through the pre-training language sub-model.

And inputting the obtained embedded vector sample into a named entity recognition submodel, and training the model. The Named Entity identifier Model includes two sub-models, one is a boundary identification sub-Model (Border Segment Model) for identifying the boundary of the Named Entity in the text data, and the other is a category identification sub-Model (NER Model) for identifying the Named Entity category to which the Named Entity word in the text data belongs. And combining the prediction results of the two submodels to obtain a final prediction result through training the two submodels included in the named entity identifier model.

After the embedded vector samples 'E [ CLS ], E1, E2, E3, E4, E5, E6, E7, E8, E … and E [ SEP ]' are input into the named entity recognition submodel, the named entity boundary in the text data sample can be recognized based on the boundary recognition submodel in the named entity recognition submodel, namely, the text "[ CLS ] mapping combination Green machine used container locker … [ SEP ]" is recognized through the boundary recognition submodel, so that the "O B E O O O O O O …" can be obtained, the named entity category to which each named entity word belongs can be obtained through the category recognition submodel in the named entity recognition submodel based on the recognized named entity boundary and the embedded sub-vector samples, namely, the text "[ CLS ] mapping combination containers … ] is recognized through the category recognition submodel, "O ORG ORG O O O O … O" can be obtained. And then combining the recognition result of the boundary recognition submodel with the recognition result of the category recognition submodel to obtain the prediction result of the text "[ CLS ] mapping combination Green Marine free container shifts … [ SEP ]", namely predicting the named entity words and the named entity categories of the named entity words in the text "[ CLS ] mapping combination Green Marine free container shifts … [ SEP ]".

Based on the named entity recognition submodel, the text "[ CLS ] mapping combination Green Marine free container shifts … [ SEP ]" is recognized, the obtained prediction result is "[ CLS ] O B-ORG E-ORG O O O … [ SEP ]", and according to the prediction result, the text can know that "mapping", "compressing", "free", "container", "shifts" and "shifts" are all non-named entities, the text "Green Marine" is a named entity, and the text "Green Marine" is an organization named entity.

In the embodiment of the application, the embedded vector output by the pre-training language sub-model and the word vector obtained based on the named entity recognition text sample with the named entity category label are used as training data to train the named entity recognition sub-model, so that the target words belonging to the named entity in the text data can be more accurately recognized by utilizing the integral semantic information of a plurality of words in the text data learned in the pre-training language sub-model stage.

The encyclopedic text data and the field text data with the field category labels are used as training data, the pre-training language submodel is trained, and the pre-training language submodel can learn more semantic information, so that when the named entity identification is carried out on the text data, the semantic information in the text data can be utilized to identify the target words belonging to the named entity in the text data more quickly and better.

Step S206, determining a first loss value according to the target word category and the field category.

And determining a first loss value according to the target word category corresponding to the target word obtained by the recognition of the named entity recognition sub-model and the field category marked by the text data sample in which the target word is positioned. In general, the loss value is a measure of how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output.

Step S207, determining whether the first loss value converges to a preset target value; if not, executing step S208; if so, step S209 is performed.

Judging whether the first loss value converges to a preset target value, if the first loss value is smaller than or equal to the preset target value, or if the variation amplitude of the first loss value obtained by continuous N times of training is smaller than or equal to the preset target value, considering that the first loss value converges to the preset target value, and indicating that the first loss value converges; otherwise, it indicates that the first loss value has not converged.

And S208, adjusting the parameters of the pre-training language sub-model to be trained according to the determined first loss value.

And if the first loss value is not converged, adjusting the model parameters of the pre-training language sub-model, and after adjusting the model parameters, returning to execute the step S202 to continue the training process of the next round.

Step S209, a second loss value is determined according to the target word category, the field category and the named entity category.

And determining a second loss value according to the target word category corresponding to the target word obtained by the recognition of the named entity recognition sub-model and the field category or the named entity category marked by the text data sample in which the target word is positioned. In general, the loss value is a measure of how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output.

In the embodiment of the application, the model parameters of the pre-training language sub-model are adjusted by utilizing the target word category corresponding to the target word output by the named entity recognition sub-model, the field category marked in the extracted text data sample, the determined first loss value is used for adjusting the model parameters of the pre-training language sub-model, the target word category corresponding to the target word output by the named entity recognition sub-model, the field category or the named entity category marked in the extracted vector sample, and the determined second loss value is used for adjusting the model parameters of the named entity recognition sub-model, so that the training process of the pre-training language sub-model can be completed firstly, and after the pre-training language sub-model finishes training, the weight of the pre-training sub-model does not participate in the training of the named entity recognition sub-model any more, so as to reduce the training time of the model and enable the named entity recognition sub-model to better combine two sub-tasks of boundary recognition and category prediction, and the target words belonging to the named entities in the text are more accurately identified, and the target words are accurately attributed to the corresponding named entity categories.

Step S210, determining whether the second loss value converges to a preset target value; if not, go to step S211; if so, go to step S212.

Judging whether the second loss value converges to a preset target value, if the second loss value is smaller than or equal to the preset target value, or if the variation amplitude of the second loss value obtained by continuous N times of training is smaller than or equal to the preset target value, considering that the second loss value converges to the preset target value, and indicating that the second loss value converges; otherwise, it indicates that the second loss value has not converged.

And step S211, adjusting parameters of the named entity recognition submodel to be trained according to the determined second loss value.

And if the second loss value is not converged, adjusting the model parameters of the named entity recognition submodel, returning to execute the step S204 after adjusting the model parameters, and continuing the next round of training process.

And step S212, finishing the training to obtain the trained category prediction model.

And if the first loss value and the second loss value are both converged, taking the currently obtained class prediction model as a trained class prediction model.

In the embodiment of the application, a training data set comprising encyclopedic text data samples, field text data samples with field category labels and named entity identification text data samples with named entity category labels is adopted to train a pre-training language sub-model and a named entity identification sub-model in a category prediction model, so that the model can learn more semantic information in the text data, more accurately identify named entities in the text data according to the semantic information in the text data, and divide the named entities into corresponding named entity categories. In addition, after training of the category prediction model is completed, a basic dictionary can be constructed according to an existing training data set, and accordingly construction of the dictionary is completed.

After the trained category prediction model is obtained, a basic dictionary can be constructed based on a training data set comprising encyclopedic text data samples, field text data samples and named entity recognition text data samples, wherein the basic dictionary comprises a plurality of named entity categories, and each named entity category can comprise a plurality of named entity words.

Specifically, after an embedded vector sample corresponding to a text data sample is obtained through a pre-training language submodel, the embedded vector sample is input into a named entity recognition submodel, and in the process of training the model, a named entity recognition task of the named entity recognition submodel mainly comprises two subtasks, namely recognition of a named entity boundary and recognition of a named entity category. The BIO/BIOES labeling method is used for labeling the boundary and the category of one word at the same time, based on the labeling method, the NER problem is processed into a multi-label classification task by the conventional NER model, and the position and the category of the named entity of each word are directly predicted. Such a processing approach combines well two subtasks of boundary identification and category prediction, but there are still two major problems. First, such models perform well when there are fewer named entity categories, but it is difficult to accurately predict the entity category of each word with more categories; secondly, the model identification process can only use the semantic information of a single word, and for a named entity formed by a plurality of words, the semantic information of the whole plurality of words cannot be used.

The named entity recognition submodel in the embodiment of the application is trained on two subtasks, but the combination mode is different from the previous model. Firstly, marking information by using a boundary, identifying the boundary of a named entity by a training model, dividing a text into different semantic units according to a prediction result, regarding words in the same unit as belonging to the same named entity, averaging word vectors of the words, and then assigning the words to each word, thereby utilizing the overall semantic information of the named entity.

The training of the pre-training language submodel and the named entity recognition submodel can be realized by two modes, one is a fine-tune method, namely, all parameters of the pre-training language submodel and the named entity recognition submodel are uniformly adjusted; the other method is a feature-base method, namely, the output of the pre-training language sub-model is directly used as an embedded vector and used as the input of the named entity recognition sub-model, the parameters of the pre-training language sub-model are fixed, and only the named entity recognition sub-model needs to be trained. The training mode adopted in the embodiment of the application is a feature-base method, after the training process of the pre-training language sub-model is finished, the weight of the pre-training language sub-model does not participate in training any more, the model training time is reduced, and the named entity recognition sub-model can be better combined with two sub-tasks of boundary recognition and category prediction.

In one embodiment, after the training data set is obtained, a plurality of text data samples in the training data set may be further analyzed and processed. Fig. 3 illustrates a specific process of analyzing and processing training data in a training data set in an embodiment of the present application, and as shown in fig. 3, the process may include the following steps:

step S301, a training data set including encyclopedic text data, domain text data and named entity recognition text data is obtained.

The encyclopedic text data is text data of a non-pertinence field, the field text data is selected text data which is highly related to a specific named entity type or contains more named entities of a certain type, and the named entities identify the text data as the text data needing word-level named entity labeling.

Step S302, preprocessing the encyclopedic text data, the field text data and the named entity identification text data respectively.

After encyclopedic text data, field text data and named entity identification text data are obtained, preprocessing operations can be respectively carried out on the encyclopedic text data, the field text data and the named entity identification text data. Wherein the preprocessing operation may include at least one of data screening and format conversion. Specifically, after the encyclopedic text data sample, the field text data sample and the named entity recognition text data sample are obtained, data cleaning, denoising and serialization operations can be respectively performed on each text data sample. The data cleaning can comprise missing value cleaning, format content cleaning, logic error cleaning, non-required data cleaning and relevance verification; data denoising may include outlier padding, distance-based detection, etc.; data serialization is the conversion of data into a standard format.

Step S303, obtaining corresponding encyclopedic text sequence, field text sequence and named entity identification text sequence.

After preprocessing operation is respectively carried out on the encyclopedic text data, the field text data and the named entity identification text data, an encyclopedic text sequence, a field text sequence and a named entity identification text sequence can be correspondingly obtained.

In the embodiment of the application, the data meeting the data quality requirement can be obtained by performing data cleaning operation on the text data in the training data set, and then performing data denoising and serialization operation on the text data, so that the text data can be converted into a standard format, and therefore the class prediction model can be trained on the basis of the processed text data in the standard format, and the training efficiency and the training result of the model can be improved.

After the trained category prediction model is obtained, dictionary construction can be performed based on the category prediction model. Fig. 4a shows a flowchart of a dictionary construction method provided by an embodiment of the present application, which may be executed by a server, for example, the server 200 in fig. 1 a. The dictionary construction process in the embodiment of the present application will be explained in detail below with reference to fig. 4 a.

Step S40, a text to be processed and a basic dictionary are acquired.

The basic dictionary is a dictionary which is obtained in a training stage of training the category prediction model and comprises a plurality of named entity categories. Each named entity category of the basic dictionary includes a plurality of corresponding named entity words. For example, the basic dictionary may include a plurality of named entity categories such as a person name, a place name, an organization name, etc., the person name may include a plurality of names, the place name may include a plurality of places, and the organization name may include a plurality of organizations.

In an embodiment, after the text to be processed is obtained, data cleaning, denoising and serialization operations may be performed on the text to be processed, so as to obtain a text sequence corresponding to the text to be processed.

Step S41, determining at least one candidate word included in the text to be processed and semantic information of each of the at least one candidate word based on the trained category prediction model.

The category prediction model may include a pre-training language sub-model and a named entity recognition sub-model.

And inputting the text to be processed into a category prediction model, and obtaining a word vector corresponding to at least one word contained in the text to be processed based on a pre-training language sub-model in the category prediction model. Wherein each word vector may represent semantic information of the corresponding word.

After the word vector corresponding to each of the at least one word is obtained, the at least one word may be combined based on the named entity recognition submodel in the category prediction model to obtain at least one candidate word and semantic information of each of the at least one candidate word. Wherein, each candidate word may contain at least one word.

For example, as shown in fig. 4b, the text to be processed is "xiao ming to hangzhou play", before the text to be processed is input into the category prediction model, the text to be processed needs to be processed into a structure of "[ CLS ] xiao ming to hangzhou play [ SEP ]", and then "[ CLS ] xiao ming to hangzhou play [ SEP ]" is input into the category prediction model, and based on the pre-training language sub-model, "E [ CLS ] E1E 2E 3E 4E 5E 6E 7E [ SEP ]". Based on the obtained "E [ CLS ] E1E 2E 3E 4E 5E 6E 7E [ SEP ]", it is possible to obtain "E [ CLS ] E1+ E2E 1+ E2E 3E 4+ E5E 4+ E5E 6E 7E [ SEP ]", by naming the entity recognition submodel.

In the embodiment of the application, based on the pre-training language sub-model and the named entity recognition sub-model, at least one candidate word contained in the text to be processed and the respective semantic information of the at least one candidate word can be obtained, so that the text to be processed can be divided into a plurality of candidate words according to the semantic information of the text to be processed, and further, when the named entity recognition is performed, the target word belonging to the named entity in the text to be processed can be accurately recognized by using the overall semantic information of the plurality of words in the text.

Step S42, selecting at least one target word meeting the set semantic conditions according to the respective semantic information of at least one candidate word through a category prediction model, and determining the probability values of the at least one target word belonging to a plurality of word categories respectively.

After determining at least one candidate word contained in the text to be processed and the semantic information of each candidate word, at least one target word of which the semantic information belongs to the named entity can be selected from the at least one candidate word through the named entity recognition submodel. The named entity is an entity name with specific semantics.

For at least one target word, the following operations may be performed, respectively: and determining probability values of a target word belonging to a plurality of word categories respectively according to semantic information of the target word by the named entity recognition submodel.

For example, as shown in fig. 4c, assuming that the basic dictionary includes 3 kinds of named entity categories including human names, organizational structure names and place names, after obtaining the corresponding "E [ CLS ] E1+ E2E 1+ E2E 3E 4+ E5E 4+ E5E 6E 7E [ SEP ]" according to the text "play from xiao ming to hangzhou, it can be determined that the words corresponding to" E1+ E2 "and" E4+ E5 "are all target words of the named entity to which the semantic information belongs according to the semantic information. By predicting the probabilities that "E1 + E2" and "E4 + E5" belong to the person name, the organization name, and the place name, probability values P1, P2, and P3 that "E1 + E2" belong to the person name, the organization name, and the place name, and probability values P4, P5, and P6 that "E4 + E5" belong to the person name, the organization name, and the place name, respectively, can be obtained.

In the embodiment of the application, based on the named entity recognition submodel, the target words belonging to the named entities in the text to be processed can be recognized, and the probability values of the target words belonging to the named entity categories in the basic dictionary are obtained, so that the named entities in the text to be processed can be accurately recognized, and the probability of the named entities relative to each named entity category is obtained.

And step S43, respectively attributing at least one target word to the word categories with the corresponding probability values meeting the set probability conditions.

After the probability values that at least one target word belongs to the plurality of word categories respectively are determined, for each target word, the word category meeting the set probability condition can be selected as the target category to which the target word belongs based on the probability values that the target word belongs to the plurality of word categories respectively, and the target word is added into the target category of the basic dictionary. For example, after determining that the named entities "xiaoming" in the text "xiaoming to hangzhou play" to be processed respectively belong to probability values P1, P2, and P3 of the person name, the organizational structure name, and the place name, and "hangzhou" respectively belongs to probability values P4, P5, and P6 of the person name, the organizational structure name, and the place name, assuming that the probability values P1 and P6 meet the set probability conditions, the person name may be used as the named entity category of the named entity "xiaoming", the place name may be used as the named entity category of the named entity "hangzhou", the "xiaoming" may be added to the person name of the base dictionary, and the "hangzhou" may be added to the place name of the base dictionary.

In the above step S43, for one target word in the at least one target word, a process of determining a target category to which the target word belongs and attributing the target word to the target category of the base dictionary may be as shown in fig. 4d, and includes the following steps:

and step S431, determining the word category corresponding to the maximum probability value based on the probability values of the target words belonging to the plurality of word categories respectively, and taking the word category as the target category.

The target words are words belonging to the named entities and contained in the text to be processed and output through the category prediction model, and the word categories are named entity categories contained in the basic dictionary. After the text to be processed is input into the category prediction model, the category prediction model may output a prediction result of at least one target word belonging to the named entity contained in the text to be processed, and the prediction result is a probability value that each target word belongs to each named entity category, so as to form a (category, probability) result pair.

After the probability values that the target words respectively belong to the named entity categories are determined, how to attribute each target word to the corresponding named entity category in the basic dictionary can be elaborated by taking one target word in each target word as an example.

And (3) sequencing the probability values of the target words belonging to the named entity categories in a descending order, namely sequencing the probability values of the target words belonging to the named entity categories respectively in a descending order, determining the maximum probability value, and taking the named entity category corresponding to the maximum probability value as the target category.

Step S432, if the probability value of the target word belonging to the target category is greater than the first set threshold, attributing the target word to the target category.

And if the maximum probability value is greater than a first set threshold value, attributing the target words to the target categories, namely attributing the target words to the target categories in the basic dictionary, thereby completing dictionary expansion.

For example, the target word is "hangzhou", the named entity category included in the basic dictionary includes a name of a person, an organization name and a place name, the probability value that the target word "hangzhou" belongs to the name of the person is 0.08, the probability value that the target word "hangzhou" belongs to the organization name is 0.12, the probability value that the target word "hangzhou" belongs to the place name is 0.98, and the first set threshold is 0.8. And (3) performing descending arrangement on the probability values of the target words belonging to the names of people, organizations and places respectively to determine that the maximum probability value is 0.98, the named entity category corresponding to the maximum probability value is the place name, and attributing the target words 'Hangzhou' to the place name of the basic dictionary because the maximum probability value of 0.98 is greater than a first set threshold value of 0.8.

In the embodiment of the application, after the probability values that the target words respectively belong to the named entity categories are obtained, the target words can be attributed to the named entity category corresponding to the maximum probability value based on the fact that the maximum probability value exceeds the set threshold value, and dictionary expansion is completed. Therefore, the target words which are determined in the text and belong to the named entities can be divided into the named entity categories with the highest probability values, and the accuracy of dictionary construction is improved.

Step S433, if the probability values of the target word belonging to the plurality of word categories respectively are not greater than the second set threshold, selecting at least one word category having a probability value greater than the third set threshold as a candidate category.

The third set threshold is smaller than the second set threshold, and the second set threshold may be equal to the first set threshold or may not be equal to the first set threshold.

And when the second set threshold is equal to the first set threshold, the method is equivalent to that if the maximum probability value of the probability values of the target words belonging to the plurality of word categories is not greater than the first set threshold, at least one word category with the probability value greater than the third set threshold is selected as a candidate category.

When the probability values that the target word respectively belongs to the plurality of word categories are not greater than the second set threshold, the probability values that the target word respectively belongs to the named entity categories may be sorted in a descending order, and at least one named entity category in which the probability value is greater than the third set threshold is selected as a candidate category.

Step S434, based on the similarity between the target word and the word included in each of the at least one candidate category, selecting the candidate category that meets the set similarity condition as the target category, and attributing the target word to the target category.

After at least one candidate category is selected, the similarity between the target word and the words contained in at least the candidate categories can be determined, the candidate category with the highest similarity is selected from the similarity, the candidate category is used as the target category, and the target word is assigned to the target category of the basic dictionary.

In the embodiment of the application, after the probability values that the target words belong to the named entity categories are determined, a plurality of candidate categories can be selected from the named entity categories according to the probability values, the target words are compared with the similarity of the words contained in each candidate category, the target categories are determined according to the similarity, and then the target words are assigned to the target categories to complete dictionary expansion. Therefore, the target words which belong to the named entities and are determined in the text can be classified into the target categories corresponding to the words with the highest similarity to the target words based on the similarity judgment, and the accuracy of dictionary construction is improved.

In an embodiment, for each candidate category, a WordNet dictionary tool may be used to determine the similarity between the target word and the existing word set included in the candidate category, and after determining the similarity between the target word and the existing word set in each candidate category, the candidate category with the highest similarity is selected from the similarities, and is used as the target category, and the target word is assigned to the target category of the basic dictionary, so as to complete dictionary expansion.

For example, the target word is "new york", the named entity categories included in the basic dictionary include a person name, an organization name, a place name, a festival, a product name, and a time, the probability value that the target word "new york" belongs to the person name is 0.45, the probability value that the target word "new york" belongs to the organization name is 0.67, the probability value that the target word "new york" belongs to the place name is 0.8, the probability value that the target word "new york" belongs to the festival is 0.32, the probability value that the target word "new york" belongs to the product name is 0.76, the probability value that the target word "new york" belongs to the person name is 0.06, the second set threshold is 0.9, and the third set threshold is 0.5. Since the probability values of the target word "new york" belonging to the name of a person, the name of an organization, the name of a place, the festival, the name of a product and the time are not more than the second set threshold value of 0.9, a plurality of named entity categories with probability values more than the third set threshold value of 0.5 can be selected as candidate categories, namely the selected candidate categories are the name of the organization, the name of the place and the name of the product. Determining the similarity of the target word "new york" with the existing word set contained in the organization name, the place name and the product name respectively, and assuming that the similarity of the target word "new york" with the existing word set contained in the organization name is 0.06, the similarity of the target word "new york" with the existing word set contained in the place name is 0.96 and the similarity of the target word "new york" with the existing word set contained in the product name is 0.21, the place name can be used as the target category corresponding to the target word "new york" and the target word "new york" is assigned to the place name of the basic dictionary.

In summary, according to the probability values that a target word belongs to each named entity category, the detailed process of attributing a target word to a corresponding target category in the basic dictionary may be further as shown in fig. 5, and includes the following steps:

step S501, determining probability values of target words belonging to various named entity categories respectively.

Step S502, the probability values are sorted in a descending order, and the maximum probability value is determined.

Step S503, determining whether the maximum probability value is greater than a first set threshold value; if not, executing step S504; if so, step S506 is performed.

Step S504, at least one named entity category with the probability value larger than a third set threshold is selected as a candidate category.

Step S505, according to the similarity between the target word and the word included in each of the at least one candidate category, selecting the candidate category with the highest similarity as the target category, and attributing the target word to the target category.

And S506, taking the named entity category corresponding to the maximum probability value as a target category, and attributing the target words to the target category.

The dictionary construction method provided by the embodiment of the application can train the category prediction model based on the training data set comprising encyclopedic text data, field text data with field category labels and named entity recognition text samples with named entity category labels to obtain the trained category prediction model, and construct a basic dictionary comprising a plurality of named entity categories based on the training data set. After the trained category prediction model is obtained, target words belonging to named entities in the text to be processed are identified based on the category prediction model, and probability values of the target words belonging to the categories of the named entities in the basic dictionary respectively are identified. After the probability value is obtained, based on the fact that the maximum probability value is larger than a set threshold value, attributing the target word to the named entity category corresponding to the maximum probability value, and completing the expansion of the dictionary, or if the maximum probability value is not larger than the set threshold value, selecting a plurality of named entity categories as candidate categories according to the probability value, determining the similarity between the target word and the word contained in each candidate category, and attributing the target word to the named entity category corresponding to the highest similarity based on the similarity, and completing the expansion of the dictionary. Therefore, complex rules depending on field expert design or a large amount of manual labeling work are effectively avoided, massive natural language texts and deep models are effectively utilized to learn text semantic representation, and finally construction and expansion of the dictionary are efficiently realized with high quality. Meanwhile, the method realizes that the judgment of new corpus information can be more flexible and efficient, the dictionary matching speed is high, the accuracy of dictionary construction is greatly improved, and the performance problem is alleviated to a certain extent.

Referring to fig. 6, the following describes the above embodiment in further detail with a specific application scenario:

the acquired text to be processed is assumed to be 'Mingming to visit Mount Huang', and the basic dictionary comprises 6 named entity categories including a person name, an organization name, a place name, a festival, a product name and time.

After the text to be processed, namely 'clear to visit the Huangshan', is processed into '[ CLS ] clear to visit the Huangshan [ SEP ]', the processed text is input into a pre-training language sub-model of a category prediction model, and corresponding embedded vectors 'E [ CLS ] E1, E2, E3, E4, E5, E6, E7 and E [ SEP ]', are obtained.

The embedded vector 'E [ CLS ] E1, E2, E3, E4, E5, E6, E7 and E [ SEP ]' is input into a named entity recognizer model of the category prediction model, and the obtained 'E1 + E2' and 'E6 + E7' are target words belonging to the named entity. Meanwhile, it can be obtained that "E1 + E2" has a probability value of 0.98 for the name of a person, 0.31 for the name of an organization, 0.12 for the name of a place, 0.06 for the name of a holiday, 0.45 for the name of a product, and 0.03 for the name of a time; "E6 + E7" has a probability value of 0.68 for a person name, 0.78 for an organization name, 0.89 for a place name, 0.07 for a holiday, 0.72 for a product name, and 0.04 for a time.

Assuming that the first set threshold is 0.95, the second set threshold is 0.9, and the third set threshold is 0.6, it may be determined that the named entity category to which the target word "Ming" corresponding to "E1 + E2" belongs is a person name, and the target word "Ming" is added to the person name of the basic dictionary.

Since the probability values of the target word "huangshan" for "E6 + E7" belonging to the person name, the organization name, the place name, the holiday, the product name and the time are respectively less than the first set threshold 0.95 and the second set threshold 0.9, and the named entity category greater than the third set threshold 0.6 includes the person name, the organization name, the place name and the product name, it is possible to determine the similarity between the target word "huangshan" and the words included in the person name, the organization name, the place name and the product name respectively, and obtain the similarity between the target word "huangshan" and the words included in the person name of 0.06, the similarity between the words included in the organization name of 0.14, the similarity between the words included in the place name of 0.76 and the similarity between the words included in the product name of 0.04. The named entity class to which the target word "Huangshan" belongs may be determined to be a place name, and the target word "Huangshan" is added to the place name of the base dictionary.

In some embodiments, the dictionary construction method provided by the present application may be compared with an N-Gram-based dictionary construction method, a word2 Vec-based emotion dictionary construction method, a textrank-based dictionary construction method, and a domain dictionary construction method oriented to product evaluation object mining in the related art, and the comparison result may be as shown in table 1 below:

TABLE 1

Model (model)	Precision	Recall	F1-score
				Based on N-Gram	40.0	—	—
Based on word2Vec	60.3	37.5	46.2
				Based on textrank	59.8	58.6	59.2
Product evaluation object oriented mining	64.5	69.2	66.8
				Our Method	85.6	86.2	85.9

Wherein Precision represents accuracy, Recall represents Recall, F1-score is a comprehensive evaluation index, and the higher Precision, Recall and F1-score respectively represent better performance of the method respectively.

As can be seen from the above table, compared with the dictionary construction method based on N-Gram, the emotion dictionary construction method based on word2Vec, the dictionary construction method based on textrank and the domain dictionary construction method oriented to product evaluation object mining, the Precision, the Recall representation and the F1-score of the dictionary construction method provided by the application are the highest, which indicates that the performance of the dictionary construction method provided by the application is the best.

The dictionary construction method shown in fig. 4a is based on the same inventive concept, and the embodiment of the present application further provides a dictionary construction device, which can be arranged in a server or a terminal device. Because the device is a device corresponding to the construction method of the dictionary, and the principle of solving the problems of the device is similar to that of the method, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Fig. 7 is a schematic structural diagram of a dictionary construction device according to an embodiment of the present application, and as shown in fig. 7, the dictionary construction device includes an acquisition module 701, a word recognition module 702, a category recognition module 703, and a dictionary construction module 704.

The acquiring module 701 is used for acquiring a text to be processed and a basic dictionary; wherein the base dictionary comprises a plurality of word categories;

a word recognition module 702, configured to determine, based on the trained category prediction model, at least one candidate word included in the text to be processed and semantic information of each of the at least one candidate word;

the category identification module 703 is configured to select, according to respective semantic information of at least one candidate word, at least one target word that meets a set semantic condition through a category prediction model, and determine probability values that the at least one target word respectively belongs to a plurality of word categories;

and the dictionary constructing module 704 is used for attributing at least one target word to the word categories of which the corresponding probability values meet the set probability conditions.

Optionally, the category prediction model includes a pre-training language sub-model and a named entity recognition sub-model; the word recognition module 702 is specifically configured to:

based on the text to be processed, obtaining a word vector corresponding to at least one word contained in the text to be processed through a pre-training language sub-model; wherein each word vector represents semantic information of a corresponding word;

combining at least one word through a named entity recognition submodel based on a word vector corresponding to the at least one word to obtain at least one candidate word and semantic information of the at least one candidate word; wherein each candidate word comprises at least one word.

Optionally, the category identifying module 703 is specifically configured to:

selecting at least one target word of which the semantic information belongs to the named entity from at least one candidate word through the named entity recognition submodel; the named entity is an entity name with specific semantics;

for at least one target word, performing the following operations: and determining probability values of a target word belonging to a plurality of word categories respectively according to semantic information of the target word by the named entity recognition submodel.

Optionally, as shown in fig. 8, the apparatus may further include a model training module 801, configured to:

iteratively training the category prediction model based on a training data set until a set convergence condition is met, wherein the one-time iterative training process comprises the following steps:

determining at least one target word in the text data sample through a category prediction model based on the text data sample extracted from the training data set, and determining a target word category corresponding to each of the at least one target word;

and determining a corresponding loss value according to the category of the target word and the set category, and performing parameter adjustment on the category prediction model according to the loss value.

Optionally, the category prediction model includes a pre-training language sub-model and a named entity recognition sub-model; the training data set comprises encyclopedic text data samples, field text data samples and named entity recognition text data samples; the model training module 801 is further configured to:

determining corresponding embedded vector samples through a pre-training language sub-model based on extracted text data samples from encyclopedic text data samples and field text data samples; the encyclopedic text data samples are text data of non-pertinence fields, each field text data sample is text data comprising a plurality of named entities of set categories, and the text data are marked with corresponding field categories;

determining at least one corresponding target word and a target word category corresponding to the target word by a named entity identifier model based on vector samples extracted from the embedded vector samples and the word vector samples; the word vector sample is obtained by identifying the text data sample based on the named entity; the named entity recognition text data sample is text data comprising at least one named entity, and each named entity word is labeled with a corresponding named entity category.

Optionally, the model training module 801 is further configured to:

determining a first loss value according to the category of the target words and the category of the field, and carrying out parameter adjustment on the pre-training language sub-model according to the first loss value;

and determining a second loss value according to the target word category, the field category and the named entity category, and carrying out parameter adjustment on the named entity recognition submodel according to the second loss value.

Optionally, the model training module 801 is further configured to:

preprocessing the encyclopedic text data samples, the field text data samples and the named entity recognition text data samples in the training data set respectively; the preprocessing operation includes at least one of data screening and format conversion.

Optionally, the dictionary constructing module 704 is specifically configured to:

and if the probability value of a target word belonging to the target category is greater than a first set threshold value, attributing the target word to the target category.

Optionally, the dictionary constructing module 704 is further configured to:

if the probability value of a target word belonging to the plurality of word categories respectively is not greater than a second set threshold, selecting at least one word category with the probability value greater than a third set threshold as a candidate category; the third set threshold is smaller than the second set threshold;

and selecting a candidate category which meets the set similarity condition based on the similarity between a target word and the words contained in at least one candidate category respectively, and using the candidate category as the target category, and attributing the target word to the target category.

Having described the dictionary construction method and apparatus according to the exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Based on the same inventive concept as the method embodiment, an electronic device is further provided in the embodiment of the present application, referring to fig. 9, which is a schematic diagram of a hardware component structure of an electronic device to which the embodiment of the present application is applied, and the electronic device 900 may at least include a processor 901 and a memory 902. The memory 902 stores therein program codes, which, when executed by the processor 901, cause the processor 901 to perform the steps of any of the above-described dictionary construction methods.

In some possible implementations, a computing device according to the present application may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the dictionary construction method according to various exemplary embodiments of the present application described above in the present specification. For example, the processor may perform the steps as shown in fig. 4 a.

A computing device 1000 according to this embodiment of the present application is described below with reference to fig. 10. As shown in fig. 10, computing device 1000 is embodied in the form of a general purpose computing device. Components of computing device 1000 may include, but are not limited to: the at least one processing unit 1001, the at least one storage unit 1002, and a bus 1003 connecting different system components (including the storage unit 1002 and the processing unit 1001).

Bus 1003 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The storage unit 1002 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)10021 and/or cache memory unit 10022, which may further include Read Only Memory (ROM) 10023.

The storage unit 1002 may also include a program/utility 10025 having a set (at least one) of program modules 10024, such program modules 10024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The computing device 1000 may also communicate with one or more external devices 1004 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the computing device 1000, and/or any devices (e.g., router, modem, etc.) that enable the computing device 1000 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 1005. Moreover, computing device 1000 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 1006. As shown, the network adapter 1006 communicates with the other modules for the computing device 1000 over a bus 1003. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computing device 1000, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

On the basis of the same inventive concept as the above-described method embodiments, the various aspects of the dictionary construction method provided by the present application may also be implemented in the form of a program product comprising program code for causing an electronic device to perform the steps in the dictionary construction method according to the various exemplary embodiments of the present application described above in this specification when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 4 a.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of constructing a dictionary, comprising:

2. The method of claim 1, wherein the category prediction model comprises a pre-training language sub-model and a named entity recognition sub-model; the determining, based on the trained category prediction model, at least one candidate word included in the text to be processed and respective semantic information of the at least one candidate word includes:

3. The method of claim 2, wherein the selecting, by the category prediction model, at least one target word meeting a set semantic condition according to the semantic information of each of the at least one candidate word, and determining probability values that each of the at least one target word belongs to the plurality of word categories respectively comprises:

4. The method according to any one of claims 1 to 3, wherein the training process of the class prediction model comprises:

5. The method of claim 4, wherein the category prediction model comprises a pre-training language sub-model and a named entity recognition sub-model; the training data set comprises encyclopedic text data samples, field text data samples and named entity recognition text data samples;

the determining, by the category prediction model, at least one target word in the text data sample based on the text data sample extracted from the training data set, and determining a target word category corresponding to each of the at least one target word, includes:

6. The method of claim 5, wherein determining a corresponding loss value according to the target word class and the set class, and performing parameter adjustment on the class prediction model according to the loss value comprises:

7. The method of claim 5, wherein after the obtaining of the training data set, the method further comprises:

8. The method according to any one of claims 1 to 3, wherein the attributing the at least one target word to the word categories with corresponding probability values meeting the set probability condition respectively comprises:

for the at least one target word, performing the following operations, respectively:

9. The method according to any one of claims 1 to 3, wherein the attributing the at least one target word to the word categories with corresponding probability values meeting set probability conditions respectively comprises:

if the probability value of a target word belonging to the plurality of word categories respectively is not greater than a second set threshold, selecting at least one word category with the probability value greater than a third set threshold as a candidate category; the third set threshold is less than the second set threshold;

10. A dictionary constructing apparatus, comprising:

11. The apparatus of claim 10, in which the category prediction model comprises a pre-training language sub-model and a named entity recognition sub-model; the word recognition module is specifically configured to:

12. The apparatus of claim 11, wherein the category identification module is specifically configured to:

13. An electronic device, comprising a processor and a memory, wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 9.

14. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to carry out the steps of the method according to any one of claims 1 to 9, when said program code is run on said electronic device.

15. A computer program product comprising computer programs/instructions for implementing the steps of the method according to claims 1 to 9 when executed by a processor.