CN111950265A - Domain lexicon construction method and device - Google Patents

Domain lexicon construction method and device Download PDF

Info

Publication number
CN111950265A
CN111950265A CN202010867382.3A CN202010867382A CN111950265A CN 111950265 A CN111950265 A CN 111950265A CN 202010867382 A CN202010867382 A CN 202010867382A CN 111950265 A CN111950265 A CN 111950265A
Authority
CN
China
Prior art keywords
training
keywords
field
trained
word stock
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010867382.3A
Other languages
Chinese (zh)
Inventor
汪良果
许文文
张峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Information Science Research Institute
Original Assignee
CETC Information Science Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Information Science Research Institute filed Critical CETC Information Science Research Institute
Priority to CN202010867382.3A priority Critical patent/CN111950265A/en
Publication of CN111950265A publication Critical patent/CN111950265A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The disclosure provides a field lexicon construction method and a device, wherein the method comprises the following steps: determining the field of a word bank to be constructed; acquiring a corresponding domain text according to the domain; extracting all keywords from the field text to obtain an initial word bank; marking a plurality of keywords in the initial word stock, and constructing to obtain a training data set; training a preset pre-training model by using a training data set; and predicting the keywords in the initial word stock by using the trained pre-training model, and obtaining a field word stock according to a prediction result. In the field lexicon construction method and device of the embodiment of the disclosure, the field text is acquired according to the specific field, the keywords in the field text are extracted, the labels of the keywords are constructed to construct the training data set, the pre-training model is trained, the field words are predicted, the field lexicon is constructed, model training is performed on the basis of the extraction of the keywords, the efficiency and the accuracy of the pre-training model training are improved, and therefore the efficiency and the accuracy of the lexicon construction are improved.

Description

Domain lexicon construction method and device
Technical Field
The disclosure belongs to the technical field of computers, and particularly relates to a field lexicon construction method and device.
Background
Information retrieval in a specific field depends on effective field terms as retrieval keywords, taking the industrial information field as an example, the industrial information field is developed rapidly, new vocabularies emerge continuously, and the important role of finding and constructing a field lexicon in time on field language understanding is played. In the prior art, the construction method of the domain word stock mainly comprises a rule-based method and a statistical-based method, wherein the rule-based method is mainly used for establishing a characteristic template according to the word construction characteristics, the syntactic characteristics and the domain characteristics of domain terms, and then extracting words matched with the template from linguistic data, and the method has higher requirements on rule formulation and template quality and cannot cover all language phenomena in a specific domain, so that the recall rate is low; the statistical-based method mainly depends on the calculation of various statistics, including word frequency, mutual information, information entropy and the like, needs the support of a large-scale corpus and has higher requirements on the quality of the corpus.
Disclosure of Invention
The present disclosure is directed to at least one of the technical problems in the prior art, and provides a domain lexicon construction method and apparatus.
One aspect of the present disclosure provides a domain lexicon construction method, the method comprising:
determining the field of a word bank to be constructed;
acquiring a corresponding domain text according to the domain;
extracting all keywords from the field text to obtain an initial word bank;
marking a plurality of keywords in the initial word stock to construct a training data set;
training a preset pre-training model by using the training data set;
and predicting the keywords in the initial word stock by using the trained pre-training model, and obtaining a field word stock according to a prediction result.
Optionally, the marking a plurality of keywords in the initial lexicon to construct a training data set, including:
making a reservation label for a first preset number of keywords in the initial thesaurus, and,
making deletion marks on the keywords with the second preset number in the initial word stock; and the number of the first and second electrodes,
and a preset proportional relation exists between the first preset quantity and the second preset quantity.
Optionally, the training of the preset pre-training model by using the training data set includes:
dividing the training data set into a training data subset, a verification data subset and a test data subset according to a preset training proportion;
constructing a plurality of classifiers based on the pre-training model, and training the plurality of classifiers by using the training data subset;
verifying the trained classifiers by using the verification data subset, and selecting the classifier with the highest accuracy as the trained pre-training model;
and testing the trained pre-training model by using the test data subset, and recording the test accuracy.
Optionally, the predicting the keywords in the initial lexicon by using the trained pre-training model, and obtaining a domain lexicon according to a prediction result, including:
and predicting the keywords in the initial word stock by using the trained pre-training model, and selecting the keywords with the prediction results of being reserved as the field word stock.
Optionally, the pre-training model adopts a pre-training bert model.
In another aspect of the present disclosure, a domain lexicon constructing apparatus is provided, the apparatus comprising:
the determining module is used for determining the field of the word stock to be constructed;
the acquisition module is used for acquiring a corresponding field text according to the field;
the extraction module is used for extracting all keywords from the field text to obtain an initial word stock;
the marking module is used for marking a plurality of keywords in the initial word stock to construct a training data set;
the training module is used for training a preset pre-training model by utilizing the training data set;
and the domain word stock building module is used for predicting the keywords in the initial word stock by utilizing the trained pre-training model and obtaining the domain word stock according to the prediction result.
Optionally, the marking module further includes:
the reservation marking sub-module is used for making reservation marks on the keywords with the first preset number in the initial word stock;
the deletion marking sub-module is used for making deletion marks on the keywords with the second preset number in the initial word stock; wherein the content of the first and second substances,
and a preset proportional relation exists between the first preset quantity and the second preset quantity.
Optionally, the training module further includes:
the classification submodule is used for dividing the training data set into a training data subset, a verification data subset and a test data subset according to a preset training proportion;
the training sub-module is used for constructing a plurality of classifiers based on the pre-training model and training the plurality of classifiers by utilizing the training data subset;
the verification sub-module is used for verifying the trained classifiers by using the verification data subset and selecting the classifier with the highest accuracy as the trained pre-training model;
and the testing sub-module is used for testing the trained pre-training model by using the testing data subset and recording the testing accuracy.
An electronic device, comprising:
one or more processors;
a storage unit for storing one or more programs which, when executed by the one or more processors, enable the one or more processors to carry out a method according to the preceding description.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is able to carry out a method according to the preamble.
In the field lexicon construction method and device of the embodiment of the disclosure, the field text is acquired according to the specific field, the keywords in the field text are extracted, the keywords are labeled on the basis, the training data set is constructed, the pre-training model is trained according to the training data set, the field words are predicted according to the trained pre-training model, the field lexicon is constructed according to the prediction result, the model training is performed on the basis of the extraction of the keywords, the efficiency and the accuracy of the pre-training model training are improved, and therefore the efficiency and the accuracy of the lexicon construction are improved.
Drawings
FIG. 1 is a schematic block diagram of an example electronic device for implementing a domain thesaurus construction method and apparatus according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of a domain lexicon construction method according to another embodiment of the present disclosure;
fig. 3 is a schematic flowchart of step S140 according to another embodiment of the disclosure;
fig. 4 is a schematic flow chart of step S150 according to another embodiment of the disclosure;
fig. 5 is a schematic flowchart of step S160 according to another embodiment of the disclosure;
fig. 6 is a schematic structural diagram of a domain lexicon constructing apparatus according to another embodiment of the present disclosure.
Detailed Description
For a better understanding of the technical aspects of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
Unless otherwise specifically stated, technical or scientific terms used in the present disclosure shall have the ordinary meaning as understood by those of ordinary skill in the art to which the present disclosure belongs. The use of "including" or "comprising" and the like in this disclosure does not limit the referenced shapes, numbers, steps, actions, operations, members, elements and/or groups thereof, nor does it preclude the presence or addition of one or more other different shapes, numbers, steps, actions, operations, members, elements and/or groups thereof or those. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number and order of the technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present disclosure, "a plurality" means two or more unless specifically limited otherwise.
In some descriptions of the invention, unless expressly stated or limited otherwise, the terms "mounted," "connected," or "fixed" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect through an intermediate medium, whether internal to two elements or an interactive relationship between two elements.
The relative arrangement of parts and steps, numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise. Also, it should be understood that the dimensions of the various elements shown in the figures are not drawn to scale, for ease of description, and that techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate. In all examples shown and discussed herein, any particular other example may have a different value. It should be noted that: like symbols and letters represent like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Before discussing in greater detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when the operation is completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
First, an example electronic device for implementing a domain lexicon construction method and apparatus according to an embodiment of the present disclosure is described with reference to fig. 1.
As shown in FIG. 1, the electronic device 200 includes one or more processors 210, one or more memory devices 220, input devices 230, output devices 240, etc., which are interconnected via a bus system and/or other form of connection mechanism 250. It should be noted that the components and structure of the electronic device shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.
The processor 210 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
The storage 220 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that a processor may execute to implement the client functionality (implemented by the processor) in the embodiments of the disclosure described below and/or other desired functionality. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.
The input device 230 may be a device used by a user to input instructions, and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.
The output device 240 may output various information (e.g., images or sounds) to an outside (e.g., a user), and may include one or more of a display, a speaker, and the like.
Exemplarily, an example electronic device for implementing a domain lexicon building method and apparatus according to an embodiment of the present disclosure may be implemented as a smartphone, a tablet, or the like.
Next, a domain lexicon construction method according to an embodiment of the present disclosure will be described with reference to fig. 2.
As shown in fig. 2, a method S100 for constructing a domain lexicon includes:
s110: the domain to be constructed is determined.
Specifically, in this step, the domain to be constructed is determined according to specific requirements, and the domain may include a technical domain, such as a computer domain, a building domain, and the like, and may also include other domains, such as an e-commerce domain, and the like, and is not particularly limited in this embodiment.
S120: and acquiring a corresponding domain text according to the domain.
Specifically, in this step, the text of the field to be constructed may be obtained in any manner, for example, the corresponding field text may be obtained by crawling an internet web page corresponding to the field to be constructed, for example, the internet web page may include a webpage with specialties such as a popular science page and a paper website, and may also include a webpage with specialties such as an internet news webpage, a web community webpage or a blog webpage, and the like, for example, the field text may include a text with specialties such as an academic conference, an academic journal, and the like, and may also include a non-specialties such as a blog, and a person skilled in the art may also obtain the text of the field to be constructed in other manners, for example, from a paper document such as a publication, and the embodiment is not limited in particular.
S130: and extracting all keywords from the field text to obtain an initial word bank.
Specifically, in this step, any manner may be used to extract the keywords, for example, a crawler technology may be used to capture the keywords in the internet webpage, and a person skilled in the art may also extract the keywords in other manners, which is not limited in this embodiment. It should be noted that the keyword includes a keyword that has been marked in the domain text, for example, if the domain text is acquired from a thesis website such as CNKI, in general, a keyword tag that has been established for the text in the thesis website is used to mark the keyword on the domain text, at this time, such information can be directly extracted as the keyword, and the extraction of the keyword in this way can effectively utilize the marked data in the web page, thereby improving the efficiency and accuracy of keyword extraction; in addition, the keywords may also include keywords that are not marked in the domain text, for example, if the domain text is obtained from a web page such as a blog, the web page of the blog may not mark the keywords on the text, and at this time, the keywords may be extracted from the domain text by a keyword extraction algorithm.
S140: and marking a plurality of keywords in the initial word stock to construct a training data set.
Specifically, in this step, the plurality of keywords are labeled according to a preset classification, for example, the preset classification may be retention and deletion, the keywords labeled as retention are finally used for constructing the domain thesaurus, and the keywords labeled as deletion are not finally used for constructing the domain thesaurus. For example, in this step, a manual labeling method may be used to label a plurality of keywords, so as to ensure the accuracy of the constructed training data set.
S150: and training a preset pre-training model by using the training data set.
Specifically, in this step, the pre-training model is a model that is selected according to actual conditions and can realize a word classification prediction function, and the pre-training model is trained through a training data set to obtain a pre-training model that meets the word characteristics of the field to be constructed.
S160: and predicting the keywords in the initial word stock by using the trained pre-training model, and obtaining a field word stock according to a prediction result.
Specifically, in this step, the trained pre-training model is used to mark the keywords in the initial lexicon according to the classification preset in step S140, for example, the keywords are marked as being retained or deleted, and the prediction of the keywords is completed.
According to the field word stock construction method, the field text is obtained according to the specific field, the keywords in the field text are extracted, the keywords are marked on the basis, the training data set is constructed, the pre-training model is trained according to the training data set, the field words are predicted according to the trained pre-training model, the field word stock is constructed according to the prediction result, the training data set is constructed on the basis of the extraction of the keywords, model training is carried out, the efficiency and the accuracy of training of the pre-training model are improved, and therefore the efficiency and the accuracy of constructing the word stock are improved.
The method for constructing the training data set is further described below with reference to fig. 3.
Exemplarily, as shown in fig. 3, step S140 specifically includes:
s141: and making reserved marks on the keywords of the first preset number in the initial word stock.
Specifically, in this step, the keywords may be randomly extracted from the initial lexicon, and it is determined whether the keywords should belong to the keywords in the field to be constructed by manual reading, if yes, the extracted keywords are marked as retained, and the above extraction and marking processes are performed in a loop until the number of the keywords marked as retained reaches a preset first preset number. For example, the first preset number may be set according to an actual situation, for example, the first preset number is 500, or a ratio of the first preset number to the total number of the keywords in the initial lexicon exceeds a preset ratio, so as to ensure that the constructed training data set includes a sufficient number of keywords marked as reserved, thereby ensuring accuracy of subsequent training on the pre-training model.
It should be noted that, in this step, the keyword is only exemplarily marked as reserved, and those skilled in the art may also mark it as any other words that may have a meaning of "reserved", for example, save, retain, etc., and this embodiment is not particularly limited.
S142: and making deletion marks on the keywords with the second preset number in the initial word stock.
Specifically, in this step, the keywords may be randomly extracted from the initial lexicon, and it is determined whether the keywords should belong to the keywords in the field to be constructed by manual reading, if not, the extracted keywords are marked as deleted, and the above extraction and marking processes are performed in a loop until the number of the keywords marked as deleted reaches a preset second preset number. For example, the second preset number may be set according to an actual situation, for example, the second preset number is 500, or a ratio of the second preset number to the total number of the keywords in the initial lexicon exceeds a preset ratio, so as to ensure that the constructed training data set includes a sufficient number of keywords marked as deleted, thereby ensuring accuracy of subsequent training on the pre-training model.
It should be noted that, in this step, the keyword is only exemplarily marked as deleted, and those skilled in the art may also mark it as any other words that may have a meaning of "delete", for example, delete, remove, etc., and this embodiment is not particularly limited.
For example, a preset proportional relationship exists between the first preset number and the second preset number, and the proportional relationship may be set according to actual requirements, for example, the proportional relationship may be 1:1, that is, the number of the keywords marked as retained is the same as the number of the keywords marked as deleted, so that the number of the two keywords in the training data set is balanced, the recognition capability of the trained pre-training model for the two keywords is the same, and the prediction accuracy and the balance of the trained pre-training model are improved.
It should be noted that step S141 and step S142 are not in a sequential order, and step S141 and step S142 may be executed in any sequential order, or step S141 and step S142 may be executed simultaneously.
The training method of the pre-training model is further described below with reference to fig. 4.
Exemplarily, as shown in fig. 4, step S150 specifically includes:
s151: and dividing the training data set into a training data subset, a verification data subset and a test data subset according to a preset training proportion.
Specifically, in this step, the preset training proportion may be set according to an actual situation, for example, the number of the keywords in the training data subset is greater than the number of the keywords in the verification data subset and the test data subset, so as to ensure that the number of the keywords in the training data subset is sufficient, and improve the comprehensiveness and stability of the training, for example, the keywords in the training data subset are classified according to a training proportion of 8:1:1, so as to obtain the training data subset, the verification data subset, and the test data subset, respectively.
S152: and constructing a plurality of classifiers based on the pre-training model, and training the plurality of classifiers by using the training data subsets.
Specifically, in this step, different settings are respectively performed on the pre-training models to obtain a plurality of classifiers, on this basis, a plurality of keywords in the training data subset are respectively input into the plurality of classifiers, and the classification results output by the classifiers are corrected according to the marked retention and deletion, and the above processes are repeated for a plurality of times to complete the training of the classifiers.
S153: and verifying the trained classifiers by using the verification data subset, and selecting the classifier with the highest accuracy as the trained pre-training model.
Specifically, in this step, a plurality of keywords in the verification data subset are respectively input into a plurality of trained classifiers, the classification results output by the classifiers are verified according to the marked retention and deletion, the accuracy of each classifier is obtained according to the verification results, and the classifier with the highest accuracy is used as a trained pre-training model.
S154: and testing the trained pre-training model by using the test data subset, and recording the test accuracy.
Specifically, in this step, a plurality of keywords in the test data subset are input into the trained pre-training model, and the classification result output by the trained pre-training model is tested according to the marked retention and deletion, so as to obtain the test accuracy, for example, if the test accuracy does not meet the preset accuracy requirement and is too low, the above steps S151 to S154 may be executed again until the test accuracy meets the preset accuracy requirement.
Specifically, since the keywords marked as reserved in the training data set are the keywords belonging to the field to be constructed, and the keywords marked as deleted are the keywords not belonging to the field to be constructed, the trained pre-training model can be classified according to whether the keywords belong to the field to be constructed, the keywords belonging to the field to be constructed are classified as reserved, and the keywords not belonging to the field to be constructed are classified as deleted.
The method for constructing the domain lexicon according to the prediction structure is further described below with reference to fig. 5.
Exemplarily, as shown in fig. 5, step S160 specifically includes:
s161: and predicting the keywords in the initial word bank by using the trained pre-training model.
Specifically, in this step, the keywords in the initial lexicon are input into the trained pre-training model, and the trained pre-training model classifies the keywords as reserved or deleted, thereby completing the prediction of the keyword classification.
S162: and selecting the keywords with the prediction results as the reserved keywords as the domain word stock.
Specifically, in this step, the prediction result is a reserved keyword, that is, a keyword classified as a reserved keyword, which is a keyword belonging to the domain to be constructed, and therefore, the part of keywords is reserved to form the domain lexicon.
For example, in the present embodiment, the pre-training model may adopt any model that can implement a classification function, such as a pre-training bert model.
According to the field word stock construction method, the keywords in the initial word stock are marked to be reserved or deleted according to the preset quantity relation, an accurate and comprehensive data set is constructed for the pre-training model, and the accuracy and the efficiency of training the pre-training model are improved; dividing a training data set into a training data subset, a verification data subset and a test data subset according to a preset training proportion, and respectively training, verifying and testing a pre-training model, so that the accuracy of the trained pre-training model is ensured, and the reliability and accuracy of training are improved; the trained pre-training model can accurately classify the keywords, so that the trained pre-training model is used for predicting the keywords in the initial word stock, the reserved keywords with prediction results are selected as the field word stock, the classification of the keywords can be effectively and accurately finished, and the field word stock can be effectively and accurately constructed.
Next, a domain lexicon constructing apparatus according to another embodiment of the present disclosure will be described with reference to fig. 6.
Illustratively, as shown in FIG. 6, a domain lexicon constructing apparatus 100 comprises
The determining module 110 is configured to determine a field of a lexicon to be constructed.
An obtaining module 120, configured to obtain a corresponding domain text according to the domain.
And the extraction module 130 is configured to extract all keywords from the domain text to obtain an initial word bank.
And the marking module 140 is configured to mark the plurality of keywords in the initial word bank to construct a training data set.
A training module 150, configured to train a preset pre-training model by using the training data set.
And the domain lexicon building module 160 is configured to predict the keywords in the initial lexicon by using the trained pre-training model, and obtain the domain lexicon according to a prediction result.
The field lexicon building device can acquire a field text and extract keywords from the field text according to a specific field, label the keywords on the basis, build a training data set, train a pre-training model according to the training data set, predict field words according to the trained pre-training model, build the field lexicon according to a prediction result, train the model on the basis of extracting the keywords, improve the efficiency and accuracy of training the pre-training model, and build the field lexicon more efficiently and accurately.
Illustratively, as shown in fig. 6, the marking module 140 further includes:
the reservation label sub-module 141 is configured to make a reservation label for a first preset number of keywords in the initial thesaurus.
And a deletion marking sub-module 142, configured to make a deletion mark for the second preset number of keywords in the initial thesaurus.
And a preset proportional relation exists between the first preset quantity and the second preset quantity.
Illustratively, as shown in fig. 6, the training module 150 further includes:
and the classification submodule 151 is configured to divide the training data set into a training data subset, a verification data subset, and a test data subset according to a preset training proportion.
A training submodule 152, configured to construct a plurality of classifiers based on the pre-training model, and train the plurality of classifiers by using the training data subset.
The verification sub-module 153 is configured to verify the trained classifiers by using the verification data subset, and select the classifier with the highest accuracy as the trained pre-training model.
A test sub-module 154 for testing the trained pre-trained model using the test data subset and recording the test accuracy.
According to the field lexicon construction device disclosed by the embodiment of the disclosure, the keywords in the initial lexicon can be marked to be reserved or deleted according to the preset quantity relationship, so that an accurate and comprehensive data set is constructed for the pre-training model, and the accuracy and efficiency of training the pre-training model are improved; the training data set is divided into a training data subset, a verification data subset and a testing data subset according to a preset training proportion, and the pre-training model is trained, verified and tested respectively, so that the accuracy of the trained pre-training model is ensured, the reliability and accuracy of training are improved, the classification of keywords can be effectively and accurately completed, and a domain word stock is effectively and accurately constructed.
The computer readable medium may be included in the apparatus, device, system, or may exist separately.
The computer readable storage medium may be any tangible medium that can contain or store a program, and may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, more specific examples of which include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, an optical fiber, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof.
The computer readable storage medium may also include a propagated data signal with computer readable program code embodied therein, for example, in a non-transitory form, such as in a carrier wave or in a carrier wave, wherein the carrier wave is any suitable carrier wave or carrier wave for carrying the program code.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
It is to be understood that the above embodiments are merely exemplary embodiments that are employed to illustrate the principles of the present disclosure, and that the present disclosure is not limited thereto. It will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the disclosure, and these are to be considered as the scope of the disclosure.

Claims (10)

1. A domain lexicon construction method is characterized by comprising the following steps:
determining the field of a word bank to be constructed;
acquiring a corresponding domain text according to the domain;
extracting all keywords from the field text to obtain an initial word bank;
marking a plurality of keywords in the initial word stock to construct a training data set;
training a preset pre-training model by using the training data set;
and predicting the keywords in the initial word stock by using the trained pre-training model, and obtaining a field word stock according to a prediction result.
2. The method of claim 1, wherein the tagging of the plurality of keywords in the initial thesaurus to construct a training data set comprises:
making a reservation label for a first preset number of keywords in the initial thesaurus, and,
making deletion marks on the keywords with the second preset number in the initial word stock; and the number of the first and second electrodes,
and a preset proportional relation exists between the first preset quantity and the second preset quantity.
3. The method of claim 2, wherein training a pre-set pre-trained model using the training data set comprises:
dividing the training data set into a training data subset, a verification data subset and a test data subset according to a preset training proportion;
constructing a plurality of classifiers based on the pre-training model, and training the plurality of classifiers by using the training data subset;
verifying the trained classifiers by using the verification data subset, and selecting the classifier with the highest accuracy as the trained pre-training model;
and testing the trained pre-training model by using the test data subset, and recording the test accuracy.
4. The method of claim 3, wherein the predicting the keywords in the initial lexicon by using the trained pre-training model and obtaining a domain lexicon according to a prediction result comprises:
and predicting the keywords in the initial word stock by using the trained pre-training model, and selecting the keywords with the prediction results of being reserved as the field word stock.
5. The method of any one of claims 1 to 4, wherein the pre-trained model employs a pre-trained bert model.
6. A domain thesaurus construction device is characterized by comprising:
the determining module is used for determining the field of the word stock to be constructed;
the acquisition module is used for acquiring a corresponding field text according to the field;
the extraction module is used for extracting all keywords from the field text to obtain an initial word stock;
the marking module is used for marking a plurality of keywords in the initial word stock to construct a training data set;
the training module is used for training a preset pre-training model by utilizing the training data set;
and the domain word stock building module is used for predicting the keywords in the initial word stock by utilizing the trained pre-training model and obtaining the domain word stock according to the prediction result.
7. The apparatus of claim 6, wherein the tagging module further comprises:
the reservation marking sub-module is used for making reservation marks on the keywords with the first preset number in the initial word stock;
the deletion marking sub-module is used for making deletion marks on the keywords with the second preset number in the initial word stock; wherein the content of the first and second substances,
and a preset proportional relation exists between the first preset quantity and the second preset quantity.
8. The apparatus of claim 7, wherein the training module further comprises:
the classification submodule is used for dividing the training data set into a training data subset, a verification data subset and a test data subset according to a preset training proportion;
the training sub-module is used for constructing a plurality of classifiers based on the pre-training model and training the plurality of classifiers by utilizing the training data subset;
the verification sub-module is used for verifying the trained classifiers by using the verification data subset and selecting the classifier with the highest accuracy as the trained pre-training model;
and the testing sub-module is used for testing the trained pre-training model by using the testing data subset and recording the testing accuracy.
9. An electronic device, comprising:
one or more processors;
a storage unit for storing one or more programs which, when executed by the one or more processors, enable the one or more processors to implement the method of any one of claims 1 to 5.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that,
the computer program is capable of implementing a method according to any one of claims 1 to 5 when executed by a processor.
CN202010867382.3A 2020-08-25 2020-08-25 Domain lexicon construction method and device Pending CN111950265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010867382.3A CN111950265A (en) 2020-08-25 2020-08-25 Domain lexicon construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010867382.3A CN111950265A (en) 2020-08-25 2020-08-25 Domain lexicon construction method and device

Publications (1)

Publication Number Publication Date
CN111950265A true CN111950265A (en) 2020-11-17

Family

ID=73366496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010867382.3A Pending CN111950265A (en) 2020-08-25 2020-08-25 Domain lexicon construction method and device

Country Status (1)

Country Link
CN (1) CN111950265A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836519A (en) * 2021-02-08 2021-05-25 网易(杭州)网络有限公司 Training method of text generation model, and text generation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium
CN110598213A (en) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium
CN111325019A (en) * 2020-01-21 2020-06-23 国网北京市电力公司 Word bank updating method and device and electronic equipment
CN111460820A (en) * 2020-03-06 2020-07-28 中国科学院信息工程研究所 Network space security domain named entity recognition method and device based on pre-training model BERT
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium
CN110598213A (en) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium
CN111325019A (en) * 2020-01-21 2020-06-23 国网北京市电力公司 Word bank updating method and device and electronic equipment
CN111460820A (en) * 2020-03-06 2020-07-28 中国科学院信息工程研究所 Network space security domain named entity recognition method and device based on pre-training model BERT
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836519A (en) * 2021-02-08 2021-05-25 网易(杭州)网络有限公司 Training method of text generation model, and text generation method and device

Similar Documents

Publication Publication Date Title
RU2678716C1 (en) Use of autoencoders for learning text classifiers in natural language
KR101754473B1 (en) Method and system for automatically summarizing documents to images and providing the image-based contents
CN111177569A (en) Recommendation processing method, device and equipment based on artificial intelligence
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN113722438B (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN112183994B (en) Evaluation method and device for equipment state, computer equipment and storage medium
JP6053131B2 (en) Information processing apparatus, information processing method, and program
CN111190997A (en) Question-answering system implementation method using neural network and machine learning sequencing algorithm
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN112395391B (en) Concept graph construction method, device, computer equipment and storage medium
WO2021190662A1 (en) Medical text sorting method and apparatus, electronic device, and storage medium
CN109271624B (en) Target word determination method, device and storage medium
CN112307770A (en) Sensitive information detection method and device, electronic equipment and storage medium
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN117077679B (en) Named entity recognition method and device
CN107908649B (en) Text classification control method
CN111191011B (en) Text label searching and matching method, device, equipment and storage medium
CN111950265A (en) Domain lexicon construction method and device
CN115098619A (en) Information duplication eliminating method and device, electronic equipment and computer readable storage medium
CN112417858A (en) Entity weight scoring method, system, electronic equipment and storage medium
CN111061869A (en) Application preference text classification method based on TextRank
CN110717029A (en) Information processing method and system
CN115714002B (en) Training method for depression risk detection model, depression symptom early warning method and related equipment
CN113988085B (en) Text semantic similarity matching method and device, electronic equipment and storage medium
CN110633446B (en) Webpage column recognition model training method, using method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination