CN111079445A - Training method and device based on semantic model and electronic equipment - Google Patents

Training method and device based on semantic model and electronic equipment Download PDF

Info

Publication number
CN111079445A
CN111079445A CN201911385958.6A CN201911385958A CN111079445A CN 111079445 A CN111079445 A CN 111079445A CN 201911385958 A CN201911385958 A CN 201911385958A CN 111079445 A CN111079445 A CN 111079445A
Authority
CN
China
Prior art keywords
model
text
training
recognition
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911385958.6A
Other languages
Chinese (zh)
Inventor
陈喜旺
黄柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Sanbaiyun Information Technology Co Ltd
Original Assignee
Nanjing Sanbaiyun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Sanbaiyun Information Technology Co Ltd filed Critical Nanjing Sanbaiyun Information Technology Co Ltd
Priority to CN201911385958.6A priority Critical patent/CN111079445A/en
Publication of CN111079445A publication Critical patent/CN111079445A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a training method and device based on a semantic model and electronic equipment, relates to the technical field of model training, and solves the technical problem that the recognition result accuracy of the current semantic recognition model is low. The method comprises the following steps: training the semantic annotation model based on the annotated training sample set to obtain a trained semantic annotation model; repeatedly executing the following steps based on the semantic annotation model until the recognition result of the recognition model meets the preset condition, and outputting the trained recognition model: identifying each text in the unmarked text set based on a semantic annotation model to obtain a primary label of each text; judging the preliminary label of each text based on the keyword set and the judgment logic to obtain a final label of each text; training the trained semantic annotation model based on the final label of each text, and optimizing intermediate decision logic and an intermediate keyword set; and determining whether the recognition result of the recognition model meets a preset condition.

Description

Training method and device based on semantic model and electronic equipment
Technical Field
The present application relates to the field of model training technologies, and in particular, to a training method and apparatus based on a semantic model, and an electronic device.
Background
Currently, there are many types of semantic recognition models, such as Natural Language Processing (NLP) models, Bidirectional Encoders (BERTs), and the like.
However, no matter what model is used for semantic recognition, various ambiguities are easily caused, and the phenomenon of wrong recognition is easily caused. For example, "zhangge", "miss" and the like identify ambiguity between a relative brother, a sister, a father and a father of a person, and the like, so that a large amount of misjudgments exist in a final identification result, and the accuracy of the identification result of the current semantic identification model is low.
Disclosure of Invention
The invention aims to provide a training method, a training device and electronic equipment based on a semantic model, and aims to solve the technical problem that the recognition result accuracy of the current semantic recognition model is low.
In a first aspect, an embodiment of the present application provides a training method based on a semantic model, in which a labeled training sample set, an unlabeled text set, and an identification model are predetermined, the identification model includes a semantic labeling model, a decision logic, and a keyword set, and keywords in the keyword set are ambiguous words determined based on the labeled training sample set; the method comprises the following steps:
training the semantic annotation model based on the labeled training sample set to obtain a trained semantic annotation model; repeatedly executing the following steps based on the semantic annotation model until the recognition result of the recognition model meets the preset condition, and outputting the recognition model after training:
identifying each text in the unlabeled text set based on the semantic labeling model to obtain a preliminary label of each text;
judging the preliminary label of each text based on the keyword set and the judgment logic to obtain a final label of each text;
training the trained semantic annotation model based on the final label of each text, and optimizing intermediate decision logic and an intermediate keyword set;
and determining whether the recognition result of the recognition model meets a preset condition.
In one possible implementation, the storage mode of the keyword set is a distributed storage mode; and/or the storage mode of the labeled training sample set and the unlabeled text set is a distributed storage mode.
In one possible implementation, the keywords in the keyword set correspond to tags; a plurality of said tags are grouped into groups of a plurality of different tag categories, the tags within each of said groups having an index with the corresponding group.
In one possible implementation, the keywords in the keyword set are identity keywords of the target object;
the samples in the labeled training sample set and the unlabeled text set are social data samples of the target object.
In one possible implementation, the annotation content of the social data sample of the target object includes any one or more of the following:
time, place, and identification of the target object, work industry domain, social relationship, and affiliation.
In one possible implementation, the labeled training sample set includes: training samples and testing samples;
the step of determining whether the recognition result of the recognition model meets a preset condition includes:
obtaining a test result by passing the identification model based on the test sample;
and determining whether the recognition result of the recognition model meets a preset condition or not according to the test result.
In one possible implementation, the method further comprises:
packaging the keyword set, the judgment logic and the identification model to obtain a packaging result;
the form of the encapsulation result is a Python module or an API interface.
In a second aspect, a training device based on a semantic model is provided, wherein a labeled training sample set, an unlabeled text set and an identification model are predetermined, the identification model comprises a semantic labeling model, a judgment logic and a keyword set, and keywords in the keyword set are ambiguous words determined based on the labeled training sample set; the device comprises:
the first training module is used for training the semantic annotation model based on the labeled training sample set to obtain a trained semantic annotation model;
the identification module is used for identifying each text in the unlabeled text set based on the semantic labeling model to obtain a preliminary label of each text;
the judging module is used for judging the preliminary label of each text based on the keyword set and the judging logic to obtain a final label of each text;
the second training module is used for training the trained semantic annotation model based on the final label of each text and optimizing an intermediate judgment logic and an intermediate keyword set;
and the determining module is used for determining whether the recognition result of the recognition model meets a preset condition.
The recognition module, the judgment module, the second training module and the determination module operate repeatedly until the recognition result of the recognition model meets the preset condition, and the recognition model after training is output.
In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the method of the first aspect when executing the computer program.
In a fourth aspect, this embodiment of the present application further provides a computer-readable storage medium storing machine executable instructions, which, when invoked and executed by a processor, cause the processor to perform the method of the first aspect.
The embodiment of the application brings the following beneficial effects:
the training method, the training device and the electronic equipment based on the semantic model can train the semantic annotation model based on the labeled training sample set so as to obtain the trained semantic annotation model, then repeatedly execute the following cyclic iteration steps based on the semantic annotation model until the recognition result of the recognition model meets the preset condition, and output the final trained recognition model: the method comprises the steps of firstly identifying each text in an unmarked text set based on a semantic annotation model to obtain a primary label of each text, then judging the primary label of each text based on a keyword set and a judgment logic to obtain a final label of each text, then training the trained semantic annotation model based on the final label of each text, optimizing the intermediate judgment logic and the intermediate keyword set, and finally determining whether the identification result of the identification model meets a preset condition.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart diagram illustrating a semantic model-based training method provided in an embodiment of the present application;
FIG. 2 is a schematic view of another flowchart of a training method based on semantic models provided in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a training apparatus based on a semantic model according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram illustrating an electronic device provided in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "comprising" and "having," and any variations thereof, as referred to in the embodiments of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
At present, structured information about related activities such as company activities can be extracted from unstructured texts such as chapters, key contents of structured information of names of people, place names, organizational names, time and numerical expressions are extracted, and the entity names and the categories thereof, namely named entity identification and classification, need to be identified from the texts.
Statistical methods based on large-scale corpora become the mainstream of natural language processing, and the following is summary of named entity identification methods based on statistical models: supervised learning methods, markov models, maximum entropy models, conditional random fields, and the like; the semi-supervised learning method utilizes labeled small data and bootstrap learning; unsupervised learning: utilizing vocabulary resources; context clustering, a hybrid approach: several models were combined. However, for billions of data, either the store or read processes, are very memory consuming and inefficient offline.
Moreover, no matter what model is based on, various ambiguities and wrong recognition phenomena are easily caused. The existing Natural Language Processing (NLP) model entity labeling has a plurality of ambiguity problems which are difficult to solve. For example, "zhang ge", "wang jie", and the like are recognized as brother and sister of a relative, the ambiguity between father and father of a certain person, and some car brands are recognized as names, which results in a large amount of misjudgments stored in the final labeling result, and therefore, the recognition result accuracy of the current semantic recognition model is low.
Based on this, the embodiment of the application provides a training method and device based on a semantic model and an electronic device, and the technical problem that the recognition result accuracy of the current semantic recognition model is low can be solved through the method.
Embodiments of the present invention are further described below with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a training method based on a semantic model according to an embodiment of the present disclosure. The method comprises the steps of determining a labeled training sample set, an unlabeled text set and an identification model in advance, wherein the identification model comprises a semantic labeling model, judgment logic and a keyword set, and keywords in the keyword set are ambiguous words determined based on the labeled training sample set. As shown in fig. 1, the method includes:
step S110, training the semantic annotation model based on the annotated training sample set to obtain the trained semantic annotation model.
The recognition model may be implemented based on a Natural Language Processing (NLP) model, or may be implemented based on a Convolutional Neural Network (CNN) model, a transform, or a Bidirectional Encoder (BERT).
It should be noted that the keywords are ambiguous words determined based on the labeled training sample set, and the keyword set can be directly stored and read in the form of a system file.
Illustratively, as shown in fig. 2, for data tagged in the hundred million levels of social data (e.g., 110 in fig. 2), a large amount of social data may be manually summarized to obtain a set of keywords that result in ambiguous determinations and some direct determinations of identity. The semantic annotation model can then be trained on annotated data in the hundred million social data (e.g., 110 in FIG. 2).
Then, repeatedly executing the following steps S120 to S150 based on the semantic annotation model until the recognition result of the recognition model satisfies the preset condition, and outputting the recognition model after training:
and step S120, identifying each text in the unmarked text set based on the semantic marking model to obtain a preliminary label of each text.
For example, as shown in fig. 2, preliminary entity tagging recognition may be performed on the remaining unlabeled data in the hundred million levels of social data, so that a recognition result of each text model, that is, a preliminary label of the user data, may be obtained.
Step S130, based on the keyword set and the judgment logic, the preliminary label of each text is judged to obtain the final label of each text.
Illustratively, as shown in fig. 2, a logical decision is made on the result of the social identity preliminary tag (e.g., 160 in fig. 2) in combination with the identity keyword (e.g., 130 in fig. 2), and the wrong result is removed, so as to obtain a final tag result with one step of integration, that is, the final tag is obtained by resolving the ambiguity label (e.g., 170 in fig. 2) through the integration logic.
Step S140, training the trained semantic annotation model based on the final label of each text, and optimizing the intermediate decision logic and the intermediate keyword set.
In this step, the trained semantic annotation model is trained based on the final label of each text, and the intermediate decision logic and the intermediate keyword set are optimized based on the final label of each text.
And step S150, determining whether the recognition result of the recognition model meets a preset condition. If yes, go to step S160. If not, step S120 is executed again.
In practical application, as shown in fig. 2, each text segment has a new tag, and the new tagged data is fed back to identity keyword repair when a feedback mechanism repairs a word set (e.g., 190 in fig. 2) and is logically determined to be repaired, and on the other hand, part of a feedback mechanism repair model (e.g., 200 in fig. 2) is fed back to a recognition model entity tagging model to form a model iteration training mechanism, and the whole process enters the next iteration process.
And step S160, outputting the trained recognition model.
Illustratively, a certain amount of manually labeled high-quality data can be taken as a test set, and the iteration is terminated when the test result meets the requirement. The final result obtained by iteration termination is the final social identity tag of the user, and the tag is associated with the user index, stored and queried.
The ambiguity judgment logic is executed by using the ambiguous words in the keyword set through discrimination based on the recognition model and the keyword set containing the ambiguous words, so that the accuracy of the recognition result of the recognition model can be improved, and the technical problem of low accuracy of the recognition result of the current semantic recognition model can be solved.
In the embodiment of the application, the problem of wrong results of the NLP entity labeling technology in the field of social identity recognition can be specially solved by manually collecting and integrating the keyword set and an ambiguous logic solution through a large amount of social data based on social identity discovery judged and combined by the recognition model and the keyword set. Moreover, the self-feedback mechanism of the whole recognition model can improve the effect of the entity tagging model of the recognition model and the restoration of the keyword set. The identity calibration can be conveniently, quickly and correctly performed on the hundred million social data under the back-end line.
The above steps are described in detail below.
In some embodiments, the storage mode of the keyword set is a distributed storage mode; and/or the storage mode of the marked training sample set and the unmarked text set is a distributed storage mode.
The problem of low efficiency caused by too large magnitude of order in the process of back-end text processing can be solved through a distributed data storage structure. Specifically, since social data is text data with more than one hundred million levels, the back-end offline fast reading processing is a big difficulty, and the embodiment of the application adopts a distributed data storage scheme, so that the tag result can be conveniently and fast updated in a hot manner and stored in real time.
In some embodiments, the keywords in the keyword set correspond to tags; the plurality of tags are grouped into groups of a plurality of different tag categories, the tags within each group having an index with the corresponding group.
As shown in fig. 2, the labels are grouped by the category of the labels for the keyword set obtained from the identity keyword (e.g. 130 in fig. 2), and each label is given a unique index in each group, so that addition and search can be facilitated.
Of course, users may also be grouped. For example, as shown in fig. 2, the total amount of text data of users is counted for the remaining unmarked data in the hundred million-level social data (e.g., 110 in fig. 2), the text data is grouped according to a certain number, a certain dimensional number sequence is set for each group where the user is located, and then the length of each group is used, so as to give an index number to each group where the user is located.
In some embodiments, the keywords in the keyword set are identity keywords of the target object; the samples in the labeled training sample set and the unlabeled text set are social data samples of the target object.
As shown in fig. 2, if the keyword in the keyword set is an identity keyword of the target object (e.g. 130 in fig. 2), the sample is a social data sample of the target object, so as to combine the identity information of the target object with the social data of the target object, thereby improving the accuracy of the data.
In some embodiments, the annotation content of the social data sample for the target object includes any one or more of:
time, place, and identification of target objects, work industry domains, social relationships, and affiliations.
As shown in fig. 2, a large amount of social data texts are partially acquired, the main form is that each user corresponds to a piece of text content describing remarks, and a part is extracted for marking, where the marking content may include: time, place, name of person (identification of target object), work industry field, social relationship, relationship of relatives, etc. Therefore, the marked content is richer and more comprehensive.
In some embodiments, the annotated set of training samples comprises: training samples and testing samples; the step S150 may include the following steps:
obtaining a test result through the identification model based on the test sample;
and determining whether the recognition result of the recognition model meets a preset condition or not according to the test result.
In practical application, the labeled training samples can be divided into a training set and a test set, wherein the test set can be used for determining whether the recognition result of the recognition model meets a preset condition.
For example, a certain amount of manually labeled high quality data may be taken as a test set, and the iteration terminates when the test result satisfies the requirement. The final result from the iteration termination is the user's final social identity tag.
By using the test set divided from the marked training sample, whether the recognition result of the recognition model meets the preset condition can be tested more effectively.
In some embodiments, the keyword set, the decision logic and the recognition model are encapsulated to obtain an encapsulation result; the encapsulation result is in the form of a Python module or API interface.
In practical application, the keyword set, the decision logic and the recognition model can be uniformly packaged (e.g. 220 in fig. 2) into a Python module, so that multiplexing, loading and real-time updating can be facilitated. The whole process is packaged into a Python module or an API (application program interface) so as to facilitate data calling.
In the embodiment of the application, the identity can be conveniently, quickly and correctly calibrated and uniformly packaged for the hundred million social data under the back-end line, and the calling, multiplexing, loading and real-time updating are convenient.
Fig. 3 provides a schematic structural diagram of a training apparatus based on a semantic model. The method comprises the steps of determining a labeled training sample set, an unlabeled text set and an identification model in advance, wherein the identification model comprises a semantic labeling model, judgment logic and a keyword set, and keywords in the keyword set are ambiguous words determined based on the labeled training sample set. As shown in fig. 3, the training apparatus 300 based on semantic model includes:
the first training module 301 is configured to train a semantic annotation model based on a labeled training sample set to obtain a trained semantic annotation model;
the identification module 302 is configured to identify each text in the unlabeled text set based on the semantic labeling model to obtain a preliminary label of each text;
the judging module 303 is configured to judge the preliminary tag of each text based on the keyword set and the judgment logic, and obtain a final tag of each text;
a second training module 304, configured to train the trained semantic annotation model based on the final label of each text, and optimize the intermediate decision logic and the intermediate keyword set;
a determining module 305, configured to determine whether a recognition result of the recognition model satisfies a preset condition.
The recognition module 302, the judgment module 303, the second training module 304 and the determination module 305 are repeatedly operated until the recognition result of the recognition model meets the preset condition, and the recognition model after training is output.
In some embodiments, the storage mode of the keyword set is a distributed storage mode;
and/or the storage mode of the marked training sample set and the unmarked text set is a distributed storage mode.
In some embodiments, the keywords in the keyword set correspond to tags;
the plurality of tags are grouped into groups of a plurality of different tag categories, the tags within each group having an index with the corresponding group.
In some embodiments, the keywords in the keyword set are identity keywords of the target object;
the samples in the labeled training sample set and the unlabeled text set are social data samples of the target object.
In some embodiments, the annotation content of the social data sample for the target object includes any one or more of:
time, place, and identification of target objects, work industry domains, social relationships, and affiliations.
In some embodiments, the annotated set of training samples comprises: training samples and testing samples;
the determining module 305 is specifically configured to:
obtaining a test result through the identification model based on the test sample;
and determining whether the recognition result of the recognition model meets a preset condition or not according to the test result.
In some embodiments, the apparatus further comprises:
the packaging module is used for packaging the keyword set, the judgment logic and the identification model to obtain a packaging result;
the encapsulation result is in the form of a Python module or API interface.
The training device based on the semantic model provided by the embodiment of the application has the same technical characteristics as the training method based on the semantic model provided by the embodiment, so that the same technical problems can be solved, and the same technical effect can be achieved.
As shown in fig. 4, the electronic device 4 provided in the embodiment of the present application includes a memory 401 and a processor 402, where the memory stores a computer program that can run on the processor, and the processor executes the computer program to implement the steps of the method provided in the foregoing embodiment.
Referring to fig. 4, the electronic device further includes: a bus 403 and a communication interface 404, the processor 402, the communication interface 404 and the memory 401 being connected by the bus 403; the processor 402 is used to execute executable modules, such as computer programs, stored in the memory 401.
The memory 401 may include a high-speed Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 404 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 403 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.
The memory 401 is used for storing a program, and the processor 402 executes the program after receiving an execution instruction, and the method performed by the apparatus defined by the process disclosed in any of the foregoing embodiments of the present application may be applied to the processor 402, or implemented by the processor 402.
The processor 402 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 402. The Processor 402 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 401, and the processor 402 reads the information in the memory 401 and completes the steps of the method in combination with the hardware.
Corresponding to the above training method based on semantic model, the present application further provides a computer-readable storage medium storing machine executable instructions, which, when invoked and executed by a processor, cause the processor to execute the steps of the above training method based on semantic model.
The training device based on the semantic model provided by the embodiment of the application can be specific hardware on equipment or software or firmware installed on the equipment. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the training method based on the semantic model according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the scope of the embodiments of the present application. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A training method based on a semantic model is characterized in that a labeled training sample set, an unlabeled text set and an identification model are predetermined, the identification model comprises a semantic labeling model, a judgment logic and a keyword set, and keywords in the keyword set are ambiguous words determined based on the labeled training sample set; the method comprises the following steps:
training the semantic annotation model based on the labeled training sample set to obtain a trained semantic annotation model; repeatedly executing the following steps based on the semantic annotation model until the recognition result of the recognition model meets the preset condition, and outputting the recognition model after training:
identifying each text in the unlabeled text set based on the semantic labeling model to obtain a preliminary label of each text;
judging the preliminary label of each text based on the keyword set and the judgment logic to obtain a final label of each text;
training the trained semantic annotation model based on the final label of each text, and optimizing intermediate decision logic and an intermediate keyword set;
and determining whether the recognition result of the recognition model meets a preset condition.
2. The method according to claim 1, wherein the keyword set is stored in a distributed manner; and/or the storage mode of the labeled training sample set and the unlabeled text set is a distributed storage mode.
3. The method of claim 1, wherein the keywords in the set of keywords correspond to tags; a plurality of said tags are grouped into groups of a plurality of different tag categories, the tags within each of said groups having an index with the corresponding group.
4. The method of claim 1, wherein the keywords in the keyword set are identity keywords of the target object;
the samples in the labeled training sample set and the unlabeled text set are social data samples of the target object.
5. The method of claim 4, wherein the annotation content of the social data sample of the target object comprises any one or more of:
time, place, and identification of the target object, work industry domain, social relationship, and affiliation.
6. The method of claim 1, wherein the annotated set of training samples comprises: training samples and testing samples;
the step of determining whether the recognition result of the recognition model meets a preset condition includes:
obtaining a test result by passing the identification model based on the test sample;
and determining whether the recognition result of the recognition model meets a preset condition or not according to the test result.
7. The method of claim 1, further comprising:
packaging the keyword set, the judgment logic and the identification model to obtain a packaging result;
the form of the encapsulation result is a Python module or an API interface.
8. A training device based on a semantic model is characterized in that a labeled training sample set, an unlabeled text set and an identification model are predetermined, the identification model comprises a semantic labeling model, a judgment logic and a keyword set, and keywords in the keyword set are ambiguous words determined based on the labeled training sample set; the device comprises:
the first training module is used for training the semantic annotation model based on the labeled training sample set to obtain a trained semantic annotation model;
the identification module is used for identifying each text in the unlabeled text set based on the semantic labeling model to obtain a preliminary label of each text;
the judging module is used for judging the preliminary label of each text based on the keyword set and the judging logic to obtain a final label of each text;
the second training module is used for training the trained semantic annotation model based on the final label of each text and optimizing an intermediate judgment logic and an intermediate keyword set;
the determining module is used for determining whether the recognition result of the recognition model meets a preset condition or not;
the recognition module, the judgment module, the second training module and the determination module operate repeatedly until the recognition result of the recognition model meets the preset condition, and the recognition model after training is output.
9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor implements the steps of the method of any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium having stored thereon machine executable instructions which, when invoked and executed by a processor, cause the processor to execute the method of any of claims 1 to 7.
CN201911385958.6A 2019-12-27 2019-12-27 Training method and device based on semantic model and electronic equipment Pending CN111079445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911385958.6A CN111079445A (en) 2019-12-27 2019-12-27 Training method and device based on semantic model and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911385958.6A CN111079445A (en) 2019-12-27 2019-12-27 Training method and device based on semantic model and electronic equipment

Publications (1)

Publication Number Publication Date
CN111079445A true CN111079445A (en) 2020-04-28

Family

ID=70319172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911385958.6A Pending CN111079445A (en) 2019-12-27 2019-12-27 Training method and device based on semantic model and electronic equipment

Country Status (1)

Country Link
CN (1) CN111079445A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069293A (en) * 2020-09-14 2020-12-11 上海明略人工智能(集团)有限公司 Data annotation method and device, electronic equipment and computer readable medium
CN112149179A (en) * 2020-09-18 2020-12-29 支付宝(杭州)信息技术有限公司 Risk identification method and device based on privacy protection
CN112307337A (en) * 2020-10-30 2021-02-02 中国平安人寿保险股份有限公司 Association recommendation method and device based on label knowledge graph and computer equipment
CN112487814A (en) * 2020-11-27 2021-03-12 北京百度网讯科技有限公司 Entity classification model training method, entity classification device and electronic equipment
CN113220836A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Training method and device of sequence labeling model, electronic equipment and storage medium
CN113327591A (en) * 2021-06-16 2021-08-31 北京有竹居网络技术有限公司 Voice processing method and device
CN114372446A (en) * 2021-12-13 2022-04-19 北京五八信息技术有限公司 Vehicle attribute labeling method, device and storage medium
CN114492419A (en) * 2022-04-01 2022-05-13 杭州费尔斯通科技有限公司 Text labeling method, system and device based on newly added key words in labeling
CN114637848A (en) * 2022-03-15 2022-06-17 美的集团(上海)有限公司 Semantic classification method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469188A (en) * 2016-08-30 2017-03-01 北京奇艺世纪科技有限公司 A kind of entity disambiguation method and device
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN108875059A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 For generating method, apparatus, electronic equipment and the storage medium of document label

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469188A (en) * 2016-08-30 2017-03-01 北京奇艺世纪科技有限公司 A kind of entity disambiguation method and device
CN108009589A (en) * 2017-12-12 2018-05-08 腾讯科技(深圳)有限公司 Sample data processing method, device and computer-readable recording medium
CN108875059A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 For generating method, apparatus, electronic equipment and the storage medium of document label

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069293A (en) * 2020-09-14 2020-12-11 上海明略人工智能(集团)有限公司 Data annotation method and device, electronic equipment and computer readable medium
CN112069293B (en) * 2020-09-14 2024-04-19 上海明略人工智能(集团)有限公司 Data labeling method, device, electronic equipment and computer readable medium
CN112149179A (en) * 2020-09-18 2020-12-29 支付宝(杭州)信息技术有限公司 Risk identification method and device based on privacy protection
CN112149179B (en) * 2020-09-18 2022-09-02 支付宝(杭州)信息技术有限公司 Risk identification method and device based on privacy protection
CN112307337A (en) * 2020-10-30 2021-02-02 中国平安人寿保险股份有限公司 Association recommendation method and device based on label knowledge graph and computer equipment
CN112307337B (en) * 2020-10-30 2024-04-12 中国平安人寿保险股份有限公司 Associated recommendation method and device based on tag knowledge graph and computer equipment
CN112487814A (en) * 2020-11-27 2021-03-12 北京百度网讯科技有限公司 Entity classification model training method, entity classification device and electronic equipment
CN112487814B (en) * 2020-11-27 2024-04-02 北京百度网讯科技有限公司 Entity classification model training method, entity classification device and electronic equipment
CN113220836A (en) * 2021-05-08 2021-08-06 北京百度网讯科技有限公司 Training method and device of sequence labeling model, electronic equipment and storage medium
CN113220836B (en) * 2021-05-08 2024-04-09 北京百度网讯科技有限公司 Training method and device for sequence annotation model, electronic equipment and storage medium
CN113327591A (en) * 2021-06-16 2021-08-31 北京有竹居网络技术有限公司 Voice processing method and device
CN113327591B (en) * 2021-06-16 2023-01-17 北京有竹居网络技术有限公司 Voice processing method and device
CN114372446B (en) * 2021-12-13 2023-02-17 北京爱上车科技有限公司 Vehicle attribute labeling method, device and storage medium
CN114372446A (en) * 2021-12-13 2022-04-19 北京五八信息技术有限公司 Vehicle attribute labeling method, device and storage medium
CN114637848A (en) * 2022-03-15 2022-06-17 美的集团(上海)有限公司 Semantic classification method and device
CN114492419B (en) * 2022-04-01 2022-08-23 杭州费尔斯通科技有限公司 Text labeling method, system and device based on newly added key words in labeling
CN114492419A (en) * 2022-04-01 2022-05-13 杭州费尔斯通科技有限公司 Text labeling method, system and device based on newly added key words in labeling

Similar Documents

Publication Publication Date Title
CN111079445A (en) Training method and device based on semantic model and electronic equipment
CN109902307B (en) Named entity recognition method, named entity recognition model training method and device
CN107423278B (en) Evaluation element identification method, device and system
CN110427487B (en) Data labeling method and device and storage medium
US20160239500A1 (en) System and methods for extracting facts from unstructured text
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN111198948A (en) Text classification correction method, device and equipment and computer readable storage medium
CN113312899B (en) Text classification method and device and electronic equipment
CN111338692B (en) Vulnerability classification method and device based on vulnerability codes and electronic equipment
CN113221555A (en) Keyword identification method, device and equipment based on multitask model
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN116842951A (en) Named entity recognition method, named entity recognition device, electronic equipment and storage medium
CN116244410B (en) Index data analysis method and system based on knowledge graph and natural language
CN114218381B (en) Method, device, equipment and medium for identifying position
CN113204956B (en) Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN113255319B (en) Model training method, text segmentation method, abstract extraction method and device
CN111222051A (en) Training method and device of trend prediction model
CN114722204A (en) Multi-label text classification method and device
CN118094432A (en) Industrial Internet abnormal behavior detection method and system
CN113011162B (en) Reference digestion method, device, electronic equipment and medium
CN113688243B (en) Method, device, equipment and storage medium for labeling entities in sentences
CN113255355B (en) Entity identification method and device in text information, electronic equipment and storage medium
CN112632232B (en) Text matching method, device, equipment and medium
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium
CN115587163A (en) Text classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200428