CN111582497A - Training file generation and evaluation method, device, computer system and storage medium - Google Patents

Training file generation and evaluation method, device, computer system and storage medium Download PDF

Info

Publication number
CN111582497A
CN111582497A CN202010344715.4A CN202010344715A CN111582497A CN 111582497 A CN111582497 A CN 111582497A CN 202010344715 A CN202010344715 A CN 202010344715A CN 111582497 A CN111582497 A CN 111582497A
Authority
CN
China
Prior art keywords
file
training
entity
hit
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010344715.4A
Other languages
Chinese (zh)
Inventor
王巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ping An Medical Health Technology Service Co Ltd
Original Assignee
Ping An Medical and Healthcare Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Medical and Healthcare Management Co Ltd filed Critical Ping An Medical and Healthcare Management Co Ltd
Priority to CN202010344715.4A priority Critical patent/CN111582497A/en
Publication of CN111582497A publication Critical patent/CN111582497A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a training file generation and evaluation method, a training file generation and evaluation device, a computer system and a storage medium, wherein the training file generation and evaluation method comprises the following steps: receiving an original file, acquiring field information and a training entity of the original file, and processing the original file according to the field information and the training entity to acquire a labeled file; recognizing the semantics of the labeled file through a preset natural language understanding model, and carrying out sequence labeling on the semantics to obtain a training file; and inputting the training file into an intelligent search model corresponding to the field information to obtain a training result, calculating the training result through a hit analysis algorithm to obtain a hit rate, summarizing the training file and the hit rate, and generating a hit analysis report. The method and the device have the advantages that the technical effect of automatically obtaining the training file is achieved, the generation quality and the generation speed of the training file are guaranteed, and the problem that the marking quality of the training sample cannot be guaranteed as the real hit rate of the training sample cannot be obtained at present is solved.

Description

Training file generation and evaluation method, device, computer system and storage medium
Technical Field
The invention relates to the technical field of machine learning, in particular to a training file generation and evaluation method, a training file generation and evaluation device, a computer system and a storage medium.
Background
The machine learning model is a general term of an algorithm for realizing prediction or classification by excavating implicit rules from a large amount of historical data, and is specifically represented by receiving sample data and performing operation through a function of the sample data to output a prediction result or a classification result; in the field of intelligent search, currently, an intelligent search model constructed based on a machine learning model is usually trained by using a sample file with labels to obtain a mature model capable of accurately understanding sample data and obtaining an accurate retrieval result according to the data.
Therefore, high quality sample files are crucial for training intelligent search models; however, the current generation method of the training file cannot obtain the real hit rate of the training sample, so that the labeling quality of the training sample cannot be guaranteed, and the situation that the intelligent search model cannot be trained quickly and accurately is caused.
Disclosure of Invention
The invention aims to provide a training file generation and evaluation method, a training file generation and evaluation device, a computer system and a storage medium, which are used for solving the problem that the marking quality of a training sample cannot be ensured due to the fact that the real hit rate of the training sample cannot be obtained in the prior art.
In order to achieve the above object, the present invention provides a training file generation and evaluation method, including:
the method comprises the steps that a label server receives an original file, obtains field information and a training entity of the original file, processes the original file according to the field information and the training entity to obtain a label file, and sends the label file to a recognition server; the field information is information data expressing the field to which the original file belongs, and the training entity is a named entity in the original file;
the recognition server recognizes the semantics of the labeled file through a preset natural language understanding model, carries out sequence labeling on the semantic of the labeled file to obtain a training file, and sends the training file to a hit server;
the hit server is provided with an intelligent search model and a hit analysis algorithm, the hit server records the training files into the intelligent search model corresponding to the field information to obtain training results, the training results are calculated through the hit analysis algorithm to obtain hit rates, and the training files and the hit rates are summarized to generate hit analysis reports.
In the above scheme, the receiving an original file and acquiring the domain information and the training entity of the original file includes:
acquiring an original file, performing field identification on the original file to obtain field information, and performing entity identification on the original file to obtain an independent entity;
obtaining the code of the independent entity through a preset relation list, and associating the code with the independent entity;
judging whether two adjacent independent entities have an association relation or not according to a preset relation rule; if the two independent entities have the incidence relation, combining the two independent entities to form an incidence entity, and identifying whether the next adjacent two independent entities have the incidence relation; if the entity does not have the incidence relation, identifying whether the next adjacent two independent entities have the incidence relation;
and setting the independent entity and the associated entity as training entities.
In the above scheme, the processing the original file according to the domain information and the training entity to obtain the markup file includes:
marking the original file according to the training entity to obtain a marked processing file;
and loading the field information into the label processing file to obtain a label file.
In the above scheme, the recognizing the semantics of the labeled file and performing sequence labeling on the recognized semantics to obtain the training file includes:
performing semantic recognition on the labeled file to obtain a query intention;
filling slot values in the label file according to codes in the label file to realize sequence labeling of training entities in the label file;
and summarizing the query intentions and the labeled files with the sequence labels to form a training file.
In the above scheme, the entering the training file into the intelligent search model corresponding to the domain information to obtain the training result includes:
selecting a corresponding intelligent search model in a production environment according to the field information of the training file, and inputting the training file into the intelligent search model;
and the intelligent search model obtains a training result according to the query intention and the labeled file of the training file.
In the foregoing solution, the calculating the training result by the hit analysis algorithm to obtain the hit rate includes:
calculating the occurrence frequency of each training entity in the training file in the training result through a hit analysis algorithm to obtain a word frequency for describing the importance degree of the training entity to the related file;
calculating the number of each training entity in the training result in the training file through a hit analysis algorithm to obtain a reverse file frequency for describing the scarcity degree of the training entities in the training result;
multiplying the word frequency information and the reverse file frequency through a hit analysis algorithm to obtain an entity matching value for describing the matching degree between each training entity and each related file;
adding the entity matching values of the related files to obtain a file matching value for describing the matching degree between the training file and the related files;
and adding the file matching values of the related files to obtain a hit rate for describing the matching degree between the training file and the training result.
In the foregoing solution, after the summarizing the training file and the hit rate to generate the hit analysis report, the method may further include:
comparing the hit rate with a preset hit threshold;
if the hit rate exceeds a preset hit threshold, judging that the training file is qualified, and sending the hit analysis report to a user side;
and if the hit rate does not exceed a preset hit threshold, judging that the training file is unqualified, and sending the hit analysis report to the user side.
In order to achieve the above object, the present invention further provides a training file generating and evaluating apparatus, including:
the system comprises a label server, a recognition server and a database server, wherein the label server is used for receiving an original file, acquiring field information and a training entity of the original file, processing the original file according to the field information and the training entity to acquire a label file, and sending the label file to the recognition server; the field information is information data expressing the field to which the original file belongs, and the training entity is a named entity in the original file;
the recognition server is used for recognizing the semantics of the labeled file through a preset natural language understanding model, performing sequence labeling on the semantic of the labeled file to obtain a training file, and sending the training file to the hit server;
and the hit server is provided with an intelligent search model and a hit analysis algorithm and is used for inputting the training file into the intelligent search model corresponding to the field information to obtain a training result, calculating the training result through the hit analysis algorithm to obtain a hit rate, summarizing the training file and the hit rate and generating a hit analysis report.
In order to achieve the above object, the present invention further provides a computer system, which includes a plurality of computer devices, each computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processors of the plurality of computer devices jointly implement the steps of the training file generation and evaluation method when executing the computer program.
In order to achieve the above object, the present invention further provides a computer-readable storage medium, which includes a plurality of storage media, each storage medium storing a computer program, and the computer programs stored in the storage media, when executed by a processor, collectively implement the steps of the above training file generation and evaluation method.
According to the training file generation and evaluation method, device, computer system and storage medium provided by the invention, the original file is obtained, the field information and the training entity of the original file are obtained, and the original file is processed according to the field information and the training entity to obtain the labeled file; recognizing the semantics of the labeled file and performing sequence labeling on the labeled file to obtain a training file; the technical effect of automatically obtaining the training file is achieved, the influence of human errors is eliminated, and the generation quality and the generation speed of the training file are guaranteed.
The training files are input into an intelligent search model corresponding to the field information to obtain training results, the training results are calculated through a hit analysis algorithm to obtain hit rates, the training files and the hit rates are summarized to generate hit analysis reports, and therefore the problem that the real hit rates of training samples cannot be obtained at present and the labeling quality of the training samples cannot be guaranteed is solved by sending the hit rates of the training results to a user side.
Drawings
FIG. 1 is a flowchart of a first embodiment of a training file generation and evaluation method according to the present invention;
FIG. 2 is a flowchart of a training data set obtained by the training file generation and evaluation method according to the first embodiment of the present invention;
FIG. 3 is a flowchart of a method for generating and evaluating a training file according to a first embodiment of the present invention;
FIG. 4 is a flowchart of a training file generation and evaluation method according to a first embodiment of the present invention;
FIG. 5 is a flowchart of a training result obtained in the first embodiment of the method for generating and evaluating a training file according to the present invention;
FIG. 6 is a flowchart illustrating a method for generating and evaluating a training file according to a hit rate;
FIG. 7 is a flowchart of a training file generation and evaluation method according to an embodiment of the present invention after generating a hit analysis report;
FIG. 8 is a schematic diagram of program modules of a second embodiment of a training file generation and evaluation apparatus according to the present invention;
fig. 9 is a schematic diagram of a hardware structure of a computer device in the third embodiment of the computer system according to the present invention.
Reference numerals:
1. training file generation and evaluation device 2, computer device 11, and annotation server
12. Recognition server 13, hit server 21, memory 22, processor
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a training file generation and evaluation method, a training file generation and evaluation device, a computer system and a storage medium, which are suitable for the field of machine learning and are used for providing a training file generation and evaluation method based on a marking server, an identification server and a hit server. According to the method, a label server receives an original file, field information and a training entity of the original file are obtained, and the original file is processed according to the field information and the training entity to obtain a label file; the recognition server recognizes the semantics of the labeled file through a preset natural language understanding model, and carries out sequence labeling on the semantics to obtain a training file; the hit server is provided with an intelligent search model and a hit analysis algorithm, the hit server records the training files into the intelligent search model corresponding to the field information to obtain training results, the training results are calculated through the hit analysis algorithm to obtain hit rates, and the training files and the hit rates are summarized to generate hit analysis reports.
Example one
Referring to fig. 1, a training file generation and evaluation method of the present embodiment includes:
s1: the method comprises the steps that a label server receives an original file, obtains field information and a training entity of the original file, processes the original file according to the field information and the training entity to obtain a label file, and sends the label file to a recognition server; the field information is information data expressing the field to which the original file belongs, and the training entity is a named entity in the original file;
s2: the recognition server recognizes the semantics of the labeled file through a preset natural language understanding model, carries out sequence labeling on the semantic of the labeled file to obtain a training file, and sends the training file to a hit server;
s3: the hit server is provided with an intelligent search model and a hit analysis algorithm, the hit server records the training files into the intelligent search model corresponding to the field information to obtain training results, the training results are calculated through the hit analysis algorithm to obtain hit rates, and the training files and the hit rates are summarized to generate hit analysis reports.
In this application, the original document may be an article or a short sentence stored in a database, or may also be a query entry or a query sentence output by a user side, and in this embodiment, the domain information may be fund audit, or intelligent supervision, or macro decision; the annotation file is text information obtained by annotating the original file according to the training entity, and semantic recognition is carried out on the annotation file through a natural language understanding model so as to obtain the query intention of the annotation file; carrying out sequence annotation on the annotated file through a natural language understanding model, and carrying out sequence annotation on the annotated file through a slot value filling method in the embodiment; and loading the query intention into the annotation file with the sequence labels to obtain a training file.
The hit rate algorithm uses TF-idf (term Frequency Inverse document Frequency) algorithm, which is a commonly used weighting algorithm for information retrieval and text mining. For evaluating the importance of a word to one of a set of documents or a corpus of documents. Where the importance of a word increases in direct proportion to the number of times it appears in the document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
Therefore, according to the training file generation and evaluation method provided by the invention, the original file is obtained, the field information and the training entity of the original file are obtained, and the original file is processed according to the field information and the training entity to obtain the labeled file; recognizing the semantics of the labeled file and performing sequence labeling on the labeled file to obtain a training file; the technical effect of automatically obtaining the training file is achieved, the influence of human errors is eliminated, and the generation quality and the generation speed of the training file are guaranteed.
Inputting the training file into an intelligent search model corresponding to the field information to obtain a training result, calculating the training result through a hit analysis algorithm to obtain a hit rate, summarizing the training file and the hit rate to generate a hit analysis report; therefore, by sending the hit rate of the training result to the user side, index reference is provided for the entity recognition model and/or the natural language understanding model of the labeling management personnel, so that the labeling management personnel can obtain high-quality sample files, the technical effect of quickly and accurately training the intelligent search model is achieved, and the problem that the labeling quality of the training sample cannot be guaranteed because the real hit rate of the training sample cannot be obtained at present is solved.
In a preferred embodiment, referring to fig. 2, the receiving an original file and acquiring domain information and a training entity of the original file includes:
s101: acquiring an original file, performing field identification on the original file to obtain field information, and performing entity identification on the original file to obtain an independent entity;
in this step, the original file is obtained by extracting the original file from a storage server in which the original file is pre-stored, or by receiving the original file output by the user.
S102: and obtaining the code of the independent entity through a preset relation list, and associating the code with the independent entity.
In this step, the relational database has the relational list, and the relational list includes a code and a coding entity, where the code corresponds to at least one coding entity. And acquiring a coding entity corresponding to an independent entity from the relation list, setting a code corresponding to the coding entity as a target code, and loading the target code into the original file to associate the target code with the independent entity. For example: assume that the code includes a date and a location, namely: DATE and LOCATION; the above-mentioned
The coding entity corresponding to the date code comprises: yesterday, today, tomorrow; the coding entity corresponding to the location code comprises: beijing, Shanghai, Guangzhou, Shenzhen; if the original file is: how the weather of Shenzhen today is, then the independent entities associated with the encodings are as follows:
today Shenzhen
DATE LOCATION
Wherein, DATE refers to DATE code, and LOCATION refers to place code.
S103: judging whether two adjacent independent entities have an association relation or not according to a preset relation rule; if the two independent entities have the incidence relation, combining the two independent entities to form an incidence entity, and identifying whether the next adjacent two independent entities have the incidence relation; if not, identifying whether the next adjacent two independent entities have the association relationship.
In this step, the relationship rule is used to specify the association relationship between codes. Extracting codes of two adjacent independent entities in an original file, and judging whether an association relationship exists between the two codes according to a relationship rule; and if the two codes have the association relationship, copying the independent entities corresponding to the two codes and combining the independent entities to form the association entity.
For example: the original file is: 2019, the step S101 identifies two independent entities of the medical insurance POLICY and the Jiangsu, the step S102 obtains a code "LOCATION" of "Jiangsu" and a code "POLICY" of "medical insurance POLICY", and if the code "LOCATION" and the code "POLICY" in the relationship rule have an association relationship, an association entity of the medical insurance POLICY of "Jiangsu" is obtained.
S104: and setting the independent entity and the associated entity as training entities.
In this step, the independent entity and the associated entity are set as training entities and duplicate removal is performed to ensure the shortness and accuracy of the training entities.
In a preferred embodiment, the performing domain identification on the original file to obtain domain information, and the performing entity identification on the original file to obtain an independent entity includes:
s101-1: and identifying the original file through a preset field list so as to obtain words corresponding to the field keywords in the field list.
In this step, the domain list includes domain titles and domain keywords, and each domain title has at least one domain keyword. In this embodiment, the domain titles at least include fund auditing, intelligent supervision, and macro decision.
S101-2: and acquiring the number of the words appearing in the original file, setting the domain keyword corresponding to the word with the largest number as a target keyword, and setting the domain title of the target keyword in the domain list as domain information.
In this step, in the original file, the number of occurrences of the words corresponding to the domain keywords is sequentially obtained, the word with the largest number is obtained, and the domain keyword corresponding to the word is set as the target keyword.
S101-3: and carrying out entity recognition on the words in the original file through an entity recognition model so as to obtain independent entities in the original file.
In this embodiment, the entity recognition model is a Conditional Random Field (CRF) model, which is a discriminant probability model of random fields commonly used for labeling or analyzing sequence data, such as natural language text or biological sequences. Since obtaining the named entity through the conditional random field model is common general knowledge of those skilled in the art, and the technical problem to be solved in this step is how to obtain the domain to which the original document belongs and its independent entity, the detailed process of the conditional random field model is not described in detail in this application.
In a preferred embodiment, after the entity identifying the original file to obtain the independent entity, the method further includes:
s101-4: and identifying the independent entity through a preset synonymy database in which synonyms are stored, obtaining synonyms with the same meanings as the independent entity, and setting the synonyms as the independent entity.
In this step, the synonym database has a synonym set, synonyms in the synonym set have the same meaning, the independent entities are sequentially compared with the synonym set to obtain words consistent with the independent entities, and all synonyms in the synonym set where the words are located are set as independent entities.
In a preferred embodiment, referring to fig. 3, the processing the original document according to the domain information and the training entity to obtain the markup document includes:
s111: marking the original file according to the training entity to obtain a marked processing file;
in this step, words in the original file are labeled according to the independent entities and the associated entities in the training entities to obtain a labeled processing file.
S112: and loading the field information into the label processing file to obtain a label file.
In this step, the field information is used as a part of the title of the markup processing file or a part of the file name of the markup processing file, so as to achieve the effect of loading the field information into the markup processing file, and at this time, the markup processing file is converted into the markup file.
In a preferred embodiment, referring to fig. 4, the identifying semantics of the markup file and performing sequence markup on the markup file to obtain a training file includes:
s201: and performing semantic recognition on the labeled file to obtain a query intention.
In this step, the semantic recognition is essentially a task of text classification, which recognizes the semantics of the markup document through a natural language understanding model to obtain the query intention of the markup document; wherein the original file has at least one query intent. For example: the original file is: "how do the weather of Shenzhen today? "the user expresses the query weather, and we can consider the query weather as an intention here.
It should be noted that, because a person skilled in the art can easily recognize text semantics through a natural language understanding model, and how to know whether a training file meets the expectations of an operator is solved in the present application, the working principle of the natural language understanding model is not described herein again.
S202: and filling slot values in the label file according to the codes in the label file so as to realize the sequence labeling of the training entities in the label file.
In this step, the model is understood by the natural language, and the slot value filling is performed on the labeled file according to the code in the labeled file, so that the model can accurately perform the sequence labeling on the independent entity or the associated entity with the code in the labeled file.
In this embodiment, the slot value filling is essentially a task of performing sequence annotation on the entities in the annotation file in the form of BIO.
Based on the above example, for example, the original file is: "how do the weather of Shenzhen today? Here, the weather query is an intention, where the weather query is specific, and the weather query is specific, where the weather query is specific, and where the weather query is specific, where the user also delivers the information, (where, date, today), and where, the independent entity or the associated entity corresponding to the location code and the date code is the information slot. Or "how does the weather of Shenzhen today? For example, the intention is classified into the intention of "inquire weather" by using a text classification method during the intention identification, and the intention can be labeled as follows by using a sequence labeling method during the slot value filling:
how the weather of Shenzhen today
B_DATE B_LOCATION O OOOOO。
It should be noted that the slot value filling is a task of performing sequence labeling on entities in a text based on a natural language understanding technology, and belongs to the prior art, so that a person skilled in the art can easily perform sequence labeling on the text through the natural language understanding technology, and the problem of how to perform sequence labeling on valuable entities in the text in a targeted manner is solved by the present application, and therefore, detailed descriptions of a specific flow of the slot value filling are not repeated in the present application.
S203: and summarizing the query intentions and the labeled files with the sequence labels to form a training file.
In a preferred embodiment, referring to fig. 5, the entering the training file into the intelligent search model corresponding to the domain information to obtain the training result includes:
s301: selecting a corresponding intelligent search model in a production environment according to the field information of the training file, and inputting the training file into the intelligent search model;
in this step, the intelligent search model in the production environment has a professional label, and the professional label is used for describing the field that the intelligent search model is good at forecasting or classifying; acquiring a professional label matched with the field information in a production environment, selecting an intelligent search model corresponding to the professional label as a target model, and inputting the training file into the target model.
It should be noted that the production environment refers to a service system formally providing external services, and the intelligent search model refers to a search engine which is set in a server of the service system and is constructed based on a machine learning model; the machine learning model is a general term of an algorithm for mining out rules implicit in the machine learning model from a large amount of historical data and for predicting or classifying the machine learning model; the intelligent search model receives sample data and operates through a function of the intelligent search model to output a prediction result or a classification result.
S302: and the intelligent search model obtains a training result according to the query intention and the labeled file of the training file.
The training result refers to a prediction result or a classification result obtained by calculating the training file through the function of the intelligent search model.
In a preferred embodiment, referring to fig. 6, the calculating the training result by the hit analysis algorithm to obtain the hit rate includes:
s311: and calculating the occurrence frequency of each training entity in the training file in the training result through a hit analysis algorithm to obtain the word frequency for describing the importance degree of the training entity to the related file.
In this step, the Term Frequency refers to the Frequency (Frequency) of a word (Term) appearing in a document, and in this embodiment, the Frequency is used instead of the number of times, so as to prevent the occurrence of excessive words due to too long document contents.
In this embodiment, the hit analysis algorithm has a word frequency objective function, and the word frequency of each training entity in the training result is calculated through the word frequency objective function.
Wherein the word frequency objective function is:
Figure BDA0002469663590000121
in the above formula, tfi, j refers to the word frequency of the ith training entity in the training file in the jth related file; ni, j refers to the ith training entity in the training file, the occurrence frequency in the jth related file in the training result, and the denominator Σ knk, j refers to the sum of the occurrence frequencies in the related files of all the training entities (the training entities have k);
the method realizes the normalization processing of each training entity in the training files, correctly evaluates the importance degree of each training entity to each relevant file, namely: is the importance of a training entity in a relevant document, and increases as the number of occurrences of the training entity increases.
For example: the total number of words in a certain relevant document in the training result is 100, and the word "shanghai" appears 3 times, so that the word frequency of the word "shanghai" in the document is 3/100-0.03.
S312: and calculating the number of each training entity in the training result in the training file through a hit analysis algorithm to obtain the reverse file frequency for describing the scarcity degree of the training entities in the training result.
In this step, the Inverse Document Frequency (IDF) refers to the number of documents in a Document set that contain a word. It represents the general importance of a training entity in the training results.
In this embodiment, the hit analysis algorithm includes an inverse objective function, and the number of training entities in the training result is calculated through the inverse objective function.
Wherein the inverse objective function is as follows:
Figure BDA0002469663590000131
in the above formula, idfi refers to the reverse file frequency of the ith training entity in the training file in the training result, | D | represents the total number of files in the document set, i.e., the total number of related files in the training result in the present application; i { j: ti ∈ dj } | refers to the number of related files containing the word ti (i.e., the number of files with ni ≠ 0); thus, the inverse document frequency is indicative of how important a training entity is in a training result, the rarer the weight, so it decreases as words increase. Based on the above example, the training entity "shanghai" appeared in 1,000 relevant documents, and the total number of relevant documents in the training results was 10,000,000, then the inverse document frequency of the training entity "shanghai" was log (10,000,000/1,000) ═ 4.
Optionally, since the training entity may not be in the training result, once the training entity is encountered, the inverse objective function may be in error or cause function failure due to the denominator being zero, thereby causing an error or even a crash of the computer program; therefore, a natural number is added to the denominator of the reverse objective function to ensure that the denominator is not zero under any condition, and further avoid the occurrence of errors or function failure of the reverse objective function. For example, a natural number of 1 is added to the denominator so that the denominator is expressed as follows:
1+|{d∈D:f∈d}|
s313: and multiplying the word frequency information and the reverse file frequency through a hit analysis algorithm to obtain an entity matching value for describing the matching degree between each training entity and each related file.
In the step, a hit analysis algorithm has a hit target function, and entity matching values of each training entity and each related file are obtained through the hit target function; wherein
The hit objective function is as follows:
tfidfi,j=tfi,j×idfi
wherein tfidfi, j refers to an entity matching value of the ith training entity and the jth related file, tfi, j refers to a word frequency of the ith training entity in the training file in the jth related file, and idfi refers to a reverse file frequency of the ith training entity in the training file in the training result; in summary, the hit objective function can generate tf-idf with high entity matching values for high training entity frequency within a certain associated file and low file frequency of the training entity in the whole training result. Therefore, hitting the objective function tends to filter out common words, preserving important words.
For example, based on the above example, the obtained entity match values are: tfidfi, j 0.03 × 4 0.12.
S314: and adding the entity matching values of the related files to obtain a file matching value for describing the matching degree between the training file and the related files.
In this step, the entity matching degrees between all training entities in the training file and a certain related file are added to obtain the matching degree between the training file and the related file, and a file matching value describing the matching degree.
Based on the above example, if the training file includes training entities: beijing, Shanghai, Guangzhou, Shenzhen; if the entity matching value of the Beijing and the jth related file is: 0.03; if the entity matching value of the Beijing and the jth related file is: 0.12; if the entity matching value of the Beijing and the jth related file is: 0.01: if the entity matching value of the Beijing and the jth related file is: 0.10; then the file matching value of this training file and the jth related file is: 0.25.
s315: and adding the file matching values of the related files to obtain a hit rate for describing the matching degree between the training file and the training result.
In this step, the hit rate may be obtained by adding the file matching values of all the related files in the training result, or the training results may be sorted in descending order by the file matching values, and the file matching values in the front (for example, the first ten) are added to obtain the hit rate.
In a preferred embodiment, referring to fig. 7, the aggregating the training files and the hit rates to generate a hit analysis report may further include:
s321: comparing the hit rate with a preset hit threshold;
s322: if the hit rate exceeds a preset hit threshold, judging that the training file is qualified, and sending the hit analysis report to a user side;
s323: and if the hit rate does not exceed a preset hit threshold, judging that the training file is unqualified, and sending the hit analysis report to the user side.
In this step, if the hit rate exceeds the preset hit threshold, it indicates that the original file is passed
The acquired field information, the acquired labeled file, the semantic recognition of the labeled file and the accuracy of sequence labeling meet the requirements, and the corresponding training file is qualified.
If the hit rate does not exceed the preset hit threshold, the domain information and the labeled file acquired by the original file are indicated, the semantic recognition and sequence labeling accuracy of the labeled file does not meet the requirements, and the corresponding training file is unqualified, so that a labeling person is required to hit an analysis report to serve as a reference for adjusting a relationship list, and/or a domain list, and/or a relationship rule, and/or a synonymous database, and/or an entity recognition model, and/or a natural language understanding model, so as to acquire the training file with the hit rate exceeding the hit threshold.
Example two
Referring to fig. 8, a training file generation and evaluation apparatus 1 of the present embodiment includes:
the annotation server 11 is configured to receive an original file, acquire field information and a training entity of the original file, process the original file according to the field information and the training entity to acquire an annotation file, and send the annotation file to the recognition server 12; the field information is information data expressing the field to which the original file belongs, and the training entity is a named entity in the original file;
the recognition server 12 is configured to recognize semantics of the labeled file through a preset natural language understanding model, perform sequence labeling on the semantic of the labeled file to obtain a training file, and send the training file to the hit server 13;
and the hit server 13 is provided with an intelligent search model and a hit analysis algorithm, and is used for inputting the training file into the intelligent search model corresponding to the field information to obtain a training result, calculating the training result through the hit analysis algorithm to obtain a hit rate, summarizing the training file and the hit rate, and generating a hit analysis report.
The technical scheme can be applied to the field of model hosting of artificial intelligence, and by acquiring the field information of the original file and the training entity, processing the original file according to the domain information and the training entity to obtain a labeled file, identifying the semantics of the labeled file, and carries out sequence labeling on the training files to obtain training files, the training files are input into an intelligent search model corresponding to the field information to obtain training results, calculating the training result by a hit analysis algorithm to obtain a hit rate, summarizing the training file and the hit rate to generate a hit analysis report, realizing the improvement of the generation quality and the generation speed of the training file, and provides index reference for the entity recognition model and/or the natural language understanding model of the annotation manager, so as to help the annotation management personnel obtain high-quality sample files and further help the machine learning task in the model building process.
Example three:
in order to achieve the above object, the present invention further provides a computer system, which includes a plurality of computer devices 2, components of the training file generation and evaluation apparatus 1 according to the second embodiment may be distributed in different computer devices, and the computer devices may be smartphones, tablet computers, notebook computers, desktop computers, rack-mounted servers, blade servers, tower servers, or rack-mounted servers (including independent servers or a server cluster formed by a plurality of servers) which execute programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory 21, a processor 22, which may be communicatively coupled to each other via a system bus, as shown in FIG. 9. It should be noted that fig. 9 only shows a computer device with components, but it should be understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead.
In the present embodiment, the memory 21 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory 21 may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the memory 21 may also include both internal and external storage devices of the computer device. In this embodiment, the memory 21 is generally used for storing an operating system and various application software installed in the computer device, for example, a program code of the training file generation and evaluation apparatus in the first embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device. In this embodiment, the processor 22 is configured to run program codes stored in the memory 21 or process data, for example, run a training file generation and evaluation device, so as to implement the training file generation and evaluation method according to the first embodiment.
Example four:
to achieve the above objects, the present invention also provides a computer-readable storage system including a plurality of storage media, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor 22, implements corresponding functions. The computer-readable storage medium of the present embodiment is used for storing a training file generation and evaluation device, and when being executed by the processor 22, the training file generation and evaluation device implements the training file generation and evaluation method of the first embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A training file generation and evaluation method is characterized by comprising the following steps:
the method comprises the steps that a label server receives an original file, obtains field information and a training entity of the original file, processes the original file according to the field information and the training entity to obtain a label file, and sends the label file to a recognition server; the field information is information data expressing the field to which the original file belongs, and the training entity is a named entity in the original file;
the recognition server recognizes the semantics of the labeled file through a preset natural language understanding model, carries out sequence labeling on the semantic of the labeled file to obtain a training file, and sends the training file to a hit server;
the hit server is provided with an intelligent search model and a hit analysis algorithm, the hit server records the training files into the intelligent search model corresponding to the field information to obtain training results, the training results are calculated through the hit analysis algorithm to obtain hit rates, and the training files and the hit rates are summarized to generate hit analysis reports.
2. The method for generating and evaluating a training file according to claim 1, wherein the receiving an original file and acquiring domain information and a training entity of the original file comprises:
acquiring an original file, performing field identification on the original file to obtain field information, and performing entity identification on the original file to obtain an independent entity;
obtaining the code of the independent entity through a preset relation list, and associating the code with the independent entity;
judging whether two adjacent independent entities have an association relation or not according to a preset relation rule; if the two independent entities have the incidence relation, combining the two independent entities to form an incidence entity, and identifying whether the next adjacent two independent entities have the incidence relation; if the entity does not have the incidence relation, identifying whether the next adjacent two independent entities have the incidence relation;
and setting the independent entity and the associated entity as training entities.
3. The method for generating and evaluating a training file according to claim 1, wherein the processing the original file according to the domain information and the training entity to obtain the markup file comprises:
marking the original file according to the training entity to obtain a marked processing file;
and loading the field information into the label processing file to obtain a label file.
4. The method for generating and evaluating a training file according to claim 1, wherein the identifying semantics of the labeled file and performing sequence labeling on the semantics to obtain the training file comprises:
performing semantic recognition on the labeled file to obtain a query intention;
filling slot values in the label file according to codes in the label file to realize sequence labeling of training entities in the label file;
and summarizing the query intentions and the labeled files with the sequence labels to form a training file.
5. The method for generating and evaluating the training file according to claim 1, wherein the entering of the training file into the intelligent search model corresponding to the domain information to obtain the training result comprises:
selecting a corresponding intelligent search model in a production environment according to the field information of the training file, and inputting the training file into the intelligent search model;
and the intelligent search model obtains a training result according to the query intention and the labeled file of the training file.
6. The method for generating and evaluating a training file according to claim 1, wherein the calculating the training result by the hit analysis algorithm to obtain the hit rate comprises:
calculating the occurrence frequency of each training entity in the training file in the training result through a hit analysis algorithm to obtain a word frequency for describing the importance degree of the training entity to the related file;
calculating the number of each training entity in the training result in the training file through a hit analysis algorithm to obtain a reverse file frequency for describing the scarcity degree of the training entities in the training result;
multiplying the word frequency information and the reverse file frequency through a hit analysis algorithm to obtain an entity matching value for describing the matching degree between each training entity and each related file;
adding the entity matching values of the related files to obtain a file matching value for describing the matching degree between the training file and the related files;
and adding the file matching values of the related files to obtain a hit rate for describing the matching degree between the training file and the training result.
7. The method for generating and evaluating training files according to claim 1, wherein the step of summarizing the training files and generating hit analysis reports according to hit rates further comprises:
comparing the hit rate with a preset hit threshold;
if the hit rate exceeds a preset hit threshold, judging that the training file is qualified, and sending the hit analysis report to a user side;
and if the hit rate does not exceed a preset hit threshold, judging that the training file is unqualified, and sending the hit analysis report to the user side.
8. A training file generation and evaluation device is characterized by comprising:
the system comprises a label server, a recognition server and a database server, wherein the label server is used for receiving an original file, acquiring field information and a training entity of the original file, processing the original file according to the field information and the training entity to acquire a label file, and sending the label file to the recognition server; the field information is information data expressing the field to which the original file belongs, and the training entity is a named entity in the original file;
the recognition server is used for recognizing the semantics of the labeled file through a preset natural language understanding model, performing sequence labeling on the semantic of the labeled file to obtain a training file, and sending the training file to the hit server;
and the hit server is provided with an intelligent search model and a hit analysis algorithm and is used for inputting the training file into the intelligent search model corresponding to the field information to obtain a training result, calculating the training result through the hit analysis algorithm to obtain a hit rate, summarizing the training file and the hit rate and generating a hit analysis report.
9. A computer system comprising a plurality of computer devices, each computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processors of the plurality of computer devices when executing the computer program collectively implement the steps of the training file generation and evaluation method of any of claims 1 to 7.
10. A computer-readable storage medium comprising a plurality of storage media, each storage medium having a computer program stored thereon, wherein the computer programs stored in the storage media, when executed by a processor, collectively implement the steps of the training file generation and evaluation method according to any one of claims 1 to 7.
CN202010344715.4A 2020-04-27 2020-04-27 Training file generation and evaluation method, device, computer system and storage medium Pending CN111582497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010344715.4A CN111582497A (en) 2020-04-27 2020-04-27 Training file generation and evaluation method, device, computer system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010344715.4A CN111582497A (en) 2020-04-27 2020-04-27 Training file generation and evaluation method, device, computer system and storage medium

Publications (1)

Publication Number Publication Date
CN111582497A true CN111582497A (en) 2020-08-25

Family

ID=72115505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010344715.4A Pending CN111582497A (en) 2020-04-27 2020-04-27 Training file generation and evaluation method, device, computer system and storage medium

Country Status (1)

Country Link
CN (1) CN111582497A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270180A (en) * 2020-11-03 2021-01-26 北京阳光云视科技有限公司 BIO automatic labeling system and method for entity recognition training data
CN112380327A (en) * 2020-11-09 2021-02-19 天翼爱音乐文化科技有限公司 Cold-start slot filling method, system, device and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270180A (en) * 2020-11-03 2021-01-26 北京阳光云视科技有限公司 BIO automatic labeling system and method for entity recognition training data
CN112380327A (en) * 2020-11-09 2021-02-19 天翼爱音乐文化科技有限公司 Cold-start slot filling method, system, device and storage medium
CN112380327B (en) * 2020-11-09 2022-03-04 天翼爱音乐文化科技有限公司 Cold-start slot filling method, system, device and storage medium

Similar Documents

Publication Publication Date Title
US20210224694A1 (en) Systems and Methods for Predictive Coding
Gottipati et al. Finding relevant answers in software forums
Jonnalagadda et al. A new iterative method to reduce workload in systematic review process
CN111343161B (en) Abnormal information processing node analysis method, abnormal information processing node analysis device, abnormal information processing node analysis medium and electronic equipment
CN107102993B (en) User appeal analysis method and device
US20150066968A1 (en) Authorship Enhanced Corpus Ingestion for Natural Language Processing
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN111753048B (en) Document retrieval method, device, equipment and storage medium
Malik et al. Accurate information extraction for quantitative financial events
Feng et al. Practical duplicate bug reports detection in a large web-based development community
CN112395875A (en) Keyword extraction method, device, terminal and storage medium
CN111582497A (en) Training file generation and evaluation method, device, computer system and storage medium
CN112181490A (en) Method, device, equipment and medium for identifying function category in function point evaluation method
WO2021004118A1 (en) Correlation value determination method and apparatus
CN116860311A (en) Script analysis method, script analysis device, computer equipment and storage medium
CN116150376A (en) Sample data distribution optimization method, device and storage medium
Kalmar Bootstrapping Websites for Classification of Organization Names on Twitter.
CN117573956B (en) Metadata management method, device, equipment and storage medium
CN109408797A (en) A kind of text sentence vector expression method and system
CN113407859B (en) Resource recommendation method and device, electronic equipment and storage medium
WO2021056740A1 (en) Language model construction method and system, computer device and readable storage medium
CN112767022B (en) Mobile application function evolution trend prediction method and device and computer equipment
US20230162031A1 (en) Method and system for training neural network for generating search string
CN111930545B (en) SQL script processing method, SQL script processing device and SQL script processing server
CN117763109A (en) Data checking method for file full-text retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220520

Address after: 518000 China Aviation Center 2901, No. 1018, Huafu Road, Huahang community, Huaqiang North Street, Futian District, Shenzhen, Guangdong Province

Applicant after: Shenzhen Ping An medical and Health Technology Service Co.,Ltd.

Address before: Room 12G, Area H, 666 Beijing East Road, Huangpu District, Shanghai 200001

Applicant before: PING AN MEDICAL AND HEALTHCARE MANAGEMENT Co.,Ltd.

TA01 Transfer of patent application right