CN116401543A - Text data label generation method based on artificial intelligence and related equipment - Google Patents

Text data label generation method based on artificial intelligence and related equipment Download PDF

Info

Publication number
CN116401543A
CN116401543A CN202310295186.7A CN202310295186A CN116401543A CN 116401543 A CN116401543 A CN 116401543A CN 202310295186 A CN202310295186 A CN 202310295186A CN 116401543 A CN116401543 A CN 116401543A
Authority
CN
China
Prior art keywords
text
text data
similarity
inquiry
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310295186.7A
Other languages
Chinese (zh)
Inventor
赵越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310295186.7A priority Critical patent/CN116401543A/en
Publication of CN116401543A publication Critical patent/CN116401543A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/60ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices
    • G16H40/67ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices for remote operation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H80/00ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Primary Health Care (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)

Abstract

The application relates to the field of artificial intelligence and digital medical treatment, and provides a text data label generation method, a device, electronic equipment and a storage medium based on the artificial intelligence, wherein the text data label generation method based on the artificial intelligence comprises the following steps: acquiring on-line consultation data of a doctor to acquire a consultation text data set; clustering the inquiry text data sets to obtain a plurality of inquiry text data clusters; designing task labels to generate keyword lists; calculating a first text similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm; calculating a second text similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm; and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity. The method and the device can automatically match the acquired inquiry data with the corresponding label, so that the acquisition efficiency of the text data label is improved.

Description

Text data label generation method based on artificial intelligence and related equipment
Technical Field
The application relates to the technical field of artificial intelligence and digital medical treatment, in particular to a text data label generation method, a device, electronic equipment and a storage medium based on artificial intelligence.
Background
In the auxiliary diagnosis and treatment scene, massive online consultation generates a large amount of text data, the data is the basis for making data labels in a plurality of supervision and study tasks, such as a construction speech library, intention recognition, search pushing and dialogue system and the like, and the supervision and study tasks provide great assistance for online consultation, so that the consultation efficiency and the conversion rate are improved.
At present, most data labels are generated manually, but all the methods depending on manual definition cannot be produced on a large scale, so that the expandability of a label system is not strong, and the label system is not very rich, and meanwhile, the mode of manually generating the data labels takes time and labor along with the exponential increase of the text data quantity, so that the label generation efficiency of the text data is lower.
Disclosure of Invention
In view of the foregoing, it is necessary to propose a text data tag generation method, apparatus, electronic device and storage medium based on artificial intelligence, so as to solve the technical problem of how to improve the tag generation efficiency of text data. The related equipment comprises a text data label generating device based on artificial intelligence, electronic equipment and a storage medium.
The application provides a text data label generation method based on artificial intelligence, which comprises the following steps:
acquiring on-line consultation data of a doctor to acquire a consultation text data set;
clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity;
calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity;
and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
In some embodiments, the acquiring the on-line interview data of the doctor to obtain the interview text dataset comprises:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
And storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
In some embodiments, the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
and clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
In some embodiments, the designing the task tag according to the preset manner to generate the keyword list includes:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
In some embodiments, the calculating the similarity between the query text data cluster and the keyword list according to the cosine similarity algorithm to obtain a first text similarity includes:
Respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
In some embodiments, the calculating the similarity between the query text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity includes:
calculating an average value of text vector similarity between a target inquiry text and all keyword vectors in the keyword list as a target average similarity of the target inquiry text, wherein the target inquiry text is any one of the inquiry text data clusters;
Traversing the inquiry text data cluster to obtain the average similarity of the targets corresponding to each inquiry text;
and calculating the average value of all the target average similarities as the first text similarity of the inquiry text data cluster.
In some embodiments, the matching the task labels for the query text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
The embodiment of the application also provides a text data label generating device based on artificial intelligence, which comprises an acquisition module, a clustering module, a generating module, a first calculating module, a second calculating module and a matching module:
the acquisition module is used for acquiring on-line consultation data of a doctor to acquire a consultation text data set;
the clustering module is used for clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
The generating module is used for designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
the first calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm so as to obtain a first text similarity;
the second calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm so as to obtain second text similarity;
the matching module is configured to match the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity.
The embodiment of the application also provides electronic equipment, which comprises:
a memory storing at least one instruction;
and the processor executes the instructions stored in the memory to realize the text data label generating method based on artificial intelligence.
Embodiments of the present application also provide a computer-readable storage medium having at least one instruction stored therein, the at least one instruction being executed by a processor in an electronic device to implement the artificial intelligence-based text data tag generation method.
According to the method and the device, the collected inquiry data are clustered, the keyword list corresponding to the task label is created to perform different text similarity calculation with the clustered inquiry data, so that the inquiry data of each category are automatically matched with more accurate and proper task labels, and the acquisition efficiency of the text data labels is effectively improved.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of an artificial intelligence based text data tag generation method in accordance with the present application.
Fig. 2 is a functional block diagram of a preferred embodiment of an artificial intelligence based text data tag generating apparatus according to the present application.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the text data tag generation method based on artificial intelligence according to the present application.
Detailed Description
In order that the objects, features and advantages of the present application may be more clearly understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, the described embodiments are merely some, rather than all, of the embodiments of the present application.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more of the described features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The embodiment of the application provides a text data tag generation method based on artificial intelligence, which can be applied to one or more electronic devices, wherein the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware comprises, but is not limited to, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, an ASIC), a programmable gate array (Field-Programmable Gate Array, FPGA), a digital processor (Digital Signal Processor, DSP), an embedded device and the like.
The electronic device may be any electronic product that can interact with a customer in a human-machine manner, such as a personal computer, tablet, smart phone, personal digital assistant (Personal Digital Assistant, PDA), gaming machine, interactive web television (Internet Protocol Television, IPTV), smart wearable device, etc.
The electronic device may also include a network device and/or a client device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
As shown in FIG. 1, a flow chart of a preferred embodiment of the artificial intelligence based text data tag generation method of the present application is shown. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs.
In an alternative embodiment, data support may be provided for many supervised learning tasks, such as construction of a speech library, intention recognition, disease diagnosis, etc., according to a large amount of text data generated by on-line interrogation, so that corresponding text data may be collected according to different supervised learning tasks, thereby improving efficiency of the data acquisition process.
S10, acquiring on-line consultation data of a doctor to acquire a consultation text data set.
In an alternative embodiment, the acquiring the on-line consultation data of the doctor to acquire the consultation text data set includes:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
and storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
In this optional embodiment, the crawler technology may crawl message data sent by multiple doctors in each online consultation process, and take text messages sent by each doctor each time as one piece of consultation text data, where all acquired consultation text data are taken as an initial consultation text data set.
In this alternative embodiment, since there may be some unnecessary interview text data in the interview text data set, such as "good", "ok", "pair", "hello", and "? "text messages which do not include more semantic information, etc., so that the redundant inquiry text data in the inquiry text data set can be screened out and filtered out by compiling a script command to pre-specify which inquiry text data are not needed.
In this optional embodiment, the data in the screened initial query text data set may be stored in a pre-constructed query text database, where the query text database may be a common text database such as txtSQL, TXTDB, openSSL, and in this embodiment, the initial query text data set screened and stored in the query text database is used as a query text data set.
Therefore, by collecting message data of doctors in the on-line consultation process and screening redundant message data, a consultation text data set which can provide accurate data support for the follow-up process can be quickly obtained.
S11, clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters.
In an optional embodiment, the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
and clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
In this alternative embodiment, each of the query text data in the query text data set may be represented by a binary vector through one-hot encoding, where the binary vector of each of the query text data is used as the query text vector. Illustratively, a patient has three symptoms of "fever", "nasal obstruction", "cough", which can be represented as "100", "010", "001", respectively, after being converted to binary vectors by one-hot encoding.
In this alternative embodiment, the preset text clustering algorithm may be a single pass algorithm, where the single pass algorithm does not need to specify a specific number of clusters, and may complete clustering by sequentially calculating the similarity between the input query text vector and the existing query text vector. Because the single pass algorithm generally calculates the similarity between data through cosine similarity and Euclidean distance, the method mainly aims at text data, so that editing distance which is more suitable for the text data can be selected for substitution, the single pass algorithm is further optimized, the more accurate similarity between the inquiry text vectors is obtained, and the single pass algorithm with the optimized editing distance is used as a text clustering algorithm in the method.
In this alternative embodiment, all the query text vectors may be clustered by the text clustering algorithm, so as to obtain multiple types of query text data clusters, and for example, 10 query text vectors with the existing numbers from 1 to 10 are counted, and the main clustering process is as follows:
a) Randomly selecting one inquiry text vector as a clustering seed, for example, taking the inquiry text vector with the number of 1 as the clustering seed;
b) Calculating the similarity between the clustering seeds and other 9 inquiry text vectors through editing distances;
c) If the similarity is greater than a preset similarity threshold, classifying the corresponding inquiry text vector and the clustering seeds into a category, for example, the similarity between the inquiry text vectors 2, 3 and 4 and the inquiry text vector 1 is greater than the preset similarity threshold, so that the inquiry text vectors 1, 2, 3 and 4 are used as a category;
d) If the similarity is not greater than the preset similarity threshold, selecting one of the query text vectors not greater than the similarity threshold as another clustering seed, and continuously calculating the similarity between the rest of the query text vectors and each clustering seed, and so on until all the query text vectors are traversed and then are classified into corresponding categories, such as query text vector 5
If the similarity between the query text vector 5 and the clustering seeds is smaller than a preset similarity threshold, taking the query text vector 5 as a new clustering seed, and continuously calculating the query text vector 5 and the query text vector respectively
6. 7, 8, 9, 10, if the query text vector 5 and the query text vector
6. 7 is greater than a preset similarity threshold, the inquiry text vectors 5, 6,
7 is classified into a category, and finally any one of the inquiry text vectors 8, 9 or 10 is used as another clustering seed, for example, the inquiry text vector 8 is used as the latest clustering seed, and the similarity between the inquiry text vector 8 and the inquiry text vectors 9 and 10 is greater than a preset similarity threshold
Values, therefore, the query text vectors 8, 9, 10 are finally taken as another category, and three query text data clusters are finally obtained.
In the scheme, the finally obtained inquiry text data corresponding to the inquiry text vectors of all the categories are respectively used as inquiry text data clusters, so that the inquiry text data with similar texts are gathered together.
Therefore, through optimizing a text clustering algorithm, the text data of each inquiry with similar texts can be more accurately classified.
S12, designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one.
In an optional embodiment, the designing the task tag according to the preset manner to generate a keyword list, where the tag corresponds to the keyword list one to one, includes:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
In this optional embodiment, a plurality of task labels corresponding to the task category may be designed in advance according to different supervised learning tasks, and in this embodiment, the doctor's intention in the on-line inquiry process is identified as an example of the supervised learning task, where the task labels may be "cause", "life advice", "patient comfort", "rehabilitation expectation", "medicine curative effect", "inquiry medication" and are used to represent the intention of the doctor, and the label classification of the inquiry text data set is implemented by matching the inquiry text data cluster to the corresponding task label, so as to identify the intention of the doctor corresponding to each text data in the inquiry text data set. For example, one text data in the inquiry text data set is "your illness is not serious", and the corresponding task label is automatically identified as "illness comfort" through supervision and study tasks, namely, the doctor intends to "illness comfort" at the moment.
In this alternative embodiment, a plurality of keywords corresponding to each task tag may be configured in advance by a medical expert, and the plurality of keywords corresponding to each task tag may be used as the keyword list of the task tag.
Exemplary, as shown in table 1, the keyword list corresponding to each task tag set in the present solution is shown.
Figure BDA0004150164150000081
TABLE 1
Therefore, corresponding task labels can be designed according to different task categories, and corresponding keywords are configured for each task label, so that the text data can be conveniently matched with the corresponding task labels according to the keywords in the subsequent process.
And S13, calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity.
In an alternative embodiment, the calculating the similarity between the query text data cluster and the keyword list according to the cosine similarity algorithm to obtain the first text similarity includes:
respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
Carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
In this alternative embodiment, word vectors of each query text in the query text data cluster and word vectors of each keyword in the keyword list may be calculated through a word2vec word vector model, so as to be used as a text word vector and a keyword vector. The word2vec is a word vector calculation tool of open source, and can convert words in the inquiry text and the keywords into word vectors. Where the word vector is made up of numbers of multiple dimensions, and the exemplary interview text is "attention rest," the "attention" and "rest" can be converted by word2vec into two text word vectors of the expressions [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361 ].
In this optional embodiment, an average value of each element in each text word vector may be calculated as a text word value corresponding to the text word vector, that is, the text word vector and the text word value are in one-to-one correspondence, and all text word values corresponding to each query text are used as elements to generate a query text vector of the query text. Illustratively, the element averages for the text word vectors [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361] are 0.217 and-0.042, respectively, and the "attention rest" of the interview text corresponds to the interview text vector of [0.217, -0.042].
Set S pq A query text vector representing the q-th text in the p-th query text data cluster, S pq =[S pq,0 ,S pq,1 ,……,S pq,s ,……,S pq,l-1 ]Wherein S is pq,s And the s-th text word value of the q-th text in the p-th inquiry text data cluster is represented, and l is the number of the text word values of the q-th text in the p-th inquiry text data cluster.
In this optional embodiment, in order to make the distribution of the query text vectors in each query text data cluster uniform in each dimension, the query text vectors corresponding to each query text may be normalized to obtain standard text vectors, where the standard text vectors satisfy the relation:
Figure BDA0004150164150000091
wherein,,
Figure BDA0004150164150000092
standard text vector representing the q-th text in the p-th question text data cluster, u p And alpha p The mean value and standard deviation of each inquiry text vector in the p-th inquiry text data cluster are respectively.
For example, the query text data cluster a includes 3 texts, and the corresponding query text vectors are respectively: s is S p1 =[0.2,100],S p2 =[0.9,120],S p3 =[0.8,130]The mean and standard deviation of each inquiry text vector in the inquiry text data cluster a can be calculated as follows: u (u) p =[0.633,116.67],α p =[0.309,12.47]Therefore, the standard text vector corresponding to each text is
Figure BDA0004150164150000101
Figure BDA0004150164150000102
In this alternative embodiment, a cosine similarity algorithm may be used to calculate cosine similarity between the standard text vector and the keyword vector as the text vector similarity. Wherein, the cosine similarity between the q standard text vector in the p-th question text data cluster and the j keyword vector of the i task label is set as Sim pq,ij Cosine similarity Sim between the q standard text vector in the p-th question text data cluster and the keyword vectors corresponding to all keywords in the keyword list corresponding to the i-th task tag pq,i The method comprises the following steps:
Figure BDA0004150164150000103
wherein, len i Representing the number of keywords in a keyword list corresponding to the ith task tag, taking any one of the inquiry text data (such as the q-th inquiry text data) in the inquiry text data cluster as a target inquiry text, and taking the obtained Sim pq,i And obtaining the target average similarity corresponding to each inquiry text in the inquiry text data cluster as the target average similarity corresponding to the target inquiry text.
The cosine similarity Sim between the p-th question text data cluster and the keyword list corresponding to the i-th task tag can be calculated i,p The method comprises the following steps:
Figure BDA0004150164150000104
wherein, len p Representing the total number of interview text data within the p-th interview text data cluster. In the scheme, the cosine similarity Sim between the finally obtained p-th inquiry text data cluster and the keyword list corresponding to the i-th task label i,p As a first text similarity corresponding to the query text data cluster.
In this way, the first text similarity between the inquiry text data clusters and the keyword list can be calculated preliminarily through the cosine similarity algorithm, so that the follow-up process can be facilitated to match corresponding task labels for each inquiry text data cluster accordingly.
And S14, calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity.
In this alternative embodiment, a TF-IDF algorithm may be selected as a text similarity algorithm for calculating a similarity between the query text data cluster and the keyword list as a second text similarity. Wherein, TF-IDF is an index for measuring the importance of words in documents, TF is the frequency of keywords in the inquiry text data clusters, and IDF is the frequency of keywords in all inquiry text data clusters.
In this alternative embodiment, the TF value of the jth keyword of the ith task tag in the jth query text data cluster satisfies the relation:
Figure BDA0004150164150000111
wherein n is ij,p Representing the number of times the jth keyword of the ith task tag appears in the jth question text data cluster, N p Is the number of all words that occur within the p-th question text data cluster.
In this alternative embodiment, the IDF value of the j-th keyword of the i-th task tag satisfies the relation:
Figure BDA0004150164150000112
where m is the total number of the query text data clusters, and the denominator is the number of the query text data clusters containing the j-th keyword.
From the above two relations, the TF-IDF value of the jth keyword of the ith task tag in the p-th query text data cluster is as follows:
TFIDF ij,p =TF ij,p ×IDF ij
wherein TFIDF ij,p The TF-IDF value of the jth keyword representing the ith task tag in the p-th query text data cluster. In the scheme, the maximum value of TF-IDF values of keywords in the ith task tag in the ith inquiry text data cluster is taken as a second text similarity TFIDF between the ith task tag and the ith inquiry text data cluster i,p I.e. TFIDF i,p =max(TFIDF ij,p )。
Therefore, the second text similarity between the inquiry text data cluster and the keyword list can be calculated through a text similarity algorithm, so that more accurate task labels can be matched for the inquiry text data cluster in the follow-up process according to the obtained second text similarity.
S15, matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
In an optional embodiment, the matching the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
Sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
In this alternative embodiment, the first text similarity can be efficiently distinguished for task labels that are more distinct, while it is difficult to distinguish for task labels that are more similar.
Illustratively, for the query text data cluster 1 and the query text data cluster 2, two query text data are included in each query text data cluster. Two pieces of inquiry text data of the inquiry text data cluster 1 are respectively 'the disease is very common, is not very serious' and 'the disease is very common, is not very serious in general', and can easily judge that the task labels corresponding to keywords such as common, not and serious are 'the disease is comfort' through the first text similarity; the two pieces of inquiry text data in the inquiry text data cluster 2 are respectively ' diet attention at ordinary times ', repeated attacks are avoided, serious consequences are avoided, diet attention in life is avoided, recurrence is avoided, serious symptoms are avoided ', and whether the corresponding task label is ' life suggestion ' or ' cause ' is difficult to judge through the first text similarity of keywords ' attention ', ' avoidance ', ' cause '.
In this alternative embodiment, the similarity of the query text may be obtained by multiplying the first text similarity and the second text similarity, thereby solving the problem that the first text similarity alone is difficult to correctDistinguishing the problem that task labels are relatively similar, wherein the inquiry text similarity score i,p The relation is satisfied:
score i,p =abs(Sim i,p ×TFIDF i,p )
wherein score i,p And (3) representing the similarity of the query text of the ith task label and the p-th query text data cluster, wherein abs is an absolute value function.
In this optional embodiment, the similarity of the query text between the query text data cluster and each task label may be calculated, the query text similarities are sorted according to the order from the top to the bottom, then the task label corresponding to the query text similarity ranked at the top is selected as the task label matched with the query text data, and finally the corresponding task label is set for the data in the query text data cluster.
In this way, by combining the first text similarity and the second text similarity, the task labels can be more accurately matched with the inquiry text data clusters.
Referring to fig. 2, fig. 2 is a functional block diagram of a preferred embodiment of the text data tag generating apparatus based on artificial intelligence of the present application. The text data tag generating device 11 based on artificial intelligence comprises an acquisition module 110, a clustering module 111, a generating module 112, a first calculating module 113, a second calculating module 114 and a matching module 115. The unit/module referred to herein is a series of computer readable instructions capable of being executed by the processor 13 and of performing a fixed function, stored in the memory 12. In the present embodiment, the functions of the respective units/modules will be described in detail in the following embodiments.
In an alternative embodiment, the acquisition module 110 is configured to acquire on-line interview data of a doctor to obtain an interview text dataset.
In an alternative embodiment, the acquiring the on-line consultation data of the doctor to acquire the consultation text data set includes:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
and storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
In an alternative embodiment, the clustering module 111 is configured to cluster the query text data in the query text data set to obtain a plurality of query text data clusters.
In an optional embodiment, the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
And clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
In an alternative embodiment, the generating module 112 is configured to design the task tag according to a preset manner to generate a keyword list, where the tag corresponds to the keyword list one by one.
In an optional embodiment, the designing the task tag according to the preset manner to generate a keyword list, where the tag corresponds to the keyword list one to one, includes:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
In this optional embodiment, a plurality of task labels corresponding to the task category may be designed in advance according to different supervised learning tasks, and in this embodiment, the doctor's intention in the on-line inquiry process is identified as an example of the supervised learning task, where the task labels may be "cause", "life advice", "patient comfort", "rehabilitation expectation", "medicine curative effect", "inquiry medication" and are used to represent the intention of the doctor, and the label classification of the inquiry text data set is implemented by matching the inquiry text data cluster to the corresponding task label, so as to identify the intention of the doctor corresponding to each text data in the inquiry text data set. For example, one text data in the inquiry text data set is "your illness is not serious", and the corresponding task label is automatically identified as "illness comfort" through supervision and study tasks, namely, the doctor intends to "illness comfort" at the moment.
In this alternative embodiment, a plurality of keywords corresponding to each task tag may be configured in advance by a medical expert, and the plurality of keywords corresponding to each task tag may be used as the keyword list of the task tag.
Exemplary, as shown in table 1, the keyword list corresponding to each task tag set in the present solution is shown.
Figure BDA0004150164150000141
/>
TABLE 1
In an alternative embodiment, the first calculating module 113 is configured to calculate the similarity between the query text data cluster and the keyword list according to a cosine similarity algorithm to obtain the first text similarity.
In an alternative embodiment, the calculating the similarity between the query text data cluster and the keyword list according to the cosine similarity algorithm to obtain the first text similarity includes:
respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
Calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
In this alternative embodiment, word vectors of each query text in the query text data cluster and word vectors of each keyword in the keyword list may be calculated through a word2vec word vector model, so as to be used as a text word vector and a keyword vector. The word2vec is a word vector calculation tool of open source, and can convert words in the inquiry text and the keywords into word vectors. Where the word vector is made up of numbers of multiple dimensions, and the exemplary interview text is "attention rest," the "attention" and "rest" can be converted by word2vec into two text word vectors of the expressions [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361 ].
In this optional embodiment, an average value of each element in each text word vector may be calculated as a text word value corresponding to the text word vector, that is, the text word vector and the text word value are in one-to-one correspondence, and all text word values corresponding to each query text are used as elements to generate a query text vector of the query text. Illustratively, the element averages for the text word vectors [0.682,0.388, -0.466,0.262] and [0.823, -0.588, -0.765,0.361] are 0.217 and-0.042, respectively, and the "attention rest" of the interview text corresponds to the interview text vector of [0.217, -0.042].
Set S pq A query text vector representing the q-th text in the p-th query text data cluster, S pq =[S pq,0 ,S pq,1 ,……,S pq,s ,……,S pq,l-1 ]Wherein S is pq,s An s-th text word value representing a q-th text in a p-th question text data cluster, l being the q-th text in the p-th question text data clusterIs a number of text word values of (a).
In this optional embodiment, in order to make the distribution of the query text vectors in each query text data cluster uniform in each dimension, the query text vectors corresponding to each query text may be normalized to obtain standard text vectors, where the standard text vectors satisfy the relation:
Figure BDA0004150164150000151
wherein,,
Figure BDA0004150164150000152
standard text vector representing the q-th text in the p-th question text data cluster, u p And alpha p The mean value and standard deviation of each inquiry text vector in the p-th inquiry text data cluster are respectively.
For example, the query text data cluster a includes 3 texts, and the corresponding query text vectors are respectively: s is S p1 =[0.2,100],S p2 =[0.9,120],S p3 =[0.8,130]The mean and standard deviation of each inquiry text vector in the inquiry text data cluster a can be calculated as follows: u (u) p =[0.633,116.67],α p =[0.309,12.47]Therefore, the standard text vector corresponding to each text is
Figure BDA0004150164150000153
Figure BDA0004150164150000154
In this alternative embodiment, a cosine similarity algorithm may be used to calculate cosine similarity between the standard text vector and the keyword vector as the text vector similarity. Wherein, the cosine similarity between the q standard text vector in the p-th question text data cluster and the j keyword vector of the i task label is set as Sim pq,ij The q standard text vector in the p-th question text data cluster and the keyword list corresponding to the i task labelCosine similarity Sim between keyword vectors corresponding to all keywords of (a) pq,i The method comprises the following steps:
Figure BDA0004150164150000155
wherein, len i Representing the number of keywords in a keyword list corresponding to the ith task tag, taking any one of the inquiry text data (such as the q-th inquiry text data) in the inquiry text data cluster as a target inquiry text, and taking the obtained Sim pq,i And obtaining the target average similarity corresponding to each inquiry text in the inquiry text data cluster as the target average similarity corresponding to the target inquiry text.
The cosine similarity Sim between the p-th question text data cluster and the keyword list corresponding to the i-th task tag can be calculated i,p The method comprises the following steps:
Figure BDA0004150164150000161
wherein, len p Representing the total number of interview text data within the p-th interview text data cluster. In the scheme, the cosine similarity Sim between the finally obtained p-th inquiry text data cluster and the keyword list corresponding to the i-th task label i,p As a first text similarity corresponding to the query text data cluster.
In an alternative embodiment, the second calculating module 114 is configured to calculate the similarity between the query text data cluster and the keyword list according to a text similarity algorithm to obtain the second text similarity.
In this alternative embodiment, a TF-IDF algorithm may be selected as a text similarity algorithm for calculating a similarity between the query text data cluster and the keyword list as a second text similarity. Wherein, TF-IDF is an index for measuring the importance of words in documents, TF is the frequency of keywords in the inquiry text data clusters, and IDF is the frequency of keywords in all inquiry text data clusters.
In this alternative embodiment, the TF value of the jth keyword of the ith task tag in the jth query text data cluster satisfies the relation:
Figure BDA0004150164150000162
wherein n is ij,p Representing the number of times the jth keyword of the ith task tag appears in the jth question text data cluster, N p Is the number of all words that occur within the p-th question text data cluster.
In this alternative embodiment, the IDF value of the j-th keyword of the i-th task tag satisfies the relation:
Figure BDA0004150164150000163
where m is the total number of the query text data clusters, and the denominator is the number of the query text data clusters containing the j-th keyword.
From the above two relations, the TF-IDF value of the jth keyword of the ith task tag in the p-th query text data cluster is as follows:
TFIDF ij,p =TF ij,p ×IDF ij
Wherein TFIDF ij,p The TF-IDF value of the jth keyword representing the ith task tag in the p-th query text data cluster. In the scheme, the maximum value of TF-IDF values of keywords in the ith task tag in the ith inquiry text data cluster is taken as a second text similarity TFIDF between the ith task tag and the ith inquiry text data cluster i,p I.e. TFIDF i,p =max(TFIDF ij,p )。
In an alternative embodiment, the matching module 115 is configured to match the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity.
In an optional embodiment, the matching the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
In this alternative embodiment, the first text similarity can be efficiently distinguished for task labels that are more distinct, while it is difficult to distinguish for task labels that are more similar.
Illustratively, for the query text data cluster 1 and the query text data cluster 2, two query text data are included in each query text data cluster. Two pieces of inquiry text data of the inquiry text data cluster 1 are respectively 'the disease is very common, is not very serious' and 'the disease is very common, is not very serious in general', and can easily judge that the task labels corresponding to keywords such as common, not and serious are 'the disease is comfort' through the first text similarity; the two pieces of inquiry text data in the inquiry text data cluster 2 are respectively ' diet attention at ordinary times ', repeated attacks are avoided, serious consequences are avoided, diet attention in life is avoided, recurrence is avoided, serious symptoms are avoided ', and whether the corresponding task label is ' life suggestion ' or ' cause ' is difficult to judge through the first text similarity of keywords ' attention ', ' avoidance ', ' cause '.
In this alternative embodiment, the similarity of the query text may be obtained by multiplying the first text similarity and the second text similarity, so as to solve the problem that it is difficult to distinguish task labels with a school only by the first text similarity, where the query text similarity score i,p The relation is satisfied:
score i,p =abs(Sim i,p ×TFIDF i,p )
wherein score i,p And (3) representing the similarity of the query text of the ith task label and the p-th query text data cluster, wherein abs is an absolute value function.
In this optional embodiment, the similarity of the query text between the query text data cluster and each task label may be calculated, the query text similarities are sorted according to the order from the top to the bottom, then the task label corresponding to the query text similarity ranked at the top is selected as the task label matched with the query text data, and finally the corresponding task label is set for the data in the query text data cluster.
According to the technical scheme, the acquired inquiry data can be clustered, the keyword list corresponding to the task label is created to perform different text similarity calculation with the clustered inquiry data, so that more accurate and proper task labels can be automatically matched for the inquiry data of each category, and further the acquisition efficiency of the text data labels is effectively improved.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 1 comprises a memory 12 and a processor 13. The memory 12 is configured to store computer readable instructions and the processor 13 is configured to execute the computer readable instructions stored in the memory to implement the artificial intelligence based text data tag generating method according to any of the above embodiments.
In an alternative embodiment, the electronic device 1 further comprises a bus, a computer program stored in said memory 12 and executable on said processor 13, such as an artificial intelligence based text data tag generation program.
Fig. 3 shows only an electronic device 1 with a memory 12 and a processor 13, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In connection with fig. 1, the memory 12 in the electronic device 1 stores a plurality of computer readable instructions to implement an artificial intelligence based text data tag generating method, the processor 13 being executable to implement:
acquiring on-line consultation data of a doctor to acquire a consultation text data set;
clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity;
Calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity;
and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
Specifically, the specific implementation method of the above instructions by the processor 13 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the electronic device 1 and does not constitute a limitation of the electronic device 1, the electronic device 1 may be a bus type structure, a star type structure, the electronic device 1 may further comprise more or less other hardware or software than illustrated, or a different arrangement of components, e.g. the electronic device 1 may further comprise an input-output device, a network access device, etc.
It should be noted that the electronic device 1 is only used as an example, and other electronic products that may be present in the present application or may be present in the future are also included in the scope of the present application and are incorporated herein by reference.
The memory 12 includes at least one type of readable storage medium, which may be non-volatile or volatile. The readable storage medium includes flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, such as a mobile hard disk of the electronic device 1. The memory 12 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. The memory 12 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of text data tag generation programs based on artificial intelligence, but also for temporarily storing data that has been output or is to be output.
The processor 13 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects the respective components of the entire electronic device 1 using various interfaces and lines, executes or executes programs or modules stored in the memory 12 (for example, executes an artificial intelligence-based text data tag generation program or the like), and invokes data stored in the memory 12 to perform various functions of the electronic device 1 and process data.
The processor 13 executes the operating system of the electronic device 1 and various types of applications installed. The processor 13 executes the application program to implement the steps of the various embodiments of the artificial intelligence based text data tag generation method described above, such as the steps shown in fig. 1.
Illustratively, the computer program may be split into one or more units/modules, which are stored in the memory 12 and executed by the processor 13 to complete the present application. The one or more units/modules may be a series of computer readable instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the electronic device 1. For example, the computer program may be partitioned into an acquisition module 110, a clustering module 111, a generation module 112, a first calculation module 113, a second calculation module 114, a matching module 115.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a computer device, or a network device, etc.) or processor (processor) to perform portions of the artificial intelligence-based text data tag generation methods described in various embodiments of the present application.
The integrated units/modules of the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand alone product. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by instructing the relevant hardware device by a computer program, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each method embodiment described above when executed by a processor.
Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory, other memories, and the like.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
The blockchain referred to in the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but only one bus or one type of bus is not shown. The bus is arranged to enable a connection communication between the memory 12 and at least one processor 13 or the like.
The embodiment of the application further provides a computer readable storage medium (not shown), in which computer readable instructions are stored, and the computer readable instructions are executed by a processor in an electronic device to implement the method for generating a text data tag based on artificial intelligence according to any one of the embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present application may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
Furthermore, it is evident that the word "comprising" does not exclude other modules or steps, and that the singular does not exclude a plurality. The various modules or means set forth in the specification may also be implemented by one module or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the present application and not for limiting, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present application may be modified or substituted without departing from the spirit and scope of the technical solution of the present application.

Claims (10)

1. A method for generating text data labels based on artificial intelligence, the method comprising:
acquiring on-line consultation data of a doctor to acquire a consultation text data set;
clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm to obtain a first text similarity;
Calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm to obtain a second text similarity;
and matching corresponding task labels for the inquiry text data clusters based on the first text similarity and the second text similarity.
2. The artificial intelligence based text data tag generation method of claim 1, wherein the acquiring on-line interview data of a doctor to acquire an interview text dataset comprises:
acquiring text messages sent by a doctor every time during online consultation as a piece of consultation text data to obtain an initial consultation text data set;
screening the initial inquiry text data set according to a preset mode;
and storing the data in the screened initial consultation text data set into a consultation text database to obtain a consultation text data set.
3. The method for generating text data labels based on artificial intelligence according to claim 1, wherein the clustering the query text data in the query text data set to obtain a plurality of query text data clusters includes:
respectively converting each inquiry text data in the inquiry text data set into inquiry text vectors;
Optimizing a preset text clustering algorithm to obtain a text clustering optimization algorithm;
and clustering the inquiry text vectors based on the text clustering optimization algorithm to obtain inquiry text data clusters of a plurality of categories.
4. The method for generating text data labels based on artificial intelligence according to claim 1, wherein the designing task labels according to a preset manner to generate the keyword list comprises:
designing a plurality of task labels according to preset task categories;
configuring a plurality of corresponding keywords for each task label according to a preset mode;
and taking a plurality of keywords corresponding to each task label as a keyword list.
5. The artificial intelligence based text data tag generation method of claim 1, wherein the calculating the similarity between the question text data cluster and the keyword list according to the cosine similarity algorithm to obtain a first text similarity comprises:
respectively calculating text word vectors of all the inquiry texts in the inquiry text data clusters and keyword vectors of all the keywords in the keyword list according to a word vector model;
calculating an average value of each element in the text word vector as a text word value, and constructing a consultation text vector of each consultation text based on the text word value;
Carrying out standardization processing on the inquiry text vector to obtain a standard text vector;
calculating the similarity between the standard text vector and the keyword vector according to a cosine similarity algorithm to obtain text vector similarity;
and calculating the similarity between the inquiry text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity.
6. The artificial intelligence based text data tag generation method of claim 5, wherein the calculating similarity between the question text data cluster and the keyword list based on the text vector similarity to obtain a first text similarity comprises:
calculating an average value of text vector similarity between a target inquiry text and all keyword vectors in the keyword list as a target average similarity of the target inquiry text, wherein the target inquiry text is any one of the inquiry text data clusters;
traversing the inquiry text data cluster to obtain the average similarity of the targets corresponding to each inquiry text;
and calculating the average value of all the target average similarities as the first text similarity of the inquiry text data cluster.
7. The method for generating text data labels based on artificial intelligence according to claim 1, wherein the matching the task labels corresponding to the question text data clusters based on the first text similarity and the second text similarity includes:
multiplying the first text similarity and the second text similarity to obtain inquiry text similarity;
sequencing the similarity of the inquiry texts according to the sequence from big to small to obtain a sequencing result;
and determining task labels corresponding to the inquiry text data clusters based on the sorting results.
8. The text data label generating device based on the artificial intelligence is characterized by comprising an acquisition module, a clustering module, a generating module, a first calculating module, a second calculating module and a matching module:
the acquisition module is used for acquiring on-line consultation data of a doctor to acquire a consultation text data set;
the clustering module is used for clustering the inquiry text data in the inquiry text data set to obtain a plurality of inquiry text data clusters;
the generating module is used for designing task labels according to a preset mode to generate a keyword list, wherein the labels correspond to the keyword list one by one;
The first calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a cosine similarity algorithm so as to obtain a first text similarity;
the second calculation module is used for calculating the similarity between the inquiry text data cluster and the keyword list according to a text similarity algorithm so as to obtain second text similarity;
the matching module is configured to match the task labels corresponding to the query text data clusters based on the first text similarity and the second text similarity.
9. An electronic device, the electronic device comprising:
a memory storing computer readable instructions; and
A processor executing computer readable instructions stored in the memory to implement the artificial intelligence based text data tag generation method of any one of claims 1 to 7.
10. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the artificial intelligence based text data tag generation method of any of claims 1 to 7.
CN202310295186.7A 2023-03-22 2023-03-22 Text data label generation method based on artificial intelligence and related equipment Pending CN116401543A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310295186.7A CN116401543A (en) 2023-03-22 2023-03-22 Text data label generation method based on artificial intelligence and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310295186.7A CN116401543A (en) 2023-03-22 2023-03-22 Text data label generation method based on artificial intelligence and related equipment

Publications (1)

Publication Number Publication Date
CN116401543A true CN116401543A (en) 2023-07-07

Family

ID=87015192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310295186.7A Pending CN116401543A (en) 2023-03-22 2023-03-22 Text data label generation method based on artificial intelligence and related equipment

Country Status (1)

Country Link
CN (1) CN116401543A (en)

Similar Documents

Publication Publication Date Title
Yuvaraj et al. Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster
CN113822494B (en) Risk prediction method, device, equipment and storage medium
CN111695033B (en) Enterprise public opinion analysis method, enterprise public opinion analysis device, electronic equipment and medium
US11232365B2 (en) Digital assistant platform
CN113761218B (en) Method, device, equipment and storage medium for entity linking
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US20200074242A1 (en) System and method for monitoring online retail platform using artificial intelligence
Vidhya et al. Modified adaptive neuro-fuzzy inference system (M-ANFIS) based multi-disease analysis of healthcare Big Data
CN113435202A (en) Product recommendation method and device based on user portrait, electronic equipment and medium
CN112214515B (en) Automatic data matching method and device, electronic equipment and storage medium
WO2022160442A1 (en) Answer generation method and apparatus, electronic device, and readable storage medium
Wanyan et al. Deep learning with heterogeneous graph embeddings for mortality prediction from electronic health records
CN113706253A (en) Real-time product recommendation method and device, electronic equipment and readable storage medium
Johnson et al. Encoding high-dimensional procedure codes for healthcare fraud detection
CN113705698B (en) Information pushing method and device based on click behavior prediction
Sharma et al. A novel approach of ensemble methods using the stacked generalization for high-dimensional datasets
Belwal et al. Extractive text summarization using clustering-based topic modeling
CN114706985A (en) Text classification method and device, electronic equipment and storage medium
WO2021174923A1 (en) Concept word sequence generation method, apparatus, computer device, and storage medium
WO2021009375A1 (en) A method for extracting information from semi-structured documents, a related system and a processing device
CN114581177B (en) Product recommendation method, device, equipment and storage medium
CN116150185A (en) Data standard extraction method, device, equipment and medium based on artificial intelligence
CN115169360A (en) User intention identification method based on artificial intelligence and related equipment
CN113627186B (en) Entity relation detection method based on artificial intelligence and related equipment
CN116401543A (en) Text data label generation method based on artificial intelligence and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination