CN115048927A - Method, device and equipment for identifying disease symptoms based on text classification - Google Patents

Method, device and equipment for identifying disease symptoms based on text classification Download PDF

Info

Publication number
CN115048927A
CN115048927A CN202210687158.5A CN202210687158A CN115048927A CN 115048927 A CN115048927 A CN 115048927A CN 202210687158 A CN202210687158 A CN 202210687158A CN 115048927 A CN115048927 A CN 115048927A
Authority
CN
China
Prior art keywords
text
clause
semantics
semantic
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210687158.5A
Other languages
Chinese (zh)
Inventor
彭立彪
郑银河
黄民烈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lingxin Intelligent Technology Co ltd
Original Assignee
Beijing Lingxin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lingxin Intelligent Technology Co ltd filed Critical Beijing Lingxin Intelligent Technology Co ltd
Priority to CN202210687158.5A priority Critical patent/CN115048927A/en
Publication of CN115048927A publication Critical patent/CN115048927A/en
Priority to CN202310707427.4A priority patent/CN116992830B/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application relates to the field of artificial intelligence and discloses a method, a device and equipment for identifying diseases based on text classification. The embodiment of the application relates to a method for identifying diseases based on text classification, which comprises the following steps: acquiring a text set to be identified; identifying semantic information of a text to be identified; acquiring the deviation degree of semantic information of a text to be recognized and target semantics; and if the deviation degree is smaller than a preset threshold value, classifying and determining the corresponding text to be classified as the class corresponding to the target disease state. Therefore, the characteristics of the recognition errors are extracted and used as training input conditions to train the classification model, and the recognition degree of the classification model to the recognition errors is improved. Due to the similarity relation established between the first clause and the target semantics, point-to-point deviation identification of all contents in the text data is realized. On the other hand, model training is carried out on the recognition errors, and text data classification is carried out through the trained models, so that the compatibility of the text data classification process is improved, and the cost is reduced.

Description

Method, device and equipment for identifying disease symptoms based on text classification
Technical Field
The embodiment of the invention relates to the field of artificial intelligence, and relates to a method, a device and equipment for identifying diseases based on text classification.
Background
The text classification is a technology for performing semantic classification on collected texts through artificial intelligence according to keywords in the collected texts, and the technology is widely applied to various fields. For example, in the context of mental and mental disease identification, it may be determined whether the detected user has a medical condition, or has a potential risk of developing a disease, etc. based on text classification techniques. For example, a descriptive text related to the detected emotion of the user, such as a diary text or a voice message converted into a text, may be obtained first, then keyword recognition may be performed on the obtained text, a classification of the obtained text may be determined based on the result of the keyword recognition, and finally, whether the patient suffers from a disease or has a potential disease risk level may be identified based on the result of the classification of the text.
In the text classification process, because the semantics expressed by the same text in different contexts may be different, a misrecognition phenomenon is easily generated in the process of keyword recognition, that is, the semantics recognized based on the keywords are not matched with the semantics actually expressed by the detected user, so that the deviation between the final classification result of the text data and the semantics actually expressed by the text is large, and further, the recognition of the disease symptoms of the detected user is wrong.
In the existing text data classification technology, the corresponding semantics or identification parameters of keywords are generally adjusted to reduce the identification deviation, so that the accuracy of the disease identification result is improved. However, in practical applications, because the content difference between text data is large, and there may be keyword patterns of the same character between different text data, the way of adjusting the keyword semantics and the identification parameters affects all the contents in the text data, and the point-to-point deviation of all the contents cannot be reduced. On the other hand, if deviation identification or deviation reduction of partial contents in the text data is to be realized, the corresponding platform needs to be constructed manually, and identification cost of the text data is increased.
Disclosure of Invention
The embodiment of the application provides a method, a device and equipment for identifying a disease state based on text classification, and aims to solve the problem that the targeted identification deviation of text data cannot be reduced in the existing method for identifying the disease state based on text classification.
In a first aspect, an embodiment of the present application provides a method for identifying a disease condition based on text classification, where the method includes:
acquiring a text set to be recognized, wherein the text set to be recognized comprises at least one text to be recognized;
identifying semantic information of each text to be identified in the text set to be identified;
acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target symptoms;
and if the deviation degree is smaller than a preset threshold value, classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target symptoms.
In some possible embodiments, the method for identifying a disease condition based on text classification further includes:
acquiring a training sample set, wherein the semantic of each text in the training sample set is opposite to the target semantic;
and training according to the training sample set to obtain a text recognition model, wherein the text recognition model is used for acquiring the deviation degree of the semantic information of each text to be classified and preset target semantics. Therefore, the recognition deviation is used as an independent training sample to train the recognition model, the recognition degree of the recognition model on the recognition deviation can be improved, and the deviation of a disease recognition result is further reduced.
In some possible embodiments, the obtaining training samples includes:
acquiring a first training sample set, wherein the first training sample set comprises a plurality of texts;
segmenting words to obtain at least one clause corresponding to each text in the training sample set;
acquiring a first clause set according to the similarity between the semantics of the at least one clause and the target semantics;
generating a second training sample set according to each clause in the first clause set, wherein the semantic meaning of each text in the second training sample set is opposite to the semantic meaning of the first clause corresponding to the corresponding text;
using the second training sample set as the training sample set. Therefore, the recognition error can be converted into a specific training sample in a clause semantic replacement mode, the learning effect of the recognition model is improved, and the recognition degree of the recognition model on the recognition error is further improved.
In some possible embodiments, obtaining the first clause set according to the similarity between the semantics of the at least one clause and the target semantics comprises;
obtaining a semantic result of each clause in the at least one clause;
comparing the similarity of each clause semantic result with the target semantic to obtain a similarity set, wherein the similarity set comprises at least one similarity, and each clause semantic result corresponds to one similarity;
if at least one similarity in the similarity set is greater than or equal to a preset threshold, selecting a clause corresponding to the similarity with the largest value as a first clause,
and if the similarity which is greater than or equal to the preset threshold does not exist in the similarity set, adjusting the word segmentation processing rule and subdividing the clauses.
Therefore, the clause closest to the target semantic meaning can be accurately found, the accuracy of obtaining the identification error subsequently is improved, and the identification degree of the identification model to the identification error is further improved.
In some possible implementations, the training recognition bias implementation includes:
and if the deviation degree is greater than or equal to a preset threshold value, determining the text to be classified corresponding to the semantic recognition result as the category of the non-target disease.
In some possible embodiments, the adjusting the word segmentation processing rule includes: the number of divided characters is changed. Therefore, the clause closest to the target semantic can be accurately found, and the accuracy of acquiring the identification error subsequently is improved.
In some possible embodiments, before performing the semantic information for identifying each text to be identified in the text set to be identified, the method further includes performing bias pre-identification on the text set to be identified, where the bias pre-identification method includes:
carrying out truncation clause processing on the first position character of each identification text along a first direction to obtain a first truncation clause;
identifying the first segmentation sentence semantics and comparing the similarity of the first segmentation sentence semantics and the target semantics;
if the similarity between the first truncated clause semantic and the target semantic is smaller than a preset threshold, keeping a previous character truncated clause as a first truncated clause;
carrying out truncation clause processing on a second position character of the first truncation clause along a second direction to obtain a second truncation clause;
identifying the second truncated clause semantics and comparing the second truncated clause semantics with the target semantics;
if the similarity between the second truncated clause semantic and the target semantic is smaller than a preset threshold, keeping the previous character truncated clause as a second truncated clause;
and determining clauses formed by characters except the second truncated clause in the text as recognition deviation. Therefore, the recognition errors in the text set to be recognized can be disclosed before actual classification, so that recognition of the recognition model is facilitated, and the classification accuracy is improved.
In one possible implementation, the first character and the second character correspond to a first character and a last character of the text, respectively.
In a possible embodiment, the first direction is opposite to the second direction.
In some possible embodiments, the pre-trained language model is obtained, and the model categories include: the large vocabulary language model N-gram.
In a second aspect, an embodiment of the present application further provides a device for identifying a disease condition based on text classification, where the device includes:
the device comprises a first acquisition module, a second acquisition module and a recognition module, wherein the first acquisition module is used for acquiring a text set to be recognized, and the text set to be recognized comprises at least one text to be recognized;
the recognition module is used for recognizing semantic information of each text to be recognized in the text set to be recognized;
the second acquisition module is used for acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target diseases;
and the recognition module is used for classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target diseases if the deviation degree is smaller than a preset threshold value.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor performing the method of the first aspect or any of the possible embodiments of the first aspect by executing the computer instructions.
In a fourth aspect, the present application further provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause the computer to execute the method in the first aspect or any possible implementation manner of the first aspect.
The embodiment of the application provides a disease identification method based on text classification, and the method comprises the steps of firstly obtaining a text set to be identified, wherein the text set to be identified comprises at least one text to be identified; identifying semantic information of each text to be identified in the text set to be identified; acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target symptoms; and if the deviation degree is smaller than a preset threshold value, classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target symptoms. Therefore, the characteristics of the recognition errors are extracted and used as training input conditions to train the classification model, so that the recognition degree of the classification model on the recognition errors is improved, and the classification model is optimized. Furthermore, in the text classification process, the optimized classification model is used, and the model trained by the recognition deviation is used for text classification, so that the final disease identification result is more accurate. On one hand, due to the similarity relation established between the first clause and the target semantics, when the target semantics are changed by an application scene or the content of the text data or need to be redefined, the content of the first clause is also changed, and the point-to-point deviation identification of all the content in the text data is further realized. On the other hand, after a second clause opposite to the first clause is obtained, the second clause is used as a recognition error for model training, and text data is classified through the trained model, so that point-to-point deviation recognition of different target semantics can be achieved, even the point-to-point deviation is reduced, the compatibility of the text data classification process is improved, and the cost is reduced.
Drawings
Fig. 1 is a schematic flowchart of a method for identifying a disease condition based on text classification according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a disease recognition apparatus based on text classification according to an embodiment of the present application.
Fig. 3 is an exemplary structural diagram of a disease identification device based on text classification provided in an embodiment of the present application.
Detailed Description
The terminology used in the following examples of the present application is for the purpose of describing alternative embodiments and is not intended to be limiting of the present application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well. It should also be understood that although the terms first, second, etc. may be used in the following embodiments to describe a class of objects, the objects are not limited to these terms. These terms are used to distinguish between particular objects of that class of objects. For example, the following embodiments may adopt the terms first, second, etc. to describe other class objects in the same way, and are not described herein again.
The embodiment of the application provides a disease identification method based on text classification, and the method comprises the steps of firstly obtaining a text set to be identified, wherein the text set to be identified comprises at least one text to be identified; identifying semantic information of each text to be identified in the text set to be identified; acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target symptoms; and if the deviation degree is smaller than a preset threshold value, classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target symptoms. Therefore, the characteristics of the recognition errors are extracted and used as training input conditions to train the classification model, so that the recognition degree of the classification model on the recognition errors is improved, and the classification model is optimized. Furthermore, in the text classification process, the optimized classification model is used, and the model trained by the recognition deviation is used for text classification, so that the final disease identification result is more accurate. On one hand, due to the similarity relation established between the first clause and the target semantics, when the target semantics are changed by an application scene or the content of the text data or need to be redefined, the content of the first clause is also changed, and the point-to-point deviation identification of all the content in the text data is further realized. On the other hand, after the second clause opposite to the first clause is obtained, the second clause is used as a recognition error to carry out model training, and text data classification is carried out through the trained model, so that point-to-point deviation recognition of different target semantics can be realized, even the point-to-point deviation is reduced, the compatibility of the text data classification process is improved, and the cost is reduced.
Any electronic device related to the embodiments of the present application may be an electronic device such as a mobile phone, a tablet computer, a wearable device (e.g., a smart watch, a smart bracelet, etc.), a notebook computer, a desktop computer, and an in-vehicle device. The electronic device is pre-installed with a software deployment application. It is understood that the embodiment of the present application does not set any limit to the specific type of the electronic device.
The text classification is a technology for performing semantic classification on collected texts through artificial intelligence according to keywords in the collected texts, and the technology is widely applied to various fields. For example, in the context of mental and mental disease identification, it may be determined whether the detected user has a disease state, or has a potential risk of developing a disease, etc. based on text classification techniques. For example, a descriptive text related to the detected emotion of the user, such as a diary text or a voice message converted into a text, may be obtained first, then keyword recognition may be performed on the obtained text, a classification of the obtained text may be determined based on the result of the keyword recognition, and finally, whether the patient suffers from a disease or has a potential disease risk level may be identified based on the result of the classification of the text.
In the text classification process, because the semantics expressed by the same text in different contexts may be different, a misrecognition phenomenon is easily generated in the process of keyword recognition, that is, the semantics recognized based on the keywords are not matched with the semantics actually expressed by the detected user, so that the deviation between the final classification result of the text data and the semantics actually expressed by the text is large, and further, the recognition of the disease symptoms of the detected user is wrong.
In the existing text data classification technology, the corresponding semantics or identification parameters of keywords are generally adjusted to reduce the identification deviation, so that the accuracy of the disease identification result is improved. However, in practical applications, because the content difference between text data is large, and there may be keyword patterns of the same character between different text data, the way of adjusting the keyword semantics and the identification parameters affects all the contents in the text data, and the point-to-point deviation of all the contents cannot be reduced. On the other hand, if deviation identification or deviation reduction of partial contents in text data is to be realized, the deviation identification needs to be realized by manually constructing a corresponding platform, so that the identification cost of the text data is increased.
It is understood that the technical solutions mentioned in the present application are not only applicable to the scenes of psychological or mental disease recognition, but also applicable to emotion detection (for example, whether angry or not, whether depression or not, etc.) of a detected user and dangerous behavior detection (for example, whether behavior adversely affecting the public society by anger or not) caused by whether extreme emotion is triggered by the detected user or not
The following is a description of several exemplary embodiments, and the technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be explained.
In a first aspect of the present application, a method for identifying a disease condition based on text classification is provided, and referring to fig. 1, fig. 1 is a schematic flow chart of the method for identifying a disease condition based on text classification provided in an embodiment of the present application, and includes the following steps:
acquiring a text set to be recognized, wherein the text set to be recognized comprises at least one text to be recognized;
identifying semantic information of each text to be identified in the text set to be identified;
acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target symptoms;
and if the deviation degree is smaller than a preset threshold value, classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target symptoms.
Illustratively, to identify "mania" in psychological diseases, to clarify whether the subject to be detected has "mania", rules are defined: the recognition band detects whether the text information (namely the text set to be recognized) provided by the recognition band has the text capable of representing the voice of 'mania', and defines 'anger' as target semantics, namely, when the text to be recognized has 'anger' characters or has related 'anger' semantics, the closer the similarity to the target semantics is, the smaller the deviation degree is, the judgment that the person to be detected has 'mania'.
In a possible embodiment, the method for identifying a disease condition based on text classification further includes:
acquiring a training sample set, wherein the semantic of each text in the training sample set is opposite to the target semantic;
and training according to the training sample set to obtain a text recognition model, wherein the text recognition model is used for acquiring the deviation degree of the semantic information of each text to be classified and preset target semantics.
Optionally, the obtaining a training sample includes:
acquiring a first training sample set, wherein the first training sample set comprises a plurality of texts;
segmenting words to obtain at least one clause corresponding to each text in the training sample set;
acquiring a first clause set according to the similarity between the semantics of the at least one clause and the target semantics;
generating a second training sample set according to each clause in the first clause set, wherein the semantic meaning of each text in the second training sample set is opposite to the semantic meaning of the first clause corresponding to the corresponding text;
using the second training sample set as the training sample set.
Illustratively, text aimed at identifying "angry" semantics is taken as an example,
firstly, each text sentence stored in a training sample set of a recognition model in advance is subjected to sentence splitting processing to obtain at least one clause;
for example, for the sentence division processing, the processing can be performed in an N-gram manner,
specifically, by adopting an N-gram mode, setting the number of words of each clause to be 4, and taking an example sentence "everything is not in order, today is too angry, i get slow", then the divided clauses include: "disorder", "present day is present", "too angry", "disorder present" and "i'm slow".
Then, respectively confirming the similarity between each clause corresponding to each text and the target semantics, selecting the clauses with the similarity between the clauses and the target semantics in each text exceeding a preset threshold value, taking the clause exceeding the preset threshold value and being the maximum value in each text as a first clause, and constructing a first clause set from a plurality of text clauses,
optionally, according to the similarity between the semantics of the at least one clause and the target semantics, a first clause set is obtained;
obtaining a semantic result of each clause in the at least one clause;
comparing the similarity of each clause semantic result with the target semantic to obtain a similarity set, wherein the similarity set comprises at least one similarity, and each clause semantic result corresponds to one similarity;
if at least one similarity in the similarity set is greater than or equal to a preset threshold, selecting a clause corresponding to the similarity with the largest value as a first clause,
and if the similarity which is greater than or equal to the preset threshold does not exist in the similarity set, adjusting the word segmentation processing rule and subdividing the clauses.
For example, the result of comparing the similarity between the semantics of the four clauses and the target semantics ("anger") can be as follows: "something is wrong" corresponds to a similarity of 90%, "today is really" corresponds to a similarity of 40%, "too angry" corresponds to a similarity of 95%, and "i get slow" corresponds to a similarity of 70%, assuming that the preset threshold is 80%, then "too angry" is selected as the first clause of the example sentence,
optionally, the filtering form for the first clause includes: if the rule is set and the similarity is greater than or equal to the screening preset threshold value, the clause meeting the condition is directly determined as the first clause,
similarly, the same semantic similarity comparison is carried out on other example sentences to obtain a first clause corresponding to each example sentence, and a first clause set is constructed according to the first clause set.
And finally, constructing a second training sample according to the reverse semantics of the first clause, wherein the method specifically comprises the following steps: generating at least one second text according to the reverse semantics of the clause, constructing a second training sample according to the second text,
illustratively, with "too angry" as the first clause, the semantic clauses opposite it may be "too happy", and "happy", etc.
Optionally, the process of generating the second text according to the reverse semantics of the clause may be implemented in an alternative manner, and specifically includes:
identifying a position of the first clause in the corresponding text;
and replacing the first clause with a clause with opposite semantics to obtain a second text.
Illustratively, the example sentence "everything is out of order, today is really too angry, i get slow" to explain, set "anger" as the target semantic, then through the screening of the first clause, we can get "too angry" as the first clause,
firstly, identifying the position of 'being too angry' in the original example sentence;
then, the "too angry" in the original example sentence is replaced by the opposite semantic words (or clauses) including "too happy", and "very happy" (the opposite semantic words or clauses may be different from the first clause in character number);
finally, a substituted sentence is obtained, comprising: "everything is out of order, really is happy today, i gets lazy", "everything is out of order, really is too happy today, i gets lazy" and "everything is out of order, happy today, i gets lazy", these sentences are taken as the second text and stored in the second training sample set.
Optionally, for the replacement of the first clause, a non-similar semantic word except for the semantic opposite may also be selected for replacement, so as to implement the extension of the recognition deviation, and further improve the recognition degree of the recognition model for the recognition deviation.
For example, the replacing words of the above generation replacing the first clause are changed into "too fast", "too tired", and "too rich", so that the sentence of the second text includes: "everything is not in order, what is really fast today, i gets slow", "everything is not in order, what is really tired today, i gets slow" and "everything is not in order, what is really too rich today, i gets slow", the semantics of the above sentence are not opposite to the semantics of the first clause, but can also be used as a second text, and a second training sample set is included for training the recognition model.
In a possible implementation manner, before performing the semantic information for identifying each piece of text to be identified in the text set to be identified, a deviation pre-identification is further performed on the text set to be identified, where the deviation pre-identification method includes:
carrying out truncation clause processing on the first position character of each identification text along a first direction to obtain a first truncation clause;
identifying the first segmentation sentence semantics and comparing the similarity of the first segmentation sentence semantics and the target semantics;
if the similarity between the first truncated clause semantic and the target semantic is smaller than a preset threshold, keeping a previous character truncated clause as a first truncated clause;
carrying out truncation clause processing on a second position character of the first truncation clause along a second direction to obtain a second truncation clause;
identifying the second truncated clause semantics and comparing the second truncated clause semantics with the target semantics;
if the similarity between the second truncated clause semantic and the target semantic is smaller than a preset threshold, keeping the previous character truncated clause as a second truncated clause;
and determining clauses formed by characters except the second truncated clause in the text as recognition deviation.
Optionally, the first character and the second character correspond to a first character and a last character of the text, respectively.
Optionally, the first direction is opposite to the second direction.
Illustratively, with the example sentence "everything is out of order, today is really too angry, i get slow", the target semantic is "anger", the preset threshold is 90%, for example,
the character selection can adopt an N-gram mode, and obtains the identification error according to a greedy truncation method from long to short, and the method specifically comprises the following steps:
deleting the single characters from left to right (truncation and partial processing), and comparing the similarity between the semantics of the deleted sentences (first truncated sentences) and the target semantics, for example:
"the thing is not smooth, but is really too angry today, I get slow" the similarity corresponds to 99%;
"not smooth, really too angry today, i must slow" similarity corresponds to 99%;
"angry, i got slow" similarity corresponds to 96%;
"Qi is enough, I get slow" similarity is corresponding to 70%,
therefore, the first segmentation clause can be determined as 'angry, i get slow';
similarly, the first truncated clause is sequentially deleted from right to left, and the actions are repeated, so that the 'angry' can be finally obtained as the second truncated clause. Thus, we can know that "angry" is the semantic clause closest to "angry", and then the clauses consisting of truncated and deleted characters ("everything is not in order", "is too today" and "i get slow") can be considered as recognition bias.
In the foregoing embodiment, from the aspects of obtaining a text set to be recognized and semantic information of a text to be recognized, obtaining semantics of the text to be recognized, and recognizing a target disease category, the embodiments of the method for recognizing a disease state based on text classification provided in the embodiment of the present application are introduced. It should be understood that, in the embodiment of the present application, the functions described above may be implemented in hardware or in a combination of hardware and computer software in the processing steps of acquiring the text set to be recognized and the semantic information of the text to be recognized, acquiring the semantics of the text to be recognized, and recognizing the target disease category. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
For example, if the above implementation steps implement the corresponding functions through software modules. As shown in fig. 3, the apparatus for recognizing a disease based on text classification may include a first obtaining module, a first recognition module, a second obtaining module, and a second recognition module. The disease identification device based on text classification can be used for executing part or all of the operations of the soft multi-dimensional information-based disease identification method.
For example:
the device comprises a first acquisition module, a second acquisition module and a recognition module, wherein the first acquisition module is used for acquiring a text set to be recognized, and the text set to be recognized comprises at least one text to be recognized;
the recognition module is used for recognizing semantic information of each text to be recognized in the text set to be recognized;
the second acquisition module is used for acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target diseases;
and the recognition module is used for classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target symptoms if the deviation degree is smaller than a preset threshold value.
Therefore, the embodiment of the application provides a method for identifying the disease based on text classification, and in the scheme, a text set to be identified is obtained firstly, wherein the text set to be identified comprises at least one text to be identified; identifying semantic information of each text to be identified in the text set to be identified; acquiring the similarity between the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target symptoms; and if the similarity is greater than or equal to a preset threshold value, classifying the texts to be classified corresponding to the corresponding semantic recognition results into a category corresponding to the target disease symptoms. Therefore, the characteristics of the recognition errors are extracted and used as training input conditions to train the classification model, so that the recognition degree of the classification model on the recognition errors is improved and optimized. Furthermore, in the text classification process, the optimized classification model is used, and the model trained by the recognition deviation is used for text classification, so that the final disease identification result is more accurate. On one hand, because of the similarity relation established between the first clause and the target semantics, when the target semantics are changed by the application scene or the content of the text data or need to be redefined, the content of the first clause is also changed, and further point-to-point deviation identification of all the content in the text data is realized. On the other hand, after the second clause opposite to the first clause is obtained, the second clause is used as a recognition error to carry out model training, and text data classification is carried out through the trained model, so that point-to-point deviation recognition of different target semantics can be realized, even the point-to-point deviation is reduced, the compatibility of the text data classification process is improved, and the cost is reduced.
It is understood that the functions of the above modules may be implemented by being integrated into a hardware entity, for example, the first acquiring module and the second acquiring module may be implemented by being integrated into a transceiver, the first identifying module and the second identifying module may be implemented by being integrated into a processor, and the programs and instructions for implementing the functions of the above modules may be maintained in a memory. As shown in fig. 3, an electronic device is provided, which includes a processor, a transceiver and a memory, wherein the transceiver is configured to execute learning result acquisition corresponding to the target reference information and each of the encoding information in the disease species identification method based on multiple information, and the memory is configured to store the program/code preinstalled by the aforementioned deployment apparatus, and may also store the code for execution by the processor, etc. When the processor executes the codes stored in the memory, the electronic device is caused to execute part or all of the operations of the software deployment method in the method.
The specific process is described in the above embodiments of the method, and is not described in detail here.
In a specific implementation, corresponding to the foregoing electronic device, an embodiment of the present application further provides a computer storage medium, where the computer storage medium disposed in the electronic device may store a program, and when the program is executed, part or all of the steps in each embodiment of the software deployment method may be implemented. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
One or more of the above modules or units may be implemented in software, hardware or a combination of both. When any of the above modules or units are implemented in software, which is present as computer program instructions and stored in a memory, a processor may be used to execute the program instructions and implement the above method flows. The processor may include, but is not limited to, at least one of: various computing devices that run software, such as a Central Processing Unit (CPU), a microprocessor, a Digital Signal Processor (DSP), a Microcontroller (MCU), or an artificial intelligence processor, may each include one or more cores for executing software instructions to perform operations or processing. The processor may be built in an SoC (system on chip) or an Application Specific Integrated Circuit (ASIC), or may be a separate semiconductor chip. The processor may further include a necessary hardware accelerator such as a Field Programmable Gate Array (FPGA), a PLD (programmable logic device), or a logic circuit for implementing a dedicated logic operation, in addition to a core for executing software instructions to perform an operation or a process.
When the above modules or units are implemented in hardware, the hardware may be any one or any combination of a CPU, a microprocessor, a DSP, an MCU, an artificial intelligence processor, an ASIC, an SoC, an FPGA, a PLD, a dedicated digital circuit, a hardware accelerator, or a discrete device that is not integrated, which may run necessary software or is independent of software to perform the above method flows.
Further, a bus interface may also be included in FIG. 3, which may include any number of interconnected buses and bridges, with one or more processors, represented by a processor, and various circuits, represented by a memory, being linked together. The bus interface may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver provides a means for communicating with various other apparatus over a transmission medium. The processor is responsible for managing the bus architecture and the usual processing, and the memory may store data used by the processor in performing operations.
When the above modules or units are implemented using software, they may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It should be understood that, in the various embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic thereof, and should not constitute any limitation to the implementation process of the embodiments.
All parts of the specification are described in a progressive mode, the same and similar parts of all embodiments can be referred to each other, and each embodiment is mainly introduced to be different from other embodiments. In particular, as to the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple and reference may be made to the description of the method embodiments in relevant places.
While alternative embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present invention.

Claims (10)

1. A method for identifying a medical condition based on text classification, the method comprising:
acquiring a text set to be recognized, wherein the text set to be recognized comprises at least one text to be recognized;
identifying semantic information of each text to be identified in the text set to be identified;
acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target symptoms;
and if the deviation degree is smaller than a preset threshold value, classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target symptoms.
2. The method of text classification based disorder recognition according to claim 1, further comprising:
acquiring a training sample set, wherein the semantic of each text in the training sample set is opposite to the target semantic;
and training according to the training sample set to obtain a text recognition model, wherein the text recognition model is used for acquiring the deviation degree of the semantic information of each text to be classified and preset target semantics.
3. The method of claim 2, wherein the obtaining training samples comprises:
acquiring a first training sample set, wherein the first training sample set comprises a plurality of texts;
segmenting words to obtain at least one clause corresponding to each text in the training sample set;
acquiring a first clause set according to the similarity between the semantics of the at least one clause and the target semantics;
generating a second training sample set according to each clause in the first clause set, wherein the semantic meaning of each text in the second training sample set is opposite to the semantic meaning of the first clause corresponding to the corresponding text;
using the second training sample set as the training sample set.
4. The method of claim 3, wherein obtaining a first set of clauses based on similarity of the semantics of the at least one clause and the target semantics comprises;
obtaining a semantic result of each clause in the at least one clause;
comparing the similarity of each clause semantic result with the target semantic to obtain a similarity set, wherein the similarity set comprises at least one similarity, and each clause semantic result corresponds to one similarity;
if at least one similarity in the similarity set is greater than or equal to a preset threshold, selecting a clause corresponding to the similarity with the largest value as a first clause,
and if the similarity which is greater than or equal to the preset threshold does not exist in the similarity set, adjusting the word segmentation processing rule and subdividing the clauses.
5. The method of claim 1, wherein the training recognition bias implementation comprises:
and if the deviation degree is greater than or equal to a preset threshold value, determining the text to be classified corresponding to the semantic recognition result as the category of the non-target disease.
6. The method of claim 4, wherein the adjusting the segmentation processing rules comprises: the number of divided characters is changed.
7. The method according to claim 1 or 2, wherein the pre-trained language model is obtained, and the model categories comprise: the large vocabulary language model N-gram.
8. An apparatus for identifying a medical condition based on text classification, the apparatus comprising:
the device comprises a first acquisition module, a second acquisition module and a recognition module, wherein the first acquisition module is used for acquiring a text set to be recognized, and the text set to be recognized comprises at least one text to be recognized;
the recognition module is used for recognizing semantic information of each text to be recognized in the text set to be recognized;
the second acquisition module is used for acquiring the deviation degree of the semantic information of each text to be recognized and preset target semantics, wherein the target semantics are predefined semantics representing target diseases;
and the recognition module is used for classifying and determining the texts to be classified corresponding to the corresponding semantic recognition results as the classes corresponding to the target symptoms if the deviation degree is smaller than a preset threshold value.
9. An electronic device, characterized in that the electronic device comprises: a memory and a processor communicatively coupled to each other, the memory having stored therein computer instructions, the processor performing the method of any of claims 1-7 by executing the computer instructions.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-7.
CN202210687158.5A 2022-06-17 2022-06-17 Method, device and equipment for identifying disease symptoms based on text classification Withdrawn CN115048927A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210687158.5A CN115048927A (en) 2022-06-17 2022-06-17 Method, device and equipment for identifying disease symptoms based on text classification
CN202310707427.4A CN116992830B (en) 2022-06-17 2023-06-14 Text data processing method, related device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210687158.5A CN115048927A (en) 2022-06-17 2022-06-17 Method, device and equipment for identifying disease symptoms based on text classification

Publications (1)

Publication Number Publication Date
CN115048927A true CN115048927A (en) 2022-09-13

Family

ID=83161209

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202210687158.5A Withdrawn CN115048927A (en) 2022-06-17 2022-06-17 Method, device and equipment for identifying disease symptoms based on text classification
CN202310707427.4A Active CN116992830B (en) 2022-06-17 2023-06-14 Text data processing method, related device and computing equipment

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202310707427.4A Active CN116992830B (en) 2022-06-17 2023-06-14 Text data processing method, related device and computing equipment

Country Status (1)

Country Link
CN (2) CN115048927A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992830A (en) * 2022-06-17 2023-11-03 北京聆心智能科技有限公司 Text data processing method, related device and computing equipment

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160012038A1 (en) * 2014-07-10 2016-01-14 International Business Machines Corporation Semantic typing with n-gram analysis
CN108549656B (en) * 2018-03-09 2022-06-28 北京百度网讯科技有限公司 Statement analysis method and device, computer equipment and readable medium
CN111611374A (en) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 Corpus expansion method and device, electronic equipment and storage medium
US10796104B1 (en) * 2019-07-03 2020-10-06 Clinc, Inc. Systems and methods for constructing an artificially diverse corpus of training data samples for training a contextually-biased model for a machine learning-based dialogue system
CN110888968A (en) * 2019-10-15 2020-03-17 浙江省北大信息技术高等研究院 Customer service dialogue intention classification method and device, electronic equipment and medium
CN111709247B (en) * 2020-05-20 2023-04-07 北京百度网讯科技有限公司 Data set processing method and device, electronic equipment and storage medium
CN111931492A (en) * 2020-07-16 2020-11-13 平安科技(深圳)有限公司 Data expansion mixing strategy generation method and device and computer equipment
US20220277197A1 (en) * 2021-03-01 2022-09-01 Nec Laboratories America, Inc. Enhanced word embedding
CN112906392B (en) * 2021-03-23 2022-04-01 北京天融信网络安全技术有限公司 Text enhancement method, text classification method and related device
CN113822047A (en) * 2021-07-02 2021-12-21 腾讯科技(深圳)有限公司 Text enhancement method and device, electronic equipment and storage medium
CN115221872B (en) * 2021-07-30 2023-06-02 苏州七星天专利运营管理有限责任公司 Vocabulary expansion method and system based on near-sense expansion
CN113807098B (en) * 2021-08-26 2023-01-10 北京百度网讯科技有限公司 Model training method and device, electronic equipment and storage medium
US20230153528A1 (en) * 2021-11-12 2023-05-18 Oracle International Corporation Data augmentation and batch balancing methods to enhance negation and fairness
CN114330359A (en) * 2021-11-30 2022-04-12 青岛海尔科技有限公司 Semantic recognition method and device and electronic equipment
CN114298030A (en) * 2021-12-14 2022-04-08 达闼机器人有限公司 Statement extraction method and device, electronic equipment and computer-readable storage medium
CN114595327A (en) * 2022-02-22 2022-06-07 平安科技(深圳)有限公司 Data enhancement method and device, electronic equipment and storage medium
CN115879458A (en) * 2022-04-08 2023-03-31 北京中关村科金技术有限公司 Corpus expansion method, apparatus and storage medium
CN115033753A (en) * 2022-06-17 2022-09-09 北京金山数字娱乐科技有限公司 Training corpus construction method, text processing method and device
CN115048927A (en) * 2022-06-17 2022-09-13 北京聆心智能科技有限公司 Method, device and equipment for identifying disease symptoms based on text classification
CN115408495A (en) * 2022-08-25 2022-11-29 厦门市美亚柏科信息股份有限公司 Social text enhancement method and system based on multi-modal retrieval and keyword extraction
CN115422326A (en) * 2022-08-31 2022-12-02 北京沃东天骏信息技术有限公司 Text sample expansion method and device, electronic equipment and computer readable medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992830A (en) * 2022-06-17 2023-11-03 北京聆心智能科技有限公司 Text data processing method, related device and computing equipment

Also Published As

Publication number Publication date
CN116992830A (en) 2023-11-03
CN116992830B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
US10713323B2 (en) Analyzing concepts over time
CN109388795B (en) Named entity recognition method, language recognition method and system
CN111401066B (en) Artificial intelligence-based word classification model training method, word processing method and device
CN110569366A (en) text entity relation extraction method and device and storage medium
US20170031894A1 (en) Systems and methods for domain-specific machine-interpretation of input data
CN113836277A (en) Machine learning system for digital assistant
US20220405484A1 (en) Methods for Reinforcement Document Transformer for Multimodal Conversations and Devices Thereof
CN111813905B (en) Corpus generation method, corpus generation device, computer equipment and storage medium
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN115048927A (en) Method, device and equipment for identifying disease symptoms based on text classification
CN111858860B (en) Search information processing method and system, server and computer readable medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN112071304B (en) Semantic analysis method and device
CN114548113A (en) Event-based reference resolution system, method, terminal and storage medium
CN115023695A (en) Updating training examples for artificial intelligence
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium
Olivo et al. CRFPOST: Part-of-Speech Tagger for Filipino Texts using Conditional Random Fields
CN116453702B (en) Data processing method, device, system and medium for autism behavior feature set
CN116108163B (en) Text matching method, device, equipment and storage medium
CN113254658B (en) Text information processing method, system, medium, and apparatus
CN115795016B (en) Question matching method, system, electronic device and storage medium
Gal-Nadasan et al. Robotic Process Automation Based Data Extraction from Handwritten Medical Forms
CN114116771A (en) Voice control data analysis method and device, terminal equipment and storage medium
CN117874239A (en) Content generation method, device, equipment and storage medium
CN116956914A (en) Named entity identification method, named entity identification device, named entity identification equipment, named entity identification medium and named entity identification program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220913

WW01 Invention patent application withdrawn after publication