CN117611581A

CN117611581A - Tongue picture identification method and device based on multi-mode information and electronic equipment

Info

Publication number: CN117611581A
Application number: CN202410076457.4A
Authority: CN
Inventors: 张建峰; 李劲松
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-02-27
Anticipated expiration: 2044-01-18

Abstract

The specification discloses a tongue picture identification method and device based on multi-mode information and electronic equipment. The method comprises the following steps: acquiring tongue images and symptom information of a user, wherein the symptom information comprises symptom complaints, symptoms and signs; inputting the tongue image and the symptom information into a tongue image recognition model, and determining image features corresponding to the tongue image and text features corresponding to the symptom information; according to a preset tongue diagnosis knowledge graph, determining a corresponding relation between a symptom entity and a tongue image entity, and acquiring graph characteristics determined based on the corresponding relation; and fusing the image features, the text features and the map features to obtain target tongue image features, so as to identify the tongue images according to the target tongue image features. According to the scheme, the multi-mode information is integrated, so that the accuracy and the reliability of tongue picture identification results are effectively improved.

Description

Tongue picture identification method and device based on multi-mode information and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a tongue identification method and apparatus based on multi-mode information, and an electronic device.

Background

The tongue diagnosis is the main content of "inspection" in four diagnostic methods of traditional Chinese medicine, and is used to diagnose the condition of a patient by observing the tongue condition of the patient. The tongue inspection includes inspection of tongue coating and inspection of tongue quality, the tongue quality refers to the body of the tongue and the tongue coating refers to the attachment on the surface of the body. When observing the tongue of a patient, doctors of traditional Chinese medicine mainly observe the color, the state of mind and the position and color of tongue fur to infer the health condition and disease development.

The traditional tongue diagnosis mainly directly observes tongue images through doctors, relies on subjective judgment of the doctors to a great extent, is easily limited by knowledge level, thinking mode and diagnosis technology of the doctors, has poor repeatability among different doctors, and limits clinical application of tongue diagnosis technology to a certain extent. With the development of computer imaging technology, digital image recognition technology has been widely used in intelligent tongue diagnosis, providing convenience for the application of objective tongue diagnosis.

However, the existing tongue picture identification method does not conform to the overall diagnosis thought of tongue diagnosis in traditional Chinese medicine, has low accuracy, cannot comprehensively evaluate the comprehensive condition of tongue pictures, and is difficult to provide reliable reference for clinical tongue picture diagnosis.

Therefore, how to accurately identify tongue images and further ensure the reliability of clinical diagnosis is a problem to be solved urgently.

Disclosure of Invention

The present disclosure provides a tongue recognition method and apparatus based on multi-mode information, and an electronic device, so as to partially solve the foregoing problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a tongue picture identification method based on multi-mode information, which comprises the following steps:

receiving a tongue picture identification request;

according to the tongue picture identification request, acquiring tongue picture and symptom information of a user, wherein the symptom information comprises symptom complaints, symptom information and sign information of the user;

inputting the tongue image and the symptom information into a pre-trained tongue image recognition model to extract symptom entities contained in the symptom information through the tongue image recognition model, and determining image features corresponding to the tongue image and text features corresponding to the symptom information;

according to a preset tongue diagnosis knowledge graph, determining a corresponding relation between the symptom entity and the tongue picture entity, and acquiring graph characteristics determined based on the corresponding relation;

and fusing the image features, the text features and the map features to obtain target tongue image features, so as to identify tongue images according to the target tongue image features.

Optionally, according to the tongue picture identification request, acquiring a tongue picture of the user specifically includes:

acquiring an original image;

and extracting an image of the tongue region from the original image to obtain the tongue image.

Optionally, fusing the image feature, the text feature and the map feature to obtain a target tongue feature, which specifically includes:

fusing the text features and the image features to obtain first fusion features, and fusing the image features and the map features to obtain second fusion features;

and determining the target tongue image characteristic according to the first fusion characteristic and the second fusion characteristic.

Optionally, fusing the text feature and the image feature to obtain a first fused feature, which specifically includes:

determining a first attention weight through a first feature fusion network in the tongue picture recognition model, wherein the first attention weight is used for representing the association degree between the text feature and the image feature;

and fusing the text feature and the image feature based on the first attention weight to obtain the first fused feature.

Optionally, based on the first attention weight, the text feature and the image feature are fused to obtain the first fusion feature, which specifically includes:

determining a query vector corresponding to the image feature and a key vector and a value vector corresponding to the text feature;

and determining the first fusion feature according to the first attention weight, the query vector corresponding to the image feature, and the key vector and the value vector corresponding to the text feature.

Optionally, fusing the image feature and the map feature to obtain a second fused feature, which specifically includes:

determining a second attention weight through a second feature fusion network in the tongue picture identification model, wherein the second attention weight is used for representing the association degree between the map features and the image features;

and fusing the map features with the image features based on the second attention weight to obtain the second fused features.

Optionally, the profile features include: entity embedding features and relationship embedding features for characterizing relationships between entities;

based on the second attention weight, fusing the map feature and the image feature to obtain the second fused feature, which specifically comprises:

Determining a query vector corresponding to the image feature and a key vector and a value vector corresponding to the entity embedded feature;

and determining the second fusion feature according to the second attention weight, the query vector corresponding to the image feature, the relation embedding feature and the key vector and the value vector corresponding to the entity embedding feature.

Optionally, training the tongue image recognition model specifically includes:

acquiring a training sample, wherein the training sample comprises historical tongue images and historical disorder information of a designated user;

inputting the historical tongue image and the historical disorder information into a tongue image recognition model to be trained, extracting a historical disorder entity contained in the historical disorder information through the tongue image recognition model, and determining historical image features corresponding to the historical tongue image and historical text features corresponding to the historical disorder information;

according to the tongue diagnosis knowledge graph, determining the corresponding relation between the historical disorder entity and the tongue picture entity, and acquiring the characteristic of the historical graph determined based on the corresponding relation between the historical disorder entity and the tongue picture entity;

fusing the historical image features, the historical text features and the historical map features to obtain historical tongue image features, and carrying out tongue image recognition according to the historical tongue image features to obtain tongue image recognition results;

And training the tongue picture recognition model by taking the deviation between the minimum tongue picture recognition result and the actual tongue picture type of the appointed user as an optimization target.

Optionally, the method further comprises:

diagnosing the symptoms of the user according to tongue picture identification results, wherein the tongue picture identification results comprise: the tongue body is of the tongue type and tongue fur type.

The specification provides a tongue picture recognition device based on multi-mode information, which comprises:

the receiving module receives a tongue picture identification request;

the acquisition module is used for acquiring tongue images and symptom information of a user according to the tongue image identification request, wherein the symptom information comprises symptom complaints, symptom information and sign information of the user;

the input module is used for inputting the tongue image and the symptom information into a pre-trained tongue image recognition model so as to extract symptom entities contained in the symptom information through the tongue image recognition model and determine image features corresponding to the tongue image and text features corresponding to the symptom information;

the determining module is used for determining the corresponding relation between the symptom entity and the tongue picture entity according to a preset tongue diagnosis knowledge graph and acquiring graph characteristics determined based on the corresponding relation;

And the identification module is used for fusing the image features, the text features and the map features to obtain target tongue image features so as to identify tongue images according to the target tongue image features.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above tongue recognition method based on multimodal information.

The present specification provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the tongue recognition method based on multimodal information described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the tongue picture identification method based on the multi-mode information provided by the specification, tongue picture and disorder information of a user are obtained, and the disorder information comprises disorder complaints, symptoms and signs; inputting the tongue image and the symptom information into a tongue image recognition model, and determining image features corresponding to the tongue image and text features corresponding to the symptom information; according to a preset tongue diagnosis knowledge graph, determining a corresponding relation between a symptom entity and a tongue picture type, and acquiring graph characteristics determined based on the corresponding relation; and fusing the image features, the text features and the map features to obtain target tongue image features, and further carrying out tongue image recognition according to the target tongue image features.

According to the method, in the tongue picture identification process, not only the image information of the tongue body is considered, but also the main complaint and symptom sign knowledge of the patient and the experience knowledge of the tongue diagnosis map are integrated, and rich experience support is provided for tongue picture classification, so that the target tongue picture characteristics obtained based on the information can fully utilize the overall diagnosis thought of tongue diagnosis of traditional Chinese medicine, tongue picture identification is more accurately carried out, and the reliability of subsequent clinical diagnosis is further ensured.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic flow chart of a tongue image recognition method based on multi-modal information provided in the present specification;

FIG. 2 is a schematic diagram of a fusion process of image features and text features provided in the present specification;

FIG. 3 is a schematic diagram of a fusion process of image features and map features provided in the present specification;

FIG. 4 is a schematic diagram of a tongue recognition device based on multimodal information provided in the present specification;

Fig. 5 is a schematic diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The existing tongue picture classification method can be divided into two main categories. One is to extract the corresponding features by using traditional image processing technology or hyperspectral technology and other methods according to the unique features of tongue features, and then classify the features by using AdaBoost, KNN, SVM and other classifiers. Another method is to construct tongue feature classification model directly by deep learning network.

However, the tongue diagnosis classification method mainly relies on the input tongue images, and this method which relies on image driving only has the following problems:

1. the overall appearance diagnosis thought of tongue diagnosis in traditional Chinese medicine is not met: in general, the middle doctor refers to tongue image information when examining tongue images, but is influenced by the theory and clinical experience of tongue diagnosis, and generally combines other diagnosis methods (such as inquiry, pulse diagnosis and the like) to comprehensively judge tongue states;

2. Single mode: traditional tongue classification methods generally rely only on feature extraction and analysis of tongue images, ignoring other important clinical information such as complaints and symptoms signs. The single-mode classification method can cause low classification accuracy and can not comprehensively evaluate the comprehensive condition of tongue images;

3. lack of joint learning: the existing method is often simply to splice or connect the features of different modes in series, and lacks effective modeling of the relevance between the information of different modes. This simple feature fusion approach may not fully exploit complementarity and interactions between different modalities, limiting the performance of the tongue recognition model.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of a tongue image recognition method based on multi-mode information provided in the present specification, which includes the following steps:

s101: a tongue identification request is received.

S102: and acquiring tongue images and symptom information of the user according to the tongue image identification request, wherein the symptom information comprises symptom complaints, symptom information and sign information of the user.

The scheme introduces a multi-mode tongue picture classification method with enhanced knowledge, and imitates the thinking mode of diagnosing tongue pictures by clinicians. The method merges two key types of traditional Chinese medicine knowledge: firstly, basic information, complaints, symptoms and physical sign knowledge of a patient are included, and combined knowledge is provided for tongue classification; and secondly, the empirical knowledge of tongue diagnosis patterns is combined, so that abundant empirical support is provided for tongue image classification, and the tongue image classification can better accord with the overall diagnosis thought of tongue diagnosis.

In the present specification, an execution subject for implementing a tongue image recognition method based on multimodal information may be a designated device such as a server, or may be a client installed on a terminal device such as a medical diagnostic apparatus, a mobile phone, or a computer, and for convenience of description, only a server is taken as an execution subject to describe a tongue image recognition method based on multimodal information provided in the present specification.

After receiving the tongue picture identification request, the server can acquire the original image and the symptom information of the tongue of the user according to the tongue picture identification request.

In this specification, the condition information may include condition complaints, symptom information, and sign information of the user.

Disorder complaints may refer to the main symptoms and duration of the most pronounced or painful feeling of the patient as stated at his visit.

Symptom information may refer to the painful manifestations perceived by the patient himself, and is subjective, such as pain, dizziness, nasal obstruction, nausea, fever, dyspnea, etc.

Sign information may refer to diagnostically meaningful symptoms that a physician finds when examining a patient, objectively, clinically common signs include: the heart rate can be varied according to the needs of the patient.

The user may input text of the above-mentioned disorder information in the client to cause the client to upload the disorder information to the server.

The original image can be directly shot by a user or a doctor through a mobile phone camera or a diagnostic instrument camera under the natural illumination condition and uploaded to the server, and of course, the user can also read the tongue image selected by the user in the album through a client in the mobile phone and upload the tongue image to the server.

Specifically, after the user clicks into the application client, the client may provide two options for the user to select how to acquire the tongue image. First, the user can choose to directly shoot the original image of the tongue with the camera of the mobile phone. At this point, the system will provide a series of instructions and advice to ensure that the user is able to properly capture the tongue image. For example, the system may suggest that the user take a photograph under natural lighting conditions to obtain a more accurate tongue picture. In addition, the system can remind the user to naturally extend the tongue when shooting, so that excessive force or curling of the tongue body is avoided. The user can use the camera of the mobile phone to shoot tongue images according to the guidance of the system.

In the shooting process, the client can display the image captured by the camera in real time so as to enable a user to confirm the shooting effect. If the user is not satisfied with the current shooting result, the shooting can be repeated at any time until the user is satisfied. In addition, the user can select the existing tongue photo from the album. The system provides an interface for browsing photo albums, and the user can browse and select the appropriate tongue photo. Once the selection is completed, the server may upload the image selected by the user to the server and send a tongue identification request.

After the server acquires the original image of the tongue body, the image of the tongue body area can be extracted from the original image, and a final tongue body image is obtained. In this process, the server may segment the tongue region of the original image of the tongue, and exclude redundant information in the original image, such as lips, face, chin, etc.

In practical application, the server can adopt a UNET semantic segmentation model based on a deep learning technology to realize the segmentation task of the tongue image, and the specific implementation process is as follows:

first, the server may take the original image of the tongue as the basis for segmenting the dataset. In order to train the deep learning model, the edges of the tongue body need to be definitely defined, and the edges of the tongue body in the data set are manually marked by using an open source image marking tool Labelme. This labeling process covers the key step of accurately dividing the tongue from the original image. Labelme supports a variety of labeling modes, including polygonal, multi-line segment and point labeling, so that the boundaries of the tongue can be precisely defined.

After Labelme is marked, each image is correspondingly generated into a json file containing edge information, in order to facilitate training of the deep learning model, the json file is required to be converted into a binarization contour map in png format and a text file required by training is generated, the text file comprises text information, an original map and a contour map file, and meanwhile a data set is divided into a training set and a testing set, so that the UNET semantic segmentation model is trained and evaluated through the data sets.

After the UNET semantic segmentation model training is completed, the server may use the UNET semantic segmentation model to perform tongue region segmentation. The UNET model structure includes three key parts: and the main feature extraction and reinforcement feature extraction and prediction parts. These parts work cooperatively to accurately identify and segment the tongue.

The main feature extraction part is responsible for extracting preliminary effective features of the tongue body from the original image. This stage includes convolution operations and pooling operations to capture important information in the image. The enhanced feature extraction portion further processes the backbone features to generate an efficient feature layer with higher level features by upsampling and feature fusion. These feature layers will help us locate the boundary of the tongue more accurately. The prediction section classifies each pixel point in the image using the final effective feature layer. Through the process, the server can identify the pixel points belonging to the tongue body, and the tongue body area is obtained.

S103: and inputting the tongue image and the symptom information into a pre-trained tongue image recognition model to extract symptom entities contained in the symptom information through the tongue image recognition model, and determining image features corresponding to the tongue image and text features corresponding to the symptom information.

The server can input the tongue image and the symptom information into a pre-trained tongue image recognition model, and determine the image characteristics corresponding to the tongue image and the text characteristics corresponding to the symptom information through a tongue image recognition model characteristic extraction network.

In this specification, the tongue recognition model feature extraction network may include an image feature extraction sub-network and a text feature extraction sub-network. The server can extract image features from a single tongue image or tongue images of a plurality of different views through an image feature extraction sub-networkCan be expressed as:

wherein,and->Representing the number of image feature channels and the dimensions of the image features, respectively.

In practical application, the server can ensure that the model can effectively extract the image features of the tongue image through a series of optimization strategies and advanced model architecture. For example, the server may preferably normalize the size of each tongue image to 256×256 pixels, thereby ensuring consistency and stability of the input image.

During the model training phase, the server may implement a random clipping strategy. A rectangular region is randomly selected from the normalized 256 x 256 pixel image, and the cropped region has a randomly determined area ratio between 0.09 and 1.0. The selected region size is then adjusted to 224 x 224 pixels to ensure a match to the input size of the model.

In addition, in order to increase the diversity of training data, a random horizontal inversion method is preferably employed. Part of the images are subjected to horizontal mirroring in the training process, so that variability of model training is increased.

The server may select the pre-trained ResNet model as the image feature extraction sub-network. The model is excellent in image feature extraction, and has high accuracy and generalization capability.

The information of the tongue image can be fully utilized through the optimization strategies, and the efficiency and the accuracy of feature extraction are improved, so that the performance and the reliability of the tongue image recognition model are enhanced.

The server may convert unstructured text information (including complaints, symptoms and signs) into embedded vectors that the model can understand as text features through the text feature extraction sub-network.

In the present specification, the server may select the pretrained Bert model as the text feature extraction sub-network, and first embed unstructured disorder information to obtain an embedded vector. After that pair->Word segmentation processing is carried out, and the word segmentation processing is expressed as follows:

the server may then use the pretrained model Bert generationFeature vectors of each word in the sequence are then obtained as word vector embedded sequences +. >，/>Can be seen as a specific textual knowledge of each input, the process can be expressed as:

wherein,is the generated word vector embedding,/>，/>Referring to the total number of words that the text contains, 300 represents the dimension of the embedded vector.

The server may perform word segmentation processing on the original text (symptom information). This word segmentation process splits text into individual words or phrases, making each word the basic unit of processing. For example, the text "patient feel dry mouth and dry tongue" for complaints, symptoms and signs can be broken down into vocabulary units of "patient", "feel", "dry mouth", "dry tongue", and the like.

Next, a corresponding word vector is generated for each word or phrase, and the feature vectors contain semantic information of the words and can capture the relevance between the words. After the word vectors for each word are obtained, they are combined into a text sequence and embedded to obtain the final text feature, which represents the entire text information and can be regarded as a vectorized representation of the text.

In practical applications, the dimensions of the text feature are typically set to a fixed value, such as 300 dimensions. The choice of this dimension value can be adjusted according to the specific task and model requirements.

In addition, the server may also extract the symptoms entities contained in the symptom information during the process of extracting the text features through the feature extraction network, where the symptoms entities are used for representing specific symptoms of the user, such as "dysphoria with smothery sensation", "inappetence", "bitter taste in mouth and dry throat", "insomnia and dreaminess", etc.

S104: and determining the corresponding relation between the symptom entity and the tongue picture entity according to a preset tongue diagnosis knowledge graph, and acquiring graph characteristics determined based on the corresponding relation.

After extracting the symptom entity contained in the symptom information, the server can determine the tongue picture entity associated with the symptom entity based on a preset tongue diagnosis knowledge graph, and then acquire the graph characteristics determined based on the corresponding relation between the symptom entity and the tongue picture entity. Wherein, the symptoms entity is used for representing symptoms type, and the tongue appearance entity is used for representing tongue appearance type (including tongue fur type and tongue quality type).

The profile features may include: the entity embedding characteristics corresponding to the disease entity and the tongue image entity and the relation embedding characteristics representing the relation between the disease entity and the tongue image entity.

The tongue diagnosis knowledge graph includes tongue image entities related to different disease entities, and the server can construct the tongue diagnosis knowledge graph in advance before making the tongue diagnosis knowledge graph.

In particular, the server can obtain the medical knowledge required for constructing the tongue diagnosis knowledge graph, and then represent the medical knowledge as a disease entity, a tongue image entity and a triplet of relations between disease and tongue image types, namelyWherein->The number of triples in the tongue diagnosis knowledge graph is indicated.

Taking the example of the disorder entity "dysphoria with five hearts" and the corresponding tongue-like entity "red tongue", the triplet formed by these two entities may be expressed as "red tongue" - "common tongue color" - "dysphoria with five hearts". This means that a red tongue is a type of tongue pattern associated with dysphoria with feverish sensation in the five palms.

Further, in order to obtain knowledge that can be used by the tongue image recognition model from the tongue diagnosis knowledge graph, the server may obtain an entity embedding feature and a relationship embedding feature representing a corresponding relationship between entities in the tongue diagnosis knowledge graph through a pre-training model (such as a graph embedding model RotatE), and define each relationship as a rotation from a source entity to a target entity in a complex vector space, which is expressed as:

wherein,and->Representing entity-embedded features and relationship-embedded features, respectively, < >>And->Referring to the number of entities and the off-coefficient, respectively, in general knowledge, 400 represents the dimension of the embedded vector.

The key objective of this step is to encode the information obtained from the tongue knowledge graph into entity and relationship embeddings that will be used in subsequent model training for fusion of rich visual features and knowledge text.

S105: and fusing the image features, the text features and the map features to obtain target tongue image features, so as to identify tongue images according to the target tongue image features.

After the image features and the text features are determined, the server can fuse the image features and the text features to obtain first fusion features.

The server can determine a first attention weight representing the association degree between text features and image features through a first fusion network in the tongue image recognition model, and then fuse the text features with the image features based on the first attention weight to obtain the first fusion features. For ease of understanding, the present disclosure provides a schematic diagram of a fusion process of image features and text features, as shown in fig. 2.

Fig. 2 is a schematic diagram of a fusion process of image features and text features provided in the present specification.

The server may transmit the text feature and the image feature to a first fusion network having a multi-head attention mechanism, where the multi-head attention mechanism includes a plurality of heads (heads), each of which independently learns a relationship between the text and the image, and determine a Query vector (Query) corresponding to the image feature and a Key vector (Key) and a Value vector (Value) corresponding to the text feature. The first fusion network may process the query vector and the key vector output by the linear layer (linear) sequentially through a Matmul function, a Scale function and a Sortmax function, and further process the output result and the value vector of the Sortmax function through the Matmul function to obtain a final first fusion feature.

For each header, the server may determine a first attention weight from the query vector corresponding to the image feature and the Key vector (Key) and Value vector (Value) corresponding to the text feature, which may be expressed as:

wherein,representing a query vector, a key vector, and a value vector. />Is the feature dimension. />Is a first attention weight for characterizing a degree of association between a text feature and the image feature.

Next, the server may derive a text-enhanced visual representation by weighted summing the attention scores of the plurality of heads as a first fusion feature. This feature integrates text information and image information, emphasizing the correlation between them. In particular by applying multi-headed attention toObtaining a final first fusion feature, wherein the first fusion feature can be calculated by the following formula:

wherein the method comprises the steps ofIndicate->First attention fraction of the individual head, < >>，/>，/>，/>And->Are respectively->Head-of-day queries, keys, value transformation matrices.

Characterizing textAnd image feature->Application to the first fusion characterization formula +.>Thereafter, the first fusion feature may be further expressed as:

wherein the text features Comprises two eigenvectors->Representing a key vector and a value vector, respectively.

Meanwhile, the server can fuse the image features and the map features to obtain second fusion features.

Specifically, the server can determine a second attention weight representing the association degree between the map features and the image features through a second fusion network in the tongue image recognition model, and then fuse the map features and the image features based on the second attention weight to obtain second fusion features. For ease of understanding, the present disclosure provides a schematic diagram of a fusion process of image features and atlas features, as shown in fig. 3.

Fig. 3 is a schematic diagram of a fusion process of image features and map features provided in the present specification.

The server may transmit the map feature and the image feature to a second fusion network with a multi-head attention mechanism, and determine a Query vector (Query) corresponding to the image feature, and a Key vector (Key) and a Value vector (Value) corresponding to the map feature. The second fusion network may process the query vector and the key vector output by the linear layer (linear) sequentially through a Matmul function and a Scale function, further activate the processed information and the relation embedded feature through a Softmax function, and then process the output result of the Softmax function and the value vector output by the linear layer (linear) through the Matmul function to obtain the second fusion feature.

In particular, since the profile features include two entities in total, namely the disease entityAnd tongue entity->For each entity in the common knowledge, the server may aggregate the edge features of all its neighbors and add these aggregated edge features as relational deviations to the model.

The server may establish an aggregate relationship embeddingWherein each elementRepresentation entity->And entity->Relationship between them. Since there are various relations between two entities, the relation embedding feature is represented by using the average value of relation embedding, which is +.>Can be expressed as:

wherein,representing the average pool function, +.>Representation entity->And entityAll relationships between.

The server may then calculate a relationship bias, which may be expressed as:

wherein,is a learnable matrix parameter of size 400 x 1 +.>The size is +.>. After the treatment of the waste water is carried out in a pool,is the aggregate relational embedding of entities.

The second fusion network may determine a second attention weight based on the relationship deviation, a Query vector (Query) corresponding to the image feature, and a Key vector (Key) and a Value vector (Value) corresponding to the atlas feature, where the second attention weight may be expressed as:

Wherein,representing a query vector, a key vector, and a value vector, respectively. />Is the embedded dimension.

And then the second fusion network can determine a second fusion feature according to the second attention weight, the query vector corresponding to the image feature, the relation embedding feature and the key vector and the value vector corresponding to the entity embedding feature, and the second fusion feature can be calculated by the following formula:

wherein the method comprises the steps ofIndicate->Deviation of the relation of individual heads->，/>，/>，And->Are respectively->Head-of-day queries, keys, value transformation matrices.

Characterizing an imageTwo entity embedded features->And relation embedding feature->Application to the second fusion characterization formula +.>Thereafter, the second fusion feature may be further expressed as:

after the first fusion feature and the second fusion feature are determined, the server can fuse the first fusion feature and the second fusion feature through a third fusion network of the tongue picture recognition model to obtain a target tongue picture feature, and the target tongue picture feature can be expressed as:

after the target tongue picture features are obtained, the server can conduct tongue picture recognition according to the target tongue picture features to obtain tongue picture recognition results.

In the present specification, the tongue picture recognition result may include a tongue type and a tongue fur type, and specific classifications of the tongue type and the tongue fur type are shown in table one:

List one

After the tongue picture identification result is determined, the server can diagnose the symptoms of the user according to the obtained tongue picture type.

For example, the server may input the tongue recognition result into a pre-trained auxiliary diagnostic model, thereby outputting a classification result through the auxiliary diagnostic model.

The server can also directly feed back the tongue picture recognition result to the doctor so that the doctor can diagnose the user by referring to the recognized tongue picture type.

Of course, the server may also send the tongue recognition result to the client used by the user, so as to push information to the user.

In addition, before the tongue picture recognition model is used, the server can train the tongue picture recognition model in advance, wherein the server can acquire a training sample containing historical tongue picture and historical disorder information of a designated user, input the historical tongue picture and the historical disorder information into the tongue picture recognition model to be trained, so as to extract the historical disorder entity contained in the historical disorder information, and determine the historical picture characteristics corresponding to the historical tongue picture and the historical text characteristics corresponding to the historical disorder information.

According to the tongue diagnosis knowledge graph, determining the corresponding relation between the historical disorder entity and the tongue image entity, acquiring the historical graph characteristics determined based on the corresponding relation between the historical disorder entity and the tongue image entity, fusing the image characteristics, the historical text characteristics and the historical graph characteristics to acquire historical tongue image characteristics, and carrying out tongue image recognition according to the historical tongue image characteristics to acquire tongue image recognition results.

The server may train the tongue recognition model with the objective of optimizing minimizing the deviation between the tongue recognition result and the actual tongue type of the specified user. Until the training target is met (if the training target reaches the preset training times or the model converges to the preset range), deploying the tongue image recognition model after the training is completed in a server.

According to the method, two key types of traditional Chinese medicine knowledge, namely complaint and symptom sign knowledge and tongue diagnosis map knowledge, are fused in addition to tongue images. The complaint and symptom sign knowledge provides rich combined parameter information for tongue picture identification, and allows a model to learn doctor thinking to consider the complaint and symptom sign of a patient, so that classification is more accurate. The tongue diagnosis map knowledge provides precious medical experience support for tongue image classification.

In addition, the scheme provides a comprehensive multi-modal representation learning method for tongue image recognition through joint learning of tongue images, unstructured complaint and symptom sign data and structured tongue diagnosis knowledge graph.

Notably, the present invention may be the first tongue classification method to fuse multimodal clinical information into overall representation learning. The innovative method greatly improves the accuracy and the interpretability of tongue classification, and is expected to bring important help to the field of tongue diagnosis of traditional Chinese medicine.

The above is a method for identifying tongue images based on multi-modal information implemented by one or more embodiments of the present disclosure, and based on the same concept, the present disclosure further provides a corresponding tongue image identifying device based on multi-modal information, as shown in fig. 4.

Fig. 4 is a schematic diagram of a tongue identifier based on multi-mode information provided in the present specification, including:

a receiving module 401, configured to receive a tongue picture identification request;

an obtaining module 402, configured to obtain, according to the tongue image identification request, a tongue image of a user and symptom information, where the symptom information includes a symptom complaint, symptom information, and sign information of the user;

the input module 403 is configured to input the tongue image and the condition information into a pre-trained tongue recognition model, so as to extract a condition entity included in the condition information through the tongue recognition model, and determine an image feature corresponding to the tongue image and a text feature corresponding to the condition information;

the determining module 404 is configured to determine a correspondence between the disorder entity and the tongue image entity according to a preset tongue diagnosis knowledge graph, and obtain a graph feature determined based on the correspondence;

And the recognition module 405 is configured to fuse the image feature, the text feature and the map feature to obtain a target tongue feature, so as to perform tongue recognition according to the target tongue feature.

Optionally, the acquiring module 402 is specifically configured to acquire an original image; and extracting an image of the tongue region from the original image to obtain the tongue image.

Optionally, the identifying module 405 is specifically configured to fuse the text feature with the image feature to obtain a first fusion feature, and fuse the image feature with the map feature to obtain a second fusion feature; and determining the target tongue image characteristic according to the first fusion characteristic and the second fusion characteristic.

Optionally, the identifying module 405 is specifically configured to determine, through a first feature fusion network in the tongue image identifying model, a first attention weight, where the first attention weight is used to characterize a degree of association between the text feature and the image feature; and fusing the text feature and the image feature based on the first attention weight to obtain the first fused feature.

Optionally, the identifying module 405 is specifically configured to determine a query vector corresponding to the image feature and a key vector and a value vector corresponding to the text feature; and determining the first fusion feature according to the first attention weight, the query vector corresponding to the image feature, and the key vector and the value vector corresponding to the text feature.

Optionally, the identifying module 405 is specifically configured to determine, through a second feature fusion network in the tongue image identifying model, a second attention weight, where the second attention weight is used to characterize a degree of association between the map feature and the image feature; and fusing the map features with the image features based on the second attention weight to obtain the second fused features.

the identifying module 405 is specifically configured to determine a query vector corresponding to the image feature and a key vector and a value vector corresponding to the entity embedded feature; and determining the second fusion feature according to the second attention weight, the query vector corresponding to the image feature, the relation embedding feature and the key vector and the value vector corresponding to the entity embedding feature.

Optionally, the apparatus further comprises:

a training module 406, configured to obtain a training sample, where the training sample includes a historical tongue image and historical condition information of a specified user; inputting the historical tongue image and the historical disorder information into a tongue image recognition model to be trained, extracting a historical disorder entity contained in the historical disorder information through the tongue image recognition model, and determining historical image features corresponding to the historical tongue image and historical text features corresponding to the historical disorder information; according to the tongue diagnosis knowledge graph, determining the corresponding relation between the historical disorder entity and the tongue picture entity, and acquiring the characteristic of the historical graph determined based on the corresponding relation between the historical disorder entity and the tongue picture entity; fusing the historical image features, the historical text features and the historical map features to obtain historical tongue image features, and carrying out tongue image recognition according to the historical tongue image features to obtain tongue image recognition results; and training the tongue picture recognition model by taking the deviation between the minimum tongue picture recognition result and the actual tongue picture type of the appointed user as an optimization target.

Optionally, the identifying module 405 is further configured to diagnose a disorder of the user according to a tongue image identifying result, where the tongue image identifying result includes: the tongue body is of the tongue type and tongue fur type.

The present specification also provides a computer readable storage medium storing a computer program operable to perform a tongue recognition method based on multimodal information as provided in fig. 1 above.

The present specification also provides a schematic structural diagram of an electronic device corresponding to fig. 1 shown in fig. 5. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as illustrated in fig. 5, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the tongue image recognition method based on the multi-mode information, which is described in the above-mentioned figure 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

Improvements to one technology can clearly distinguish between improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) and software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A tongue picture identification method based on multi-mode information is characterized by comprising the following steps:

receiving a tongue picture identification request;

2. The method according to claim 1, wherein acquiring the tongue image of the user according to the tongue identification request specifically comprises:

acquiring an original image;

3. The method of claim 1, wherein fusing the image feature, the text feature, and the atlas feature to obtain a target tongue feature, specifically comprises:

4. A method according to claim 3, wherein the fusing of the text feature with the image feature results in a first fused feature, comprising:

5. The method of claim 4, wherein fusing the text feature with the image feature based on the first attention weight results in the first fused feature, comprising:

6. A method according to claim 3, wherein the fusing of the image features with the map features results in a second fused feature, comprising:

7. The method of claim 6, wherein the profile features comprise: entity embedding features and relationship embedding features for characterizing relationships between entities;

8. The method of claim 1, wherein training the tongue recognition model comprises:

9. The method of claim 1, wherein the method further comprises:

10. A tongue identification device based on multimodal information, comprising:

the receiving module receives a tongue picture identification request;

11. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-9.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-9 when executing the program.