CN117576405A

CN117576405A - Tongue picture semantic segmentation method, device, equipment and medium

Info

Publication number: CN117576405A
Application number: CN202410063634.5A
Authority: CN
Inventors: 李会霞; 韩爱庆; 唐燕
Original assignee: Shenzhen Huiyi Bida Medical Technology Co ltd
Current assignee: Shenzhen Huiyi Bida Medical Technology Co ltd
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-02-20

Abstract

The invention relates to a tongue picture semantic segmentation method, a tongue picture semantic segmentation device, tongue picture semantic segmentation equipment and a tongue picture semantic segmentation medium. The method comprises the steps of obtaining a target tongue picture and carrying out feature extraction treatment to obtain a shallow feature picture; carrying out multi-scale feature extraction processing on the shallow feature map to obtain a multi-scale feature map, and carrying out channel dimension reduction on the multi-scale feature map to obtain an integrated feature map; performing channel dimension reduction on the shallow feature map to obtain a first feature map, upsampling the integrated feature map to obtain a second feature map, and performing feature fusion on the first and second feature maps to obtain a deep feature map; and inputting the shallow and deep feature images into an image segmentation model to obtain a segmentation prediction result. The method can better identify and classify the objects, reduce the complexity of the model, improve the running speed and accuracy, better understand and identify the image content, improve the semantic segmentation of tongue picture, provide more accurate and objective diagnosis basis, and facilitate the diagnosis and health detection of traditional Chinese medicine.

Description

Tongue picture semantic segmentation method, device, equipment and medium

Technical Field

The invention is suitable for the technical field of image processing, and particularly relates to a tongue picture semantic segmentation method, device, equipment and medium.

Background

Semantic segmentation is a technique in the field of computer vision whose main purpose is to segment an image into distinct regions and to label these regions as distinct semantic categories. Semantic segmentation allows for a more accurate understanding of different regions in an image and matching them to corresponding semantic categories than conventional image segmentation techniques. Therefore, the semantic segmentation technology has wide application prospect in many visual scenes.

In recent years, tongue segmentation is a research hotspot in the field of traditional Chinese medicine modernization. Many scholars have carried out related researches on tongue image segmentation, tried various segmentation methods, and tongue image segmentation has also achieved some research results. The existing deep LabV3+ model performs semantic segmentation, but the model has large parameter quantity, the edge segmentation precision is not high, and a better segmentation effect cannot be achieved.

Therefore, how to improve the existing model and provide the effect of improving the semantic segmentation of the tongue is a urgent problem to be solved.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method, apparatus, device and medium for semantic segmentation of tongue images, so as to solve the problem of how to improve the existing model and provide an improvement of the semantic segmentation effect of tongue images.

In a first aspect, a tongue picture semantic segmentation method is provided, where the tongue picture semantic segmentation method includes:

obtaining a target tongue picture, and carrying out feature extraction pretreatment on the target tongue picture to obtain a shallow feature picture of the target tongue picture;

carrying out multi-scale feature extraction processing on the shallow feature map to obtain a multi-scale feature map of the target tongue picture, and carrying out channel dimension reduction processing on the multi-scale feature map to obtain an integrated feature map of the target tongue picture;

performing channel dimension reduction processing on the shallow feature map to obtain a first feature map of the target tongue picture, up-sampling the integrated feature map to obtain a second feature map of the target tongue picture, and performing feature fusion on the first feature map and the second feature map to obtain a deep feature map of the target tongue picture;

and inputting the shallow feature map and the deep feature map into a preset image segmentation model for semantic segmentation tasks to obtain a segmentation prediction result of the target tongue picture.

In a second aspect, a tongue picture semantic segmentation device is provided, where the tongue picture semantic segmentation device includes:

the shallow feature map acquisition module is used for acquiring a target tongue picture, and carrying out feature extraction pretreatment on the target tongue picture to obtain a shallow feature map of the target tongue picture;

The integrated feature map acquisition module is used for carrying out multi-scale feature extraction processing on the shallow feature map to obtain a multi-scale feature map of the target tongue picture, and carrying out channel dimension reduction processing on the multi-scale feature map to obtain an integrated feature map of the target tongue picture;

the deep feature map acquisition module is used for carrying out channel dimension reduction processing on the shallow feature map to obtain a first feature map of the target tongue picture, upsampling the integrated feature map to obtain a second feature map of the target tongue picture, and carrying out feature fusion on the first feature map and the second feature map to obtain a deep feature map of the target tongue picture;

the prediction result generation module is used for inputting the shallow feature map and the deep feature map into a preset image segmentation model for semantic segmentation tasks to obtain a segmentation prediction result of the target tongue picture.

In a third aspect, an embodiment of the present invention provides a computer device, where the computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, where the processor implements the tongue semantic segmentation method according to the first aspect when the processor executes the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the tongue semantic segmentation method according to the first aspect.

In a fifth aspect, an embodiment of the present invention provides a computer program product, which when run on a computer device, causes the computer device to perform the tongue semantic segmentation method according to the first aspect.

Compared with the prior art, the invention has the beneficial effects that: through the steps, the invention can fully analyze the characteristics of the image on different scales, better identify and classify objects, reduce the complexity of a model, improve the running speed and accuracy, better understand and identify the image content, effectively improve the semantic segmentation of tongue image pictures, accurately segment each part of the tongue, provide more accurate and objective diagnosis basis, facilitate traditional Chinese medicine diagnosis and health detection, and reduce the influence of human factors on diagnosis results.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment of a tongue semantic segmentation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a tongue semantic segmentation method according to a second embodiment of the present invention;

FIG. 3 is a flow chart of a tongue semantic segmentation method according to a third embodiment of the present invention;

fig. 4 is a flow chart of a tongue picture semantic segmentation method according to a fourth embodiment of the present invention;

FIG. 5 is a flowchart of a tongue semantic segmentation method according to a fifth embodiment of the present invention;

FIG. 6 is a flowchart of a tongue semantic segmentation method according to a sixth embodiment of the present invention;

fig. 7 is a flow chart of a tongue semantic segmentation method according to a seventh embodiment of the present invention;

FIG. 8 is a flowchart of a tongue semantic segmentation method according to an eighth embodiment of the present invention;

FIG. 9 is a schematic diagram of a model architecture of a tongue semantic segmentation method according to a ninth embodiment of the present invention;

fig. 10 is a schematic structural diagram of a tongue image semantic segmentation device according to a tenth embodiment of the present invention;

fig. 11 is a schematic structural diagram of a computer device according to an eleventh embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

It should be understood that the sequence numbers of the steps in the following embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not be construed as limiting the implementation process of the embodiments of the present invention.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

The tongue picture semantic segmentation method provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client side and a server side are communicated. The clients include, but are not limited to, palm top computers, desktop computers, notebook computers, ultra-mobile personal computer (UMPC), netbooks, cloud computing devices, personal digital assistants (personal digital assistant, PDA), and other computing devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

Referring to fig. 2, a flow chart of a tongue picture semantic segmentation method according to a second embodiment of the present invention is provided, where the tongue picture semantic segmentation method may be applied to the client in fig. 1, and a user uses the client to analyze a tongue picture, and the tongue picture semantic segmentation method is performed in the client to output a result of tongue picture semantic segmentation. As shown in fig. 2, the tongue semantic segmentation method may include the following steps:

step S201, a target tongue picture is obtained, and feature extraction pretreatment is carried out on the target tongue picture to obtain a shallow feature map of the target tongue picture.

The target tongue picture can be obtained through a public database or a study, or can be obtained directly through image acquisition of tongue of a patient through medical equipment, or can be obtained through uploading after manual photographing by a user.

The shallow feature map is a feature map for performing preliminary processing on a target tongue picture, the shallow feature map can specifically show detailed features of an original tongue picture, the shallow is defined as a name, but not a feature map is essentially defined, and the shallow feature map is an image obtained by performing feature extraction preprocessing on a photographic image.

The feature extraction preprocessing may include the steps of size adjustment, color normalization, denoising, etc. for feature extraction, where the extracted features generally include: color, texture, shape, etc. The feature extraction preprocessing can obtain an image which is mainly represented by some features, and particularly can be an image expressed in a matrix form.

Step S202, performing multi-scale feature extraction processing on the shallow feature map to obtain a multi-scale feature map of the target tongue picture, and performing channel dimension reduction processing on the multi-scale feature map to obtain an integrated feature map of the target tongue picture.

The multi-scale feature extraction refers to performing multiple treatments on the shallow feature map, wherein different scales or parameters are used for each treatment, and finally different treatment results can be combined to obtain the multi-scale feature map of the target tongue picture. When the multi-scale feature extraction processing is performed on the shallow feature map, the shallow feature map can be reduced or enlarged to different degrees, so that image feature information under different scales can be obtained, and the feature representations such as the depth, the convolution kernel size, the step length and the like of the network can be adjusted by adjusting the structural parameters of the network.

A large number of feature channels may be generated in the process of performing multi-scale feature extraction, which increases complexity and computational burden of subsequent processing, so that channel dimension reduction processing is required, the number of feature channels is reduced, and key information is retained. The integrated feature map contains important information of tongue images on different scales and channels, and can provide comprehensive feature representation for subsequent analysis.

The channel dimension processing refers to a process of dimension reduction of a channel of data, and the channel generally refers to the number of feature graphs in input data. The purpose of channel dimension reduction is to reduce the complexity and the parameter number of the model, thereby improving the training and reasoning efficiency of the model. Meanwhile, the over fitting can be effectively prevented.

And step S203, performing channel dimension reduction processing on the shallow feature map to obtain a first feature map of the target tongue picture, up-sampling the integrated feature map to obtain a second feature map of the target tongue picture, and performing feature fusion on the first feature map and the second feature map to obtain a deep feature map of the target tongue picture.

The first feature map refers to an image or map representation obtained after performing channel dimension reduction processing on a shallow feature map, for example, the shallow feature map is a feature map of 3 channels, and the shallow feature map can be processed by performing channel dimension reduction processing on the shallow feature map to obtain a corresponding first feature map by using 2 channels or 1 channel.

The second feature map refers to an image or map representation obtained after upsampling an integrated feature map, where the integrated feature map is an image or map representation obtained after performing channel dimension reduction processing on a multi-scale feature map, and it should be understood that the size of the integrated feature map is smaller than that of the first feature map, so that the integrated feature map needs to be upsampled, and the second feature map with the same size as that of the first feature map is obtained through interpolation or substitution convolution.

The fusion of the two feature images with the same size can be performed to combine the information of the first feature image and the second feature image to obtain a richer and multidimensional deep feature image, wherein the fusion can be performed in modes of addition fusion, splicing fusion, weighted fusion and the like, and the fusion modes can be based.

The deep feature map not only contains texture and color information of the original target tongue picture, but also can contain semantic information after multi-level processing and fusion.

Step S204, the shallow feature map and the deep feature map are input into a preset image segmentation model for semantic segmentation task, and a segmentation prediction result of the target tongue picture is obtained.

The image segmentation model may be a pre-trained model, including a machine learning model, a neural network model, etc., and the collected tongue image images are selected as a training set during training, so that the trained model can be used for segmentation of tongue images. In this embodiment, the method is specifically implemented by accurately dividing different areas of the tongue picture, for example, dividing a tongue body and a background in the tongue picture, extracting the tongue body, and the like.

The deep feature map is input into the image segmentation model, a preliminary segmentation prediction result can be obtained, and the feature loss is caused by the fact that global features possibly cannot be considered in the process of carrying out image segmentation on the deep features, so that in order to avoid the feature loss, the segmentation prediction result generated by trimming the shallow feature map is adopted to perfect the segmentation result, and the integrity of the tongue picture is improved.

According to the tongue picture semantic segmentation method, the characteristics of the images on different scales are fully understood through multi-scale extraction processing, objects are better identified and classified, the application of dimension reduction processing is realized, the complexity of a model is reduced, the running speed and accuracy are improved, the image content is better understood and identified by combining shallow and deep information, the semantic segmentation of tongue picture is effectively improved, each part of tongue is accurately segmented, more accurate and objective diagnosis basis is provided, traditional Chinese medicine diagnosis and health detection are facilitated, the target tongue picture is automatically analyzed and processed through machine learning and deep learning, and the influence of artificial factors on a diagnosis result is reduced.

Referring to fig. 3, a flowchart of a tongue image semantic segmentation method according to a third embodiment of the present invention is shown in fig. 3, in step S202, a multi-scale feature extraction process is performed on a shallow feature map to obtain a multi-scale feature map of a target tongue image, where the method includes:

In step S301, the shallow feature map is passed through a pooling layer and at least one convolution layer, respectively.

Wherein the shallow feature map needs to pass through a pooling layer and a convolution layer or convolution layers. The pooling layer is generally used for downsampling, so that important information is reserved while the dimension of the feature map is reduced, the calculated amount and the risk of overfitting are reduced, and the higher-level features are improved; the detail information in the image can be captured through the convolution layers, and a plurality of convolution layers can be connected to further extract the characteristics, so that the complex characteristics in the tongue picture can be better learned and understood.

And step S302, when the number of the convolution layers is greater than two, each convolution layer is used for performing expansion convolution processing of different expansion rates on the shallow feature map to obtain each convolution feature map of the target tongue picture.

The expansion convolution is a special convolution operation, the expansion rate is introduced in the convolution process, the step length of the expansion of the convolution kernel on the input feature graph can be controlled, when a plurality of convolution layers exist, different expansion rates can be set according to requirements, and each convolution layer can expand the receptive field in different modes, so that features are extracted from different angles and scales. The dilation convolution process for each dilation rate can produce a convolution signature.

Step S303, a pooling layer is used for carrying out average pooling treatment on the shallow feature map to obtain a pooled feature map of the target tongue picture.

The shallow feature map is subjected to global average pooling treatment, so that the dimension of the feature map is further reduced, important information is reserved, and more concise and efficient feature representation is provided for subsequent image segmentation and classification tasks.

And step S304, carrying out feature fusion on each convolution feature map and the pooled feature map to obtain a multi-scale feature map.

The feature fusion is to combine feature graphs from different levels to obtain a multi-scale feature representation, perform feature fusion on each processed feature graph, and fuse the embodied features together to form a feature graph with multiple scales. The specific fusion method can be to connect the feature graphs of different layers in series in the channel dimension, or to make the feature graphs consistent in size through up-sampling and down-sampling.

For example, four convolution layers, 1×1 convolution layers each, and 3×3 expansion convolution layers with expansion ratios of 3, 6, and 9, respectively, are provided, with one pooling layer and each convolution layer generating a processed signature.

According to the tongue picture semantic segmentation method, the pooled convolution with different expansion rates is used for capturing the multi-scale features, so that tongue picture can be better understood and analyzed, the dimension of the feature picture is reduced, the computing resources and time are reduced, the processing efficiency is improved, and certain flexibility is provided in the aspect of convolution layer selection.

Referring to fig. 4, a flow chart of a tongue image semantic segmentation method provided in a fourth embodiment of the present invention is shown in fig. 4, and after performing expansion convolution processing with different expansion rates on a shallow feature map in step S302, obtaining each convolution feature map of a target tongue image map, the method further includes:

step S401, for each convolution layer, calculating the channel dimension attention weight of the convolution feature map according to a preset channel attention module, and calculating the space dimension attention weight of the convolution feature map according to a preset space attention module.

Each convolution layer can output a convolution feature map, the output convolution feature map is input into an attention fusion module, and attention weights in the channel dimension and the space dimension of the convolution feature map are calculated respectively. The method for calculating the attention weight of the channel dimension comprises global average pooling, self-attention mechanism and the like, and the method for calculating the attention weight of the space dimension comprises position embedding, space pyramid pooling and the like.

Step S402, the convolution feature map is processed by using the channel dimension attention weight and the space dimension attention weight to obtain a weighted feature map of the convolution feature map.

And multiplying the calculated attention weight by the convolution feature map, and adjusting the output of the convolution layer by using the weight to obtain a weighted feature map of the convolution feature map. Specifically, the weights may be multiplied by the channel characteristics to obtain weighted channel characteristics, and similarly, for each location in the feature map, the attention weight of the spatial dimension may be used to weight the characteristics of that location.

According to the tongue image semantic segmentation method, the attention weight is utilized to carry out weighted calculation on the convolution feature map, so that the result and semantic information of input data can be captured better, the model can understand and process the input data better, and the performance and accuracy of the model are improved.

Referring to fig. 5, a flowchart of a tongue image semantic segmentation method provided in a fifth embodiment of the present invention is shown in fig. 5, and in step S402, processing a convolution feature map by using a channel dimension attention weight includes:

step S501, carrying out average pooling operation on the convolution feature map to obtain a first pooling result.

And carrying out average pooling operation on the convolution characteristic diagram, carrying out average operation on each local area in the characteristic diagram to obtain an average value in the area as output, and replacing each pixel point in the characteristic diagram with an average value so as to reduce the dimension of the characteristic diagram.

Step S502, performing maximum pooling operation on the convolution feature map to obtain a second pooling result.

The maximum pooling is to select the maximum value in the area covered by the pooling kernel as the output, and compared with the average pooling, the maximum pooling replaces each pixel point with the maximum value in the area, and the mutation information in the extracted feature map is more focused.

Step S503, inputting the first pooling result and the second pooling result into the first multi-layer sensor to obtain a first output feature map and a second output feature map respectively.

The first Multi-Layer Perceptron (MLP) further processes and analyzes the first output characteristic map and the second output characteristic map respectively, and the specific operation process is as follows: carrying out layer-by-layer convolution and pooling operation on the input feature map, and extracting deeper features; performing nonlinear transformation on the convolution result by using an activation function in each layer, and increasing the expression capacity of the model; and flattening the feature map into a one-dimensional vector in the last layer, and performing full-connection operation to obtain an output result.

Step S504, the first output feature diagram and the second output feature diagram are added to obtain the channel dimension attention weight.

Combining the first output feature map and the second output feature map to obtain a richer feature fusion result, carrying out weighted summation on each channel, and determining the attention weight of the channel dimension so as to emphasize important channels and channels which are not important all the time. The channel dimension attention weight obtained through calculation can be further used for guiding the attention distribution of the model to the feature map, and the performance and generalization capability of the model are improved.

Step S505, multiplying the channel dimension attention weight with the convolution feature map to obtain a channel attention mechanism fusion feature map.

The channel attention mechanism is a method for enabling the model to pay attention to important channel information, and by distributing different weights to each channel, the model can pay more attention to important characteristics, so that the performance of the model is improved. Multiplying the obtained channel dimension attention weight with the convolution feature map to enable the model to pay attention to important channels in the feature map better, and inhibit irrelevant or redundant channels, so that the performance and generalization capability of the model are improved.

According to the tongue picture semantic segmentation method, attention fusion is carried out on the channel dimension, the representation capacity and the classification accuracy of the model are effectively improved, average pooling and maximum pooling are combined, the feature map dimension is reduced, and meanwhile important information in an image is reserved, so that tongue picture images are segmented better.

Referring to fig. 6, a flowchart of a tongue semantic segmentation method according to a sixth embodiment of the present invention is shown in fig. 6, and after obtaining a channel attention mechanism fusion feature map in step S505, the method further includes:

and step S601, carrying out channel-based maximum pooling and average pooling operation on the channel attention mechanism fusion feature map to respectively obtain a maximum pooling result and an average pooling result.

Based on the common pooling method during the operation of the maximum pooling and the average pooling of the channels, the maximum pooling is to select the maximum value in the pooling window as the output, the average pooling is to calculate the average value in the pooling window as the output, reduce the size of the feature map and keep the important feature information. The method comprises the specific operation that the feature images are divided along the channel direction, each channel corresponds to one feature image, and then the feature images of each channel are subjected to maximum pooling and average pooling respectively to obtain new feature vectors, namely a maximum pooling result and an average pooling result.

And step S602, combining the maximum pooling result and the average pooling result to obtain a spatial attention combination characteristic diagram.

Specifically, the maximum pooling result and the average pooling result are respectively flattened into one-dimensional vectors, the two one-dimensional vectors are spliced together to form an information feature vector, the feature vector contains the maximum value and the average value information of the channel attention mechanism fusion feature map on each channel, and the new feature vector is adjusted to form a new spatial attention merging feature map.

And step S603, performing convolution activation on the spatial attention merging feature map to obtain spatial dimension attention weight.

The convolution operation can be carried out on the spatial attention merging feature images, one convolution kernel can be used for carrying out one-time convolution, or a plurality of convolution kernels can be used for carrying out multiple-time convolution, the convolution operation can be used for extracting local features in the feature images and generating new feature images, after the convolution operation, nonlinear transformation is carried out on the feature images by using an activation function, and the feature images are processed by using a sigmoid activation function. Under the action of the activation function, each pixel point in the feature map is assigned a value that represents the attention weight of the pixel point in the spatial dimension.

Step S604, multiplying the spatial dimension attention weight and the channel attention mechanism fusion feature map to obtain a weighted feature map.

The calculated spatial dimension attention weight and the channel attention mechanism fusion feature map have the same size, and element-by-element multiplication is performed, namely the feature map of each channel is weighted, so that the model can pay more attention to important spatial features, and the weighted feature map is obtained through element-by-element multiplication, wherein the value of each pixel point is the product of the corresponding channel attention weight and the spatial dimension attention weight.

The tongue image semantic segmentation method improves the characterization capability of the feature map, combines the maximum pooling and the average pooling, reduces the calculation amount of the subsequent convolution operation, improves the running efficiency of the model, can better process various different image data, and improves the generalization capability of the model.

Referring to fig. 7, a flowchart of a tongue picture semantic segmentation method provided in a seventh embodiment of the present invention is shown in fig. 7, and in step S204, a shallow feature map and a deep feature map are input into a preset image segmentation model for a semantic segmentation task to obtain a segmentation prediction result of a target tongue picture, where the method includes:

And step S701, carrying out convolution feature extraction on the deep feature map to obtain a feature representation result of the deep feature map.

Further processing operation is carried out on the deep feature map, the deep feature map after the convolution feature extraction can obtain more effective feature representation, and local areas of each feature map are convolved to extract more discriminative features.

In step S702, point sampling is performed around the pixel points of the feature representation result, so as to obtain a plurality of preset sampling points.

The sampling points are preset to enable the feature representation result to be relatively fuzzy, points of semantic information cannot be clearly determined, the points can be uniformly selected around the pixel points, a specific sampling process can be designed according to actual requirements, the density and distribution of the sampling points are controlled, and the important information in the feature representation result can be fully covered. It should be noted that the number and distribution of sampling points need to be selected and adjusted according to the specific task and characteristics of the data set to ensure optimal results.

Step S703, obtaining the feature vector of each preset sampling point according to the feature representation result.

It should be understood that the feature vector of each preset sampling point may be extracted from the feature representation result, and an appropriate feature vector extraction method and parameter setting may be selected according to the actual implementation, so as to improve the performance and accuracy of the classification task.

In one embodiment, as shown in fig. 8, obtaining the feature vector of each preset sampling point includes:

step S801, obtaining point coordinates of each preset sampling point according to the characteristic representation result;

step S802, sampling the shallow feature map according to the point coordinates to obtain a low-level feature vector of a preset sampling point;

step S803, sampling the feature representation result according to the point coordinates to obtain a high-level feature vector of a preset sampling point;

step S804, combining the low-level feature vector and the high-level feature vector to obtain the feature vector of the preset sampling point.

The position of the preset sampling point (i.e., the point coordinate of the preset sampling point) is determined from the feature representation result, and since the size of the shallow feature map is consistent with the size of the feature representation result, the feature vector of the corresponding position in the shallow feature map can be obtained according to the point coordinate, i.e., the low-level feature vector of the preset sampling point, and the feature vector of the corresponding position in the feature representation result can be obtained according to the point coordinate, i.e., the high-level feature vector, and the low-level feature vector and the high-level feature vector of the same preset sampling point are combined to obtain the feature vector of the preset sampling point.

By fusing the high-level feature vector and the low-level feature vector, feature information of different levels can be obtained, the essential features of images or data can be better understood, the representation capability of the features is effectively improved, different tasks and data sets can be processed more flexibly, the comprehensiveness and accuracy of feature representation are improved, and the generalization capability of a model is enhanced.

In step S704, the feature vector is calculated by the second multi-layer sensor to obtain a segmentation prediction result.

The multi-layer perceptron is composed of a plurality of neurons, each neuron receives an input signal and outputs a value, and the multi-layer perceptron can learn the mapping relation from the input characteristic vector to the output segmentation result through training. The specific training process can be performed by using optimization algorithms such as a back propagation algorithm, and errors between the prediction result and the actual segmentation result can be minimized by continuously adjusting the weight parameters of the second multi-layer perceptron. It should be noted that the configuration and parameter settings of the second multi-layer sensor need to be selected and adjusted according to the specific task and characteristics of the data set. And finally obtaining a segmentation prediction result with the same size as the original image through calculation.

In an embodiment, the obtained segmentation prediction result may be further subjected to iterative repair processing, so as to continuously optimize the segmentation result, thereby achieving a more accurate effect. The segmentation result is updated continuously, the optimal solution is gradually approached, the weight parameters of the second multi-layer perceptron can be adjusted according to the difference between the current segmentation prediction result and the actual segmentation result, so that a more accurate segmentation result is obtained, then the more accurate segmentation result is used as a new input feature vector, and the calculation of the second multi-layer perceptron is performed again, so that a further optimized segmentation prediction result is obtained. In this way, the error between the prediction result and the actual segmentation result is gradually reduced, and the accuracy and stability of segmentation are improved. In addition, the number of iterative repair processing and parameter setting need to be selected and adjusted according to the specific task and the characteristics of the data set so as to achieve the optimal segmentation effect.

As shown in fig. 9, which is a schematic diagram of a model architecture of a tongue image semantic segmentation method, in fig. 9, a lightweight network MobileNet V2 is input to obtain shallow feature images, the shallow feature images are respectively input into a 1×1 convolution layer (Conv), a 3×3 convolution layer with 3 expansion rate, a 3×3 convolution layer with 6 expansion rate, a 3×3 convolution layer with 9 expansion rate, and an average pooling layer, and the feature images are respectively processed and output, and the feature images are fused to obtain a multi-scale feature image. After the shallow feature map passes through the convolution layer, the output feature map can be processed by using CBAM (Convolutional Block Attention Module, convolution attention mechanism module) to integrate the channel attention and the space attention. The shallow feature map does not need CBAM processing through the feature map of the pooling layer. And carrying out channel dimension reduction processing on the multi-scale feature images to obtain an integrated feature image, and merging (Concat) the integrated feature image after upsampling (namely the second feature image) with the shallow feature image (namely the first feature image) after channel dimension reduction processing to obtain a deep feature image. And (3) finishing and outputting a segmentation prediction result of the target tongue picture through a deep feature map and shallow feature map input point rendering (Pointrend) module of the convolution layer.

According to the tongue picture semantic segmentation method, some preset sampling points are selected for feature vector calculation, the obtained segmentation prediction result is more accurate, the segmentation accuracy and stability are improved, and convenience is provided for tongue picture analysis of traditional Chinese medicine.

Referring to fig. 10, a schematic structural diagram of a tongue image semantic segmentation device according to a tenth embodiment of the present application is provided, and based on the tongue image semantic segmentation method, the tongue image semantic segmentation device according to the tenth embodiment includes: the shallow feature map obtaining module 101 is configured to obtain a target tongue image, and perform feature extraction processing on the target tongue image to obtain a shallow feature map of the target tongue image;

the integrated feature map obtaining module 102 is configured to perform multi-scale feature extraction processing on the shallow feature map to obtain a multi-scale feature map of the target tongue image map, and perform channel dimension reduction processing on the multi-scale feature map to obtain an integrated feature map of the target tongue image map;

the deep feature map obtaining module 103 is configured to perform channel dimension reduction processing on the shallow feature map to obtain a first feature map of the target tongue image map, upsample the integrated feature map to obtain a second feature map of the target tongue image map, and perform feature fusion on the first feature map and the second feature map to obtain a deep feature map of the target tongue image map;

The prediction result generation module 104 is configured to input the shallow feature map and the deep feature map to a preset image segmentation model for a semantic segmentation task, so as to obtain a segmentation prediction result of the target tongue picture.

Optionally, the integrated feature map obtaining module 102 includes:

the feature map processing submodule is used for respectively passing the shallow feature map through a pooling layer and at least one convolution layer;

when the number of the convolution layers is greater than two, each convolution layer is used for performing expansion convolution processing of different expansion rates on the shallow feature images to obtain each convolution feature image of the target tongue image;

the pooling layer is used for carrying out average pooling treatment on the shallow feature images to obtain pooled feature images of the target tongue picture;

and the feature fusion sub-module is used for carrying out feature fusion on each convolution feature map and the pooled feature map to obtain a multi-scale feature map.

Optionally, the feature map processing submodule includes:

the weight calculation unit is used for calculating the channel dimension attention weight of the convolution feature map according to a preset channel attention module and calculating the space dimension attention weight of the convolution feature map according to a preset space attention module for each convolution layer;

The weighted feature map obtaining unit is used for processing the convolution feature map by utilizing the channel dimension attention weight and the space dimension attention weight to obtain a weighted feature map of the convolution feature map.

Optionally, the weighted feature map acquiring unit includes:

the average pooling subunit is used for carrying out average pooling operation on the convolution characteristic diagram to obtain a first pooling result;

the maximum pooling subunit is used for carrying out maximum pooling operation on the convolution characteristic diagram to obtain a second pooling result;

the characteristic diagram output subunit is used for inputting the first pooling result and the second pooling result into the first multi-layer perceptron to respectively obtain a first output characteristic diagram and a second output characteristic diagram;

the channel attention weight calculation subunit is used for adding the first output feature diagram and the second output feature diagram to obtain channel dimension attention weight;

and the channel attention fusion subunit is used for multiplying the channel dimension attention weight with the convolution feature map to obtain a channel attention mechanism fusion feature map.

Optionally, the channel attention fusion subunit further includes:

the channel Chi Huazi unit is used for carrying out maximum pooling and average pooling operation based on the channel attention mechanism fusion characteristic diagram to respectively obtain a maximum pooling result and an average pooling result;

The pool combining subunit is used for combining the maximum pooling result and the average pooling result to obtain a spatial attention combining characteristic diagram;

the space attention weight acquisition subunit is used for performing convolution activation on the space attention merging feature map to obtain space dimension attention weight;

and the weighted feature map acquisition subunit is used for multiplying the spatial dimension attention weight and the channel attention mechanism fusion feature map to obtain a weighted feature map.

Optionally, the prediction result generating module 104 includes:

the feature extraction submodule is used for carrying out convolution feature extraction on the deep feature map to obtain a feature representation result of the deep feature map;

the point sampling sub-module is used for sampling points around the pixel points of the characteristic representation result to obtain a plurality of preset sampling points;

the feature vector acquisition sub-module is used for acquiring the feature vector of each preset sampling point according to the feature representation result;

and the prediction result acquisition sub-module is used for calculating the feature vector through the second multi-layer perceptron to obtain a segmentation prediction result.

Optionally, the feature vector obtaining submodule includes:

the point coordinate acquisition unit is used for acquiring the point coordinate of each preset sampling point according to the characteristic representation result;

The low-level feature vector acquisition unit is used for sampling the shallow feature map according to the point coordinates to obtain a low-level feature vector of a preset sampling point;

the high-level feature vector acquisition unit is used for sampling the feature representation result according to the point coordinates to obtain a high-level feature vector of a preset sampling point;

and the feature vector acquisition unit is used for combining the low-layer feature vector and the high-layer feature vector to obtain the feature vector of the preset sampling point.

It should be noted that, because the content of information interaction and execution process between the modules and the embodiment of the method of the present invention are based on the same concept, specific functions and technical effects thereof may be referred to in the method embodiment section, and details thereof are not repeated herein.

Fig. 11 is a schematic structural diagram of a computer device according to an eleventh embodiment of the present invention. As shown in fig. 10, the computer device of this embodiment includes: at least one processor (only one shown in fig. 10), a memory, and a computer program stored in the memory and executable on the at least one processor, the processor executing the computer program to perform the steps of any of the various tongue semantic segmentation method embodiments described above.

The computer device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that fig. 10 is merely an example of a computer device and is not intended to be limiting, and that a computer device may include more or fewer components than shown, or may combine certain components, or different components, such as may also include a network interface, a display screen, an input device, and the like.

The processor may be a CPU, but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory includes a readable storage medium, an internal memory, etc., where the internal memory may be the memory of the computer device, the internal memory providing an environment for the execution of an operating system and computer-readable instructions in the readable storage medium. The readable storage medium may be a hard disk of a computer device, and in other embodiments may be an external storage device of the computer device, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. that are provided on the computer device. Further, the memory may also include both internal storage units and external storage devices of the computer device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs such as program codes of computer programs, and the like. The memory may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention.

The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above-described embodiment, and may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The present invention may also be implemented as a computer program product for implementing all or part of the steps of the method embodiments described above, when the computer program product is run on a computer device, causing the computer device to execute the steps of the method embodiments described above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. The tongue picture semantic segmentation method is characterized by comprising the following steps of:

Performing multi-scale feature extraction processing on the shallow feature map to obtain a multi-scale feature map of the target tongue picture, and performing channel dimension reduction processing on the multi-scale feature map to obtain an integrated feature map of the target tongue picture;

2. The tongue picture semantic segmentation method according to claim 1, wherein the performing multi-scale feature extraction processing on the shallow feature map to obtain a multi-scale feature map of the target tongue picture comprises:

respectively passing the shallow feature map through a pooling layer and at least one convolution layer;

when the number of the convolution layers is greater than two, each convolution layer is used for performing expansion convolution processing of different expansion rates on the shallow feature images to obtain each convolution feature image of the target tongue image picture;

The pooling layer is used for carrying out average pooling treatment on the shallow feature map to obtain a pooled feature map of the target tongue picture;

and carrying out feature fusion on each convolution feature map and the pooled feature map to obtain the multi-scale feature map.

3. The tongue picture semantic segmentation method according to claim 2, wherein after performing expansion convolution processing of different expansion rates on the shallow feature map to obtain each convolution feature map of the target tongue picture, the method further comprises:

for each convolution layer, calculating the channel dimension attention weight of the convolution feature map according to a preset channel attention module, and calculating the space dimension attention weight of the convolution feature map according to a preset space attention module;

and processing the convolution feature map by using the channel dimension attention weight and the space dimension attention weight to obtain a weighted feature map of the convolution feature map.

4. A tongue semantic segmentation method according to claim 3, wherein processing the convolution feature map with the channel dimension attention weights comprises:

carrying out average pooling operation on the convolution feature map to obtain a first pooling result;

Performing maximum pooling operation on the convolution feature map to obtain a second pooling result;

inputting the first pooling result and the second pooling result into a first multi-layer perceptron to obtain a first output characteristic diagram and a second output characteristic diagram respectively;

adding the first output feature map and the second output feature map to obtain the channel dimension attention weight;

multiplying the channel dimension attention weight with the convolution feature map to obtain a channel attention mechanism fusion feature map.

5. The tongue semantic segmentation method according to claim 4, further comprising, after the obtaining the channel attention mechanism fusion feature map:

carrying out maximum pooling and average pooling operation based on the channel attention mechanism fusion feature map to respectively obtain a maximum pooling result and an average pooling result;

combining the maximum pooling result and the average pooling result to obtain a spatial attention combination feature map;

performing convolution activation on the spatial attention merging feature map to obtain the spatial dimension attention weight;

multiplying the spatial dimension attention weight by the channel attention mechanism fusion feature map to obtain the weighted feature map.

6. The tongue picture semantic segmentation method according to claim 1, wherein the step of inputting the shallow feature map and the deep feature map to a preset image segmentation model for a semantic segmentation task to obtain a segmentation prediction result of the target tongue picture comprises:

performing convolution feature extraction on the deep feature map to obtain a feature representation result of the deep feature map;

performing point sampling around the pixel points of the characteristic representation result to obtain a plurality of preset sampling points;

according to the feature representation result, obtaining a feature vector of each preset sampling point;

and calculating the feature vector through a second multi-layer perceptron to obtain the segmentation prediction result.

7. The tongue semantic segmentation method according to claim 6, wherein the obtaining the feature vector of each preset sampling point comprises:

acquiring point coordinates of each preset sampling point according to the characteristic representation result;

sampling the shallow feature map according to the point coordinates to obtain a low-level feature vector of the preset sampling point;

sampling the feature representation result according to the point coordinates to obtain a high-level feature vector of the preset sampling point;

And combining the low-level characteristic vector and the high-level characteristic vector to obtain the characteristic vector of the preset sampling point.

8. The tongue picture semantic segmentation device is characterized by comprising:

the deep feature map acquisition module is used for carrying out channel dimension reduction processing on the shallow feature map to obtain a first feature map of the target tongue picture, up-sampling the integrated feature map to obtain a second feature map of the target tongue picture, and carrying out feature fusion on the first feature map and the second feature map to obtain a deep feature map of the target tongue picture;

the prediction result generation module is used for inputting the shallow feature image and the deep feature image into a preset image segmentation model for a semantic segmentation task to obtain a segmentation prediction result of the target tongue picture.

9. A computer device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the tongue semantic segmentation method according to any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the tongue semantic segmentation method according to any one of claims 1 to 7.