CN116797614A

CN116797614A - CBAUnet-based double-attention rapid tongue contour extraction method and system

Info

Publication number: CN116797614A
Application number: CN202310294304.2A
Authority: CN
Inventors: 王新强; 路文焕; 刘佳; 韦钰; 郝丽燕
Original assignee: Tianjin Binhai Xunteng Technology Group Co ltd; Tianjin University; Tianjin Sino German University of Applied Sciences
Current assignee: Tianjin Binhai Xunteng Technology Group Co ltd; Tianjin University; Tianjin Sino German University of Applied Sciences
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-09-22
Anticipated expiration: 2043-03-23
Also published as: CN116797614B

Abstract

The application discloses a CBAUnet-based double-attention rapid tongue contour extraction method and a CBAUnet-based double-attention rapid tongue contour extraction system, wherein the method comprises the steps of acquiring an original tongue ultrasonic image dataset; preprocessing an original ultrasonic image data set; inputting the preprocessed data into a CBAUnet network, encoding the preprocessed ultrasonic image, and obtaining feature graphs of different scale information by using a double-attention mechanism of AG gating attention and CBAM attention of a comprehensive attention module; and counting the target characteristic information according to the characteristic map, and then decoding at each stage by a decoder to obtain a contour map after pixel restoration. On the basis of reducing the original U-Net network, the CBAM module and the AG module are introduced, the two modules are connected in parallel to be designed into a comprehensive attention module with a double attention function, and the attention mechanism module for receiving information with different scales replaces the original jump connection, so that the feature representation capability is further improved.

Description

CBAUnet-based double-attention rapid tongue contour extraction method and system

Technical Field

The application relates to the technical field of contour extraction, in particular to a CBAUnet-based double-attention rapid tongue contour extraction method.

Background

The ultrasonic technology is clean, safe and low in price, and can image the tongue and the oral cavity. The tongue outline is extracted from the image accurately, so that doctors can observe the sounding situation of patients with abnormal sounding or language disorder caused by diseases, and can also provide language sounding reference for some confidential occasions or input tongue characteristics as biological signals into a silent voice interface. In short, the extraction of the ultrasonic tongue profile provides a powerful guarantee for the speech communication from person to person.

Studies have shown that tongue profile is a good starting point for quantitative study of speech, and information derived from tongue profile can support a deeper understanding and development of pronunciation models. Because ultrasound can dynamically describe tongue positions belonging to different voices and characterize tongue movements that produce sound transitions during pronunciation, various applications related to unvoiced voices involve almost all the extraction of ultrasound tongue contours, which is a fundamental necessary operation. The extraction accuracy of the ultrasonic tongue profile is related to the accuracy of the whole voice task, and the real-time performance of the extraction is related to the efficiency of the whole process. Therefore, it is important to explore an accurate and rapid ultrasonic tongue contour tracking and extracting method.

Currently, automatic tracking of tongue contours is extremely challenging. From the ultrasonic tongue imaging process, a high degree of speckle noise accompanies the whole process; the hyoid bone and jawbone sometimes block ultrasound; the difference in reflectivity of muscle fibers of the tongue itself makes the echo path incomplete, resulting in incomplete sagittal profile; the soft tissue structure of the tongue is imaged to contain artifacts and contours are even completely absent when the tongue position is transformed. From the perspective of the extraction method, the accuracy of the tongue profile fit depends largely on the quality of the ultrasound data and the type of profile tracking algorithm. Meanwhile, the extraction speed may not reach a high speed due to the semi-automatic or manual nature of the extraction method. There is little discussion of the speed of current research results, and a uniform industry specification has not yet been established, recording tongue profile extraction speeds of 29.8fps only in the study of document 3.

There are a variety of techniques for tongue contour tracking in ultrasound images, such as active contour models, graph-based techniques, machine learning-based methods, and the like. In these studies, manual tagging is at least necessary for initialization, so the well-known software packages of EdgeTrak et al also cannot track tongue contours in real time. The advent of deep learning methods has attracted considerable attention from researchers. Convolutional neural networks are considered to be sufficiently powerful to be used in feature extraction studies such as ultrasound tongue contour tracking. The depth confidence network and the depth auto-encoder show good results. After this, researchers have found that the accuracy of the deep learning method is highly correlated with the size of the training dataset and the complexity of the deep network model. Thus, there is always a trade-off between the number of training samples and the number of network parameters. The higher-precision extraction result depends on that the segmentation network must acquire enough semantic information and abundant detail information. If high-precision segmentation is achieved by deepening network parameters and increasing the resolution of the input image, a sudden increase in the amount of computation and a decrease in segmentation efficiency result. However, even in the absence of labeled training data, CBAUnet achieves better segmentation results in segmenting medical images to the extent that it has become a practical standard for medical image segmentation. However, the deep architecture with multiple layers inside the network is very costly in terms of computing resources during training and testing, which is a serious problem for real-time tracking of ultrasound tongue contours.

The minimum computational resource consumption has led to a focus of research in the field of deep learning in recent years towards attention mechanisms. The mechanism of attention plays a vital role in human perception. Through the attention module, the deep convolutional neural network can accelerate the learning process, extract more key features for the target task, and enhance the robustness of the network model. Kaul et al propose a focus net method that draws attention to a full convolutional neural network that performs medical image segmentation from a feature map generated by a separate convolutional automatic encoder. For example, in the literature, a method called CBAUnet skip connect with Attention Gate (AG) is proposed to improve prediction accuracy and sensitivity in pancreatic segmentation protocols. In another document, SENet is proposed to adaptively recalibrate channel characteristic responses by explicitly modeling inter-dependencies between channels. A convolution block attention module (Convolutional Block Attention Module, CBAM) has also been proposed to be a lightweight generic module, using hardly any computational resources, and capable of performing adaptive feature refinement feature graphs based on given intermediate variables. For the ultrasonic tongue contour extraction task, the tongue contour line only occupies a very small area of the whole image, and the attention is focused on the small target area, so that the training speed can be increased, the object representation in the area can be enhanced, and the characteristic details are highlighted, therefore, the attention mechanism can be increased in the network, the attention to irrelevant background elements can be reduced by adding weights to the characteristics, and the learning speed of the tongue contour features can be increased. However, for the ultrasonic tongue contour extraction task, the ultrasonic tongue contour map may have blurred boundaries and irregular shapes, and thus it is difficult to perform well in the tongue contour segmentation task by means of a single focus mechanism.

Disclosure of Invention

Therefore, the application aims to provide a double-attention quick tongue contour extraction method based on CBAUnet, which redesigns the internal structure of the CBAUnet network and a comprehensive attention learning module, and embeds the output of the comprehensive attention learning module into the redesigned CBAUnet network; the tongue contour is rapidly segmented and extracted.

In order to achieve the above object, the method for extracting a double-attention quick tongue profile based on CBAUnet according to the present application comprises the following steps:

s1, acquiring an original tongue ultrasonic image dataset;

s2, preprocessing an original ultrasonic image data set;

s3, inputting the preprocessed data into a CBAUnet network, and obtaining feature graphs of different scale information by using a double-attention mechanism of AG gating attention and CBAM attention of a comprehensive attention module after encoding the preprocessed ultrasonic image;

and S4, counting the target characteristic information according to the characteristic map, and then decoding at each stage by a decoder to obtain a contour map after pixel restoration.

Further preferably, in S1, the ultrasound image dataset comprises an NS dataset, a TJU dataset and a timt dataset.

Further preferably, in S2, the process of preprocessing the ultrasound image dataset includes the steps of:

normalizing the acquired data set, and uniformly adjusting the size of the picture to 96 pixels by 96 pixels after normalization;

carrying out random rotation and random overturning training on the normalized picture by using a transformer packet;

in the training process, adjusting hue, saturation, brightness and contrast according to random probability;

and labeling the adjusted image to form a labeled data set.

Further preferably, in S3, after encoding the preprocessed ultrasound image, a feature map of different scale information is obtained by using a dual-attention mechanism of AG-gated attention and CBAM attention of the integrated attention module, including the following steps:

reducing one convolution layer in a coding convolution block and a decoding convolution block of each stage in a traditional U-Net network, and embedding a comprehensive attention module in the traditional U-Net network to form a CBAUnet network;

the AG gating attention and the CBAM attention are connected in parallel in the comprehensive attention module, the AG gating attention is utilized to adaptively learn to focus on target structures with different shapes and sizes from the encoded information, and the characteristics useful for specific tasks are highlighted through implicit learning, so that irrelevant areas in an input image are restrained; utilizing the spatial relationship and the channel relationship of the CBAM attention based on the characteristics; generating a spatial attention map and a channel attention map, respectively;

the spatial attention map and the channel attention map are then transmitted to the corresponding level of decoding convolution blocks for decoding.

Further preferably, the output result of the AG gating attention is expressed by the following formula:

wherein ,is a corresponding S-type activation function; theta (theta) _att In order to characterize the parameters of the AG, 1 in the channel direction using input tensor linear transformation is calculated by x 1 convolution; sigma (sigma) ₁ Corresponding to the function of the ReLU,are all linear transformation matrices>Is a bias term; />A gating vector is used for each pixel to determine the focal region.

Further preferably, the CBAM attention generation spatial attention map based on the spatial relationship of the features includes the steps of:

the spatial information of the feature map is aggregated using the average pooling and maximum pooling operations to generate two different spatial context descriptors: and /> and />Representing the average pooling feature and the maximum pooling feature, respectively.

The two descriptors are forwarded to the shared network, generating a channel attention map.

Further preferably, the CBAM attention generation channel attention map based on the characteristic channel relation includes the steps of:

the channel information of the feature map is aggregated using an averaging pooling and a max pooling operation, two-dimensional valid channel feature descriptors are generated, and />

The channel feature descriptors are concatenated and convolved by a standard convolution layer to generate a 2D spatial attention map.

The application also provides a CBAUnet-based dual-attention rapid tongue profile extraction system, which comprises: the system comprises a data acquisition module, a data preprocessing module and a CBAUnet network;

the data acquisition module is used for acquiring an original tongue ultrasonic image data set;

the data preprocessing module is used for preprocessing an original ultrasonic image data set;

the CBAUnet network comprises an encoding block, a decoding block and a comprehensive attention module; the encoding block is used for encoding the preprocessed ultrasonic image to form encoding information;

the comprehensive attention module is used for obtaining feature graphs of different scale information based on a double-attention mechanism of AG gating attention and CBAM attention;

and the decoder is used for counting the target characteristic information according to the characteristic map and then carrying out decoding at each stage by the decoder to obtain a contour map after pixel restoration.

Further, the ultrasound image dataset includes an NS dataset, a TJU dataset, and a timt dataset.

Further, AG gate control attention and CBAM attention are connected in parallel in the comprehensive attention module, AG gate control attention is utilized to adaptively learn to focus on target structures with different shapes and sizes from the encoded information, characteristics useful for specific tasks are highlighted through implicit learning, and irrelevant areas in an input image are restrained; utilizing the spatial relationship and the channel relationship of the CBAM attention based on the characteristics; generating a spatial attention map and a channel attention map, respectively; the spatial attention map and the channel attention map are then transmitted to the corresponding level of decoding convolution blocks for decoding.

Compared with the prior art, the CBAUnet-based double-attention rapid tongue contour extraction method and system disclosed by the application have at least the following advantages:

the application adopts the double-attention quick tongue profile extraction in the CBAUnet, changes the structure of the original Unet network, and adds AG gate-controlled attention and CBAM attention in the source network to form the CBAUnet; the method has the advantages that the key features are rapidly positioned by adopting a double-attention mechanism, the connection and convolution operation in the modules are tiny in operation amount for the convenience of further segmentation and extraction, the consumption of calculation resources can be ignored, the segmentation speed of tongue contours is accelerated by the cooperation of the two attention modules, and the segmentation accuracy is improved.

Drawings

FIG. 1 is a flow chart of a CBAUnet-based dual-attention rapid tongue contour extraction method of the present application;

FIG. 2 is a block diagram of a CBAUnet network provided by the application;

FIG. 3 is a block diagram of a comprehensive attention module diagram in accordance with an embodiment of the present application;

FIG. 4 is a diagram of an AG gated attention module in accordance with an embodiment of the present application;

FIG. 5 is a diagram of a channel attention module in CBAM attention of the present application;

FIG. 6 is a block diagram of the attention-space-attention module of the CBAM of the present application;

FIG. 7 is a display view of an ultrasound image dataset in accordance with the present application;

FIG. 8 is a graph showing the effect of preprocessing data in an ultrasound image dataset in accordance with the present application;

FIG. 9 (a) is a graph showing a IoU index comparison of five networks of the present application over three data sets;

FIG. 9 (b) is a graph showing comparison of loss indicators of five networks on three data sets according to the present application;

FIG. 9 (c) is a graph showing a comparison of Time indicators for five networks on three data sets in accordance with the present application;

FIG. 10 is a graph comparing the performance of networks on an ultrasound tongue dataset;

FIG. 11 (a) is a graph showing statistics of accuracy of five networks of the present application in comparison to ablation experiments;

FIG. 11 (b) is a graph showing time statistics of five networks according to the present application in comparison to ablation experiments;

fig. 12 is a schematic diagram of the test results of the ablation experiment of five networks in the present application.

Detailed Description

The application is described in further detail below with reference to the drawings and the detailed description.

As shown in fig. 1, a CBAUnet-based dual-attention rapid tongue profile extraction method provided by an embodiment of the present application includes the following steps:

s1, acquiring an original tongue ultrasonic image dataset; the ultrasound image dataset includes an NS dataset, a TJU dataset, and a TIMIT dataset.

S2, preprocessing an original ultrasonic image data set;

In S2, the process of preprocessing the ultrasound image dataset includes the steps of:

and labeling the adjusted image to form a labeled data set.

In S3, after encoding the preprocessed ultrasound image, a dual-attention mechanism of AG-gated attention and CBAM attention of the integrated attention module is used to obtain feature maps of different scale information, including the following steps:

as shown in fig. 2, one convolution layer is reduced in the coding convolution block and the decoding convolution block of each stage in the conventional U-Net network, and a comprehensive attention module is embedded in the conventional U-Net network to form a CBAUnet network;

The AG is determined from the relevant information extracted from the coarse scale, eliminating the noisy and uncorrelated response from skipping the connection. In addition, the AG performs an operation of combining the related activations just before the connection operation, filtering the neuron activation for forward transmission and reverse transmission. After extracting and fusing the supplemental information from each sub-AG encoding and decoding path, the output of the concatenation is skipped. Like non-local blocks, the AG is linearly transformed without any spatial support, and downsampling of the gating signal reduces the resolution of the input feature map, thereby reducing the parameters of the network model and the computational resource consumption. The CBAM module may effectively emphasize or suppress and improve the content and location of intermediate features, thereby directing the network to correctly focus on the target object. Given an input image, the module calculates complementary attentions through the channel and the spatial two attentions modules, focusing on the content and the position of the feature, respectively. The channel attention module focuses on the study object of ultrasonic tongue contour, and the space attention focuses on the pixel position corresponding to the white sagittal tongue contour line.

The comprehensive attention module formed by the parallel connection of AG and CBAM not only performs concentrated processing on the characteristics in respective dimension, but also enables accurate and rapid positioning of the extracted characteristics to be realized by cooperation of the AG and the CBAM. The integrated attention module diagram is shown in fig. 3. The internal structure of AG is within the dashed box a, the internal structure of CBAM is within the dashed box b, the channel attention portion of CBAM is within the solid box c, and the spatial attention portion of CBAM is within the solid box d. Two major internal modules of the integrated attention module will be described in detail below.

AG gates attention by analyzing the context information and the activation provided by gating signal g collected from the coarser scale selects spatial regions. AG note that the internal structure of the module is as shown in FIG. 4, input feature x ^l Scaling is performed according to the attention coefficient alpha of the resampling grid and is done using tri-linear interpolation. Gating vectorFor each pixel to determine the focal region, the calculation is shown in equations (1) and (2). Wherein sigma ₁ Corresponding to ReLU function, < >>Corresponding to an S-type activation function. Linear transformation->Bias item->b _ψ E R forms a set of Θatt parameters characterizing AG, and a linear transformation is computed using a 1 x 1 convolution of the input tensor in the channel direction. Linear mapping to +.>Cascading features x of dimensional intermediate space ^l And g is referred to as vector-based connection attention. The output of AG is the product of the element mapped by the input element and the attention factor, and the calculation mode is shown in a formula (3). The AG herein will remain default set for every pixel vector +.>Calculating a single scalar focus value, where F _l Corresponding to the number of feature maps in the l-layer.

The output of AG-gated attention is expressed as follows:

wherein ,a product of an element representing the input element map and an attention factor; />Representing attention factor, x ^l Inputting characteristics;

representing an input feature, each pixel vector;

g _i the gate-control vector is represented by a vector,for each pixel to determine a focus area;

representing an attention factor;

b _g the term of the bias is indicated,

representing additive attention;

Θatt comprises: linear transformationBias termThe AG attention module adaptively adjusts and automatically learns to focus on different shapes and sizes of target structures in the medical image, and highlights features useful for a particular task by implicit learning with AG filtered models while suppressing irrelevant regions in the input image. For the task of extraction of ultrasound tongue contours, AG can help the extraction network to effectively highlight key contours while suppressing extraneous pixels outside of the key features.

CBAM attention generates a spatial attention map based on the spatial relationship of the features, comprising the steps of:

using spatial information of the average pooling and max pooling operations aggregate feature maps to generate two different spatial contextsText descriptor: and /> and />Representing the average pooling feature and the maximum pooling feature, respectively.

A hidden layer multi-layer perceptron (MLP) is considered a shared network. To reduce parameter overhead, the hidden activation size is set to R ^C/r×1×1 Where r is the reduction rate. After the shared network is applied to each descriptor, the feature vectors are combined using element-wise summation.

The channel attention calculation mode is shown in a formula (4). Wherein σ represents a Sigmoid function, W ₀ ∈R ^C/r×C ，W ₁ ∈R ^C×C/r 。W ₀ and W₁ Is to input two weights of shared MLP, and the Relu activates the function heel W ₀ . A channel attention module diagram is shown in fig. 5.

Wherein AvgPool (F) represents average pooling; MLP (AvgPool (F)) represents a multi-layer perceptron, sharing a network. MaxPool (F) represents maximum pooling;representing an average pooling feature; />Representing a maximum pooling feature;

the CBAM attention generation channel attention map based on the channel relation of the feature includes the steps of:

Unlike channel attention, spatial attention is focused on the location information of the target feature, which is complementary to channel attention. The internal structure of the spatial attention module is shown in fig. 6. First, two pooling operations are used along the channel axis to aggregate the channel information of the feature map, generating two-dimensional valid feature descriptors, i.e and />These features are then concatenated and convolved by a standard convolution layer to generate a 2D spatial attention map M _s (F)∈R ^H×W The map encodes the emphasized or suppressed positions, enabling feature extraction. The spatial attention calculation mode is shown in formula (5).

In network design, the coded information is processed by CBAM on the channel and space relation, the feature to be extracted is described in the characteristic mode of the feature vector, the feature information transmitted to the decoding block is input after the key position of the original feature is determined, and the full-scale information is not directly transmitted to the decoding block by skip connection like the original U-Net network. Therefore, after the CBAM is embedded, the U-Net directly receives the extracted key information during decoding, and pixels are restored by the key information of the scale and the decoding information of other scales, so that final prediction is completed. For decoding, the CBAM may accelerate its process. Meanwhile, the accuracy of the process of recovering the pixels is improved because the process of recovering the pixels is aided by the key positioning of the CBAM.

The ultrasound image dataset includes an NS dataset, a TJU dataset, and a TIMIT dataset.

The AG gating attention and the CBAM attention in the comprehensive attention module are connected in parallel, the AG gating attention is utilized to adaptively learn to focus on target structures with different shapes and sizes from the encoded information, and the characteristics useful for specific tasks are highlighted through implicit learning, so that irrelevant areas in an input image are restrained; utilizing the spatial relationship and the channel relationship of the CBAM attention based on the characteristics; generating a spatial attention map and a channel attention map, respectively; the spatial attention map and the channel attention map are then transmitted to the corresponding level of decoding convolution blocks for decoding.

The CBAUnet network provided by the present application will be tested for its various performances in specific embodiments.

The datasets used in this experiment included NS dataset, TJU dataset and timt dataset. The NS dataset consisted of 3926 video frames of two american english-speaking one male and one female reciting "The North Wind and the Sun". TJU data set from Tianjin university laboratory, four men whose native language is chinese record in real time the ultrasonic tongue video when speaking in a professional noise-reduced room. The TIMIT dataset is a voice database of continuous English recorded by the native from each region of the United states (with dialects). Each dataset contained a series of ultrasound frames of 480 pixels x 640 pixels in size that captured a mid-sagittal view of the tongue when speaking. A single ultrasound frame is recorded as a 2D matrix, with each column representing ultrasound reflection intensities along a single scan line.

In ultrasound recordings, the tongue profile is created by the ultrasound reflected at the tongue boundary and the air above, representing a bright band. The results of ultrasonic transducer imaging are shown in fig. 7. From (a) to (c) are samples of the NS data set, TJU data set and timt data set, respectively, with the middle bright line being the tongue side contour, and from left to right the tongue root to the tongue tip. The three data sets are visualized, and compared with the other two data sets, the NS data set has more vivid contrast between the white sagittal outline and the background; the fusion of the lingual line in the TJU data set and the internal environment of the oral cavity is serious, and the lingual line segmentation is difficult compared with other two data sets; the tongue line of the TIMIT data set is fused with the oral environment to a low degree, but the tongue line is thin, the definition degree of the whole tongue line is not uniform, and the division of the whole tongue line is still difficult.

The data preprocessing process is as follows:

the proportion of white pixels of the tongue outline to all pixels of the image is only 2%, and the soft tissue structure of the tongue lacks reference, so that the actual outline has burrs, discontinuities and the like when the actual outline is not subjected to smoothing treatment, and the extraction effect of the ultrasonic tongue outline is affected. And when the data set is marked, the marking of details is increased, so that more detail knowledge is contained in the group trunk, the accuracy of feature extraction is improved, and further the later voice research is facilitated. Meanwhile, in order to improve the recognition rate of the data set in the network, the experiment adds a data enhancement operation before the picture is not input into the network. By applying the transformer package to a training image loaded with random rotation and random flip, hue, saturation, brightness and contrast are adjusted according to random probabilities. Two data enhancements are implemented by the compound and Oneof functions, respectively. In addition, in order to realize accurate feature extraction of ultrasonic data sets with different sizes and sources by a network, normalization pretreatment is uniformly carried out on the data sets before network training, and the image size of the data sets is adjusted to 96 pixels by 96 pixels in the experiment. A data-preprocessed tongue dataset sample is shown in fig. 8.

Experimental environment

The entire experiment was performed on an Nvidia Tesla V100 high performance GPU training server with 32G video memory, 8 core CPU and 40G memory. All training and testing is performed in the same hardware environment. The experiment uses Windows10 operating system, python3.6 is used as programming language, and the design of neural network structure and the debugging of model are carried out under Pytorch1.6.0 deep open source framework.

The loss, the IoU and the Time are used as indexes for judging the network segmentation performance.

Regarding the index characterizing feature extraction accuracy, loss, ioU are used herein. loss represents a binary cross entropy and reflects the difference between expected output and actual output; the IoU gain is used as an extremely important evaluation function in the field of target detection, and is used for intuitively reflecting the difference between the tongue contour segmentation result and the target truth mask at the pixel level. The expression of loss is shown as formula (6), N represents the number of pixel points in the diagram, and the true value of the ith pixel is y _i The predicted value isL (·) represents a mapping relationship; ioU is represented by formula (7), y ^* Representing the predicted outcome->Represents the group Truth, c represents the pixel point, J _c (. Cndot.) represents the mapping relationship. The smaller the loss, the greater the IoU gain, which is manifested as a higher degree of accuracy in accomplishing the segmentation task.

Regarding the evaluation index for characterizing the real-time property, tie is used herein. All three data sets used in the experiment are picture sets, and the video is processed into the data sets formed by pictures corresponding to frames, so that all the data sets input into the network are static pictures, and all the data sets used for testing are picture data sets. Thus, under existing datasets, the Time to test a picture is used herein to characterize the speed of the test and this index is named Time in milliseconds, representing the Time taken to process each frame of the picture. The Time characterizes the speed at which the segmentation task is completed and serves as one of the test indicators for the network. The time length represents the real-time performance of the network, and the shorter the time length is, the faster the picture processing speed is, and the real-time performance is shown in the practical application.

The data sets are randomly distributed according to 50% training set, 20% verification set and 30% test set, the batch_size adopted by the experiment is 32, the learning rate is 0.001, the iteration times are 100, the momentum is 0.9, and an Adam optimizer is adopted. An early-stopping mechanism is introduced in the training process, the result is supervised by using the verification set, and a model with the minimum average loss on the verification set is saved as an optimal training model for later testing.

Since the U-Net network is the current hottest medical segmentation network and is also a baseline model for many image recognition network improvements, multiple sets of contrast experiments were designed around the U-Net network. And (3) comparing accuracy and instantaneity of feature extraction tasks on three data sets by calculating loss, ioU and Time indexes and comparing various networks such as Unet++, SA-Unet and SegAN. Then, in order to verify the contribution of each part of the simplified network, the AG module and the CBAM module to the improvement of the U-Net network performance in the CBAUnet network designed in the specification, an ablation experiment is designed.

And (3) taking the U-Net network as a reference network, and introducing the Unet++, SA-Unet and SegAN for comparison experiments. The comparison results are shown in Table 1. The table lists the IoU, loss, and tie metrics for each of the comparative networks under different data sets. Fig. 9 visually illustrates a comparison of metrics for five networks on three data sets, in bar graph form, where (a) is a IoU metric, (b) is a loss metric, and (c) is a tie metric. Fig. 10 is a graph of the partitioning effect of five comparison networks, including three rows and seven columns. Randomly selecting one frame of the three data sets for testing, wherein the first row represents 2138 frames of the NS data set, the second row represents 266 frames of the TJU data set, and the third row represents 46 frames of the TIMIT; the first column from left to right represents the original of the sample taken, the second column represents the corresponding mask map, and the third to seventh columns are predictive maps of the Unet, unet++, SA-Unet, segAN and CBAUnet networks, respectively. The three indexes and the segmented effect graphs are used for comprehensively analyzing each network, so that the network with relatively good cost performance is obtained by balancing the standard with high accuracy and high speed.

From the data set, the performance of the five groups of networks involved in the comparative experiment on the three data sets is: NS dataset > timt dataset > TJU dataset. This fully demonstrates that the clarity of the main features in the dataset greatly affects the extraction of the network, and also that the authors' prognosis of the difficulty of performing feature extraction tasks on three data in section 3.1 is correct.

Table 1 results of comparative experiments

From the comparison network, the Unet++ internally expands the U-Net on the aspect of feature scale selection, the nest type network structure promotes the increase of the richness of the semantic information to be understood, but the internal structure is finely divided and complex, and the parameter quantity is increased in multiple times during operation, which is certainly difficult for the ultrasonic tongue application with strong real-time requirements. This is seen in the Time index, where the Unet++ is 7.85ms slower than the U-Net test Time on the NS data set. SA-Unet is a reorganization of U-Net after considering the attention mechanism with spatial relationship. This network is a typical representation of an improvement in U-Net from an attentiveness perspective, and from the results it can be seen that the spatial attentiveness alone, while slightly more in time than U-Net, is superior in accuracy to U-Net. SegAN is a generating countermeasure network with segmentation function, which requires only a small number of samples to train a relatively stable feature extraction network, but involves an intermediate process of confusion of a discriminator with false samples, and the extraction task is delayed in time. Compared with other networks, the SegAN network is behind U-Net in Time index, is level with SA-Unet, and is next to Unet++.

The method proposed herein can be understood as adding AG modules on the basis of the SA-Unet network, replacing the original spatial attention module with CBAM modules that pay attention to both spatial and channel relationships. Such an operation appears to add to the internal structure of the network, and in fact, may achieve the goal of rapid extraction. This can also be seen from the experimental results. The AG module and the CBAM module play roles in the network to accelerate the positioning of key features, thereby bringing convenience for further segmentation and extraction. The connection and convolution operations inside the module are very small in terms of computational effort, and the consumption of computational resources can be negligible. In this way, both modules increase the accuracy of the ultrasound tongue profile extraction without increasing the run time of the network. Further, on the basis of simplifying the network, the cooperation of the two attention modules accelerates the segmentation speed of the tongue profile, can achieve 94.23% on the NS data set, a segmentation result of 34.55 ms/frame, can achieve 91.95% on the TJU data set, a segmentation result of 35.17 ms/frame, and can achieve 92.06% on the TIMIT data set, and a segmentation result of 34.93 ms/frame, which is effective for achieving the task of rapid segmentation.

And performing de-layering simplified design on the U-Net network, and embedding the AG module and the CBAM module into the U-Net through parallel design to obtain the CBAUnet. The three most critical parts in the whole design process are recorded as: 1) U-Net de-layering is simplified; 2) Embedding AG branches; 3) CBAM tributary embedding. In order to clearly see the effect of each part on the base line network U-Net, a corresponding ablation experiment was designed.

The segmentation effect of NS is superior to the other two datasets in terms of both accuracy and speed, so the data of the NS dataset was chosen as a control for the following ablation experiments. This section involves a total of seven sets of comparative experiments, with the network of U-Net each codec layer minus one convolutional layer abbreviated in the table as Simplify. Simplify, AG and CBAM are used as three configuration elements to carry out mathematical full arrangement with the U-Net network, thus obtaining seven networks except the U-Net network. Through training and testing of seven models, the contribution rates of three elements are obtained. The contribution rate is represented by comparing experimental results, and the better the accuracy and time performance are, the greater the contribution rate of the part to the improvement of the original network performance is. Since the main task is the segmentation extraction of the sagittal outline of the ultrasound tongue, ioU is taken as an Accuracy index in the table. The Time index in the table represents the Time it takes to test the same frame of ultrasound frames with seven models trained. Table 2 shows the results of the ablation experiments. Fig. 11 is a comparative statistical graph of ablation experiments, which statistics the accuracy and time index detected by each model on the NS dataset as a bar graph. Taking frame 2138 of the NS dataset as an example, the results of the visual testing of this frame under the various networks of the ablation experiment are listed in table 3.

Referring to tables 2, 3 and 11, the simplefy version of U-Net has a slight decrease in accuracy relative to the original network by only 1.8%, but the simplefy is faster in processing time by 1.11ms than before due to the reduction in convolutional layer parameters and computational load. To some extent, simpley achieves a compromise of accuracy and speed, and the comprehensive effect of the ultrasonic tongue profile extraction task is superior to that of the original U-Net network. On the basis of simplifying U-Net, attention module is introduced, and accuracy and time are improved. When the AG module is used alone, the precision is better than that of Simplify and U-Net, and the precision is between the two in time. The CBAM module alone is superior to the AG module alone in terms of both accuracy and time index. When two attention modules are used together, the precision and time are better than the effect of using one module alone. Specifically, the accuracy is 1.6% higher than that of the AG module used alone, and the processing time is 2.25ms; the accuracy is 1.1% higher than that of the CBAM module alone, and the processing time is 1.66ms faster. When simpley works together with both modules, the improvement in both indices is more pronounced than with simplefy alone. The precision is improved by 5.87 percent, and the processing time is 2.37ms.

Table 2 ablation experimental results

Based on the above ablation experiments, it can be obtained that the compromise of realizing precision and speed is feasible in specific practice, and the superior lightweight network design of the image segmentation network, such as U-Net, which is mature, can ensure the feature extraction with higher precision. In this context, the network is redesigned by simplifying the original network structure, increasing the operation of the triple-attentiveness mechanism. The operation of the Simplify module, the AG module and the CBAM module enables the U-Net network which is originally excellent to be further improved in the effect of extracting the ultrasonic tongue profile.

The application provides a Unet ultrasonic tongue contour image segmentation algorithm comprising an AG and a CBAM. The results of experiments on three data sets show that the improvement of efficiency can be realized by adding an attention mechanism on the basis of the jump connection of the original U-Net network or properly simplifying the network by reducing a convolution layer, and the comprehensive application of the two can ensure that the network can more quickly determine the extraction characteristics, thereby saving the calculation resources and obtaining more accurate segmentation results.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the application.

Claims

1. The CBAUnet-based double-attention rapid tongue contour extraction method is characterized by comprising the following steps of:

s1, acquiring an original tongue ultrasonic image dataset;

s2, preprocessing an original ultrasonic image data set;

2. The CBAUnet based dual-focus rapid tongue profile extraction method of claim 1, wherein in S1 the ultrasound image dataset comprises an NS dataset, a TJU dataset, and a timt dataset.

3. The CBAUnet-based dual-attention rapid tongue profile extraction method of claim 1, wherein in S2 the process of preprocessing the ultrasound image dataset comprises the steps of:

and labeling the adjusted image to form a labeled data set.

4. The CBAUnet-based dual-attention rapid tongue contour extraction method of claim 1, wherein in S3, after encoding the preprocessed ultrasound image, a feature map of different scale information is obtained by using a dual-attention mechanism of AG-gated attention and CBAM attention of the integrated attention module, comprising the steps of:

the spatial attention map and the channel attention map are transmitted to the decoding convolution blocks of the corresponding level for decoding.

5. The CBAUnet-based dual-attention fast tongue profile extraction method of claim 1, wherein AG-gated attention is used to select spatial regions by analyzing context information and activation provided by gating signals (g) collected from coarser scales; output of AGIs an element of the input element map +.>And attention factor->Product of (2) and calculation mode

wherein ,is a corresponding S-type activation function; theta (theta) _att In order to characterize the parameters of the AG, 1 in the channel direction using input tensor linear transformation is calculated by x 1 convolution; sigma (sigma) ₁ Corresponding to ReLU function, < >>Are all linear transformation matrices>Is a bias term; />A gating vector is used for each pixel to determine the focal region.

6. The CBAUnet-based dual-attention quick tongue profile extraction method of claim 1, wherein the CBAM attention generation spatial attention profile based on the spatial relationship of features comprises the steps of:

the spatial information of the feature map is aggregated using the average pooling and maximum pooling operations to generate two different spatial context descriptors: and /> and />Representing the average pooling features respectivelyMaximize pooling characteristics.

7. The CBAUnet-based dual-attention quick tongue profile extraction method of claim 1, wherein the CBAM attention generation channel attention map based on a characteristic channel relationship comprises the steps of:

8. A CBAUnet-based dual-attention rapid tongue profile extraction system, comprising: the system comprises a data acquisition module, a data preprocessing module and a CBAUnet network;

9. The CBAUnet based dual-focus rapid tongue profile extraction system of claim 8, wherein the ultrasound image dataset comprises an NS dataset, a TJU dataset, and a timt dataset.

10. The CBAUnet-based dual-attention rapid tongue profile extraction system of claim 8, wherein AG-gated attention and CBAM attention in the integrated attention module are connected in parallel, and AG-gated attention is utilized to adaptively learn to focus on target structures of different shapes and sizes from the encoded information, and to highlight features useful for specific tasks by implicit learning to suppress irrelevant regions in the input image; utilizing the spatial relationship and the channel relationship of the CBAM attention based on the characteristics; the spatial attention and channel attention attempts are generated separately and then transmitted to the corresponding level of decoding convolution blocks for decoding.