CN114842238A

CN114842238A - Embedded mammary gland ultrasonic image identification method

Info

Publication number: CN114842238A
Application number: CN202210349097.1A
Authority: CN
Inventors: 孙自强; 龚任
Original assignee: Suzhou Shishang Medical Technology Co ltd
Current assignee: Suzhou Shishang Medical Technology Co ltd
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-08-02
Anticipated expiration: 2042-04-01
Also published as: CN114842238B

Abstract

The invention discloses an embedded mammary gland ultrasonic image identification method, which comprises the following steps: constructing a target detection network model and a target classification network model based on the dynamic ultrasonic image breast lesion, and respectively training the target detection network model and the target classification network model; cutting the trained target detection network model, deploying the cut target detection network model and the trained target classification network model as sub-networks into an embedded system, and generating an embedded mammary gland ultrasonic image recognition network system; inputting the breast ultrasonic image to be identified into a target detection subnetwork for screening, and outputting a screening result; and inputting the positively detected video sequence dynamic image feature vector matrix into a target classification sub-network, outputting a classification result, and generating a mammary gland ultrasonic image identification result according to the classification result. The method effectively reduces the false positive rate of mammary gland ultrasonic image recognition on the premise of ensuring lower false negative recognition rate, provides reference for medical personnel, is convenient for the medical personnel to judge the state of an illness more accurately, and reduces missed diagnosis.

Description

Embedded mammary gland ultrasonic image identification method

Technical Field

The invention relates to the technical field of artificial intelligence and ultrasonic medical image processing, in particular to an embedded mammary gland ultrasonic image identification method.

Background

The breast ultrasound technology has the advantages of being noninvasive, rapid, high in repeatability, free of radiation and the like, can clearly display changes of soft tissues of each layer of a breast and tumors in the soft tissues and changes of internal structures and adjacent tissues, and brings great convenience to the breast disease investigation. In recent years, with the rapid development of computer technology and the progress of medical technology, the artificial intelligence technology has substantially progressed in the auxiliary examination of breast cancer, the artificial intelligence technology and the ultrasonic imaging technology are combined to be applied to the auxiliary screening of the breast cancer, the transitional dependence on ultrasonic skills and experience of medical staff in the breast cancer screening process can be effectively reduced, the screening implementation threshold is reduced, the bottleneck of shortage of professionals and uneven diagnosis level is broken through, the basification and the scaling of the breast cancer screening are realized, reference is provided for the medical staff, and the more accurate judgment of the breast cancer condition by the medical staff is facilitated.

Due to the limitation of factors such as an imaging device and a scanning method of the ultrasonic image, the ultrasonic image is inevitably influenced by factors such as noise and artifacts in the imaging process, meanwhile, in view of the wettability of the breast tumor, the contrast and resolution of a focus image are low, the boundary is fuzzy, so that the features are difficult to extract, especially, the information content of a single-frame image is often too small to make accurate judgment by a system, and the problems of false recognition and missed recognition are easy to occur.

The current AI ultrasonic image intelligent detection and auxiliary diagnosis analysis technology based on artificial intelligence deep learning and clinical application have the following main problems:

1. the real-time performance is poor, the deep learning is a calculation intensive technology, and the deep learning has higher requirements on a CPU, a GPU and the like. Therefore, most of medical image artificial intelligence systems and equipment are deployed and operated on workstations or remote cloud terminals on large-scale and ultra-powerful GPUs and CPU operation platforms at present, application scenes of the systems and the equipment are severely limited by sites, environments and the quality of surrounding communication networks, the problems of slow response, large delay and the like generally exist, and the experience and the use efficiency of medical personnel on the systems are greatly limited.

2. The existing auxiliary focus identification, positioning and diagnosis analysis methods of breast ultrasound images are all based on static ultrasound images, and the problems of low speed and high false positive rate generally exist in dynamic ultrasound image identification.

3. The current medical image AI system network model basically adopts a network model with wide and deep depth and width, the network model is either deployed at a remote cloud end or on a large high-end workstation with strong edge computing capability or high-end ultrasonic image equipment, and the cost and the portability seriously restrict the wide application of the product in basic-level medical institutions. With the popularization of portable ultrasonic imaging equipment in basic medical institutions, the embedded AI neural network model has an increasing demand for real-time target detection and auxiliary diagnosis on resource-limited hardware equipment.

Therefore, on the basis of the existing ultrasound medical image processing technology, how to solve the problems that the existing breast ultrasound image identification is low in timeliness and accuracy, the operation is limited by regions, environments and communication, and only can be deployed and operated on a large-scale ultra-computationally-powerful GPU and CPU operation platform becomes a problem which needs to be solved urgently by technical personnel in the field.

Disclosure of Invention

In view of the above problems, the present invention provides an embedded breast ultrasound image recognition method that solves at least some of the above technical problems, and the method can effectively reduce the false positive rate and false negative rate of breast lesion recognition of dynamic ultrasound images, provide reference for medical staff, and facilitate medical staff to determine the state of an illness more accurately.

The embodiment of the invention provides an embedded breast ultrasound image identification method, which comprises the following steps:

s1, constructing a target detection network model and a target classification network model of the breast lesion based on the dynamic ultrasonic image, respectively obtaining a target detection network model data set and a target classification network model data set, and training the target detection network model and the target classification network model to obtain a trained target detection network and a trained target classification network model;

s2, cutting the trained target detection network model, deploying the cut target detection network model and the trained target classification network model as sub-networks into an embedded system, and generating an embedded breast ultrasound image recognition network system; the embedded breast ultrasound image recognition network system comprises: a target detection subnetwork and a target classification subnetwork;

s3, inputting the breast ultrasonic image to be identified into the target detection subnetwork for screening, and outputting a screening result; the screening result is a video sequence dynamic image characteristic vector matrix which is detected positively; the positively detected video sequence dynamic image feature vector matrix comprises: image feature vectors of a positive detection display frame and image feature vectors of a first n frames of video sequence of the positive detection display frame; the value of n depends on the image definition and the scanning frame rate of the ultrasonic image;

s4, inputting the positively detected video sequence dynamic image feature vector matrix into the target classification sub-network, outputting a classification result, and generating a breast ultrasonic image recognition result according to the classification result; and the classification result is a true positive feature vector or a false positive feature vector.

Further, in step S1, the target detection network model data set is obtained by:

respectively collecting breast ultrasonic image data of a clinical breast confirmed patient, breast part ultrasonic image data of normal people and non-breast part ultrasonic image data, and constructing a target detection network model data set; the breast ultrasonic image data of the clinical breast confirmed patient is marked with a target focus, and the breast part ultrasonic image data and the non-breast part ultrasonic image data of the normal population are marked with characteristic classification.

Further, in step S1, the target classification network model data set is obtained by:

inputting the collected original breast ultrasound image data video sequences of a preset number into the target detection network model for data screening, outputting image characteristic vectors of a positive detection display frame and the first n frames of video sequence image characteristic vectors of the positive detection display frame, and constructing a positive detection sample video sequence image characteristic composite vector matrix; the value of n depends on the image definition and the scanning frame rate of the ultrasonic image; the image feature vector includes: the detected confidence score, the height and the width of the detection frame and the coordinate position of the center point of the detection frame;

marking each positive detection display frame to generate a video file containing marks;

acquiring the manual retest result of the video file, and writing a first label on the corresponding positive-detected sample video sequence image feature composite vector matrix if the retest result is true positive; if the rechecking result is false positive, writing a second label on the corresponding positive detection sample video sequence image characteristic composite vector matrix;

and repeating the processes to obtain a target classification network model data set containing positive and negative samples.

Further, in step S1, the training of the target classification network model includes:

performing time characteristic enhancement and space characteristic enhancement on the positive detection sample video sequence image characteristic composite vector matrix by learning the space-time change characteristics of the first n frames of video sequence image characteristic vectors of the positive detection sample video sequence image characteristic composite vector matrix; the spatiotemporal variation features include: the scoring change of the confidence coefficient, the height and width change of the detection frame and the coordinate position change of the central point of the detection frame;

and training the target classification network model through a data set formed by the positive detection sample video sequence image feature composite vector matrix after feature enhancement.

And performing temporal feature enhancement and spatial feature enhancement on the positive detection sample video sequence image feature composite vector matrix, wherein the temporal feature enhancement and the spatial feature enhancement comprise the following steps:

and performing random transformation enhancement on data formed by the positive detection sample video sequence image feature composite vector matrix, wherein the random transformation enhancement comprises the following steps: randomly and synchronously amplifying or reducing the height and the width of the detection frame, and randomly translating the coordinate of the central point of the detection frame;

further, in step S2, the cutting the trained target detection network model includes:

normalizing the BN layer of the trained target detection network model, introducing a group of scaling factors into each channel in the BN layer, and adding an L1 norm to constrain the scaling factors;

and scoring each channel in the BN layer according to the scaling factor, filtering out the channels with the scores lower than a preset threshold value, and finishing the cutting of the target detection network model.

Further, the step S4 includes:

s41, inputting the positive detected video sequence dynamic image feature vector matrix into the target classification sub-network, and outputting a classification normalization score;

s42, comparing the classification normalization score with a preset optimal threshold value of the target classification sub-network, and when the classification normalization score is larger than the optimal threshold value, taking the positive detected video sequence dynamic image feature vector matrix as a true positive feature vector; otherwise, the vector is a false positive feature vector;

and S43, generating a breast ultrasonic image recognition result according to the true positive characteristic vector or the false positive characteristic vector.

Further, the network architecture of the target detection network model is composed of an input end, a spine network, a neck network and a head prediction network;

the input end performs data enhancement on an input data set;

the backbone network is composed of a convolutional neural network; the convolutional neural network adopts a Focus + CSPNet + SPP series structure; the Focus performs image slicing on the data set; the CSPNet performs feature extraction on the sliced data set to generate a feature map; the SPP converts the feature map of arbitrary size into a feature vector of fixed size;

the backbone network adopts an AF-FPN framework, and comprises: an adaptive attention module and a feature enhancement module; the neck trunk network aggregates paths of image features and transmits the image features to the head prediction network;

the head prediction network outputs a mammary gland ultrasonic image recognition result; the recognition result comprises: the breast ultrasound image to be recognized comprises the target object category, confidence score, bounding box size characteristic information and position coordinate characteristic information.

Further, the data enhancement comprises:

performing rotation, left-right inversion, translation, scaling and affine transformation enhancement on the data set;

randomly adding noise perturbations to the data set, comprising: performing random disturbance on each pixel gray value of the image in the data set by adopting salt and pepper noise and Gaussian noise;

performing Gaussian blur on the data set;

and carrying out contrast and brightness image enhancement on the data set through Gamma transformation.

Further, the adaptive attention module obtains a plurality of context features of different scales through an adaptive average pooling layer; and generating a spatial weight graph for each feature graph output by the backbone network through a spatial attention mechanism, and fusing the context features through the weight graphs to generate a new feature graph containing multi-scale context information.

Further, the feature enhancement module is composed of a multi-branch convolutional layer and a branch pooling layer;

the multi-branch convolutional layer comprises: a void convolution layer, a BN layer and a ReLU active layer; the multi-branch convolution layer provides a corresponding receptive field for the input characteristic diagram through cavity convolution;

the branch pooling layer fuses image feature information from the receptive field.

Further, the target classification network model adopts a logistic regression classifier model; inputting the positive detected video sequence dynamic image feature vector matrix into the trained target classification network model, and synchronizing to the current positive detected display frame; and performing time characteristic enhancement and space characteristic enhancement on the image of the current positive detection display frame by utilizing the space-time characteristic information of the video sequence in the preset time period of the positive detection video sequence dynamic image characteristic vector matrix, performing true positive and false positive classification judgment on the current positive detection display frame, and outputting a true positive characteristic vector or a false positive characteristic vector.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the embodiment of the invention provides an embedded breast ultrasound image identification method, which comprises the following steps: constructing a target detection network model and a target classification network model of the breast lesion based on the dynamic ultrasonic image, respectively acquiring a target detection network model data set and a target classification network model data set, and training the target detection network model and the target classification network model; cutting the trained target detection network model, deploying the cut target detection network model and the trained target classification network model as sub-networks into an embedded system, and generating an embedded mammary gland ultrasonic image recognition network system; inputting the breast ultrasonic image to be identified into a target detection subnetwork for screening, and outputting a screening result; and inputting the positively detected video sequence dynamic image feature vector matrix into a target classification sub-network, outputting a classification result, and generating a mammary gland ultrasonic image identification result according to the classification result. The method effectively reduces the false positive rate of the network model on the premise of maintaining a low network missed diagnosis rate, provides reference for medical personnel, is convenient for the medical personnel to judge the state of an illness more accurately, effectively helps the medical personnel to reduce missed diagnosis, and improves the diagnosis efficiency of the medical personnel.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of an embedded breast ultrasound image identification method according to an embodiment of the present invention;

FIG. 2 is a diagram of a basic network architecture according to an embodiment of the present invention;

FIG. 3 is a diagram of a target detection network model architecture according to an embodiment of the present invention;

FIG. 4 is a diagram of an AF-FPN architecture provided by an embodiment of the present invention;

FIG. 5 is a diagram illustrating classification training results provided by an embodiment of the present invention;

fig. 6 is a flowchart of a method for pruning a target detection network through a structured network channel according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides an embedded breast ultrasound image identification method, which is shown in figure 1 and comprises the following steps:

s1, constructing a target detection network model and a target classification network model of the breast lesion based on the dynamic ultrasonic image, respectively obtaining a target detection network model data set and a target classification network model data set, and training the target detection network model and the target classification network model to obtain a trained target detection network model and a trained target classification network model;

s2, cutting the trained target detection network model, deploying the cut target detection network model and the trained target classification network model as sub-networks into an embedded system, and generating the embedded breast ultrasound image recognition network system; the embedded breast ultrasound image identification network system comprises: a target detection subnetwork and a target classification subnetwork;

s3, inputting the breast ultrasonic image to be identified into a target detection subnetwork for screening, and outputting a screening result; the screening result is a video sequence dynamic image characteristic vector matrix detected positively or a video sequence dynamic image characteristic vector matrix detected negatively;

s4, inputting the positively detected video sequence dynamic image feature vector matrix into a target classification sub-network, outputting a classification result, and generating a breast ultrasonic image recognition result according to the classification result; the classification result is a true positive feature vector or a false positive feature vector.

The embedded mammary gland ultrasonic image identification method provided by the embodiment can effectively improve the working efficiency of mammary gland ultrasonic auxiliary examination, provides reference for medical staff, is convenient for the medical staff to judge the state of an illness more accurately, effectively helps the medical staff to reduce missed diagnosis, and improves the diagnosis efficiency of the medical staff. On the premise of maintaining a low network missed diagnosis rate, the false positive rate of the network model is effectively reduced.

The method provided by the embodiment is explained in detail below:

the method comprises the following steps: and respectively constructing a target detection network data set and a target classification network data set.

And constructing a target detection network data set. The data set is composed of negative sample data and positive sample data. Collecting breast part ultrasonic image data and non-breast part ultrasonic image data of normal people of various ages, carrying out desensitization processing on the data, and carrying out characteristic classification and labeling on the breast part ultrasonic image data to be used as negative sample data of a target detection network. Collecting mammary gland ultrasonic video recording and focus image data of a clinical mammary gland confirmed patient, carrying out desensitization processing on the data, and carrying out target focus labeling on a target mammary gland focus image to be used as positive sample data of a target detection network.

And constructing a target classification network data set. The method comprises the following steps:

firstly, inputting an original breast ultrasound image data video sequence collected in real time into a trained target detection network for screening, and outputting image characteristic vector information { t } of a positive sample detected by the target detection network ₀ ,c ₀ ,w ₀ ,h ₀ ,x ₀ ,y ₀ }, wherein: the category t of the target object, the positive confidence c of the target object, the normalized parameter width w, the height h and the central point coordinate (x, y) of the target object detection frame, and the feature vector information { t of the video sequence image of the first n frames (the image frame is collected according to the collection mode of the first n frames of video) of the current positive image _-1 ,c _-1 ,w _-1 ,h _-1 ,x _-1 ,y _-1 ；...；t _-n ,c _-n ,w _-n ,h _-n ,x _-n ,y _-n The value of n depends on the definition of the collected ultrasonic image and the scanning frame rate, and n is more than or equal to 0 and less than or equal to 10. Secondly, constructing a positive detection video sequence image feature compound vector matrix (feature compound vector matrix for short) of (n +1) x6 as follows: { t ₀ ,c ₀ ,w ₀ ,h ₀ ,x ₀ ,y ₀ ；t _-1 ,c _-1 ,w _-1 ,h _-1 ,x _-1 ,y _-1 ；...；t _-n ,c _-n ,w _-n ,h _-n ,x _-n ,y _-n And writing the matrix into the CSV file according to n +1 rows and 6 columns, and finally generating target classification network sample data.

Performing data enhancement on the generated target classification network data set:

and performing random transformation enhancement on data consisting of the characteristic composite vector matrix, including random synchronous amplification or reduction on the height and width of the detection frame, random translation on the central point coordinate of the detection frame and sparse processing on the confidence score.

Training a target classification network by adopting training data constructed by a positive detection video sequence image feature composite vector matrix, and enhancing image features of an image frame detected to be positive through first n frames of video sequence image feature vectors of a current positive image, wherein the method comprises the following steps: the network is trained to learn the characteristics of the change of confidence score and the change of the position and size of the position of the positive breast detection in the change process of the past n frames of video sequences, so that the time and space characteristics of the image characteristics of the positive breast detection are enhanced, and the defect of incomplete information of the single-frame image characteristics is overcome.

The method for acquiring n frames of video before the detected positive image is preset according to the scanning speed (frame rate) of the video of the ultrasonic imaging equipment, and the method can be divided into 3 grades, which are respectively as follows: the frame rate is less than or equal to 50fps, the frame rate is less than or equal to 150fps, and the frame rate is more than 150 fps.

When the breast ultrasound image video sequence is input, each detected positive frame is marked, and a video file containing the marks is generated and stored. The video file is manually rechecked by an experienced sonographer, sample data of true positive or false positive detection is screened out, if the sample data is true positive, a first label (for example: label 1) is written in the corresponding characteristic composite vector matrix, and otherwise, a second label (for example: label 0) is written in the corresponding characteristic composite vector matrix. And repeating the processes to obtain a batch of data sets containing positive and negative samples, and recording the data sets in the CSV file. These data sets will be used for training, testing and optimization of the target classification network model.

Step two: and constructing a basic network framework which consists of an object detection network and an object classification network.

Referring to fig. 2, the base network is composed of an object detection network and an object classification network. Referring to fig. 3, the target detection network architecture adopts an improved YOLOv5 as an algorithm model architecture of the target detection network, the YOLO architecture has the characteristics of low requirement on hardware equipment and low computation cost, and the network architecture includes: the system comprises an Input end (Input) part, a Backbone network (Backbone) part, a Neck network (Neck) part and a Head Prediction network (Head Prediction) part.

The input end part enriches the data set by splicing various data enhancement means, and enhances the image characteristics and generalization capability. The backbone network part is used for aggregating and forming a convolutional neural network for training image features on different fine image granularities, a Focus + CSPNet + SPP series structure is adopted, Focus executes image slicing operation, CSPNet (cross-phase local network) executes feature extraction, and SPP (space pyramid pooling network) converts feature maps with any size into feature vectors with fixed size, so that the receptive field of the network is increased. The neck net network part uses FPN + PANet framework to realize path aggregation of image features and transfer the image features to a prediction layer. And the head prediction network part at the last stage realizes a target prediction result, generates a target object type, a confidence coefficient, a bounding box size and position coordinate characteristic information and outputs the information.

The constructed target detection network can realize dynamic identification and positioning of the breast lesions: the method comprises the steps of rapidly identifying and positioning the breast focus target object in an input ultrasonic image video image frame or image, and constructing a feature vector matrix (dynamic feature vector matrix) t of a dynamic positive detection video sequence image by adopting a method similar to the step I for constructing a target classification network data set _-1 ,c _-1 ,w _-1 ,h _-1 ,x _-1 ,y _-1 ；...；t _-n ,c _-n ,w _-n ,h _-n ,x _-n ,y _-n }, wherein: image feature vector { t0, c0, w0, h0, x0, y0} of current positive detection display frame and image feature vector information { t of n frames before the current positive detection display frame _-1 ,c _-1 ,w _-1 ,h _-1 ,x _-1 ,y _-1 ；...；t _-n ,c _-n ,w _-n ,h _-n ,x _-n ,y _-n And (2) forming a dynamic feature vector matrix of (n +1) x6, wherein the matrix has space and time enhancement characteristics, and makes up for the defect that the feature information of a single-frame image is not complete. The feature matrix is toAnd dynamically outputting the data to a target classification network of the next stage for further judging false positives and true positives. The feature matrix has the space and time dynamic enhancement features of positive detection, so that the method is helpful for the target classification network to further distinguish true positive from false positive.

The dynamic characteristic vector matrix is an output result generated dynamically in the running and reasoning processes of the deployed target detection network, the characteristic vector is a dynamic positive detection result generated by the target detection network through detecting an input dynamic video, and then the result is immediately input to a subsequent target classification network for classification operation, and the dynamic characteristic vector matrix is in an online running state (namely after training is completed, when the dynamic characteristic vector matrix is used specifically).

The "feature complex vector matrix" is obtained by running off-line (i.e. used in training), and specific ultrasound video data is input into the target detection network, so as to generate positive detection feature vectors, and the feature vector data is used for training the target classification network.

The constructed target detection network can also be used as a data screening tool to generate a characteristic composite vector matrix for forming a data set of the target classification network.

Yolov5s and Yolov5m are selected as two most basic network model frameworks, and are convenient to apply to different edge computing hardware platforms and embedded systems (including small edge computing imaging devices or apparatuses) due to small network scale and super high detection real-time performance. The method comprises the following steps of selecting YOLOv5 as a main reason of a target detection network model framework:

1) conventional CNNs typically require a large number of parameters and floating point operations (FLOPs) to achieve satisfactory precision, e.g., approximately 25.6 million parameters for ResNet-50 and 41 hundred million floating point operations to process images of size 224 x 224. However, mobile devices with limited memory and computing resources (e.g., portable ultrasound devices) cannot employ deployment and reasoning for larger networks. Yolov5 as a single-stage detector has the obvious advantages of small calculation amount, high recognition speed and the like.

2) The YOLOv5 can realize real-time target detection, and target area prediction and target category prediction are integrated in a single-stage network framework, so that the network framework is relatively simple, the reasoning and detection speed is higher, and the network model scale is smaller, and the method can be deployed in an embedded system based on an edge computing hardware platform with limited resources (AI computing power and storage space).

For the target detection network, a stricter network recall rate (true positive rate) strategy, namely a lower missed diagnosis rate is adopted, in order to effectively reduce the false positive rate of the system, a target classification network is added behind a target detection network (YOLOv5), and positive samples detected by a previous stage YOLOv5 are further classified, namely, which positive samples are detected by true positive and which are detected by false positive.

The method comprises the steps of constructing a target classification network, and using a Logistic regression (Logistic) classifier as a target classification network model, wherein the classifier is a two-classifier which takes Bernoulli distribution as a model for modeling, the Logistic regression classifier has the advantages of simple structure, small operand (high speed), easy realization and the like, and the requirement on extra calculation power of a system is very small.

The network algorithm model of the target classification network adopts a logistic regression classifier model, the classifier adopts a target feature enhancement classification prediction method based on a video sequence, the feature vector matrix information of a positively detected video sequence dynamic image output by the target detection network is synchronized to a current display frame, and the feature enhancement operation reasoning is carried out on the image of the current display frame by utilizing the spatio-temporal feature information of a previous section of the video sequence, so that the true positive and false positive classification discrimination is further carried out on the current positive detection sample.

By adopting a video sequence image feature method, namely a video sequence feature composite enhancement method, namely a video sequence-based target feature enhancement classification prediction method, positive detection sample video sequence feature matrix (namely dynamic feature vector matrix) information output by a target detection network (YOLOv5) is synchronized to a current display frame (positive detection image frame), and the target classification network performs feature enhancement operation by using current image features of the input display frame and spatio-temporal feature information of a previous segment of video sequence, for example: the method is characterized by performing feature enhancement on the time-space change features, the background change features, the confidence coefficient change features and the like of the positions, positions and sizes of breast lesions, and performing further true positive and false positive classification judgment on currently displayed positive detection image frames, and has the following specific principle:

the classifier inputs a positive detection sample video sequence characteristic matrix X output by a previous-stage target detection network (YOLOv5), and the classifier is a trained and optimized classification network, and the mathematical model of the network is as follows:

wherein h is _θ (x) For the net prediction function, θ is the net weight parameter vector, and x is the input vector. And after calculation, obtaining a normalized score, comparing the score with a preset optimal threshold (a first preset threshold) by the target classification network, wherein the preset optimal threshold is obtained by training and optimizing the early classification network, and if the score is greater than the first preset threshold, judging the positive detected sample as a true positive sample, otherwise, judging the positive detected sample as a false positive sample.

The determination of the YOLOv5 network for a positive detection is determined based on whether the confidence of the image feature is greater than a system-set threshold. For most false positive detected samples, the feature confidence of the image is usually not too high, and many fall near the threshold. The characteristics of the current display frame image in the spatial and temporal dimensions are enhanced by combining the image characteristics (part type, confidence, size and position of the target object frame) of a plurality of frame video sequences before the image frame, so that the multi-dimensionality of threshold judgment is realized. The same video sequence image feature method is adopted to train and optimize a target classification network model, so that the method has the advantages of comparing and analyzing the change of spatial and temporal characteristics of input multi-dimensional image feature vectors, for example: the secondary interpretation of the current positive detection sample is enhanced by the space-time change characteristics, the background change characteristics, the confidence coefficient change characteristics and the like of the position and the size of the target object, so that the false positive rate of the system is reduced (on the basis of ensuring higher system true positive rate).

Step three: aiming at the constructed target detection network framework, an improved characteristic pyramid model method is adopted, and the recognition and positioning capacity of the network model on the multi-scale change of the breast lesion in continuous multi-frame images is improved.

Referring to fig. 4, an improved feature pyramid model AF-FPN, namely an Adaptive Attention Module (AAM) and a Feature Enhancement Module (FEM), is used to replace the original Feature Pyramid Network (FPN) in the YOLOv5 network framework backbone network part.

The neck trunk is designed to better utilize the characteristics extracted by the Backbone, and the characteristic diagram extracted by the Backbone is reprocessed in different stages and reasonably used. In an original YOLOv5 neck skeleton framework, a Feature Pyramid Network (FPN) and a path aggregation network (PANet) are adopted to aggregate image features of the network, the FPN is a common multilayer feature fusion method, the use of the FPN can cause the network to pay more attention to optimization of network bottom layer features, sometimes, the detection accuracy of the network on various scale change targets is reduced, and the detection accuracy of multi-scale targets is difficult to improve while the detection real-time performance is ensured. Breast lesions and lesions, ranging from breast dilatation and nodules to various breast cancers, have completely different size scales and visual characteristics, so that a network model which has strong multi-scale target recognition capability and can effectively balance recognition speed and precision is required. By adopting the improved AF-FPN framework, channel information can be reserved to a great extent in the feature transmission process through self-adaptive feature fusion and receptive field enhancement, different receptive fields in each feature graph can be learned in a self-adaptive mode, the representation of a feature pyramid is enhanced, and the accuracy of multi-scale target identification is effectively improved.

The AAM self-adaptive attention module obtains a plurality of context characteristics of different scales through a self-adaptive average pooling layer, the pooling coefficient is [0.1,0.5], self-adaptively changes according to the target size of the data set, then generates a spatial weight value graph for each feature graph through a spatial attention mechanism, fuses the context characteristics through the weight graph, and generates a new feature graph containing multi-scale context information. The adaptive attention module reduces feature channels and reduces loss of context information in high-level feature maps.

The FEM feature enhancement module mainly utilizes the cavity convolution to self-adaptively learn different receptive fields in each feature map according to different scales of the detected target object, so that the accuracy of multi-scale target detection and identification is improved. It can be divided into two parts: multi-branch convolutional layers and branch pooling layers. Referring to fig. 4, the multi-branch convolution layer is used to provide different magnitudes of receptive fields for the input feature map through the cavity convolution, and meanwhile, the average pooling layer is used to fuse image feature information from three branch receptive fields, so as to improve multi-scale precision prediction.

The multi-branch convolutional layer includes a hole convolution, a BN layer, and a ReLU activation layer. The hole convolutions in the three parallel branches have the same kernel size but different expansion rates. Specifically, the kernel of each hole convolution is 3 × 3, and the expansion rates d of the different branches are 1, 3, and 5, respectively.

The extended convolution supports an exponentially extended receptive field without loss of resolution. In the convolution operation of the hole convolution, elements of the convolution kernel are spaced, and the size of the space depends on the expansion rate, which is different from the situation that the elements of the convolution kernel are adjacent in the standard convolution operation. The feature enhancement module adaptively learns different receptive fields in each feature map by utilizing the hole convolution, so that the accuracy of multi-scale target detection and identification is improved.

Step four: and training and optimizing the constructed target detection network and the target classification network.

The common five steps of machine learning network model training are: data → model → loss → optimizer → iterative training, through the process of forward propagation, the difference between the model output and the real label, i.e. the loss function, is obtained; the gradient of the parameters is obtained through a back propagation process, and the parameters of the network optimizer are updated according to the gradient. The optimization method is to optimize the loss function through repeated iterative training, so that the loss is continuously reduced, and the best model is trained.

The training of the target detection network comprises the following steps:

performing data enhancement on a data set input by a target detection network;

initializing a weight value;

evaluating and optimally training the loss function by adopting an adaptive moment estimation algorithm;

a preheating learning rate method is adopted for a network training learning rate strategy, one-dimensional linear interpolation is adopted in a preheating stage to carry out learning rate iteration and updating, and a cosine annealing algorithm is adopted after the preheating stage.

Carrying out weight initialization, loss function evaluation and optimization training on a target detection network and setting a learning rate strategy of the target detection network training, and specifically comprises the following steps:

firstly, weight initialization is carried out on a target detection network (YOLOv5), and the specific steps comprise: setting initial hyper-parameters of basic network training by adopting a pre-training weight model based on ultrasonic image characteristics; the network hyper-parameter is optimized by adopting a hyper-parameter evolution method, and the hyper-parameter evolution is a method for optimizing the hyper-parameter by utilizing a Genetic Algorithm (GA). YOLOv5 has about 25 hyper-parameters for various training settings, which are stored in yaml files. The convergence of the model can be accelerated by correct weight initialization, and the output of an output layer is too large or too small due to improper weight initialization, so that gradient explosion or disappearance is finally caused, and the model cannot be trained.

Secondly, performing loss function evaluation and optimization training on a target detection network (YOLOv5), wherein the method specifically comprises the following steps: and (3) adopting adaptive moment estimation (Adam), namely an optimization algorithm of gradient descent of an adaptive learning rate, solving the problems of overlarge swing amplitude and unstable convergence of a training gradient and accelerating the convergence speed of a function. It dynamically adjusts the learning rate of each parameter using first moment estimates and second moment estimates of the gradient.

Finally, a training learning rate strategy of the target detection network (YOLOv5) is set, and the specific steps comprise: a preheating learning rate method is adopted in a learning rate adjusting strategy in network training, one-dimensional linear interpolation is adopted in a preheating stage to carry out learning rate iteration and updating, and a cosine annealing algorithm is adopted after the preheating stage. When a gradient descent algorithm is used for optimizing the Loss function, when the Loss function is closer to the global minimum value of the Loss value, the learning rate is smaller to enable the model to be as close to the point as possible, the cosine annealing can reduce the learning rate through a cosine function, the cosine value in the cosine function is firstly slowly reduced along with the increase of x, then is rapidly reduced, and then is slowly reduced again, and the reduction mode can be matched with the learning rate to produce a better convergence effect.

Carrying out normalization processing on the boundary box coordinates in the label text file for network training; an early-stopping mechanism is adopted in the process of training the network model, namely mAP and Loss values are dynamically monitored, the maximum epoch times are set, and if continuous training exceeds the maximum set value, the mAP and Loss performances are not improved, and then the training is automatically stopped.

The training of the target classification network comprises the following steps:

performing data enhancement on a target classification network data set formed by a composite vector matrix based on video sequence image features;

and training and optimizing a network function through an optimized gradient ascending (or descending) algorithm to determine an optimal threshold value.

Training and optimizing a target classification network by adopting a data set consisting of a characteristic composite vector matrix, and determining an optimal threshold value, wherein the specific steps comprise:

and constructing a sample data set, and preparing the data set by adopting the method for constructing the target classification network data set. Reading a sample data set stored in a CSV file, namely a data set formed by a characteristic composite vector matrix (n +1) x 6), wherein n is more than or equal to 0 and less than or equal to 10, and is set in advance, and the value of the n depends on the definition of an ultrasonic image and the scanning frame rate. And randomly dividing a training set and a testing set according to 8: 2.

Training and optimizing a network function through an optimization gradient ascending (or descending) algorithm to determine an optimal threshold value, wherein the optimal threshold value comprises the following steps:

the core of the logistic regression classification network is to solve the problem of binary classification, the logistic regression is to find the probability to be maximum, and the used mathematical method is a likelihood function. The mathematical model of the classifier network is:

wherein h is _θ (x) For the net prediction function, θ is the net weight parameter vector, and x is the input vector.

Suppose that: the probability when the output vector y is 1 and y is 0 is:

P(y＝1|x；θ)＝h _θ (x)

P(y＝0|x；θ)＝1-h _θ (x)

the likelihood function is probability multiplication:

the above equation is also referred to as the likelihood function of the m observations; and m is the number of observation statistical samples.

The goal is to be able to find the parameter estimate that maximizes the value of this likelihood function, i.e. the network weight θ ₀ ,θ ₁ ,...,θ _n Obtaining the maximum value by the above formula, and obtaining the gradient of theta after taking logarithm and derivation of the likelihood function by adopting a gradient algorithm of an optimization algorithm:

the learning rule for θ is:

wherein j represents the j-th attribute of the sample, and the total number of j is m; alpha is a learning rate step length and can be freely set.

The optimization method is to obtain the weight value theta after the gradient is sufficiently converged through repeated iterative training. And determining the threshold value of the optimal point through a network ROC curve.

Referring to fig. 5, the classification training results using the published Wisconsin Breast Cancer Dataset (569 cases, 212(M malignant), 357(B benign)) are shown. The classification effect is very apparent from the figure.

Step five: and (3) enhancing the data set of the target classification network by adopting a data characteristic extension means instead of the conventional image enhancement technology.

Aiming at a target classification network, a data feature enhancement method is adopted to expand the scale of a training and testing data set of the target classification network, and the specific steps comprise:

1) for example, for a 10 × 6 feature matrix of a certain set of 10 frames of video, denoted X:

X＝t-9,c-9,w-9,h-9,x-9,y-9

t-8,c-8,w-8,h-8,x-8,y-8

……

t0,c0,w0,h0,x0,y0

2) and (3) randomly transforming X:

keeping the 1 st and 2 nd columns of X unchanged, synchronously multiplying the w, h of the 3 rd and 4 th columns of X, namely the width and the height of the detection frame by the same random coefficient; translating the coordinates of the 5 th column X of the X, namely the coordinates of the central point X of the detection frame by the same numerical value; and translating the coordinate of the 6 th column y of the X, namely the coordinate of the center point y of the detection frame by the same value.

Further, the breast ultrasound image has the characteristics of low contrast and resolution, fuzzy boundary, background noise and artifact interference, except for the conventional image enhancement technology, such as: rotation (Rotation), left-right inversion (Flip), translation (translation), scaling (scale), affine (perspective) transformation, etc., for the target detection network (YOLOv5), the following image enhancement methods are also employed:

1) randomly copying a part of samples, and randomly adding noise disturbance (noise): performing random disturbance on each pixel gray value of the image by adopting salt and pepper noise and Gaussian noise;

2) randomly copying a part of samples, and realizing Gaussian blur by using a Pilow library;

3) contrast and brightness image enhancement is realized through Gamma transformation.

Step six: and cutting and compressing the basic network model by adopting a network pruning technology.

And further cutting, compressing and adapting the basic network model to generate an embedded network algorithm model which can be suitable for an edge computing system or device with limited low-power-consumption computing power so as to solve the problem of insufficient computing power of an embedded end AI.

In the deep learning landing process, in order to adapt to the problem of insufficient computation of the embedded end AI, the deep learning model needs to be compressed, and the pruning technology is one of deep learning model compression technologies.

In the breast ultrasound lesion identification and location system composed of the aforementioned YOLOv5 target detection network and logistic regression classifier, the resources and operation scale occupied by the classifier are very small and can be ignored, so the network system scale size is basically dependent on the scale of the target detection network (YOLOv5), as is known, YOLOv5, especially YOLOv5s, is a very excellent lightweight target detection network, but sometimes the model is still large, especially for ultrasound images with high image resolution, if the network is to be deployed into an embedded system based on an edge computing platform, which usually has small power consumption, low memory and limited AI computing power, the network scale size often has to be reduced so that the system can run smoothly in the embedded system.

The embodiment adopts a structured network Channel Pruning (Channel Pruning) method to perform clipping and compression on the YOLOv5 network model. In the YOLOv5 network architecture, batch normalization layer (BN) and convolution layer are used as a network minimum operation unit in a large number for the backbone network part and the neck network part, and although the BN layer is used as regularization, the BN layer plays positive roles of speeding up convergence and avoiding overfitting during training. However, when the network reasoning is carried out, a plurality of layers of operations are added, the performance of the model is influenced, and more resource space is occupied.

Referring to fig. 6, a group of trainable scaling factors are introduced into each channel in the BN layer, the scaling factors are constrained by adding L1 regularization, the scaling factors are thinned through sparse training, the importance (score) of the input channels is evaluated by using the scaling factors (i.e., weights) of the BN layer, then the channels with scores lower than a threshold value are filtered, i.e., the channels with small sparsity or small scaling factors are cut out, then the network after pruning is finely tuned, and the process is repeatedly iterated, so that a more compressed and refined model is obtained. The method comprises the following specific steps:

1) channel sparse regularization training

Channel-wise sparsity can be used for any classical CNN or full connectivity clipping and thinning, thus increasing the inference speed of the network. The network target loss function here is defined as:

where (x, y) represents training data and labels, W is a trainable parameter of the network, the first term above is a loss function of normal training of the network, the second term is a constraint term, λ is a regularization coefficient, the larger λ is, the larger the constraint is, if g (γ) ═ γ |, i.e., L1 regularization (L1 norm) is selected, it is widely applied to sparsification. Gamma is a scaling factor which is multiplied by the input of the channel, the network weight and the scaling factors are jointly trained by adopting a gradient descent optimization algorithm, the scaling factor is restricted by adding an L1 norm, and the L1 norm can make the score in the middle of the trained scaling factor (weight) approach to 0, so that the weight can be more sparse, and the sparse channel can be cut off, and the network can automatically identify the unimportant channel and then remove the channel, and the precision is hardly lost. And thinning the channel by sparse training, and cutting out the channel with small sparseness or small scaling factor.

In the YOLOv5 network architecture, the batch normalization layer (BN) and convolution layer are used in large quantities as a network minimum arithmetic unit for the backbone network portion and the backbone network portion. The conversion formula of the BN layer is as follows:

wherein z is _in And z _out Inputting an activation vector value and an output activation value for a channel of a BN layer, respectively, mu and sigma are mean values and variances of all small-batch input activation feature (mini-batch input feature) vectors, respectively, gamma and beta are a scaling factor and an offset function of a corresponding activation channel, respectively, and gamma represents the activation degree of the corresponding channel. It can be seen from the conversion formula that the BN is normalized first, i.e. the average value is subtracted from the small-batch input feature vector and divided by the standard deviation, and finally the learnable parameters γ and β are used for affine transformation, so that the final BN output can be obtained. The scaling factor method of the normalized activation channel of the BN can just be combined with the aforementioned concept of the channel scaling factor, and the scaling factor γ in the BN layer is used as the importance factor, i.e. the smaller γ, the smaller γ is, the corresponding activation is, and therefore, the smaller γ is, the influence on the following is, and the less important channel corresponding to γ is, and the channel can be cut out.

2) Model pruning and trimming

After sparse regularization training, the BN obtains a model containing more sparse scaling factors (or weights), a pruning threshold value is obtained according to a preset pruning rate, the scaling factor of a channel smaller than the pruning threshold value is set to be 0, namely, the channel is pruned, then the pruned network is finely adjusted, and the process is repeatedly iterated, so that a more compressed and refined target detection network model is obtained.

Many of these weights are close to 0, and assuming a preset pruning rate P (percentage), a pruning threshold can be obtained:

θ ═ sortp (M), where M is the importance score, M ═ γ 1, γ 2. sortp () is a number that sorts objects in ascending order and takes the P position to output. The value of the pruning rate P is determined according to the calculation power of a specific hardware platform, and is usually 40 to 70 percent. If the clipping rate is 70%, the value of 0.7 quantile in the M list is the clipping threshold, and the scaling factor γ of the channel smaller than the clipping threshold is set to 0, so that a more compact network model with less parameters and less memory and computational requirements can be obtained. Typically, after a large pruning, the accuracy of the model may be reduced, and the accuracy may be substantially restored by appropriately fine-tuning several training rounds (epochs).

According to the identification method of the embedded breast ultrasound image, the constructed basic network model is cut and compressed through a pruning technology, so that the embedded network algorithm model which can be suitable for the edge computing system or device with limited low-power-consumption AI computing power is generated, the embedded system can be widely deployed and operated on various small-sized low-power-consumption and low-cost portable ultrasound image equipment or devices, and the operation threshold and the technical threshold of breast ultrasound auxiliary screening and preliminary examination work are greatly reduced.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An embedded breast ultrasound image identification method is characterized by comprising the following steps:

s3, inputting the breast ultrasonic image to be identified into the target detection subnetwork for screening, and outputting a screening result; the screening result is a feature vector matrix of a video sequence dynamic image which is detected positively; the positively detected video sequence dynamic image feature vector matrix comprises: the image feature vector of a positive detection display frame and the image feature vector of the first n frames of video sequence of the positive detection display frame; the value of n depends on the image definition and the scanning frame rate of the ultrasonic image;

2. The method for identifying an embedded breast ultrasound image as claimed in claim 1, wherein in the step S1, the target classification network model data set is obtained by:

3. The method for recognizing an embedded breast ultrasound image as claimed in claim 2, wherein in step S1, the training of the target classification network model includes:

4. The method for recognizing an embedded breast ultrasound image as claimed in claim 1, wherein in step S2, the clipping the trained target detection network model includes:

and scoring each channel in the BN layer according to the scaling factors, filtering out the channels with the scores lower than a preset threshold value, and finishing the cutting of the target detection network model.

5. The method for identifying an embedded breast ultrasound image as claimed in claim 1, wherein the step S4 includes:

6. The method for identifying an embedded breast ultrasound image as claimed in claim 1, wherein the network architecture of the target detection network model is composed of an input end, a spine network, a neck network and a head prediction network;

the input end performs data enhancement on an input data set;

the head prediction network outputs a mammary gland ultrasonic image recognition result; the recognition result comprises: object category, confidence score, bounding box size feature information, and location coordinate feature information.

7. The method of claim 6, wherein the data enhancement comprises:

performing Gaussian blur on the data set;

8. The method of claim 6, wherein the adaptive attention module obtains a plurality of context features of different scales through an adaptive average pooling layer; and generating a spatial weight graph for each feature graph output by the backbone network through a spatial attention mechanism, and fusing the context features through the weight graphs to generate a new feature graph containing multi-scale context information.

9. The method for identifying an embedded breast ultrasound image as claimed in claim 6, wherein the feature enhancement module is composed of a multi-branch convolution layer and a branch pooling layer;

10. The method for identifying an embedded breast ultrasound image as claimed in claim 1, wherein the target classification network model employs a logistic regression classifier model; inputting the positive detected video sequence dynamic image feature vector matrix into the trained target classification network model, and synchronizing to the current positive detected display frame; and performing time characteristic enhancement and space characteristic enhancement on the image of the current positive detection display frame by utilizing the space-time characteristic information of the video sequence in the preset time period of the positive detection video sequence dynamic image characteristic vector matrix, performing true positive and false positive classification judgment on the current positive detection display frame, and outputting a true positive characteristic vector or a false positive characteristic vector.