WO2021155679A1 - 一种目标定位方法、装置及系统 - Google Patents

一种目标定位方法、装置及系统 Download PDF

Info

Publication number
WO2021155679A1
WO2021155679A1 PCT/CN2020/124623 CN2020124623W WO2021155679A1 WO 2021155679 A1 WO2021155679 A1 WO 2021155679A1 CN 2020124623 W CN2020124623 W CN 2020124623W WO 2021155679 A1 WO2021155679 A1 WO 2021155679A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target lesion
area image
video frame
search area
Prior art date
Application number
PCT/CN2020/124623
Other languages
English (en)
French (fr)
Inventor
尚鸿
章子健
孙钟前
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021155679A1 publication Critical patent/WO2021155679A1/zh
Priority to US17/677,990 priority Critical patent/US20220180520A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • G06T7/0014Biomedical image inspection using an image reference approach
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10068Endoscopic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30096Tumor; Lesion

Definitions

  • This application relates to the field of computer technology, and in particular to a target positioning method, device and system.
  • the target detection algorithm is used to detect each video frame image in the real-time video stream of the endoscope, that is, to determine on each video frame image Any number of lesions that may exist in any position, and the location of each lesion is given.
  • a target positioning method, device, and system are provided.
  • An embodiment of the present application provides a target positioning method, including:
  • the target lesion is tracked, and the location information of the target lesion on the to-be-tracked video frame image in the to-be-detected video stream is determined.
  • a target positioning device including:
  • the detection module is configured to determine the position information of the target lesion on the video frame image when a video frame image containing the target lesion is detected from the video stream to be detected;
  • the tracking module is configured to track the target lesion according to the position information of the target lesion on the video frame image, and determine the position of the target lesion on the to-be-tracked video frame image in the to-be-detected video stream information.
  • a target positioning system which at least includes: a video capture device, a processing device, and an output device, specifically:
  • Video capture equipment used to obtain the video stream to be detected
  • the processing device is configured to determine the position information of the target lesion on the video frame image when the video frame image containing the target lesion is detected from the video stream to be detected; according to the target lesion on the video frame image Tracking the target lesion, and determining the location information of the target lesion on the to-be-tracked video frame image in the to-be-detected video stream;
  • the output device is used to output the position information of the target lesion on the video frame and the position information on the video frame to be tracked.
  • Another embodiment of the present application provides an electronic device, including a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor executes The steps of the target positioning method described above.
  • Another embodiment of the present application provides one or more non-volatile storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute The steps of the target positioning method described above.
  • FIG. 1 is a schematic diagram of the application architecture of the target positioning method in an embodiment of the application
  • Figure 2 is a flowchart of a target positioning method in an embodiment of the application
  • FIG. 3 is a schematic diagram of a network structure of a similar network in an embodiment of the application.
  • Figure 4 is a schematic framework diagram of a tracking model in an embodiment of the application.
  • Fig. 5 is a flowchart of a tracking model training method in an embodiment of the application.
  • FIG. 6 is a schematic diagram of selecting a template area image and a search area image in an embodiment of the application
  • FIG. 7 is a schematic structural diagram of a target positioning system in an embodiment of this application.
  • FIG. 8 is a schematic structural diagram of a target positioning device in an embodiment of the application.
  • Video stream The embodiments of this application are mainly for images and video streams scanned during various medical diagnosis, for example, medical image video streams obtained by endoscopic scanning, including endoscopic colorectal video streams, etc., of course, it is not limited , It can also be the video stream of other business fields.
  • Lesion refers to the part of the body where the disease occurs, such as polyps in the colorectal.
  • Target If the video stream in this embodiment of the application is a medical-related video stream, the target is the target lesion.
  • Similar network A machine learning network structure, that is, a neural network framework, rather than a specific network.
  • Convolutional neural networks convolutional neural network, CNN
  • CNN convolutional neural network
  • the tracking model in the embodiment of this application is mainly based on the similarity network, through the similarity detection of target lesions in different video frame images, so as to track the target lesion and determine that the target lesion is in the video frame image On the location.
  • CNN Convolutional neural network is a deep feedforward artificial neural network.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect the image. As a scientific discipline, computer vision studies related theories and technologies, trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, three-dimensional (3 Dimensions) , 3D) technology, virtual reality, augmented reality, synchronous positioning and map construction and other technologies, including common facial recognition, fingerprint recognition and other biometric recognition technologies.
  • OCR optical character recognition
  • video processing video semantic understanding, video content/behavior recognition
  • three-dimensional object reconstruction three-dimensional (3 Dimensions) , 3D) technology
  • virtual reality augmented reality
  • synchronous positioning and map construction and other technologies including common facial recognition, fingerprint recognition and other biometric recognition technologies.
  • artificial intelligence technology can be applied to the medical field.
  • the embodiments of this application mainly involve computer vision technology in artificial intelligence.
  • the image semantic understanding technology in computer vision technology can be used to target the video in the video stream to be detected.
  • Frame images are used to detect target lesions to detect whether the video frame images contain target lesions.
  • the target lesion
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones
  • artificial intelligence technology will be applied in more fields and exert more and more important value.
  • target detection methods are used to detect each video frame image in the endoscope's real-time video stream, but this method is completely Relying on the target detection method and detecting each video frame image, there are speed and robustness problems.
  • AI-assisted detection needs to provide real-time prediction results to have clinical value.
  • the endoscope video stream The frame rate is relatively large.
  • the frame rate of the current endoscope video stream is usually 25 frames per second (Frame Per Second, fps), and each frame interval is 40 milliseconds (ms).
  • the target detection method is to ensure a certain Accuracy, it takes more than this time, resulting in missing some frames, and the focus has shifted when the prediction is given, resulting in inaccurate positioning or drag effects on the product experience; for robustness, video is not considered Streaming timing information, for example, lesions in an endoscopic video stream cannot appear or disappear instantaneously, but continuously appear, zoom in, zoom out, and finally disappear from the edge of the screen.
  • the frame-by-frame detection method may cause the lesions Certain frames of the consecutive multiple frames are predicted as "no disease", and each "no disease” prediction will regard the next "with disease” frame as a new focus, and a new alarm will be issued. But in fact, it is only the same lesion, which reduces the robustness and reliability, causing multiple alarms for the same lesion, which easily interferes with the doctor's clinical operation.
  • an embodiment of the present application provides a target positioning method.
  • the position information of the target lesion on the video frame image is determined, which can then be triggered.
  • the tracking process according to the location information of the target lesion on the detected video frame image containing the target lesion, the target lesion is tracked, and the location information of the target lesion on the to-be-tracked video frame image in the to-be-detected video stream is determined.
  • the tracking process is triggered to locate and track the target lesion. Compared with detection, the tracking is less difficult.
  • the target lesion is a certain object near a given position and has a known shape. Furthermore, the known information obtained during follow-up tracking is clearer and richer. Therefore, compared with detection, the tracking speed is faster, real-time can be guaranteed, and the target lesion in the video stream to be detected is tracked, combined with the video to be detected
  • the time sequence information of the stream can be predicted as the same target lesion on multiple consecutive video frames that appear on the target lesion, which reduces misjudgment and improves robustness and reliability.
  • FIG. 1 is a schematic diagram of the application architecture of the target positioning method in the embodiment of this application, including a server 100 and a terminal device 200.
  • the terminal device 200 can be a medical device.
  • a user can collect an endoscopic image video stream through the terminal device 200, and can also view the tracking result of the target lesion in the video stream to be detected based on the terminal device 200, including the video stream in the video stream to be detected. The position information that appears on the video frame image, etc.
  • the terminal device 200 and the server 100 can be connected via the Internet to realize mutual communication.
  • the aforementioned Internet uses standard communication technologies and/or protocols.
  • the Internet is usually the Internet, but it can also be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless Any combination of network, private network, or virtual private network.
  • technologies and/or formats including HyperText Mark-up Language (HTML), Extensible Markup Language (XML), etc. are used to represent data exchanged over the network.
  • HTML HyperText Mark-up Language
  • XML Extensible Markup Language
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • VPN Virtual Private Network
  • IPsec Internet Protocol Security
  • the server 100 can provide various network services for the terminal device 200, where the server 100 can be a server, a server cluster composed of several servers, or a cloud computing center.
  • the server 100 may include a processor 110 (Center Processing Unit, CPU), a memory 120, an input device 130, an output device 140, etc.
  • the input device 130 may include a keyboard, a mouse, a touch screen, etc.
  • the output device 140 may include a display device.
  • liquid crystal display Liquid Crystal Display, LCD
  • cathode ray tube Cathode Ray Tube, CRT
  • the target positioning method is mainly executed by the server 100 side, and the training process of the tracking model in the embodiment of the present application is also executed by the server 100.
  • the terminal device 200 collects the to-be-detected The video stream is sent to the server 100, and the server 100 performs target lesion detection on each video frame image in the video stream to be detected.
  • a trained detection model is used for detection.
  • the tracking model is triggered to detect the target lesion Perform tracking and determine the position information of the target lesion on the video frame image to be tracked until the target lesion disappears, and the server 100 may send the target lesion detection and tracking result, that is, the position information of the target lesion in each video frame image to the terminal
  • the device 200 for example, can send location information to the terminal device 200 every time a target lesion is detected or tracked, so that the user can see the real-time location information of the target lesion in each video frame image on the terminal device 200, as shown in FIG.
  • the application architecture shown in 1 is described by taking the application on the server 100 side as an example.
  • the target positioning method in the embodiment of the present application can also be executed by the terminal device 200.
  • the terminal device 200 can obtain the trained detection model and tracking model from the server 100, and can detect the target lesion based on the detection model, and the detection includes When the video frame image of the target lesion is triggered, the tracking model is triggered to track and locate the target lesion, which is not limited in the embodiment of the present application.
  • the video stream to be detected is an endoscopic colorectal video stream
  • the target lesion is, for example, a polyp.
  • FIG. 1 Each embodiment of the present application is schematically illustrated by using the application architecture diagram shown in FIG. 1 as an example.
  • FIG. 2 is a flowchart of a target positioning method in an embodiment of this application, the method includes:
  • Step 200 When a video frame image containing a target lesion is detected from a video stream to be detected, the position information of the target lesion on the video frame image is determined.
  • a tracking process when a video frame image containing a target lesion is detected, a tracking process can be triggered to track and locate the target lesion, and specifically a method for detecting the target lesion.
  • step 200 when step 200 is performed, a possible implementation manner is provided in the embodiment of the present application.
  • the target lesion is detected on each video frame image of the video stream to be detected to determine whether The video frame image containing the target lesion is detected, and if it is determined that the video frame image containing the target lesion is detected, the position information of the target lesion on the video frame image is determined.
  • the detection method is a detection model, which is pre-trained to obtain the detection model based on the image sample set of the target lesion.
  • the detection model can also be used to detect each video frame image in the video stream to be detected to determine whether the target lesion is detected.
  • the detected position information is not only the coordinates of a point.
  • the position information represents the target area range coordinates of the target lesion on the video frame image.
  • the position coordinate of a positioning frame is the position coordinate of a positioning frame.
  • the detection of target lesions in the embodiments of this application does not require detection for each video frame image.
  • the video stream to be detected has a certain frame rate.
  • the detection method detects one video frame image, It may also be time-consuming, and the two are usually different.
  • the frame rate of the video stream of the endoscope is usually smaller. Therefore, the detection method in the embodiment of this application may allow the detection method to target the video to be detected according to its detection time-consuming interval. After detecting the lesion, it can track each frame of the video frame image to be tracked based on the tracking process until the target lesion disappears, that is, until it is determined that the target lesion is not tracked, the frame missing can be reduced.
  • Step 210 Track the target lesion according to the position information of the target lesion on the video frame image, and determine the position information of the target lesion on the to-be-tracked video frame image in the video stream to be detected.
  • step 210 specifically includes:
  • the position information on the first video frame image where the target lesion appears is learned during tracking, the position information can be used as input to track the target lesion.
  • the video frame images to be tracked in the video stream to be detected are preferably consecutive video frame images starting from the next video frame image containing the video frame image of the target lesion detected, until the video frame image where the tracking disappears is determined.
  • the video frame images to be tracked in the video stream to be detected are preferably consecutive video frame images starting from the next video frame image containing the video frame image of the target lesion detected, until the video frame image where the tracking disappears is determined.
  • determining the first similarity value between the search area image and the template area image specifically includes:
  • the size of the sliding window is the same as the template area image.
  • the size of the template area image is 6*6*128, the size of the search area image is 22*22*128, and the preset step size is 1.
  • the search area image is divided into multiple
  • a 6*6*128 sliding window is used to slide in 1 step, and the 22*22*128 search area image can be divided into 17*17*1 image blocks.
  • a two-dimensional matrix formed by arranging each second similarity value according to the position of the corresponding image block on the search area image is used as the first similarity value of the search area image and the template area image.
  • the second similarity value of each image block and the template area image is calculated to obtain multiple second similarity values.
  • the first similarity value of the template area image and the search area image is not a single number. It is a two-dimensional matrix, such as a (17*17) two-dimensional matrix, where each value in the two-dimensional matrix represents a second similarity value with the corresponding image block in the search area image.
  • the image block corresponding to the largest second similarity value among the second similarity values is determined as the location information of the target lesion on the image of the search area.
  • the first similarity value is a two-dimensional matrix with a size of 2*2, the value is (0.3, 0.4; 0.5, 0, 8), and the preset threshold is 0.6, then it is determined that 0.8 is greater than 0.6, and the target is determined to be tracked For lesions, 0.8 is the maximum value, so the location information of the image block corresponding to 0.8 is the location information of the target lesion on the video frame to be tracked.
  • the tracking process is triggered, and the tracking starts from the 7th video frame image. If it is determined that the 7th video frame image to the 18th video frame image are all It is determined that the target lesion is tracked, and the first similarity value calculated by the 19th video frame image does not meet the similarity condition, then it is determined that the target lesion is not tracked, that is, the tracking can be ended, that is, the tracking process that is triggered this time ends. Combined with tracking, it can be detected that the target lesions are contained in the 6th to 19th video frames, and the position information of the target lesions can be determined, reducing the missing frames and improving the robustness.
  • an alarm may be issued when the target lesion is detected.
  • a possible implementation manner is provided.
  • the preset method is to give an alarm to indicate the occurrence of the target lesion.
  • alarms can be given by voice, text, and different sounds such as "di".
  • the tracking method is used to track the target lesion, which is less difficult to track, and the tracking will use the timing information of the video stream, that is, the displacement of the object between the previous and the next two frames is limited Therefore, consistent predictions can be output on consecutive multiple frames where the target lesion appears, reducing false negatives, improving robustness and reliability, thereby reducing multiple alarms for the same target lesion, and reducing interference with doctors’ clinical operations.
  • the above method for performing step 210 can be implemented by using a tracking model.
  • a possible implementation manner is provided in the embodiment of the present application. Specifically, the target lesion is tracked according to the position information of the target lesion on the video frame image , Determine the position information of the target lesion on the video frame image to be tracked in the video stream to be detected, including: triggering the trained tracking model, based on the tracking model, taking the position information of the target lesion on the video frame image as the input parameter, and the target The lesion is tracked, and the position information of the target lesion on the to-be-tracked video frame image in the to-be-detected video stream is determined.
  • the tracking model is obtained by training a set of training image samples.
  • the training image sample pair set includes multiple training image sample pairs with similar value labels.
  • the training image sample pairs are based on random samples from the video stream where the target lesion appears. It is constructed by extracting two video frame images.
  • step 210 the specific implementation of tracking the target lesion based on the tracking model is the same as the specific implementation of step 210 in the foregoing embodiment, except that the tracking model needs to be obtained through machine learning training in advance.
  • the specific implementation of the foregoing step 210 is through Tracking model implementation.
  • the tracking model adopts an algorithm based on similar networks.
  • the similar network input is a data pair x 1 and x 2 , such as the search area image and the template area image, which go through the same network, such as a CNN network, and output convolutional features G w (X 1 ) and G w (X 2 ), by measuring a certain distance between two convolutional features ⁇ G w (X 1 )-G w (X 2 ) ⁇ , it is judged whether the two images are similar.
  • the similarity network it is possible to determine whether the target lesion is tracked and to determine the location information of the target lesion by comparing the similarity between the two inputs, namely the search area image and the template area image.
  • the principle framework diagram of the tracking model in the application embodiment as shown in FIG. 4, when a video frame image containing a target lesion is detected, the area image corresponding to the position information of the target lesion on the video frame image is used as the template area image, Denoted as z, for example, take the location information of the target lesion on the video frame image as the center, select the search area image from the video frame image to be tracked, and denote it as x, where the size of x and z need not be the same.
  • the size of z is small and the size of x is large.
  • the size of z is 127*127*3 and the size of x is 255*255*3, so the output similarity value is not a number but a two-dimensional matrix.
  • 6*6*128 represents the z pass
  • the feature obtained afterwards is a 128-channel 6*6 size feature.
  • 22*22*128 is x passing After the feature, the "*" in Figure 4 represents the convolution operation.
  • the 22*22*128 feature is convolved by the 6*6*128 convolution kernel to obtain a 17*17 two-dimensional matrix, and each value represents Search for the similarity values between each image block in the region image and the template region image, where the CNN network here can select the fully convolutional AlexNet, and calculate the similarity value using cross correlation, which is operated by the two-dimensional convolution in CNN To achieve, and if there is a value greater than the preset threshold, it is determined that the target lesion is tracked, and the location information of the image block corresponding to the largest value is determined as the location information of the target lesion on the current video frame image.
  • the target lesion is tracked with the position information of the target lesion on the video frame image as an input parameter, and the position information of the target lesion on the to-be-tracked video frame image in the video stream to be detected is determined , Specifically including:
  • the template area image and the search area image are mapped to the set dimensional feature space through two identical neural networks, and the search area image and the template area image are obtained. Eigenvectors.
  • the network structure of the tracking model includes at least two identical neural networks.
  • the neural network includes at least a convolutional layer and a pooling layer.
  • the template area image is the area image corresponding to the location information of the target lesion on the video frame image
  • the search area image It is an image of a preset range area selected from the video frame image to be tracked and centered on the location information corresponding to the target lesion.
  • the neural network can use the CNN network.
  • parallel processing of the tracking model can be triggered to track the multiple target lesions separately, and respectively determine that the multiple target lesions are to be tracked Position information on the video frame image.
  • the detection model will also detect based on its own detection interval. If the tracking is not over, the detection model detects a new target lesion, it will also trigger the tracking model again. The new target lesion is tracked, and it does not affect each other with the previous tracking process and can be processed in parallel.
  • a tracking model is introduced to locate the target lesion through detection and tracking.
  • the first video frame image that appears is the trigger tracking model. Based on the tracking model, the target lesion is tracked and positioned in real time until the target lesion is reported from the field of view, and the tracking is over.
  • the target positioning method runs at a speed of 60-70fps, which is much higher than the detection model's 5-12fps, and is also higher than the real-time frame rate of the video stream at 25fps, ensuring the product
  • the real-time performance based on tracking, also uses the timing information of the video stream, which improves the robustness and reliability, reduces multiple alarms of the same lesion, and reduces false alarms.
  • FIG. 5 is a flowchart of a tracking model training method in an embodiment of this application, the method includes:
  • Step 500 Obtain a set of training image sample pairs.
  • the step 500 includes:
  • a set of video stream samples can be determined by collecting a series of endoscopic colorectal videos and intercepting video clips in which polyps appear.
  • FIG. 6 a schematic diagram of selecting a template area image and a search area image in an embodiment of this application.
  • the polyp may be the center for each frame in the video stream sample (ie, the small square in FIG. 6 The part enclosed by the frame), respectively select the template area image z and the search area image x, and the search area image is larger than the template area image.
  • a frame and b frame randomly select a frame such as a template area image selected on a video frame image, and select another frame b video frame image to select
  • the search area image of compose a data pair, and generate the true similarity value label of this data pair.
  • the size of the search area image and the template area image are different, so the similarity value label is a two-dimensional matrix, and the center of the lesion overlaps with 1 , And the background value is 0 in other places.
  • each training image sample pair is a search area image and a template area image.
  • Step 510 each training image sample pair is input to the tracking model for training until the loss function of the tracking model converges, and the trained tracking model is obtained, where the loss function is the determined similarity value and similarity of each training image sample pair The sum of the cross entropy of the value label.
  • a similar network is trained on the set according to the training image samples, and each training image sample pair outputs a two-dimensional matrix representing similar values through the similar network.
  • the size of the template area image is 127*127*3
  • the size of the search area image is 255*255*3
  • the output similarity value is 17*17 two-dimensional matrix, and then converted to the range of 0-1 through the element-wise sigmoid function, and then calculate the binarized value together with the real similar value label
  • Cross entropy function the sum of the cross entropy corresponding to each training image sample is used as the total loss function, through continuous iterative training, for example, the method of stochastic gradient descent is used for iterative training until the loss function converges and is minimized. That is, the completed tracking model is obtained.
  • the position points corresponding to the search area image are distinguished between positive and negative samples, that is, points within a certain target range are regarded as positive samples, and points outside this range are regarded as negative samples.
  • positive and negative samples are regarded as positive samples
  • negative samples points outside this range.
  • one part is positive samples and part is negative samples. Therefore, when calculating the loss function of a training image sample pair, since the number of negative samples is much larger than the positive samples, the positive and negative samples can be combined
  • the loss items of are averaged and then added separately, which can prevent the contribution of positive samples from being overwhelmed by negative samples, and further improve the accuracy.
  • the corresponding video stream sample where the target lesion appears is obtained, and then the training image sample pair set is obtained therefrom, and the set is trained based on the training image sample to obtain the tracking model, which can be based on
  • the tracking model realizes the tracking and positioning of the target lesion, is suitable for the detection scene of the target lesion, and can improve the speed and reliability.
  • steps in the embodiments of the present application are not necessarily executed in sequence in the order indicated by the step numbers. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in each embodiment may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • FIG. 7 is a schematic structural diagram of a target positioning system in an embodiment of this application.
  • the target positioning system includes at least a video capture device 70, a processing device 71, and an output device 72.
  • the video capture device 70, the processing device 71, and the output device 72 are related medical devices and can be integrated in the same medical device. , It can also be divided into multiple devices, interconnected and communicated to form a medical system for use.
  • the video capture device 70 can be an endoscope
  • the processing device 71 and output device 72 can be Computer equipment that communicates with the sight glass, etc.
  • the video capture device 70 is used to obtain the video stream to be detected.
  • the processing device 71 is configured to determine the location information of the target lesion on the video frame image when a video frame image containing the target lesion is detected from the video stream to be detected; Tracking is performed to determine the position information of the target lesion on the to-be-tracked video frame image in the to-be-detected video stream.
  • the output device 72 is used to output the position information of the target lesion on the video frame and the position information on the video frame to be tracked.
  • the detection and positioning of the target lesion can trigger the tracking process when the target lesion is detected, track the target lesion, determine the location information of the target lesion location, and then display the determined location information of the target lesion ,
  • the speed and realization are improved, and the timing information of the video stream is used to improve the robustness.
  • an embodiment of the present application also provides a target positioning device.
  • the target positioning device may be, for example, the server in the foregoing embodiment.
  • the target positioning device may be a hardware structure, a software module, or a hardware structure plus software. Module.
  • the target positioning device in the embodiment of the present application specifically includes:
  • the detection module 80 is configured to determine the position information of the target lesion on the video frame image when a video frame image containing the target lesion is detected from the video stream to be detected;
  • the tracking module 81 is configured to track the target lesion according to the position information of the target lesion on the video frame image, and determine the position information of the target lesion on the to-be-tracked video frame image in the video stream to be detected.
  • the tracking module 81 is specifically configured to:
  • the target lesion is tracked, and the position information of the target lesion on the search area image is determined.
  • the tracking module 81 when determining the first similarity value between the search area image and the template area image, is specifically configured to:
  • Each image block uses a sliding window in the The search area image is obtained by sliding with a preset step length; the size of the sliding window is the same as the template area image;
  • a two-dimensional matrix formed by arranging each second similarity value according to the position of the corresponding image block on the search area image is used as the first similarity value of the search area image and the template area image.
  • the tracking module 81 is specifically configured to:
  • the position information of the image block corresponding to the largest second similarity value among the second similarity values is determined as the position information of the target lesion on the search area image.
  • a video frame image containing the target lesion when detected from the to-be-detected video stream, it further includes:
  • the alarm module 82 is used to give an alarm in a preset manner to prompt that a target lesion has occurred.
  • the tracking module 81 is specifically configured to:
  • the position information of the target lesion on the video frame image is used as the input parameter to track the target lesion and determine the position information of the target lesion on the video frame image to be tracked in the video stream to be detected ,
  • the tracking model is obtained by training a set of training image samples.
  • the training image sample pair set includes multiple training image sample pairs with similar value labels.
  • the training image sample pair is based on the video stream samples where the target lesion appears. It is constructed from two randomly selected video frames.
  • the tracking Module 81 is specifically used for:
  • the template area image and the search area image are mapped to the set dimensional feature space through two identical neural networks, and the corresponding features of the search area image and the template area image are obtained.
  • Vector where the network structure of the tracking model includes at least two identical neural networks.
  • the neural network includes at least a convolutional layer and a pooling layer.
  • the template area image is the area image corresponding to the location information of the target lesion on the video frame image.
  • Search The regional image is a preset range regional image selected from the video frame image to be tracked and centered on the location information of the target lesion;
  • the target lesion is tracked, and the position information of the target lesion on the search area image is determined.
  • a training module 83 for:
  • Each training image sample pair is input to the tracking model for training until the loss function of the tracking model converges, and the trained tracking model is obtained.
  • the loss function is the determined similarity value and the similarity value label of each training image sample pair. The sum of cross entropy.
  • the training module 83 when acquiring the training image sample pair set, is specifically configured to:
  • the template area image of the first preset range and the search area image of the second preset range are selected from each video frame image with the target lesion as the center. Set the range to be greater than the first preset range;
  • the selected template area image and search area image are taken as a training image sample pair, and similar value labels of the training image sample pairs are generated, where the similar value label is a two-dimensional matrix.
  • the embodiments of the present application also provide an electronic device of another exemplary embodiment.
  • the electronic device in the embodiments of the present application may include a memory, a processor, and an electronic device stored on the memory.
  • the processor in the electronic device is the processor 110 in the server 100
  • the memory in the electronic device is the memory in the server 100. 120.
  • one or more non-volatile storage media storing computer-readable instructions are provided.
  • the computer-readable instructions are executed by one or more processors, one or more Multiple processors execute the target positioning method in any of the foregoing method embodiments.
  • this application can be provided as a method, a system, or a computer-readable instruction product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of computer-readable instruction products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer-readable instructions can also be stored in a computer-readable memory that can direct a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer-readable instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, which can be executed on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

一种目标定位方法、装置及系统,涉及计算机技术领域。该方法包括:从待检测视频流中检测到包含目标病灶的视频帧图像时,确定目标病灶在视频帧图像上的位置信息(200);根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息(210)。

Description

一种目标定位方法、装置及系统
本申请要求于2020年02月08日提交中国专利局,申请号为202010083134X,申请名称为“一种目标定位方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种目标定位方法、装置及系统。
背景技术
相关技术中,在内窥镜诊断系统中,对于病灶的检测,采用的是目标检测算法,对内窥镜实时视频流中的每一个视频帧图像进行检测,即在每一个视频帧图像上确定任意数目个、可能存在于任意位置的病灶,给出每个病灶的定位位置。
但是相关技术中这种方法,完全依赖于目标检测方法并需要对每一个视频帧都进行检测,存在速度和鲁棒性问题,其中,在速度上,通常内窥镜视频流的帧率比较大,而目标检测算法为了保证一定的准确率,其耗时通常都会超过这个时间,导致容易漏掉部分帧,或者输出检测结果时病灶已经移位,从而造成定位不准确,在鲁棒性上,由于没有考虑视频流的时序信息,逐帧进行检测,有可能将病灶连续出现的多帧中的某几帧预测为“无病灶”,而每一次的“无病灶”预测会将下一次的“有病灶”帧视为一个新的病灶出现,但其实只是同一个病灶,降低了鲁棒性和可靠性。
发明内容
根据本申请提供的各种实施例,提供一种目标定位方法、装置及系统。
本申请一个实施例提供了一种目标定位方法,包括:
从待检测视频流中检测到包含目标病灶的视频帧图像时,确定所述目标病灶在所述视频帧图像上的位置信息;
根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息。
本申请另一个实施例提供了一种目标定位装置,包括:
检测模块,用于从待检测视频流中检测到包含目标病灶的视频帧图像时,确定所述目标病灶在所述视频帧图像上的位置信息;
跟踪模块,用于根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息。
本申请另一个实施例提供了一种目标定位系统,至少包括:视频采集设备、处理设备和输出设备,具体地:
视频采集设备,用于获取待检测视频流;
处理设备,用于从待检测视频流中检测到包含目标病灶的视频帧图像时,确定所述目标病灶在所述视频帧图像上的位置信息;根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息;
输出设备,用于输出所述目标病灶在所述视频帧上的位置信息,以及在所述待追踪视频帧上的位置信息。
本申请另一个实施例提供了一种电子设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述目标定位方法的步骤。
本申请另一个实施例提供了一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述目标定位方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例中目标定位方法的应用架构示意图;
图2为本申请实施例中目标定位方法流程图;
图3为本申请实施例中相似网络的网络结构示意图;
图4为本申请实施例中跟踪模型原理框架图;
图5为本申请实施例中跟踪模型训练方法流程图;
图6为本申请实施例中选取模板区域图像和搜索区域图像的示意图;
图7为本申请实施例中一种目标定位系统的结构示意图;
图8为本申请实施例中目标定位装置结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,并不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为便于对本申请实施例的理解,下面先对几个概念进行简单介绍:
视频流:本申请实施例中主要是针对各种医疗诊断时扫描的影像视频流,例如,内窥镜扫描获得的医疗影像视频流,包括内窥镜结直肠视频流等,当然并不进行限制,也可以为其它业务领域的视频流。
病灶:表示机体上发生病变的部分,例如结直肠中出现的息肉。
目标:本申请实施例中若视频流为医疗相关视频流,则目标为目标病灶。
相似网络(siamese network):一种机器学习的网络结构,即是一种神经网络的框架,而不是具体的某种网络,在具体实现时可以使用卷积神经网络(convolutional neural network,CNN),用于衡量两个输入的相似程度,本申请实施例中跟踪模型主要就是基于相似网络,通过不同视频帧图像中目标病灶的相似性检测,从而实现跟踪目标病灶,并确定目标病灶在视频帧图像上的位置。
CNN:卷积神经网络是一种深度前馈人工神经网络。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(Optical Character Recognition,OCR)、视频处理、视频语义理解、 视频内容/行为识别、三维物体重建、三维(3 Dimensions,3D)技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。例如,本申请实施例中可以将人工智能技术应用到医疗领域,本申请实施例中主要涉及人工智能中的计算机视觉技术,可以通过计算机视觉技术中图像语义理解技术,针对待检测视频流中视频帧图像进行目标病灶检测,检测视频帧图像中是否包含目标病灶,又例如,本申请实施例中还可以根据通过计算机视觉技术中的视频语义理解技术,实现对目标病灶的跟踪。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
本申请实施例提供的方案主要涉及人工智能的计算机视觉等技术,具体通过如下实施例进行说明:
目前,对于AI辅助内窥镜目标病灶的检测,例如,结直肠息肉检测方法,通常是采用目标检测方法,对内窥镜实时视频流中的每一视频帧图像进行检测,但是这种方法完全依赖于目标检测方法,并对每一视频帧图像进行检测,存在速度和鲁棒性问题,其中,在速度上,AI辅助检测需要实时提供预测结果才有临床价值,通常内窥镜视频流的帧率比较大,例如目前内窥镜视频流的帧率通常为25画面每秒传输帧数(Frame Per Second,fps),每帧间隔为40毫秒(ms),而目标检测方法为了保证一定的准确率,其耗时都超过这个时间,导致漏掉了部分帧,以及给出预测时病灶已经移位,从而造成定位不准或者产品体验上的拖拽效应;对于鲁棒性,没有考虑视频流的时序信息,例如,内窥镜视频流中的病灶不可能瞬间出现或者消失,而是连续的从画面边缘出现、拉近、拉远、最后消失,而逐帧检测方式,有可能将病灶连续出现的多帧中的某几帧预测为“无病灶”,而每一次的“无病灶”预测会将下一次的“有病灶”帧视为一个新的病灶出现,发出一个新的报警,但其实只是同一个病灶,降低了鲁棒性和 可靠性,造成同一个病灶多次报警,容易干扰医生的临床操作。
因此针对上述问题,本申请实施例中提供了一种目标定位方法,从待检测视频流中检测到包含目标病灶的视频帧图像时,确定目标病灶在视频帧图像上的位置信息,进而可以触发跟踪流程,根据目标病灶在该检测到包含目标病灶的视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息,这样,检测目标病灶后,就触发跟踪流程,对目标病灶进行定位跟踪,由于相比于检测,跟踪的难度更低,这是因为其目标病灶是一个确定物体,在给定位置附近并具有已知形状,进而后续跟踪时所获知的已知信息更明确、更丰富,因此相比于检测,跟踪的速度更快,可以保证实时性,并且对待检测视频流中的目标病灶进行跟踪,结合了待检测视频流的时序信息,可以在目标病灶出现的连续多个视频帧图像上预测为同一个目标病灶,减少误判,提高了鲁棒性和可靠性。
参阅图1所示,为本申请实施例中目标定位方法的应用架构示意图,包括服务器100、终端设备200。
终端设备200可以是医疗设备,例如,用户可以通过终端设备200采集内窥镜影像视频流,并且还可以基于终端设备200查看待检测视频流中目标病灶的跟踪结果,包括在待检测视频流的视频帧图像上出现的位置信息等。
终端设备200与服务器100之间可以通过互联网相连,实现相互之间的通信。可选地,上述的互联网使用标准通信技术和/或协议。互联网通常为因特网、但也可以是任何网络,包括但不限于局域网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实施例中,使用包括超文本标记语言(Hyper Text Mark-up Language,HTML)、可扩展标记语言(Extensible Markup Language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer,SSL)、传输层安全(Transport Layer Security,TLS)、虚拟专用网络(Virtual  Private Network,VPN)、网际协议安全(Internet Protocol Security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
服务器100可以为终端设备200提供各种网络服务,其中,服务器100可以是一台服务器、若干台服务器组成的服务器集群或云计算中心。
具体地,服务器100可以包括处理器110(Center Processing Unit,CPU)、存储器120、输入设备130和输出设备140等,输入设备130可以包括键盘、鼠标、触摸屏等,输出设备140可以包括显示设备,如液晶显示器(Liquid Crystal Display,LCD)、阴极射线管(Cathode Ray Tube,CRT)等。
需要说明的是,本申请实施例中,目标定位方法主要由服务器100侧执行,并且本申请实施例中的跟踪模型的训练过程也是由服务器100执行,例如,终端设备200将采集到的待检测视频流发送给服务器100,服务器100对待检测视频流中各视频帧图像进行目标病灶检测,例如采用已训练的检测模型进行检测,检测到包含目标病灶的视频帧图像时,触发跟踪模型对目标病灶进行跟踪,确定目标病灶在待追踪视频帧图像上的位置信息,直到目标病灶消失,并且服务器100可以将目标病灶检测和跟踪结果,即目标病灶在各视频帧图像中出现的位置信息发送给终端设备200,例如可以在每检测或跟踪到目标病灶时就发送位置信息给终端设备200,以使用户可以在终端设备200上看到实时的目标病灶在各视频帧图像出现的位置信息,如图1所示的应用架构,是以应用于服务器100侧为例进行说明的。
当然,本申请实施例中目标定位方法也可以由终端设备200执行,例如终端设备200可以从服务器100侧获得已训练的检测模型和跟踪模型,可以基于检测模型对目标病灶进行检测,检测到包含目标病灶的视频帧图像时,就触发跟踪模型对目标病灶进行跟踪定位,对此本申请实施例中并不进行限制。
其中,例如待检测视频流为内窥镜结直肠视频流,则目标病灶例如为息肉。
本申请实施例中的应用架构图是为了更加清楚地说明本申请实施例中的 技术方案,并不构成对本申请实施例提供的技术方案的限制,当然,也并不仅限于医疗业务应用,对于其它的应用架构和业务应用,本申请实施例提供的技术方案对于类似的问题,同样适用。
本申请各个实施例以应用于图1所示的应用架构图为例进行示意性说明。
基于上述实施例,参阅图2所示,为本申请实施例中目标定位方法流程图,该方法包括:
步骤200:从待检测视频流中检测到包含目标病灶的视频帧图像时,确定目标病灶在视频帧图像上的位置信息。
本申请实施例中主要在于检测到包含目标病灶的视频帧图像时,即可以触发跟踪流程,对目标病灶进行跟踪定位,而具体地对目标病灶的检测方法。
例如执行步骤200时,本申请实施例中提供了一种可能的实施方式,根据目标病灶的图像特征信息和预设检测方法,分别对待检测视频流的各视频帧图像进行目标病灶检测,确定是否检测到包含目标病灶的视频帧图像,若确定检测到包含目标病灶的视频帧图像,确定目标病灶在视频帧图像上的位置信息。
又例如检测方法为检测模型,根据目标病灶的图像样本集预先训练获得检测模型,进而还可以采用检测模型,对待检测视频流中的各视频帧图像进行检测,以确定是否检测到目标病灶。
其中,检测出的位置信息并不仅是一个点的坐标,通常目标病灶在视频帧图像出现不是一个点,而是一个区域,因此,该位置信息表示目标病灶在视频帧图像上的目标区域范围坐标,例如,为一个定位框的位置坐标。
需要说明是,本申请实施例中对目标病灶的检测,并不需要针对每一个视频帧图像都进行检测,通常待检测视频流具有一定帧率,检测方法在针对一个视频帧图像进行检测时,也会有一定耗时,两者通常是不同的,例如通常内窥镜视频流的帧率会更小,因此本申请实施例中可以允许检测方法按照其检测耗时间隔,对待检测视频进行目标病灶检测,检测到后就可以基于跟踪流程对待跟踪视频帧图像的每一帧进行跟踪,直至目标病灶消失,即直至确定未跟踪到 目标病灶,可以降低漏帧情况。
步骤210:根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息。
执行步骤210时,具体包括:
S1、将目标病灶在视频帧图像上的位置信息对应的区域图像,作为模板区域图像,依次从待检测视频流中待追踪视频帧图像上选取以目标病灶对应的位置信息为中心的预设范围区域图像,作为搜索区域图像。
本申请实施例中,在跟踪时获知到了目标病灶出现的第一个视频帧图像上的位置信息,那么可以将该位置信息作为输入,对目标病灶进行跟踪。
其中,待检测视频流中待追踪视频帧图像,较佳的为从检测到包含目标病灶的视频帧图像的下一个视频帧图像开始的各连续视频帧图像,直到确定跟踪消失的视频帧图像,但是本申请实施例中并不进行限制。
进而在跟踪时,为提高效率和速度,从待追踪视频帧图像上选取目标病灶附近的图像,即以目标病灶出现的上一个视频帧的位置信息为中心,选取预设范围区域图像作为搜索区域图像,为定位更加准确,可以选取一个较大的区域图像,作为搜索区域图像。
S2、确定搜索区域图像和模板区域图像的第一相似值。
其中,确定搜索区域图像和模板区域图像的第一相似值,具体包括:
S2.1、基于卷积神经网络,分别将搜索区域图像和模板区域图像映射到设定维度特征空间,获得搜索区域图像和模板区域图像相应的特征向量。
S2.2、将搜索区域图像和模板区域图像相应的特征向量进行二维卷积操作,分别确定搜索区域图像中每个图像块与模板区域图像的第二相似值,其中,每个图像块是采用滑动窗口在搜索区域图像上以预设步长进行滑动获得的。
其中,滑动窗口的大小与模板区域图像相同。例如,模板区域图像的大小为6*6*128,搜索区域图像的大小为22*22*128,预设步长为1,这样,通过二维卷积操作,将搜索区域图像划分为多个图像块时,采用6*6*128的滑动窗口 以1步长滑动,可以将22*22*128的搜索区域图像划分为17*17*1个图像块。
S2.3、将各第二相似值按照对应图像块在搜索区域图像上的位置排列构成的二维矩阵,作为搜索区域图像和模板区域图像的第一相似值。
也就是说,本申请实施例中计算每个图像块与模板区域图像的第二相似值,获得多个第二相似值,可知模板区域图像和搜索区域图像的第一相似值并不是一个数,而是一个二维矩阵,例如为(17*17)的二维矩阵,其中二维矩阵中每个值代表与搜索区域图像中的对应图像块的第二相似值。
S3、若确定第一相似值满足相似条件,则确定跟踪到目标病灶,并确定目标病灶在搜索区域图像上的位置信息。
具体地:若确定第一相似值的二维矩阵中存在一个第二相似值不小于预设阈值,则确定跟踪到目标病灶;将各第二相似值中最大的第二相似值对应的图像块的位置信息,确定为目标病灶在搜索区域图像上的位置信息。
例如,第一相似值为2*2大小的二维矩阵,取值为(0.3,0.4;0.5,0,8),预设阈值为0.6,则判断确定存在0.8大于0.6,则确定跟踪到目标病灶,并0.8是最大值,因此0.8对应图像块的位置信息即为目标病灶在该待追踪视频帧上的位置信息。
进一步地,若确定第一相似值不满足相似条件,则确定没有跟踪到目标病灶,结束跟踪。
即确定第一相似值的二维矩阵中不存在一个第二相似值不小于预设阈值,则确定没有跟踪到目标病灶,可以结束跟踪。
例如,从待检测视频流中检测到包含目标病灶的视频帧图像为第6帧,即触发跟踪流程,从第7视频帧图像开始进行跟踪,若确定第7视频帧图像到18视频帧图像均确定跟踪到目标病灶,到第19视频帧图像计算获得的第一相似值不满足相似条件,则确定没有跟踪到目标病灶,即可以结束跟踪,即本次触发的跟踪流程结束,这样,通过检测和跟踪结合,可以检测到在第6到第19视频帧图像中均包含有目标病灶,并可以确定目标病灶的位置信息,减少漏帧 的情况,提高鲁棒性。
进一步地,本申请实施例中在检测到目标病灶时还可以进行告警,具体提供了一种可能的实施方式,从待检测视频流中检测到包含目标病灶的视频帧图像时,还包括:按照预设方式,进行告警,以提示出现了目标病灶。
例如,通过语音、文本方式、不同声音如“滴”一声等方式进行告警。
这样,本申请实施例中,检测到目标病灶后,就采用跟踪方法对目标病灶进行跟踪,跟踪难度更低,并且跟踪会利用视频流的时序信息,即物体在前后两帧间的位移是有限的,因此可以在目标病灶出现的连续多帧上输出一致的预测,降低假阴性,提高了鲁棒性和可靠性,从而减少同一个目标病灶的多次报警,降低对医生临床操作的干扰。
进一步地,上述执行步骤210的方法可以采用跟踪模型来实现,本申请实施例中提供了一种可能的实施方式,具体地,根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息,包括:触发已训练的跟踪模型,基于跟踪模型,以目标病灶在视频帧图像上的位置信息为输入参数,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息。
其中,跟踪模型是根据训练图像样本对集合训练获得的,训练图像样本对集合中包括多个有相似值标签的训练图像样本对,训练图像样本对是根据从出现目标病灶的视频流样本中随机抽取的两个视频帧图像构建的。
需要说明的是,基于跟踪模型跟踪目标病灶的具体实现方式和上述实施例中执行步骤210的具体实现方式是相同的,只是需要预先通过机器学习训练获得跟踪模型,上述步骤210的具体实现方式通过跟踪模型实现。
具体地,本申请实施例中,跟踪模型采用基于相似网络的算法,为便于理解,下面先对相似网络的网络结构进行简单说明,参阅图3所示,为本申请实施例中相似网络的网络结构示意图。如图3所示,相似网络输入为一个数据对x 1和x 2,例如搜索区域图像和模板区域图像,分别经过同一个网络,例如为CNN 网络,输出卷积特征G w(X 1)和G w(X 2),通过衡量两个卷积特征之间的某种距离‖G w(X 1)-G w(X 2)‖,判断这两个图像是否相似。
这样,基于相似网络即可以通过比较两个输入即搜索区域图像和模板区域图像之间的相似性,确定是否跟踪到目标病灶并确定目标病灶的位置信息,例如,参阅图4所示,为本申请实施例中跟踪模型原理框架图,如图4所示,检测到的包含目标病灶的视频帧图像时,将目标病灶在该视频帧图像上的位置信息对应的区域图像,作为模板区域图像,记为z,例如,以目标病灶在该视频帧图像上的位置信息为中心,从待追踪视频帧图像上选取搜索区域图像,记为x,其中x和z的尺寸不需要相同,为定位更加准确,z尺寸较小,x尺寸较大,例如,z大小为127*127*3,x大小为255*255*3,因此输出的相似值不是一个数而是一个二维矩阵。
Figure PCTCN2020124623-appb-000001
表示一种特征映射操作,为提高计算效率,将x和z通过
Figure PCTCN2020124623-appb-000002
实现将原始图像映射到特定的设定维度特征空间,可以采用CNN中的卷积层和池化层实现,如图4中,6*6*128代表z经过
Figure PCTCN2020124623-appb-000003
后得到的特征,是一个128通道6*6大小特征,同理,22*22*128是x经过
Figure PCTCN2020124623-appb-000004
后的特征,图4中“*”代表卷积操作,将22*22*128的特征被6*6*128的卷积核卷积,得到一个17*17的二维矩阵,每个值代表着搜索区域图像中各个图像块与模板区域图像的相似值,其中,这里的CNN网络可以选取全卷积的AlexNet,计算相似值采用互相关(cross correlation),由CNN中的二维卷积操作实现,进而若存在一个值大于预设阈值,则确定跟踪到目标病灶,并将最大的值对应的图像块的位置信息确定为目标病灶在当前视频帧图像上的位置信息。
则本申请实施例中,基于跟踪模型,以目标病灶在视频帧图像上的位置信息为输入参数,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息,具体包括:
1)以目标病灶在视频帧图像上的位置信息为输入参数,分别通过两个相同的神经网络对模板区域图像和搜索区域图像映射到设定维度特征空间,获得搜索区域图像和模板区域图像相应的特征向量。
其中,跟踪模型的网络结构至少包括两个相同的神经网络,神经网络至少包括卷积层和池化层,模板区域图像为目标病灶在视频帧图像上的位置信息对应的区域图像,搜索区域图像为从待追踪视频帧图像上选取的,以目标病灶对应的位置信息为中心的预设范围区域图像。
其中,神经网络可以采用CNN网络。
2)将搜索区域图像和模板区域图像相应的特征向量通过卷积层进行二维卷积操作,获得搜索区域图像和模板区域图像的第一相似值。
3)若确定第一相似值满足相似条件,则确定跟踪到目标病灶,并确定目标病灶在搜索区域图像上的位置信息。
进一步地,需要说明的是,若基于检测模型检测到某视频帧图像包含多个目标病灶,则可以触发跟踪模型并行处理,分别跟踪这多个目标病灶,分别确定该多个目标病灶在待追踪视频帧图像上的位置信息。并且在基于跟踪模型跟踪目标病灶的过程中,检测模型也会基于其自身检测间隔进行检测,若跟踪还未结束时,检测模型检测到了新的目标病灶,则也会再次触发跟踪模型,对该新的目标病灶进行跟踪,跟之前正在跟踪的过程互不影响,可以并行处理。
本申请实施例中,从待检测视频流中检测到包含目标病灶的视频帧图像时,确定目标病灶在视频帧图像上的位置信息,触发跟踪流程,进而根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息,这样,引入跟踪模型,通过检测和跟踪,实现对目标病灶的定位,当检测到目标病灶出现的第一视频帧图像,即触发跟踪模型,基于跟踪模型对目标病灶进行实时跟踪定位直至目标病灶从视野中消息,跟踪结束,相比于检测,跟踪的难度更低,因此速度更快,例如通过实验,本申请实施例中引入跟踪模型后,目标定位方法的运行速度在60-70fps,远高于检测模型的5-12fps,并且也高于视频流的实时帧率25fps,保证了产品的实时性,同时基于跟踪,还利用了视频流的时序信息,提高了鲁棒性和可靠性,减少同一个病灶的多次告警,减少误告警。
基于上述实施例,下面对本申请实施例中跟踪模型的训练过程进行简单说明。参阅图5所示,为本申请实施例中跟踪模型训练方法流程图,该方法包括:
步骤500:获取训练图像样本对集合。
具体地,执行步骤500时包括:
S1、获取出现目标病灶的视频流样本集,其中,视频流样本集中包括多个出现目标病灶的视频流样本。
例如,针对结直肠的息肉检测场景,可以通过收集一系列内窥镜结直肠视频,并截取其中出现息肉的视频片段,来确定视频流样本集。
S2、分别针对视频流样本包含的各视频帧图像,以目标病灶为中心,从各视频帧图像中选取第一预设范围的模板区域图像和第二预设范围的搜索区域图像,其中,第二预设范围大于第一预设范围。
例如,参阅图6所示,为本申请实施例中选取模板区域图像和搜索区域图像的示意图,本申请实施例中可以针对视频流样本中每一帧,以息肉为中心(即图6中小方框圈定的部位),分别选取模板区域图像z和搜索区域图像x,并且搜索区域图像大于模板区域图像。
S3、从视频流样本包含的各视频帧图像中随机抽取两个视频帧图像,选取抽取的两个视频帧图像中一个视频帧图像的模板区域图像和另一个视频帧图像的搜索区域图像。
S4、将选取的模板区域图像和搜索区域图像作为一个训练图像样本对,并生成训练图像样本对的相似值标签,其中,相似值标签为二维矩阵。
例如,从视频流样本中随机抽取同一息肉的不同两帧,例如a帧和b帧,随机选取一帧例如a视频帧图像上选取的模板区域图像,并选取另一帧b视频帧图像上选取的搜索区域图像,组成一个数据对,并且生成这个数据对的真实的相似值标签,搜索区域图像和模板区域图像的大小不同,因此相似值标签为二维矩阵,中心处病灶重合的地方为1,其它地方为背景值取值为0。
这样,通过上述方式重复,可以获得足够的带有相似值标签的数据对,即 多个训练图像样本对,每个训练图像样本对即为一个搜索区域图像和一个模板区域图像。
步骤510:分别将各训练图像样对输入到跟踪模型进行训练,直至跟踪模型的损失函数收敛,获得训练完成的跟踪模型,其中,损失函数为确定出的各训练图像样本对的相似值与相似值标签的交叉熵的总和。
具体地,本申请实施例中根据训练图像样本对集合进行相似网络的训练,每个训练图像样本对通过相似网络输出表示相似值的二维矩阵,例如模板区域图像大小为127*127*3,搜索区域图像大小为255*255*3,输出相似值即17*17二维矩阵,再通过逐元素的sigmoid函数转换到0-1范围内,进而与真实的相似值标签一起计算二值化的交叉熵函数,每个训练图像样本对所对应的交叉熵相加的总和作为总的损失函数,通过不断迭代训练,例如,采用随机梯度下降的方法进行迭代训练,直至损失函数收敛并最小化,即获得训练完成的跟踪模型。
另外,本申请实施例中为了构造有效的损失函数,对搜索区域图像对应的位置点进行了正负样本的区分,即一定目标范围内的点作为正样本,这个范围外的点作为负样本,例如最终生成的二维矩阵中,一部分为正样本,一部分为负样本,从而在计算一个训练图像样本对的损失函数时,由于负样本个数远大于正样本,因此可以将正样本和负样本的损失项分别平均再相加,这样可以避免正样本的贡献被负样本淹没,进一步提高准确性。
这样,本申请实施例中,针对目标病灶的应用场景,获取相应的出现目标病灶的视频流样本,进而从中获取训练图像样本对集合,基于训练图像样本对集合进行训练,获得跟踪模型,可以基于跟踪模型实现对目标病灶的跟踪定位,适用于目标病灶的检测场景,可以提高速度和可靠性。
应该理解的是,本申请各实施例中的各个步骤并不是必然按照步骤标号指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各实施例中至少一部分步 骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
基于上述实施例,参阅图7所示,为本申请实施例中一种目标定位系统的结构示意图。
该目标定位系统至少包括视频采集设备70、处理设备71和输出设备72,本申请实施例中,视频采集设备70、处理设备71和输出设备72为相关的医疗器械,可以集成在同一医疗器械中,也可以分为多个设备,相互连接通信,组成一个医疗系统来使用等,例如针对结直肠的息肉诊断,视频采集设备70可以为内窥镜,处理设备71和输出设备72可以为与内窥镜相通信的计算机设备等。
具体地,视频采集设备70,用于获取待检测视频流。
处理设备71,用于从待检测视频流中检测到包含目标病灶的视频帧图像时,确定目标病灶在视频帧图像上的位置信息;根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息。
输出设备72,用于输出目标病灶在视频帧上的位置信息,以及在待追踪视频帧上的位置信息。
本申请实施例中,对目标病灶进行检测定位,可以在检测到目标病灶时,触发跟踪流程,对目标病灶进行跟踪,确定目标病灶定位的位置信息,进而可以展示确定出的目标病灶的位置信息,以供用户查看,这样,通过检测和跟踪结合,相较于完全依赖检测,提高了速度和实现性,并利用了视频流的时序信息,提高了鲁棒性。
基于同一发明构思,本申请实施例中还提供了一种目标定位装置,该目标 定位装置例如可以是前述实施例中的服务器,该目标定位装置可以是硬件结构、软件模块、或硬件结构加软件模块。基于上述实施例,参阅图8所示,本申请实施例中目标定位装置,具体包括:
检测模块80,用于从待检测视频流中检测到包含目标病灶的视频帧图像时,确定目标病灶在视频帧图像上的位置信息;
跟踪模块81,用于根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息。
可选的,根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息时,跟踪模块81具体用于:
将目标病灶在视频帧图像上的位置信息对应的区域图像,作为模板区域图像;
依次从待检测视频流中待追踪视频帧图像上选取以所述位置信息为中心的预设范围区域图像,作为搜索区域图像;
确定搜索区域图像和模板区域图像的第一相似值;
若确定第一相似值满足相似条件,则确定跟踪到目标病灶,并确定目标病灶在搜索区域图像上的位置信息。
可选的,确定搜索区域图像和模板区域图像的第一相似值时,跟踪模块81具体用于:
基于卷积神经网络,分别将搜索区域图像和模板区域图像映射到设定维度特征空间,获得搜索区域图像和模板区域图像相应的特征向量;
将搜索区域图像和模板区域图像相应的特征向量进行二维卷积操作,分别确定搜索区域图像中每个图像块与模板区域图像的第二相似值,其中,每个图像块是采用滑动窗口在搜索区域图像上以预设步长进行滑动获得的;滑动窗口的大小与模板区域图像相同;
将各第二相似值按照对应图像块在搜索区域图像上的位置排列构成的二 维矩阵,作为搜索区域图像和模板区域图像的第一相似值。
可选的,若确定第一相似值满足相似条件,则确定跟踪到目标病灶,并确定目标病灶在搜索区域图像上的位置信息时,跟踪模块81具体用于:
若确定第一相似值的二维矩阵中存在一个第二相似值不小于预设阈值,则确定跟踪到目标病灶;
将各第二相似值中最大的第二相似值对应的图像块的位置信息,确定为目标病灶在搜索区域图像上的位置信息。
可选的,从待检测视频流中检测到包含目标病灶的视频帧图像时,进一步包括:
告警模块82,用于按照预设方式,进行告警,以提示出现了目标病灶。
可选的,根据目标病灶在视频帧图像上的位置信息,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息时,跟踪模块81具体用于:
触发已训练的跟踪模型,基于跟踪模型,以目标病灶在视频帧图像上的位置信息为输入参数,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息,其中,跟踪模型是根据训练图像样本对集合训练获得的,训练图像样本对集合中包括多个有相似值标签的训练图像样本对,训练图像样本对是根据从出现目标病灶的视频流样本中随机抽取的两个视频帧图像构建的。
可选的,基于跟踪模型,以目标病灶在视频帧图像上的位置信息为输入参数,对目标病灶进行跟踪,确定目标病灶在待检测视频流中待追踪视频帧图像上的位置信息时,跟踪模块81具体用于:
以目标病灶在视频帧图像上的位置信息为输入参数,分别通过两个相同的神经网络对模板区域图像和搜索区域图像映射到设定维度特征空间,获得搜索区域图像和模板区域图像相应的特征向量,其中,跟踪模型的网络结构至少包括两个相同的神经网络,神经网络至少包括卷积层和池化层,模板区域图像为 目标病灶在视频帧图像上的位置信息对应的区域图像,搜索区域图像为从待追踪视频帧图像上选取的,以目标病灶的位置信息为中心的预设范围区域图像;
将搜索区域图像和模板区域图像相应的特征向量通过卷积层进行二维卷积操作,获得搜索区域图像和模板区域图像的第一相似值;
若确定第一相似值满足相似条件,则确定跟踪到目标病灶,并确定目标病灶在搜索区域图像上的位置信息。
可选的,进一步包括,训练模块83,用于:
获取训练图像样本对集合;
分别将各训练图像样对输入到跟踪模型进行训练,直至跟踪模型的损失函数收敛,获得训练完成的跟踪模型,其中,损失函数为确定出的各训练图像样本对的相似值与相似值标签的交叉熵的总和。
可选的,获取训练图像样本对集合时,训练模块83具体用于:
获取出现目标病灶的视频流样本集,其中,视频流样本集中包括多个出现目标病灶的视频流样本;
分别针对视频流样本包含的各视频帧图像,以目标病灶为中心,从各视频帧图像中选取第一预设范围的模板区域图像和第二预设范围的搜索区域图像,其中,第二预设范围大于第一预设范围;
从视频流样本包含的各视频帧图像中随机抽取两个视频帧图像,选取抽取的两个视频帧图像中一个视频帧图像的模板区域图像和另一个视频帧图像的搜索区域图像;
将选取的模板区域图像和搜索区域图像作为一个训练图像样本对,并生成训练图像样本对的相似值标签,其中,相似值标签为二维矩阵。
基于上述实施例,本申请实施例中还提供了另一示例性实施方式的电子设备,在一些可能的实施方式中,本申请实施例中电子设备可以包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,其中,处理器执行计算机可读指令时可以实现上述实施例中目标定位方法的步骤。
例如,以电子设备为本申请图1中的服务器100为例进行说明,则该电子设备中的处理器即为服务器100中的处理器110,该电子设备中的存储器即为服务器100中的存储器120。
基于上述实施例,本申请实施例中,提供了一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任意方法实施例中的目标定位方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机可读指令产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机可读指令产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机可读指令产品的流程图和/或方框图来描述的。应理解可由计算机可读指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机可读指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机可读指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机可读指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个 流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (15)

  1. 一种目标定位方法,包括:
    从待检测视频流中检测到包含目标病灶的视频帧图像时,确定所述目标病灶在所述视频帧图像上的位置信息;及
    根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息。
  2. 如权利要求1所述的方法,其中根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息,具体包括:
    将所述目标病灶在所述视频帧图像上的位置信息对应的区域图像,作为模板区域图像;
    依次从所述待检测视频流中待追踪视频帧图像上选取以所述位置信息为中心的预设范围区域图像,作为搜索区域图像;
    确定所述搜索区域图像和所述模板区域图像的第一相似值;及
    若确定所述第一相似值满足相似条件,则确定跟踪到所述目标病灶,并确定所述目标病灶在所述搜索区域图像上的位置信息。
  3. 如权利要求2所述的方法,其中确定所述搜索区域图像和所述模板区域图像的第一相似值,具体包括:
    基于卷积神经网络,分别将所述搜索区域图像和所述模板区域图像映射到设定维度特征空间,获得所述搜索区域图像和所述模板区域图像相应的特征向量;
    将所述搜索区域图像和所述模板区域图像相应的特征向量进行二维卷积操作,分别确定所述搜索区域图像中每个图像块与所述模板区域图像的第二相似值,其中,所述每个图像块是采用滑动窗口在所述搜索区域图像上以预设步长进行滑动获得的;所述滑动窗口的大小与所述模板区域图像相同;及
    将所述各第二相似值按照对应图像块在所述搜索区域图像上的位置排列构成的二维矩阵,作为所述搜索区域图像和所述模板区域图像的第一相似值。
  4. 如权利要求3所述的方法,其中若确定所述第一相似值满足相似条件,则确定跟踪到所述目标病灶,并确定所述目标病灶在所述搜索区域图像上的位置信息,具体包括:
    若确定所述第一相似值的二维矩阵中存在一个第二相似值不小于预设阈值,则确定跟踪到所述目标病灶;及
    将各第二相似值中最大的第二相似值对应的图像块的位置信息,确定为所述目标病灶在所述搜索区域图像上的位置信息。
  5. 如权利要求1所述的方法,其中从待检测视频流中检测到包含目标病灶的视频帧图像时,进一步包括:
    按照预设方式,进行告警,以提示出现了所述目标病灶。
  6. 如权利要求1-5任一项所述的方法,其中根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息,具体包括:
    触发已训练的跟踪模型,基于所述跟踪模型,以所述目标病灶在所述视频帧图像上的位置信息为输入参数,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息,其中,所述跟踪模型是根据训练图像样本对集合训练获得的,所述训练图像样本对集合中包括多个有相似值标签的训练图像样本对,所述训练图像样本对是根据从出现目标病灶的视频流样本中随机抽取的两个视频帧图像构建的。
  7. 如权利要求6所述的方法,其中基于所述跟踪模型,以所述目标病灶在所述视频帧图像上的位置信息为输入参数,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息,具体包括:
    以所述目标病灶在所述视频帧图像上的位置信息为输入参数,分别通过两 个相同的神经网络对模板区域图像和搜索区域图像映射到设定维度特征空间,获得所述搜索区域图像和所述模板区域图像相应的特征向量,其中,所述跟踪模型的网络结构至少包括两个相同的神经网络,所述神经网络至少包括卷积层和池化层,所述模板区域图像为所述目标病灶在所述视频帧图像上的位置信息对应的区域图像,所述搜索区域图像为从待追踪视频帧图像上选取的,以所述位置信息为中心的预设范围区域图像;
    将所述搜索区域图像和所述模板区域图像相应的特征向量通过卷积层进行二维卷积操作,获得所述搜索区域图像和所述模板区域图像的第一相似值;及
    若确定所述第一相似值满足相似条件,则确定跟踪到所述目标病灶,并确定所述目标病灶在所述搜索区域图像上的位置信息。
  8. 如权利要求6所述的方法,其中进一步包括:
    获取训练图像样本对集合;及
    分别将各训练图像样对输入到跟踪模型进行训练,直至所述跟踪模型的损失函数收敛,获得训练完成的跟踪模型,其中,所述损失函数为确定出的各训练图像样本对的相似值与相似值标签的交叉熵的总和。
  9. 如权利要求8所述的方法,其中获取训练图像样本对集合,具体包括:
    获取出现目标病灶的视频流样本集,其中,所述视频流样本集中包括多个出现目标病灶的视频流样本;
    分别针对所述视频流样本包含的各视频帧图像,以所述目标病灶为中心,从各视频帧图像中选取第一预设范围的模板区域图像和第二预设范围的搜索区域图像,其中,所述第二预设范围大于所述第一预设范围;
    从所述视频流样本包含的各视频帧图像中随机抽取两个视频帧图像,选取抽取的两个视频帧图像中一个视频帧图像的模板区域图像和另一个视频帧图像的搜索区域图像;及
    将选取的模板区域图像和搜索区域图像作为一个训练图像样本对,并生成 所述训练图像样本对的相似值标签,其中,所述相似值标签为二维矩阵。
  10. 一种目标定位装置,包括:
    检测模块,用于从待检测视频流中检测到包含目标病灶的视频帧图像时,确定所述目标病灶在所述视频帧图像上的位置信息;及
    跟踪模块,用于根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息。
  11. 如权利要求10所述的装置,其中根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息时,跟踪模块具体用于:
    将所述目标病灶在所述视频帧图像上的位置信息对应的区域图像,作为模板区域图像;
    依次从所述待检测视频流中待追踪视频帧图像上选取以所述位置信息为中心的预设范围区域图像,作为搜索区域图像;
    确定所述搜索区域图像和所述模板区域图像的第一相似值;及
    若确定所述第一相似值满足相似条件,则确定跟踪到所述目标病灶,并确定所述目标病灶在所述搜索区域图像上的位置信息。
  12. 如权利要求11所述的装置,其中确定所述搜索区域图像和所述模板区域图像的第一相似值时,跟踪模块具体用于:
    基于卷积神经网络,分别将所述搜索区域图像和所述模板区域图像映射到设定维度特征空间,获得所述搜索区域图像和所述模板区域图像相应的特征向量;
    将所述搜索区域图像和所述模板区域图像相应的特征向量进行二维卷积操作,分别确定所述搜索区域图像中每个图像块与所述模板区域图像的第二相似值,其中,所述每个图像块是采用滑动窗口在所述搜索区域图像上以预设步长进行滑动获得的;所述滑动窗口的大小与所述模板区域图像相同;及
    将所述各第二相似值按照对应图像块在所述搜索区域图像上的位置排列构成的二维矩阵,作为所述搜索区域图像和所述模板区域图像的第一相似值。
  13. 一种目标定位系统,至少包括:视频采集设备、处理设备和输出设备,具体地:
    视频采集设备,用于获取待检测视频流;
    处理设备,用于从待检测视频流中检测到包含目标病灶的视频帧图像时,确定所述目标病灶在所述视频帧图像上的位置信息;根据所述目标病灶在所述视频帧图像上的位置信息,对所述目标病灶进行跟踪,确定所述目标病灶在所述待检测视频流中待追踪视频帧图像上的位置信息;及
    输出设备,用于输出所述目标病灶在所述视频帧上的位置信息,以及在所述待追踪视频帧上的位置信息。
  14. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述程序时实现权利要求1-9任一项所述方法的步骤。
  15. 一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述处理器执行权利要求1-9任一项所述方法的步骤。
PCT/CN2020/124623 2020-02-08 2020-10-29 一种目标定位方法、装置及系统 WO2021155679A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/677,990 US20220180520A1 (en) 2020-02-08 2022-02-22 Target positioning method, apparatus and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010083134.XA CN111311635A (zh) 2020-02-08 2020-02-08 一种目标定位方法、装置及系统
CN202010083134.X 2020-02-08

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/677,990 Continuation US20220180520A1 (en) 2020-02-08 2022-02-22 Target positioning method, apparatus and system

Publications (1)

Publication Number Publication Date
WO2021155679A1 true WO2021155679A1 (zh) 2021-08-12

Family

ID=71160021

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124623 WO2021155679A1 (zh) 2020-02-08 2020-10-29 一种目标定位方法、装置及系统

Country Status (3)

Country Link
US (1) US20220180520A1 (zh)
CN (1) CN111311635A (zh)
WO (1) WO2021155679A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111311635A (zh) * 2020-02-08 2020-06-19 腾讯科技(深圳)有限公司 一种目标定位方法、装置及系统
CN111915573A (zh) * 2020-07-14 2020-11-10 武汉楚精灵医疗科技有限公司 一种基于时序特征学习的消化内镜下病灶跟踪方法
CN112766066A (zh) * 2020-12-31 2021-05-07 北京小白世纪网络科技有限公司 一种动态视频流和静态图像处理显示方法、系统
CN112907628A (zh) * 2021-02-09 2021-06-04 北京有竹居网络技术有限公司 视频目标追踪方法、装置、存储介质及电子设备
KR102536369B1 (ko) * 2021-02-26 2023-05-26 주식회사 인피니트헬스케어 인공 지능 기반 위 내시경 영상 진단 보조 시스템 및 방법
KR102531400B1 (ko) * 2021-03-19 2023-05-12 주식회사 인피니트헬스케어 인공 지능 기반 대장 내시경 영상 진단 보조 시스템 및 방법
CN113344854A (zh) * 2021-05-10 2021-09-03 深圳瀚维智能医疗科技有限公司 基于乳腺超声视频的病灶检测方法、装置、设备及介质
CN113344855A (zh) * 2021-05-10 2021-09-03 深圳瀚维智能医疗科技有限公司 降低乳腺超声病灶检测假阳率的方法、装置、设备及介质
CN113689469A (zh) * 2021-08-24 2021-11-23 复旦大学附属中山医院 一种自动识别超声造影小肝癌病灶的方法及超声系统
CN114091507B (zh) * 2021-09-02 2022-07-29 北京医准智能科技有限公司 超声病灶区域检测方法、装置、电子设备及存储介质
CN114627395B (zh) * 2022-05-17 2022-08-05 中国兵器装备集团自动化研究所有限公司 基于嵌套靶标的多旋翼无人机角度分析方法、系统及终端
CN117523379B (zh) * 2023-11-20 2024-04-30 广东海洋大学 基于ai的水下摄影目标定位方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648327A (zh) * 2019-09-29 2020-01-03 无锡祥生医疗科技股份有限公司 基于人工智能的超声影像视频自动追踪方法和设备
CN110738211A (zh) * 2019-10-17 2020-01-31 腾讯科技(深圳)有限公司 一种对象检测的方法、相关装置以及设备
CN111311635A (zh) * 2020-02-08 2020-06-19 腾讯科技(深圳)有限公司 一种目标定位方法、装置及系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679455A (zh) * 2017-08-29 2018-02-09 平安科技(深圳)有限公司 目标跟踪装置、方法及计算机可读存储介质
CN109886243B (zh) * 2019-03-01 2021-03-26 腾讯医疗健康(深圳)有限公司 图像处理方法、装置、存储介质、设备以及系统
CN109993774B (zh) * 2019-03-29 2020-12-11 大连理工大学 基于深度交叉相似匹配的在线视频目标跟踪方法
CN110276780A (zh) * 2019-06-17 2019-09-24 广州织点智能科技有限公司 一种多目标跟踪方法、装置、电子设备及存储介质
CN110766730B (zh) * 2019-10-18 2023-02-28 上海联影智能医疗科技有限公司 图像配准及随访评估方法、存储介质及计算机设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648327A (zh) * 2019-09-29 2020-01-03 无锡祥生医疗科技股份有限公司 基于人工智能的超声影像视频自动追踪方法和设备
CN110738211A (zh) * 2019-10-17 2020-01-31 腾讯科技(深圳)有限公司 一种对象检测的方法、相关装置以及设备
CN111311635A (zh) * 2020-02-08 2020-06-19 腾讯科技(深圳)有限公司 一种目标定位方法、装置及系统

Also Published As

Publication number Publication date
CN111311635A (zh) 2020-06-19
US20220180520A1 (en) 2022-06-09

Similar Documents

Publication Publication Date Title
WO2021155679A1 (zh) 一种目标定位方法、装置及系统
US11747898B2 (en) Method and apparatus with gaze estimation
WO2021227726A1 (zh) 面部检测、图像检测神经网络训练方法、装置和设备
JP2022526750A (ja) オブジェクト追跡方法、オブジェクト追跡装置、コンピュータプログラム、及び電子機器
WO2020107847A1 (zh) 基于骨骼点的跌倒检测方法及其跌倒检测装置
US10509957B2 (en) System and method for human pose estimation in unconstrained video
Irfanullah et al. Real time violence detection in surveillance videos using Convolutional Neural Networks
WO2021238548A1 (zh) 区域识别方法、装置、设备及可读存储介质
WO2021218238A1 (zh) 图像处理方法和图像处理装置
CN111240476A (zh) 基于增强现实的交互方法、装置、存储介质和计算机设备
Nie et al. Cross-view action recognition by cross-domain learning
Harrou et al. Malicious attacks detection in crowded areas using deep learning-based approach
WO2023083030A1 (zh) 一种姿态识别方法及其相关设备
Zhou et al. A study on attention-based LSTM for abnormal behavior recognition with variable pooling
CN115482523A (zh) 轻量级多尺度注意力机制的小物体目标检测方法及系统
Kumar et al. Early estimation model for 3D-discrete indian sign language recognition using graph matching
Sharma et al. Study on HGR by Using Machine Learning
Islam et al. Representation for action recognition with motion vector termed as: SDQIO
Verma et al. Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach
JP2013016170A (ja) 人体動作の認識の方法、装置、及びプログラム
Jangade et al. Study on Deep Learning Models for Human Pose Estimation and its Real Time Application
Guo et al. Hand gesture recognition and interaction with 3D stereo camera
Zhou et al. Multiple perspective object tracking via context-aware correlation filter
Faujdar et al. Human Pose Estimation using Artificial Intelligence with Virtual Gym Tracker
Mesbahi et al. Hand Gesture Recognition Based on Various Deep Learning YOLO Models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20917713

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20917713

Country of ref document: EP

Kind code of ref document: A1