WO2023246641A1 - 识别对象的方法、装置及存储介质 - Google Patents
识别对象的方法、装置及存储介质 Download PDFInfo
- Publication number
- WO2023246641A1 WO2023246641A1 PCT/CN2023/100703 CN2023100703W WO2023246641A1 WO 2023246641 A1 WO2023246641 A1 WO 2023246641A1 CN 2023100703 W CN2023100703 W CN 2023100703W WO 2023246641 A1 WO2023246641 A1 WO 2023246641A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target object
- visual data
- information
- semantic
- target
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 99
- 230000000007 visual effect Effects 0.000 claims abstract description 240
- 238000012549 training Methods 0.000 claims description 88
- 230000015654 memory Effects 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 5
- 241000406668 Loxodonta cyclotis Species 0.000 claims 1
- 230000004438 eyesight Effects 0.000 abstract description 4
- 238000006243 chemical reaction Methods 0.000 description 18
- 230000011218 segmentation Effects 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 11
- 238000002372 labelling Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 238000012795 verification Methods 0.000 description 7
- 238000012854 evaluation process Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 244000144730 Amygdalus persica Species 0.000 description 3
- 244000141359 Malus pumila Species 0.000 description 3
- 241000234295 Musa Species 0.000 description 3
- 235000006040 Prunus persica var persica Nutrition 0.000 description 3
- 235000021016 apples Nutrition 0.000 description 3
- 235000021015 bananas Nutrition 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012067 mathematical method Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 239000003086 colorant Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
- G06V10/235—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on user input or interaction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
- G06V10/945—User interactive design; Environments; Toolboxes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
Definitions
- the present application relates to the field of computer vision, and in particular to a method, device and storage medium for identifying objects.
- Visual data can be data such as images or videos. By processing the visual data, the objects in the visual data are obtained. Different applications can then be performed on the object, such as positioning based on the object, classifying the object, or segmenting the object.
- the object recognition model corresponds to at least one object category.
- the object recognition model is used to identify objects belonging to the at least one object category from the visual data. For example, assume that the object categories corresponding to the object recognition model include apples, peaches, and bananas, and the visual data to be processed are pictures. If the picture includes apples, peaches, and bananas, the objects recognized from the picture by the object recognition model include apples, peaches, and bananas.
- the visual data includes objects of the object category corresponding to the object recognition model
- all objects belonging to the object category are identified from the visual data, and the flexibility of object recognition is poor.
- This application provides a method, device and storage medium for identifying objects to improve the flexibility of identifying objects.
- the technical solutions are as follows:
- the present application provides a method for identifying an object.
- visual data to be processed and indication information of at least one target object to be identified are obtained.
- Semantic information is obtained based on the indication information of the at least one target object, and the semantic information is semantics used to describe the at least one target object. Based on the object recognition model and this semantic information, the target object in the visual data is identified.
- the semantic information is obtained based on the indication information of the at least one target object, the semantic information is used to describe the semantics of the at least one target object.
- the device can understand the semantics of the at least one target object through the semantic information, so that it can be based on the object. Recognize the model and this semantic information to identify the target object in the visual data.
- At least one target object to be recognized is an object that needs to be recognized, so that objects in the visual data can be recognized on demand and the flexibility of identifying objects can be improved.
- the indication information of the at least one target object includes text description information of the at least one target object. Based on the corresponding relationship between the text description information of the at least one target object and the semantic features, the semantic features corresponding to the text description information of each target object are obtained respectively, and the semantic information includes the text description of each target object. Semantic features corresponding to the described information.
- the text description information of each target object is converted through the corresponding relationship, and the semantic features corresponding to the text description information of each target object are obtained.
- the implementation is simple through the correspondence relationship, and the semantic features corresponding to the text description information of each target object can be quickly converted.
- At least one visual feature vector is obtained based on the object recognition model and the visual data, and the at least one visual feature vector is used to indicate the encoding semantics of the visual data. Based on the at least one visual feature vector and the semantic information, a target object in the visual data is identified.
- the device can understand the coding semantics of the visual data through the at least one visual feature vector, and understand the semantics of the at least one target object through the semantic information, so that it can obtain The target object that needs to be identified is accurately identified in this visual data.
- the indication information of at least one target object includes indication information of the first object, and the indication information of the first object is used to indicate the first object and at least one component of the first object. identifying a first object in the visual data based on the object recognition model and the semantic information; and identifying at least one component from the first object based on the object recognition model and the semantic information.
- the object can be identified first, and then the component parts of the object can be identified from the object to achieve hierarchical recognition, and the component parts can be identified from the object, which can reduce the amount of data that needs to be processed and improve the efficiency of identifying component parts.
- the indication information of at least one target object also includes position information used to indicate the position range of the target object in the visual data.
- the position characteristics of the target object are obtained based on the position information, and the position characteristics are used to indicate the spatial orientation of the target object.
- the semantic information and the location features the target object in the visual data is identified.
- the location information may be the location where the user clicked in the visual data.
- the outline of the target object located at the user's click location is identified from the visual data, realizing on-demand recognition and improving the flexibility of recognition.
- the visual data includes images or videos
- the at least one visual feature vector includes a visual feature vector of each pixel point in the visual data.
- the scores between the first pixel and each target object to be identified are obtained respectively.
- the visual data includes the first pixel, the first pixel and the target object to be identified.
- the score between target objects is used to reflect the probability that the first pixel belongs to the target object to be recognized. From each target object to be recognized, a target object whose score satisfies the specified condition with the first pixel point is selected, and the first pixel point is a pixel point in the selected target object.
- the score between the first pixel and each target object to be recognized is obtained respectively, so that the object to which the first pixel belongs can be accurately identified through the score. , which can improve the accuracy of identifying objects.
- the object recognition model is obtained by performing model training based on at least one training sample and semantic information corresponding to the indication information of at least one object to be annotated.
- the training sample includes at least one indication indicated by the indication information.
- Object, part or all of the at least one object is labeled.
- at least one object to be labeled is an object that needs to be labeled, so that objects can be labeled as needed. and some or all of the objects in the at least one object are Labeling, which can improve the flexibility of labeling objects.
- the marked object includes a second object, the image clarity of the second object exceeds a clarity threshold, and the components of the second object are marked.
- the image clarity of the second object exceeds a clarity threshold
- the components of the second object are marked.
- the correspondence between the text description information and the semantic features can convert more text description information than the text description information of at least one object to be annotated.
- the text description information that can be converted by the correspondence between text description information and semantic features includes the first text description information, and the text information of at least one object to be annotated does not include the first text description information. That is to say, through the correspondence relationship and The object recognition model can identify the object indicated by the first text description information, and can identify objects beyond the object category corresponding to the object recognition model.
- the visual data includes annotated objects
- the object recognition accuracy of the object recognition model is obtained based on the annotated objects and the target object. In this way, the object recognition model's ability to recognize objects can be evaluated based on the accuracy.
- this application provides a device for identifying an object, for performing the method in the first aspect or any possible implementation of the first aspect.
- the apparatus includes a unit for performing the method in the first aspect or any possible implementation of the first aspect.
- this application provides a device for identifying objects, including a processor and a memory; the processor is configured to execute instructions stored in the memory, so that the device executes the first aspect or any of the first aspects.
- the present application provides a computer program product containing instructions that, when executed by a device, cause the device to execute the method of the above-mentioned first aspect or any possible implementation of the first aspect.
- the present application provides a computer-readable storage medium for storing a computer program.
- the computer program When the computer program is executed by a device, the device executes the method of the above-mentioned first aspect or any possible implementation of the first aspect. .
- this application provides a chip, including a memory and a processor.
- the memory is used to store computer instructions
- the processor is used to call and run the computer instructions from the memory to execute the above first aspect or any possibility of the first aspect. method of implementation.
- Figure 1 is a schematic diagram of a network architecture provided by an embodiment of the present application.
- Figure 2 is a schematic diagram of visual data provided by an embodiment of the present application.
- Figure 3 is a schematic diagram of a training sample provided by an embodiment of the present application.
- Figure 4 is a schematic diagram of another training sample provided by an embodiment of the present application.
- Figure 5 is a schematic diagram of a knowledge base provided by an embodiment of the present application.
- Figure 6 is a flow chart of a model training method provided by an embodiment of the present application.
- Figure 7 is a flow chart of a method for identifying objects provided by an embodiment of the present application.
- Figure 8 is a schematic diagram of a user clicking object provided by an embodiment of the present application.
- Figure 9 is a schematic structural diagram of a device for identifying objects provided by an embodiment of the present application.
- Figure 10 is a schematic structural diagram of a device for identifying objects provided by an embodiment of the present application.
- Figure 11 is a schematic structural diagram of a cluster of identification objects provided by an embodiment of the present application.
- Figure 12 is a schematic structural diagram of another cluster of identification objects provided by an embodiment of the present application.
- Visual recognition technology refers to a technology that uses computers to predict and analyze various important information in visual data. It is the core research content in the field of computer vision. In recent years, thanks to the rapid development of theories and technologies such as deep learning and convolutional neural networks, visual recognition technology has been widely used in all aspects of production and life, such as smart cities, smart medical care, autonomous driving and other emerging industries. Visual recognition technology is not enabled.
- Visual recognition technology uses object recognition models, which are intelligent models used to identify objects. In this way, the visual recognition technology can identify the target object from the visual data to be processed based on the object recognition model. This target object can be used to achieve different tasks.
- the visual data includes data such as images and/or videos.
- the target object is used to implement many tasks such as object classification, object positioning, and/or object segmentation.
- the purpose of the object segmentation task is to identify several target objects (also called target areas) with specific characteristics from the image, and to segment the identified target objects from the image. That is, the object segmentation task is used to divide the image into several target regions with specific characteristics.
- the target object it can be subdivided into different object segmentation tasks, for example, it can be subdivided into semantic segmentation (semantic segmentation), instance segmentation (instance segmentation) and/or part segmentation (part segmentation).
- Semantic segmentation refers to classifying each pixel in the image into the corresponding semantic concept; instance segmentation refers to dividing the area corresponding to certain specific instances in the image; component segmentation refers to further dividing certain instances into different parts. the corresponding area.
- the target area corresponding to semantic segmentation is a semantic concept.
- the semantic concept may be an object category (such as roads, cars, buildings, etc.), and the target area corresponding to instance segmentation is an object (such as a specific car or a specific building). object), the target area corresponding to component segmentation is the object component (such as the door, body, wheels, etc. of the car).
- Visual recognition technology includes components such as annotation process, recognition process and evaluation process.
- annotation process is used to annotate objects in visual data to obtain training samples. After obtaining the training samples, use the labeled training samples for model training to obtain the object recognition model.
- the labeling process provides a data basis for visual recognition technology.
- the recognition process is used to identify the target object from the visual data to be processed based on the object recognition model.
- the recognition process refers to the specific execution or running process.
- the evaluation process is used to give scores and feedback on the recognition results, and the evaluation process is used to obtain the accuracy of the recognition object.
- an embodiment of the present application provides a network architecture 100.
- the network architecture 100 includes a first device 101 and a second device 102.
- the first device 101 communicates with the second device 102.
- the network architecture 100 includes one or more second Devices 102, the first device 101 communicates with each second device 102.
- the first device 101 can be used to perform the annotation process.
- the first device 101 is used to assist the annotator to annotate objects in the visual data to obtain training samples, and use the training samples to perform model training to obtain object recognition. Model.
- the first device 101 Since the first device 101 communicates with each second device 102, the first device 101 can deploy the object recognition model on each second device 102.
- the second device 102 is used to perform the recognition process and/or the evaluation process based on the object recognition model, and the second device 102 is used to obtain the visual data to be processed, based on the object recognition model. Identify the target object from the visual data to be processed, and/or provide scoring and feedback to the identified target object.
- the first device 101 is a computer
- each of the second devices 102 is a camera.
- the cameras can be deployed on roads and other places.
- the object recognition model is deployed on the camera.
- the camera captures the visual data to be processed, and the target object in the visual data to be processed is identified based on the object recognition model.
- the first device 101 displays the visual data that needs to be annotated, and the annotator annotates the objects in the visual data to obtain training samples.
- the first device 101 acquires at least one visual data, for any visual data, the visual data includes at least one object, and displays the visual data.
- the annotator annotates the at least one object in the visual data, and the first device 101 uses the annotated visual data as a training sample.
- Annotating an object in visual data means filling it with one or more colors.
- each object in the visual data is often labeled, and the workload of labeling is huge.
- Figure 2 which shows cars, buildings, and roads in the picture.
- the annotator labels each object in the picture, that is, labels the cars, buildings, and roads in the picture.
- the first device 101 uses the picture as a training sample, and each object in the training sample is labeled.
- the first device 101 performs model training based on the training sample to obtain an object recognition model.
- the object recognition model corresponds to the text description information of each object, and the object recognition model can recognize more objects. object.
- the second device 102 uses the object recognition model to identify objects in the visual data to be processed, it often recognizes the objects in the visual data to be processed.
- the recognized objects may include objects that the user does not need to recognize, which not only results in low flexibility in identifying objects, but also wastes a large amount of computing resources.
- embodiments of the present application adopt on-demand object labeling; and/or, in order to improve the flexibility of identifying objects and avoid waste of computing resources, embodiments of the present application adopt on-demand object identification.
- the embodiment of the present application defines indication information of at least one object that needs to be marked.
- the first device 101 obtains at least one visual data.
- the visual data includes at least one object indicated by the indication information.
- the at least one object is The object to be labeled.
- the annotator annotates part or all of the at least one object, and the first device 101 uses the annotated visual data as a training sample. That is, the training sample includes at least one object indicated by the instruction information, and part or all of the objects in the at least one object are labeled, so that on-demand labeling of objects is achieved.
- the indication information includes textual description information for each of the at least one object.
- the object's text description information is used to describe the object.
- the text description information of the object includes the object category of the object, etc., assuming that the object is a car, the text description information of the object is "car".
- the annotator also inputs the text description information of each annotated object to the first device 101, and the first device 101 associates the training sample with the text description information of each annotated object.
- the first device 101 marks the text description information corresponding to the annotated object on the annotated object in the training sample, so The training sample includes text description information corresponding to the marked object.
- the annotator uses black to fill the buildings in the visual data shown in Figure 2. objects in order to label the buildings in the visual data.
- the first device 101 uses the annotated visual data as training samples.
- the annotator may continue to annotate the components included in the object. That is, if the image clarity in the training sample exceeds the clarity threshold and is labeled, the components of the object will also be labeled.
- each component of the object can be labeled, or some of the components in the object can be labeled.
- the components included in the object are not labeled, but the object can be labeled, thus avoiding labeling errors when labeling the components of the object.
- the annotator also inputs text description information of the annotated component in the object to the first device 101 .
- the first device 101 marks the text description information of the annotated component in the object in the training sample, so the training sample includes the text description information of the annotated component in the object.
- the text description information of the component is used to describe the component.
- the text description information of the component may include the name of the component, etc.
- the components of a car include doors, wheels, and body.
- the wheels of the car need to be annotated. See Figure 4. Assume that the annotator uses black Fill the wheels.
- the component includes at least one sub-component, and at least one sub-component included in the component can be further labeled.
- the wheels include tires and hubs, and the tires and hubs of the wheels can also be marked.
- the first device 101 after obtaining at least one training sample, also establishes a knowledge base based on the at least one training sample.
- the knowledge base may be a graph including multiple nodes.
- the node represents the text description information of the annotated object.
- Each child node of the node represents a different component of the object.
- the node stores the text of the object represented by the node.
- Description information the sub-node stores the text description information of the component represented by the sub-node.
- the node represents a component of an object, and each child node of the node represents a different subcomponent of the component.
- the node stores the text description information of the component represented by the node, and the child nodes store the text description information of the component represented by the node.
- the operation of establishing the knowledge base is: for any training sample, obtain the text description information of the annotated object from the training sample, and establish a node in the knowledge base for saving the text description information. ; Obtain the text description information of the component in the marked object from the training sample, and establish a sub-node of the node in the knowledge base. The sub-node is used to save the text description information of the component.
- the various components included in the objects of any object category can be clearly obtained through the knowledge base.
- the nodes corresponding to each object have the same parent node, and the parent node is a virtual node and is the root node of the knowledge base.
- a knowledge base as shown in Figure 5 is established based on the training sample.
- the knowledge base includes node 1 corresponding to the car, node 2 corresponding to the road, and node 3 corresponding to the building.
- the parent nodes of node 2 and node 3 are both virtual nodes "Root".
- the child nodes of node 1 include the child node 11 corresponding to the door and the child node 12 corresponding to the wheel. The meanings of the child nodes of other nodes will not be listed one by one.
- this embodiment of the present application provides a method 600 for model training.
- the method 600 is applied to the network architecture 100 shown in Figure 1.
- the method 600 is executed by the first device 101 in the network architecture 100.
- the model training method 600 includes the following process of steps 601-605.
- Step 601 The first device obtains at least one training sample and indication information of at least one object to be labeled.
- the training sample includes at least one object indicated by the indication information, and part or all of the at least one object is labeled.
- At least one object to be annotated is an object that needs to be annotated.
- the first device may store indication information of the object that needs to be annotated. The indication information is used to guide the annotator to annotate objects in the visual data to obtain training samples.
- the indication information includes text description information of the at least one object. That is, the indication information includes text description information of each object in the at least one object, and the text description information of the object is used to describe the object.
- the text description information of the object includes an object category of the object through which the object can be described.
- the first device acquires at least one visual data, and for any one of the visual data, the visual data includes at least one object to be annotated indicated by the indication information, and displays the visual data.
- the annotator annotates all or part of the at least one object in the displayed visual data.
- the first device then uses the visual data as a training sample.
- the first device may display each visual data in the at least one visual data one by one, the annotator annotates some or all of the at least one object to be annotated in each visual data, and the first device will annotate each object to be annotated.
- Visual data as training samples.
- Step 602 The first device acquires first semantic information based on the indication information of at least one object to be annotated.
- the first semantic information is semantics used to describe at least one object to be annotated.
- the indication information includes text description information of each object
- the first semantic information includes semantic features corresponding to the text description information of each object.
- the semantic features corresponding to the text description information of the object are used to describe the semantics of the object.
- the semantic feature of the object is a feature vector.
- step 602 based on the correspondence between the text description information and the semantic features, the semantic features corresponding to the text description information of each object are obtained respectively.
- the correspondence between text description information and semantic features may be a correspondence table.
- Each record in the correspondence table includes text description information of an object and semantic features corresponding to the text description information. .
- the semantic features corresponding to the text description information of each object are queried from the correspondence table.
- the correspondence between text description information and semantic features may be a text description information conversion model.
- the text description information conversion model is used to obtain semantic features corresponding to the text description information based on the text description information to be converted.
- the text description information conversion model is a text encoder, etc.
- the text description information of each object is input into the text description information conversion model, so that the text description information conversion model converts the text description information of each object, and the text description information of each object is obtained respectively.
- the corresponding semantic features are obtained to obtain the semantic features corresponding to the text description information of each object output by the text description information conversion model.
- the text description information conversion model is obtained by training an intelligent model, and technicians create multiple first samples.
- Each first sample includes text description information of an object and information related to the text description.
- Corresponding semantic features are used to train an intelligent model using the plurality of first samples to obtain a text description information conversion model.
- the at least one object includes a first object
- the text description information of the first object is used to indicate the first object and at least one component of the first object. Therefore, the semantic features of the text description information of the first object include the first Semantic characteristics of the object and semantic characteristics of each of the at least one component.
- the first device performs model training based on at least one training sample and the first semantic information. See the following steps 603-605 for the detailed implementation process.
- Step 603 The first device identifies the object in each training sample based on at least one training sample, the first semantic information and the object recognition model to be trained.
- the object recognition model to be trained has a visual feature extraction function, including a convolutional neural network, a vision transformer model (ViT), or any network with a visual feature extraction function, etc.
- the network with visual feature extraction function includes a deep residual network (deep residual network, ResNet) and other networks.
- Step 603 can be implemented through the following operations 6031-6032.
- the first device obtains at least one visual feature vector based on the object recognition model to be trained and the training sample.
- the at least one visual feature vector is used to indicate the coding semantics of the training sample. .
- the training sample includes pictures and/or videos, etc.
- the training sample includes multiple pixels
- the coding semantics of the training sample includes the coding semantics of each pixel in the training sample.
- the at least one visual feature vector includes a visual feature vector of each pixel in the training sample
- the visual feature vector of the pixel includes at least one visual feature used to indicate the coding semantics of the pixel.
- the first device inputs the training sample into the object recognition model to be trained, so that the object recognition model to be trained processes the training sample and obtains the visual feature vector of each pixel in the training sample, and obtains the training sample.
- the visual feature vector of each pixel output by the object recognition model.
- the first device identifies the object in the training sample based on the at least one visual feature vector and the first semantic information.
- the first semantic information includes semantic features corresponding to text description information of each object to be annotated.
- the training sample includes an image or a video
- the at least one visual feature vector includes a visual feature vector of each pixel in the training sample.
- the objects in the training sample can be identified through the following operations (1) to (2).
- the training sample includes a first pixel, and the score between the first pixel and the object to be labeled is used to reflect the probability that the first pixel belongs to the object to be labeled.
- the semantic feature corresponding to the text description information of any object to be annotated is also a vector. Based on the visual feature vector of the first pixel and the semantic feature corresponding to the text description information of the object to be annotated, the first pixel is obtained according to the following first formula The score between the point and the object to be labeled.
- u is the score between the first pixel point and the object to be labeled
- E is a vector that includes the semantic features corresponding to the text description information of the object to be labeled
- E T is the vector
- the transposition vector of f (w, h) is the first pixel
- (w, h) is the coordinate of the first pixel in the visual data.
- the score between the first pixel and each object to be labeled is calculated according to the above-mentioned first formula.
- the specified condition refers to selecting any object whose score between the first pixel and the first pixel is greater than the score threshold, or the specified condition refers to selecting an object whose score between the first pixel and the first pixel is greater than the score threshold and The object with the largest score between the first pixel and the first pixel.
- any object whose score between it and the first pixel is greater than the score threshold is selected from each object to be labeled. Or, from each object to be labeled, select each object whose score between it and the first pixel is greater than the score threshold, and select from each object the object whose score between it and the first pixel is the largest. Treat the first pixel as the pixel of a selected object.
- all pixels belonging to the object can be obtained from the training sample, thereby identifying the object in the training sample.
- the recognized object is the above-mentioned first object
- the first object is recognized, based on the semantic features corresponding to the object recognition model and the text description information of the first object, at least one of the first objects is identified from the first object.
- a component When implemented,
- the semantic features corresponding to the text description information of the first object include the semantic features of each component in the at least one component, based on the visual feature vector of the second pixel point and the semantic features of each component, according to the above-mentioned first formula Get the score between the second pixel and each component.
- the first object includes a second pixel, and the score between the second pixel and any component is used to reflect the probability that the second pixel belongs to the component.
- a component whose score satisfies a specified condition with respect to a second pixel is selected from each component, and the second pixel is a pixel in the selected component.
- the component includes at least one subcomponent.
- the at least one subcomponent can be identified from the component in the above manner, and will not be described in detail here.
- Step 604 The first device calculates a loss value through the loss function based on the marked objects in each training sample and the identified objects in each training sample, and adjusts the parameters of the object recognition model to be trained based on the loss value.
- Step 605 The first device determines whether to continue training the object recognition model to be trained. If it determines to continue training the object recognition model to be trained, return to step 603. If it determines not to continue training the object recognition model to be trained, use the object recognition model to be trained as the object recognition model. Model.
- the number of times the object recognition model to be trained is trained reaches a specified number of times, it is determined not to continue training the object recognition model to be trained.
- each verification sample includes annotated objects. Based on the semantic features corresponding to the object recognition model to be trained and the text description information of the annotated object, the object in each verification sample is identified. Based on the labeled objects in each verification sample and the identified objects in each verification sample, the accuracy of identifying the object is calculated. When the accuracy does not exceed the specified threshold, it is determined to continue training the object recognition model to be trained. When the accuracy exceeds the specified threshold, it is determined not to continue training the object recognition model to be trained.
- the first device After the first device has trained the object recognition model, it can send the object recognition model to the second device.
- Second device receives After using the object recognition model, the visual data to be processed is obtained, and based on the object recognition model, the target object in the visual data to be processed is identified.
- the embodiment of the present application defines requirement information, which is used to indicate at least one object that needs to be identified, and the at least one object is an object to be identified.
- the second device Based on the demand information and the object recognition model, the second device identifies the target object in the visual data to be processed, and the target object is the object indicated by the demand information, thereby realizing on-demand object recognition.
- this embodiment of the present application provides a method 700 for identifying an object.
- the method 700 is applied to the network architecture 100 shown in Figure 1.
- the method 700 is executed by the second device 102 in the network architecture 100.
- the second device 102 includes an object recognition model, which may be an object recognition model trained by the method 600 shown in FIG. 6 .
- the method 700 includes the following process of steps 701 to 704.
- Step 701 The second device obtains visual data to be processed and indication information of at least one target object to be recognized.
- the above-mentioned requirement information includes indication information of at least one target object to be identified, and the indication information includes text description information of each target object to be identified. At least one target object to be identified is an object that needs to be identified indicated by the requirement information.
- the visual data to be processed includes pictures and/or videos, etc.
- the second device may store at least one visual data that needs to be identified, and one visual data may be selected from the at least one visual data as the visual data to be processed.
- the second device is a device such as a camera, and the second device captures the visual data to be processed.
- the second device may also use other methods to obtain the visual data to be processed, which will not be listed here.
- At least one target object to be identified includes a second object
- the indication information is used to indicate that the second object and at least one component of the second object need to be identified.
- the indication information includes text description information of the second object, and the text description information is used for the indicated second object and at least one component of the second object.
- the text description information of the second object includes the object category of the second object and the name of at least one component of the second object, so that the text description information of the second object represents the second object that needs to be identified and at least one of the second object. Components.
- the second device locally stores indication information of at least one target object to be identified.
- the second device obtains the locally saved indication information of at least one target object to be identified.
- the user inputs indication information of at least one target object to be recognized to the second device, and the second device receives the indication information of at least one target object to be recognized.
- the user inputs the indication information of at least one target object to be identified to the first device, the first device sends the indication information of the at least one target object to be identified to the second device, and the second device receives the indication information of the at least one target object to be identified. Instructions for at least one target object.
- the second device may also use other methods to obtain the indication information of at least one target object to be identified, which will not be listed here.
- the indication information includes text description information of each target object to be identified
- the user can refer to the knowledge base to determine each target object to be identified. text description information.
- the target object may be an object or a component of an object
- the text description information of the target object includes the object category of the target object and/or the composition of the target object. The name of the component, etc.
- the user refers to the knowledge base shown in Figure 5 and selects the buildings and cars that need to be identified, as well as the wheels and doors of the cars that need to be identified.
- the user inputs text description information 1 and text description information 2 to the second device, text description information 1
- the object category "building” is included, and the text description information 2 includes the object category "car” and the names of the components "wheels" and “car doors”.
- the user selects that the car door needs to be identified, and the user inputs text description information 3 to the second device, where the text description information 3 includes the name of the component "car door".
- the indication information of at least one target object to be recognized also includes position information of the position range of the target object in the visual data.
- the user when the user inputs the text description information of the target object, he also inputs the position information of the position range of the target object in the visual data.
- the position information indicates that the user needs to identify the target object located at the position information in the visual data.
- the visual data to be processed is a street view image as shown in Figure 2.
- the text description information of the target object input by the user includes the object category "car”. See Figure 8.
- the user can click on the An image of a car in a Street View image.
- the second device obtains the clicked position, which is a two-dimensional coordinate, and uses the position as position information of the target object in the position range in the visual data.
- the target object in the visual data to be processed can be identified based on the object recognition model and the instruction information to achieve on-demand object identification. See the following steps 702-704 for the detailed implementation process.
- Step 702 The second device acquires second semantic information based on the indication information of at least one target object to be recognized.
- the second semantic information is semantics used to describe at least one target object to be recognized.
- the second semantic information includes semantic features corresponding to the text description information of each target object to be recognized, and the semantic features corresponding to the text description information of each target object are respectively used to reflect the semantics of each target object.
- step 702 based on the corresponding relationship between text description information and semantic features and the text description information of each target object to be identified, the semantic features corresponding to the text description information of each target object to be identified are obtained respectively.
- the semantic feature may be a vector, and the semantic feature uses mathematical methods to describe the semantics of the target object.
- the correspondence between text description information and semantic features may be a correspondence table.
- Each record in the correspondence table includes a piece of text description information and a semantic feature corresponding to the piece of text description information.
- the semantic features corresponding to the text description information of each target object to be recognized are queried from the correspondence table.
- the correspondence between text description information and semantic features may be a text description information conversion model, so that in step 702, the text description information of each target object to be recognized is input into the text description information conversion model. , so that the text description information conversion model converts the text description information of each target object to be identified, obtains the semantic features corresponding to the text description information of each target object to be identified, and obtains the text description information conversion model output The semantic features corresponding to the text description information of each target object to be recognized.
- the indication information includes text description information of the second object
- the text description information of the second object is used to indicate the second object that needs to be identified and at least one component of the second object. Therefore, the semantic features corresponding to the text description information of the second object include semantic features used to describe the second object and semantic features used to describe each component of the at least one component.
- the indication information also includes position information of the position range of the target object in the visual data.
- the position characteristics of the target object may also be obtained based on the position information of the target object. The position characteristics are used for Indicates the spatial orientation of the target object.
- the position feature of the target object may be a vector, and the position feature uses a mathematical method to describe the spatial orientation of the target object.
- the position information of the target object is input into the position conversion model, so that the position conversion model obtains the position characteristics of the target object based on the position of the target object, and obtains the position characteristics of the target object output by the position conversion model.
- the position conversion model is obtained by training an intelligent model, and technicians create multiple second samples, each second sample includes position information of an object and position characteristics corresponding to the position information, Use the plurality of second samples to train an intelligent model to obtain a position conversion model.
- the position conversion model is a coordinate encoder, etc.
- Step 703 The second device obtains at least one visual feature vector based on the object recognition model and the visual data, and the at least one visual feature vector is used to indicate the coding semantics of the visual data.
- the visual data includes pictures and/or videos, the visual data includes multiple pixels, and the coding semantics of the visual data includes the coding semantics of each pixel in the visual data.
- the at least one visual feature vector includes a visual feature vector of each pixel in the visual data, and the visual feature vector of the pixel includes at least one visual feature used to indicate the coding semantics of the pixel.
- step 703 the second device inputs the visual data into the object recognition model, causing the object recognition model to process the visual data and obtain the visual feature vector of each pixel in the visual data, and obtain the object recognition model.
- the output visual feature vector of each pixel is the visual feature vector of each pixel.
- Step 704 The second device identifies the target object in the visual data based on the at least one visual feature vector and the second semantic information.
- the target object in the visual data can also be identified based on the at least one visual feature vector, the second semantic information and the positional features.
- the target object located at the user's click position is identified from the visual data, that is, the outline of the target object located at the position is identified, realizing on-demand recognition and improving the flexibility of recognition.
- the visual data includes images or videos, and the at least one visual feature vector includes a visual feature vector for each pixel in the visual data.
- the target object in the visual data can be identified through the following operations 7041 to 7042.
- the second device obtains the scores between the first pixel and each target object to be recognized based on the visual feature vector and the second semantic information of the third pixel.
- the visual data includes a third pixel, and the score between the third pixel and the target object to be identified is used to reflect the probability that the third pixel belongs to the target object to be identified.
- the first semantic information includes semantic features corresponding to the text description information of each target object to be recognized.
- the distance between the third pixel and the target object to be identified is obtained according to the following second formula: score.
- U is the score between the third pixel point and the target object to be recognized
- E is a vector that includes the semantic features corresponding to the text description information of the target object to be recognized
- E T is The transposed vector of this vector
- F (x, y) is the third pixel point
- (x, y) is the coordinate of the third pixel point in the visual data.
- the vector E also includes the location characteristics of the target object to be identified, that is, the vector E includes the text description information of the target object to be identified. corresponding slang semantic features and location features of the target object to be recognized.
- the score between the third pixel and each target object to be recognized is calculated according to the above second formula.
- the second device selects a target object whose score with the third pixel point satisfies the specified condition from each target object to be recognized, and the third pixel point is a pixel point in the selected target object.
- any target object whose score between it and the third pixel is greater than the score threshold is selected. Or, from each target object to be identified, select each object whose score between it and the third pixel is greater than the score threshold, and select from each object the target with the largest score between it and the third pixel. object. Use the third pixel as the pixel of the selected target object.
- the semantic features corresponding to the text description information of the second object include semantic features used to describe the second object and semantics used to describe at least one component of the second object. feature. After the second object is identified, at least one component is identified from the second object based on the object recognition model and semantic features corresponding to the text description information of the second object.
- the score between the fourth pixel and each component is obtained according to the above-mentioned second formula.
- the second object includes a fourth pixel, and the scores between the fourth pixel and each component are respectively used to reflect the probability that the fourth pixel belongs to each component.
- a component whose score satisfies a specified condition with respect to a fourth pixel is selected from each component, and the fourth pixel is a pixel in the selected component.
- hierarchical recognition is achieved, that is, coarse-grained objects are first identified, and then fine-grained components are identified from the objects. Since the component is identified from the object, compared with identifying the component from the entire visual data, the amount of data to be processed can be reduced and the recognition efficiency can be improved.
- the component includes at least one subcomponent.
- the at least one subcomponent can be identified from the component in the above manner, and will not be described in detail here.
- the text description information that can be converted from the correspondence between text description information and semantic features includes text description information of each object to be annotated, so that the text description information of each target object to be identified can be more than the text description information to be annotated.
- Text description information for each object For example, assume that the text description information of the target object to be identified includes the text description information of the third object, and the text description information of each object to be annotated does not include the text description information of the third object. That is to say, without When the object recognition model is trained to recognize the third object, the third object in the visual data can also be recognized based on the semantic features corresponding to the text description information of the third object and the object recognition model.
- t l is the target object
- HPQ(t l ) is the recognition accuracy of the target object
- u l is the components recognized in the target object
- is the target object including The number of components
- t l' is a component in the target object
- HPQ(t l' ) is the recognition accuracy of the component.
- the recognition accuracy of the target object is equal to the ratio between the number of pixels in the intersection and the number of pixels in the union.
- the object recognition model may continue to be trained based on at least one training sample.
- an embodiment of the present application provides a device 900 for identifying objects.
- the device 900 is deployed in the system shown in Figure 1.
- the method includes:
- Acquisition unit 901 configured to acquire visual data to be processed and indication information of at least one target object to be recognized
- the recognition unit 902 is configured to recognize the target object in the visual data based on the object recognition model and the semantic information.
- step 701 of the method 700 shown in FIG. 7, please refer to the relevant content in step 701 of the method 700 shown in FIG. 7, which will not be described in detail here.
- the indication information of the at least one target object includes text description information of the at least one target object
- the identification unit 902 is used for:
- the at least one visual feature vector being used to indicate the encoding semantics of the visual data
- a target object in the visual data is identified based on at least one visual feature vector and the semantic information.
- the recognition unit 902 obtaining at least one visual feature vector, please refer to the relevant content in step 703 of the method 700 shown in FIG. 7, which will not be described in detail here.
- step 704 of the method 700 shown in FIG. 7, please refer to the relevant content in step 704 of the method 700 shown in FIG. 7, which will not be described in detail here.
- the indication information of the at least one target object includes indication information of the first object, and the indication information of the first object is used to indicate the first object and at least one component of the first object;
- Identification unit 902 used for:
- At least one component is identified from the first object.
- the identification unit 902 identifying the first object and at least one component of the first object, please refer to the relevant content in 7041-7042 of the method 700 shown in FIG. 7, which will not be described in detail here.
- the indication information of the at least one target object also includes position information used to indicate the position range of the target object in the visual data
- the acquisition unit 901 is also used to acquire the position characteristics of the target object based on the position information, and the position characteristics are used to indicate the spatial orientation of the target object;
- the identification unit 902 is used to identify the target object in the visual data based on the object recognition model, the semantic information and the location features.
- the identification unit 902 refers to the method shown in Figure 7
- the relevant content in steps 7041-7042 of method 700 will not be explained in detail here.
- the visual data includes images or videos, and the at least one visual feature vector includes a visual feature vector of each pixel in the visual data;
- Identification unit 902 used for:
- the scores between the first pixel and each target object to be identified are obtained respectively.
- the visual data includes the first pixel, the first pixel and the target to be identified.
- the score between objects is used to reflect the probability that the first pixel belongs to the target object to be recognized;
- a target object whose score satisfies the specified condition with the first pixel point is selected, and the first pixel point is a pixel point in the selected target object.
- the annotated object includes a second object, the image clarity of the second object exceeds a clarity threshold, and the components of the second object are annotated.
- the visual data includes annotated objects
- the acquisition unit 901 is also configured to obtain the accuracy of object recognition by the object recognition model based on the annotated objects and the target object.
- step 704 of the method 700 shown in Figure 7, please refer to the relevant content in step 704 of the method 700 shown in Figure 7, which will not be described in detail here.
- the acquisition unit since the acquisition unit obtains the indication information of at least one target object to be identified, it acquires the second semantic information based on the indication information, and the second semantic information is used to describe the semantics of the at least one target object to be identified, In this way, the recognition unit recognizes the target object from the visual data based on the second semantic information and the object recognition model, thereby realizing object recognition based on needs and improving the flexibility of object recognition.
- the indication information is used to indicate the first object and at least one component of the first object, so that based on the first semantic information and the object recognition model, the first object in the visual data is identified, and the at least one component is identified from the first object. Components can be used to identify objects hierarchically and improve the flexibility of identification.
- an embodiment of the present application provides a device 1000 for identifying objects.
- the device 1000 includes: a bus 1002, a processor 1004, a memory 1006 and a communication interface 1008.
- the processor 1004, the memory 1006 and the communication interface 1008 communicate through the bus 1002.
- the device 1000 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the device 1000.
- the bus 1002 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, or the like.
- PCI peripheral component interconnect
- EISA extended industry standard architecture
- the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 10, but it does not mean that there is only one bus or one type of bus.
- Bus 1002 may include a path that carries information between various components of computing device 1000 (eg, memory 1006, processor 1004, communications interface 1008).
- the processor 1004 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
- CPU central processing unit
- GPU graphics processing unit
- MP microprocessor
- DSP digital signal processor
- Memory 1006 may include volatile memory, such as random access memory (RAM).
- the processor 1004 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).
- ROM read-only memory
- HDD hard disk drive
- SSD solid state drive
- executable program code is stored in the memory 1006, and the processor 1004 executes the executable program code to respectively implement the functions of the acquisition unit 901 and the identification unit 902 in the device 900 shown in Figure 9, thereby realizing identification.
- Object methods That is, instructions for executing the method of identifying an object are stored on the memory 1006 .
- the communication interface 1008 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 1000 and other devices or communication networks.
- the embodiment of the present application also provides a cluster of identification objects.
- the cluster of identification objects includes at least one device 1000 .
- the device 1000 may be a server, such as a central server, an edge server, or a local server in a local data center.
- the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
- the cluster of identification objects includes at least one device 1000 .
- the memory 1006 in one or more devices 1000 in the cluster of identification objects may store the same instructions for executing the method provided by any of the above embodiments.
- the memory 1006 of one or more devices 1000 in the cluster for identifying objects may also store part of the instructions for executing the above method of identifying objects.
- a combination of one or more computing devices 1000 may jointly execute instructions for performing the method provided by any of the above embodiments.
- one or more computing devices in a cluster that identifies objects may be connected through a network.
- the network may be a wide area network or a local area network, etc.
- Figure 11 shows a possible implementation. As shown in Figure 12, two devices 1000A and 1000B are connected through a network. Specifically, the connection is made to the network through a communication interface in each device 1000 .
- the memory 1006 in the device 1000A stores instructions for executing the function of the acquisition unit 901 in the embodiment shown in FIG. 9 .
- the memory 1006 in the device 1000B stores instructions for performing the functions of the identification unit 902 in the embodiment shown in FIG. 9 .
- the functions of the device 1000A shown in FIG. 12 can also be completed by multiple devices 1000.
- the functions of device 1000B can also be completed by multiple devices 1000.
- the embodiment of the present application also provides another cluster of identification objects.
- the connection relationship between the computing devices in the cluster of identification objects can be similar to the connection method of the cluster processing source code described in FIG. 12 .
- the difference is that the same instructions for executing the method provided by any of the above embodiments may be stored in the memory 1006 of one or more devices 1000 in the cluster of identification objects.
- the memory 1006 of one or more devices 1000 in the cluster of identification objects may also store part of the instructions for executing the method provided by any of the above embodiments.
- a combination of one or more devices 1000 can jointly execute instructions for performing the method provided by any of the above embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
本申请公开了一种识别对象的方法、装置及存储介质,属于计算机视觉领域。所述方法包括:获取待处理的视觉数据和待识别的至少一个目标对象的指示信息;基于所述至少一个目标对象的指示信息获取语义信息,所述语义信息是用于描述所述至少一个目标对象的语义;基于对象识别模型和所述语义信息,识别所述视觉数据中的所述目标对象。本申请能够提高识别对象的灵活性。
Description
本申请要求于2022年6月24日提交的申请号为202210727401.1、发明名称为“一种按需视觉识别的方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。以及,本申请还要求于2022年7月19日提交的申请号为202210851482.6、发明名称为“识别对象的方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机视觉领域,特别涉及一种识别对象的方法、装置及存储介质。
视觉数据可以是图像或视频等数据,通过对视觉数据进行处理,得到视觉数据中的对象。然后可以对该对象进行不同的应用,例如基于该对象进行定位,对该对象进行分类或对该对象进行分割等应用。
对视觉数据进行处理时需要使用到对象识别模型,对象识别模型与至少一个对象类别相对应,对象识别模型用于从视觉数据中识别属于该至少一个对象类别的对象。例如,假设对象识别模型对应的对象类别包括苹果、桃子和香蕉,需要处理的视觉数据为图片。如果该图片中包括苹果、桃子和香蕉,则通过对象识别模型从该图片中识别出的对象包括苹果、桃子和香蕉。
目前在视觉数据包括对象识别模型对应的对象类别的对象时,则从该视觉数据中识别出属于该对象类别的所有对象,对象识别的灵活性差。
发明内容
本申请提供了一种识别对象的方法、装置及存储介质,以提高识别对象的灵活性。所述技术方案如下:
第一方面,本申请提供了一种识别对象的方法,在所述方法中,获取待处理的视觉数据和待识别的至少一个目标对象的指示信息。基于该至少一个目标对象的指示信息获取语义信息,该语义信息是用于描述该至少一个目标对象的语义。基于对象识别模型和该语义信息,识别视觉数据中的目标对象。
由于基于该至少一个目标对象的指示信息获取语义信息,该语义信息是用于描述该至少一个目标对象的语义,这样设备通过该语义信息,能够理解该至少一个目标对象的语义,从而能够基于对象识别模型和该语义信息,识别视觉数据中的目标对象。待识别的至少一个目标对象是需要识别的对象,这样实现按需识别该视觉数据中的对象,提高识别对象的灵活性。
在一种可能的实现方式中,该至少一个目标对象的指示信息包括该至少一个目标对象的文本描述信息。基于该至少一个目标对象的文本描述信息与语义特征的对应关系,分别获取该每个目标对象的文本描述信息对应的语义特征,该语义信息包括该每个目标对象的文本描
述信息对应的语义特征。
这样通过该对应关系对每个目标对象的文本描述信息进行转换,得到该每个目标对象的文本描述信息对应的语义特征。通过对应关系的方式实现简单,可以快速转换出每个目标对象的文本描述信息对应的语义特征。
在另一种可能的实现方式中,基于对象识别模型和视觉数据获取至少一个视觉特征向量,该至少一个视觉特征向量用于指示视觉数据的编码语义。基于该至少一个视觉特征向量和该语义信息,识别该视觉数据中的目标对象。
由于该至少一个视觉特征向量用于指示视觉数据的编码语义,这样设备通过该至少一个视觉特征向量能够理解该视觉数据的编码语义,以及通过语义信息理解该至少一个目标对象的语义,从而能够从该视觉数据中准确地识别出需要识别的目标对象。
在另一种可能的实现方式中,至少一个目标对象的指示信息包括第一对象的指示信息,第一对象的指示信息用于指示第一对象和第一对象的至少一个组成部件。基于对象识别模型和该语义信息,识别视觉数据中的第一对象;以及,基于对象识别模型和该语义信息,从第一对象中识别至少一个组成部件。这样可以先识别出对象,再从对象中识别对象的组成部件,实现层次化识别,且从对象中识别组成部件,可以减少需要处理的数据量,提高识别组成部件的效率。
在另一种可能的实现方式中,至少一个目标对象的指示信息还包括用于指示目标对象在该视觉数据中的位置范围的位置信息。基于该位置信息获取目标对象的位置特征,该置特征用于指示目标对象的空间方位。基于对象识别模型、该语义信息和该位置特征,识别视觉数据中的目标对象。这样该位置信息可能是用户在视觉数据中点击的位置,如此从该视觉数据中识别出位于用户点击位置处的目标对象的轮廓,实现按需识别,提高识别的灵活性。
在另一种可能的实现方式中,视觉数据包括图像或视频,该至少一个视觉特征向量包括视觉数据中的每个像素点的视觉特征向量。基于第一像素点的视觉特征向量和该语义信息,分别获取第一像素点与每个待识别的目标对象之间的评分,该视觉数据包括第一像素点,第一像素点与待识别的目标对象之间的评分用于反映第一像素点属于待识别的目标对象的概率。从每个待识别的目标对象中,选择与第一像素点之间的评分满足指定条件的目标对象,第一像素点是选择的目标对象中的像素点。
由于基于第一像素点的视觉特征向量和该语义信息,分别获取第一像素点与每个待识别的目标对象之间的评分,这样通过该评分可以准确地识别出第一像素点属于的对象,能够提高识别对象的精度。
在另一种可能的实现方式中,对象识别模型是基于至少一个训练样本和待标注的至少一个对象的指示信息对应的语义信息进行模型训练得到的,该训练样本包括该指示信息指示的至少一个对象,该至少一个对象中的部分或全部对象被标注。这样待标注的至少一个对象是需要标注的对象,这样可以实现按需要标注对象。且该至少一个对象中的部分或全部对象被
标注,这样可以提高标注对象的灵活性。
在另一种可能的实现方式中,被标注的对象包括第二对象,第二对象的图像清晰度超过清晰度阈值,第二对象的组成部件被标注。也就是说,对于图像清晰度超过清晰度阈值的对象,才标注该对象的组成部件,避免标注出错。
在另一种可能的实现方式中,文本描述信息与语义特征的对应关系能够转换的文本描述信息多于待标注的至少一个对象的文本描述信息。例如,文本描述信息与语义特征的对应关系能够转换的文本描述信息包括第一文本描述信息,待标注的至少一个对象的文本信息不包括第一文本描述信息,也就是说,通过该对应关系和对象识别模型可以识别第一文本描述信息指示的对象,能够识别出超出该对象识别模型对应的对象类别的对象。
在另一种可能的实现方式中,视觉数据包括被标注的对象,基于被标注的对象和目标对象获取对象识别模型识别对象的精度。这样可以基于该精度来评测该对象识别模型识别对象的情况。
第二方面,本申请提供了一种识别对象的装置,用于执行第一方面或第一方面的任意一种可能的实现方式中的方法。具体地,所述装置包括用于执行第一方面或第一方面的任意一种可能的实现方式中的方法的单元。
第三方面,本申请提供了一种识别对象的设备,包括处理器和存储器;所述处理器用于执行所述存储器中存储的指令,以使得所述设备执行第一方面或第一方面的任意可能的实现方式中的方法。
第四方面,本申请提供了一种包含指令的计算机程序产品,当所述指令被设备运行时,使得所述设备执行上述第一方面或第一方面任意可能的实现方式的方法。
第五方面,本申请提供了一种计算机可读存储介质,用于存储计算机程序,所述计算机程序被设备执行时,所述设备执行上述第一方面或第一方面任意可能的实现方式的方法。
第六方面,本申请提供了一种芯片,包括存储器和处理器,存储器用于存储计算机指令,处理器用于从存储器中调用并运行该计算机指令,以执行上述第一方面或第一方面任意可能的实现方式的方法。
图1是本申请实施例提供的一种网络架构示意图;
图2是本申请实施例提供的一种视觉数据的示意图;
图3是本申请实施例提供的一种训练样本的示意图;
图4是本申请实施例提供的另一种训练样本的示意图;
图5是本申请实施例提供的一种知识库的示意图;
图6是本申请实施例提供的一种模型训练的方法流程图;
图7是本申请实施例提供的一种识别对象的方法流程图;
图8是本申请实施例提供的一种用户点击对象的示意图;
图9是本申请实施例提供的一种识别对象的装置结构示意图;
图10是本申请实施例提供的一种识别对象的设备结构示意图;
图11是本申请实施例提供的一种识别对象的集群的结构示意图;
图12是本申请实施例提供的另一种识别对象的集群的结构示意图。
下面将结合附图对本申请实施方式作进一步地详细描述。
视觉识别(visual recognition)技术是指利用计算机来预测和分析视觉数据中各类重要信息的一种技术,是计算机视觉领域的核心研究内容。近年来,得益于深度学习、卷积神经网络等理论和技术的快速发展,视觉识别技术在生产生活的各个方面都得到了广泛应用,例如智慧城市、智慧医疗、自动驾驶等新兴行业都离不开视觉识别技术。
视觉识别技术使用到对象识别模型,对象识别模型是用于识别对象的智能模型。这样视觉识别技术能够基于该对象识别模型,从待处理的视觉数据中识别出目标对像。该目标对象可以用于实现不同的任务。可选地,视觉数据包括图像和/或视频等数据。
在一些实施例中,目标对象用于实现对象分类、对象定位和/或对象分割等诸多任务。例如,以对象分割任务为例进行说明,对象分割任务的目的是从图像中识别出若干个具有特定特征的目标对象(又可称为目标区域),并从图像中分割识别出的目标对象。也就是说,对象分割任务用于将图像分成若干个具有特定特征的目标区域。根据目标对象的不同定义方式又可以细分为不同的对象分割任务,例如可以细分为语义分割(semantic segmentation),实例分割(instance segmentation)和/或部件分割(part segmentation)。
语义分割是指将图像中的每个像素分类到对应的语义概念上;实例分割是指在图像中划分出某些具体实例所对应的区域;部件分割是指将某些实例进一步划分成不同部件所对应的区域。语义分割对应的目标区域为语义概念,语义概念可能是对象类别(例如为道路、汽车、建筑物等类别),实例分割对应的目标区域为对象(例如某个具体的车或某栋具体的建筑物),部件分割对应的目标区域为对象部件(例如车的车门、车身、轮子等)。上述对图像分割任务进行了说明,对视觉识别的其他任务不再一一列举说明。
视觉识别技术包括标注过程、识别过程和评测过程等组成部分,其中,标注过程用于对视觉数据中的对象进行标注得到训练样本。在得到训练样本之后,使用标注的训练样本进行模型训练,得到对象识别模型,标注过程为视觉识别技术提供数据基础。识别过程用于基于对象识别模型从待处理的视觉数据中识别目标对象,识别过程是指具体的执行或运行过程。评测过程用于对识别的结果给出评分和反馈,评测过程用于获取识别对象的精度。
参见图1,本申请实施例提供了一种网络架构100,该网络架构100包括第一设备101和第二设备102,第一设备101与第二设备102通信。可选地,该网络架构100包括一个或多个第二
设备102,第一设备101与每个第二设备102通信。
对于视觉识别技术包括的标注过程,第一设备101可用于执行该标注过程,第一设备101用于协助标注员标注视觉数据中的对象得到训练样本,使用该训练样本进行模型训练,得到对象识别模型。
由于第一设备101与每个第二设备102通信,第一设备101可以在每个第二设备102上部署该对象识别模型。
对于视觉识别技术包括的识别过程和评测过程,第二设备102用于基于该对象识别模型执行识别过程和/或评测过程,第二设备102用于获取待处理的视觉数据,基于该对象识别模型从待处理的视觉数据中识别出目标对象,和/或,对识别出的目标对象进行评分和反馈。
例如,上述第一设备101为计算机,上述每个第二设备102为摄像机,摄像机可以部署在道路等场所,计算机训练出对象识别模型后,将对象识别模型部署在摄像机上。摄像机拍摄得到待处理的视觉数据,基于该对象识别模型识别待处理的视觉数据中的目标对象。
在一些实施例中,对于上述标注过程,第一设备101显示需要标注的视觉数据,标注员对视觉数据中的对象进行标注,得到训练样本。在实现时,第一设备101获取至少一个视觉数据,对于任一个视觉数据,该视觉数据包括至少一个对象,显示该视觉数据。然后,标注员对该视觉数据中的该至少一个对象进行标注,第一设备101将标注后的视觉数据作为训练样本。所谓标注视觉数据中的对象是指使用一种或多种颜色来填充该对象。
目前在对视觉数据中的对象进行标注时,往往标注视觉数据中的各对象,标注的工作量很大。例如,参见图2,该图片中的汽车、建筑物和道路,标注员标注该图片中的各对象,即标注该图片中的汽车、建筑物和道路。然后,第一设备101将该图片作为训练样本,该训练样本中的各对象被标注。
由于训练样本中的各对象被标注,第一设备101基于此训练样本进行模型训练得到对象识别模型,该对象识别模型与该各对象的文本描述信息相对应的,对象识别模型能够识别较多的对象。这样第二设备102在使用该对象识别模型识别待处理的视觉数据中的对象时,往往识别待处理的视觉数据中的该各对象。识别出的对象中可能包括用户不需要识别的对象,不仅导致识别对象的灵活性低,也浪费了大量的计算资源。
为了减小标注的工作量,本申请实施例采用按需标注对象;和/或,为了提高识别对象的灵活性以及避免计算资源浪费,本申请实施例采用按需识别对象。
本申请实施例定义了需要标注的至少一个对象的指示信息,第一设备101获取至少一个视觉数据,对于任一个视觉数据,该视觉数据包括该指示信息指示的至少一个对象,该至少一个对象为待标注的对象。标注员标注该至少一个对象中的部分对象或全部对象,第一设备101将该被标注的视觉数据作为训练样本。即该训练样本包括该指示信息指示的至少一个对象,该至少一个对象中的部分对象或全部对象被标注,如此实现按需标注对象。
在一些实施例中,该指示信息包括该至少一个对象中的每个对象的文本描述信息。对象的文本描述信息用于描述该对象。可选地,对象的文本描述信息包括该对象的对象类别等,假设该对象为汽车,该对象的文本描述信息为“汽车”。
在一些实施例中,标注员还向第一设备101输入被标注的每个对象的文本描述信息,第一设备101关联该训练样本和该被标注的每个对象的文本描述信息。可选地,对于该关联操作,第一设备101在该训练样本中的被标注的对象上标记该被标注的对象对应的文本描述信息,所
以该训练样本包括被标注的对象对应的文本描述信息。
例如,假设需要标注的至少一个对象包括建筑物,对于图2所示的视觉数据,该视觉数据中包括建筑物、汽车和道路,标注员使用黑色填充该图2所示的视觉数据中的建筑物,以实现标注该视觉数据中的建筑物。参见图3,第一设备101将被标注的视觉数据作为训练样本。
在一些实施例中,对于该训练样本中被标注的任一个对象,如果该对象的图像清晰度超过清晰度阈值,标注员还可能继续标注该对象包括的组成部件。即该训练样本中的图像清晰度超过清晰度阈值的被标注的对象,该对象的组成部件也被标注。可选地,可以标注该对象的每个组成部件,或者,标注该对象中的部分组成部件。可选地,在该对象的图像清晰度未超过清晰度阈值,不标注该对象包括的组成部件,但可以标注该对象,这样避免标注该对象的组成部件时出现标注错误的情况。
在一些实施例中,标注员还向第一设备101输入该对象中被标注的组成部件的文本描述信息。可选地,第一设备101在该训练样本中标记该对象中被标注的组成部件的文本描述信息,所以该训练样本包括该对象中被标注的组成部件的文本描述信息。可选地,组成部件的文本描述信息用于描述该组成部件。例如,该组成部件的文本描述信息可以包括该组成部件的名称等。
例如,汽车的组成部件包括车门、车轮和车身等,对于图2所示视觉数据中的图像清晰度超过清晰度阈值的汽车图像,假设需要标注汽车的车轮,参见图4,假设标注员使用黑色填充车轮。
在一些实施例中,对于该对象包括的任一个组成部件,该组成部件包括至少一个子部件,还可以继续标注该组成部件包括的至少一个子部件。例如,对于汽车的车轮,车轮包括轮胎和轮毂,还可以标注该车轮的轮胎和轮毂。
在一些实施例中,第一设备101在得到至少一个训练样本后,还基于该至少一个训练样本建立知识库,该知识库可能是一个图谱,包括多个节点。对于该知识库中的一个节点,该节点表示被标注的对象的文本描述信息,该节点的每个子节点表示该对象的不同组成部件,可选地,该节点保存有该节点表示的对象的文本描述信息,子节点保存有该子节点表示的组成部件的文本描述信息。或者,该节点表示一个对象的组成部件,该节点的每个子节点表示该组成部件的不同子部件,可选地,该节点保存有该节点表示的组成部件的文本描述信息,子节点保存有该子节点表示的子部件的文本描述信息。
在一些实施例中,建立该知识库的操作为:对于任一个训练样本,从该训练样本中获取被标注的对象的文本描述信息,在知识库中建立一个用于保存该文本描述信息的节点;从该训练样本中获取该被标注的对象中的组成部件的文本描述信息,在知识库中建立该节点的子节点,该子节点用于保存该组成部件的文本描述信息。其中,通过知识库可以清晰地得出任一对象类别的对象包括的各组成部件。
在一些实施例中,在知识库中,各对象对应的节点具有相同的父节点,该父节点是一个虚拟节点,是知识库的根节点。例如,对于图4所示的训练样本,基于该训练样本建立如图5所示的知识库,该知识库包括汽车对应的节点1,道路对应的节点2和建筑物对应的节点3等,节点1、节点2和节点3的父节点均为虚拟节点“Root”。节点1的子节点包括车门对应的子节点11,车轮对应的子节点12。对于其他节点的子节点的含义,不再一一列举。
参见图6,本申请实施例提供了一种模型训练的方法600,所述方法600应用于图1所示的网络架构100,所述方法600由该网络架构100中的第一设备101来执行。参见图6,该模型训练的方法600包括如下步骤601-605的流程。
步骤601:第一设备获取至少一个训练样本和待标注的至少一个对象的指示信息,该训练样本包括该指示信息指示的至少一个对象,该至少一个对象中的部分或全部对象被标注。
待标注的至少一个对象为需要标注的对象,第一设备中可能保存有需要标注的对象的指示信息,该指示信息用于指导标注员标注视觉数据中的对象,得到训练样本。
该指示信息包括该至少一个对象的文本描述信息。即该指示信息包括该至少一个对象中的每个对象的文本描述信息,对象的文本描述信息用于描述该对象。例如,该对象的文本描述信息包括该对象的对象类别,通过该对象类别可以描述该对象。
在步骤601中,第一设备获取至少一个视觉数据,对于该视觉数据中的任一个视觉数据,该视觉数据包括该指示信息指示的待标注的至少一个对象,显示该视觉数据。标注员在显示的该视觉数据中标注该至少一个对象中的全部或部分对象。第一设备再将该视觉数据作为训练样本。第一设备可以一一显示该至少一个视觉数据中的每个视觉数据,标注员标注每个视觉数据中的待标注的至少一个对象中的部分或全部对象,第一设备将被标注的每个视觉数据作为训练样本。
步骤602:第一设备基于待标注的至少一个对象的指示信息,获取第一语义信息,第一语义信息是用于描述待标注的至少一个对象的语义。
在一些实施例中,该指示信息包括每个对象的文本描述信息,第一语义信息包括每个对象的文本描述信息对应的语义特征。对象的文本描述信息对应的语义特征用于描述该对象的语义。可选地,该对象的语义特征是一个特征向量。
在步骤602中,基于文本描述信息与语义特征的对应关系,分别获取每个对象的文本描述信息对应的语义特征。
在一些实施例中,文本描述信息与语义特征的对应关系可能是一个对应关系表,该对应关系表中的每条记录包括一个对象的文本描述信息和与该一个文本描述信息相对应的语义特征。这样在步骤602中,基于每个对象的文本对象信息,从该对应关系表中查询每个对象的文本描述信息对应的语义特征。
在一些实施例中,文本描述信息与语义特征的对应关系可能是一个文本描述信息转换模型,该文本描述信息转换模型用于基于待转换的文本描述信息获取与该文本描述信息相对应的语义特征,例如,该文本描述信息转换模型为文本编码器等。这样在步骤602中,将每个对象的文本描述信息输入到该文本描述信息转换模型,使该文本描述信息转换模型对每个对象的文本描述信息进行转换,分别得到每个对象的文本描述信息对应的语义特征,获取该文本描述信息转换模型输出的每个对象的文本描述信息对应的语义特征。
在一些实施例中,该文本描述信息转换模型是对智能模型进行训练得到的,技术人员创建多个第一样本,每个第一样本包括一个对象的文本描述信息和与该文本描述信息相对应的语义特征,使用该多个第一样本训练智能模型,得到文本描述信息转换模型。
在一些实施例中,该至少一个对象包括第一对象,第一对象的文本描述信息用于指示第一对象和第一对象的至少一个组成部件。所以第一对象的文本描述信息的语义特征包括第一
对象的语义特征和该至少一个组成部件中的每个组成部件的语义特征。
接下来,第一设备基于至少一个训练样本和第一语义信息进行模型训练,详细实现过程见如下步骤603-605。
步骤603:第一设备基于至少一个训练样本、第一语义信息和待训练对象识别模型,识别每个训练样本中的对象。
待训练对象识别模型具有视觉特征提取功能,包括卷积神经网络、视觉变压器模型(vision transformer,ViT)或任意具有视觉特征提取功能的网络等。可选地,具有视觉特征提取功能的网络包括深度残差网络(deep residual network,ResNet)等网络。
步骤603可以通过如下6031-6032的操作来实现。
6031:对于该至少一个训练样本中的任一个训练样本,第一设备基于待训练对象识别模型和该训练样本获取至少一个视觉特征向量,该至少一个视觉特征向量用于指示该训练样本的编码语义。
其中,该训练样本包括图片和/或视频等,该训练样本包括多个像素点,该训练样本的编码语义包括该训练样本中的每个像素点的编码语义。该至少一个视觉特征向量包括该训练样本中的每个像素点的视觉特征向量,像素点的视觉特征向量包括至少一个视觉特征,用于指示该像素点的编码语义。
在6031中,第一设备将该训练样本输入到待训练对象识别模型,使待训练对象识别模型对该训练样本进行处理并得到该训练样本中的每个像素点的视觉特征向量,获取待训练对象识别模型输出的该每个像素点的视觉特征向量。
6032:第一设备基于该至少一个视觉特征向量和第一语义信息,识别该训练样本中的对象。
对于待标注的至少一个对象,第一语义信息包括每个待标注的对象的文本描述信息对应的语义特征。
在一些实施例中,该训练样本包括图像或视频,该至少一个视觉特征向量包括该训练样本中的每个像素点的视觉特征向量。在6032,可以通过如下(1)至(2)的操作,识别该训练样本中的对象。
(1):基于第一像素点的视觉特征向量和每个待标注的对象的文本描述信息对应的语义特征,分别获取第一像素点与每个待标注的对象之间的评分。
其中,该训练样本包括第一像素点,第一像素点与待标注的对象之间的评分用于反映第一像素点属于待标注的对象的概率。
任一个待标注的对象的文本描述信息对应的语义特征也是一个向量,基于第一像素点的视觉特征向量和待标注的对象的文本描述信息对应的语义特征,按如下第一公式获取第一像素点与该待标注的对象之间的评分。
第一公式为:u=ET·f(w,h),
在第一公式中,u为第一像素点与该待标注的对象之间的评分,E是一个向量,该向量包括该待标注的对象的文本描述信息对应的语义特征,ET为该向量的转置向量,f(w,h)为第一像素点,(w,h)为第一像素点在该视觉数据中的坐标。
在操作(1)中,按上述第一公式计算出第一像素点与每个待标注的对象之间的评分。
(2):从每个待标注的对象中,选择与第一像素点之间的评分满足指定条件的对象,第一像素点是选择的对象中的像素点。
在一些实施例中,该指定条件是指选择与第一像素点之间的评分大于评分阈值的任一个对象,或者,该指定条件是指选择与第一像素点之间的评分大于评分阈值且与第一像素点之间的评分最大的对象。
也就是说,在操作(2)中,从每个待标注的对象中,选择与第一像素点之间的评分大于评分阈值的任一个对象。或者,从每个待标注的对象中,选择与第一像素点之间的评分大于评分阈值的每个对象,从该每个对象中选择与第一像素点之间的评分最大的一个对象。将第一像素点作为选择的一个对象的像素点。
重复上述(1)-(2)的操作,可以从该训练样本中得到属于该对象的所有像素点,从而识别出该训练样本中的对象。
在一些实施例中,当识别出的对象是上述第一对象,在识别出第一对象后,还基于对象识别模型和第一对象的文本描述信息对应的语义特征,从第一对象中识别至少一个组成部件。在实现时,
第一对象的文本描述信息对应的语义特征包括该至少一个组成部件中的每个组成部件的语义特征,基于第二像素点的视觉特征向量和每个组成部件的语义特征,按上述第一公式获取第二像素点与每个组成部件之间的评分。第一对象包括第二像素点,第二像素点与任一个组成部件之间的评分用于反映第二像素点属于该组成部件的概率。从该每个组成部件中选择与第二像素点之间的评分满足指定条件的组成部件,第二像素点是选择的组成部件中的像素点。重复上述过程,可以从第一对象中识别出属于该选择的组成部件的所有像素点,从而从第一对象中识别出该选择的组成部件。
对于第一对象的任一个组成部件,该组成部件包括至少一个子部件,按上述方式可以从该组成部件中识别出该至少一个子部件,在此不再详细说明。
步骤604:第一设备基于每个训练样本中被标注的对象和每个训练样本中被识别出的对象,通过损失函数计算损失值,基于该损失值调整待训练对象识别模型的参数。
步骤605:第一设备确定是否继续训练待训练对象识别模型,如果确定继续训练待训练对象识别模型,返回步骤603,如果确定不继续训练待训练对象识别模型,将待训练对象识别模型作为对象识别模型。
在一些实施例中,当对待训练对象识别模型进行训练的次数达到指定次数时,确定不继续对待训练对象识别模型进行训练。或者,
使用多个校验样本获取待训练对象识别模型识别对象的精度,在该精度超过指定阈值,确定不继续对待训练对象识别模型进行训练。在实现时:
获取多个校验样本,每个校验样本包括被标注的对象。基于待训练对象识别模型和被标注的对象的文本描述信息对应的语义特征,识别每个校验样本中的对象。基于每个校验样本中被标注的对象和每个校验样本中被识别出的对象,计算识别对象的精度。在该精度未超过指定阈值,确定继续对待训练对象识别模型进行训练,在该精度超过指定阈值,确定不继续对待训练对象识别模型进行训练。
第一设备训练出对象识别模型后,可以向第二设备发送该对象识别模型。第二设备接收
该对象识别模型后,获取待处理的视觉数据,基于该对象识别模型,识别待处理的视觉数据中的目标对象。
本申请实施例定义需求信息,该需求信息用于指示需要识别的至少一个对象,该至少一个对象为待识别的对象。第二设备基于该需求信息和该对象识别模型,识别待处理的视觉数据中的目标对象,目标对象是该需求信息指示的对象,如此实现按需识别对象。按需识别对象的详细实现过程,见如下任一实施例。
参见图7,本申请实施例提供了一种识别对象的方法700,所述方法700应用于图1所示的网络架构100,所述方法700由该网络架构100中的第二设备102来执行,第二设备102包括对象识别模型,该对象识别模型可能是图6所示的方法600训练出的对象识别模型。该方法700包括如下步骤701-步骤704的流程。
步骤701:第二设备获取待处理的视觉数据和待识别的至少一个目标对象的指示信息。
上述需求信息包括待识别的至少一个目标对象的指示信息,该指示信息包括每个待识别的目标对象的文本描述信息。待识别的至少一个目标对象是该需求信息指示的需要识别的对象。
在一些实施例中,待处理的视觉数据包括图片和/或视频等。可选地,第二设备可能保存有至少一个需要识别对象的视觉数据,可以从该至少一个视觉数据中选择一个视觉数据作为待处理的视觉数据。或者,第二设备为摄像机等设备,第二设备拍摄得到待处理的视觉数据。当然,第二设备还可能采用其他方式获取待处理的视觉数据,在此不再一一列举。
在一些实施例中,待识别的至少一个目标对象包括第二对象,该指示信息用于指示需要识别第二对象和第二对象的至少一个组成部件。在实现时,
该指示信息包括第二对象的文本描述信息,该文本描述信息用于指示的第二对象和第二对象的至少一个组成部件。例如,第二对象的文本描述信息包括第二对象的对象类别和第二对象的至少一个组成部件的名称,使得第二对象的文本描述信息表示需要识别的第二对象和第二对象的至少一个组成部件。
在一些实施例中,第二设备本地保存有待识别的至少一个目标对象的指示信息,在步骤701中,第二设备获取本地保存的待识别的至少一个目标对象的指示信息。或者,在步骤701中,用户向第二设备输入待识别的至少一个目标对象的指示信息,第二设备接收待识别的至少一个目标对象的指示信息。或者,在步骤701中,用户向第一设备输入待识别的至少一个目标对象的指示信息,第一设备向第二设备发送待识别的至少一个目标对象的指示信息,第二设备接收待识别的至少一个目标对象的指示信息。当然,第二设备还可能采用其他方式获取待识别的至少一个目标对象的指示信息,在此不再一一列举。
在一些实施例中,对于用户输入的待识别的至少一个目标对象的指示信息,该指示信息包括每个待识别的目标对象的文本描述信息,用户可以参照知识库确定每个待识别的目标对象的文本描述信息。可选的,对于任一个待识别的目标对象,该目标对象可能是一个对象或是对象的一个组成部件,该目标对象的文本描述信息包括该目标对象的对象类别和/或该目标对象的组成部件的名称等。
例如,用户参照图5所示的知识库,选择需要识别建筑物和汽车,以及选择需要识别汽车的车轮和车门。用户向第二设备输入文本描述信息1和文本描述信息2,文本描述信息1
包括对象类别“建筑物”,文本描述信息2包括对象类别“汽车”、组成部件的名称“车轮”和“车门”。或者,用户选择需要识别车门,用户向第二设备输入文本描述信息3,文本描述信息3包括组成部件的名称“车门”。
在一些实施例中,待识别的至少一个目标对象的指示信息还包括目标对象在该视觉数据中的位置范围的位置信息。可选地,用户在输入目标对象的文本描述信息时还输入目标对象在该视觉数据中的位置范围的位置信息。该位置信息表示用户需要识别视觉数据中位于该位置信息处的目标对象。
例如,假设待处理的视觉数据为如图2所示的街景图片,假设用户输入的目标对象的文本描述信息包括对象类别“汽车”,参见图8,在显示该街景图片后,用户可以点击该街景图片中的某一个汽车图像。第二设备获取被点击的位置,该位置是一个二维坐标,并将该位置作为目标对象在该视觉数据中的位置范围中的位置信息。
接下来,可以基于对象识别模型和该指示信息,识别待处理的视觉数据中的目标对象,以实现按需识别对象。详细实现过程见如下步骤702-704。
步骤702:第二设备基于待识别的至少一个目标对象的指示信息获取第二语义信息,第二语义信息是用于描述待识别的至少一个目标对象的语义。
第二语义信息包括每个待识别的目标对象的文本描述信息对应的语义特征,每个目标对象的文本描述信息对应的语义特征分别用于反映每个目标对象的语义。
在步骤702中,基于文本描述信息与语义特征的对应关系和每个待识别的目标对象的文本描述信息,分别获取每个待识别的目标对象的文本描述信息对应的语义特征。
对于任一个目标对象的文本描述信息对应的语义特征,该语义特征可能是一个向量,语义特征是使用数学方式来描述该目标对象的语义。
在一些实施例中,文本描述信息与语义特征的对应关系可能是一个对应关系表,该对应关系表中的每条记录包括一个文本描述信息和与该一个文本描述信息相对应的语义特征。这样在步骤702中,基于每个待识别的目标对象的文本描述信息,从该对应关系表中分别查询每个待识别的目标对象的文本描述信息对应的语义特征。
在一些实施例中,文本描述信息与语义特征的对应关系可能是一个文本描述信息转换模型,这样在步骤702中,将每个待识别的目标对象的文本描述信息输入到该文本描述信息转换模型,使该文本描述信息转换模型对每个待识别的目标对象的文本描述信息进行转换,分别得到每个待识别的目标对象的文本描述信息对应的语义特征,获取该文本描述信息转换模型输出的每个待识别的目标对象的文本描述信息对应的语义特征。
在一些实施例中,该指示信息包括第二对象的文本描述信息,第二对象的文本描述信息用于指示需要识别的第二对象和第二对象的至少一个组成部件。所以第二对象的文本描述信息对应的语义特征包括用于描述第二对象的语义特征和用于描述该至少一个组成部件中的每个组成部件的语义特征。
在一些实施例中,该指示信息还包括目标对象在该视觉数据中的位置范围的位置信息,在步骤702中,还可能基于目标对象的位置信息获取目标对象的位置特征,该位置特征用于指示目标对象的空间方位。
在一些实施例中,目标对象的位置特征可能是一个向量,该位置特征是使用数学方式来描述目标对象的空间方位。
在一些实施例中,将目标对象的位置信息输入到位置转换模型,使该位置转换模型基于目标对象的位置获取目标对象的位置特征,获取该位置转换模型输出的目标对象的位置特征。
在一些实施例中,该位置转换模型是对智能模型进行训练得到的,技术人员创建多个第二样本,每个第二样本包括一个对象的位置信息和与该位置信息相对应的位置特征,使用该多个第二样本训练智能模型,得到位置转换模型。可选地,该位置转换模型为坐标编码器等。
步骤703:第二设备基于对象识别模型和该视觉数据获取至少一个视觉特征向量,该至少一个视觉特征向量用于指示该视觉数据的编码语义。
其中,视觉数据包括图片和/或视频等,视觉数据包括多个像素点,该视觉数据的编码语义包括该视觉数据中的每个像素点的编码语义。该至少一个视觉特征向量包括该视觉数据中的每个像素点的视觉特征向量,像素点的视觉特征向量包括至少一个视觉特征,用于指示该像素点的编码语义。
在步骤703中,第二设备将该视觉数据输入到对象识别模型,使该对象识别模型对该视觉数据进行处理并得到该视觉数据中的每个像素点的视觉特征向量,获取该对象识别模型输出的该每个像素点的视觉特征向量。
步骤704:第二设备基于该至少一个视觉特征向量和第二语义信息,识别该视觉数据中的目标对象。
在一些实施例中,在得到目标对象的位置特征的情况,还能够基于该至少一个视觉特征向量、第二语义信息和该位置特征,识别该视觉数据中的目标对象。从而从该视觉数据中识别出位于用户点击位置处的目标对象,即识别出位于该位置处的目标对象的轮廓,实现按需识别,提高识别的灵活性。
在一些实施例中,视觉数据包括图像或视频,至少一个视觉特征向量包括视觉数据中的每个像素点的视觉特征向量。在步骤704,可以通过如下7041至7042的操作,识别该视觉数据中的目标对象。
7041:第二设备基于第三像素点的视觉特征向量和第二语义信息,分别获取第一像素点与每个待识别的目标对象之间的评分。
其中,视觉数据包括第三像素点,第三像素点与待识别的目标对象之间的评分用于反映第三像素点属于待识别的目标对象的概率。
对于每个待识别的目标对象,第一语义信息包括每个待识别的目标对象的文本描述信息对应的语义特征。在7041中,基于第三像素点的视觉特征向量和任一个待识别的目标对象的文本描述信息对应的语义特征,按如下第二公式获取第三像素点与该待识别的目标对象之间的评分。
第二公式为:U=ET·F(x,y),
在第二公式中,U为第三像素点与该待识别的目标对象之间的评分,E是一个向量,该向量包括该待识别的目标对象的文本描述信息对应的语义特征,ET为该向量的转置向量,F(x,y)为第三像素点,(x,y)为第三像素点在该视觉数据中的坐标。
在一些实施例中,如果还获取到该待识别的目标对象的位置特征,该向量E还包括该待识别的目标对象的位置特征,即该向量E包括该待识别的目标对象的文本描述信息对应的语
义特征和该待识别的目标对象的位置特征。
在7041中,按上述第二公式计算出第三像素点与该每个待识别的目标对象之间的评分。
7042:第二设备从每个待识别的目标对象中,选择与第三像素点之间的评分满足指定条件的目标对象,第三像素点是选择的目标对象中的像素点。
在7042中,从每个待识别的目标对象中,选择与第三像素点之间的评分大于评分阈值的任一个目标对象。或者,从每个待识别的目标对象中,选择与第三像素点之间的评分大于评分阈值的每个对象,从该每个对象中选择与第三像素点之间的评分最大的一个目标对象。将第三像素点作为选择的目标对象的像素点。
重复上述7041-7042的操作,可以从视觉数据中得到属于该选择的目标对象的所有像素点,从而识别出视觉数据中的该选择的目标对象。
在一些实施例中,当目标对象是上述第二对象,第二对象的文本描述信息对应的语义特征包括用于描述第二对象的语义特征和用于描述第二对象的至少一个组成部件的语义特征。在识别出第二对象后,还基于对象识别模型和第二对象的文本描述信息对应的语义特征,从第二对象中识别至少一个组成部件。在实现时,
基于第四像素点的视觉特征向量和每个组成部件的语义特征,按上述第二公式获取第四像素点与每个组成部件之间的评分。第二对象包括第四像素点,第四像素点与每个组成部件之间的评分分别用于反映第四像素点属于每个组成部件的概率。从该每个组成部件中选择与第四像素点之间的评分满足指定条件的组成部件,第四像素点是选择的组成部件中的像素点。重复上述过程,可以从第二对象中识别出属于该选择的组成部件的所有像素点,从而在第二对象中识别出至少一个组成部件。
这样实现层次化识别,即先识别出粗粒度的对象,在从该对象中识别细粒度的组成部件。由于从该对象中识别组成部件,相比从整个视觉数据中识别该组成部件,可以减小要处理的数据量,提高识别效率。
对于第二对象的任一个组成部件,该组成部件包括至少一个子部件,按上述方式可以从该组成部件中识别出该至少一个子部件,在此不再详细说明。
在一些实施例中,文本描述信息与语义特征的对应关系能够转换的文本描述信息包括待标注的每个对象的文本描述信息,这样待识别的每个目标对象的文本描述信息能够多于待标注的每个对象的文本描述信息。例如,假设该待识别的目标对象的文本描述信息包括第三对象的文本描述信息,而该待标注的每个对象的文本描述信息不包括第三对象的文本描述信息,也就是说,在没有训练对象识别模型识别第三对象的情况下,也可以基于第三对象的文本描述信息对应的语义特征和对象识别模型,识别视觉数据中的第三对象。
在一些实施例中,待处理的视觉数据包括被标注的对象。可选地,待处理的视觉数据可能是上述校验样本。基于该被标注的对象和目标对象获取对象识别模型识别对象的精度。在实现时,
确定每个目标对象对应的被标注的对象,对于任一个目标对象,如果识别出该目标对象的至少一个组成部件,基于该被标注的对象和该目标对象,按如下第三公式获取该目标对象的识别精度。
第三公式为:
在第三公式中,tl为该目标对象,HPQ(tl)为该目标对象的识别精度,ul为该目标对象中被识别出的各组成部件,|ul|为该目标对象包括的组成部件的个数,tl'为该目标对象中的某个组成部件,HPQ(tl')为该组成部件的识别精度。
其中,该组成部件与被标注的对象中的一个被标注的组成部件相对应,在该组成部件中的子部件没有被识别的情况下,获取该组成部件与该被标注的组成部件之间的交集,以及获取该组成部件与该被标注的组成部件之间的并集,HPQ(tl')等于该交集中的像素点个数与该并集中的像素点个数之间的比值,HPQ(tl')是该组成部件的识别精度。在该组成部件中的子部件被识别的情况下,迭代上述第三公式先计算出该组成部件的识别精度,即在上述第三公式中,tl为该组成部件,HPQ(tl)为该组件部件的识别精度,ul为该组成部件中被识别出的各子部件,|ul|为该组成部件包括的子部件的个数,tl'为该组成部件中的某个子部件,HPQ(tl')为该子部件的识别精度。
如果没有识别出该目标对象的至少一个组成部件,获取该目标对象与该目标对象对应的被标注的对象之间的交集,以及获取该目标对象与该目标对象对应的被标注的对象之间的并集,该目标对象的识别精度等于该交集中的像素点个数与该并集中的像素点个数之间的比值。
在得到该视觉数据中的每个目标对象的识别精度后,基于该每个目标对象的识别精度,迭代上述第三公式先计算出该对象识别模型识别对象的精度,即在上述第三公式中,tl为该视觉数据,HPQ(tl)为对象识别模型在该视觉数据中识别对象的精度,ul为从该视觉数据中识别出的各目标对象,|ul|为识别出的目标对象的个数,tl'为识别出的某个目标对象,HPQ(tl')为该目标对象的识别精度。
在一些实施例中,当该对象识别模型识别对象的精度小于指定的精度阈值时,还可以继续基于至少一个训练样本,训练该对象识别模型。
在本申请实施例中,由于获取待识别的至少一个目标对象的指示信息,基于该指示信息获取第二语义信息,第二语义信息是用于描述待识别的至少一个目标对象的语义,这样基于第二语义信息和对象识别模型,从视觉数据中识别目标对象,从而实现基于需求来识别对象,提高识别对象的灵活性。对于该指示信息用于指示第一对象和第一对象的至少一个组成部件,这样基于第一语义信息和对象识别模型,识别视觉数据中的第一对象,以及从第一对象中识别该至少一个组成部件,这样能够层次化识别对象,更能提高识别的灵活性。由于先识别出第一对象,在从第一对象中识别至少一个组成部件,相比从整个视觉数据中识别该至少一个组成部件,可以减小需要处理的数据量,减小对计算资源的占用,以及提高识别该至少一个组成部件的效率。
参见图9,本申请实施例提供了一种识别对象的装置900,所述装置900部署在图1所示
的网络架构100中的第二设备上或图7所示方法700中的第二设备上,包括:
获取单元901,用于获取待处理的视觉数据和待识别的至少一个目标对象的指示信息;
获取单元901,还用于基于该至少一个目标对象的指示信息获取语义信息,该语义信息是用于描述该至少一个目标对象的语义;
识别单元902,用于基于对象识别模型和该语义信息,识别该视觉数据中的目标对象。
可选地,获取单元901获取该视觉数据和该指示信息的详细实现过程,参见图7所示的方法700的步骤701中的相关内容,在此不再详细说明。
可选地,获取单元901获取语义信息的详细实现过程,参见图7所示的方法700的步骤702中的相关内容,在此不再详细说明。
可选地,识别单元902识别目标对象的详细实现过程,参见图7所示的方法700的步骤703和704中的相关内容,在此不再详细说明。
可选地,该至少一个目标对象的指示信息包括该至少一个目标对象的文本描述信息;
获取单元901,用于基于该至少一个目标对象的文本描述信息与语义特征的对应关系,分别获取每个目标对象的文本描述信息对应的语义特征,该语义信息包括每个目标对象的文本描述信息对应的语义特征。
可选地,获取单元901获取每个目标对象的文本描述信息对应的语义特征的详细实现过程,参见图7所示的方法700的步骤702中的相关内容,在此不再详细说明。
可选地,识别单元902,用于:
基于对象识别模型和视觉数据获取至少一个视觉特征向量,该至少一个视觉特征向量用于指示视觉数据的编码语义;
基于至少一个视觉特征向量和该语义信息,识别视觉数据中的目标对象。
可选地,识别单元902获取至少一个视觉特征向量的详细实现过程,参见图7所示的方法700的步骤703中的相关内容,在此不再详细说明。
可选地,识别单元902识别视觉数据中的目标对象的详细实现过程,参见图7所示的方法700的步骤704中的相关内容,在此不再详细说明。
可选的,该至少一个目标对象的指示信息包括第一对象的指示信息,第一对象的指示信息用于指示第一对象和第一对象的至少一个组成部件;
识别单元902,用于:
基于对象识别模型和该语义信息,识别视觉数据中的第一对象;以及,
基于对象识别模型和该语义信息,从第一对象中识别至少一个组成部件。
可选地,识别单元902识别第一对象和第一对象的至少一个组成部件的详细实现过程,参见图7所示的方法700的7041-7042中的相关内容,在此不再详细说明。
可选地,该至少一个目标对象的指示信息还包括用于指示目标对象在视觉数据中的位置范围的位置信息,
获取单元901,还用于基于该位置信息获取目标对象的位置特征,该位置特征用于指示目标对象的空间方位;
识别单元902,用于基于该对象识别模型、该语义信息和该位置特征,识别视觉数据中的目标对象。
可选地,识别单元902识别视觉数据中的目标对象的详细实现过程,参见图7所示的方
法700的步骤7041-7042中的相关内容,在此不再详细说明。
可选地,该视觉数据包括图像或视频,该至少一个视觉特征向量包括视觉数据中的每个像素点的视觉特征向量;
识别单元902,用于:
基于第一像素点的视觉特征向量和该语义信息,分别获取第一像素点与每个待识别的目标对象之间的评分,视觉数据包括第一像素点,第一像素点与待识别的目标对象之间的评分用于反映第一像素点属于待识别的目标对象的概率;
从每个待识别的目标对象中,选择与第一像素点之间的评分满足指定条件的目标对象,第一像素点是选择的目标对象中的像素点。
可选地,识别单元902获取评分的详细实现过程,参见图7所示的方法700的7041中的相关内容,在此不再详细说明。
可选地,识别单元902选择目标对象的详细实现过程,参见图7所示的方法700的7042中的相关内容,在此不再详细说明。
可选地,对象识别模型是基于至少一个训练样本和待标注的至少一个对象的指示信息对应的语义信息进行模型训练得到的,训练样本包括该指示信息指示的至少一个对象,至少一个对象中的部分或全部对象被标注。
可选地,被标注的对象包括第二对象,第二对象的图像清晰度超过清晰度阈值,第二对象的组成部件被标注。
可选地,该视觉数据包括被标注的对象,获取单元901,还用于基于被标注的对象和目标对象获取对象识别模型识别对象的精度。
可选地,获取单元901获取精度的详细实现过程,参见图7所示的方法700的步骤704中的相关内容,在此不再详细说明。
在本申请实施例中,由于获取单元获取待识别的至少一个目标对象的指示信息,基于该指示信息获取第二语义信息,第二语义信息是用于描述待识别的至少一个目标对象的语义,这样识别单元基于第二语义信息和对象识别模型,从视觉数据中识别目标对象,从而实现基于需求来识别对象,提高识别对象的灵活性。对于该指示信息用于指示第一对象和第一对象的至少一个组成部件,这样基于第一语义信息和对象识别模型,识别视觉数据中的第一对象,以及从第一对象中识别该至少一个组成部件,这样能够层次化识别对象,更能提高识别的灵活性。
参见图10,本申请实施例提供了一种识别对象的设备1000。如图10所示,该设备1000包括:总线1002、处理器1004、存储器1006和通信接口1008。处理器1004、存储器1006和通信接口1008之间通过总线1002通信。该设备1000可以是服务器或终端设备。应理解,本申请不限定该设备1000中的处理器、存储器的个数。
总线1002可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图10中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线1002可包括在计算设备1000各个部件(例如,存储器1006、处理器1004、通信接口1008)之间传送信息的通路。
处理器1004可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
存储器1006可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器1004还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。
参见图10,存储器1006中存储有可执行的程序代码,处理器1004执行该可执行的程序代码以分别实现图9所示的装置900中的获取单元901和识别单元902的功能,从而实现识别对象的方法。也即,存储器1006上存有用于执行识别对象的方法的指令。
通信接口1008使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备1000与其他设备或通信网络之间的通信。
本申请实施例还提供了一种识别对象的集群。该识别对象的集群包括至少一台设备1000。该设备1000可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。
如图11所示,所述识别对象的集群包括至少一个设备1000。识别对象的集群中的一个或多个设备1000中的存储器1006中可以存有相同的用于执行上述任意实施例提供的方法的指令。
在一些可能的实现方式中,该识别对象的集群中的一个或多个设备1000的存储器1006中也可以分别存有用于执行上述识别对象的方法的部分指令。换言之,一个或多个计算设备1000的组合可以共同执行用于执行上述任意实施例提供的方法的指令。
在一些可能的实现方式中,识别对象的集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图11示出了一种可能的实现方式。如图12所示,两个设备1000A和1000B之间通过网络进行连接。具体地,通过各个设备1000中的通信接口与所述网络进行连接。
在这一类可能的实现方式中,设备1000A中的存储器1006中存有执行如图9所示实施例中的获取单元901功能的指令。同时,设备1000B中的存储器1006中存有执行如图9所示实施例中的识别单元902的功能的指令。
应理解,图12中示出的设备1000A的功能也可以由多个设备1000完成。同样,设备1000B的功能也可以由多个设备1000完成。
本申请实施例还提供了另一种识别对象的集群。该识别对象的集群中各计算设备之间的连接关系可以类似的参考图12所述处理源代码的集群的连接方式。不同的是,该识别对象的集群中的一个或多个设备1000中的存储器1006中可以存有相同的用于执行上述任意实施例提供的方法的指令。
在一些可能的实现方式中,该识别对象的集群中的一个或多个设备1000的存储器1006中也可以分别存有用于执行上述任意实施例提供的方法的部分指令。换言之,一个或多个设备1000的组合可以共同执行用于执行上述任意实施例提供的方法的指令。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
Claims (20)
- 一种识别对象的方法,其特征在于,所述方法包括:获取待处理的视觉数据和待识别的至少一个目标对象的指示信息;基于所述至少一个目标对象的指示信息获取语义信息,所述语义信息是用于描述所述至少一个目标对象的语义;基于对象识别模型和所述语义信息,识别所述视觉数据中的所述目标对象。
- 如权利要求1所述的方法,其特征在于,所述至少一个目标对象的指示信息包括所述至少一个目标对象的文本描述信息;所述基于所述至少一个目标对象的指示信息获取语义信息,包括:基于所述至少一个目标对象的文本描述信息与语义特征的对应关系,分别获取每个目标对象的文本描述信息对应的语义特征,所述语义信息包括所述每个目标对象的文本描述信息对应的语义特征。
- 如权利要求1或2所述的方法,其特征在于,所述基于对象识别模型和所述语义信息,识别所述视觉数据中的目标对象,包括:基于所述对象识别模型和所述视觉数据获取至少一个视觉特征向量,所述至少一个视觉特征向量用于指示所述视觉数据的编码语义;基于所述至少一个视觉特征向量和所述语义信息,识别所述视觉数据中的所述目标对象。
- 如权利要求1-3任一项所述的方法,其特征在于,所述至少一个目标对象的指示信息包括第一对象的指示信息,所述第一对象的指示信息用于指示所述第一对象和所述第一对象的至少一个组成部件;所述基于对象识别模型和所述语义信息,识别所述视觉数据中的所述目标对象,包括:基于所述对象识别模型和所述语义信息,识别所述视觉数据中的所述第一对象;以及,基于所述对象识别模型和所述语义信息,从所述第一对象中识别所述至少一个组成部件。
- 如权利要求1-3任一项所述的方法,其特征在于,所述至少一个目标对象的指示信息还包括用于指示所述目标对象在所述视觉数据中的位置范围的位置信息,所述方法还包括:基于所述位置信息获取所述目标对象的位置特征,所述位置特征用于指示所述目标对象的空间方位;所述基于对象识别模型和所述语义信息,识别所述视觉数据中的所述目标对象,包括:基于所述对象识别模型、所述语义信息和所述位置特征,识别所述视觉数据中的所述目标对象。
- 如权利要求3所述的方法,其特征在于,所述视觉数据包括图像或视频,所述至少一个视觉特征向量包括所述视觉数据中的每个像素点的视觉特征向量;所述基于所述至少一个视觉特征向量和所述语义信息,识别所述视觉数据中的所述目标 对象,包括:基于第一像素点的视觉特征向量和所述语义信息,分别获取所述第一像素点与每个待识别的目标对象之间的评分,所述视觉数据包括所述第一像素点,所述第一像素点与所述待识别的目标对象之间的评分用于反映所述第一像素点属于所述待识别的目标对象的概率;从所述每个待识别的目标对象中,选择与所述第一像素点之间的评分满足指定条件的目标对象,所述第一像素点是所述选择的目标对象中的像素点。
- 如权利要求1-6任一项所述的方法,其特征在于,所述对象识别模型是基于至少一个训练样本和待标注的至少一个对象的指示信息对应的语义信息进行模型训练得到的,所述训练样本包括所述指示信息指示的至少一个对象,所述至少一个对象中的部分或全部对象被标注。
- 如权利要求7所述的方法,其特征在于,所述被标注的对象包括第二对象,所述第二对象的图像清晰度超过清晰度阈值,所述第二对象的组成部件被标注。
- 如权利要求1-8任一项所述的方法,其特征在于,所述视觉数据包括被标注的对象,所述方法还包括:基于所述被标注的对象和所述目标对象获取所述对象识别模型识别对象的精度。
- 一种识别对象的装置,其特征在于,所述装置包括:获取单元,用于获取待处理的视觉数据和待识别的至少一个目标对象的指示信息;所述获取单元,还用于基于所述至少一个目标对象的指示信息获取语义信息,所述语义信息是用于描述所述至少一个目标对象的语义;识别单元,用于基于对象识别模型和所述语义信息,识别所述视觉数据中的所述目标对象。
- 如权利要求10所述的装置,其特征在于,所述至少一个目标对象的指示信息包括所述至少一个目标对象的文本描述信息;所述获取单元,用于基于所述至少一个目标对象的文本描述信息与语义特征的对应关系,分别获取每个目标对象的文本描述信息对应的语义特征,所述语义信息包括所述每个目标对象的文本描述信息对应的语义特征。
- 如权利要求10或11所述的装置,其特征在于,所述识别单元,用于:基于所述对象识别模型和所述视觉数据获取至少一个视觉特征向量,所述至少一个视觉特征向量用于指示所述视觉数据的编码语义;基于所述至少一个视觉特征向量和所述语义信息,识别所述视觉数据中的所述目标对象。
- 如权利要求10-12任一项所述的装置,其特征在于,所述至少一个目标对象的指示信息包括第一对象的指示信息,所述第一对象的指示信息用于指示所述第一对象和所述第一对 象的至少一个组成部件;所述识别单元,用于:基于所述对象识别模型和所述语义信息,识别所述视觉数据中的所述第一对象;以及,基于所述对象识别模型和所述语义信息,从所述第一对象中识别所述至少一个组成部件。
- 如权利要求10-12任一项所述的装置,其特征在于,所述至少一个目标对象的指示信息还包括用于指示所述目标对象在所述视觉数据中的位置范围的位置信息,所述获取单元,还用于基于所述位置信息获取所述目标对象的位置特征,所述位置特征用于指示所述目标对象的空间方位;所述识别单元,用于基于所述对象识别模型、所述语义信息和所述位置特征,识别所述视觉数据中的所述目标对象。
- 如权利要求12所述的装置,其特征在于,所述视觉数据包括图像或视频,所述至少一个视觉特征向量包括所述视觉数据中的每个像素点的视觉特征向量;所述识别单元,用于:基于第一像素点的视觉特征向量和所述语义信息,分别获取所述第一像素点与每个待识别的目标对象之间的评分,所述视觉数据包括所述第一像素点,所述第一像素点与所述待识别的目标对象之间的评分用于反映所述第一像素点属于所述待识别的目标对象的概率;从所述每个待识别的目标对象中,选择与所述第一像素点之间的评分满足指定条件的目标对象,所述第一像素点是所述选择的目标对象中的像素点。
- 如权利要求10-15任一项所述的装置,其特征在于,所述对象识别模型是基于至少一个训练样本和待标注的至少一个对象的指示信息对应的语义信息进行模型训练得到的,所述训练样本包括所述指示信息指示的至少一个对象,所述至少一个对象中的部分或全部对象被标注。
- 如权利要求16所述的装置,其特征在于,所述被标注的对象包括第二对象,所述第二对象的图像清晰度超过清晰度阈值,所述第二对象的组成部件被标注。
- 如权利要求10-17任一项所述的装置,其特征在于,所述视觉数据包括被标注的对象,所述获取单元,还用于基于所述被标注的对象和所述目标对象获取所述对象识别模型识别对象的精度。
- 一种设备,其特征在于,包括处理器和存储器;所述处理器用于执行所述存储器中存储的指令,以使得所述设备执行如权利要求1-9任一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被设备执行时,所述设备执行如权利要求1-9任一项所述的方法。
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210727401.1 | 2022-06-24 | ||
CN202210727401 | 2022-06-24 | ||
CN202210851482.6 | 2022-07-19 | ||
CN202210851482.6A CN117333868A (zh) | 2022-06-24 | 2022-07-19 | 识别对象的方法、装置及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023246641A1 true WO2023246641A1 (zh) | 2023-12-28 |
Family
ID=89281753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/100703 WO2023246641A1 (zh) | 2022-06-24 | 2023-06-16 | 识别对象的方法、装置及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117333868A (zh) |
WO (1) | WO2023246641A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019055114A1 (en) * | 2017-09-12 | 2019-03-21 | Hrl Laboratories, Llc | VIEW-FREE VIEW-SENSITIVE SYSTEM FOR ATTRIBUTES THROUGH SHARED REPRESENTATIONS |
CN114299321A (zh) * | 2021-08-04 | 2022-04-08 | 腾讯科技(深圳)有限公司 | 视频分类方法、装置、设备及可读存储介质 |
CN114401417A (zh) * | 2022-01-28 | 2022-04-26 | 广州方硅信息技术有限公司 | 直播流对象跟踪方法及其装置、设备、介质 |
CN114429552A (zh) * | 2022-01-21 | 2022-05-03 | 北京有竹居网络技术有限公司 | 对象属性识别方法、装置、可读存储介质及电子设备 |
CN114581710A (zh) * | 2022-03-04 | 2022-06-03 | 腾讯科技(深圳)有限公司 | 图像识别方法、装置、设备、可读存储介质及程序产品 |
-
2022
- 2022-07-19 CN CN202210851482.6A patent/CN117333868A/zh active Pending
-
2023
- 2023-06-16 WO PCT/CN2023/100703 patent/WO2023246641A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019055114A1 (en) * | 2017-09-12 | 2019-03-21 | Hrl Laboratories, Llc | VIEW-FREE VIEW-SENSITIVE SYSTEM FOR ATTRIBUTES THROUGH SHARED REPRESENTATIONS |
CN114299321A (zh) * | 2021-08-04 | 2022-04-08 | 腾讯科技(深圳)有限公司 | 视频分类方法、装置、设备及可读存储介质 |
CN114429552A (zh) * | 2022-01-21 | 2022-05-03 | 北京有竹居网络技术有限公司 | 对象属性识别方法、装置、可读存储介质及电子设备 |
CN114401417A (zh) * | 2022-01-28 | 2022-04-26 | 广州方硅信息技术有限公司 | 直播流对象跟踪方法及其装置、设备、介质 |
CN114581710A (zh) * | 2022-03-04 | 2022-06-03 | 腾讯科技(深圳)有限公司 | 图像识别方法、装置、设备、可读存储介质及程序产品 |
Also Published As
Publication number | Publication date |
---|---|
CN117333868A (zh) | 2024-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112949415B (zh) | 图像处理方法、装置、设备和介质 | |
WO2019128646A1 (zh) | 人脸检测方法、卷积神经网络参数的训练方法、装置及介质 | |
JP2018200685A (ja) | 完全教師あり学習用のデータセットの形成 | |
CN106951830B (zh) | 一种基于先验条件约束的图像场景多对象标记方法 | |
JP2023541532A (ja) | テキスト検出モデルのトレーニング方法及び装置、テキスト検出方法及び装置、電子機器、記憶媒体並びにコンピュータプログラム | |
JP2022006174A (ja) | モデルをトレーニングするための方法、装置、デバイス、媒体、およびプログラム製品 | |
CN109918513B (zh) | 图像处理方法、装置、服务器及存储介质 | |
EP3740935B1 (en) | Visual tracking by colorization | |
WO2022227218A1 (zh) | 药名识别方法、装置、计算机设备和存储介质 | |
CN112926700B (zh) | 针对目标图像的类别识别方法和装置 | |
CN113971727A (zh) | 一种语义分割模型的训练方法、装置、设备和介质 | |
CN112686243A (zh) | 智能识别图片文字的方法、装置、计算机设备及存储介质 | |
WO2023246912A1 (zh) | 图像文字结构化输出方法、装置、电子设备和存储介质 | |
CN111488873A (zh) | 一种基于弱监督学习的字符级场景文字检测方法和装置 | |
CN114694165A (zh) | 一种pid图纸智能识别与重绘方法 | |
WO2024055864A1 (zh) | 结合rpa和ai实现ia的分类模型的训练方法及装置 | |
CN114429566A (zh) | 一种图像语义理解方法、装置、设备及存储介质 | |
CN116977248A (zh) | 图像处理方法、装置、智能设备、存储介质及产品 | |
CN117975418A (zh) | 一种基于改进rt-detr的交通标识检测方法 | |
CN114120074B (zh) | 基于语义增强的图像识别模型的训练方法和训练装置 | |
CN113887481A (zh) | 一种图像处理方法、装置、电子设备及介质 | |
WO2023246641A1 (zh) | 识别对象的方法、装置及存储介质 | |
CN114429631B (zh) | 三维对象检测方法、装置、设备以及存储介质 | |
WO2022247628A1 (zh) | 一种数据标注方法及相关产品 | |
US20220277540A1 (en) | System and method for generating an optimized image with scribble-based annotation of images using a machine learning model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23826290 Country of ref document: EP Kind code of ref document: A1 |