WO2022179587A1 - 一种特征提取的方法以及装置 - Google Patents

一种特征提取的方法以及装置 Download PDF

Info

Publication number
WO2022179587A1
WO2022179587A1 PCT/CN2022/077807 CN2022077807W WO2022179587A1 WO 2022179587 A1 WO2022179587 A1 WO 2022179587A1 CN 2022077807 W CN2022077807 W CN 2022077807W WO 2022179587 A1 WO2022179587 A1 WO 2022179587A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
image
fusion
feature extraction
processed
Prior art date
Application number
PCT/CN2022/077807
Other languages
English (en)
French (fr)
Inventor
韩凯
王云鹤
肖安
郭健元
许春景
钱莉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22758951.2A priority Critical patent/EP4290408A1/en
Publication of WO2022179587A1 publication Critical patent/WO2022179587A1/zh
Priority to US18/237,995 priority patent/US20230419646A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to the field of artificial intelligence, and in particular to a method and device for feature extraction.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. What we need is the knowledge of the data and information of the subject being photographed. To put it figuratively, it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organ to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • visual perception models including image classification, 2D detection, semantic segmentation, keypoint detection, linear object detection (such as lane lines or stop lines in autonomous driving technology) detection), drivable area detection, scene recognition, etc. How to make the visual perception model better to complete the target task and make the performance and effect of the visual perception model better is a matter of great concern.
  • the present application provides a method and device for feature extraction, which can enable the extracted features of an object to be processed to better represent the object to be processed, thereby improving the performance of a model to which the feature extraction method is applied.
  • the present application provides a method for feature extraction, which may include: performing feature extraction on a first vector by using a first feature extraction model to obtain a first feature.
  • the first vector is used to represent the first segmented object.
  • the first segmented object may include some elements in the object to be processed.
  • the data type of the object to be processed may be image data, text data, and voice data. It can be understood that the segmented object to be processed includes some elements in the object to be processed.
  • the object to be processed is an image
  • some elements in the object to be processed refer to some pixels in the image; when the object to be processed is text or voice, some elements in the object to be processed refer to words or words in the text or voice.
  • Feature extraction is performed on the second vector by the second feature extraction model to obtain a plurality of second features.
  • the second vector is used to represent some of the elements in the first segmented object.
  • the at least two second features are fused according to the first target weight to obtain the first fusion feature, the first target weight is determined according to the first parameter value, and the first target weight and the first parameter value are positively correlated.
  • the first parameter value is used to represent the similarity between each of the at least two second features and the target second feature, where the target second feature is any one of the at least two second features. , or the first target weight is a second parameter value, and the second parameter value includes at least one preset constant.
  • the similarity between one or more second features and the target second feature can be measured in different ways. For example, it can be measured by the size of the inner product between the two second features.
  • the second feature of the target is feature B
  • the inner product between feature A and feature B is greater than the inner product between feature A and feature C
  • the similarity between feature A and feature B is greater
  • the influence of feature A on feature B is greater
  • the influence of feature C on feature B is small.
  • the weights can be set to be 0.9 and 0.1 respectively, then the fusion processing of at least two second features according to the first target weight can be understood as 0.9*A+B+0.1*C, and this result represents a first fusion feature.
  • the first target weight may also be preset, for example, each of the at least two second features can be set, and the impact on the target second feature is the same, then the target second feature and other One or more second features are averaged, and the average is superimposed on the target second feature. It should be noted that the above does not exhaustively measure the influence of one or more second features on the second feature of the target. In addition to the several measurement methods listed above, other methods may also be used for measurement.
  • the first feature and the first fused feature are fused to obtain a second fused feature, and the second fused feature is used to obtain the feature of the object to be processed.
  • the second fusion feature is used to determine the final feature of the object to be processed.
  • the second fusion feature output by the last feature extraction module among the multiple feature extraction modules in the first feature extraction model is used to determine The final extracted features of the object to be processed.
  • the last feature extraction module will output the corresponding second fusion feature, and the set of the second fusion feature is the final feature of the object to be processed.
  • weighting processing is performed on the second fusion feature corresponding to each segmented object output by the last feature extraction module, and the result of the weighting processing is used as the final feature of the object to be processed.
  • the association relationship between elements that is, the first fusion feature
  • the first fusion feature is fused with the first feature, so that the extracted feature includes the element and the relationship between the elements, so that the feature of the object to be processed can be better characterized.
  • the method may further include: acquiring a third feature, where the third feature is acquired by performing feature extraction on the third vector by using the first feature extraction model.
  • the third vector is used to represent the second segmented object, and the second segmented object may include some elements in the object to be processed.
  • Performing fusion processing on the first feature and the first fusion feature to obtain the second fusion feature may include: fusing the first feature and the third feature according to the second target weight to obtain the third fusion feature.
  • the second target weight is determined according to the third parameter value
  • the third parameter value is used to represent the similarity between the third feature and the first feature
  • the second target weight is the fourth parameter value
  • the fourth parameter value includes at least A preset constant.
  • a fusion process is performed on the third fusion feature and the first fusion feature to obtain the second fusion feature.
  • an association relationship between the segmented object and the segmented object is established through the first feature extraction model.
  • the relationship between the segmented object and the segmented object is preserved, and the relationship between the elements is preserved, so that the extracted features can be better Characteristics that characterize the object to be processed.
  • the performance of the model to which the feature extraction method is applied can be prompted.
  • the first vector is specifically used to represent the first segmented object that carries the first location information
  • the first location information is that the first segmented object is in the object to be processed location information in .
  • the first position information may be represented by the coordinate information of one pixel or may be represented by the coordinate information of a plurality of pixels.
  • the position information of each image block can be represented by the coordinates of the upper left corner of each image block.
  • the position information of each image block may be represented by the coordinates of the upper left pixel and the coordinates of the lower right pixel of each image block.
  • the first position information may also be represented by a coding vector.
  • the first vector includes more information, that is, the first position information, so that the first feature extraction model can obtain more information. The more information the first feature extraction model can obtain, the more It is helpful for the first feature extraction model to learn to better extract image features.
  • each second vector is specifically used to represent some elements in the first segmented object that carry the second position information, and the second location information is the first segment Position information of some elements in the latter object in the first segmented object.
  • the second vector includes more information, that is, the second position information. The more information that can be obtained by the second feature extraction, the more helpful the second feature extraction is to learn, so as to better extract image features.
  • performing fusion processing on the first feature and the first fusion feature to obtain the second fusion feature may include: performing end-to-end splicing processing on the first feature and the first fusion feature to obtain the second fusion feature. Obtain the second fusion feature.
  • a specific method of performing fusion processing on the first feature and the first fusion feature is provided, which increases the diversity of the solution.
  • performing fusion processing on the first feature and the first fusion feature to obtain the second fusion feature may include: performing target operation on the first feature and the first fusion feature to obtain the second fusion feature.
  • the target operation may include at least one of addition or multiplication.
  • a specific method of performing fusion processing on the first feature and the first fusion feature is provided, which increases the diversity of the solution.
  • the target operation is performed on the first feature and the first fusion feature to obtain the second fusion feature, which may include: when the first fusion feature may include multiple
  • the fused features are processed by end-to-end splicing to obtain the spliced features.
  • the spliced feature is mapped to the feature of the target length, and the target length is determined according to the length of the first feature.
  • the first feature and the feature of the target length are added to obtain the second fusion feature.
  • a specific method of performing fusion processing on the first feature and the first fusion feature is provided, which increases the diversity of the solution.
  • the at least two second features are fused according to the first target weight to obtain the first fusion feature, which may include: using the at least two second features as the input of the target model,
  • the output of the target model is the first fusion feature
  • the target model can include one of the self-attention network Transformer, the convolutional neural network CNN or the recurrent neural network RNN.
  • the target model is the Transformer
  • the first target weight is based on at least two.
  • the inner product between each of the second features and the target second feature is determined.
  • the target model is one of CNN or RNN
  • the first target weight is preset. In this embodiment, several ways of acquiring the first fusion feature are provided, which increases the diversity of the solution.
  • the object to be processed is an image to be processed
  • the first vector is specifically used to represent the first segmented image
  • the first segmented image may specifically include the image to be processed.
  • the second vector is specifically used to represent some of the pixels in the first segmented image
  • the second fusion feature is specifically used to obtain the feature of the image to be processed.
  • the object to be processed is the image to be processed, and in the process of extracting image features, the relationship between image blocks and image blocks is preserved, and pixels (or pixel blocks) and pixels (or pixel blocks) are preserved.
  • the extracted image features can well capture the color features, texture features, shape features and spatial relationship features of the image, which can improve the performance of the visual perception model.
  • the present application provides a feature extraction model.
  • the feature extraction model may include a first feature extraction model and a second feature extraction model.
  • the first feature extraction model is used to obtain a first feature, and the first feature is obtained through the first feature extraction model. Obtained after a feature extraction model performs feature extraction on a first vector, the first vector is used to represent the first segmented object, and the first segmented object may include some elements in the object to be processed.
  • the second feature extraction model is used to obtain a plurality of second features, the second features are obtained after feature extraction is performed on the second vector by the second feature extraction model, and the second vector is used to represent the first segmented object. part of the elements.
  • the second feature extraction model is further configured to fuse at least two second features according to the first target weight to obtain the first fusion feature, the first target weight is determined according to the first parameter value, the first target weight and the first target weight are A parameter value is positively correlated.
  • the first parameter value is used to represent the similarity between each of the at least two second features and the target second feature, where the target second feature is any one of the at least two second features. , or the first target weight is a second parameter value, and the second parameter value includes at least one preset constant.
  • the first feature extraction model is further configured to perform fusion processing on the first feature and the first fusion feature to obtain a second fusion feature, and the second fusion feature is used to obtain the feature of the object to be processed.
  • the first feature extraction model is further used to: obtain a third feature, where the third feature is obtained after feature extraction is performed on the third vector by the first feature extraction model, and the third feature is obtained by performing feature extraction on the third vector.
  • the vector is used to represent the second segmented object, and the second segmented object may include some elements in the object to be processed.
  • the first feature and the third feature are fused according to the second target weight to obtain the third fusion feature.
  • the first feature extraction model, specifically for the second target weight is determined according to the third parameter value, and the third parameter value uses is used to represent the similarity between the third feature and the first feature, or the second target weight is a fourth parameter value, and the fourth parameter value includes at least one preset constant.
  • a fusion process is performed on the third fusion feature and the first fusion feature to obtain the second fusion feature.
  • the first vector is specifically used to represent the first segmented object carrying the first position information
  • the first location information is that the first segmented object is in the object to be processed location information in .
  • each second vector is specifically used to represent some elements in the first segmented object that carry the second position information, and the second location information is the first segment Position information of some elements in the latter object in the first segmented object.
  • the first feature extraction model is specifically configured to: perform end-to-end splicing processing on the first feature and the first fusion feature to obtain the second fusion feature.
  • the first feature extraction model is specifically used for: performing a target operation on the first feature and the first fusion feature to obtain the second fusion feature, and the target operation may include addition or addition at least one of the multiplications.
  • the first feature extraction model is specifically used to: when the first fusion feature can include multiple, perform head-to-tail splicing processing on the multiple first fusion features to obtain the spliced features .
  • the spliced feature is mapped to the feature of the target length, and the target length is determined according to the length of the first feature.
  • the first feature and the feature of the target length are added to obtain the second fusion feature.
  • the second feature extraction model is specifically used to: use at least two second features as the input of the target model, the output of the target model is the first fusion feature, and the target model may include automatic One of attention network Transformer, convolutional neural network CNN or recurrent neural network RNN, when the target model is Transformer, the first target weight is based on each of the at least two second features and the target second The inner product between the features is determined.
  • the target model is one of CNN or RNN
  • the first target weight is preset.
  • the object to be processed is an image to be processed
  • the first vector is specifically used to represent the first segmented image
  • the first segmented image may specifically include the image to be processed.
  • the second vector is specifically used to represent some of the pixels in the first segmented image
  • the second fusion feature is specifically used to obtain the feature of the image to be processed.
  • the present application provides an image processing method, which may include: acquiring an image to be processed. Input the image to be processed into the visual perception model to extract image features through a feature extraction model that can be included in the visual perception model, and the feature extraction model is the feature described in the second aspect or any possible implementation manner of the second aspect Extract the model. Visual perception of the image to be processed according to the image features.
  • performing visual perception of the image to be processed according to the image feature may include: classifying the image to be processed according to the image feature to obtain a classification result of the image to be processed.
  • acquiring the to-be-processed image may include: acquiring the to-be-processed image through a sensor of the vehicle.
  • Visual perception of the to-be-processed image according to the image features may include: performing semantic segmentation on the to-be-processed image according to the image features to obtain the region where the target object is located in the to-be-processed image, and the target object may include one or more of people, vehicles, and road surfaces.
  • acquiring the image to be processed may include: acquiring the image to be processed through a monitoring device.
  • Performing visual perception on the image to be processed according to the image features may include: if identifying the image to be processed according to the image features can include a person, then identifying the attributes of the person according to the image features, and the attributes may include one of gender, skin color, age, clothing, or variety.
  • the present application provides an electronic device, which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and the first aspect or any one of the first aspects is implemented when the program instructions stored in the memory are executed by the processor. method described in a possible implementation.
  • the present application provides a computer-readable storage medium, which may include a program, which, when executed on a computer, causes the computer to execute the method described in the first aspect or any possible implementation manner of the first aspect.
  • the present application provides a circuit system, the circuit system may include a processing circuit configured to perform the method as described in the first aspect or any possible implementation manner of the first aspect.
  • the present application provides a computer program product.
  • the computer program product may include instructions.
  • the instructions When the instructions are loaded and executed by an electronic device, the electronic device can execute the first aspect or any possible implementation manner of the first aspect. method.
  • the present application provides a chip, which is coupled to a memory and configured to execute a program stored in the memory, so as to execute the method described in the first aspect or any possible implementation manner of the first aspect.
  • FIG. 1 is a schematic structural diagram of an artificial intelligence main frame provided by an embodiment of the present application.
  • FIG. 2 is an architectural diagram of a system provided by an embodiment of the present application
  • Fig. 3 is a kind of schematic flow chart of carrying out feature extraction to an image
  • FIG. 4 is a schematic flowchart of a method for feature extraction provided by an embodiment of the present application.
  • FIG. 5 provides a schematic flowchart of acquiring an element set according to an embodiment of the present application
  • FIG. 6 is a schematic flowchart of converting an image block into a vector representation according to an embodiment of the present application
  • FIG. 7 is a schematic diagram of a feature extraction model provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a feature extraction model provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a method for feature extraction provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a feature extraction model provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of an application scenario of a feature extraction method provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an application scenario of a feature extraction method provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of the architecture of an image classification model provided by an embodiment of the present application.
  • FIG. 14 is a graph of the experimental result of applying a model of a feature extraction method provided by the present application to perform an image classification task
  • FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 16 is another schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 17 is another schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the embodiments of the present application provide a method and apparatus for feature extraction, and the solution provided by the present application can improve the performance and effect of a visual perception model.
  • Fig. 1 shows a schematic structural diagram of the main frame of artificial intelligence. (horizontal axis) and “IT value chain” (vertical axis) two dimensions to illustrate the above-mentioned main framework of artificial intelligence.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” from the underlying infrastructure of artificial intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system reflects the value brought by artificial intelligence to the information technology industry.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by a smart chip, as an example, the smart chip includes a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processor (graphics unit) processing unit, GPU), application specific integrated circuit (ASIC), field programmable gate array (FPGA) and other hardware acceleration chips; the basic platform includes distributed computing framework and network-related platforms Guarantee and support can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • CPU central processing unit
  • NPU neural-network processing unit
  • graphics processor graphics processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • Guarantee and support can include cloud storage and computing, interconnection networks, etc.
  • sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed
  • the data on the upper layer of the infrastructure indicates the source of data in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as algorithms or a general system, such as image classification, personalized management of images, and personalized management of battery charging. , text analysis, processing of computer vision, speech recognition, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, the productization of intelligent information decision-making, and the realization of landing applications. Its application areas mainly include: intelligent terminals, intelligent manufacturing, Smart transportation, smart home, smart healthcare, smart security, autonomous driving, smart city, etc.
  • the embodiments of the present application may be applied to multiple application scenarios in the above-mentioned various fields, for example, it can be applied to the application scenario of natural language search to improve the accuracy of natural language search; it can also be applied to the application scenario of machine translation , so that the translation results are more accurate; it can also be applied to the application scenarios of multiple rounds of dialogues to improve the efficiency of human-machine communication.
  • the embodiments of the present application are mainly applied to application scenarios related to the computer vision field in the above-mentioned various fields.
  • the embodiments of the present application can be applied to face recognition, image classification, target detection, semantic segmentation, key point detection, linear object detection (such as lane line or stop line detection in automatic driving technology), drivable area detection, scene detection identification and other application scenarios.
  • it can be specifically applied to the application scenario of automatic driving.
  • Self-driving vehicles acquire images of the environment around the vehicle through cameras. The image acquired by the camera is segmented to segment the area where different objects such as road surface, roadbed, vehicle, pedestrian, etc. are located from the image, so as to keep the vehicle driving in the correct area.
  • the accuracy of image segmentation is very important to the safety of vehicle driving.
  • the solution provided in this application can improve the accuracy of image segmentation in the field of automatic driving.
  • the embodiments of the present application may be applied in the field of intelligent monitoring.
  • pedestrian attribute recognition is a key task based on images obtained by monitoring equipment.
  • the pedestrian attribute recognition task needs to identify common attributes of pedestrians, such as gender, age, hair, clothing, and clothing.
  • This requires that the image features can represent more image information, such as carrying more detailed information of the image, where the image features can be extracted by inputting the image obtained by the monitoring device into the feature extraction model, and the image features of the image can be extracted through the feature extraction model.
  • the feature extraction model is sometimes referred to as a feature extraction module in this application, and the two have the same meaning.
  • the image acquired by the monitoring device is input into the target model, and the target model is used to perform the task of pedestrian attribute recognition.
  • the target model includes a feature extraction module, and the image is extracted by the feature extraction module. features to enable the target model to identify pedestrian attributes based on the extracted image features.
  • FIG. 2 is an architecture diagram of a system provided by an embodiment of the present application.
  • the system 200 Including execution equipment 210 , training equipment 220 , database 230 and data storage system 240 .
  • a training data set is stored in the database 230, and the database 230 can specifically be represented as a storage medium in any form, and is not limited to a database in the traditional sense.
  • This application does not limit the data type of the training samples, for example, the training samples may be image data, or the training samples may be voice data, or the training samples may be text data. It should be noted that, in general, the data types of the training samples included in the training dataset are the same.
  • the training device 220 generates the first machine learning model/rule 201, and performs iterative training on the first machine learning model/rule 201 by using the training data set in the database to obtain a mature first machine learning model/rule 201.
  • the present application also refers to the first machine learning model/rule 201 as a visual perception model.
  • the training sample is used as the input of the first machine learning model/rule 201
  • the first machine learning model/rule 201 extracts the image features of the image data through the feature extraction model, and the first machine learning model/rule 201 uses the extracted image features to Machine learning model/rule 201 Iterative training.
  • the training device may use the first machine learning model/rule 201 to train the data to obtain a mature first machine learning model/rule 201 .
  • the work of each layer of the first machine learning model/rule 201 can be expressed mathematically To describe: from the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column through five operations on the input space (set of input vectors). Space), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are determined by Complete, the operation of 4 is completed by +b, and the operation of 5 is realized by a().
  • W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • This vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training the first machine learning model/rule 201 is to finally obtain the weight matrix of all layers of the trained first machine learning model/rule 201 (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the first machine learning model/rule 201 is essentially learning how to control the spatial transformation, and more specifically, learning a weight matrix.
  • the output of the first machine learning model/rule 201 be as close as possible to the value that is really intended to be predicted.
  • the value actually to be predicted is related to the training objective of the first machine learning model/rule 201 or the task that the first machine learning model/rule 201 needs to complete.
  • the first machine learning model/rule 201 is used to perform an image classification task, and the output of the first machine learning model/rule 201 is as close to the real image classification result as possible. It should be noted that this application focuses on how to make the features extracted by the first machine learning model/rule 201 better represent the information of the object to be processed. Task, this application is not limited.
  • the predicted value of the current network can be compared with the expected target value, and then updated according to the difference between the two
  • the weight vector of each layer of neural network (of course, there is usually an initialization process before the first update, that is, the parameters are pre-configured for each layer in the deep neural network), for example, if the predicted value of the network is high, adjust The weight vector makes its prediction lower and keeps adjusting until the neural network can predict the actual desired target value.
  • the execution device 210 may call data, codes, etc. in the data storage system 240 , and may also store data, instructions, etc. in the data storage system 240 .
  • the data storage system 240 may be configured in the execution device 210 , or may be a memory outside the execution device 210 .
  • the execution device 210 may invoke the mature first machine learning model/rule 201 to extract the features of the object to be processed, and perform specific tasks according to the extracted features of the object to be processed.
  • the data type of the object to be processed and the data type of the training sample are generally the same.
  • the specific tasks are determined according to the training tasks in the training phase.
  • the first machine learning model/rule 201 is iteratively trained by using the training data set in the database, so that the mature first machine learning model/rule 201 can perform feature extraction on the image, and execute the image according to the extracted features. classification task.
  • the execution device 210 can invoke the mature first machine learning model/rule 201 to extract the features of the image, and perform the image classification task according to the extracted image features.
  • a “user” may directly interact with the execution device 210 , that is, the execution device 210 and the client device are integrated in the same device.
  • the execution device 210 may be represented as a terminal device, such as a mobile phone, a camera, a smart home, and so on.
  • the user can input the object to be processed through the execution device 210 , for example, the user takes a picture with a camera, and the image obtained by the camera is used as the input of the mature first machine learning model/rule 201 .
  • the execution device 210 may be embodied as an execution device configured with a display screen, then in the inference stage, after completing a task (or multiple tasks), the execution device 210 can show the user the first machine learning The output of model/rule 201. For example, after executing the image classification task, the execution device 210 displays the result of the image classification to the user.
  • the execution device 210 may also be in other forms, which are not listed here, but FIG. 2 is only a schematic diagram of the architecture provided by the embodiment of the present invention, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limit.
  • the execution device 210 and the client device may be separate devices, the execution device 210 is configured with an input/output interface, and performs data interaction with the client device.
  • the output interface inputs at least one task to the execution device 210, and the execution device 210 returns the processing result to the client device through the input/output interface.
  • the process of feature extraction of the object to be processed is involved. Therefore, the solutions provided in this application can be executed by both the training device 220 and the execution device 210 .
  • the input requirements for some of the first machine learning models/rules 201 are one-dimensional vectors.
  • self-attention network (Transformer), long short term memory (LSTM) neural network and gated recurrent unit network (GRU) the input requirement of these models is a dimensional vector.
  • the objects to be processed are often multi-dimensional tensors, such as images are usually three-dimensional tensors. Therefore, it is necessary to preprocess the object to be processed and convert the tensor into a vector before it can be used as the input of these models.
  • the image is divided into nine image blocks, and each image block includes 1/9 of the image, that is, includes 1/9 of the pixels in the image.
  • the image block is converted into a vector representation, and the image block represented by the vector is used as the input of the Transformer.
  • Transformer is a neural network based on self-attention mechanism. For an image block, when Transformer extracts image features, it can establish the association between this image block and all input image blocks.
  • this method causes the internal structure of the image to be destroyed, which is embodied in that this method only considers the relationship between image blocks and image blocks, but does not consider the relationship between pixels and pixels.
  • the image block After the image block is converted into a vector representation, the association between some pixels and pixels is lost. For example, the original adjacent pixels, due to the conversion of the image block into a vector representation, the adjacent pixels are no longer adjacent, and then lost. The adjacent relationship between pixels and pixels. Furthermore, trying to divide the image blocks small enough to solve this problem can lead to new problems. That is, the increase in the number of image blocks will lead to a great increase in the amount of computation, which will lead to a decrease in the efficiency of model training and a decrease in the efficiency of model prediction after training.
  • an embodiment of the present application provides a method for feature extraction, so that the first machine learning model/rule 201 includes at least two self-attention modules, wherein one self-attention module is used to establish a relationship between an image block and an image block. Another self-attention module is used to establish the correlation between pixels and pixels, which can improve the performance of the model.
  • FIG. 4 it is a schematic flowchart of a method for feature extraction provided by an embodiment of the present application.
  • a method for feature extraction provided by an embodiment of the present application may include the following steps:
  • the data type of the object to be processed may be image data (hereinafter abbreviated as image), text data (hereinafter abbreviated as text), and voice data (hereinafter abbreviated as voice).
  • image image
  • text data hereinafter abbreviated as text
  • voice voice data
  • the segmented object to be processed includes some elements in the object to be processed.
  • the object to be processed is an image
  • some elements in the object to be processed refer to some pixels in the image
  • the object to be processed is text or voice
  • some elements in the object to be processed refer to words or words in the text or voice.
  • the object to be processed in this application is image data.
  • a method for feature extraction provided by the present application will be introduced by taking the object to be processed as image data as an example.
  • the segmented image is hereinafter referred to as image blocks, each image block includes some pixels in the image, and all the image blocks form a completed image.
  • the image may be uniformly segmented, so that each segmented image block includes the same number of pixels.
  • the image may not be uniformly divided, so that the number of pixels included in each segmented image block is not exactly the same. Specifically, the number of pixels included in some image blocks is the same. The number of pixels included in some image blocks is different, or the number of pixels included in all image blocks is different.
  • all the pixels included in each image block may be adjacent pixels, or some pixels may be adjacent pixels, and some pixels may not be adjacent pixels. Among them, adjacent pixels means that the spatial positional relationship between pixels in the complete image is adjacent.
  • the image can be divided evenly, and all the pixels included in each image block are adjacent pixels. For an image, it is evenly divided into n image blocks, which can be understood by referring to formula 1-1.
  • X represents the image to be processed.
  • X 1 to X n respectively represent the image blocks after segmentation, and n is a positive integer greater than 1, which is used to represent the number of image blocks after segmentation.
  • R stands for tensor, the size of the tensor is n ⁇ p ⁇ p ⁇ 3, where the size of each image block is p ⁇ p ⁇ 3, p ⁇ p can be used to represent the two dimensions of the image block, 3 means Another dimension, namely the channel dimension, for example, the pixel value of each image block including the image can be a red, green and blue (RGB) color value, then the channel dimension of the image block is 3, and the pixel value can be a long integer representing the color.
  • RGB red, green and blue
  • Dividing the image to be processed into multiple image blocks helps to speed up the progress of the model to extract image features, that is, the model can process multiple image blocks in parallel and extract image features of multiple image blocks at the same time.
  • Each element set includes some elements in the segmented object to be processed. For example, for each image block, an element set is obtained, and each element set includes some pixels in each image block. For the convenience of description, some of the pixels in the image block are referred to as pixel blocks below.
  • each image block multiple pixel blocks may be obtained, and any two pixel blocks in the multiple pixel blocks may include the same or different numbers of pixels. Also, the pixels included in each pixel block may be adjacent pixels or non-adjacent pixels. Exemplarily, it can be understood with reference to Formula 1-2.
  • n is a positive integer greater than 1, which is used to represent the number of image blocks after segmentation.
  • m is a positive integer greater than 1, and is used to represent the number of pixel blocks included in one image block.
  • c is used to represent the length of the vector corresponding to a pixel block.
  • n image blocks there are n pixel block groups, which can be understood with reference to formula 1-3.
  • pixels included in any two pixel blocks in the multiple pixel blocks may overlap.
  • the collection of elements can be obtained through image matrix column conversion (image to column, im2col).
  • im2col mainly converts the data contained in each window in the image data into a column vector, and finally arranges it into a new matrix by column.
  • FIG. 5 each number represents a pixel, and the channel dimension of each pixel is not shown in FIG. 5 .
  • the size of the window can be customized, such as the 3 ⁇ 3 window in FIG. 5 , and it can also be customized as a window of other sizes, such as 2 ⁇ 2, 4 ⁇ 4, etc., the embodiment of the present application This is not limited.
  • each sliding of the window can also be customized, such as a distance of 1 pixel for each sliding, a distance of two pixels for each sliding, etc., which are not limited in this embodiment of the present application.
  • all the pixels included in the window can be regarded as a pixel block. Since each pixel block has a channel dimension, after each pixel is expanded, each pixel corresponds to multiple positions in a column vector. For example, each pixel may include three channels of red, green and blue, and after each pixel block is expanded, it corresponds to the position of three elements in a column vector.
  • Each pixel block can be converted into a row vector or a column vector, as shown in Figure 5, which shows the process of converting a pixel block into a column vector.
  • the first vector is used to represent the segmented object to be processed, for example, the first vector is used to represent the image blocks mentioned in step 401 and step 402 .
  • the second vector is used to represent some elements in the segmented object, for example, the second vector is used to represent the pixel blocks mentioned in step 401 and step 402 .
  • the first feature extraction model and the second feature extraction model may be understood as multiple feature extraction modules in the first machine learning model/rule 201 mentioned above.
  • the first feature extraction model and the second feature extraction model may be CNN or RNN.
  • the first feature extraction model includes a plurality of feature extraction modules
  • the second feature extraction model includes a plurality of feature extraction modules.
  • multiple feature extraction modules are connected end to end, and the output of the former feature extraction module is used as the input of the latter feature extraction module, so that the latter feature extraction module continues to perform feature extraction. extract.
  • Each feature extraction module has a specific weight matrix, and its role in image processing is equivalent to a filter that extracts specific information from the input image matrix, traverses the input through the weight matrix, and completes the work of extracting specific features from the image.
  • the output of the previous feature extraction module can be regarded as the first feature.
  • image features mainly include color features, texture features, shape features and spatial relationship features of images.
  • the color feature is a global feature that describes the surface properties of the scene corresponding to the image or image area; the general color feature is a pixel-based feature, and all pixels belonging to the image or image area have their own contributions. Since color is not sensitive to changes in the orientation, size, etc.
  • Texture feature is also a global feature, which also describes the surface properties of the scene corresponding to the image or image area; however, since texture is only a feature of the surface of an object and cannot fully reflect the essential properties of the object, only using texture features is Unable to get high-level image content.
  • texture features are not pixel-based features, and require statistical calculations in areas containing multiple pixels. There are two types of representation methods for shape features, one is contour feature, the other is regional feature.
  • the contour feature of the image is mainly aimed at the outer boundary of the object, while the regional feature of the image is related to the entire shape area;
  • the spatial relationship feature refers to The mutual spatial position or relative direction relationship between multiple objects segmented in the image, these relationships can also be divided into connection/adjacency relationship, overlapping/overlapping relationship and inclusion/inclusion relationship, etc.
  • spatial location information can be divided into two categories: relative spatial location information and absolute spatial location information.
  • the former relationship emphasizes the relative situation between targets, such as the relationship between up, down, left and right, etc.
  • the latter relationship emphasizes the distance and orientation between the targets.
  • the image features listed above can be used as some examples of specific features in the image, and the image can also have other features, such as higher-level features: semantic features, which will not be expanded here.
  • Fusion at least two second features is performed according to the first target weight to obtain a first fusion feature.
  • the feature extraction module fuses at least two second features according to the first target weight to obtain the first fusion feature.
  • the purpose of acquiring the first fusion feature is to establish the connection between pixel blocks and pixel blocks.
  • establishing an association relationship between a pixel block and a pixel block can be understood as taking into account the influence of one or more other pixel blocks on the pixel block when extracting the image features of one pixel block. The greater the influence of the other one or more pixel blocks on the pixel block, the greater the weight, and the less the influence of the other one or more pixel blocks on the pixel block, the smaller the weight.
  • the influence of one or more pixel blocks on the pixel block can be measured in different ways.
  • the similarity between the vectors corresponding to the two pixel blocks can be measured.
  • the vector corresponding to the pixel block to be processed and the vectors corresponding to the pixel blocks adjacent to the pixel block to be processed may be averaged, and the average value may be superimposed on the vector corresponding to the pixel block to be processed.
  • the second feature extraction model may be a neural network with a self-attention mechanism, for example, the second feature extraction model may be a Transformer.
  • the second feature extraction model is a Transformer
  • the first fusion feature can be obtained.
  • the multiple feature extraction modules in the second feature extraction model may specifically be multiple feature extraction blocks (blocks) used for feature processing, and multiple blocks are connected end to end, and the front The output of one block is used as the input of the next block, so that the latter block can continue to perform feature extraction.
  • each block has a specific weight matrix, and its role in image processing is equivalent to a filter that extracts specific information from the input image matrix, and traverses the input through the weight matrix to complete the extraction of specific features from the image.
  • each block of the second feature extraction model performs self-attention calculation on multiple pixel blocks, and takes into account the influence of each pixel block on the currently processed pixel block.
  • FIG. 7 is a schematic diagram of the architecture of the Transformer.
  • a block generally includes a normalization processing module, which is used to normalize the input, wherein the normalization processing can understand that the mean value of the input data is 0, and the standard deviation is 1, so that the loss value decreases smoothly in each training process. .
  • the output of the normalization processing module can be regarded as the second feature
  • a block can also include a self-attention module, and the output of the normalized processing module is used as the input of the self-attention module.
  • the self-attention calculation is performed on multiple pixel blocks through the self-attention module to establish the association between the pixel blocks and the pixel blocks.
  • a block can also include another normalization processing module, which normalizes the output of the self-attention module, so that the loss value can be better smoothly decreased during each training process.
  • the output of the previous block of the current block can be regarded as the second feature
  • the input of the previous block can be regarded as the second vector
  • the output of the previous block can be regarded as the input of the current block
  • the output of the current block can be regarded as the second vector.
  • the output is the first fusion feature.
  • the output of the current block can be regarded as the second feature
  • the output of the latter block is the first fusion feature. It should be noted that when the current block is the first block, the input of the current block is not the second feature, but can only be the second vector.
  • the first feature extraction model and the second feature extraction model are the aforementioned models whose input requirements are one-dimensional vectors.
  • the first feature extraction model may be a Transformer, GRU, and LSTM.
  • the second feature extraction model can be one of Transformer, GRU and LSTM.
  • the input requirements of these feature extraction models are one-dimensional vectors, so for these models, after obtaining image blocks and pixel blocks through steps 401 and 402, the image blocks also need to be converted into vector representations , the image block represented by the vector is used as the input of the first feature extraction model, the pixel block is converted into a vector representation, and the pixel block represented by the vector is used as the input of the second feature extraction model.
  • the conversion of the image block into a vector representation can have multiple representations.
  • the pixels of each row of the image block can be spliced end-to-end. Since each pixel block has a channel dimension, then each pixel After expansion, each pixel corresponds to multiple positions in a column vector.
  • each pixel may include three channels of red, green and blue, and after each pixel block is expanded, it corresponds to the position of three elements in a column vector.
  • Each image block can be converted into a row vector or a column vector. After the vectors corresponding to all image blocks are sorted by row or column, the vector matrix of the image block can be obtained.
  • the way of converting a pixel block into a vector representation can be understood by referring to the way of converting an image block into a vector representation.
  • all the pixels included in each window are expanded into a column of vectors, which can be Get a multi-column vector, and then sort by column to get a vector matrix of pixel blocks.
  • the first feature extraction model and the second feature extraction model may also have requirements on the size of the input vector, for example, only a vector with a preset length can be used as the input of the first feature extraction model and the second feature extraction model.
  • mapping processing on the transformed vector of each image block, so as to map the transformed vector of the image block into a vector of preset length, which satisfies the input requirements of the first feature extraction model. requirements; it is also necessary to perform mapping processing on the converted vector of each pixel block, so as to map the converted vector of the pixel block into a vector of preset length, so as to meet the input requirements of the second feature extraction model.
  • the first vector in this application is used to represent the segmented object to be processed, for example, the first vector is used to represent the image blocks mentioned in step 401 and step 402 .
  • the second vector is used to represent some elements in the segmented object, for example, the second vector is used to represent the pixel blocks mentioned in step 401 and step 402 . If the first vector is used as the input of the first feature extraction model and the second vector is used as the input of the second feature extraction model, the first vector also meets the input requirements of the first feature extraction model, and the second vector also meets the second feature extraction model. input requirements.
  • the solution provided by the present application can realize the fusion processing of the first feature and the first fusion feature in various ways. The following will explain from two aspects of fusion timing and fusion method.
  • the solution provided by this application makes the first machine learning model/rule 201 include two feature extraction models, namely the first feature extraction model and the second feature extraction model. .
  • the second feature extraction model establishes an association relationship between pixels (or pixel blocks) and pixels (or pixel blocks), which can be understood by referring to step 403 for details.
  • the first feature extraction model can establish an association relationship between image blocks and image blocks. How to establish an association relationship between image blocks and image blocks can be understood by referring to how to establish an association relationship between pixel blocks and pixel blocks.
  • the first feature extraction model described above includes multiple feature extraction modules, wherein, for the current feature extraction module, the output of the previous feature extraction module is used to obtain the input of the next feature extraction module, and the output of the previous feature extraction module is used to obtain the input of the next feature extraction module.
  • the input can be thought of as the first vector.
  • the first feature output by the previous feature extraction module in the first feature extraction model and the current feature extraction module output of the second feature extraction model can be output After the first fusion feature is fused, a second fusion feature is obtained, and the second fusion feature is used as the input of the current feature extraction module in the first feature extraction model.
  • the first feature output by the previous feature extraction module in the first feature extraction model can be used as the current feature extraction module in the first feature extraction model input, after the current feature extraction module output in the first feature extraction model and the first fusion feature output by the current feature extraction module in the second feature extraction model are fused, the second fusion feature is obtained as the first fusion feature. Input to the latter feature extraction module in a feature extraction model.
  • a head-to-tail splicing process may be performed on the multiple first fusion features to obtain the spliced features.
  • the spliced feature is mapped to the feature of the target length, and the target length is determined according to the length of the first feature. If the lengths of the spliced feature and the first feature are the same, the two can be directly added, and the first feature and the feature of the target length can be added to obtain the second fusion feature.
  • the first feature and the first fused feature are spliced end to end to obtain the second fused feature, for example, the first feature and the spliced feature are spliced to obtain the second fused feature.
  • a target operation is performed on the first feature and the first fusion feature to obtain the second fusion feature, and the target operation includes at least one of addition or multiplication.
  • the second fusion feature is used to determine the final feature of the object to be processed.
  • the second fusion feature output by the last feature extraction module among the multiple feature extraction modules in the first feature extraction model is used to determine The final extracted features of the object to be processed.
  • the last feature extraction module will output the corresponding second fusion feature, and the set of the second fusion feature is the final feature of the object to be processed.
  • weighting processing is performed on the second fusion feature corresponding to each image block output by the last feature extraction module, and the result of the weighting processing is used as the final feature of the object to be processed.
  • the text block may be understood as including some elements in the text data to be processed, such as including some adjacent words in the text data to be processed.
  • a word block may be understood to include some elements in the text block, such as including some adjacent words in the text block.
  • image blocks and pixels can also be retained in the process of extracting image features by the model.
  • the location information of the block is described below with reference to an embodiment.
  • FIG. 9 it is a schematic flowchart of a method for feature extraction provided by an embodiment of the present application.
  • a method for feature extraction provided by an embodiment of the present application may include the following steps:
  • Steps 901 and 902 can be understood with reference to 401 and 402 in the embodiment corresponding to FIG. 4 , and details are not repeated here.
  • the first position information is position information of the segmented object in the object to be processed.
  • the first position information is the position information of the image block in the image block
  • the second position information is the position information of some elements in the segmented object in the segmented object.
  • the second position information is the pixel block in the segmented object. Location information in the image block.
  • the first position information may be represented by coordinate information of one pixel or may be represented by coordinate information of multiple pixels.
  • the position information of each image block can be represented by the coordinates of the upper left corner of each image block.
  • the position information of each image block may be represented by the coordinates of the upper left pixel and the coordinates of the lower right pixel of each image block.
  • the coordinates of the pixel in the upper left corner and the coordinates of the pixel in the lower right corner are only examples, and are used to illustrate that the first position information can be represented by the coordinate information of one pixel or the coordinate information of multiple pixels, and It does not represent a limitation of the solutions provided in this application.
  • the second position information may be represented by the coordinate information of one pixel or may be represented by the coordinate information of a plurality of pixels.
  • the position information in one pixel block may be represented by the coordinates of all the pixels included in the pixel block.
  • the first position information and the second position information may also be represented by a coding vector.
  • the first machine learning model/rule 201 may include a position encoding module, and in an initial state, the position encoding module may randomly set a vector to represent the position information of each image block.
  • the parameters of the position coding module can be updated according to the loss value, so that the vector used to represent the position information of the image block encoded by the position coding module can be closer to the real position information of the image block. .
  • the fusion of the first position information on the first vector and the fusion of the second position information on the second vector can be understood as updating X n in Equation 1-1 and Y 0 i in Equation 1-3, please refer to Equations 1-4 and 1-5 are understood.
  • E position-patch is used to represent the first position information
  • E position-pixel is used to represent the second position information
  • Fusion at least two second features is performed according to the first target weight to obtain a first fusion feature.
  • Steps 904 to 906 can be understood with reference to steps 403 to 405 in the embodiment corresponding to FIG. 4 , the difference is that the embodiment corresponding to FIG. 9 can provide more information for both the first feature extraction model and the second feature extraction model. information, specifically providing first location information and second location information. The more information the first feature extraction model and the second feature extraction can obtain, the more helpful the first feature extraction model and the second feature extraction are to learn, so as to better extract image features.
  • the solution provided by the present application makes the first machine learning model/rule 201 include the first feature extraction model and the second feature extraction model, so that the first machine learning model/rule 201
  • the association between image blocks and image blocks is preserved, and the association between pixels (or pixel blocks) and pixels (or pixel blocks) is preserved, so that the first machine learning model/rule
  • the image features extracted in 201 can well capture the color features, texture features, shape features, and spatial relationship features of the image, so as to improve the performance of the first machine learning model/rule 201 .
  • the first machine learning model/rule 201 may further include a greater number of first feature extraction models and a greater number of second feature extraction models. The following description will be given with reference to a specific embodiment.
  • the model may include multiple feature extraction models, such as feature extraction model 1, feature extraction model 2, and feature feature extraction model 3 as shown in Figure 10.
  • feature extraction model 1 is equivalent to the first feature extraction model
  • feature extraction model 2 is equivalent to the second feature extraction model
  • feature extraction model 3 is equivalent to the second feature extraction model
  • feature extraction model 3 is equivalent to the second feature extraction model
  • feature extraction model 3 is equivalent to the second feature extraction model
  • the to-be-processed image may be segmented multiple times. For example, in one segment, the to-be-processed image is segmented into 4 image blocks, and the 4 image blocks are preprocessed to satisfy the feature extraction requirements. After the input requirements of model 1, it is used as the input of feature extraction model 1; in another segmentation, the image to be processed is divided into 16 image blocks, and the 16 image blocks are preprocessed to satisfy feature extraction model 2. After the input requirements are met, it is used as the input of the feature extraction model 2; in another segmentation, the image to be processed is divided into 64 image blocks, and the 64 image blocks are preprocessed to satisfy the feature extraction model 3. For input , as the input of feature extraction model 3. It should be noted that, in a possible implementation, multiple feature extraction models, such as feature extraction model 1, feature extraction model 2, and feature feature extraction model 3, may be executed in parallel at the same time.
  • the performance of the feature extraction model can be improved, so that the extracted image features can better represent image information.
  • the more image information that image features can represent the more beneficial it is to improve the accuracy of visual analysis tasks.
  • the solution provided by this application is introduced below by taking the application of the solution provided by this application to several typical visual analysis tasks as an example.
  • the mature first machine learning model/rule 201 can be deployed on the automatic driving vehicle or on the cloud device.
  • the autonomous vehicle obtains the image of the environment around the vehicle through the camera
  • the obtained image is input into the preprocessing module, so that the preprocessing model can segment the image to obtain image blocks and pixel blocks, and the obtained image Image blocks and pixel blocks are converted into vectors that meet the input requirements of the first feature extraction model and the second feature extraction model.
  • the preprocessing module can be regarded as a part of the mature first machine learning model/rule 201, or can be regarded as a separate part.
  • the preprocessing module can be deployed on the autonomous vehicle, and the mature first machine learning model/rule 201 can be deployed on the cloud device.
  • the first machine learning model/rule 201 uses the first feature extraction model and the second feature extraction model to perform feature extraction on the image of the environment around the vehicle obtained by the camera, because in the process of feature extraction, both the image block and the space between the image blocks are preserved.
  • the association relationship between pixel blocks and pixel blocks is preserved, so that the extracted image features can better represent the area where each object is located in the environment around the vehicle, which is beneficial to the first machine learning model/rule.
  • the semantic segmentation model in 201 segmentes the environment image around the vehicle obtained by the camera according to the extracted features, so as to segment the area where different objects such as road surface, roadbed, vehicle, pedestrian, etc. are located from the image, so as to keep the vehicle driving in the correct area. .
  • the mature first machine learning model/rule 201 can be deployed on the intelligent monitoring device or on the cloud device.
  • the image obtained by the monitoring device (such as through the camera A, the camera B, and the camera C shown in FIG. 12 ) is input into the preprocessing module, so that the preprocessing module performs segmentation processing on the image to obtain image blocks and pixel blocks,
  • the acquired image blocks and pixel blocks are converted into vectors that meet the input requirements of the first feature extraction model and the second feature extraction model.
  • the preprocessing module can be regarded as a part of the mature first machine learning model/rule 201, or can be regarded as a separate part.
  • the preprocessing module can be deployed on the monitoring device, and the mature first machine learning model/rule 201 can be deployed on the cloud device.
  • the first machine learning model/rule 201 performs feature extraction on the image acquired by the intelligent monitoring device through the first feature extraction model and the second feature extraction model, because in the process of feature extraction, both the image block and the image block are retained.
  • the correlation relationship between the pixel blocks and the pixel blocks is preserved, so that the extracted image features can better represent the features of objects appearing within the sensing range of the intelligent monitoring device.
  • the image features extracted by the solution provided in this application can better characterize the attributes of pedestrians and the detailed characteristics of pedestrians, which is beneficial to the attribute recognition model in the first machine learning model/rule 201.
  • the latter features identify pedestrian attributes in the images obtained by the intelligent monitoring device, such as identifying the gender, age, hair color, clothes, clothing, etc. of the pedestrian.
  • the results of pedestrian attribute recognition can be displayed on the end-side device, or stored in the server.
  • the first machine learning model/rule 201 is used to perform an image classification task. As shown in FIG. 13 , it is a schematic flowchart of performing an image classification task through the first machine learning model/rule 201 .
  • the first machine learning model/rule 201 includes a plurality of target feature extraction models, and each target feature extraction model includes a first feature extraction model and a second feature extraction model.
  • L represents a positive integer.
  • the first machine learning model/rule 201 may further include an image preprocessing module, or the image preprocessing module may also serve as a module independent of the first machine learning model/rule 201 .
  • the image preprocessing module divides the image to obtain image blocks and pixel blocks, and converts the obtained image blocks and pixel blocks into vectors that meet the input requirements of the first feature extraction model and the second feature extraction model.
  • weighting processing may also be performed on vectors corresponding to multiple image blocks that meet the input requirements of the first feature extraction model, and the result is also used as the input of the first feature extraction model.
  • the first machine learning model/rule 201 may also include a multi-layer perceptron head (MLP head), which is used to perform an image classification task according to the output of the last target feature extraction model to output a classification result , for example, in the scene corresponding to Figure 13, the output classification result is "house”.
  • MLP head multi-layer perceptron head
  • Parameter one is used to indicate the number of feature extraction modules included in the feature extraction model, that is, the first feature extraction model includes 12 feature extraction modules, and the second feature extraction model includes 12 feature extraction modules.
  • the second parameter is used to indicate the requirement of the second feature extraction model for the length of the input vector.
  • Parameter 3 is used to represent the number of heads (multi-head self-attention) in the self-attention module in the second feature extraction model.
  • Parameter four is used to indicate the requirement of the first feature extraction model for the length of the input vector.
  • Parameter five is used for the number of heads (multi-head self-attention) in the self-attention module in the first feature extraction model.
  • Parameter six is used to represent the total number of parameters in the first machine learning model/rule 201 to which a feature method provided by this application is applied, and the unit of the number of parameters is million.
  • Parameter seven is used to indicate the number of floating point operations (Floating Point Operations, FLOPs), in billions.
  • the test data set is the ImageNet data set, and the test experiment of image classification is carried out on the ImageNet data set.
  • the test results are shown in Figure 14.
  • the solution provided by the present application preserves the association between image blocks and image blocks in the process of extracting image features through the first machine learning model/rule 201 relationship between pixels (or pixel blocks) and pixels (or pixel blocks), so that the image features extracted by the first machine learning model/rule 201 can well capture the color features, texture features, Shape features and spatial relationship features, etc., can further improve the classification accuracy of the first machine learning model/rule 201 .
  • the first machine learning model/rule 201 by which the feature extraction method provided by the present application is applied is applied.
  • the model/rule 201 is more accurate for classifying images.
  • the test results also show that, compared with the existing image classification models, the first machine learning model/rule 201 using the feature extraction method provided by the present application requires less computation, in other words, applying the feature extraction method provided by the present application.
  • the efficiency of the first machine learning model/rule 201 of the feature extraction method is higher, specifically, when several existing models and the first machine learning model/rule 201 of the feature extraction method provided by this application are applied to the image
  • the first machine learning model/rule 201 to which the feature extraction method provided in this application is applied requires less computation.
  • a feature extraction method provided by an embodiment of the present application has been introduced above.
  • the extracted features of the object to be processed can better represent the object to be processed, and further improvements can be made.
  • the performance of the model to which this feature extraction method is applied has been introduced above.
  • FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device may include a first acquisition module 1501 , a second acquisition module 1502 , a first fusion module 1503 , a second fusion module 1504 and a third fusion module 1505 .
  • the first acquisition module 1501 is used to acquire the first feature
  • the second acquisition module 1502 is used to acquire a plurality of second features
  • the first features are obtained after feature extraction is performed on the first vector through the first feature extraction model.
  • the vector is used to represent the first segmented object
  • the first segmented object includes some elements in the object to be processed
  • the second feature is obtained after feature extraction is performed on the second vector by the second feature extraction model.
  • Two vectors are used to represent some elements in the first segmented object.
  • the first fusion module 1503 is configured to fuse at least two second features according to the first target weight to obtain the first fusion feature.
  • the first target weight is based on each second feature in the at least two second features.
  • the influence of the target second feature is determined, and the target second feature is any one of the at least two second features.
  • the second fusion module 1504 is configured to perform fusion processing on the first feature and the first fusion feature to obtain a second fusion feature, and the second fusion feature is used to obtain the feature of the object to be processed.
  • the first obtaining module 1501 is further configured to obtain a third feature, the third feature is obtained after feature extraction is performed on a third vector by the first feature extraction model, and the third vector is used to represent The second segmented object, where the second segmented object includes some elements in the object to be processed.
  • the third fusion module 1505 is configured to fuse the first feature and the third feature according to the second target weight to obtain the third fusion feature, and the second target weight is determined according to the influence of the third feature on the first feature.
  • the second fusion module 1504 is specifically configured to perform fusion processing on the third fusion feature and the first fusion feature to obtain the second fusion feature.
  • the first vector is specifically used to represent the first segmented object carrying the first position information
  • the first location information is the position of the first segmented object in the object to be processed information
  • each second vector is specifically used to represent some elements in the first segmented object that carry the second position information
  • the second location information is the first segmented object The position information of some elements in the first segmented object.
  • the second fusion module 1504 is specifically configured to perform head-to-tail splicing processing on the first feature and the first fusion feature to obtain the second fusion feature.
  • the second fusion module 1504 is specifically configured to perform target operation on the first feature and the first fusion feature to obtain the second fusion feature, and the target operation includes at least one of addition or multiplication .
  • the second fusion module 1504 is specifically configured to perform head-to-tail splicing processing on the plurality of first fusion features when there are multiple first fusion features to obtain the spliced features.
  • the spliced feature is mapped to the feature of the target length, and the target length is determined according to the length of the first feature.
  • the first feature and the feature of the target length are added to obtain the second fusion feature.
  • the first fusion module 1503 is specifically configured to use at least two second features as the input of the target model, the output of the target model is the first fusion feature, and the target model includes a self-attention network Transformer, a volume One of the product neural network CNN or the recurrent neural network RNN, when the target model is Transformer, the first target weight is based on the inner product between each of the at least two second features and the target second feature It is determined that when the target model is one of CNN or RNN, the first target weight is preset.
  • the object to be processed is an image to be processed
  • the first vector is specifically used to represent the first segmented image
  • the first segmented image specifically includes some pixels in the to-be-processed image
  • the first The two vectors are specifically used to represent some of the pixels in the first segmented image
  • the second fusion feature is specifically used to obtain features of the image to be processed.
  • the electronic device may be the training device 220 described in FIG. 2 or the execution device 210 described in FIG. 2 .
  • FIG. 16 is a schematic structural diagram of the electronic device provided by the embodiment of the present application.
  • the first machine learning model/rule 201 described in FIG. 4 to FIG. 10 may be deployed on the electronic device 1400, and the first machine learning model/rule 201 includes a first feature extraction model and a second feature extraction model for The corresponding steps in FIGS. 4 to 10 are performed.
  • the electronic device 1400 may vary greatly due to different configurations or performances, and may include one or more central processing units (CPU) 1422 (for example, one or more processors) and a memory 1432, One or more storage media 1430 (eg, one or more mass storage devices) that store applications 1442 or data 1444.
  • CPU central processing units
  • storage media 1430 eg, one or more mass storage devices
  • the memory 1432 and the storage medium 1430 may be short-term storage or persistent storage.
  • the memory 1432 is random access memory (RAM), which can directly exchange data with the central processing unit 1422 for loading data 1444 and application programs 1442 and/or operating system 1441 for the central processing unit 1441
  • RAM random access memory
  • the 1422 runs and uses directly, usually as a temporary data storage medium for the operating system or other running programs.
  • the program stored in the storage medium 1430 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the electronic device.
  • the central processing unit 1422 may be configured to communicate with the storage medium 1430 to execute a series of instruction operations in the storage medium 1430 on the electronic device 1400 .
  • Electronic device 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input and output interfaces 1458, and/or, one or more operating systems 1441, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • operating systems 1441 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • the central processing unit 1422 is also configured to execute other steps performed by the first machine learning model/rule 201 in FIGS. 4 to 10 , and the central processing unit 1422 executes the first step in the corresponding embodiments of FIGS. 4 to 10 .
  • the steps executed by the machine learning model/rule 201 and the beneficial effects brought about reference may be made to the descriptions in the respective method embodiments corresponding to FIG. 4 to FIG. 10 , which will not be repeated here.
  • FIG. 17 is a schematic structural diagram of the electronic device provided by the embodiment of the present application.
  • the first machine learning model/rule 201 described in FIG. 4 to FIG. 10 may be deployed on the electronic device 1500, and the first machine learning model/rule 201 includes a first feature extraction model and a second feature extraction model for The corresponding steps in FIGS. 4 to 10 are performed.
  • the electronic device 1500 includes: a receiver 1501, a transmitter 1502, a processor 1503, and a memory 1504 (wherein the number of processors 1503 in the electronic device 1500 may be one or more, and one processor is taken as an example in FIG. 17 ) ), wherein the processor 1503 may include an application processor 15031 and a communication processor 15032.
  • the receiver 1501, the transmitter 1502, the processor 1503, and the memory 1504 may be connected by a bus or otherwise.
  • Memory 1504 may include read-only memory and random access memory, and provides instructions and data to processor 1503 .
  • a portion of memory 1504 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1504 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1503 controls the operation of the electronic device.
  • various components of an electronic device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 1503 or implemented by the processor 1503 .
  • the processor 1503 may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 1503 or an instruction in the form of software.
  • the above-mentioned processor 1503 can be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • FPGA field programmable Field-programmable gate array
  • the processor 1503 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1504, and the processor 1503 reads the information in the memory 1504, and completes the steps of the above method in combination with its hardware.
  • the receiver 1501 can be used to receive input numerical or character information, and to generate signal input related to performing relevant settings and function control of the device.
  • the transmitter 1502 can be used to output digital or character information through the interface; the transmitter 1502 can also be used to send instructions to the disk group through the above interface to modify the data in the disk group; the transmitter 1502 can also include a display device such as a display screen.
  • the application processor 15031 is configured to execute the method executed by the first machine learning model/rule 201 described in the corresponding embodiments in FIG. 4 to FIG. 10 .
  • a vehicle may have more or fewer components than those shown, may combine two or more components, or may have different configurations of components accomplish.
  • the execution device and the training device provided by the embodiments of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip executes the method for feature extraction of the model described in the embodiments shown in FIG. 4 to FIG. 10 .
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip may be represented as a neural network processor NPU 160, and the NPU 160 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1603, which is controlled by the controller 1604 to extract the matrix data in the memory and perform multiplication operations.
  • the operation circuit 1603 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1603 is a two-dimensional systolic array.
  • the arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1603 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1602 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1601 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1608 .
  • Unified memory 1606 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 1602 through a storage unit access controller (Direct Memory Access Controller, DMAC) 1605 .
  • DMAC Direct Memory Access Controller
  • Input data is also moved into unified memory 1606 via the DMAC.
  • the bus interface unit 1610 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1609 to obtain instructions from the external memory, and also for the storage unit access controller 1605 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1606 , the weight data to the weight memory 1602 , or the input data to the input memory 1601 .
  • the vector calculation unit 1607 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • vector computation unit 1607 can store the processed output vectors to unified memory 1606 .
  • the vector calculation unit 1607 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1607 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1603, eg, for use in subsequent layers in a neural network.
  • the instruction fetch buffer 1609 connected to the controller 1604 is used to store the instructions used by the controller 1604; the unified memory 1606, the input memory 1601, the weight memory 1602 and the instruction fetch memory 1609 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1603 or the vector calculation unit 1607 .
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the method in the first aspect.
  • Embodiments of the present application also provide a chip, which includes a processing unit and a communication unit, where the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit.
  • the processing unit can execute the computer-executable instructions stored in the memory unit to cause the chip to perform the methods described above in FIGS. 4 to 10 .
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or it may be any conventional processor or the like.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program for training a model is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, it causes the computer to execute the descriptions in FIGS. 4 to 10 above. Methods.
  • the embodiments of the present application also provide a computer program product, which, when running on a computer, causes the computer to execute the steps in the methods described in the embodiments shown in the foregoing FIG. 4 to FIG. 10 .
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.
  • An embodiment of the present application further provides a circuit system, where the circuit system includes a processing circuit, and the processing circuit is configured to execute the steps in the methods described in the embodiments shown in the foregoing FIG. 4 to FIG. 10 .
  • U disk U disk
  • mobile hard disk ROM
  • RAM magnetic disk or optical disk
  • the computer software product may also be embodied in the form of controls, drivers, stand-alone or downloadable software objects, and the like.
  • modules may be combined or integrated into another system, or some features may be ignored.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some ports, and the indirect coupling or communication connection between modules may be electrical or other similar forms.
  • the modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed into multiple circuit modules, and some or all of them may be selected according to actual needs. module to achieve the purpose of the solution of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例涉及人工智能领域,公开了一种特征提取的方法以及装置。方法包括:获取待处理的对象,根据待处理对象获取切分后的对象,切分后的对象包括待处理对象中的部分元素。通过第一向量表示切分后的对象,通过第二向量表示切分后的对象中的部分元素。对第一向量进行特征提取获取第一特征,对第二向量进行特征提取后,获取第二特征。根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征。对第一特征和第一融合特征进行融合处理,以获取第二融合特征,第二融合特征用于获取待处理对象的特征,使提取的待处理对象的特征可以更好的表征待处理对象,进而可以提升应用了该特征提取方法的模型的性能。

Description

一种特征提取的方法以及装置
本申请要求于2021年2月26日提交中国专利局、申请号为202110223032.8、申请名称为“一种特征提取的方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,具体涉及一种特征提取的方法以及装置。
背景技术
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成象系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
随着计算机视觉的发展,利用视觉感知模型能够执行的任务也越来越多,包括图片分类、2D检测、语义分割、关键点检测、线性物体检测(比如自动驾驶技术中的车道线或停止线检测)、可行驶区域检测、场景识别等。如何使视觉感知模型能够更好的完成目标任务,使视觉感知模型的性能和效果更佳是备受关注的问题。
发明内容
本申请提供一种特征提取的方法以及装置,可以使提取的待处理对象的特征可以更好的表征待处理对象,进而可以提升应用了该特征提取方法的模型的性能。
为解决上述技术问题,本申请实施例提供以下技术方案:
第一方面,本申请提供一种特征提取的方法,可以包括:通过第一特征提取模型对第一向量进行特征提取,以获取第一特征。第一向量用于表示第一切分后的对象。第一切分后的对象可以包括待处理对象中的部分元素。其中,待处理对象的数据类型可以是图像数据、文本数据以及语音数据。可以理解为切分后的待处理对象包括待处理对象中的部分元素。待处理对象是图像时,待处理对象中的部分元素是指图像中的部分像素;待处理对象是文本或者语音时,待处理对象中的部分元素是指文本或者语音中的单词或者词语。通过第二特征提取模型对第二向量进行特征提取,以获取多个第二特征。第二向量用于表示第一切分后的对象中的部分元素。
根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征,第一目标权重是根据第一参数值确定的,第一目标权重和第一参数值正相关。第一参数值用于表示至少两个第二特征中的每个第二特征,和目标第二特征之间的相似度,目标第二特征是至少 两个第二特征中的任意一个第二特征,或者第一目标权重为第二参数值,第二参数值包括至少一个预设常数。可以通过不同的方式衡量一个或者多个第二特征和该目标第二特征之间的相似度,比如可以通过两个第二特征之间的内积的大小进行衡量,两个第二特征之间的内积越大,表示两个第二特征的相似度越高,则权重越大,即两个特征对彼此的影响都很大;举例说明,假设第二特征包括特征A,特征B以及特征C,当目标第二特征是特征B时,假设特征A和特征B之间的内积大于特征A和特征C之间的内积,则代表特征A对特征B之间的相似度越大,则特征A对特征B的影响越大,特征C对特征B的影响很小。可以设定权重分别是0.9和0.1,则根据第一目标权重对至少两个第二特征进行融合处理可以理解为,0.9*A+B+0.1*C,这一结果表示一个第一融合特征。需要说明的,通过内积表示特征和特征之间的相似度只是衡量两个特征之间相似度的一种方式,还可以通过其他方式衡量两个特征之间的相似度,比如还可以通过训练神经网络模型,以获取训练后的神经网络获取两个特征之间的相似度。第一目标权重还可能是预设的,比如可以设置至少两个第二特征中的每个第二特征,对目标第二特征的影响是相同的,则还可以对目标第二特征以及与其他一个或者多个第二特征求平均值,将平均值叠加到目标第二特征上。需要说明的是,上述并未穷举衡量一个或者多个第二特征对该目标第二特征的影响的方式,除了上述列举的几种衡量方式,还可以采取其他的方式进行衡量。
对第一特征和第一融合特征进行融合处理,以获取第二融合特征,第二融合特征用于获取待处理对象的特征。第二融合特征用于确定待处理对象的最终特征,在一个可能的实施方式中,第一特征提取模型中的多个特征提取模块中的最后一个特征提取模块输出的第二融合特征用于确定待处理对象的最终提取的特征。针对于每个切分后的对象,最后一个特征提取模块都会输出对应的第二融合特征,第二融合特征的集合即为待处理对象的最终特征。在一个可能的实施方式中,对最后一个特征提取模块输出的每个切分后的对象对应的第二融合特征进行加权处理,加权处理后的结果作为待处理对象的最终特征。
由第一方面提供的方案可知,通过第二特征提取模型建立元素和元素之间的关联关系,即第一融合特征。将第一融合特征与第一特征进行融合,使提取后的特征包括了元素和元素之间的关联关系,进而可以更好的表征待处理对象的特征。提取的待处理对象的特征能够表征待处理对象的信息越多,越有利于模型对待处理对象进行分析。
在第一方面的一种可能实现方式中,该方法还可以包括:获取第三特征,第三特征是通过第一特征提取模型对第三向量进行特征提取后获取的。第三向量用于表示第二切分后的对象,第二切分后的对象可以包括待处理对象中的部分元素。对第一特征和第一融合特征进行融合处理,以获取第二融合特征,可以包括:根据第二目标权重对第一特征和第三特征进行融合,以获取第三融合特征。第二目标权重是根据第三参数值确定的,第三参数值用于表示第三特征和第一特征之间的相似度,或者第二目标权重为第四参数值,第四参数值包括至少一个预设常数。对第三融合特征和第一融合特征进行融合处理,以获取第二融合特征。在这种实施方式中,通过第一特征提取模型建立切分后的对象和切分后的对象之间的关联关系。在提取待处理对象的特征的过程中,保留切分后的对象和切分后的对象之间的关联关系,又保留了元素和元素之间的关联关系,使得提取后的特征可以更好的表 征待处理对象的特征。进而可以提示应用了该特征提取方法的模型的性能。
在第一方面的一种可能实现方式中,第一向量具体用于表示携带了第一位置信息的第一切分后的对象,第一位置信息为第一切分后的对象在待处理对象中的位置信息。以待处理对象是图像为例进行说明,第一位置信息可以通过一个像素的坐标信息进行表示或者可以通过多个像素的坐标信息进行表示。比如,对待处理图像进行均匀切分,以获取多个图像块时,每个图像块的位置信息可以通过每个图像块左上角的像素的坐标进行表示。再比如,每个图像块是规整的矩形或者正方形时,每个图像块的位置信息可以通过每个图像块左上角的像素的坐标和右下角的像素的坐标进行表示。还可以通过编码向量表示第一位置信息。在这种实施方式中,第一向量中包括了更多的信息,即第一位置信息,使第一特征提取模型可以获取更多信息,第一特征提取模型能够获取到的信息越多,越有助于第一特征提取模型进行学习,以更好的提取图像特征。
在第一方面的一种可能实现方式中,每个第二向量具体用于表示携带了第二位置信息的,第一切分后的对象中的部分元素,第二位置信息为第一切分后的对象中的部分元素在第一切分后的对象中的位置信息。在这种实施方式中,第二向量中包括了更多的信息,即第二位置信息。第二特征提取能够获取到的信息越多,越有助于第二特征提取进行学习,以更好的提取图像特征。
在第一方面的一种可能实现方式中,对第一特征和第一融合特征进行融合处理,以获取第二融合特征,可以包括:对第一特征和第一融合特征进行首尾拼接处理,以获取第二融合特征。在这种实施方式中,给出了一种具体的对第一特征和第一融合特征进行融合处理的方式,增加了方案的多样性。
在第一方面的一种可能实现方式中,对第一特征和第一融合特征进行融合处理,以获取第二融合特征,可以包括:对第一特征和第一融合特征进行目标运算,以获取第二融合特征,目标运算可以包括相加或相乘中的至少一种。在这种实施方式中,给出了一种具体的对第一特征和第一融合特征进行融合处理的方式,增加了方案的多样性。
在第一方面的一种可能实现方式中,对第一特征和第一融合特征进行目标运算,以获取第二融合特征,可以包括:第一融合特征可以包括多个时,对多个第一融合特征进行首尾拼接处理,以获取拼接后的特征。将拼接后的特征映射为目标长度的特征,目标长度根据第一特征的长度确定。对第一特征和目标长度的特征进行相加处理,以获取第二融合特征。在这种实施方式中,给出了一种具体的对第一特征和第一融合特征进行融合处理的方式,增加了方案的多样性。
在第一方面的一种可能实现方式中,根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征,可以包括:将至少两个第二特征作为目标模型的输入,目标模型的输出为第一融合特征,目标模型可以包括自注意力网络Transformer、卷积神经网络CNN或循环神经网络RNN中的其中一种,目标模型是Transformer时,第一目标权重是根据至少两个第二特征中的每个第二特征和目标第二特征之间的内积确定的,目标模型是CNN或RNN中的其中一种时,第一目标权重是预设的。在这种实施方式中,给出了几种获取第一融合特征的方式,增加了方案的多样性。
在第一方面的一种可能实现方式中,待处理对象是待处理图像,第一向量具体用于表示第一切分后的图像,第一切分后的图像具体可以包括待处理图像中的部分像素,第二向量具体用于表示第一切分后的图像中的部分像素,第二融合特征具体用于获取待处理图像的特征。在这种实施方式中,待处理对象是待处理图像,提取图像特征的过程中,保留图像块和图像块之间的关联关系,又保留了像素(或者像素块)和像素(或像素块)之间的关联关系,使提取的图像特征能够很好的捕捉图像的颜色特征、纹理特征、形状特征和空间关系特征等,进而可以提升视觉感知模型的性能。
第二方面,本申请提供一种特征提取模型,该特征提取模型可以包括第一特征提取模型和第二特征提取模型,第一特征提取模型,用于获取第一特征,第一特征是通过第一特征提取模型对第一向量进行特征提取后获取的,第一向量用于表示第一切分后的对象,第一切分后的对象可以包括待处理对象中的部分元素。第二特征提取模型,用于获取多个第二特征,第二特征是通过第二特征提取模型对第二向量进行特征提取后获取的,第二向量用于表示第一切分后的对象中的部分元素。第二特征提取模型,还用于根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征,第一目标权重是根据第一参数值确定的,第一目标权重和第一参数值正相关。第一参数值用于表示至少两个第二特征中的每个第二特征,和目标第二特征之间的相似度,目标第二特征是至少两个第二特征中的任意一个第二特征,或者第一目标权重为第二参数值,第二参数值包括至少一个预设常数。第一特征提取模型,还用于对第一特征和第一融合特征进行融合处理,以获取第二融合特征,第二融合特征用于获取待处理对象的特征。
在第二方面的一种可能实现方式中,第一特征提取模型,还用于:获取第三特征,第三特征是通过第一特征提取模型对第三向量进行特征提取后获取的,第三向量用于表示第二切分后的对象,第二切分后的对象可以包括待处理对象中的部分元素。根据第二目标权重对第一特征和第三特征进行融合,以获取第三融合特征,第一特征提取模型,具体用于第二目标权重是根据第三参数值确定的,第三参数值用于表示第三特征和第一特征之间的相似度,或者第二目标权重为第四参数值,第四参数值包括至少一个预设常数。对第三融合特征和第一融合特征进行融合处理,以获取第二融合特征。
在第二方面的一种可能实现方式中,第一向量具体用于表示携带了第一位置信息的第一切分后的对象,第一位置信息为第一切分后的对象在待处理对象中的位置信息。
在第二方面的一种可能实现方式中,每个第二向量具体用于表示携带了第二位置信息的,第一切分后的对象中的部分元素,第二位置信息为第一切分后的对象中的部分元素在第一切分后的对象中的位置信息。
在第二方面的一种可能实现方式中,第一特征提取模型,具体用于:对第一特征和第一融合特征进行首尾拼接处理,以获取第二融合特征。
在第二方面的一种可能实现方式中,第一特征提取模型,具体用于:对第一特征和第一融合特征进行目标运算,以获取第二融合特征,目标运算可以包括相加或相乘中的至少一种。
在第二方面的一种可能实现方式中,第一特征提取模型,具体用于:第一融合特征可 以包括多个时,对多个第一融合特征进行首尾拼接处理,以获取拼接后的特征。将拼接后的特征映射为目标长度的特征,目标长度根据第一特征的长度确定。对第一特征和目标长度的特征进行相加处理,以获取第二融合特征。
在第二方面的一种可能实现方式中,第二特征提取模型,具体用于:将至少两个第二特征作为目标模型的输入,目标模型的输出为第一融合特征,目标模型可以包括自注意力网络Transformer、卷积神经网络CNN或循环神经网络RNN中的其中一种,目标模型是Transformer时,第一目标权重是根据至少两个第二特征中的每个第二特征和目标第二特征之间的内积确定的,目标模型是CNN或RNN中的其中一种时,第一目标权重是预设的。
在第二方面的一种可能实现方式中,待处理对象是待处理图像,第一向量具体用于表示第一切分后的图像,第一切分后的图像具体可以包括待处理图像中的部分像素,第二向量具体用于表示第一切分后的图像中的部分像素,第二融合特征具体用于获取待处理图像的特征。
对于本申请第二方面以及各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第三方面,本申请提供一种图像处理方法,可以包括:获取待处理图像。将待处理图像输入至视觉感知模型中,以通过视觉感知模型中可以包括的特征提取模型提取图像特征,特征提取模型为第二方面或第二方面任意一种可能的实施方式中所描述的特征提取模型。根据图像特征对待处理图像进行视觉感知。
在第三方面的一种可能实现方式中,根据图像特征对待处理图像进行视觉感知,可以包括:根据图像特征对待处理图像进行分类,以获取待处理图像的分类结果。
在第三方面的一种可能实现方式中,获取待处理图像,可以包括:通过车辆的传感器获取待处理图像。根据图像特征对待处理图像进行视觉感知,可以包括:根据图像特征对待处理图像进行语义分割,以获取待处理图像中目标物体所在区域,目标物体可以包括人物、车辆、路面中的一种或者多种。
在第三方面的一种可能实现方式中,获取待处理图像,可以包括:通过监控设备获取待处理图像。根据图像特征对待处理图像进行视觉感知,可以包括:若根据图像特征识别待处理图像中可以包括人物,则根据图像特征识别人物的属性,属性可以包括性别、肤色、年龄、服装中的一种或者多种。
第四方面,本申请提供一种电子设备,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现第一方面或第一方面任意一种可能的实施方式所描述的方法。
第五方面,本申请提供一种计算机可读存储介质,可以包括程序,当其在计算机上运行时,使得计算机执行如第一方面或第一方面任意一种可能的实施方式所描述的方法。
第六方面,本申请提供一种电路系统,电路系统可以包括处理电路,处理电路配置为执行如第一方面或第一方面任意一种可能的实施方式所描述的方法。
第七方面,本申请提供一种计算机程序产品,计算机程序产品可以包括指令,当指令 由电子设备加载并执行,使电子设备执行第一方面或第一方面任意一种可能的实施方式所描述的方法。
第八方面,本申请提供一种芯片,芯片与存储器耦合,用于执行存储器中存储的程序,以执行如第一方面或第一方面任意一种可能的实施方式所描述的方法。
对于本申请第四方面至第八方面以及各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
附图说明
图1为本申请实施例提供的人工智能主体框架的一种结构示意图;
图2为本申请实施例提供的一种系统的架构图;
图3为一种对图像进行特征提取的流程示意图;
图4为本申请实施例提供的一种特征提取的方法的流程示意图;
图5为本申请实施例提供一种获取元素集合的流程示意图;
图6为本申请实施例提供的一种将图像块转化为向量表示的流程示意图;
图7为本申请实施例提供的一种特征提取模型的示意图;
图8为本申请实施例提供的一种特征提取模型的示意图;
图9为本申请实施例提供的一种特征提取的方法的流程示意图;
图10为本申请实施例提供的一种特征提取模型的示意图;
图11为本申请实施例提供的一种特征提取方法的一种应用场景的示意图;
图12为本申请实施例提供的一种特征提取方法的一种应用场景的示意图;
图13为本申请实施例提供的一种图像分类模型的架构示意图;
图14为应用本申请提供的一种特征提取方法的模型进行图像分类任务的实验结果图;
图15为本申请实施例提供的电子设备的一种结构示意图;
图16为本申请实施例提供的电子设备的另一种结构示意图;
图17为本申请实施例提供的电子设备的另一种结构示意图;
图18为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
本申请实施例提供了一种特征提取的方法以及装置,本申请提供的方案可以提升视觉感知模型的性能和效果。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
为了更好的理解本申请提供的方案,首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主体框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数 据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人工智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片提供,作为示例,该智能芯片包括中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据指示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,图像的分类、图像的个性化管理、电池充电个性化管理、文本分析、计算机视觉的处理、语音识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶、智慧城市等。
本申请实施例可能应用于上述各种领域中的多个应用场景中,比如可以应用于自然语言搜索的应用场景中,以提升自然语言搜索的准确度;还可以应用于机器翻译的应用场景中,以使翻译的结果更准确;还可以应用于多轮对话的应用场景中,以提升人机沟通的效率。本申请实施例主要应用于上述各种领域中的与计算机视觉领域相关的应用场景中。比如,本申请实施例可以应用于人脸识别、图像分类、目标检测、语义分割、关键点检测、 线性物体检测(比如自动驾驶技术中的车道线或停止线检测)、可行驶区域检测、场景识别等应用场景中。作为示例,具体可以应用于自动驾驶的应用场景中。自动驾驶车辆通过摄像头获取车辆周围的环境图像。对摄像头获取的图像进行分割,以从图像中分割出路面、路基、车辆、行人等不同物体所在的区域,从而保持车辆行驶在正确的区域。在自动驾驶领域中,图像分割的准确性对车辆行驶的安全性至关重要,通过本申请提供的方案,可以提升自动驾驶领域中图像分割的准确性。作为另一个示例,本申请实施例可以应用于智能监控领域中。在智能监控领域中,根据监控设备获取的图像,进行行人属性识别是一个关键任务,行人属性识别任务需要识别出行人的常见属性,如性别、年龄、头发、衣服、穿搭等。这要求图像特征能够表征的图像信息更多,比如携带更多的图像的细节信息,其中图像特征可以通过将监控设备获取的图像输入至特征提取模型,通过特征提取模型提取图像的图像特征。需要说明的是,本申请有时也将特征提取模型称为特征提取模块,二者表示相同的意思。比如,在智能监控领域这一示例中,将监控设备获取的图像输入至目标模型中,该目标模型用于执行行人属性识别任务,该目标模型中包括特征提取模块,通过该特征提取模块提取图像特征,以使目标模型根据提取的图像特征识别行人属性。通过本申请实施例提供的方案,可以提升特征提取模型的性能,使提取出的图像特征可以更好的表征图像信息。图像特征能够表征的图像信息越多,越有利于提升视觉分析任务的准确性。针对于行人属性识别这一任务,则越有利于提升行人属性识别的准确性。
应理解,此处不对本申请实施例的应用场景进行穷举。在前述种种场景中,均可以采用本申请实施例提供的特征提取方法,从而提升特征提取模型模型的性能。
为了便于理解本方案,先结合图2对本申请实施例提供的一种系统进行介绍,请参阅图2,图2为本申请实施例提供的一种系统的架构图,在图2中,系统200包括执行设备210、训练设备220、数据库230和数据存储系统240。
在训练阶段,数据库230中存储有训练数据集合,数据库230具体可以表现为任意形式的存储介质,不限定为传统意义上的数据库。训练数据集合中可以有多个训练样本。本申请对训练样本的数据类型并不进行限定,比如训练样本可以是图像数据,或者训练样本可以是语音数据,或者训练样本可以是文字数据。需要说明的是,通常情况训练数据集中包括的训练样本的数据类型是相同的。训练设备220生成第一机器学习模型/规则201,并利用数据库中的训练数据集合对第一机器学习模型/规则201进行迭代训练,得到成熟的第一机器学习模型/规则201。当训练样本是图像数据时,本申请也将第一机器学习模型/规则201称为视觉感知模型。以训练样本是图像数据为例,对如何对第一机器学习模型/规则201迭代训练,得到成熟的第一机器学习模型/规则201进行说明。图像数据作为第一机器学习模型/规则201的输入时,第一机器学习模型/规则201通过特征提取模型提取图像数据的图像特征,第一机器学习模型/规则201通过提取的图像特征对第一机器学习模型/规则201迭代训练。训练设备可以采用第一机器学习模型/规则201对数据进行训练,以获取成熟的第一机器学习模型/规则201。第一机器学习模型/规则201的每一层的工作可以用数学表达式
Figure PCTCN2022077807-appb-000001
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2022077807-appb-000002
完成,4的操作由+b完成,5的操作则由a()来实现。 这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练第一机器学习模型/规则201的目的,也就是最终得到训练好的第一机器学习模型/规则201的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,第一机器学习模型/规则201的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望第一机器学习模型/规则201的输出尽可能的接近真正想要预测的值。其中真正想要预测的值与第一机器学习模型/规则201的训练目标或者说第一机器学习模型/规则201需要完成的任务相关。比如第一机器学习模型/规则201用于进行图像分类任务,则第一机器学习模型/规则201的输出尽可能的接近真实的图像分类结果。需要说明的是,本申请重点研究如何使第一机器学习模型/规则201提取的特征可以更好的表征待处理对象的信息,至于第一机器学习模型/规则201根据提取的特征执行何种具体任务,本申请并不进行限定。为了使第一机器学习模型/规则201的输出尽可能的接近真正想要预测的值,可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么第一机器学习模型/规则201的训练就变成了尽可能缩小这个loss的过程。
在推理阶段,执行设备210可以调用数据存储系统240中的数据、代码等,也可以将数据、指令等存入数据存储系统240中。数据存储系统240可以配置于执行设备210中,也可以为执行设备210外部的存储器。执行设备210可以调用成熟的第一机器学习模型/规则201提取待处理对象的特征,并根据提取出的待处理对象的特征执行具体的任务。其中,待处理对象的数据类型和训练样本的数据类型一般是相同的。具体的任务是根据训练阶段的训练任务确定的。比如,训练阶段,利用数据库中的训练数据集合对第一机器学习模型/规则201进行迭代训练,使成熟的第一机器学习模型/规则201可以对图像进行特征提取,并根据提取的特征执行图像分类任务。则在推理阶段,执行设备210可以调用成熟的第一机器学习模型/规则201提取图像的特征,并根据提取出的图像特征执行图像分类任务。
本申请的一些实施例中,例如图2中,“用户”可以直接与执行设备210进行交互,也即执行设备210与客户设备集成于同一设备中。作为示例,在一些应用场景中,执行设备210可以表现为终端设备,比如手机、摄像头、智能家居等等。则在推理阶段,用户可以通过执行设备210输入待处理对象,比如用户通过摄像头进行拍照,摄像头获取的图像作为成熟的第一机器学习模型/规则201的输入。在另一些应用场景中,执行设备210具体可以表现为配置有显示屏的执行设备,则在推理阶段,执行设备210在完成一个任务(或者多个任务)之后,可以向用户展示第一机器学习模型/规则201的输出结果。比如执行设 备210执行了图像分类任务之后,向用户展示图像分类的结果。执行设备210还可以表现为其它形态,此处不一一进行列举,但图2仅是本发明实施例提供的架构示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。
在本申请的另一些实施例中,执行设备210和客户设备可以为分别独立的设备,执行设备210配置有输入/输出接口,与客户设备进行数据交互,“用户”可以通过客户设备的输入/输出接口向执行设备210输入至少一个任务,执行设备210通过输入/输出接口将处理结果返回给客户设备。
在对第一机器学习模型/规则201进行迭代训练的过程中,以及应用成熟的第一机器学习模型/规则201执行任务时,都涉及到对待处理对象进行特征提取的过程。因此,本申请提供的方案既可以由训练设备220执行也可以由执行设备210执行。
目前,部分第一机器学习模型/规则201的输入要求是一维的向量。比如,自注意力网络(self-attention network,Transformer)、长短期记忆(long short term memory,LSTM)神经网络以及门控循环神经网络(gated recurrent unit networks,GRU),这些模型的输入要求是一维的向量。然而,待处理对象往往是多维的张量,比如图像通常是三维的张量。所以,要对待处理对象进行预处理,将张量转换为向量后,才能作为这些模型的输入。申请人发现一些对待处理对象进行预处理的方案,会破坏待处理对象的内部结构,导致提取的待处理对象的特征损失了细节信息,不利于这些模型进行正确的预测。下面以待处理对象是图像,第一机器学习模型/规则201是Transformer为例,对一些方案的缺陷进行说明。
参阅图3,一些做法是将图像切分为多个图像块。如图3所示,将图像切分为九个图像块,每个图像块包括1/9的图像,即包括图像中的1/9的像素。针对每一个图像块,将图像块转换为向量表示,将通过向量表示的图像块作为Transformer的输入。Transformer是一种基于自注意力机制的神经网络。针对一个图像块,Transformer在提取图像特征时,可以建立这个图像块与输入的所有图像块之间的关联关系。然而,申请人发现这种方式导致图像的内部结构被破坏,具体表现在这种方式只考虑图像块和图像块之间的关联关系,而没有考虑像素和像素之间的关联关系。将图像块转化为向量表示后,丢失了部分像素和像素之间的关联关系,比如原来是相邻的像素,由于将图像块转化为向量表示,导致相邻的像素不再相邻,则丢失了像素和像素之间的相邻关系。此外,如果试图将图像块切分的足够小来解决这一问题,又会引发新的问题。即图像块数量的增加,将会导致计算量大大增加,计算量的增长导致模型训练效率降低,以及训练后的模型预测效率下降。
为了解决上述问题,本申请实施例提供一种特征提取的方法,使第一机器学习模型/规则201包括至少两个自注意力模块,其中一个自注意力模块用于建立图像块和图像块之间的关联关系,另一个自注意力模块用于建立像素和像素之间的关联关系,进而可以提升模型的性能。
参阅图4,为本申请实施例提供的一种特征提取的方法的流程示意图。
如图4所示,本申请实施例提供的一种特征提取的方法可以包括如下步骤:
401、对待处理对象进行切分处理,以获取切分后的待处理对象。
待处理对象的数据类型可以是图像数据(以下简称为图像)、文本数据(以下简称为 文本)以及语音数据(以下简称为语音)。可以理解为切分后的待处理对象包括待处理对象中的部分元素。待处理对象是图像时,待处理对象中的部分元素是指图像中的部分像素;待处理对象是文本或者语音时,待处理对象中的部分元素是指文本或者语音中的单词或者词语。在一个优选的实施方式中,本申请中的待处理对象是图像数据。在以下实施例中将以待处理对象是图像数据为例对本申请提供的一种特征提取的方法进行介绍。为了便于描述,以下将切分后的图像称为图像块,每个图像块包括图像中的部分像素,所有图像块组成了完成的图像。
在一个可能的实施方式中,可以对图像进行均匀的切分,使切分后的每个图像块包括的像素数量是相同的。在一个可能的实施方式中,也可以不对图像进行均匀的切分,使切分后的每个图像块包括的像素数量并不完全相同,具体的,部分图像块包括的像素数量是相同的,部分图像块包括的像素数量是不同的,或者全部图像块包括的像素数量都是不相同的。此外,每个图像块中包括的像素可以全部像素都是相邻的像素,也可以部分像素是相邻的像素,部分像素不是相邻的像素。其中相邻的像素是指在完整的图像中像素之间的空间位置关系是相邻的。在一个优选的实施方式中,可以对图像进行均匀的切分,并且每个图像块包括的全部像素都是相邻的像素。对于一张图像,将其均匀切分为n个图像块,可以参照公式1-1进行理解。
Figure PCTCN2022077807-appb-000003
其中,X表示待处理图像。X 1至X n分别表示切分后的各个图像块,n为大于1的正整数,用于表示切分后的图像块的数目。R代表张量,该张量的尺寸为n×p×p×3,其中,每个图像块的尺寸为p×p×3,p×p可以用于表示图像块的两个维度,3表示另一个维度,即通道维度,比如,每个图像块包括图像的像素值可以是一个红绿蓝(RGB)颜色值,则图像块的通道维度为3,像素值可以是表示颜色的长整数。
将待处理图像切分为多个图像块,有助于加快模型提取图像特征的进度,即模型可以并行处理多个图像块,同时提取多个图像块的图像特征。
402、针对每一个切分后的待处理对象,获取多个元素集合。
每个元素集合中包括切分后的待处理对象中的部分元素。比如,针对每一个图像块,获取元素集合,每个元素集合包括每个图像块中的部分像素。为了便于描述,以下将图像块中的部分像素称为像素块。
针对每一个图像块,可以获取多个像素块,该多个像素块中的任意两个像素块包括的像素的数目可以相同或者不同。此外,每个像素块中包括的像素可以是相邻的像素或者不相邻的像素。示例性的,可以参照公式1-2进行理解。
Figure PCTCN2022077807-appb-000004
其中,i=1,2,…,n。n为大于1的正整数,用于表示切分后的图像块的数目。
Figure PCTCN2022077807-appb-000005
m为大于1的正整数,用于表示一个图像块中包括的像素块的数目。c用于表示一个像素块对应的向量的长度。
则n个图像块,就有n个像素块组,可以参照公式1-3进行理解。
Figure PCTCN2022077807-appb-000006
需要说明的是,多个像素块中的任意两个像素块中包括的像素可能有重叠。
示例性的,下面给出一种获取元素集合的方式。可以通过图像矩阵列转换(image to column,im2col)方式实现元素集合的获取。im2col主要将图像数据中每个窗口内含有的数据转为列向量,最后按列排成新的矩阵。下面结合图5进行说明,如图5所示,每个数字代表一个像素,图5中未展示每个像素的通道维度。通过滑动窗口遍历图像,其中窗口的尺寸可以自定义,比如图5中为3×3的窗口,还可以自定义为其他尺寸的窗口,比如2×2,4×4等等,本申请实施例对此并不进行限定。窗口每次滑动的步长也可以自定义,比如每次滑动1个像素的距离,每次滑动两个像素的距离等等,本申请实施例对此并不进行限定。每一次在图像块上滑动窗口,窗口中包括的全部像素可以看做一个像素块。由于每个像素块具有通道维度,则每个像素展开后,每个像素对应一个列向量中的多个位置。比如,每个像素可以包括红绿蓝三个通道,则每个像素块展开后,对应一个列向量中的3个元素位置。每一个像素块都可以转化为一个行向量或者列向量,如图5中所示,展示了将一个将像素块转化为列向量的过程。
403、通过第一特征提取模型对第一向量进行特征提取,以获取第一特征,通过第二特征提取模型对第二向量进行特征提取,以获取第二特征。
其中,第一向量用于表示切分后的待处理对象,比如第一向量用于表示步骤401和步骤402中提到的图像块。第二向量用于表示切分后的对象中的部分元素,比如第二向量用于表示步骤401和步骤402中提到的像素块。
第一特征提取模型和第二特征提取模型可以理解为上文提到的第一机器学习模型/规则201中的多个特征提取模块。比如第一特征提取模型和第二特征提取模型可以是CNN或者RNN。第一特征提取模型包括多个特征提取模块,第二特征提取模型包括多个特征提取模块。针对第一特征提取模型或者第二特征提取模型中的一个,多个特征提取模块首尾连接,前一个特征提取模块的输出作为后一个特征提取模块的输入,以使后一个特征提取模块继续进行特征提取。每个特征提取模块具有特定的权重矩阵,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,通过权重矩阵遍历输入,从而完成从图像中提取特定特征的工作。针对于当前的第一特征提取模型的特征提取模块,前一个特征提取模块的输出可以看做第一特征。其中,图像特征主要有图像的颜色特征、纹理特征、形状特征和空间关系特征等。颜色特征是一种全局特征,描述了图像或图像区域所对应的景物的表面性质;一般颜色特征是基于像素点的特征,此时所有属于图像或图像区域的像素都有各自的贡献。由于颜色对图像或图像区域的方向、大小等变化不敏感,所以颜色特征不能很好地捕捉图像中对象的局部特征。纹理特征也是一种全局特征,它也描述了图像或图像区域所对应景物的表面性质;但由于纹理只是一种物体表面的特性,并不能完全反映出物体的本质属性,所以仅仅利用纹理特征是无法获得高层次图像内容的。与颜色特征不同,纹理特征不是基于像素点的特征,它需要在包含多个像素点的区域中进行统计计算。形状特征有两类表示方法,一类是轮廓特征,另一类是区域特征,图像的轮廓特征主要针对物体的外边界,而图像的区域特征则关系到整个形状区域;空间关系特征,是指图像中 分割出来的多个目标之间的相互的空间位置或相对方向关系,这些关系也可分为连接/邻接关系、交叠/重叠关系和包含/包容关系等。通常空间位置信息可以分为两类:相对空间位置信息和绝对空间位置信息。前一种关系强调的是目标之间的相对情况,如上下左右关系等,后一种关系强调的是目标之间的距离大小以及方位。需要说明的,上述列举的图像特征可以作为图像中具有的特定特征的一些举例,图像还可以具有其他特征,如更高层级的特征:语义特征,此处不再展开。
404、根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征。
针对第二特征提取模型中的一个特征提取模块,该特征提取模块根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征。获取第一融合特征是为了建立像素块和像素块之间的联系。其中,建立像素块和像素块之间的关联关系可以理解为在提取一个像素块的图像特征时,将其他一个或者多个像素块对该像素块的影响考虑进去。其他一个或者多个像素块对该像素块的影响越大,则权重越大,其他一个或者像素块对该像素块的影响越小,则权重越小。可以通过不同的方式衡量一个或者多个像素块对该像素块的影响,比如可以通过两个像素块对应的向量之间的相似度进行衡量,具体的可以通过两个像素块对应的向量之间的内积的大小进行衡量,两个像素块对应的向量之间的内积越大,表示两个像素块的相似度越高,则权重越大;再比如,还可以通过训练神经网络模型,通过神经网络模型获取像素块和像素块之间的相似度;还可以对两个像素块对应的向量进行预设运算,根据预设运算后的结果获取其他像素块对该像素块的影响,比如可以对待处理像素块对应的向量以及与待处理像素块相邻的像素块对应的向量求平均值,将平均值叠加到待处理像素块对应的向量上。
在一个优选的实施方式中,第二特征提取模型可以是具有基于自注意力机制的神经网络,比如第二特征提取模型可以是Transformer。当第二特征提取模型是Transformer时,通过第二特征提取模型对第二向量进行特征提取后,可以获取第一融合特征。下面以第二特征提取模型是Transformer,待处理对象是图像数据为例,对通过第二特征提取模型对第二向量进行特征提取,以获取第二特征或者第一融合特征进行说明。参阅图7,第二特征提取模型是Transformer时,第二特征提取模型中的多个特征提取模块具体可以是多个用于进行特征处理的特征提取块(block),多个block首尾连接,前一个block的输出作为后一个block的输入,以使后一个block继续进行特征提取。其中,每一个block具有特定的权重矩阵,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,通过权重矩阵遍历输入,从而完成从图像中提取特定特征的工作。
每一个block在提取图像特征的过程中,可以通过多种方式建立像素(或者像素块)和像素(或像素块)之间的关联关系。第二特征提取模型的每个block在对像素块进行特征提取时,对多个像素块进行自注意力计算,将各个像素块对当前处理的像素块的影响都考虑进去。继续参阅图7,为Transformer的架构示意图。一个block一般包括标准化处理模块,该标准化处理模块用于对输入进行标准化处理,其中标准化处理可以理解使输入的数据的均值为0,标准差为1,使损失值在每次训练过程中平滑下降。可以将标准化处理模块的输出看做第二特征,一个block还可以包括自注意力模块,标准化处理模块的输出作 为自注意力模块的输入。一个block是第二特征提取模型的block时,通过自注意力模块对多个像素块进行自注意力计算,建立像素块和像素块之间的关联关系。一个block还可以包括另一个标准化处理模块,对自注意力模块的输出进行标准化处理,使损失值在每次训练过程中可以更好的平滑下降。针对于当前的block,当前的block的前一个block的输出可以看做第二特征,前一个block的输入可以看做第二向量,前一个block的输出作为当前的block的输入,当前的block的输出为第一融合特征,对于当前的block的后一个block来说,当前的block的输出可以看做第二特征,后一个block的输出为第一融合特征。需要说明的是,当前的block是第一个block时,当前的block的输入不是第二特征,只能是第二向量。
在一个优选的实施方式中,第一特征提取模型和第二特征提取模型是上文提到的输入要求是一维的向量的模型,比如第一特征提取模型可以是Transformer、GRU以及LSTM中的其中一种,第二特征提取模型可以是Transformer、GRU以及LSTM中的其中一种。正如上文提到的,这些特征提取模型的输入要求是一维的向量,所以针对这些模型,在通过步骤401和步骤402获取了图像块和像素块之后,还需要将图像块转换为向量表示,将通过向量表示的图像块作为第一特征提取模型的输入,将像素块转化为向量表示,将通过向量表示的像素块作为第二特征提取模型的输入。其中,将图像块转化为向量表示可以有多种表示,示例性的,参照图6进行理解,可以将图像块的每一行的像素首尾拼接,由于每个像素块具有通道维度,则每个像素展开后,每个像素对应一个列向量中的多个位置。比如,每个像素可以包括红绿蓝三个通道,则每个像素块展开后,对应一个列向量中的3个元素位置。每一个图像块都可以转化为一个行向量或者列向量,全部图像块对应的向量按行排序或者按列排序后,可以获取图像块的向量矩阵。将像素块转化为向量表示的方式可以参照将图像块转化为向量表示的方式进行理解,比如在上述通过im2col获取像素块的示例中,对每个窗口内包括的全部像素展开成一列向量,可以获取多列向量,进而按列排序,以获取到像素块的向量矩阵。此外,第一特征提取模型和第二特征提取模型对于输入向量的尺寸可能也会有要求,比如只有预设长度的向量才能作为第一特征提取模型和第二特征提取模型的输入。所以,在一些可能的实施方式中,还需要对每个图像块转化后的向量进行映射处理,以将图像块转化后的向量映射为预设长度的向量,满足第一特征提取模型对输入的要求;还需要对每个像素块转化后的向量进行映射处理,以将像素块转化后的向量映射为预设长度的向量,满足第二特征提取模型对输入的要求。本申请中的第一向量用于表示切分后的待处理对象,比如第一向量用于表示步骤401和步骤402中提到的图像块。第二向量用于表示切分后的对象中的部分元素,比如第二向量用于表示步骤401和步骤402中提到的像素块。如果第一向量作为第一特征提取模型的输入,第二向量作为第二特征提取模型的输入时,第一向量还满足第一特征提取模型的输入要求,第二向量还满足第二特征提取模型的输入要求。
405、对第一特征和第一融合特征进行融合处理,以获取第二融合特征,第二融合特征用于获取待处理对象的特征。
针对第二特征提取模型中的一个特征提取模块,本申请提供的方案可以通过多种方式 实现对第一特征和第一融合特征进行融合处理。以下将分别从融合时机以及融合方式两个方面进行说明。
先对融合时机进行说明:参阅图8中的子图a,本申请提供的方案使第一机器学习模型/规则201中包括两个特征提取模型,即第一特征提取模型和第二特征提取模型。其中,第二特征提取模型建立像素(或者像素块)和像素(或像素块)之间的关联关系,具体的可以参照步骤403进行理解。第一特征提取模型可以建立图像块和图像块之间的关联关系,关于如何建立图像块和图像块之间的关联关系可以参照如何建立像素块和像素块之间的关联关系进行理解,具体的,在提取一个图像块的图像特征时,将其他一个或者多个图像块对该图像块的影响考虑进去,这里不再重复赘述。上文介绍到第一特征提取模型包括多个特征提取模块,其中,针对于当前的特征提取模块,前一个特征提取模块的输出用于获取下一个特征提取模块的输入,前一个特征提取模块的输入可以看做是第一向量。继续参阅图8中的子图a,在一个可能的实施方式中,可以将第一特征提取模型中的前一个特征提取模块输出的第一特征,和第二特征提取模型的当前特征提取模块输出第一融合特征进行融合处理后,获取第二融合特征,将第二融合特征作为第一特征提取模型中的当前的特征提取模块的输入。参阅图8中的子图b,在一个可能的实施方式中,可以将第一特征提取模型中的前一个特征提取模块输出的第一特征,作为第一特征提取模型中的当前的特征提取模块的输入,对第一特征提取模型中的当前的特征提取模块输出,和第二特征提取模型中的当前的特征提取模块输出的第一融合特征进行融合处理后,获取第二融合特征,作为第一特征提取模型中的后一个特征提取模块的输入。
再对融合方式进行说明:在一个可能的实施方式中,第一融合特征包括多个时,可以对多个第一融合特征进行首尾拼接处理,以获取拼接后的特征。将拼接后的特征映射为目标长度的特征,目标长度根据第一特征的长度确定。拼接后的特征和第一特征的长度相同,则可以直接对二者进行相加处理,对第一特征和目标长度的特征进行相加处理,以获取第二融合特征。在一个可能的实施方式中,对第一特征和第一融合特征进行首尾拼接处理,以获取第二融合特征,比如将第一特征和拼接后的特征进行拼接处理,以获取第二融合特征。在一个可能的实施方式中,对第一特征和第一融合特征进行目标运算,以获取第二融合特征,目标运算包括相加或相乘中的至少一种。
第二融合特征用于确定待处理对象的最终特征,在一个可能的实施方式中,第一特征提取模型中的多个特征提取模块中的最后一个特征提取模块输出的第二融合特征用于确定待处理对象的最终提取的特征。针对于每个图像块,最后一个特征提取模块都会输出对应的第二融合特征,第二融合特征的集合即为待处理对象的最终特征。在一个可能的实施方式中,对最后一个特征提取模块输出的每个图像块对应的第二融合特征进行加权处理,加权处理后的结果作为待处理对象的最终特征。
由图4对应的实施例可知,本申请提供的方案,在通过第一机器学习模型/规则201提取图像特征的过程中,保留图像块和图像块之间的关联关系,又保留了像素(或者像素块)和像素(或像素块)之间的关联关系,使第一机器学习模型/规则201提取的图像特征能够很好的捕捉图像的颜色特征、纹理特征、形状特征和空间关系特征等,进而可以提升第一 机器学习模型/规则201的性能。此外,需要说明的是,图4对应的实施例主要以图像数据为例进行的说明,但是应当明确针对其他类型的数据,本申请提供的方案同样适用。比如针对文本数据,在通过第一机器学习模型/规则201提取文本特征的过程中,保留文本块和文本块之间的关联关系,又保留了词语块和词语块之间的关联关系,使第一机器学习模型/规则201提取的文本特征能够很好的捕捉文本的语义特征等。其中文本块可以理解为包括待处理文本数据中的部分元素,比如包括待处理文本数据中的部分相邻的词语。词语块可以理解为包括文本块中的部分元素,比如包括文本块中的部分相邻的词语。以下实施例均以图像数据为例进行说明,其他类型的数据可以参照对图像数据的处理流程进行理解,以下对此不再重复说明。
为了使第一机器学习模型/规则201提取的图像特征能够很好的捕捉图像的颜色特征、纹理特征、形状特征和空间关系特征等,还可以在模型提取图像特征的过程中保留图像块和像素块的位置信息,下面结合一个实施例进行说明。
参阅图9,为本申请实施例提供的一种特征提取的方法的流程示意图。
如图9所示,本申请实施例提供的一种特征提取的方法可以包括如下步骤:
901、对待处理对象进行切分处理,以获取切分后的待处理对象。
902、针对每一个切分后的待处理对象,获取多个元素集合。
步骤901和步骤902可以参照图4对应的实施例中的401和402进行理解,这里不再重复赘述。
903、在第一向量上融合第一位置信息,在第二向量上融合第二位置信息。
第一位置信息为切分后的对象在待处理对象中的位置信息。比如第一位置信息为图像块在图像块中的位置信息,第二位置信息为切分后的对象中的部分元素在切分后的对象中的位置信息,比如第二位置信息为像素块在图像块中的位置信息。
其中,第一位置信息可以通过一个像素的坐标信息进行表示或者可以通过多个像素的坐标信息进行表示。比如,对待处理图像进行均匀切分,以获取多个图像块时,每个图像块的位置信息可以通过每个图像块左上角的像素的坐标进行表示。再比如,每个图像块是规整的矩形或者正方形时,每个图像块的位置信息可以通过每个图像块左上角的像素的坐标和右下角的像素的坐标进行表示。需要说明的是,这里的左上角的像素的坐标和右下角的像素的坐标仅为举例说明,用于说明第一位置信息可以通过一个像素的坐标信息或者多个像素的坐标信息进行表示,并不代表本申请提供的方案的限制。
第二位置信息可以通过一个像素的坐标信息进行表示或者可以通过多个像素的坐标信息进行表示。此外,由于一个像素块中包括的各个像素可以都是不相邻的像素,则一个像素块中的位置信息可以通过像素块中包括的全部像素的坐标进行表示。
除了可以通过像素的坐标信息表示第一位置信息和第二位置信息,还可以通过编码向量表示第一位置信息和第二位置信息。以第一位置信息为例进行说明,第一机器学习模型/规则201可以包括位置编码模块,可以在初始状态下,位置编码模块可以随机设定一个向量表示每个图像块的位置信息,在对第一机器学习模型/规则201进行迭代训练的过程中,可以根据损失值更新位置编码模块的参数,使位置编码模块编码的用于表示图像块位置信 息的向量可以更接近图像块真实的位置信息。
在第一向量上融合第一位置信息,在第二向量上融合第二位置信息,可以理解为对公式1-1中的X n、公式1-3中的Y 0 i进行了更新,可以参照公式1-4和公式1-5进行理解。
X n←X n+E position-patch   (1-4)
Y 0 i←Y 0 i+E position-pixel   (1-5)
其中,E position-patch用于表示第一位置信息,E position-pixel用于表示第二位置信息。
图4对应的实施例介绍到在一个可能的实施方式中,第一特征提取模型中的各个特征提取模块的权重矩阵均为0。则第一向量只携带了第一位置信息,举例说明:若第一特征提取模型中的各个提取模块的权重矩阵均为0,则对于第一特征模型中的第一个特征提取模块,第一特征提取模块的输入是n个第一向量,并且该n个第一向量的每个向量中的全部元素的取值均为0。该n个第一向量融合了第一位置信息后作为第一个特征提取模块的输入。在一个可能的实施方式中,第一特征提取模块的输入还可以是n+1个第一向量,并且该n+1个第一向量中的每个向量中的全部元素的取值均为0。针对该n+1个第一向量,其中n个向量用于融合第一位置信息,剩余的一个向量用于表示每个第一图像块对应的第位置信息的加权平均值。
904、通过第一特征提取模型对融合了第一位置信息的第一向量进行特征提取,以获取第一特征,通过第二特征提取模型对融合了第二位置信息的第二向量进行特征提取,以获取第二特征。
905、根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征。
906、对第一特征和第一融合特征进行融合处理,以获取第二融合特征,第二融合特征用于获取待处理对象的特征。
步骤904至步骤906可以参照图4对应的实施例中的步骤403至步骤405进行理解,区别在于图9对应的实施例,可以提供都给第一特征提取模型和第二特征提取模型更多的信息,具体的提供第一位置信息和第二位置信息。第一特征提取模型和第二特征提取能够获取到的信息越多,越有助于第一特征提取模型和第二特征提取进行学习,以更好的提取图像特征。
图4以及图9的实施例中,介绍了本申请提供的方案使第一机器学习模型/规则201中包括第一特征提取模型和第二特征提取模型,使得在通过第一机器学习模型/规则201提取图像特征的过程中,保留图像块和图像块之间的关联关系,又保留了像素(或者像素块)和像素(或像素块)之间的关联关系,使第一机器学习模型/规则201提取的图像特征能够很好的捕捉图像的颜色特征、纹理特征、形状特征和空间关系特征等,进而可以提升第一机器学习模型/规则201的性能。在一些可能的实施方式中,第一机器学习模型/规则201还可以包括更多数目的第一特征提取模型和更多数目的第二特征提取模型。下面结合一个具体的实施例进行说明。
参阅图10,为本申请实施例提供的一种模型的架构示意图。该模型中可以包括多个特征提取模型,比如如图10中所示的特征提取模型1、特征提取模型2以及特征特征提取模 型3。其中,对于特征提取模型1来说,特征提取模型1相当于是第一特征提取模型,特征提取模型2和特征提取模型3都相当于是第二特征提取模型;对于特征提取模型2来说,特征提取模型1相当于第一特征提取模型,特征提取模型2相当于是第二特征提取模型;或者特征提取模型2相当于是第一特征提取模型,特征提取模型3相当于是第二特征提取模型;对于特征提取模型3来说,特征提取模型1和特征提取模型2都相当于是第一特征提取模型,特征提取模型3相当于第二特征提取模型。第一特征提取模型和第二特征提取模型的具体执行过程已经在图4和图9对应的实施例中进行了详细的介绍,这里不再重复说明。
在一个可能的实施方式中,可以对待处理图像进行多次切分,比如在一次切分中,将待处理图像切分为4个图像块,对该4个图像块进行预处理后满足特征提取模型1对于输入的要求后,作为特征提取模型1的输入;在另一次切分中,将待处理图像切分为16个图像块,对该16个图像块进行预处理后满足特征提取模型2对于输入的要求后,作为特征提取模型2的输入;在另一次切分中,将待处理图像切分为64个图像块,对该64个图像块进行预处理后满足特征提取模型3对于输入的要求后,作为特征提取模型3的输入。需要说明的是,在一个可能的实施方式中,多个特征提取模型,比如特征提取模型1、特征提取模型2以及特征特征提取模型3可以同时并行执行。
通过本申请实施例提供的方案,可以提升特征提取模型的性能,使提取出的图像特征可以更好的表征图像信息。图像特征能够表征的图像信息越多,越有利于提升视觉分析任务的准确性。下面以本申请提供的方案应用于几个典型的视觉分析任务为例,对本申请提供的方案进行介绍。
参阅图11,本申请提供的方案应用于自动驾驶的应用场景中时,成熟的第一机器学习模型/规则201可以部署在自动驾驶车辆上,也可以部署在云端设备上。自动驾驶车辆通过摄像头获取车辆周围的环境图像后,将获取到的图像输入至预处理模块中,以使预处理模型对图像进行切分处理,以获取图像块以及像素块,并将获取到的图像块以及像素块转化为满足第一特征提取模型和第二特征提取模型输入要求的向量。其中,预处理模块可以看做成熟的第一机器学习模型/规则201的一部分,也可以看做一个单独的部分。当预处理模块是一个单独的部分时,预处理模块可以部署在自动驾驶车辆上,而成熟的第一机器学习模型/规则201可以部署在云端设备上。第一机器学习模型/规则201通过第一特征提取模型和第二特征提取模型对摄像头获取车辆周围的环境图像进行特征提取,由于在特征提取的过程中,既保留了图像块和图像块之间的关联关系,又保留了像素块和像素块之间的关联关系,使提取后的图像特征可以更好的表征车辆周围的环境中每个物体所在的区域,有利于第一机器学习模型/规则201中的语义分割模型根据提取后的特征对摄像头获取车辆周围的环境图像进行分割,以从图像中分割出路面、路基、车辆、行人等不同物体所在的区域,从而保持车辆行驶在正确的区域。
参阅图12,本申请提供的方案应用于智能监控领域中时,成熟的第一机器学习模型/规则201可以部署在智能监控设备上,也可以部署在云端设备上。监控设备(比如通过图12中所示的摄像头A、摄像头B以及摄像头C)获取的图像输入至预处理模块中,以使预 处理模块对图像进行切分处理,以获取图像块以及像素块,并将获取到的图像块以及像素块转化为满足第一特征提取模型和第二特征提取模型输入要求的向量。其中,预处理模块可以看做成熟的第一机器学习模型/规则201的一部分,也可以看做一个单独的部分。当预处理模块是一个单独的部分时,预处理模块可以部署在监控设备上,而成熟的第一机器学习模型/规则201可以部署在云端设备上。第一机器学习模型/规则201通过第一特征提取模型和第二特征提取模型对智能监控设备获取的图像进行特征提取,由于在特征提取的过程中,既保留了图像块和图像块之间的关联关系,又保留了像素块和像素块之间的关联关系,使提取后的图像特征可以更好的表征智能监控设备的感知范围内出现的对象的特征。比如若进行行人属性识别人物,通过本申请提供的方案提取的图像特征,可以更好的表征行人的属性以及行人的细节特征,有利于第一机器学习模型/规则201中的属性识别模型根据提取后的特征对智能监控设备获取的图像进行行人属性的识别,例如识别行人的性别、年龄、头发颜色、衣服、穿搭等等,其中行人属性识别的结果可以在端侧设备进行展示,也可以存储在服务器中。
为了更直观地理解本方案所带来的有益效果,以下结合数据对本申请实施例带来的有益效果进行说明。测试时,第一机器学习模型/规则201用于执行图像分类任务。如图13所示,为通过第一机器学习模型/规则201执行图像分类任务的流程示意图。第一机器学习模型/规则201包括多个目标特征提取模型,每个目标特征提取模型包括第一特征提取模型和第二特征提取模型。图13中,L表示正整数。第一机器学习模型/规则201还可以包括图像预处理模块,或者图像预处理模块也可以作为一个独立于第一机器学习模型/规则201的模块。图像预处理模块对图像进行切分处理,以获取图像块以及像素块,并将获取到的图像块以及像素块转化为满足第一特征提取模型和第二特征提取模型输入要求的向量。此外,在一个可能的实施方式中,还可以对满足第一特征提取模型输入要求的多个图像块对应的向量进行加权处理,将该结果也作为第一特征提取模型的输入。图像预处理模块和目标特征提取模型执行的相关流程可以参照图4以及图9对应的实施例中的相关描述进行理解,这里不再重复赘述。第一机器学习模型/规则201还可以包括多层感知机头部(multi-layer perceptron head,MLP head),MLP head用于根据最后一个目标特征提取模型的输出执行图像分类任务,以输出分类结果,比如在图13对应的场景中,输出的分类结果为“house”。第一测试用到的第一机器学习模型/规则201的配置信息如表1所示。
表1:
Figure PCTCN2022077807-appb-000007
Figure PCTCN2022077807-appb-000008
参数一用于表示特征提取模型中包括的特征提取模块的数目,即第一特征提取模型中包括12个特征提取模块,第二特征提取模型中包括12个特征提取模块。参数二用于表示第二特征提取模型对于输入向量长度的要求。参数3用于表示第二特征提取模型中自注意力模块中的头部(multi-head self-attention)的数目。参数四用于表示第一特征提取模型对于输入向量长度的要求。参数五用于第一特征提取模型中自注意力模块中的头部(multi-head self-attention)的数目。参数六用于表示应用了本申请提供的一种特征方法的第一机器学习模型/规则201中的总的参数数目,参数数目的单位是million。参数七用于表示浮点计算次数(Floating Point Operations,FLOPs),单位是billion。
测试数据集为ImageNet数据集,通过在ImageNet数据集上进行图像分类的测试实验。测试结果通过图14进行展示。如图14所示,相比于几种已有的图像分类模型,本申请提供的方案在通过第一机器学习模型/规则201提取图像特征的过程中,保留图像块和图像块之间的关联关系,又保留了像素(或者像素块)和像素(或像素块)之间的关联关系,使第一机器学习模型/规则201提取的图像特征能够很好的捕捉图像的颜色特征、纹理特征、形状特征和空间关系特征等,进而可以提升第一机器学习模型/规则201的分类的准确性。具体表现在,当几种已有的模型和应用了本申请提供的特征提取方法的第一机器学习模型/规则201的计算量相同时,应用了本申请提供的特征提取方法的第一机器学习模型/规则201对于图像的分类的准确率更高。此外,测试结果还展示了,相比于已有的图像分类模型,应用了本申请提供的特征提取方法的第一机器学习模型/规则201的计算量更少,换句话说应用了本申请提供的特征提取方法的第一机器学习模型/规则201的效率更高,具体表现在,当几种已有的模型和应用了本申请提供的特征提取方法的第一机器学习模型/规则201对于图像的分类的准确率相同时,应用了本申请提供的特征提取方法的第一机器学习模型/规则201需要的计算量更少。
以上对本申请实施例提供的一种特征提取的方法进行了介绍,通过本申请提供的一种特征提取的方法,可以使提取的待处理对象的特征可以更好的表征待处理对象,进而可以提升应用了该特征提取方法的模型的性能。
可以理解的是,为了实现上述功能,下面还提供用于实施上述方案的相关设备。这些相关设备包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定 的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
具体参阅图15,图15为本申请实施例提供的电子设备的一种结构示意图。该电子设备可以包括第一获取模块1501,第二获取模块1502,第一融合模块1503,第二融合模块1504以及第三融合模块1505。
第一获取模块1501,用于获取第一特征,第二获取模块1502用于获取多个第二特征,第一特征是通过第一特征提取模型对第一向量进行特征提取后获取的,第一向量用于表示第一切分后的对象,第一切分后的对象包括待处理对象中的部分元素,第二特征是通过第二特征提取模型对第二向量进行特征提取后获取的,第二向量用于表示第一切分后的对象中的部分元素。第一融合模块1503用于根据第一目标权重对至少两个第二特征进行融合,以获取第一融合特征,第一目标权重是根据至少两个第二特征中的每个第二特征,对目标第二特征的影响确定的,目标第二特征是至少两个第二特征中的任意一个第二特征。第二融合模块1504,用于对第一特征和第一融合特征进行融合处理,以获取第二融合特征,第二融合特征用于获取待处理对象的特征。
在一种可能的实施方式中,第一获取模块1501,还用于获取第三特征,第三特征是通过第一特征提取模型对第三向量进行特征提取后获取的,第三向量用于表示第二切分后的对象,第二切分后的对象包括待处理对象中的部分元素。第三融合模块1505,用于根据第二目标权重对第一特征和第三特征进行融合,以获取第三融合特征,第二目标权重是根据第三特征对第一特征的影响确定的。第二融合模块1504具体用于对第三融合特征和第一融合特征进行融合处理,以获取第二融合特征。
在一种可能的实施方式中,第一向量具体用于表示携带了第一位置信息的第一切分后的对象,第一位置信息为第一切分后的对象在待处理对象中的位置信息。
在一种可能的实施方式中,每个第二向量具体用于表示携带了第二位置信息的,第一切分后的对象中的部分元素,第二位置信息为第一切分后的对象中的部分元素在第一切分后的对象中的位置信息。
在一种可能的实施方式中,第二融合模块1504具体用于对第一特征和第一融合特征进行首尾拼接处理,以获取第二融合特征。
在一种可能的实施方式中,第二融合模块1504具体用于对第一特征和第一融合特征进行目标运算,以获取第二融合特征,目标运算包括相加或相乘中的至少一种。
在一种可能的实施方式中,第二融合模块1504具体用于第一融合特征包括多个时,对多个第一融合特征进行首尾拼接处理,以获取拼接后的特征。将拼接后的特征映射为目标长度的特征,目标长度根据第一特征的长度确定。对第一特征和目标长度的特征进行相加处理,以获取第二融合特征。
在一种可能的实施方式中,第一融合模块1503具体用于将至少两个第二特征作为目标模型的输入,目标模型的输出为第一融合特征,目标模型包括自注意力网络Transformer、卷积神经网络CNN或循环神经网络RNN中的其中一种,目标模型是Transformer时,第一目标权重是根据至少两个第二特征中的每个第二特征和目标第二特征之间的内积确定的,目标模型是CNN或RNN中的其中一种时,第一目标权重是预设的。
在一种可能的实施方式中,待处理对象是待处理图像,第一向量具体用于表示第一切分后的图像,第一切分后的图像具体包括待处理图像中的部分像素,第二向量具体用于表示第一切分后的图像中的部分像素,第二融合特征具体用于获取待处理图像的特征。
在一种可能的实施方式中,该电子设备可以是图2中所描述的训练设备220或者是图2中所描述的执行设备210。
需要说明的是,图15中所示的电子合并中各模块之间的信息交互、执行过程等内容,与本申请中图4至图10对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种电子设备,请参阅图16,图16为本申请实施例提供的电子设备的一种结构示意图。电子设备1400上可以部署有图4至图10中所描述的第一机器学习模型/规则201,该第一机器学习模型/规则201中包括第一特征提取模型和第二特征提取模型,用于执行图4至图10中的对应步骤。具体的,电子设备1400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1422(例如,一个或一个以上处理器)和存储器1432,一个或一个以上存储应用程序1442或数据1444的存储介质1430(例如一个或一个以上海量存储设备)。其中,存储器1432和存储介质1430可以是短暂存储或持久存储。在一个实施例中,存储器1432为随机存储存储器(random access memory,RAM),可以与中央处理器1422直接交换数据,用于加载数据1444和应用程序1442和/或操作系统1441以供中央处理器1422直接运行与运用,通常作为操作系统或其他正在运行中的程序的临时数据存储媒介。存储在存储介质1430的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对电子设备中的一系列指令操作。更进一步地,中央处理器1422可以设置为与存储介质1430通信,在电子设备1400上执行存储介质1430中的一系列指令操作。
电子设备1400还可以包括一个或一个以上电源1426,一个或一个以上有线或无线网络接口1450,一个或一个以上输入输出接口1458,和/或,一个或一个以上操作系统1441,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
需要说明的是,中央处理器1422还用于执行图4至图10中第一机器学习模型/规则201执行的其他步骤,对于中央处理器1422执行图4至图10对应实施例中的第一机器学习模型/规则201执行的步骤的具体实现方式以及带来的有益效果,均可以参考图4至图10对应的各个方法实施例中的叙述,此处不再一一赘述。
本申请实施例还提供一种电子设备,请参阅图17,图17为本申请实施例提供的电子设备的一种结构示意图。电子设备1500上可以部署有图4至图10中所描述的第一机器学习模型/规则201,该第一机器学习模型/规则201中包括第一特征提取模型和第二特征提取模型,用于执行图4至图10中的对应步骤。具体的,电子设备1500包括:接收器1501、发射器1502、处理器1503和存储器1504(其中电子设备1500中的处理器1503的数量可以为一个或多个,图17中以一个处理器为例),其中,处理器1503可以包括应用处理器15031和通信处理器15032。在本申请的一些实施例中,接收器1501、发射器1502、处理器1503和存储器1504可通过总线或其它方式连接。
存储器1504可以包括只读存储器和随机存取存储器,并向处理器1503提供指令和数据。存储器1504的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1504存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1503控制电子设备的操作。具体的应用中,电子设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1503中,或者由处理器1503实现。处理器1503可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1503中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1503可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1503可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1504,处理器1503读取存储器1504中的信息,结合其硬件完成上述方法的步骤。
接收器1501可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1502可用于通过接口输出数字或字符信息;发射器1502还可用于通过上述接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1502还可以包括显示屏等显示设备。
在一种情况下,本申请实施例中,应用处理器15031用于执行有图4至图10中对应的实施例中描述的第一机器学习模型/规则201执行的方法。
对于应用处理器15031执行图4至图10对应实施例中第一机器学习模型/规则201的功能的具体实现方式以及带来的有益效果,均可以参考图4至图10对应的各个方法实施例中的叙述,此处不再一一赘述。
应当理解,上述仅为本申请实施例提供的一个例子,并且,车辆可具有比示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
本申请实施例提供的执行设备和训练设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使芯片执行上述图4至 图10所示实施例描述的模型的特征提取的方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图18,图18为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 160,NPU 160作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1603,通过控制器1604控制运算电路1603提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1603内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1603是二维脉动阵列。运算电路1603还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1603是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1602中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1601中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1608中。
统一存储器1606用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1605被搬运到权重存储器1602中。输入数据也通过DMAC被搬运到统一存储器1606中。
总线接口单元1610(Bus Interface Unit,简称BIU),用于取指存储器1609从外部存储器获取指令,还用于存储单元访问控制器1605从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1606或将权重数据搬运到权重存储器1602中或将输入数据数据搬运到输入存储器1601中。
向量计算单元1607包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1607能将经处理的输出的向量存储到统一存储器1606。例如,向量计算单元1607可以将线性函数和/或非线性函数应用到运算电路1603的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1607生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1603的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1604连接的取指存储器(instruction fetch buffer)1609,用于存储控制器1604使用的指令;统一存储器1606,输入存储器1601,权重存储器1602以及取指存储器1609均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路1603或向量计算单元1607执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面方法的程序执行的集成电路。
本申请实施例提供还提供一种芯片,该芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使芯片执行上述图4至图10中所描述的方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于训练模型的程序,当其在计算机上运行时,使得计算机执行上述图4至图10中所描述的方法。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述图4至图10所示实施例描述的方法中的步骤。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本申请实施例中还提供一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行如前述图4至图10所示实施例描述的方法中的步骤。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助纯软件或软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CLU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可 以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。此外,该计算机软件产品也可以控件、驱动程序、独立或可下载软件对象等形式体现。
本申请的说明书和权利要求书及上述附图中的术语“第一”,“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程,方法,系统,产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程,方法,产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的模块的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个模块可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些端口,模块之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的模块或子模块可以是也可以不是物理上的分离,可以是也可以不是物理模块,或者可以分布到多个电路模块中,可以根据实际的需要选择其中的部分或全部模块来实现本申请方案的目的。

Claims (27)

  1. 一种特征提取的方法,其特征在于,包括:
    获取第一特征和多个第二特征,所述第一特征是通过第一特征提取模型对第一向量进行特征提取后获取的,所述第一向量用于表示第一切分后的对象,所述第一切分后的对象包括待处理对象中的部分元素,所述第二特征是通过第二特征提取模型对第二向量进行特征提取后获取的,所述第二向量用于表示所述第一切分后的对象中的部分元素;
    根据第一目标权重对至少两个所述第二特征进行融合,以获取第一融合特征,所述第一目标权重是根据第一参数值确定的,所述第一参数值用于表示所述至少两个第二特征中的每个所述第二特征,和所述目标第二特征之间的相似度,所述目标第二特征是所述至少两个第二特征中的任意一个所述第二特征,或者所述第一目标权重为第二参数值,所述第二参数值包括至少一个预设常数;
    对所述第一特征和所述第一融合特征进行融合处理,以获取第二融合特征,所述第二融合特征用于获取所述待处理对象的特征。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取第三特征,所述第三特征是通过所述第一特征提取模型对第三向量进行特征提取后获取的,所述第三向量用于表示第二切分后的对象,所述第二切分后的对象包括所述待处理对象中的部分元素;
    所述对所述第一特征和所述第一融合特征进行融合处理,以获取第二融合特征,包括:
    根据第二目标权重对所述第一特征和所述第三特征进行融合,以获取第三融合特征,所述第二目标权重是根据第三参数值确定的,所述第三参数值用于表示所述第三特征和所述第一特征之间的相似度,或者所述第二目标权重为第四参数值,所述第四参数值包括至少一个预设常数;
    对所述第三融合特征和所述第一融合特征进行融合处理,以获取所述第二融合特征。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一向量具体用于表示携带了第一位置信息的所述第一切分后的对象,所述第一位置信息为所述第一切分后的对象在所述待处理对象中的位置信息。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,每个所述第二向量具体用于表示携带了第二位置信息的,所述第一切分后的对象中的部分元素,所述第二位置信息为所述第一切分后的对象中的部分元素在所述第一切分后的对象中的位置信息。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述对所述第一特征和所述第一融合特征进行融合处理,以获取第二融合特征,包括:
    对所述第一特征和所述第一融合特征进行首尾拼接处理,以获取所述第二融合特征。
  6. 根据权利要求1至4任一项所述的方法,其特征在于,所述对所述第一特征和所述第一融合特征进行融合处理,以获取第二融合特征,包括:
    对所述第一特征和所述第一融合特征进行目标运算,以获取所述第二融合特征,所述目标运算包括相加或者相乘中的至少一种。
  7. 根据权利要求6所述的方法,其特征在于,所述对所述第一特征和所述第一融合特征进行目标运算,以获取所述第二融合特征,包括:
    所述第一融合特征包括多个时,对多个所述第一融合特征进行首尾拼接处理,以获取拼接后的特征;
    将所述拼接后的特征映射为目标长度的特征,所述目标长度根据所述第一特征的长度确定;
    对所述第一特征和所述目标长度的特征进行相加处理,以获取所述第二融合特征。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述根据第一目标权重对至少两个所述第二特征进行融合,以获取第一融合特征,包括:
    将所述至少两个所述第二特征作为目标模型的输入,所述目标模型的输出为所述第一融合特征,所述目标模型包括自注意力网络Transformer、卷积神经网络CNN或者循环神经网络RNN中的其中一种,所述目标模型是所述Transformer时,所述第一目标权重是根据所述至少两个第二特征中的每个所述第二特征和所述目标第二特征之间的内积确定的,所述目标模型是所述CNN或所述RNN中的其中一种时,所述第一目标权重是所述第二参数值。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述待处理对象是待处理图像,所述第一向量具体用于表示第一切分后的图像,所述第一切分后的图像具体包括所述待处理图像中的部分像素,所述第二向量具体用于表示所述第一切分后的图像中的部分像素,所述第二融合特征具体用于获取所述待处理图像的特征。
  10. 一种特征提取模型,其特征在于,所述特征提取模型包括第一特征提取模型和第二特征提取模型,
    所述第一特征提取模型,用于获取第一特征,所述第一特征是通过所述第一特征提取模型对第一向量进行特征提取后获取的,所述第一向量用于表示第一切分后的对象,所述第一切分后的对象包括待处理对象中的部分元素;
    所述第二特征提取模型,用于获取多个第二特征,所述第二特征是通过所述第二特征提取模型对第二向量进行特征提取后获取的,所述第二向量用于表示所述第一切分后的对象中的部分元素;
    所述第二特征提取模型,还用于根据第一目标权重对至少两个所述第二特征进行融合,以获取第一融合特征,所述第一目标权重是根据第一参数值确定的,所述第一参数值用于表示所述至少两个第二特征中的每个所述第二特征,和所述目标第二特征之间的相似度,所述目标第二特征是所述至少两个第二特征中的任意一个所述第二特征,或者所述第一目标权重为第二参数值,所述第二参数值包括至少一个预设常数;
    所述第一特征提取模型,还用于对所述第一特征和所述第一融合特征进行融合处理,以获取第二融合特征,所述第二融合特征用于获取所述待处理对象的特征。
  11. 根据权利要求10所述的特征提取模型,其特征在于,所述第一特征提取模型,还用于:
    获取第三特征,所述第三特征是通过所述第一特征提取模型对第三向量进行特征提取后获取的,所述第三向量用于表示第二切分后的对象,所述第二切分后的对象包括所述待处理对象中的部分元素;
    所述第一特征提取模型,具体用于:
    根据第二目标权重对所述第一特征和所述第三特征进行融合,以获取第三融合特征,所述第二目标权重是根据第三参数值确定的,所述第三参数值用于表示所述第三特征和所述第一特征之间的相似度,或者所述第二目标权重为第四参数值,所述第四参数值包括至少一个预设常数;
    对所述第三融合特征和所述第一融合特征进行融合处理,以获取所述第二融合特征。
  12. 根据权利要求10或11所述的特征提取模型,其特征在于,所述第一向量具体用于表示携带了第一位置信息的所述第一切分后的对象,所述第一位置信息为所述第一切分后的对象在所述待处理对象中的位置信息。
  13. 根据权利要求10至12任一项所述的特征提取模型,其特征在于,每个所述第二向量具体用于表示携带了第二位置信息的,所述第一切分后的对象中的部分元素,所述第二位置信息为所述第一切分后的对象中的部分元素在所述第一切分后的对象中的位置信息。
  14. 根据权利要求10至13任一项所述的特征提取模型,其特征在于,所述第一特征提取模型,具体用于:
    对所述第一特征和所述第一融合特征进行首尾拼接处理,以获取所述第二融合特征。
  15. 根据权利权利要求10至13任一项所述的特征提取模型,其特征在于,所述第一特征提取模型,具体用于:
    对所述第一特征和所述第一融合特征进行目标运算,以获取所述第二融合特征,所述目标运算包括相加或者相乘中的至少一种。
  16. 根据权利要求15所述的特征提取模型,其特征在于,所述第一特征提取模型,具体用于:
    所述第一融合特征包括多个时,对多个所述第一融合特征进行首尾拼接处理,以获取拼接后的特征;
    将所述拼接后的特征映射为目标长度的特征,所述目标长度根据所述第一特征的长度确定;
    对所述第一特征和所述目标长度的特征进行相加处理,以获取所述第二融合特征。
  17. 根据权利要求10至16任一项所述的特征提取模型,其特征在于,所述第二特征提取模型,具体用于:
    将所述至少两个所述第二特征作为目标模型的输入,所述目标模型的输出为所述第一融合特征,所述目标模型包括自注意力网络Transformer、卷积神经网络CNN或者循环神经网络RNN中的其中一种,所述目标模型是所述Transformer时,所述第一目标权重是根据所述至少两个第二特征中的每个所述第二特征和所述目标第二特征之间的内积确定的,所述目标模型是所述CNN或所述RNN中的其中一种时,所述第一目标权重是所述第二参数值。
  18. 根据权利要求10至17任一项所述的特征提取模型,其特征在于,所述待处理对象是待处理图像,所述第一向量具体用于表示第一切分后的图像,所述第一切分后的图像具体包括所述待处理图像中的部分像素,所述第二向量具体用于表示所述第一切分后的图像 中的部分像素,所述第二融合特征具体用于获取所述待处理图像的特征。
  19. 一种图像处理方法,其特征在于,包括:
    获取待处理图像;
    将所述待处理图像输入至视觉感知模型中,以通过所述视觉感知模型中包括的特征提取模型提取图像特征,所述特征提取模型为权利要求10至18中所描述的特征提取模型;
    根据所述图像特征对所述待处理图像进行视觉感知。
  20. 根据权利要求19所述的图像处理方法,其特征在于,所述根据所述图像特征对所述待处理图像进行视觉感知,包括:
    根据所述图像特征对所述待处理图像进行分类,以获取所述待处理图像的分类结果。
  21. 根据权利要求19所述的图像处理方法,其特征在于,所述获取待处理图像,包括:
    通过车辆的传感器获取待处理图像;
    所述根据所述图像特征对所述待处理图像进行视觉感知,包括:
    所述根据所述图像特征对所述待处理图像进行语义分割,以获取所述待处理图像中目标物体所在区域,所述目标物体包括人物、车辆、路面中的一种或者多种。
  22. 根据权利要求19所述的图像处理方法,其特征在于,所述获取待处理图像,包括:
    通过监控设备获取待处理图像;
    所述根据所述图像特征对所述待处理图像进行视觉感知,包括:
    若根据所述图像特征识别所述待处理图像中包括人物,则根据所述图像特征识别所述人物的属性,所述属性包括性别、肤色、年龄、服装中的一种或者多种。
  23. 一种电子设备,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序指令,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至9中任一项所述的方法。
  24. 一种计算机可读存储介质,其特征在于,包括程序,当其在计算机上运行时,使得计算机执行如权利要求1至9中任一项所述的方法。
  25. 一种电路系统,其特征在于,所述电路系统包括处理电路,所述处理电路配置为执行如权利要求1至9中任一项所述的方法。
  26. 一种计算机程序产品,其特征在于,所述计算机程序产品包括指令,当所述指令由电子设备加载并执行,使电子设备执行权利要求1至9中任一项所述的方法。
  27. 一种芯片,其特征在于,所述芯片与存储器耦合,用于执行所述存储器中存储的程序,以执行如权利要求1至9任一项所述的方法。
PCT/CN2022/077807 2021-02-26 2022-02-25 一种特征提取的方法以及装置 WO2022179587A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22758951.2A EP4290408A1 (en) 2021-02-26 2022-02-25 Feature extraction method and apparatus
US18/237,995 US20230419646A1 (en) 2021-02-26 2023-08-25 Feature extraction method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110223032.8A CN113065576A (zh) 2021-02-26 2021-02-26 一种特征提取的方法以及装置
CN202110223032.8 2021-02-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/237,995 Continuation US20230419646A1 (en) 2021-02-26 2023-08-25 Feature extraction method and apparatus

Publications (1)

Publication Number Publication Date
WO2022179587A1 true WO2022179587A1 (zh) 2022-09-01

Family

ID=76559203

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077807 WO2022179587A1 (zh) 2021-02-26 2022-02-25 一种特征提取的方法以及装置

Country Status (4)

Country Link
US (1) US20230419646A1 (zh)
EP (1) EP4290408A1 (zh)
CN (1) CN113065576A (zh)
WO (1) WO2022179587A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065576A (zh) * 2021-02-26 2021-07-02 华为技术有限公司 一种特征提取的方法以及装置
CN114092773B (zh) * 2021-10-29 2023-11-21 北京百度网讯科技有限公司 信号处理方法、信号处理装置、电子设备及存储介质
CN115953654A (zh) * 2022-03-24 2023-04-11 北京字跳网络技术有限公司 一种图像处理方法、装置、电子设备及存储介质
CN117746047A (zh) * 2022-09-21 2024-03-22 华为技术有限公司 一种图像处理方法及其相关设备
CN115757692A (zh) * 2022-10-20 2023-03-07 华为技术有限公司 一种数据处理方法及其装置
CN117274450B (zh) * 2023-11-21 2024-01-26 长春职业技术学院 基于人工智能的动画形象生成系统及方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416323A (zh) * 2018-03-27 2018-08-17 百度在线网络技术(北京)有限公司 用于识别人脸的方法和装置
CN108509904A (zh) * 2018-03-30 2018-09-07 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN108776787A (zh) * 2018-06-04 2018-11-09 北京京东金融科技控股有限公司 图像处理方法及装置、电子设备、存储介质
US20200026952A1 (en) * 2018-07-17 2020-01-23 Samsung Electronics Co., Ltd. Electronic apparatus, method for processing image and computer-readable recording medium
CN112257759A (zh) * 2020-09-27 2021-01-22 华为技术有限公司 一种图像处理的方法以及装置
CN113065576A (zh) * 2021-02-26 2021-07-02 华为技术有限公司 一种特征提取的方法以及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4770932B2 (ja) * 2002-07-16 2011-09-14 日本電気株式会社 パターン特徴抽出方法及びその装置
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
CN109472240B (zh) * 2018-11-12 2020-02-28 北京影谱科技股份有限公司 人脸识别多模型自适应特征融合增强方法和装置
CN110263324B (zh) * 2019-05-16 2021-02-12 华为技术有限公司 文本处理方法、模型训练方法和装置
CN110335261B (zh) * 2019-06-28 2020-04-17 山东科技大学 一种基于时空循环注意力机制的ct淋巴结检测系统
CN111444709B (zh) * 2020-03-09 2022-08-12 腾讯科技(深圳)有限公司 文本分类方法、装置、存储介质及设备
CN112257858A (zh) * 2020-09-21 2021-01-22 华为技术有限公司 一种模型压缩方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416323A (zh) * 2018-03-27 2018-08-17 百度在线网络技术(北京)有限公司 用于识别人脸的方法和装置
CN108509904A (zh) * 2018-03-30 2018-09-07 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN108776787A (zh) * 2018-06-04 2018-11-09 北京京东金融科技控股有限公司 图像处理方法及装置、电子设备、存储介质
US20200026952A1 (en) * 2018-07-17 2020-01-23 Samsung Electronics Co., Ltd. Electronic apparatus, method for processing image and computer-readable recording medium
CN112257759A (zh) * 2020-09-27 2021-01-22 华为技术有限公司 一种图像处理的方法以及装置
CN113065576A (zh) * 2021-02-26 2021-07-02 华为技术有限公司 一种特征提取的方法以及装置

Also Published As

Publication number Publication date
EP4290408A1 (en) 2023-12-13
CN113065576A (zh) 2021-07-02
US20230419646A1 (en) 2023-12-28

Similar Documents

Publication Publication Date Title
WO2022179587A1 (zh) 一种特征提取的方法以及装置
JP7289918B2 (ja) 物体認識方法及び装置
US20220108546A1 (en) Object detection method and apparatus, and computer storage medium
EP4148622A1 (en) Neural network training method, image classification system, and related device
WO2021143101A1 (zh) 人脸识别方法和人脸识别装置
WO2022017245A1 (zh) 一种文本识别网络、神经网络训练的方法以及相关设备
CN109902548B (zh) 一种对象属性识别方法、装置、计算设备及系统
CN112990211B (zh) 一种神经网络的训练方法、图像处理方法以及装置
CN111931764B (zh) 一种目标检测方法、目标检测框架及相关设备
US20230076266A1 (en) Data processing system, object detection method, and apparatus thereof
WO2021164751A1 (zh) 一种感知网络结构搜索方法及其装置
CN110222718B (zh) 图像处理的方法及装置
WO2021218238A1 (zh) 图像处理方法和图像处理装置
WO2022111617A1 (zh) 一种模型训练方法及装置
WO2022012668A1 (zh) 一种训练集处理方法和装置
CN113326735B (zh) 一种基于YOLOv5的多模态小目标检测方法
CN113807399A (zh) 一种神经网络训练方法、检测方法以及装置
CN113011562A (zh) 一种模型训练方法及装置
CN113033321A (zh) 目标行人属性识别模型的训练方法及行人属性识别方法
CN111414915A (zh) 一种文字识别方法以及相关设备
CN113361549A (zh) 一种模型更新方法以及相关装置
WO2022194069A1 (zh) 一种生成显著图的方法、异常对象检测的方法以及装置
WO2022100607A1 (zh) 一种神经网络结构确定方法及其装置
WO2021155661A1 (zh) 一种图像处理方法以及相关设备
US20230401838A1 (en) Image processing method and related apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22758951

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022758951

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022758951

Country of ref document: EP

Effective date: 20230908

NENP Non-entry into the national phase

Ref country code: DE