CN112529116A

CN112529116A - Scene element fusion processing method, device and equipment and computer storage medium

Info

Publication number: CN112529116A
Application number: CN202110176324.0A
Authority: CN
Inventors: 李德辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-03-19
Anticipated expiration: 2041-02-07
Also published as: CN112529116B

Abstract

The application discloses a scene element fusion processing method, a scene element fusion processing device and a computer storage medium, and relates to the technical field of computers. Therefore, for missing elements or unbalanced elements in the target sample picture set, the missing elements or the unbalanced elements can be added in the target sample picture in a targeted manner through the process, so that the target sample picture set is enhanced, and the accuracy of the model obtained through training can be improved. The method can be applied to scenes such as automatic driving or map navigation.

Description

Scene element fusion processing method, device and equipment and computer storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of Artificial Intelligence (AI), and provides a scene element fusion processing method, a device and equipment and a computer storage medium.

Background

Computer vision technology is an important branch of artificial intelligence technology, the research of computer vision related algorithms becomes a popular research field in recent years, and the learning based on supervision is the mainstream method of the current computer vision algorithm. In supervised learning, labeled datasets are indispensable, which provide true values for model inputs and outputs during training, guide model learning parameters, and quantify and evaluate the effectiveness of the model during testing.

The acquisition of the current training data set basically relies on manual labeling, but this is a very time-consuming and labor-consuming task. Moreover, even if the task needs are expanded through manual meticulous labeling, and the frequency of various elements in an actual scene is greatly different, the problem that some types of elements are lacked or the number of different types of elements is unbalanced often exists in a data set.

For these problems, a new sample picture can be obtained by spatial and channel conversion, such as flipping, cropping, rotating, scaling and deforming, adding noise or blurring processing, and the like, so as to enhance sample diversity. However, the method cannot increase the element types of the data set, and the data set containing more element types can only be manually marked again to be acquired.

Therefore, how to solve the problems of missing elements and imbalance in the training data set is a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a scene element fusion processing method, a scene element fusion processing device and a computer storage medium, which are used for enriching element types in a sample picture and balancing the number of elements of each type in a sample picture set.

In one aspect, a scene element fusion processing method is provided, where the method includes:

acquiring a source sample picture set containing a plurality of source sample pictures and a target sample picture set containing a plurality of target sample pictures; the source sample pictures and the target sample pictures have the same scene type, each source sample picture and each target sample picture are formed by multiple types of elements, and at least one element included in each source sample picture and each target sample picture is provided with corresponding labeling information;

respectively acquiring target area pictures containing target type elements from each source sample picture according to the labeling information corresponding to each source sample picture in the source sample picture set to obtain a target area picture set;

and based on the obtained target area picture set, performing fusion processing aiming at the target type element on at least one target sample picture in the target sample picture set to obtain at least one target sample picture containing the target type element.

In one aspect, a scene element fusion processing apparatus is provided, the apparatus including:

a data set acquisition unit, configured to acquire a source sample picture set including a plurality of source sample pictures and a target sample picture set including a plurality of target sample pictures; the source sample pictures and the target sample pictures have the same scene type, each source sample picture and each target sample picture are formed by multiple types of elements, and at least one element included in each source sample picture and each target sample picture is provided with corresponding labeling information;

the element extraction unit is used for respectively acquiring target area pictures containing target type elements from each source sample picture according to the labeling information corresponding to each source sample picture in the source sample picture set to obtain a target area picture set;

and the fusion processing unit is used for performing fusion processing aiming at the target type element on at least one target sample picture in the target sample picture set based on the obtained target area picture set to obtain at least one target sample picture containing the target type element.

Optionally, the element extracting unit is specifically configured to:

acquiring corresponding labeling information aiming at the labeling file of each source sample picture;

aiming at each piece of obtained labeling information, the following operations are respectively executed: and when the element type indicated in the obtained labeling information is determined to contain the target type element, intercepting a target area picture corresponding to the target type element from a corresponding source sample picture according to the coordinate information of the target type element indicated by the labeling information.

Optionally, the fusion processing unit is specifically configured to:

determining the at least one target sample picture from the set of target sample pictures;

for each target sample picture of the at least one target sample picture, respectively performing the following operations: and aiming at one target sample picture, selecting at least one target area picture from the target area picture set, and carrying out fusion processing on the target sample picture and the at least one target area picture to obtain one target sample picture containing the target type element.

Optionally, the fusion processing unit is specifically configured to:

determining the total number of the target type elements and the total number of other type elements marked in the target sample picture set according to the marking information of each target sample picture in the target sample picture set;

determining the total number of the target type elements required to be added to the target sample picture set according to the total number of the target type elements and the total number of the other type elements;

determining the at least one target sample picture from the target sample picture set according to the total number of the target type elements to be added; the sum of the number of the target type elements corresponding to each target sample picture in the at least one target sample picture is the same as the total number of the target type elements to be added.

Optionally, the fusion processing unit is specifically configured to:

selecting a corresponding number of target area pictures for the target sample picture according to the total number of the target type elements needing to be added; wherein the sum of the number of the target type elements to be added of each target sample picture in the at least one target sample picture is the same as the total number of the target type elements to be added; or,

and selecting a set number of target area pictures for the target sample picture.

Optionally, the fusion processing unit is specifically configured to:

respectively determining the number of the target type elements corresponding to each target sample picture according to the labeling information of each target sample picture in the target sample picture set;

determining a target sample picture with the number of the target type elements not greater than a preset number threshold as the at least one target sample picture.

Optionally, the fusion processing unit is specifically configured to:

determining a fusion region of the target type element in the target sample picture;

respectively fusing the at least one target area picture corresponding to the target sample picture in the fusion area to obtain a target sample picture containing the target type element; when the target area picture is fused in the fusion area, covering the pixel of the target type element in the target area picture with the pixel at the corresponding position in the fusion area.

Optionally, the fusion processing unit is specifically configured to:

determining a region in a preset row range as the fusion region based on the pixel matrix corresponding to the target sample picture; or,

and identifying the region of the target type element with the occurrence probability higher than the preset probability value in the target sample picture, and determining the region with the occurrence probability higher than the preset probability value as the fusion region.

Optionally, the apparatus further includes a label updating unit, configured to:

and updating the annotation file of the target sample picture according to the element type information corresponding to each target area picture in at least one target area picture corresponding to the target sample picture and the coordinate information of each target area picture in the target sample picture.

Optionally, the scene type is a road scene, each source sample picture and each target sample picture are formed by using a plurality of road elements in the road scene, and the type and the position of at least one road element in each source sample picture and each target sample picture are labeled;

the road elements include one or more of pedestrians, vehicles, traffic indicators, vegetation, or buildings in combination.

Optionally, the device further comprises a model training unit, a recognition unit and a driving guidance unit;

the model training unit is used for obtaining a training sample picture set of a road element recognition model according to the at least one target sample picture, and performing model training on the road element recognition model according to the training sample picture set to obtain a trained road element recognition model;

the recognition unit is used for collecting road pictures in the vehicle running process and recognizing road elements on the road pictures by using the trained road element recognition model;

and the driving guidance unit is used for determining a driving guidance scheme of the vehicle according to the road element recognition result.

In one aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above methods when executing the computer program.

In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of any of the above-described methods.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of any of the methods described above.

In the embodiment of the application, a target area picture containing a target type element is extracted from a source sample picture set which has the same or similar scene type as a target sample picture set, the target area picture set corresponding to the target type element is constructed, and a plurality of target sample pictures in the target sample picture set are fused by the target area picture set, so that the target sample picture added with the target type element can be obtained, thus, for a missing element or an unbalanced element in the target sample picture set, the target sample picture set can be added in a targeted manner through the process, so that the target sample picture set is enhanced, the number of elements in the target sample picture set can reach an expected standard, and further, when the target sample picture set is used for model training, the trained model can be used for learning the characteristics of each element in a balanced manner, and the identification of certain elements is not prone to be realized, so that the accuracy of the trained model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a scene element fusion processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a process of capturing a target area picture from a source sample picture a according to an embodiment of the present application;

fig. 4 is a schematic diagram of capturing a target area picture from a source sample picture a according to an embodiment of the present application;

fig. 5 is a schematic view of a target area picture set taking a target type element as a traffic light as an example according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a fusion process performed by using a target area picture set and a target sample picture set according to an embodiment of the present application;

FIG. 7 is a schematic view of a fusion zone provided in an embodiment of the present application;

fig. 8 is a schematic diagram illustrating comparison of target sample pictures before and after fusion according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating a comparison of a target sample picture before and after fusion according to an embodiment of the present application;

FIG. 10 is a schematic flowchart of an element recognition model training method according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a scene element fusion processing apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained first:

sample picture set: the method includes the steps that a sample picture set mainly relates to a source sample picture set and a target sample picture set, the source sample picture set is used for extracting target type elements and constructing the target type element set, the target sample picture set is a data set to be enhanced, the elements extracted from the source sample picture set are added into samples in the target sample picture set.

Elements: a picture is usually made up of multiple items, and each item in the picture may be an element.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application mainly relates to computer vision technology, machine learning/deep learning and other technologies belonging to the field of artificial intelligence, wherein the computer vision technology is an important branch of the artificial intelligence technology, and the learning based on supervision is a mainstream method of the current computer vision algorithm.

Due to the expansion of task requirements and the fact that frequency of various elements in an actual scene is large, a data set often has the problem that some elements in certain categories are lacked or the number of elements in different categories is unbalanced, in supervised learning, a model tends to learn a weight with the minimum training loss, when the number of samples in a certain category accounts for the majority, the model tends to fit the samples in the category, the categories with small number of samples are ignored, and finally the category accuracy of a small number of samples is low.

The data enhancement method in the related art cannot fundamentally solve the problem of element loss or unbalance of elements of different classes.

In view of this, an embodiment of the present application provides a scene element fusion processing method, in which a target area picture including a target type element is extracted from a source sample picture set having the same or similar scene type as a target sample picture set, a target area picture set corresponding to the target type element is constructed, and a plurality of target sample pictures in the target sample picture set are fused by using the target area picture set, so as to obtain a target sample picture with the target type element added thereto, so that, for a missing element or an unbalanced element in the target sample picture set, the target sample picture set can be added in a targeted manner by the above-mentioned process, so as to enhance the target sample picture set, and the number of elements in the target sample picture set can reach an expected standard, furthermore, when model training is performed by using the target sample picture set, the trained model can be used for learning the characteristics of each element in a balanced manner, and the recognition of certain elements is not prone to be performed, so that the accuracy of the trained model is improved.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the present application may be applicable to most of machine learning scenes based on picture samples, and as shown in fig. 4, a scene schematic diagram provided by the embodiment of the present application may include an image capturing device 10 and a data generating device 20.

The image capturing device 10 may be configured to capture a sample picture, and for different picture capturing manners, the image capturing device 10 may be different devices, for example, the image capturing device 10 may be a camera, one possible embodiment is a camera arranged on a vehicle, a picture of the vehicle passing by may be captured by the camera to be used as the sample picture, another possible embodiment may be a monitoring device arranged on a drive test, and then a picture captured by the monitoring device may be used as the sample picture. Alternatively, the sample picture may be obtained by obtaining a picture from a network, and the image capturing device 10 may be a corresponding computer device.

The data generating apparatus 20 is a computer apparatus having a certain processing capability, and may be, for example, a Personal Computer (PC), a notebook computer, a server, or the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

The data generation device 20 includes one or more processors 201, memory 202, and I/O interfaces 203 that interact with other devices, among other things. In addition, the data generation device 20 may further configure a database 204, and the database 204 may be configured to store a target area picture set and a data set corresponding to a target type element involved in the scheme provided in the embodiment of the present application. The memory 202 of the data generating device 20 may store therein program instructions of the scene element fusion processing method provided in the embodiment of the present application, and when executed by the processor 201, the program instructions can be used to implement the steps of the scene element fusion processing method provided in the embodiment of the present application to generate a training data set to which a target type element is added.

In specific implementation, the image capturing device 10 may send the captured sample picture to the data generating device 20, the data generating device 20 stores the sample picture, and when the picture processing is required, the stored sample picture is used to obtain a training data set to which the target type element is added by the above-mentioned scene element fusion processing method, and the training data set may be used for machine learning to train a corresponding model.

In a possible implementation, the scene presented by the sample pictures may be a road scene, and each sample picture may be composed of a plurality of road elements, for example, including one or more of pedestrians, vehicles, traffic indicators, vegetation, or buildings. In practical application, each sample picture can be a road picture which is actually acquired, the road pictures form a source sample picture set and a target sample set, for example, traffic lights are labeled on the source sample picture set, the traffic lights comprise traffic lights and positions, traffic light elements can be extracted from the source sample picture set, the traffic light elements are fused into the target sample set, the fused target sample set can comprise labels of various road elements, therefore, the road element model can be trained by using the target sample set, and the trained model can be used for an actual road element identification process.

One possible mode is to apply the method to an automatic driving scene, for example, in the driving process of an automatic driving vehicle, a road picture can be taken through a camera device arranged on the vehicle, and the road element recognition model obtained through the training identifies the surrounding road elements in real time, so that an automatic driving scheme can be formulated according to the identified road elements, for example, when a traffic light in front is recognized, the state of the traffic light needs to be further confirmed to determine whether the vehicle stops advancing, or when a pedestrian in front is recognized, the pedestrian needs to be avoided in time.

Another possible mode is to apply to map navigation, for example, when a driver drives a vehicle, a road picture can be collected, and the road element recognition model obtained through the training can recognize surrounding road elements in real time, so that driving guidance information can be formulated for the driver according to the recognized road elements, and the driver can be guided in navigation. For example, when a pedestrian is recognized ahead, the driver is prompted to avoid the pedestrian.

Image-capturing device 10 and data-generating device 20 may be directly or indirectly communicatively coupled via one or more networks 30. The network 30 may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may also be other possible networks, which is not limited in this embodiment of the present invention.

Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

Referring to fig. 2, a schematic flowchart of a scene element fusion processing method provided in an embodiment of the present application is shown, where the method may be executed by the data generating device 20 in fig. 1, and a flow of the method is described as follows.

Step 201: and acquiring a source sample picture set and a target sample picture set which have the same scene type.

In the embodiment of the present application, a source sample picture set is a data set composed of source sample pictures, a source sample picture set is a data set including a target type element, a target sample picture set is a data set composed of a target sample picture, and a target sample picture set is a data set to be enhanced, extraction of the target type element is performed from the source sample picture set, and the extracted target type element is fused into the target sample picture set.

In order to make the elements better fused and closer to the scene of the target sample picture set, the source sample picture set and the target sample picture set may have similar or identical scene types, for example, both may be street view pictures or traffic light intersection pictures, etc.

Illustratively, the source sample picture set is a data set a marked with traffic lights, the target sample picture set is a data set B marked with pedestrians and vehicles, that is, the data set B lacks marks for the traffic lights, if a model obtained by training the data set B is used, the pedestrians and vehicles can be identified, and the traffic lights cannot be identified, so that if the model is required to identify the traffic lights, traffic light elements need to be added into the data set B, and the data set a just marks the traffic lights, then the marked elements in the data set a, that is, the traffic lights can be used to perform element enhancement on the data set B.

Or, the source sample picture set is a data set a marked with traffic lights, and the target sample picture set is a data set C marked with traffic lights, pedestrians, and vehicles, but the number of traffic light samples in the data set C is small, and the data set has serious category imbalance, if a model obtained by training the data set C is more prone to identifying pedestrians and vehicles, the identification capability of the traffic lights is weaker, so in order to balance the number of each category element in the data set C, traffic light elements need to be added in the data set C, and the data set a just marks the traffic lights, then the marked elements in the data set a, that is, the traffic lights, can be used to perform element enhancement on the data set C.

In a specific implementation, the source sample picture set and the target sample picture set may also be the same data set, for example, the data set C may be used, and then traffic light elements may be extracted from the data set C and fused into each sample picture, so as to increase the number of traffic light elements in the data set C.

Step 202: and respectively intercepting a target area picture containing a target type element from the source sample picture containing the target type element according to the labeling information corresponding to each source sample picture in the source sample picture set, so as to obtain a target area picture set.

In the embodiment of the application, after each source sample picture in the source sample picture set is labeled, each source sample picture has corresponding labeling information, and the labeling information is generally stored in a labeling file. Or, one source sample picture may correspond to one annotation file, so that when the annotation information of a certain source sample picture is needed, the annotation file of the source sample picture can be obtained from the query, and then the annotation information can be obtained.

The labeling information may include the type of the labeled element and coordinate information in the source sample picture, where the coordinate information may be represented by coordinates of a center point of a labeling box of the labeled element and a length and width value of a rectangular box, or may also be represented by coordinates of diagonal vertices of the labeling box of the labeled element.

In the embodiment of the application, whether the source sample picture includes the target type element or not can be determined according to the labeling information of the source sample picture, and then the target type element can be extracted from the source sample picture including the target type element when the target type element is extracted. For each source sample picture containing a target type element, the process of extracting the target type element is similar, so the process is described below by taking the source sample picture a as an example, as shown in fig. 3, which is a schematic flow chart for taking the source sample picture a as an example and capturing a target area picture from the source sample picture a.

S2021: and acquiring the labeling information of the source sample picture A from the labeling file of the source sample picture A.

Specifically, for the source sample picture a, all the annotation information of the source sample picture a is stored in the annotation file, and then the corresponding annotation information can be obtained according to the annotation file of the source sample picture a.

Exemplarily, as shown in fig. 4, the schematic diagram is a schematic diagram of capturing a target area picture from a source sample picture a, where the source sample picture a is an intersection picture, traffic lights in the source sample picture a are labeled, and taking a target type element as a traffic light as an example, as shown in fig. 4, the labeling information may specifically include element type information and coordinate information in a labeling frame, that is, the "light" characterizing element shown in fig. 4 is a traffic light.

S2022: and determining whether the source sample picture A contains the target type element according to the labeling information.

The marked element type of the source sample picture A is indicated in the marking information, so that whether the source sample picture A contains the target type element or not can be determined according to the marking information, and when the element type of the source sample picture A indicated in the marking information does not contain the target type element, the process is ended.

S2023: if the determination result in S2022 is yes, the target type element region is determined in the source sample picture a according to the coordinate information of the target type element indicated by the annotation information.

When the element type of the source sample picture a indicated in the annotation information includes a target type element, a target type element region may be determined from the source sample picture a according to the coordinate information of the target type element indicated by the annotation information.

S2024: and intercepting a target area picture corresponding to the target type element in the source sample picture A.

Specifically, a clipping manner may be adopted, that is, after the target type element region is determined, other regions in the source sample picture a are clipped, and only the target type element region is reserved, so as to obtain the target region picture. Or, since the image is stored in the form of a pixel matrix when stored in the computer device, the data of the target type element region may be read from the source sample image a according to the mapping relationship between the coordinate information and the pixels in the image, and then the corresponding target region image may be generated according to the data.

As shown in fig. 4, it can be determined that the source sample picture a includes the traffic lights according to the label information of the source sample picture a, so that the target type elements can be extracted from the source sample picture a, and then the area of the traffic lights in the source sample picture a can be determined according to the coordinate information, and further the pixels are read from the pixel matrix of the source sample picture a, so as to generate the target area picture including the traffic lights.

After the operation is performed on each source sample picture in the source sample picture set in the manner described above, a target area picture corresponding to each source sample picture can be obtained, so that the target area picture set is formed by the sample pictures. As shown in fig. 5, a schematic diagram of a target area picture set taking a target type element as a traffic light as an example is shown, where the target area picture set is composed of target area pictures of a plurality of target type elements extracted from a source sample picture, and the target area picture set shown in fig. 5 is composed of extracted traffic light pictures.

Step 203: and performing fusion processing aiming at the target type element on at least one target picture in the target sample picture set based on the obtained target area picture set to obtain at least one target sample picture containing the target type element.

In this embodiment of the present application, after the target area picture set is obtained, the obtained target area picture set and the target sample picture set may be used to perform a fusion process, so as to obtain at least one enhanced target sample picture, that is, at least one target sample picture including the target type element. In practical application, a new sample picture set is formed by the enhanced at least one target sample picture and other target sample pictures which are not subjected to fusion processing in the target area picture set, and the sample picture set is an element enhanced sample picture set and can be used as a training data set for actual model training.

As shown in fig. 6, a schematic flow chart of a fusion process performed by using a target region picture set and a target sample picture set is shown.

S2031: at least one target sample picture is determined from the set of target sample pictures.

In the embodiment of the present application, at least one target sample picture that needs to be subjected to fusion processing may be selected according to the specific situation of the target sample picture set.

Specifically, when the target sample picture set is a data set that does not include a target type element at all, such as the data set B in the above example, all pictures in the entire target sample picture set may be subjected to the fusion processing, or at least one target sample picture may be selected according to a certain ratio, for example, 80% or 85% of the target sample pictures in the target sample picture set may be subjected to the fusion processing.

Specifically, when the target sample picture set is a data set which partially includes the target type element but has a severely unbalanced number of elements, such as the data set C in the above example, all pictures in the entire target sample picture set may be fused, or at least one target sample picture which does not include the target type element may be selected.

In particular implementation, in order to keep the number of various elements in the target sample picture set balanced, the number of target type elements may be determined according to the number of other type elements already labeled in the target sample picture set. Specifically, the number of other types of elements already labeled in the target sample picture set may be obtained through the labeling information, so that the total number of target type elements and the total number of other types of elements that are labeled in the target sample picture set may be determined according to the labeling information of each target sample picture in the target sample picture set, and then the total number of target type elements that need to be added to the target sample picture set may be determined according to the total number of target type elements and the total number of other types of elements, for example, when the total number of target type elements in the target sample picture set is zero and the total number of other types of elements is substantially 10000, the total number of target type elements that need to be added may be determined to be 10000, or when the total number of target type elements in the target sample picture set is 3000 and the total number of other types of elements is substantially 10000, the total number of target type elements to be added may be determined to be 7000.

Furthermore, the number of the target sample pictures to be selected is determined according to the total number of the target type elements to be added, that is, at least one target sample picture is determined from the target sample picture set, so that the sum of the number of the target type elements corresponding to each target sample picture in the determined at least one target sample picture is the same as the total number of the target type elements to be added.

In a specific implementation, a target sample picture with a small number of target type elements may be selected as at least one target sample picture. Specifically, the number of target type elements corresponding to each target sample picture can be respectively determined according to the labeling information of each target sample picture in the target sample picture set, and then, the target sample picture with the number of the target type elements not greater than a preset number threshold is determined as at least one target sample picture.

S2032: for each target sample picture, at least one target area picture is selected from the set of target area pictures.

After the at least one target sample picture is selected, at least one target area picture may be selected from the target area picture set for each of the selected at least one target sample picture.

Specifically, a corresponding number of target area pictures may be selected for each target sample picture according to the total number of target type elements that need to be added, and the sum of the number of target type elements that need to be added of each target sample picture in the at least one target sample picture is the same as the total number of target type elements that need to be added. For example, after 3000 target sample pictures that need to be subjected to fusion processing are selected, and the total number of target type elements that need to be added is 9000, the number of target type elements that need to be added may be allocated to each target sample picture, and then a corresponding number of target area pictures may be selected from the target area picture set according to the number of each target sample picture, the number of each target sample picture may be different, and the sum of the number of the 3000 target sample pictures is 9000.

Alternatively, the number of each target sample picture may be preset, and then a set number of target area pictures may be selected for each target sample picture.

It should be noted that, the selection of the target area pictures corresponding to the target sample pictures and the selection of the target sample pictures to be subjected to the fusion processing may be performed simultaneously, that is, S2031 and S2032 may be performed simultaneously, for example, after the total number of the target type elements to be added is determined, the number of at least one target sample picture and the number of the target area pictures corresponding to each target sample picture may be selected simultaneously.

S2033: and carrying out fusion processing on the target sample picture and the corresponding at least one target area picture to obtain a target sample picture containing a target type element.

Since the process of fusion processing of each target sample picture and each target area picture is similar, the following description will be specifically made of the process of fusion processing of one target area picture of one target sample picture, for example, the target area picture C is fused to the target sample picture B.

Specifically, when the target area picture C is fused to the target sample picture B, a fusion area of the target type element in one target sample picture needs to be determined first.

Generally, each element in a picture corresponds to a region where each element approximately appears, for example, when a street view picture is taken, the height of a camera is generally constant, and the height of a traffic light is generally higher, so that the elements generally appear at a position above the picture, and therefore, the region with the set height in the target sample picture can be used as a region into which the target region picture can be merged. Since the target sample picture is stored in the pixel matrix, a region within a preset row range in the pixel matrix of the target sample picture can be determined as a fusion region.

As shown in fig. 7, when the target type element is a traffic light, an upper region in the target sample picture may be set as a fusion region, for example, an upper half portion in the target sample picture may be set as a fusion region.

In specific implementation, a model for identifying the possible occurrence regions of the elements of each type can be trained, so that the regions of the target type elements with the occurrence probability higher than the preset probability value in the target sample picture are identified through the model, and the regions with the occurrence probability higher than the preset probability value are determined as fusion regions.

In the embodiment of the application, after the fusion region is determined, the target region picture may be fused into the fusion region, where the target region picture may be fused into a random region of the fusion region, but an element region that has been labeled in the target sample picture needs to be avoided.

In this embodiment of the application, after performing fusion processing on at least one target sample picture, a target type element may be added to the target sample pictures to obtain at least one target sample picture after the fusion processing.

S204: and updating the annotation file of the target sample picture according to the element type information corresponding to each target area picture in at least one target area picture corresponding to the target sample picture and the coordinate information of each target area picture in one target sample picture.

In the embodiment of the application, each target sample picture is used for subsequent model training, so that labeling information needs to be added for adding each target type element, the labeling information includes element type information corresponding to each target area picture in at least one target area picture and coordinate information of each target area picture in one target sample picture, and then the corresponding labeling information can be updated in a labeling file of each target sample picture respectively so as to be used for subsequent model training.

The above example of data set a, data set B, and data set C is continued. As the data set B originally does not include traffic lights, after the traffic lights are fused into each target sample picture in the data set B through the above process, a data set B' including traffic light elements can be obtained, as shown in fig. 8, before a traffic light element is newly added, the target sample picture in the data set B includes two elements, namely, a person (person) and a vehicle (vehicle), which are already labeled, but does not include a traffic light element, after the fusion processing, 3 traffic lights are newly added to the target sample picture, and the target sample picture obtained through the fusion processing is a sample picture including traffic light elements.

Or, performing element fusion on the data set C, increasing the number of traffic light elements of the data set, and obtaining a data set C' with a more balanced sample number, as shown in fig. 9, before adding a new traffic light element, a target sample picture in the data set B includes two elements, namely, a traffic light, a person (person), and a vehicle (vehicle), which have been labeled, but since the number of traffic light elements in the data set C is small as a whole, it is necessary to add a new traffic light element to the data set C, after performing fusion processing, 2 traffic lights are added to the target sample picture, and then the target sample picture obtained by fusion processing is a sample picture including the traffic light element.

After the data set B 'or C' is obtained through the above process, a newly generated data set B 'or C' can be used to train a model that can detect traffic lights, vehicles, and pedestrians at the same time.

The following describes an application of the scene element fusion processing method described above with reference to the accompanying drawings. Referring to fig. 10, a flowchart of an application method of the scene element fusion processing method is shown.

Step 1001: an existing plurality of data sets is obtained.

The existing multiple data sets refer to data sets subjected to element labeling, and mainly refer to a sample picture set, for example, the data sets subjected to element labeling may include the following:

(1) only the data set a of traffic lights is labeled;

(2) only the data sets B of pedestrians and vehicles are labeled;

(3) the data sets C for traffic lights, pedestrians and vehicles are labeled, but the number of traffic light samples is small, and the data sets have serious category imbalance.

Step 1002: target type elements in the data set are extracted.

Here, taking the target type element as a traffic light as an example, the traffic light element region may be extracted from the tag of the data set a to obtain a traffic light element set. The process of extracting elements may refer to the description of the embodiment part shown in fig. 2, and is not described herein again.

Step 1003: new elements are fused to datasets lacking elements.

Step 1004: a data set is generated that contains the new element.

For example, for the data set B, the data set B lacks traffic light elements, so that the data set B can be subjected to element fusion to obtain a data set B' of newly added traffic light elements.

Step 1005: existing elements are supplemented to a dataset with an unbalanced number of elements.

Step 1006: a more sample-wise balanced data set is generated.

For example, for the data set C, the number of traffic light samples in the data set C is small, and the data set has a serious category imbalance, so that the data set C can be subjected to element fusion to obtain a data set C' with a more balanced sample number.

Step 1007: the model is trained with the newly generated data set.

Specifically, the element recognition model may be trained using the newly generated data set to obtain the trained element recognition model. The element recognition model can adopt a model with any structure and can be trained by adopting any model training method, and the embodiment of the application is not limited to this.

In this embodiment, one possible scene type is a road scene, and then, constituent elements included in the source sample picture and the target sample picture are road elements, and the road elements may include one or a combination of multiple kinds of pedestrians, vehicles, traffic indicators, vegetation, or buildings, and of course, other possible road elements may be included.

After at least one target sample picture containing target type elements is obtained through the process, a training sample picture set of the road element recognition model can be obtained according to the at least one target sample picture, and model training is performed on the road element recognition model according to the training sample picture set to obtain the trained road element recognition model. And then, acquiring a road picture in the driving process of the vehicle, and identifying road elements on the road picture by using the trained road element identification model so as to determine a driving guidance scheme of the vehicle according to a road element identification result.

For example, the data set B 'or C' obtained above may be used for training a road element recognition model, and the trained road element recognition model may be used for an actual road element recognition process. For example, in the driving process of an automatic driving vehicle, the road element recognition model obtained through the training can be used for recognizing the road elements of the shot road picture in real time, so that an automatic driving scheme can be formulated according to the recognized road elements, or in the driving process of a driver driving the vehicle, the road element recognition model obtained through the training can be used for recognizing the surrounding road elements in real time, so that driving guidance information can be formulated for the driver according to the recognized road elements, and the driver can be guided in navigation.

To sum up, the embodiment of the present application provides a method for image processing and model training based on element fusion, a new data set is generated by element-level data processing, data sets containing more element categories can be obtained without manual re-labeling by fusing elements of an existing labeled data set, and the number of various fused elements is controlled, so that the problem of unbalanced sample of the data set is fundamentally improved, and further, the data set containing more category elements can be generated with extremely low cost only by relying on the existing data set, and the problem of unbalanced number of different category elements in the existing data set can be improved with extremely low cost by supplementing the number of elements.

Referring to fig. 11, based on the same inventive concept, an embodiment of the present application further provides a scene element fusion processing apparatus 110, including:

a data set acquiring unit 1101 configured to acquire a source sample picture set including a plurality of source sample pictures and a target sample picture set including a plurality of target sample pictures; the source sample pictures and the target sample pictures have the same scene type, each source sample picture and each target sample picture are formed by multiple types of elements, and at least one element included in each source sample picture and each target sample picture is provided with corresponding labeling information;

an element extracting unit 1102, configured to obtain, according to the labeling information corresponding to each source sample picture in the source sample picture set, a target area picture including a target type element from each source sample picture, respectively, and obtain a target area picture set;

a fusion processing unit 1103, configured to perform fusion processing on at least one target sample picture in the target sample picture set for the target type element based on the obtained target region picture set, and obtain at least one target sample picture including the target type element.

Optionally, the element extracting unit 1102 is specifically configured to:

aiming at each piece of obtained labeling information, the following operations are respectively executed: and when the element type indicated in the obtained labeling information is determined to contain the target type element, intercepting a target area picture corresponding to the target type element from the corresponding source sample picture according to the coordinate information of the target type element indicated by the labeling information.

Optionally, the fusion processing unit 1103 is specifically configured to:

determining at least one target sample picture from a set of target sample pictures;

for each target sample picture in the at least one target sample picture, respectively performing the following operations: aiming at one target sample picture, selecting at least one target area picture from a target area picture set, and carrying out fusion processing on the target sample picture and the at least one target area picture to obtain one target sample picture containing a target type element;

and obtaining a training data set based on the obtained target sample pictures containing the target type elements and other target sample pictures which are not subjected to fusion processing in the target sample picture set.

Optionally, the fusion processing unit 1103 is specifically configured to:

determining the total number of target type elements and the total number of other type elements labeled in the target sample picture set according to the labeling information of each target sample picture in the target sample picture set;

determining the total number of target type elements required to be added to the target sample picture set according to the total number of the target type elements and the total number of other type elements;

determining at least one target sample picture from the target sample picture set according to the total number of the target type elements needing to be added; the sum of the number of the target type elements corresponding to each target sample picture in the at least one target sample picture is the same as the total number of the target type elements to be added.

Optionally, the fusion processing unit 1103 is specifically configured to:

selecting a corresponding number of target area pictures for one target sample picture according to the total number of target type elements needing to be added; the sum of the number of target type elements needing to be added of each target sample picture in the at least one target sample picture is the same as the total number of the target type elements needing to be added; or,

and selecting a set number of target area pictures for one target sample picture.

Optionally, the fusion processing unit 1103 is specifically configured to:

respectively determining the number of target type elements corresponding to each target sample picture according to the labeling information of each target sample picture in the target sample picture set;

and determining the target sample picture with the number of the target type elements not larger than the preset number threshold value as at least one target sample picture.

Optionally, the fusion processing unit 1103 is specifically configured to:

determining a fusion area of a target type element in a target sample picture;

respectively fusing at least one target area picture corresponding to one target sample picture in a fusion area to obtain one target sample picture containing target type elements; when one target area picture is fused in the fusion area, covering pixels of target type elements in the one target area picture at corresponding positions in the fusion area.

Optionally, the fusion processing unit 1103 is specifically configured to:

determining a region in a preset row range as a fusion region based on a pixel matrix corresponding to a target sample picture; or,

and identifying the region of the target type element with the occurrence probability higher than the preset probability value in a target sample picture, and determining the region with the occurrence probability higher than the preset probability value as a fusion region.

Optionally, the apparatus further includes an annotation updating unit 1104 configured to:

and updating the annotation file of one target sample picture according to the element type information corresponding to each target area picture in at least one target area picture corresponding to the target sample picture and the coordinate information of each target area picture in the target sample picture.

Optionally, the scene type is a road scene, each source sample picture and each target sample picture are formed by a plurality of road elements in the road scene, and the type and the position of at least one road element in each source sample picture and each target sample picture are labeled;

the road elements include one or more combinations of pedestrians, vehicles, traffic indicators, vegetation, or buildings.

Optionally, the apparatus further includes a model training unit 1105, a recognition unit 1106, and a driving guidance unit 1107;

the model training unit 1105 is configured to obtain a training sample picture set of the road element recognition model according to at least one target sample picture, and perform model training on the road element recognition model according to the training sample picture set to obtain a trained road element recognition model;

the recognition unit 1106 is used for collecting road pictures in the vehicle running process and recognizing road elements on the road pictures by using the trained road element recognition model;

a driving guidance unit 1107, which is used to determine the driving guidance scheme of the vehicle according to the road element recognition result.

The apparatus may be configured to execute the method shown in the embodiments shown in fig. 2 to 10, and therefore, for functions and the like that can be realized by each functional module of the apparatus, reference may be made to the description of the embodiments shown in fig. 2 to 10, which is not repeated herein. Note that, although mark updating section 1104, model training section 1105, recognition section 1106, and driving guide section 1107 are collectively shown in fig. 11, mark updating section 1104, model training section 1105, recognition section 1106, and driving guide section 1107 are not indispensable functional sections, and are therefore shown by broken lines in fig. 11.

Referring to fig. 12, based on the same technical concept, an embodiment of the present application further provides a computer device 120, which may include a memory 1201 and a processor 1202.

The memory 1201 is used for storing computer programs executed by the processor 1202. The memory 1201 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. The processor 1202 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The embodiment of the present application does not limit the specific connection medium between the memory 1201 and the processor 1202. In the embodiment of the present application, the memory 1201 and the processor 1202 are connected by the bus 1203 in fig. 12, the bus 1203 is represented by a thick line in fig. 12, and the connection manner between other components is only schematically illustrated and is not limited thereto. The bus 1203 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.

Memory 1201 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1201 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD), or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a computer. The memory 1201 may be a combination of the above memories.

A processor 1202, configured to execute the method executed by the apparatus in the embodiments shown in fig. 2 to fig. 10 when calling the computer program stored in the memory 1201.

In some possible embodiments, various aspects of the methods provided herein may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the methods as performed by the devices in the embodiments shown in fig. 2-10.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A scene element fusion processing method is characterized by comprising the following steps:

2. The method of claim 1, wherein obtaining a target area picture including a target type element from each source sample picture according to labeling information corresponding to each source sample picture in the source sample picture set, to obtain a target area picture set, comprises:

3. The method according to claim 1, wherein the performing, based on the obtained target region picture set, a fusion process for the target type element on at least one target sample picture in the target sample picture set to obtain at least one target sample picture containing the target type element comprises:

4. The method of claim 3, wherein determining the at least one target sample picture from the set of target sample pictures comprises:

5. The method of claim 4, wherein the selecting at least one target area picture from the set of target area pictures for one target sample picture comprises:

6. The method of claim 3, wherein determining the at least one target sample picture from the set of target sample pictures comprises:

7. The method according to claim 3, wherein the fusing the target sample picture with the at least one target area picture to obtain a target sample picture containing the target type element comprises:

8. The method of claim 7, wherein determining a fusion region of the target type element in the one target sample picture comprises:

9. The method according to claim 3, wherein after the fusing the target sample picture and the at least one target area picture to obtain a target sample picture containing the target type element, the method further comprises:

10. The method of claim 1, wherein the scene type is a road scene, each of the source sample pictures and the target sample pictures is composed of a plurality of road elements in the road scene, and the type and the position of at least one road element in each of the source sample pictures and the target sample pictures are labeled;

11. The method of claim 10, wherein after obtaining at least one target sample picture containing the target type element, the method further comprises:

obtaining a training sample picture set of a road element recognition model according to the at least one target sample picture, and performing model training on the road element recognition model according to the training sample picture set to obtain a trained road element recognition model;

collecting a road picture in the driving process of a vehicle, and identifying road elements on the road picture by using the trained road element identification model;

and determining a driving guidance scheme of the vehicle according to the road element recognition result.

12. A scene element fusion processing apparatus, characterized in that the apparatus comprises:

a picture set acquiring unit, configured to acquire a source sample picture set including a plurality of source sample pictures and a target sample picture set including a plurality of target sample pictures; the source sample pictures and the target sample pictures have the same scene type, each source sample picture and each target sample picture are formed by multiple types of elements, and at least one element included in each source sample picture and each target sample picture is provided with corresponding labeling information;

13. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when executing the computer program, realizes the steps of the method of any one of claims 1 to 11.

14. A computer storage medium having computer program instructions stored thereon, wherein,

the computer program instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 11.