Detailed Description
The technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method and a device for generating simulated pedestrian animation, which are used for solving the technical problem that the cost and the accuracy cannot be considered in the conventional method for generating the simulated pedestrian animation.
Referring to fig. 1, fig. 1 is a schematic view of a scene applicable to the method for generating an animation of a simulated pedestrian according to the present invention, where the scene may include terminals and servers, and the terminals, the servers, and the terminals and the servers are connected and communicated through the internet formed by various gateways, and the like, where the application scene includes a sensing source 11, a server 12, and a pedestrian 13; wherein:
the sensing source 11 can be a roadside sensing source arranged on two sides of a lane in a vehicle-road cooperative system, or a vehicle-mounted sensing source arranged on an automatic driving vehicle or a manned vehicle, and each sensing source comprises a laser radar and a camera, so that the accurate collection of environment-related data of various lanes, vehicles, pedestrians, obstacles and the like on the road can be realized;
the server 12 comprises a local server and/or a remote server and the like;
the pedestrian 13 is a real pedestrian walking in the target area.
The perception source 11, the server 12 and the pedestrian 13 are located in a wireless network or a wired network to realize data interaction among the three, wherein:
the server 12 firstly obtains perception data of pedestrians 13 in a target area from at least two perception sources 11, the perception sources 11 comprise at least one of vehicle-mounted perception sources and roadside perception sources and are matched with perception data of different perception sources 11, fusion is carried out on each perception data according to matching results to obtain fusion data, a plurality of first key frame data of a target pedestrian completing a complete action in the target area are extracted from the fusion data, then a plurality of first key frame data are labeled to obtain labeled data, the labeled data comprise first labeled data and second labeled data, the first labeled data comprise real pedestrian attribute information, real pedestrian position information and real environment information, the second labeled data comprise real pedestrian skeleton key point position information, corresponding relations of the first labeled data and the second labeled data are obtained, and second target data are obtained according to the first target data and the corresponding relations, the first target data comprise simulated pedestrian attribute information, simulated pedestrian position information and simulated environment information, the second target data comprise a plurality of second key frame data of a complete action completed by a simulated pedestrian, each second key frame data comprises simulated pedestrian skeleton key point position information, and finally simulated pedestrian animation is obtained according to the second target data.
It should be noted that the system scenario diagram shown in fig. 1 is only an example, the server and the scenario described in the present invention are for more clearly illustrating the technical solution of the present invention, and do not constitute a limitation to the technical solution provided by the present invention, and it is known to those skilled in the art that as the system evolves and a new service scenario appears, the technical solution provided by the present invention is also applicable to similar technical problems. The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.
Referring to fig. 2, fig. 2 is a schematic flow chart of a method for generating an animation of a simulated pedestrian according to the present invention, the method comprising:
s201: acquiring perception data of at least two perception sources to a target area, wherein the perception sources comprise at least one of vehicle-mounted perception sources and roadside perception sources.
In the invention, the vehicle-road cooperative system adopts advanced wireless communication, new generation internet and other technologies, implements vehicle-road dynamic real-time information interaction in all directions, develops vehicle active safety control and road cooperative management on the basis of full-time-space dynamic traffic information acquisition and fusion, fully realizes effective cooperation of human and vehicle roads, ensures traffic safety, improves traffic efficiency, and thus forms a safe, efficient and environment-friendly road traffic system. The perception source is a hardware resource which is deployed in a target area in the vehicle-road cooperative system.
The target area is a ground actual area needing to acquire sensing data, multiple sensing objects such as pedestrians, vehicles, obstacles, lanes and traffic signs are arranged in the target area, at least two sensing sources are arranged in the target area, the sensing sources can comprise at least one of vehicle-mounted sensing sources and roadside sensing sources, namely the target area can only comprise multiple vehicle-mounted sensing sources, also can only comprise multiple roadside sensing sources, and can also comprise the vehicle-mounted sensing sources and the roadside sensing sources at the same time, so that the sensing sources in the same sensing scene can be vehicle-mounted + vehicle-mounted, roadside + roadside or vehicle-mounted + roadside. The vehicle-mounted sensing source is a vehicle-mounted sensor mounted on an automatic driving vehicle or a manned vehicle, the roadside sensing source is a roadside sensor arranged on one side or two sides of a lane in a vehicle-road cooperation system, the vehicle-mounted sensor and the roadside sensor comprise cameras and laser radars, and can also comprise other types of sensors, so that accurate collection of environment-related data of various lanes, vehicles, pedestrians, obstacles and the like on the road can be realized.
In one embodiment, S201 specifically includes: acquiring initial sensing data of at least two sensing sources to a target area, wherein the initial sensing data of each sensing source comprises point cloud data and image data which are mutually associated; extracting a three-dimensional surrounding frame and pedestrian point cloud data of each pedestrian in the target area from each point cloud data, and extracting a two-dimensional surrounding frame of each pedestrian in the target area from image data associated with the point cloud data; and combining the three-dimensional surrounding frame, the pedestrian point cloud data and the two-dimensional surrounding frame extracted from the same initial sensing data to obtain the sensing data corresponding to the sensing source.
Each perception source comprises a laser radar and a camera, and initial perception data of the perception source comprises first initial perception data obtained by perceiving a target area by the laser radar and second initial perception data obtained by perceiving the target area by the camera, wherein the first initial perception data are point cloud data, and the second initial perception data are image data. The laser radar and the camera in the same perception source are subjected to combined calibration before perception, namely the external parameters of the laser radar and the camera are known, and the point cloud data and the image data of the same perception source can be correlated. In the invention, the perception data is mainly obtained for obtaining the relevant information of the pedestrians, and the point cloud data also comprises the point clouds of other perception objects, so that for all the pedestrians in the point cloud data, the three-dimensional surrounding frame and the pedestrian point cloud data which are respectively corresponding to each pedestrian are extracted from the point cloud data. The image data comprises a complete image of the overall environment in the sensing range, namely the complete image also comprises images of other sensing objects and environments, after the three-dimensional surrounding frame is extracted, the corresponding pedestrian position in the complete image is found according to the three-dimensional surrounding frame and the external parameters between the laser radar and the camera, the two-dimensional surrounding frame of the pedestrian is extracted, then the three-dimensional surrounding frame, the pedestrian point cloud data and the two-dimensional surrounding frame extracted from the same initial sensing data are combined to serve as the sensing data of the sensing source corresponding to the initial sensing data to the pedestrian in the target area. In the obtained sensing data of each sensing source, the three-dimensional surrounding frame, the pedestrian point cloud data and the two-dimensional surrounding frame are all correlated, namely after the three-dimensional surrounding frame of a certain pedestrian is known, the pedestrian point cloud data and the two-dimensional surrounding frame of the pedestrian can be determined according to the three-dimensional surrounding frame. In the image data associated with each point cloud data, the complete image is only required to be stored once, the two-dimensional surrounding frames in the perception data of the same perception source are associated with the complete image, and the complete image is reserved, so that the current real environment information can be identified in the subsequent steps.
S202: and matching the perception data of different perception sources, and fusing the perception data according to the matching result to obtain fused data.
When the target area is sensed, the sensing data of a single sensing source is not comprehensive enough, if the data of certain angles can be sensed only, sensing blind areas exist at other angles, therefore, the sensing data of different sensing sources are matched, a plurality of sensing data belonging to the same pedestrian can be determined after matching, a plurality of different sensing data belonging to the pedestrian are fused, fused data are obtained, and then the fused data can reflect the sensing condition of the pedestrian more comprehensively and accurately. In the target area, there are usually a plurality of pedestrians, and therefore, the perception data of a plurality of pedestrians can be obtained at the same time. Before matching, the coordinate position of each perception source in the target area is obtained, and then matching and fusion of perception data are carried out according to the position relation among the perception sources.
In one embodiment, S202 specifically includes: matching the three-dimensional surrounding frames of different perception data, and generating a matching result according to the overlapping proportion of the three-dimensional surrounding frames; and fusing different perception data according to the matching result to obtain fused data. During sensing, sensing data of the same sensing source at the same time is a frame of data, the frame of data comprises sensing results of a plurality of pedestrians, sensing of at least two sensing sources at the same time in the same scene can obtain at least two frames of data, and the at least two frames of data are matched and fused to obtain fused data. During matching, the three-dimensional surrounding frames of different pedestrians extracted from the same point cloud data are different, and the similarity of the three-dimensional surrounding frames of the same pedestrian extracted from the different point cloud data is higher, so that the sensing data of different sensing sources are matched with each three-dimensional surrounding frame, whether the sensing data belong to the same pedestrian is determined according to the coincidence proportion of each three-dimensional surrounding frame, and when the coincidence proportion of two three-dimensional surrounding frames in the different point cloud data is larger than the preset proportion, the pedestrians corresponding to the two three-dimensional surrounding frames are represented as the same pedestrian. After all the three-dimensional surrounding frames in the cloud data of each point are matched, sensing data belonging to the same pedestrian are fused according to matching results, complete sensing data of each pedestrian can be obtained, a complete three-dimensional point cloud model of the pedestrian can be obtained from the complete sensing data, the model is formed by combining pedestrian point cloud data of different angles, and the formed complete three-dimensional point cloud model can more accurately and completely represent relevant sensing information of the pedestrian. It should be noted that, in order to improve the matching degree when merging the multiple pieces of pedestrian point cloud data, the positions of the pedestrian point clouds may be manually fine-tuned.
S203: and extracting a plurality of first key frame data of a target pedestrian completing a complete action in the target area from the fusion data.
Only one or a plurality of pedestrians can be in the target area, and the pedestrians can generate various actions in the target area, such as walking, jogging, turning, watching a mobile phone, opening an umbrella and the like. Each fusion data comprises the perception results of all pedestrians in the corresponding perception range, because the actions of all the pedestrians are not necessarily synchronous, a target pedestrian is determined in a target area, then a plurality of first key frame data of a complete action of the target pedestrian are extracted, taking the action as an example of walking, and the complete action can be the action of the pedestrian in the whole process from the beginning of walking to the end of walking in one step. The pedestrian motion is continuous motion, the fusion data is also continuous frame data, a plurality of first key frame data are extracted from the continuous frame data, each first key frame data respectively comprises a sensing result of a target pedestrian at different stages of completing a complete motion, the key frame refers to a frame where each key motion is located in the process that the target pedestrian completes the complete motion, and the key motion can be leg lifting, arm swinging, foot falling and the like. Usually, the number of the first key frame data may be 15, and a complete action may be basically described according to the 15 first key frame data, and of course, the number of the first key frame data that is less than 15 frames or more than 15 frames may also be extracted according to the actual situation, and a person skilled in the art may set the extraction number of the first key frame data according to needs.
When the sensing source senses the target area, each sensing sensor has a sensing period corresponding to the sensing source, and when the acquired sensing data is data of a plurality of sensing periods, the number of the acquired sensing data is also a plurality, and the fused data is also a plurality. In addition, the set number of sensing sources and the device positions are different at different positions of the target area, such as different sidewalks, zebra crossings, and the like, so that a plurality of sensing data can be obtained in the whole target area, and a plurality of fused data can be obtained after fusion.
In one embodiment, S203 specifically includes: acquiring the tracking duration of each fusion data; determining fusion data with tracking duration being greater than a preset value as target fusion data; and extracting a plurality of first key frame data of a target pedestrian completing a complete action in the target area from the target fusion data. The method comprises the steps that fusion data have different tracking durations, the tracking duration refers to the actual duration of the pedestrian tracked in the acquired sensing data, the actual duration of the pedestrian tracked in the acquired sensing data is different due to different factors such as sensing frequency, sensing position and sensing task of each sensing source, the tracking durations of the fusion data after fusion are different, the tracking duration of each fusion data is acquired firstly, then a preset value is set, if 3 seconds are needed, the fusion data with the tracking duration larger than the preset value are determined as target fusion data and stored, and the fusion data with the tracking duration not larger than the preset value are not stored. When the tracking time of certain fusion data is short, the sensing result has no large reference value, so that the part of the fusion data is abandoned, only the fusion data with the long tracking time is reserved, and the occupation of storage resources can be reduced on the premise of ensuring the subsequent extraction effect. After the target fusion data are determined, a plurality of first key frame data of a complete action completed by the target pedestrian in the target area are extracted from the target fusion data, so that the quality of the first key frame data is improved.
In one embodiment, the step of extracting a plurality of first key frame data of a complete action of a target pedestrian in a target area from the target fusion data comprises: intercepting complete and clear target continuous frame data of target pedestrians in a target area from the target fusion data; and extracting a plurality of first key frame data of a target pedestrian completing a complete action from the target continuous frame data. When the sensing source senses the target area, complete and clear sensing data can not be obtained necessarily due to the influence of various factors such as sensing environment, sensing angle and the like, after the target fusion data are determined, the stored target fusion data are checked manually, complete and clear target continuous frame data of target pedestrians in the target area are intercepted, and then a plurality of first key frame data are extracted from the target continuous frame data, so that the quality of the first key frame data is further improved.
S204: and marking the plurality of first key frame data to obtain marking data, wherein the marking data comprise first marking data and second marking data, the first marking data comprise real pedestrian attribute information, real pedestrian position information and real environment information, and the second marking data comprise real pedestrian skeleton key point position information.
In the present invention, the real pedestrian attribute information refers to the attribute information of the target pedestrian actually existing in the target area, and specifically includes height, width, age range (teenager/youth/adult/old age), gender (male/female), action (watching mobile phone/making phone call/making umbrella, etc.), and the like; the real pedestrian position information refers to the position information of a target pedestrian in a target area, such as the target pedestrian on a zebra crossing, a sidewalk, a motor vehicle lane and the like; the real environment information refers to current environment weather information in the target area, such as sunny days, raining, snowing and the like; the information of the positions of the skeleton key points of the real pedestrians refers to the positions of the skeleton key points when the target pedestrians are in the current action, the skeleton key points refer to movable key joint points of the human body, specifically comprise the head, the eyes, the neck, the spine, the shoulders, the arms, the buttocks, the legs and the like, the positions of the face, the fingers and the like are not included in consideration of the data acquisition precision and the application scene, and the current action of the target pedestrians can be restored after the information of the positions of the skeleton key points of the target pedestrians is acquired.
During labeling, a plurality of pieces of first key frame data extracted at the same time are used as a group of information to be labeled and input into a manual labeling tool or a semi-automatic labeling tool, so that the labeling data of the group of information to be labeled can be obtained.
In one embodiment, S204 specifically includes: determining real pedestrian attribute information according to the three-dimensional surrounding frames of the first key frame data and the complete image related to the two-dimensional surrounding frames of the first key frame data; determining real pedestrian position information according to the three-dimensional surrounding frame and a prior map corresponding to the target area; determining real environment information according to the complete image; determining the position information of a real pedestrian skeleton key point according to the pedestrian point cloud data and the complete image of the first key frame data; generating first marking data according to the attribute information of the real pedestrian, the position information of the real pedestrian and the real environment information, generating second marking data according to the position information of the key points of the skeleton of the real pedestrian, and obtaining marking data according to the first marking data and the second marking data.
As shown in a in fig. 3, the fused data corresponds to a certain pedestrian, and includes a three-dimensional surrounding frame 31 of the merged pedestrian and pedestrian point cloud data 32, and the three-dimensional surrounding frame 31 completely encloses the pedestrian point cloud data 32. In the labeling tool, according to the size of the three-dimensional bounding box 31, the height and width of the pedestrian can be determined, and the two-dimensional bounding box can represent the coordinate position of the pedestrian in the corresponding complete image, so that the corresponding complete image can be obtained by querying according to the two-dimensional bounding box (not shown) corresponding to the fusion data, and then the age range (teenager/adult/old age), the gender (male/female), the action (watching a mobile phone/making a call/making an umbrella, etc.) of the pedestrian can be identified and determined from the complete image, and the real pedestrian attribute information can be obtained by integrating the two. Similarly, according to the coordinate information of the two-dimensional bounding box, after the position of the pedestrian in the complete image is identified from the associated complete image, the weather (sunny day, rain, snow, etc.) of the current environment, that is, the real environment information, can be further determined. The three-dimensional surrounding frame can represent the coordinate position of the pedestrian in the corresponding point cloud data, so that the position (sidewalk/zebra crossing/motor lane and the like) of the pedestrian can be obtained by inquiring from a prior map (a high-precision map storing the related information of each environment and object in the target area) of the target area according to the coordinate information of the three-dimensional surrounding frame, namely the position information of the real pedestrian. The combined pedestrian point cloud data 32, i.e., the complete pedestrian three-dimensional point cloud model, is taken as the main point, and the corresponding complete image is taken as the reference and the supplement, so that the positions of all skeleton key points in the whole human body under the current action of the pedestrian, i.e., the position information of the real skeleton key points of the pedestrian, can be obtained. Because the complete three-dimensional point cloud model can more accurately and completely represent the relevant perception information of the pedestrian, the skeleton key points are labeled according to the pedestrian three-dimensional point cloud model, and the obtained labeling result is more accurate.
The human skeleton key points comprise a left foot, a left leg, a left crotch, a left hand, a left arm, a left shoulder, a left eye, a right foot, a right leg, a right crotch, a right hand, a right arm, a right shoulder, a right eye, a hip and a cervical vertebra, and the connection mode of all the skeleton key points is shown in figure 4. When the pedestrians act differently, the position information of each skeleton key point is different, but the whole skeleton is in accordance with the arrangement rule of human skeletons. When the position information of the real pedestrian skeleton key points is determined, the relative relation between each skeleton key point and the reference point is sequentially determined on the basis of the reference point positioned on the ground. Specifically, as shown in fig. 5, assuming that the reference point is point a and a certain bone key point is point B, the point a is taken as the origin of coordinates, and an x-axis, a y-axis and a z-axis are established to form a coordinate system, and the distance and the rotation angle between the point B and the point a are used to represent the position information of the point B. When the point B changes with the pedestrian motion, the corresponding position information also changes.
The positions of the key points of the skeleton of the real pedestrian determined according to the complete three-dimensional point cloud model are shown as b in fig. 3, through the process, first marking data are generated according to the determined attribute information of the real pedestrian, the position information of the real pedestrian and the real environment information, second marking data are generated according to the position information of the key points of the skeleton of the real pedestrian, and the marking data corresponding to the first key frame data of the group can be obtained by combining the first marking data and the second marking data.
S205: acquiring a corresponding relation between first labeling data and second labeling data, and acquiring second target data according to the first target data and the corresponding relation, wherein the first target data comprise simulated pedestrian attribute information, simulated pedestrian position information and simulated environment information, the second target data comprise a plurality of pieces of second key frame data for simulating pedestrians to complete a complete action, and each piece of second key frame data comprises simulated pedestrian skeleton key point position information.
The first labeling data comprise attribute information of a real pedestrian, position information of the real pedestrian and real environment information, the second labeling comprises position information of a skeleton key point of the real pedestrian, and for a certain type of real pedestrian with determined attributes, the displayed actions have certain commonality under the determined position and the determined environment, for example, the age range is juvenile, the height is h1, the width is w1, the gender is female, the actions are pedestrians of the type of watching a mobile phone, when the pedestrian is positioned on a sidewalk on a sunny day, the actions of watching the mobile phone have commonality, for example, when the mobile phone is held by the right hand to a specific height, the mobile phone is operated by one hand, and the mobile phone is watched by a large margin in a low head; for another example, the age range is old, the height is h2, the width is w2, the gender is male, and the pedestrian who looks at the mobile phone moves on the sidewalk in a sunny day, and there is a commonality in the actions of looking at the mobile phone, such as holding the mobile phone to a specific height with the left hand, holding the mobile phone up with the right hand to operate on the mobile phone, and looking at the mobile phone with a small head-lowering amplitude. That is, there is a correspondence between the first annotation data and the second annotation data.
After the corresponding relation between the two types of labeled data is obtained, the first target data is input into the simulation system, and the second target data corresponding to the first target data can be obtained according to the corresponding relation between the first labeled data and the second labeled data. The first target data comprises simulated pedestrian attribute information, simulated pedestrian position information and simulated environment information, wherein the simulated pedestrian attribute information refers to attribute information of a simulated pedestrian established in a simulation system, and specifically comprises height, width, age range (teenager/youth/adult/old age), gender (male/female), actions (watching a mobile phone/making a call/making an umbrella, and the like); the simulated pedestrian position information refers to the position information of a simulated pedestrian in a simulation area established in a simulation system, such as the simulated pedestrian on a zebra crossing, a sidewalk, a motor vehicle lane and the like; the simulation environment information refers to current environment weather information of the simulation system, such as sunny days, raining, snowing and the like. The second target data comprises position information of skeleton key points of the simulated pedestrians, the position information of the skeleton key points of the simulated pedestrians refers to the positions of the skeleton key points when the simulated pedestrians are in current actions, the skeleton key points refer to movable key joint points of the human body, specifically comprise head, eyes, neck, spine, shoulders, arms, buttocks, legs and the like, and the parts such as faces and fingers are not included in consideration of data acquisition precision and application scenes.
The first target data and the first label data comprise pedestrian attributes, pedestrian positions and environment information, the second target data and the second label data comprise pedestrian skeleton key point position information, the corresponding relation between the first target data and the second target data can be obtained through analogy according to the corresponding relation between the first label data and the second label data, the corresponding second target data can be obtained after the corresponding relation between the first label data and the second label data and the first target data are known, the second target data are a plurality of pieces of second key frame data, each piece of second key frame data comprises simulated pedestrian skeleton key point position information, and the quantity of the pieces of second key frame data is equal to that of the pieces of first key frame data, for example, 15 frames. When the corresponding relationship between the first labeled data and the second labeled data is obtained, various methods may be adopted, such as obtaining by using a machine learning model or obtaining by using a statistical analysis method.
In one embodiment, when the machine learning model is adopted for obtaining, S205 specifically includes: training a bone key point learning model by taking the first labeled data as training input data and the second labeled data as training output data, and obtaining a corresponding relation between the first labeled data and the second labeled data according to the trained bone key point learning model; and configuring and inputting first target data, and calling a bone key point learning model to obtain second target data. The bone key point learning model is a machine learning model, the first labeling data and the second labeling data are respectively used as training input data and training output data to train the bone key point learning model, the trained bone key point learning model can automatically generate corresponding output data according to the input first labeling data, and errors between the output data and the corresponding second labeling data can be gradually reduced along with the increase of training sample amount and training times, so that the trained model can be used for expressing the corresponding relation between the first labeling data and the second labeling data. After the first target data is input, the trained bone key point learning model is called, and corresponding second target data can be obtained, wherein the second target data comprise position information of a simulated pedestrian bone key point, and compared with position information of a real pedestrian bone key point obtained by a real pedestrian with the same attribute in the same environment and position, the position information of the simulated pedestrian bone key point has a smaller error with the position information of the real pedestrian bone key point, namely the position information of the simulated pedestrian bone key point can truly and accurately reflect the action of the real pedestrian.
In an embodiment, when the statistical analysis method is used for obtaining, S205 further includes: acquiring a plurality of groups of labeled data; acquiring the same target first labeling data from the multiple groups of labeling data, and respectively acquiring each target second labeling data corresponding to the target first labeling data; obtaining a target second labeling data set according to the labeling content of each target second labeling data; acquiring the corresponding relation between the target first labeling data and the target second labeling data set; and determining first target data according to the target first labeling data, and generating second target data in a preset labeling range corresponding to the target second labeling data set according to the corresponding relation.
The method comprises the steps of analyzing first labeling data and second labeling data in a plurality of groups of labeling data by adopting a statistical analysis method, wherein the same target first labeling data exist in the plurality of groups of labeling data, the same target first labeling data refer to first labeling data obtained by pedestrians with the same type of attributes under the same environment and the same position, for example, the first labeling data are the first labeling data when the pedestrians are in the same type of pedestrian on a sunny day, are juvenile in the age range, are h1 in height, are w1 in width, are female in gender, and are moving as pedestrians watching mobile phones. Since such pedestrians are different people in the actual scene, and there are some differences in the actions of each person, for the same target first labeled data, a plurality of different target second labeled data can be obtained from different sets of labeled data, and if there are N identical target first labeled data in the M sets of labeled data, there are N different target second labeled data.
The method comprises the steps of obtaining a target second labeling data set according to N different target second labeling data, then screening, analyzing and sorting the target second labeling data in the target second labeling data set by a statistical method, and obtaining certain commonality of the target second labeling data, wherein if a user holds a mobile phone, the specific expression is that the user holds the mobile phone to a specific height with the right hand, operates the mobile phone with one hand, and looks at the mobile phone with a large head, the position information of the key point of the real pedestrian skeleton of the person can fluctuate within a certain range, but the overall expression is similar, and the corresponding relation between the target first labeling data and the target second labeling data set can be obtained after analysis. When the data volume of the labeled data is large enough, a plurality of target first labeled data can be obtained, a plurality of target second labeled data sets can be obtained, in the subsequent simulation, if the content of the first target data is the same as that of a certain target first labeled data, second target data can be directly and randomly generated in a preset labeled range of the target second labeled data sets according to the corresponding relation between the target first labeled data and the target second labeled data sets, and the preset labeled range is a range formed by all possible values in a preset variance range on the basis of all data in the target second labeled data sets.
S206: and generating the simulated pedestrian animation according to the second target data.
The second target data comprises first frame time and last frame time of each second key frame data besides a plurality of second key frame data, the second key frame data are not completely continuous, the simulated pedestrian animation is formed by continuous frame data, after the second target data are obtained, the second target data are processed, the continuous frame data are formed in a time period between the first frame time and the last frame time, and the simulated pedestrian animation is obtained. The simulated pedestrian animation can restore the whole process of the simulated pedestrian when the simulated pedestrian completes a complete action. In the simulation, the first target data and the second target data are both used as input data of the simulation system, and the output data of the simulation system is the 3d world and screen effect of the final rendering.
In one embodiment, S206 specifically includes: inserting transition frame data among a plurality of second key frame data by adopting an interpolation method; overlaying the action attribute data on the second target data; and generating simulated pedestrian animation corresponding to the second target data according to the insertion result, the superposition result and the time difference of the first frame and the last frame of the plurality of second key frame data. And inserting transition frame data between the adjacent second key frame data by adopting an interpolation method, wherein the transition frame data also comprises the position information of the key points of the skeleton of the simulated pedestrian, and if the simulated pedestrian arm in the first second key frame data is at the first position and the simulated pedestrian arm in the second key frame data is at the second position, the transition frame data is used for representing the position of the simulated pedestrian arm in each frame in the process of moving the simulated pedestrian arm from the first position to the second position. And during interpolation, transition frame data with preset frame numbers can be inserted between adjacent second key frame data according to actual needs. In the second labeling data, the time difference of the first frame and the last frame of the second key frame data is used for representing the action speed of the simulated pedestrian, and the smaller the difference value of the time difference of the first frame and the last frame is, the faster the action is, and the slower the action is otherwise. Meanwhile, action attribute data is also required to be superimposed on the first target data, the action attribute data is usually random noise, the noise mainly acts on the swing amplitude and the whole action speed of hands and feet, and is used for representing the action deviation range which can be allowed when a simulated pedestrian acts, if the swing amplitude of the hands of a crowd with specific attributes walks for one step under a specific environment is obtained through calculation according to a plurality of second key frame data, after the random noise is superimposed, the swing amplitude of the hands can be L +/-s when the crowd walks for one step, s is the action deviation value which can be allowed when the crowd walks, and the diversity of actions of the pedestrian in the simulation system can be increased by means of superimposing the random noise. And finally, generating the simulated pedestrian animation according to the insertion result, the superposition result and the time difference of the first frame and the last frame.
According to the embodiment, the sensing data is directly acquired through the vehicle-mounted sensing source and the roadside sensing source in the vehicle-road cooperative system, so that additional acquisition equipment is not needed to be purchased during data acquisition, the cost is reduced, the sensing data is fused to ensure the integrity and the accuracy of the sensing data, a plurality of first key frame data are extracted from the fused data and labeled, the corresponding second target data are obtained after the first target data are configured according to the corresponding relation between the first labeled data and the second labeled data, and finally the second target data obtained according to the corresponding relation also more truly reflect the action change of pedestrians in a specific environment because the data acquired in the early stage are all data of real pedestrians in a real environment, so that the fineness, the authenticity and the accuracy of the simulated pedestrian animation obtained based on the second target data are improved, the whole marking process is simple and easy to operate, the generation difficulty of the simulated pedestrian animation is greatly reduced, the cost is reduced, and the accuracy is also ensured, namely, the invention provides the simulated animation generation method with better comprehensive performance. According to the simulation system and the simulation method, the simulation pedestrian animation is automatically generated according to the real data in the simulation system, the simulation scene is more intelligently constructed through the automatic process, more real and effective information is added to the unmanned system during the simulation environment test based on the simulation recurrence of the actual scene data, the tests of functional modules such as pedestrian track prediction and human-vehicle interaction are facilitated, and the unmanned system is accelerated to achieve higher-level full unmanned driving.
In a traditional simulation system, simulation environment data can be automatically acquired according to the environment in the simulation system, and simulation pedestrian attribute information is directly input information, but simulation pedestrians cannot show different actions according to the change of the environment and the difference of positions. In the simulated pedestrian animation finally obtained in the invention, each pedestrian can automatically generate a change action according to the difference of the attributes, environment and positions of the pedestrian, so that the action change of the pedestrian in the real environment can be simulated more truly and accurately, and the generated simulated pedestrian animation has more reference significance.
As shown in fig. 6, a second flow diagram of the method for generating simulated pedestrian animation of the present invention is shown on the product side, where the method is represented by hardware resources deployed at the vehicle end and the road end in a vehicle-road cooperative system, a skeletal key point marking tool, and pedestrian skeletal animation automatic generation software in a simulation system, where the hardware resources mainly include a laser radar, a camera, and related calculation, storage, and communication units, and the method for generating simulated pedestrian animation of the present invention specifically includes the steps of:
s601: and (6) perception data acquisition.
The sensing sources in the same sensing scene can be vehicle-mounted + vehicle-mounted, roadside + roadside or vehicle-mounted + roadside, each sensing source comprises a laser radar and a camera, the laser radar acquires point cloud data, the cameras acquire image data, and the point cloud data and the image data of the same sensing source are correlated. For each perception source, extracting a three-dimensional surrounding frame and pedestrian point cloud data which correspond to each pedestrian from the point cloud data, extracting a two-dimensional surrounding frame which corresponds to each pedestrian from the image data, and then combining the three-dimensional surrounding frame, the pedestrian point cloud data and the two-dimensional surrounding frame which correspond to the same point cloud data to serve as perception data of the pedestrian in the target area from the perception source which corresponds to the point cloud data. In addition, a complete image of the image data is stored, which is correlated with the two-dimensional bounding box.
S602: and fusing and tracking.
When sensing, sensing data of the same sensing source at the same moment is a frame of data, the frame of data comprises sensing results of a plurality of pedestrians, at least two frames of data are obtained by sensing at the same moment of at least two sensing sources in the same scene, the at least two frames of data are sent to a fusion unit, matching is carried out according to the coincidence proportion of three-dimensional surrounding frames, the results that different sensing data are overlapped and meet conditions are reserved, the matched sensing results are cached, the fusion data obtained after fusion are input to a tracking unit, and the fused pedestrian point cloud data form a complete three-dimensional point cloud model of the pedestrians. The tracking unit stores continuous frame data stably tracked for 3 seconds or more in the storage unit.
S603: and (5) manually screening.
And manually checking the stored continuous frame data, and intercepting the complete and clear target continuous frame data of the target pedestrians in the target area.
S604: and (6) data annotation.
15 first key frame data of a pedestrian completing a complete action (walking, jogging and the like) are extracted from the target continuous frame data for data annotation. Taking 15 pieces of first key frame data extracted at the same time as a group, labeling by using a labeling tool, and determining real pedestrian attribute information according to a complete image associated with a three-dimensional surrounding frame and a two-dimensional surrounding frame during labeling; determining real pedestrian position information according to the three-dimensional surrounding frame and a prior map corresponding to the target area; determining real environment information according to the complete image; determining the position information of a key point of a real pedestrian skeleton according to the pedestrian point cloud data and the complete image; generating first marking data according to the attribute information of the real pedestrian, the position information of the real pedestrian and the real environment information, generating second marking data according to the position information of the key points of the skeleton of the real pedestrian, and obtaining marking data according to the first marking data and the second marking data.
The first annotation data and the second annotation data have a corresponding relationship, and when the corresponding relationship between the first annotation data and the second annotation data is obtained, two ways exist, and when a machine learning model is adopted for obtaining, S605 is executed; when the statistical analysis method is employed for the acquisition, S606 is executed.
S605: training and invoking the learning model.
The method comprises the steps of taking first labeled data as training input data and second labeled data as training output data, training a bone key point learning model, obtaining a corresponding relation between the first labeled data and the second labeled data according to the trained bone key point learning model, then configuring and inputting first target data in a simulation system, and calling the bone key point learning model to obtain second target data. The first target data comprises simulated pedestrian attribute information, simulated pedestrian position information and simulated environment information, the second target data comprises a plurality of second key frame data of a complete action completed by a simulated pedestrian, and each second key frame data comprises simulated pedestrian skeleton key point position information.
S606: and (6) counting data.
Firstly, acquiring a plurality of groups of labeled data, obtaining the action commonalities of a certain type of real pedestrians with the same attribute in a specific environment and position according to a statistical analysis method, then inputting the same attribute, environment and position into a simulation system as first target data, and randomly generating the positions of skeleton key points of the simulated pedestrians in a variance range according to the commonalities to obtain corresponding second target data.
S607: and generating simulated pedestrian animation.
And inserting transition frame data among the plurality of second key frame data by adopting an interpolation method, and superposing action attribute data on the second target data, wherein the action attribute data mainly acts on the swing amplitude, the overall action speed and the like of hands and feet and is used for representing the allowable action deviation range of the simulated pedestrian when the simulated pedestrian acts, and the simulated pedestrian animation corresponding to the second target data is generated according to the insertion result, the superposition result and the time difference of the first frame and the last frame of the plurality of second key frame data.
Through the process, the simulation pedestrian animation is automatically generated in the simulation system according to the real data, the simulation scene is more intelligently constructed through the automatic process, more real and effective information is added to the unmanned system during the simulation environment test based on the simulation recurrence of the actual scene data, the tests of functional modules such as pedestrian track prediction and human-vehicle interaction are facilitated, and the unmanned system is accelerated to achieve higher-level full unmanned driving.
Correspondingly, fig. 7 is a schematic structural diagram of the pedestrian animation simulation generating device of the present invention, please refer to fig. 7, the pedestrian animation simulation generating device includes:
the acquisition module 110 is configured to acquire sensing data of a target area from at least two sensing sources, where a sensing source includes at least one of a vehicle-mounted sensing source and a roadside sensing source;
the matching module 120 is configured to match the sensing data of different sensing sources, and fuse the sensing data according to a matching result to obtain fused data;
the extraction module 130 is configured to extract a plurality of first key frame data of a complete action of a target pedestrian in a target region from the fusion data;
the labeling module 140 is configured to label a plurality of pieces of first keyframe data to obtain labeled data, where the labeled data includes first labeled data and second labeled data, the first labeled data includes real pedestrian attribute information, real pedestrian position information, and real environment information, and the second labeled data includes real pedestrian skeleton keyframe position information;
the obtaining module 150 is configured to obtain a corresponding relationship between the first labeled data and the second labeled data, and obtain second target data according to the first target data and the corresponding relationship, where the first target data includes simulated pedestrian attribute information, simulated pedestrian position information, and simulated environment information, the second target data includes a plurality of second keyframe data in which a simulated pedestrian completes a complete action, and each of the second keyframe data includes simulated pedestrian skeleton keyframe position information;
and the generating module 160 is used for generating the simulated pedestrian animation according to the second target data.
In one embodiment, the obtaining module 110 includes:
the first acquisition submodule is used for acquiring initial sensing data of at least two sensing sources to a target area, and the initial sensing data of each sensing source comprises point cloud data and image data which are mutually associated;
the first extraction submodule is used for extracting a three-dimensional surrounding frame and pedestrian point cloud data of each pedestrian in the target area from each point cloud data, and extracting a two-dimensional surrounding frame of each pedestrian in the target area from image data associated with the point cloud data;
and the combination submodule is used for combining the three-dimensional surrounding frame, the pedestrian point cloud data and the two-dimensional surrounding frame extracted from the same initial sensing data to obtain the sensing data corresponding to the sensing source.
In one embodiment, the matching module 120 includes:
the matching submodule is used for matching the three-dimensional surrounding frames of different perception data and generating a matching result according to the coincidence proportion of the three-dimensional surrounding frames;
and the fusion submodule is used for fusing different perception data according to the matching result to obtain fusion data.
In one embodiment, the annotation module 140 comprises:
the first determining submodule is used for determining real pedestrian attribute information according to the three-dimensional surrounding frames of the first key frame data and the complete image related to the two-dimensional surrounding frames of the first key frame data;
the second determining submodule is used for determining the position information of the real pedestrian according to the three-dimensional surrounding frame and the prior map corresponding to the target area;
the third determining submodule is used for determining real environment information according to the complete image;
the fourth determining submodule is used for determining the position information of the key points of the real pedestrian skeleton according to the pedestrian point cloud data of the first key frame data and the complete image;
the first generation submodule is used for generating first marking data according to the attribute information of the real pedestrian, the position information of the real pedestrian and the real environment information, generating second marking data according to the position information of the key points of the skeleton of the real pedestrian, and obtaining marking data according to the first marking data and the second marking data.
In one embodiment, the extraction module 130 includes:
the second acquisition submodule is used for acquiring the tracking duration of each fusion data;
the fifth determining submodule is used for determining the fusion data with the tracking duration being greater than the preset value as target fusion data;
and the second extraction submodule is used for extracting a plurality of first key frame data of a complete action of the target pedestrian in the target area from the target fusion data.
In one embodiment, the obtaining module 150 includes:
the training submodule is used for training a bone key point learning model by taking the first labeled data as training input data and the second labeled data as training output data, and obtaining the corresponding relation between the first labeled data and the second labeled data according to the trained bone key point learning model;
and the calling module is used for configuring and inputting the first target data and calling the trained bone key point learning model to obtain second target data.
In one embodiment, the obtaining module 150 further includes:
the third obtaining submodule is used for obtaining a plurality of groups of marked data;
the fourth obtaining sub-module is used for obtaining the same target first labeling data from the multiple groups of labeling data and respectively obtaining each target second labeling data corresponding to the target first labeling data;
the obtaining submodule is used for obtaining a target second labeling data set according to the labeling content of each target second labeling data;
a fifth obtaining sub-module, configured to obtain a corresponding relationship between the target first annotation data and the target second annotation data set;
and the second generation submodule is used for determining the first target data according to the target first labeling data and generating second target data in a preset labeling range corresponding to the target second labeling data set according to the corresponding relation.
In one embodiment, the generation module 160 includes:
the inserting submodule is used for inserting transition frame data among the second key frame data by adopting an interpolation method;
the superposition submodule is used for superposing the action attribute data on the second target data;
and the third generation submodule is used for generating the simulated pedestrian animation corresponding to the second target data according to the input result, the superposition result and the time difference of the first frame and the last frame of the second key frame data.
Different from the prior art, the simulated pedestrian animation generation device provided by the invention directly obtains the perception data through the vehicle-mounted perception source and the roadside perception source in the vehicle-road cooperative system, so that additional acquisition equipment is not required to be purchased when the data is acquired, the cost is reduced, the perception data is fused to ensure the integrity and the accuracy of the perception data, a plurality of first key frame data are extracted from the fused data and labeled, the corresponding second target data are obtained after the first target data are configured according to the corresponding relation between the first labeled data and the second labeled data, and finally the second target data obtained according to the corresponding relation reflect the action change of pedestrians in the specific environment more truly, so that the fineness and the accuracy of the simulated pedestrian animation obtained based on the second target data are improved, The invention provides a simulation animation generation device with better comprehensive performance, which has authenticity and accuracy, and the whole marking process is simple and easy to operate, thereby greatly reducing the generation difficulty of the simulation pedestrian animation, reducing the cost and ensuring the accuracy. According to the simulation system and the simulation method, the simulation pedestrian animation is automatically generated according to the real data in the simulation system, the simulation scene is more intelligently constructed through the automatic process, more real and effective information is added to the unmanned system during the simulation environment test based on the simulation recurrence of the actual scene data, the tests of functional modules such as pedestrian track prediction and human-vehicle interaction are facilitated, and the unmanned system is accelerated to achieve higher-level full unmanned driving.
Accordingly, the present invention also provides an electronic device, as shown in fig. 8, which may include components such as a radio frequency circuit 801, a memory 802 including one or more computer-readable storage media, an input unit 803, a display unit 804, a sensor 805, an audio circuit 806, a WiFi module 807, a processor 808 including one or more processing cores, and a power supply 809. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 8 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the radio frequency circuit 801 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then sends the received downlink information to one or more processors 808 for processing; in addition, data relating to uplink is transmitted to the base station. The memory 802 may be used to store software programs and modules, and the processor 808 may execute various functional applications and data processing by operating the software programs and modules stored in the memory 802. The input unit 803 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
The display unit 804 may be used to display information input by or provided to a user and various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof.
The electronic device may also include at least one sensor 805, such as light sensors, motion sensors, and other sensors. The audio circuitry 806 includes speakers that can provide an audio interface between the user and the electronic device.
WiFi belongs to short-distance wireless transmission technology, and the electronic device can help the user send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 807, which provides wireless broadband internet access for the user. Although fig. 8 shows the WiFi module 807, it is understood that it does not belong to the essential constitution of the electronic device, and may be omitted entirely as needed within the scope not changing the essence of the application.
The processor 808 is a control center of the electronic device, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 802 and calling data stored in the memory 802, thereby integrally monitoring the mobile phone.
The electronic device also includes a power supply 809 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 808 via a power management system to manage charging, discharging, and power consumption via the power management system.
Although not shown, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 808 in the electronic device loads an executable file corresponding to a process of one or more application programs into the memory 802 according to the following instructions, and the processor 808 runs the application programs stored in the memory 802, so as to implement the following functions:
acquiring sensing data of at least two sensing sources to a target area, wherein the sensing sources comprise at least one of vehicle-mounted sensing sources and roadside sensing sources; matching perception data of different perception sources, and fusing each perception data according to a matching result to obtain fused data; extracting a plurality of first key frame data of a target pedestrian completing a complete action in a target area from the fusion data; labeling the plurality of pieces of first key frame data to obtain labeled data, wherein the labeled data comprise first labeled data and second labeled data, the first labeled data comprise real pedestrian attribute information, real pedestrian position information and real environment information, and the second labeled data comprise real pedestrian skeleton key point position information; acquiring a corresponding relation between first labeled data and second labeled data, and acquiring second target data according to the first target data and the corresponding relation, wherein the first target data comprises simulated pedestrian attribute information, simulated pedestrian position information and simulated environment information, the second target data comprises a plurality of pieces of second key frame data for simulating pedestrians to complete a complete action, and each piece of second key frame data comprises simulated pedestrian skeleton key point position information; and generating the simulated pedestrian animation according to the second target data.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, the present invention provides a computer readable storage medium having stored therein a plurality of instructions that are loadable by a processor to cause the following functions:
acquiring sensing data of at least two sensing sources to a target area, wherein the sensing sources comprise at least one of vehicle-mounted sensing sources and roadside sensing sources; matching perception data of different perception sources, and fusing each perception data according to a matching result to obtain fused data; extracting a plurality of first key frame data of a target pedestrian completing a complete action in a target area from the fusion data; labeling the plurality of pieces of first key frame data to obtain labeled data, wherein the labeled data comprise first labeled data and second labeled data, the first labeled data comprise real pedestrian attribute information, real pedestrian position information and real environment information, and the second labeled data comprise real pedestrian skeleton key point position information; acquiring a corresponding relation between first labeled data and second labeled data, and acquiring second target data according to the first target data and the corresponding relation, wherein the first target data comprises simulated pedestrian attribute information, simulated pedestrian position information and simulated environment information, the second target data comprises a plurality of pieces of second key frame data for simulating pedestrians to complete a complete action, and each piece of second key frame data comprises simulated pedestrian skeleton key point position information; and generating the simulated pedestrian animation according to the second target data.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the computer-readable storage medium can execute the steps of any method provided by the present invention, the beneficial effects that any method provided by the present invention can achieve can be achieved, for details, see the foregoing embodiments, and are not described herein again.
The method, the device, the electronic device and the storage medium for generating the simulated pedestrian animation provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the technical scheme and the core idea of the invention; those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.