KR102507253B1

KR102507253B1 - Reinforcement learning device for user data based object position optimization

Info

Publication number: KR102507253B1
Application number: KR1020220110857A
Authority: KR
Inventors: 이성령; 민예린
Original assignee: 주식회사 애자일소다
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2023-03-08

Abstract

A reinforcement learning device for optimizing the object position based on user data is disclosed. The present invention configures a learning environment including an object to be placed, order information on the position of the object to be placed, and the previous and future placement of the object based on the design data or placement data of a user. Through a reinforcement learning-based model, it is possible to generate an optimal position of the object considering the relationship between objects including the already placed state.

Description

Reinforcement learning device for optimizing object position based on user data {REINFORCEMENT LEARNING DEVICE FOR USER DATA BASED OBJECT POSITION OPTIMIZATION}

본 발명은 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치에 관한 발명으로서, 더욱 상세하게는 사용자의 설계 데이터 또는 배치 데이터를 기반으로 배치 대상과 배치 대상이 되는 물체의 위치와, 물체의 이전 및 향후 배치에 대한 순서 정보를 포함한 학습 환경을 구성하여 강화학습 기반 모델을 통해 이미 배치된 상태를 포함한 물체 간의 관계성을 고려한 물체의 최적 위치를 생성하는 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치에 관한 것이다.The present invention relates to a reinforcement learning apparatus for optimizing the position of an object based on user data, and more particularly, the location of an object to be placed and an object to be placed based on design data or placement data of a user, and the transfer and Reinforcement learning device for optimizing the position of an object based on user data that creates the optimal position of an object considering the relationship between objects including the already placed state through a reinforcement learning-based model by constructing a learning environment including sequence information for future arrangement It is about.

강화 학습은 환경(environment)과 상호작용하며 목표를 달성하는 에이전트를 다루는 학습 방법으로서, 로봇이나 인공 지능 분야에서 많이 사용되고 있다.Reinforcement learning is a learning method for dealing with an agent that interacts with an environment and achieves a goal, and is widely used in the field of robots or artificial intelligence.

이러한 강화 학습은 학습의 행동 주체인 강화 학습 에이전트(Agent)가 어떤 행동을 해야 더 많은 보상(Reward)을 받을지 알아내는 것을 목적으로 한다.The purpose of this reinforcement learning is to find out what actions the reinforcement learning agent, which is the subject of learning, must do to receive more rewards.

즉, 정해진 답이 없는 상태에서도 보상을 최대화시키기 위해 무엇을 할 것인가를 배우는 것으로서, 입력과 출력이 명확한 관계를 갖고 있는 상황에서 사전에 어떤 행위를 할 것인지 듣고 하는 것이 아니라, 시행착오를 거치면서 보상을 최대화시키는 것을 배우는 과정을 거친다.In other words, it is learning what to do to maximize the reward even in the absence of a fixed answer, rather than listening to what action to do in advance in a situation where input and output have a clear relationship, rewarding through trial and error. goes through the process of learning to maximize

또한, 에이전트는 시간 스텝이 흘러감에 따라 순차적으로 액션을 선택하게 되고, 상기 액션이 환경에 끼친 영향에 기반하여 보상(reward)을 받게 된다.In addition, the agent sequentially selects an action as the time step passes, and receives a reward based on the effect the action has on the environment.

도1은 종래 기술에 따른 강화 학습 장치의 구성을 나타낸 블록도로서, 도 1에 나타낸 바와 같이, 에이전트(10)가 강화 학습 모델의 학습을 통해 액션(Action, 또는 행동) A를 결정하는 방법을 학습시키고, 각 액션인 A는 그 다음 상태(state) S에 영향을 끼치며, 성공한 정도는 보상(Reward) R로 측정할 수 있다.Figure 1 is a block diagram showing the configuration of a reinforcement learning device according to the prior art. As shown in Figure 1, the agent 10 determines the action A through learning of the reinforcement learning model. After learning, each action A affects the next state S, and the degree of success can be measured by reward R.

즉, 보상은 강화 학습 모델을 통해 학습을 진행할 경우, 어떤 상태(State)에 따라 에이전트(10)가 결정하는 액션(행동)에 대한 보상 점수로서, 학습에 따른 에이전트(10)의 의사 결정에 대한 일종의 피드백이다.That is, the reward is a reward score for an action (action) determined by the agent 10 according to a certain state when learning is performed through a reinforcement learning model, and a reward score for the agent 10's decision-making according to learning It is a kind of feedback.

환경(20)은 에이전트(10)가 취할 수 있는 행동, 그에 따른 보상 등 모든 규칙으로서, 상태, 액션, 보상 등은 모두 환경의 구성요소이고, 에이전트(10) 이외의 모든 정해진 것들이 환경이다.The environment 20 is all rules, such as actions that the agent 10 can take and rewards accordingly. States, actions, rewards, etc. are all components of the environment, and all predetermined things other than the agent 10 are the environment.

강화 학습을 통해 에이전트(10)는 미래의 보상이 최대가 되도록 액션을 취하게 되므로, 보상을 어떻게 책정하느냐에 따라 학습 결과에 많은 영향이 발생한다.Through reinforcement learning, the agent 10 takes an action to maximize future reward, so the learning result is greatly influenced by how the reward is set.

한편, 데이터를 배치할 경우, 기존에 많은 딥러닝 아키텍처 모델은 이미 배치된 부분이 크기나 위치에 대한 고려를 할 수 없어 배치했던 부분을 다시 배치하게 되는 문제가 발생한다.On the other hand, when data is placed, many existing deep learning architecture models cannot consider the size or location of already placed parts, which causes a problem of re-arranging the previously placed parts.

또한, 배치시 고려해야 하는 배치 대상 물체의 공간과 방향성을 고려하게 될 경우, 선택해야하는 경우의 수가 너무 많아 학습이 잘 이루어지지 못하는 문제점이 있다.In addition, when the space and orientation of the object to be placed, which must be considered during arrangement, are considered, there is a problem in that learning is not performed well because the number of cases to be selected is too large.

또한, 배치의 문제는 실제 배치된 상황과 배치할 물체에 대한 연관성을 고려해야 하지만, 기존의 방법론은 배치의 연관성만 고려하고, 실제로 이미 배치된 정보를 사용하지 않는 문제점이 있다.In addition, the problem of arrangement should consider the relation between the actual arrangement situation and the object to be placed, but the existing methodology only considers the relation of arrangement and does not use information that has already been actually arranged.

또한, 종래의 방법론에서는 현재 상태의 정보만을 기반으로 물체의 배치를 수행하여, 물체를 배치하기 이전 시점과 물체를 배치한 이후 시점에 대한 배치 순서가 반영되지 못하는 문제점이 있다.In addition, in the conventional methodology, there is a problem in that the arrangement order of the viewpoints before and after the object is not reflected by arranging the objects based only on the information of the current state.

또한, 물체의 배치 순서에 대한 정보가 반영되지 못하거나 또는 배치 순서가 변경됨에 따라 잘못된 학습 결과가 제공될 수 있는 문제점이 있다.In addition, there is a problem in that information about the arrangement order of objects is not reflected or erroneous learning results may be provided as the arrangement order is changed.

한국 공개특허공보 공개번호 제10-2021-0082210호(신경망을 사용한 통합 회로 플로어 플랜 생성)Korean Patent Laid-open Publication No. 10-2021-0082210 (Generation of integrated circuit floor plan using neural network)

이러한 문제점을 해결하기 위하여, 본 발명은 사용자의 설계 데이터 또는 배치 데이터를 기반으로 배치 대상과 배치 대상이 되는 물체의 위치와, 물체의 이전 및 향후 배치에 대한 순서 정보를 포함한 학습 환경을 구성하여 강화학습 기반 모델을 통해 이미 배치된 상태를 포함한 물체 간의 관계성을 고려한 물체의 최적 위치를 생성하는 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치를 제공하는 것을 목적으로 한다.In order to solve this problem, the present invention configures and strengthens a learning environment including information about the position of the object to be placed and the object to be placed, and the order of previous and future placement of the object based on the user's design data or placement data. An object of the present invention is to provide a reinforcement learning device for optimizing object position based on user data that generates an optimal position of an object considering the relationship between objects including already placed states through a learning-based model.

상기한 목적을 달성하기 위하여 본 발명의 일 실시 예는 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치로서, 현재 배치된 물체의 상태를 보여주는 이미지 기반의 설계 정보와, 전체 배치할 물체와의 연관성을 포함한 그래프 정보와, 배치하고자 하는 임의의 대상에 대한 물체 정보에 기반하여 사용자의 설계 데이터 또는 배치 데이터로부터 특성(Feature) 정보와 임베딩 정보를 추출하고, 상기 추출된 특성 정보 및 임베딩 정보를 기반으로 자기 회귀 모델(autoregressive) 방식을 이용하여 물체의 회전(rotation) 정보를 추출하고, 추출된 회전 정보와 배치된 물체의 상태 정보와 각 물체 간의 연관성에 기반한 위치 정보를 기반으로 물체의 위치 좌표 결정을 위한 강화학습을 수행하여 액션(Action)을 결정하는 것을 특징으로 한다.In order to achieve the above object, an embodiment of the present invention is a reinforcement learning device for optimizing the position of an object based on user data, and the relationship between image-based design information showing the state of a currently placed object and the entire object to be placed Extract feature information and embedding information from the user's design data or placement data based on graph information including graph information and object information for an arbitrary object to be placed, and based on the extracted feature information and embedding information The rotation information of the object is extracted using an autoregressive model, and the location coordinates of the object are determined based on the extracted rotation information and the location information based on the state information of the placed object and the correlation between each object. It is characterized by determining an action by performing reinforcement learning for

또한, 상기 실시 예에 따른 강화학습 장치는 설계 데이터 또는 배치 데이터는 물체의 배치 순서정보를 더 포함하는 것을 특징으로 한다.In addition, the reinforcement learning apparatus according to the above embodiment is characterized in that the design data or the arrangement data further includes object arrangement sequence information.

또한, 상기 실시 예에 따른 강화학습 장치는 현재 물체가 배치된 상태에서의 이미지 정보와, 배치되는 대상 물체들 간의 연결 관계인 노드(Node)와 엣지(Edge)에 대한 연관성을 나타내는 그래프 정보와, 배치 대상 물체의 정보인 현재 배치할 물체의 특징 정보를 포함한 배치 대상 정보와, 현재 배치할 물체가 실제로 배치할 수 있는 영역에 대한 정보인 마스크 정보를 입력받는 입력부; 상기 이미지 정보로부터 이미 배치된 물체들의 위치에 대한 특성(Feature) 정보를 추출하고, 상기 그래프 정보로부터 대상 물체들 간의 노드와 엣지에 대한 분석을 수행하여 배치 대상 정보에 대한 임베딩을 통해 각 물체에 대한 임베딩 벡터를 추출하며, 각 물체에 대한 임베딩 벡터 간에 어텐션(Attention)을 통해 각 물체들이 갖는 관계와 배치할 물체에 대한 연관성을 고려한 특성 정보를 추출하는 전처리부; 및 상기 추출된 특성 정보 및 임베딩 정보를 기반으로 자기 회귀 모델(autoregressive) 방식을 이용하여 물체의 회전(rotation) 정보를 추출하고, 추출된 회전 정보, 배치된 물체의 상태 정보 및 각 물체 간의 연관성에 기반한 위치 정보를 기반으로 물체의 위치 좌표 결정을 위한 강화학습을 수행하여 액션(Action)을 결정하는 강화학습 에이전트;를 포함하는 것을 특징으로 한다.In addition, the reinforcement learning apparatus according to the above embodiment includes image information in a state in which an object is currently arranged, graph information indicating a connection between nodes and edges, which are connection relationships between the placed target objects, and arrangement an input unit for receiving placement target information including feature information of an object to be currently placed, which is information on a target object, and mask information, which is information on a region in which the object to be currently placed can actually be placed; From the image information, feature information about the positions of objects that have already been placed is extracted, and nodes and edges between target objects are analyzed from the graph information, and information about each object is embedded through the embedding of the information about the objects to be placed. a pre-processing unit that extracts embedding vectors and extracts characteristic information in consideration of relationships between objects and relevance to objects to be placed through attention between embedding vectors for each object; And based on the extracted characteristic information and embedding information, rotation information of the object is extracted using an autoregressive model, and the extracted rotation information, the state information of the placed object, and the correlation between each object and a reinforcement learning agent that determines an action by performing reinforcement learning for determining the location coordinates of an object based on the based location information.

또한, 상기 실시 예에 따른 강화학습 에이전트는 이미 배치된 물체의 상태 정보와 각 물체 간의 연관성에 기반한 배치 불가 영역을 조건으로 하여 추출된 위치 정보와, 상기 회전 정보를 기반으로 물체의 최적 위치 좌표를 추출하는 것을 특징으로 한다.In addition, the reinforcement learning agent according to the above embodiment determines the optimal location coordinates of the object based on the extracted location information and the rotation information based on the state information of the already placed object and the non-placement area based on the correlation between each object. characterized by extraction.

또한, 상기 실시 예에 따른 강화학습 장치는 좌표를 결정하는 마스크 그리드 정책부(350)의 목표(Metric)를 최소화하는 손실함수를 통해 실제 마스킹된 위치를 학습하면서 배치하도록 서브 테스크를 수행하는 파생부를 더 포함하는 것을 특징으로 한다.In addition, the reinforcement learning apparatus according to the above embodiment has a derivation unit that performs a sub-task to place while learning an actual masked position through a loss function that minimizes the target (metric) of the mask grid policy unit 350 that determines the coordinates. It is characterized by further including.

또한, 상기 실시 예에 따른 강화학습 장치는 사용자로부터 현재 상태에서 각 액션에 대한 정보가 입력되면, 상기 액션에 대한 행동 가치를 예측하는 예측 네트워크; 및 상기 액션에 대한 행동 가치 예측 정보를 손실함수에 반영하여 사용자가 의하는 학습 방향으로 학습시키는 예측 손실함수;를 더 포함하는 것을 특징으로 한다.In addition, the reinforcement learning apparatus according to the embodiment includes a prediction network that predicts an action value for each action when information on each action is input from a user in a current state; and a prediction loss function for reflecting the action value prediction information for the action in the loss function and learning in a learning direction according to the user.

또한, 상기 실시 예에 따른 강화학습 장치는 강화학습의 액션에 따라 추가되는 HPWL(Half-Perimeter Wirelength)을 보상으로 생성하되, 상기 액션에 따른 HPWL의 변동이 발생하지 않으면, 일정 상수만큼의 보상을 액션에 제공하여 물체의 배치에서 발생하는 희소 보상(sparse reward)을 각 액션에 따른 즉시 보상(reward)으로 제공하여 배치 개수에 따른 신뢰 할당(credit assignment)의 발생을 방지하는 것을 특징으로 한다.In addition, the reinforcement learning apparatus according to the above embodiment generates HPWL (Half-Perimeter Wirelength) added according to the reinforcement learning action as a reward, but if the HPWL does not change according to the action, a constant constant reward is provided. It is characterized by preventing the occurrence of credit assignment according to the number of arrangements by providing a sparse reward generated from the arrangement of objects as an immediate reward according to each action by providing it to the action.

또한, 상기 실시 예에 따른 액션에 대한 즉시 보상은 하기식In addition, the immediate compensation for the action according to the above embodiment is the following formula

- 여기서, t는 배치 갯수이고, c는 양의 상수임 - 으로부터 산출되는 것을 특징으로 한다.- Here, t is the number of batches, and c is a positive constant.

또한, 상기 실시 예에 따른 신뢰 할당에 대한 누적 보상 G_t는 하기식In addition, the cumulative compensation G _t for trust allocation according to the above embodiment is the following formula

- 여기서, t는 배치 갯수이고, T는 전체 부품의 배치 갯수이며, γ∈[0,1]이고, R_k는 보상 임 - 으로부터 산출되는 것을 특징으로 한다.- Here, t is the number of batches, T is the number of batches of all parts, γ∈ [0,1], and R _k is compensation.

또한, 상기 실시 예에 따른 강화학습 장치는 이산화된 액션 공간내에서 물체의 크기(Size)와 전체 공간내에서 콘볼루션 커널 트릭(convoltuon kernel trick) 연산에 기반하여 전체 공간 영역을 단순화시킨 특성 맵(Feature map)을 생성하고, 상기 특성 맵을 이용하여 배치 탐지 영역을 탐색하는 것을 특징으로 한다.In addition, the reinforcement learning apparatus according to the above embodiment is based on the size of an object in a discretized action space and a convolution kernel trick operation in the entire space, a feature map that simplifies the entire space domain ( Feature map) is created, and a placement detection area is searched using the feature map.

또한, 상기 실시 예에 따른 입력부(100b)는 현재 배치 대상 물체와, 이전 스텝에서 배치된 물체와, 이후 스텝에서 배치될 물체에 대한 배치 순서 정보를 입력받는 것을 특징으로 한다.In addition, the input unit 100b according to the above embodiment is characterized in that it receives arrangement order information on a current arrangement target object, an object placed in a previous step, and an object to be placed in a subsequent step.

본 발명은 사용자의 설계 데이터 또는 배치 데이터를 기반으로 배치 대상과 배치 대상이 되는 물체의 위치와, 물체의 이전 및 향후 배치에 대한 순서 정보를 포함한 학습 환경을 구성하여 강화학습 기반 모델을 통해 이미 배치된 상태를 포함한 물체 간의 관계성을 고려한 물체의 최적 위치를 생성할 수 있는 장점이 있다.The present invention is already placed through a reinforcement learning-based model by configuring a learning environment including the location of an object to be placed and the position of the object to be placed, and the sequence information for the previous and future placement of the object based on the user's design data or placement data. It has the advantage of being able to generate the optimal position of an object considering the relationship between objects including the status of the object.

또한, 본 발명은 실제 위치를 고려하여 세밀한 위치의 제공이 가능함으로써, 효율적인 설계를 제공할 수 있는 장점이 있다.In addition, the present invention has the advantage of providing an efficient design by providing a detailed location in consideration of the actual location.

또한, 본 발명은 설계자가 생각하는 방향과 일치하게 학습 시키면서, 동시에 리워드를 고려하여 최적화된 배치를 제공함으로써, 실제 배치된 상태와 연관성을 기반으로 강화학습 기반 모델이 회전과 좌표를 결정해주기 때문에 배치할 물체들이 연관성을 기반으로 최적 위치에 배치될 수 있다는 장점이 있다.In addition, the present invention learns in accordance with the direction the designer thinks and at the same time provides an optimized placement by considering the reward, so that the reinforcement learning-based model determines the rotation and coordinates based on the actual placed state and correlation. It has the advantage that the objects to be done can be placed in the optimal position based on the correlation.

또한, 본 발명은 물체의 배치 순서에 대한 정보를 제공함으로써, 물체의 배치시에 물체의 배치 이전 시점과 물체의 배치 이후 시점까지 고려하여 최적의 배치를 제공할 수 있는 장점이 있다.In addition, the present invention has an advantage of providing an optimal arrangement by considering the time before and after the object placement when arranging objects by providing information on the arrangement order of objects.

또한, 본 발명은 상태 가치를 예측하는 구성을 추가적으로 결합함으로써, 현재 액션이 실제 도메인과 얼마나 일치해서 결과를 보여주는 해석할 수 있고, 완전히 새로운 결과가 아닌 사용자의 니즈를 반영한 모델을 개발할 수 있는 장점이 있다.In addition, the present invention has the advantage of being able to develop a model that reflects the user's needs rather than a completely new result, which can be interpreted by showing how much the current action matches the actual domain by additionally combining a configuration that predicts the state value. there is.

또한, 본 발명은 배치할 대상들이 겹치지 않고 배치되도록 구성할 수 있고, 학습해야 할 액션의 수가 최소화되어 효율적인 학습이 가능하며, 결과로 출력해야 할 공간을 축소시킴으로써, 학습을 빠르게 진행할 수 있는 장점이 있다.In addition, the present invention can be configured so that objects to be placed do not overlap, the number of actions to be learned is minimized to enable efficient learning, and the space to be output as a result is reduced, so that learning can proceed quickly. there is.

또한, 본 발명은 물체의 배치에 강화학습을 적용하는 문제에 대해서 희소 보상(sparse reward)이 아닌 각 행동에 따른 즉시적인 보상(reward)을 줌으로써, 기존의 배치 개수에 따라 발생하는 신뢰 할당(credit assignment) 문제를 해결할 수 있는 장점이 있다.In addition, the present invention gives an immediate reward according to each action rather than a sparse reward for the problem of applying reinforcement learning to the arrangement of objects, thereby assigning trust that occurs according to the number of existing arrangements (credit assignment) has the advantage of solving the problem.

또한, 본 발명은 다양한 배치 영역 탐사에서 빠른 탐지를 통해 배치 가능 영역을 쉽게 찾을 수 있고, 이를 통해 이산화된 공간이 커도 신속하게 탐지할 수 있는 장점이 있다.In addition, the present invention has an advantage in that it is possible to easily find a placeable area through quick detection in various arrangement area exploration, and through this, it can be quickly detected even if the discretized space is large.

도1은 종래 기술에 따른 강화 학습 장치의 구성을 나타낸 블록도.
도2는 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 구성을 나타낸 블록도.
도3은 본 발명의 다른 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 구성을 나타낸 블록도.
도4는 본 발명의 또 다른 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 구성을 나타낸 블록도.
도5는 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 HPWL 보상을 설명하기 위해 나타낸 예시도.
도6은 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 콘볼루션 커널 트릭을 설명하기 위해 나타낸 예시도.
도7은 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 배치 가능 영역을 탐지하는 과정을 설명하기 위해 나타낸 예시도.1 is a block diagram showing the configuration of a reinforcement learning apparatus according to the prior art;
2 is a block diagram showing the configuration of a reinforcement learning device for optimizing the position of an object based on user data according to an embodiment of the present invention.
3 is a block diagram showing the configuration of a reinforcement learning device for optimizing the position of an object based on user data according to another embodiment of the present invention.
4 is a block diagram showing the configuration of a reinforcement learning device for optimizing the position of an object based on user data according to another embodiment of the present invention.
5 is an exemplary view illustrating HPWL compensation of a reinforcement learning apparatus for optimizing an object position based on user data according to an embodiment of the present invention;
6 is an exemplary view illustrating a convolution kernel trick of a reinforcement learning apparatus for optimizing an object location based on user data according to an embodiment of the present invention;
7 is an exemplary view illustrating a process of detecting an deployable region of a reinforcement learning apparatus for optimizing an object position based on user data according to an embodiment of the present invention;

이하에서는 본 발명의 바람직한 실시 예 및 첨부하는 도면을 참조하여 본 발명을 상세히 설명하되, 도면의 동일한 참조부호는 동일한 구성요소를 지칭함을 전제하여 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to preferred embodiments of the present invention and accompanying drawings, but the same reference numerals in the drawings will be described on the premise that they refer to the same components.

본 발명의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 발명의 기술적 요지와 직접적 관련이 없는 구성에 대해서는 본 발명의 기술적 요지를 흩뜨리지 않는 범위 내에서 생략하였음에 유의하여야 할 것이다. Prior to describing specific details for the implementation of the present invention, it should be noted that configurations not directly related to the technical subject matter of the present invention are omitted within the scope of not disturbing the technical subject matter of the present invention.

또한, 본 명세서 및 청구범위에 사용된 용어 또는 단어는 발명자가 자신의 발명을 최선의 방법으로 설명하기 위해 적절한 용어의 개념을 정의할 수 있다는 원칙에 입각하여 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.In addition, the terms or words used in this specification and claims are meanings and concepts consistent with the technical idea of the invention based on the principle that the inventor can define the concept of appropriate terms to best describe his/her invention. should be interpreted as

본 명세서에서 어떤 부분이 어떤 구성요소를 "포함"한다는 표현은 다른 구성요소를 배제하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In this specification, the expression that a certain part "includes" a certain component means that it may further include other components, rather than excluding other components.

또한, "‥부", "‥기", "‥모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는 그 둘의 결합으로 구분될 수 있다.In addition, terms such as ".. unit", ".. unit", and ".. module" refer to units that process at least one function or operation, which may be classified as hardware, software, or a combination of the two.

또한, "적어도 하나의" 라는 용어는 단수 및 복수를 포함하는 용어로 정의되고, 적어도 하나의 라는 용어가 존재하지 않더라도 각 구성요소가 단수 또는 복수로 존재할 수 있고, 단수 또는 복수를 의미할 수 있음은 자명하다 할 것이다. In addition, the term "at least one" is defined as a term including singular and plural, and even if at least one term does not exist, each component may exist in singular or plural, and may mean singular or plural. would be self-evident.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 바람직한 실시예를 상세하게 설명한다.Hereinafter, a preferred embodiment of a reinforcement learning apparatus for optimizing the position of an object based on user data according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도2는 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 구성을 나타낸 블록도이다.2 is a block diagram showing the configuration of a reinforcement learning device for optimizing the position of an object based on user data according to an embodiment of the present invention.

도2를 참조하면, 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치는 사용자의 설계 데이터 또는 배치 데이터로부터 현재 배치된 물체의 상태를 보여주는 이미지 기반의 설계 정보와, 전체 배치할 물체와의 연관성을 포함한 그래프 정보와, 배치하고자 하는 임의의 대상에 대한 물체 정보를 기반으로 특성(Feature) 정보와 임베딩 정보를 추출하고, 추출된 특성 정보 및 임베딩 정보를 기반으로 자기 회귀 모델(autoregressive) 방식을 이용하여 물체의 회전(rotation) 정보를 추출하고, 추출된 회전 정보와 배치된 물체의 상태 정보와 각 물체 간의 연관성에 기반한 위치 정보를 기반으로 물체의 위치 좌표 결정을 위한 강화학습을 수행하여 액션(Action)을 결정할 수 있다.Referring to FIG. 2, a reinforcement learning apparatus for optimizing the position of an object based on user data according to an embodiment of the present invention includes image-based design information showing a state of a currently placed object from user design data or arrangement data; Extract feature information and embedding information based on graph information including correlation with the entire object to be placed and object information on an arbitrary object to be placed, and self-regress based on the extracted feature information and embedding information. Extract rotation information of an object using an autoregressive model method, and strengthen for determining the position coordinates of an object based on the extracted rotation information, the state information of the placed object, and the location information based on the correlation between each object. Actions can be determined by performing learning.

여기서, 회전 정보는 물체가 회전한 각도, 예를 들어, '0°', '-45°', '90°' 등이 될 수 있고, 물체가 회전한 방향에 대한 위치 정보를 포함할 수 있다.Here, the rotation information may be an angle at which the object rotates, for example, '0°', '-45°', '90°', etc., and may include positional information about the direction in which the object rotates. .

또한, 본 발명의 실시 예에 따른 강화학습 장치는 사용자의 설계 데이터 또는 배치 데이터를 기반으로 이미 배치된 상태를 포함한 물체 간의 연관성을 고려한 배치 불가 영역을 조건으로 하여 추출된 정보와 회전 정보를 기반으로 물체의 최적 위치 좌표를 추출할 수 있다.In addition, the reinforcement learning apparatus according to an embodiment of the present invention is based on the extracted information and rotation information under the condition of an unplaceable area considering the association between objects including already placed states based on user design data or placement data. The optimal location coordinates of an object can be extracted.

여기서, 배치 대상 물체는 반도체 소자, 전기회로 소자 등을 포함한 전자부품일 수 있다.Here, the object to be placed may be an electronic component including a semiconductor device, an electric circuit device, and the like.

이를 위해, 강화학습 장치는 입력부(100)와, 전처리부(200)와, 강화학습 에이전트(300)를 포함하여 구성될 수 있다.To this end, the reinforcement learning device may include an input unit 100, a pre-processing unit 200, and a reinforcement learning agent 300.

입력부(100)는 강화학습에 사용될 정보를 입력하는 구성으로서, 현재 물체가 배치된 상태에서의 이미지(110) 정보와, 배치되는 대상 물체들 간의 연결 관계인 노드(Node)와 엣지(Edge)에 대한 연관성을 나타내는 그래프(120) 정보와, 배치 대상 물체의 정보로서 현재 배치할 물체의 특징 정보를 포함한 배치 대상 정보(130)와, 현재 배치할 물체가 실제로 배치할 수 있는 영역에 대한 정보인 마스크(140) 정보를 입력받을 수 있다.The input unit 100 is a component for inputting information to be used for reinforcement learning, and provides information about the image 110 in a state where an object is currently placed and nodes and edges, which are connection relationships between the placed target objects. Information of the graph 120 indicating correlation, information of the object to be placed, information on the object to be placed (130) including characteristic information of the object to be placed, and a mask ( 140) information can be entered.

전처리부(200)는 입력부(100)로부터 입력되는 정보들을 강화학습에 이용할 수 있도록 처리하는 구성으로서, 이미지(110) 정보는 CNN(Convolutional Neural Network, 210)과 이미지 피처부(211)를 통해서 이미 배치된 물체들의 위치에 대한 특성(Feature) 정보를 추출할 수 있다.The pre-processing unit 200 is a component that processes information input from the input unit 100 so that it can be used for reinforcement learning. Feature information about the positions of the arranged objects may be extracted.

또한, 전처리부(200)는 입력되는 그래프(120) 정보를 GNN(Graph Neural Network, 220)을 이용하여 점들과 그 점들을 잇는 선으로부터 대상 물체들 간의 노드와 엣지에 대한 분석을 수행할 수 있다.In addition, the pre-processing unit 200 may analyze the nodes and edges between target objects from points and lines connecting the points by using the input graph 120 information using a Graph Neural Network (GNN) 220. .

또한, 전처리부(200)는 임베딩부(221)를 이용하여 GNN(220)에서 분석된 정보와 입력부(100)에서 입력되는 배치 대상 정보(130)에 대한 임베딩을 수행하여 각 물체에 대한 임베딩 벡터를 추출할 수 있다.In addition, the pre-processing unit 200 performs embedding on the information analyzed by the GNN 220 using the embedding unit 221 and the arrangement target information 130 input through the input unit 100, thereby embedding vectors for each object. can be extracted.

또한, 전처리부(200)는 멀티헤드 어텐션부(Multi Head Attention, 222)를 통해 각 물체에 대한 임베딩 벡터 간에 어텐션(Attention) 계산을 수행하여 각 물체들이 갖는 관계를 추출할 수 있다.In addition, the pre-processing unit 200 may perform attention calculation between embedding vectors for each object through a multi-head attention unit 222 to extract a relationship between each object.

또한, 전처리부(200)는 노드 추출부(223)를 이용하여 배치할 물체에 대한 연관성을 고려한 배치 대상 정보의 현재 노드에 대한 특성 정보를 추출할 수 있다.In addition, the pre-processing unit 200 may use the node extracting unit 223 to extract characteristic information about the current node of the arrangement target information considering the relevance to the object to be placed.

강화학습 에이전트(300)는 강화학습을 통해 액션을 출력하는 구성으로서, 전처리부(200)에서 추출된 이미지(110) 정보에 대한 특성 정보와 그래프(120) 정보의 특성 정보를 기반으로 로테이션 정책 네트워크(310)를 이용한 회전으로 대상이 되는 물체에 대한 가중치 벡터인 쿼리(360)를 출력하는 로테이션 정책(311)과, 이미지(110) 정보의 특성 정보와 그래프(120) 정보의 특성 정보를 기반으로 벨류 네트워크(320)를 이용하여 각 물체가 쿼리(Querey)에 해당하는 물체와 얼마나 연관이 있는가를 나타내는 가중치 벡터인 벨류(321)를 출력할 수 있다.The reinforcement learning agent 300 is a component that outputs an action through reinforcement learning, and based on the characteristic information of the image 110 information extracted from the preprocessor 200 and the characteristic information of the graph 120 information, a rotation policy network Based on the rotation policy 311 outputting the query 360, which is a weight vector for the target object by rotation using 310, and the characteristic information of the image 110 information and the characteristic information of the graph 120 information A value 321, which is a weight vector indicating how much each object is related to an object corresponding to a query, can be output using the value network 320.

또한, 강화학습 에이전트(300)는 자기 회귀 모델(autoregressive) 방식을 이용하여 물체의 회전(rotation) 정보를 추출하고, 추출된 회전 정보와 배치된 물체의 상태 정보와 각 물체 간의 연관성에 기반한 위치 정보를 기반으로 물체의 위치 좌표 결정을 위한 강화학습을 수행하여 액션을 결정할 수 있다.In addition, the reinforcement learning agent 300 extracts rotation information of an object using an autoregressive model, and position information based on the relationship between the extracted rotation information and the state information of the placed object and each object. Based on this, it is possible to determine an action by performing reinforcement learning for determining the location coordinates of an object.

이를 위해, 강화학습 에이전트(300)는 로테이션 정책(311)에서 출력된 키(Key) 값이 기존의 배치된 정보인 전처리부(200)에서 추출된 이미지(110) 정보의 특성 정보와 그래프(120) 정보의 특성 정보와 결합하여 그리드 정책 네트워크(330)로 입력되고, 그리드 정책 네트워크(330)는 좌표를 결정하는 정책을 결정하여 그리드 정책부(340)를 통해 좌표를 결정할 수 있다. To this end, the reinforcement learning agent 300 converts the key value output from the rotation policy 311 to the characteristic information and graph 120 of the image 110 information extracted from the preprocessor 200, which is the existing arranged information. ) information is input to the grid policy network 330, and the grid policy network 330 may determine a coordinate determination policy through the grid policy unit 340.

또한, 강화학습 에이전트(300)는 배치 불가능 영역을 조건으로 하여 추출된 정보를 기반으로 회전에 대한 로짓(logit)을 추출하고, 추출된 정보와 추출된 이미지(110) 정보의 특성 정보와 그래프(120) 정보의 특성 정보를 포함한 기존 정보를 바탕으로 좌표를 결정할 수도 있다.In addition, the reinforcement learning agent 300 extracts a logit for rotation based on the information extracted under the condition of the non-locatable region, and extracts the extracted information and the extracted image 110 information characteristic information and graph ( 120) Coordinates may be determined based on existing information including characteristic information of information.

또한, 강화학습 에이전트(300)는 입력부(100)에서 입력되는 마스크(140) 정보와 로테이션 정책(311)에 의한 쿼리(360)를 통해 특정 마스크 정보만 추출한 그리드 마스크(370)를 추출하고, 추출된 그리드 마스크(370)를 마스크 그리드 정책부(350)의 입력으로 제공하여 마스크 그리드 정책부(350)에서 위치 좌표를 결정하는 정책과 결합될 수 있도록 한다.In addition, the reinforcement learning agent 300 extracts a grid mask 370 from which only specific mask information is extracted through a query 360 based on the mask 140 information input from the input unit 100 and the rotation policy 311, and extracts The grid mask 370 is provided as an input to the mask grid policy unit 350 so that the mask grid policy unit 350 can be combined with a policy for determining location coordinates.

이를 통해 강화학습 에이전트(300)는 사용자의 설계 데이터 또는 배치 데이터를 기반으로 이미 배치된 상태를 고려한 액션을 결정하여 공간의 회전과, 회전을 반영한 현재 배치할 물체와의 어텐션이 계산된 좌표 정보를 생성할 수 있다.Through this, the reinforcement learning agent 300 determines an action considering the already placed state based on the user's design data or placement data, and calculates the rotation of the space and the coordinate information of the attention with the object to be currently placed reflecting the rotation. can create

또한, 강화학습 장치는 좌표를 결정하는 마스크 그리드 정책부(350)의 목표(Metric)를 최소화하는 손실함수를 통해 실제 마스킹된 위치를 학습하면서 배치하도록 서브 테스크를 수행하는 파생부(400)를 더 포함하여 구성될 수 있다.In addition, the reinforcement learning apparatus further includes a derivation unit 400 that performs a sub-task to place while learning an actual masked position through a loss function that minimizes the target (metric) of the mask grid policy unit 350 that determines coordinates. can be configured to include

즉, 파생부(400)는 실제 배치 불가 영역을 이용하여 생성된 불가 영역과의 손실 함수를 생성하여 강화학습 에이전트(300)를 통해 배치 불가능한 영역을 학습할 수 있도록 구성할 수도 있다.That is, the derivation unit 400 may be configured to learn the non-placeable region through the reinforcement learning agent 300 by generating a loss function with the generated impossible region using the actual non-placeable region.

따라서, 배치할 대상 물체들이 겹치지 않고 배치될 수 있고, 학습해야할 액션의 수가 감소될 수 있어서 보다 효율적인 학습이 가능해지고, 결과로 출력해야 할 공간을 축소함으로써, 학습을 빠르게 진행할 수 있다.Accordingly, target objects to be placed can be arranged without overlapping, the number of actions to be learned can be reduced, and more efficient learning is possible, and the space to be output as a result is reduced, so that learning can proceed quickly.

이러한 물체의 배치는 여러 개의 부품들을 절연물 표면에 장착해서 연속하여 동작할 수 있도록 회로를 구성하여 개별 부품끼리 연결하거나, 건물 내에 배치되는 물품들 간의 배치, 의료시설에 배치되는 시설물 또는 기구들 간의 배치, 물류 센터에서 물품의 보관 위치와 이동 거리 최적화를 위한 배치 등에 다양하게 적용될 수 있다.Arrangement of these objects is to connect individual parts by mounting several parts on the surface of an insulator and constructing a circuit so that they can operate continuously, or to arrange between items placed in a building or between facilities or instruments placed in a medical facility. , It can be variously applied to the storage location of goods in a distribution center and arrangement for optimizing movement distance.

도3은 본 발명의 다른 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 구성을 나타낸 블록도이다.3 is a block diagram showing the configuration of a reinforcement learning device for optimizing the position of an object based on user data according to another embodiment of the present invention.

도3의 실시 예에 따른 강화학습 장치는 도2의 실시 예에 따른 강화학습 장치와 예측 네트워크(500)와 예측 손실함수(600)의 구성에서 차이점이 있다.The reinforcement learning apparatus according to the embodiment of FIG. 3 is different from the reinforcement learning apparatus according to the embodiment of FIG. 2 in the configuration of the prediction network 500 and the prediction loss function 600.

리워드를 기반으로 배치 최적화를 수행하면, 사용자가 생각하는 배치 형태와는 다른 잘못된 최적 배치가 생성될 수 있다.If placement optimization is performed based on rewards, an incorrect optimal placement different from the placement type thought by the user may be created.

즉, 도3의 실시 예에 따른 강화학습 장치는 사용자로부터 현재 상태에서 각 액션에 대한 정보가 입력되면, 입력된 액션에 대한 행동 가치를 예측하는 예측 네트워크(500)와, 입력된 액션에 대한 행동 가치 예측 정보를 손실함수에 반영하여 사용자가 의하는 학습 방향으로 학습시키는 예측 손실함수(600)를 더 포함하는 구성에서 도2의 강화학습 장치와 차이점이 있다.That is, the reinforcement learning apparatus according to the embodiment of FIG. 3 includes a prediction network 500 that predicts the action value for the input action when information on each action in the current state is input from the user, and the action for the input action. There is a difference from the reinforcement learning apparatus of FIG. 2 in a configuration that further includes a prediction loss function 600 for learning in a learning direction according to a user by reflecting value prediction information in the loss function.

여기서, 손실함수 L = α·L_actor+ β·L_critic- Ψ·E_entropy + λ·(state action loss)일 수 있고, 'L_actor'와, 'L_critic'는 각각 '액터'와 '크리틱'의 훈련에 사용되는 손실함수이고, 'E_entropy'는 엔트로피 이다.Here, the loss function L = α L _actor + β L _critic - Ψ E _entropy + λ (state action loss), and 'L _actor ' and 'L _critic ' are 'actor' and 'critic, respectively. ' is the loss function used for training, and 'E _entropy ' is the entropy.

또한, 물체들 간에 연관이 있다면 가까이 배치하는 것이 좋은데, 새로운 물체를 추가했을 때 물체간의 거리를 계산할 수 있고, 'state action loss'는 각각의 액션에 대하여 알 수 있으면, 학습할 때 사용해서 빠르게 학습하여 사용자들이 생각하는 직관과 일정 부분 일치될 수 있도록 예측 손실함수(600)로 정의될 수 있다,In addition, if there is a connection between objects, it is good to place them close. When a new object is added, the distance between objects can be calculated, and 'state action loss' can be used to learn quickly if you can know about each action. It can be defined as a prediction loss function 600 so that it can be partially matched with the intuition that users think.

여기서, λ값이 크면 기존 메트릭을 이용해서 강하게 학습 시키고, λ 값을 작게 주면 이 부분이 반영될 수 있도록 한다.Here, if the value of λ is large, it is strongly trained using the existing metric, and if the value of λ is small, this part can be reflected.

또한, 기존의 손실 함수 α·L_actor+β·L_critic-Ψ·E_entropy에 'state action loss'를 추가함으로써, 사용자가 생각하는 최적 방향을 고려한 예측 손실 함수를 통해 실제 결과물에서 어색하지 않은 결과를 산출할 수 있도록 한다.In addition, by adding 'state action loss' to the existing loss function α·L _actor +β·L _critic -Ψ·E _entropy , the prediction loss function considering the user's optimal direction results in no awkward results in actual results. to be able to calculate

따라서, 사용자로부터 입력된 액션에 대한 행동 가치를 예측하는 예측 네트워크(500)와 예측 손실함수(600)를 추가적으로 구성함으로써, 강화학습에 지도학습의 요소를 통해 현재 액션이 실제 도메인과 얼마나 일치해서 결과를 보여주는지 해석할 수 있는데 사용될 수 있도록 한다.Therefore, by additionally configuring the prediction network 500 and the prediction loss function 600 that predict the action value for the action input from the user, how much the current action matches the actual domain through the elements of supervised learning in reinforcement learning It can be interpreted to show that it can be used.

또한, 본 발명의 실시 예에 따른 강화학습 장치는 배치에서 발생하는 희소 보상(sparse reward)을 내적 보상(Intrinsic reward)으로 변경하고, 강화학습의 액션에 따라 추가되는 HPWL(Half-Perimeter Wirelength)을 보상으로 생성하여 제공할 수도 있다.In addition, the reinforcement learning apparatus according to an embodiment of the present invention changes the sparse reward generated in the batch into an intrinsic reward, and uses HPWL (Half-Perimeter Wirelength) added according to the reinforcement learning action. It can also be created and provided as a reward.

강화학습에서 배치 문제를 해결하기 위해, 예를 들어 배선거리(Wire length)와 같은 지표를 고려할 수 있는데, 이 지표를 그대로 사용하게 되면, 희소 보상(sparse reward)이 발생될 수 있다.In order to solve the placement problem in reinforcement learning, for example, an index such as wire length can be considered. If this index is used as it is, a sparse reward can be generated.

희소 보상을 갖게 되면, 다수의 배치를 수행할 경우 강화학습 에이전트(300a)는 신뢰 할당(credit assignment)의 문제가 발생하여 액션에 대한 정확한 평가를 받을 수 없게 되어 학습이 이루어지지 못할 수 있다.When having a sparse reward, when performing multiple assignments, the reinforcement learning agent 300a may not be able to receive accurate evaluation of actions due to a problem of credit assignment, and learning may not be performed.

예를 들어, 이전 스탭에서 액션을 수행하고, 다음 액션에서 배치를 수행한 경우, HPWL을 계산해서 다음 액션으로 인한 HPWL이 증가하면, 증가된 만큼을 마이너스로 보상하고, 이전 스텝과 현재 스텝의 HPWL이 동일하면, 일정 크기의 상수를 보상으로 제공할 수 있다.For example, if an action is performed in the previous step and a placement is performed in the next action, HPWL is calculated and if the HPWL due to the next action increases, the increased amount is compensated with a negative value, and the HPWL of the previous step and the current step is compensated. If is the same, a constant of a certain size may be provided as a reward.

도4는 본 발명의 또 다른 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 구성을 나타낸 블록도이다.4 is a block diagram showing the configuration of a reinforcement learning device for optimizing the position of an object based on user data according to another embodiment of the present invention.

도4의 실시 예에 따른 강화학습 장치는 도2의 실시 예에 따른 강화학습 장치와 입력부(100b)와 전처리부(200b)의 구성에서 차이점이 있다.The reinforcement learning device according to the embodiment of FIG. 4 is different from the reinforcement learning device according to the embodiment of FIG. 2 in the configuration of the input unit 100b and the preprocessor 200b.

입력부(100b)는 강화학습 장치에 사용될 정보를 입력하는 구성으로서, 현재 물체가 배치된 상태에서의 이미지(110)와, 배치되는 대상 물체들 간의 연결 관계인 노드(Node)와 엣지(Edge)에 대한 연관성을 나타내는 그래프(120)와, 배치 대상 물체의 정보로서 현재 배치할 물체의 특징 정보를 포함한 배치 대상 정보(130)와, 현재 배치할 물체가 실제로 배치할 수 있는 영역에 대한 정보인 마스크(140)와, 물체의 배치를 순서 정보를 포함한 배치순서(150) 정보를 입력 받을 수 있다.The input unit 100b is a component for inputting information to be used in the reinforcement learning device, and is related to the image 110 in a state where an object is currently placed and nodes and edges, which are connection relationships between the placed target objects. A graph 120 indicating correlation, information on an object to be placed, including feature information of an object to be currently placed, as information on the object to be placed, and a mask 140, which is information about an area in which the object to be currently placed can actually be placed. ) and information on the arrangement order 150 including order information on the arrangement of objects can be input.

배치순서(150)는 예를 들어, 'A', 'B', 'C'의 순서로 물체의 배치가 이루어져야하는 경우, 물체의 배치순서를 반영한 강화학습이 이루어질 수 있도록 한다.The arrangement order 150 enables reinforcement learning to reflect the arrangement order of objects when objects are to be arranged in the order of, for example, 'A', 'B', and 'C'.

일반적인 강화학습은 물체 'A'를 배치할 경우, 해당 물체인 'A'를 기준으로 강확학습을 수행하여 다음에 배치될 물체 'B'의 입장에서는 앞에 배치된 'A'로 인해 적절한 배치가 제공되지 못하는 문제가 발생될 수 있다.In general reinforcement learning, when an object 'A' is placed, reinforcement learning is performed based on the object 'A', and from the point of view of object 'B' to be placed next, an appropriate placement is provided due to 'A' placed in front. Problems that cannot be solved may arise.

즉, 현재 배치 대상인 물체 'B'는 앞에 배치되는 물체 'A'의 배치 정보와, 뒤에 배치되는 물체 'C'의 배치 정보도 중요하므로 강화학습 에이전트(300b)에 물체의 배치 순서정보를 제공함으로써, 이전 스텝에서 배치된 물체 'A'의 정보와, 이후 스텝에서 배치될 물체 'C'의 정보를 반영하여 물체 'B'가 최적 위치에 배치될 수 있도록 강화학습을 수행할 수 있다.That is, since the object 'B', which is currently the object to be placed, is important to the arrangement information of the object 'A' to be placed in front and the arrangement information of object 'C' to be placed in the back, the reinforcement learning agent 300b is provided with the object placement order information. , Reinforcement learning can be performed so that object 'B' can be placed in an optimal position by reflecting the information of object 'A' placed in the previous step and the information of object 'C' to be placed in the next step.

또한, 배치순서(150)는 그래프로 표현될 수 있고, 그래프는 각 그래프 간에 노드와 엣지 사이의 연결 관계에서 방향성, 예를 들어, 'A' -> 'B' -> 'C'의 순서를 갖는 그래프 정보일 수 있다.In addition, the arrangement order 150 can be expressed as a graph, and the graph has a directivity in the connection relationship between nodes and edges between each graph, for example, the order of 'A' -> 'B' -> 'C' It may be graph information having

또한, 전처리부(200b)는 입력부(100b)로부터 입력된 배치정보(150)를 GNN(230)을 이용하여 대상 물체들 간의 순서 정보에 따라 노드와 엣지에 대한 분석을 수행할 수 있다.In addition, the pre-processing unit 200b may analyze the nodes and edges according to order information between target objects using the GNN 230 of the arrangement information 150 input from the input unit 100b.

또한, 전처리부(200b)는 배치순서 임베딩부(231)를 통ㅎ GNN(230)에서 분석된 정보의 임베딩을 수행하여 각 물체에 대한 임베딩 벡터를 추출할 수 있다.In addition, the preprocessing unit 200b may perform embedding of the information analyzed by the GNN 230 through the arrangement order embedding unit 231 to extract an embedding vector for each object.

또한, 전처리부(200b)는 멀티헤드 어텐션부 1(232)을 통해 순서정보를 반영한 각 물체의 임베딩 벡터 간에 어텐션(Attention) 계산을 수행하여 순서정보를 반영한 각 물체들에 대한 현재 노드의 특성 정보를 추출하고, 추출된 특성 정보가 전처리부(200b)에서 추출된 이미지(110) 정보의 특성 정보와 그래프(120) 정보의 특성 정보와 함께 강화학습 에이전트(300b)에 입력될 수 있도록 한다.In addition, the pre-processing unit 200b performs attention calculation between the embedding vectors of each object reflecting the order information through the multi-head attention unit 1 232, and the characteristic information of the current node for each object reflecting the order information. is extracted, and the extracted characteristic information can be input to the reinforcement learning agent 300b together with the characteristic information of the image 110 information and the graph 120 information extracted in the pre-processing unit 200b.

도5는 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 HPWL 보상을 설명하기 위해 나타낸 예시도이다.5 is an exemplary view illustrating HPWL compensation of a reinforcement learning apparatus for optimizing an object position based on user data according to an embodiment of the present invention.

도 5에 나타낸 바와 같이, 액션에 따라 일정 영역(700)에서 물체(710)의 배치가 이루어지면, 면적을 계산하여 HPWL을 통해 보상이 제공될 수 있는데, 액션에 따른 HPWL의 변동이 발생하지 않으면, 희소 보상으로 인한 신뢰 할당 문제가 발생될 수 있다.As shown in FIG. 5 , when an object 710 is arranged in a certain area 700 according to an action, the area can be calculated and compensation can be provided through HPWL. If the HPWL does not change according to the action, , a trust allocation problem may occur due to sparse compensation.

따라서, 액션에 따른 HPWL의 변동이 발생하지 않으면, 일정 상수만큼의 보상을 액션에 제공하여 물체의 배치에서 발생하는 희소 보상을 각 액션에 따른 즉각적인 보상(reward)으로 제공하여 배치 개수에 따른 신뢰 할당의 문제가 발생되는 것을 방지할 수 있도록 한다.Therefore, if there is no change in HPWL according to the action, a certain constant reward is provided to the action, and the sparse reward generated from the arrangement of objects is provided as an immediate reward according to each action, and trust is assigned according to the number of arrangements. to prevent problems from occurring.

액션에 대한 즉각적인 보상은 하기식으로부터 산출될 수 있다.The immediate reward for action can be calculated from the equation below.

여기서, t는 배치 갯수이고, c는 양의 상수이다.where t is the number of batches and c is a positive constant.

또한, 누적 보상(over return)인 G_t는 하기식에서 보상 R_k를 이용하여 더욱 크게 만들 수 있다.In addition, the cumulative compensation (over return) G _t can be made larger by using the compensation R _k in the following equation.

여기서, t는 배치 갯수이며, T는 전체 부품의 배치 갯수이고, γ∈[0,1]이다.Here, t is the number of batches, T is the number of batches of all parts, and γ∈[0,1].

따라서, 배치 갯수를 모두 합했을때, HPWL의 변화량을 고려하면서 양수를 추가 제공하여 보다 더 작게 만들면서 보상을 크게 제공할 수 있고, G_t를 최대화할 때 각 R_t에 대한 내적 보상 형태로 변경되어 배치에서 배선 길이에 대한 신뢰 할당 문제를 해소할 수 있다.Therefore, when the total number of batches is added, it is possible to provide a large compensation while making it smaller by adding _a positive _number while considering the change in HPWL. This can solve the problem of assigning trust to wire length in a layout.

또한, 본 발명의 실시 예에 따른 강화학습 장치는 이산화된 액션 공간내에서 물체의 크기(Size)와 전체 공간내에서 콘볼루션 커널 트릭(convoltuon kernel trick) 연산에 기반하여 전체 공간 영역을 단순화시킨 특성 맵(Feature map)을 생성하고, 생성된 특성 맵을 이용하여 배치 탐지 영역을 탐색할 수 있다.In addition, the reinforcement learning apparatus according to an embodiment of the present invention has the characteristics of simplifying the entire spatial domain based on the size of an object in a discretized action space and a convolution kernel trick operation in the entire space. A feature map may be created, and a placement detection area may be searched using the generated feature map.

즉, 강화학습에서 이산화된 액션 공간이 증가할수록 마스크는 배치 탐지 영역을 찾은 일에 많은 시간을 소비할 수 있다.In other words, as the discretized action space increases in reinforcement learning, the mask may spend a lot of time finding the batch detection area.

이때, 본 발명의 강화학습 장치는 큰 공간에서 물체의 크기에 대한 형태를 생성하여 공간내에서 콘볼루션 연산을 통해 전체 공간에서 배치 가능 공간(또는 영역)을 선택할 수 있다.At this time, the reinforcement learning apparatus of the present invention can generate a shape for the size of an object in a large space and select a placeable space (or area) in the entire space through a convolution operation within the space.

도6은 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 콘볼루션 커널 트릭을 설명하기 위해 나타낸 예시도이다.6 is an exemplary view illustrating a convolution kernel trick of a reinforcement learning apparatus for optimizing an object position based on user data according to an embodiment of the present invention.

도6에 나타낸 바와 같이, 좌표를 구비한 전체 공간 영역(800)과 물체 크기 영역을 콘볼루션 연산하여 각각의 좌표에 대하여 단순화시킨 피처 맵(820)을 생성하여 새로운 물체의 배치가 가능한지 여부를 확인함으로써, 신속하게 배치 가능한 탐지 영역을 탐색할 수 있다.As shown in FIG. 6, it is confirmed whether a new object can be placed by generating a simplified feature map 820 for each coordinate by performing a convolution operation on the entire space area 800 with coordinates and the object size area. By doing so, it is possible to quickly search for a detectable area that can be deployed.

도7은 본 발명의 일 실시 예에 따른 사용자 데이터 기반의 물체 위치 최적화를 위한 강화학습 장치의 배치 가능 영역을 탐지하는 과정을 설명하기 위해 나타낸 예시도이다.7 is an exemplary view illustrating a process of detecting an deployable region of a reinforcement learning apparatus for optimizing an object position based on user data according to an embodiment of the present invention.

도7에 나타낸 바와 같이, 실제 물체(901)가 배치된 이산화된 공간(900)에서 배치 대상 물체(910)를 배치 공간(920)에서 배치 가능 공간(921)을 탐지하여 제공할 수 있다.As shown in FIG. 7 , in the discretized space 900 in which the real object 901 is placed, the placement target object 910 can be provided by detecting the placement possible space 921 in the placement space 920 .

따라서, 다양한 배치 영역 탐사에서 신속한 탐지를 통해 배치 가능 영역을 쉽게 찾을 수 있고, 이를 통해 이산화된 공간이 커도 배치 대상 물체의 배치 가능 공간을 신속하게 탐지할 수 있다.Therefore, it is possible to easily find the placeable area through rapid detection in various placement area exploration, and through this, the placeable space of the placement target object can be quickly detected even if the discretized space is large.

상기와 같이, 본 발명의 바람직한 실시 예를 참조하여 설명하였지만 해당 기술 분야의 숙련된 당업자라면 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.As described above, although it has been described with reference to the preferred embodiments of the present invention, those skilled in the art will variously modify and change the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. You will understand that it can be done.

또한, 본 발명의 특허청구범위에 기재된 도면번호는 설명의 명료성과 편의를 위해 기재한 것일 뿐 이에 한정되는 것은 아니며, 실시예를 설명하는 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다.In addition, the drawing numbers described in the claims of the present invention are only described for clarity and convenience of explanation, but are not limited thereto, and in the process of explaining the embodiment, the thickness of lines or the size of components shown in the drawings, etc. may be exaggerated for clarity and convenience of description.

또한, 상술된 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있으므로, 이러한 용어들에 대한 해석은 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, the above-mentioned terms are terms defined in consideration of functions in the present invention, which may change according to the intention or custom of the user or operator, so the interpretation of these terms should be made based on the contents throughout this specification. .

또한, 명시적으로 도시되거나 설명되지 아니하였다 하여도 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기재사항으로부터 본 발명에 의한 기술적 사상을 포함하는 다양한 형태의 변형을 할 수 있음은 자명하며, 이는 여전히 본 발명의 권리범위에 속한다. In addition, even if it is not explicitly shown or described, a person skilled in the art to which the present invention belongs can make various modifications from the description of the present invention to the technical idea according to the present invention. Obviously, it is still within the scope of the present invention.

또한, 첨부하는 도면을 참조하여 설명된 상기의 실시예들은 본 발명을 설명하기 위한 목적으로 기술된 것이며 본 발명의 권리범위는 이러한 실시예에 국한되지 아니한다.In addition, the above embodiments described with reference to the accompanying drawings are described for the purpose of explaining the present invention, and the scope of the present invention is not limited to these embodiments.

100, 100a, 100b : 입력부
200, 200a, 200b : 전처리부
300, 300a, 300b : 강화학습 에이전트
400 : 파생부
500 : 예측 네트워크
600 : 예측 손실 함수
700 : 영역
710 : 물체
800 : 전체 공간 영역
810 : 물체 크기 영역
820 : 피처 맵
900 : 이산화된 공간
901 : 물체
910 : 배치 대상 물체
920 : 배치 공간
921 : 배치 가능 공간100, 100a, 100b: input unit
200, 200a, 200b: pre-processing unit
300, 300a, 300b: reinforcement learning agent
400: derivative part
500: prediction network
600: prediction loss function
700: area
710: object
800: entire space area
810: object size area
820: feature map
900: discretized space
901: object
910: object to be placed
920: layout space
921: placeable space

Claims

A graph showing the relationship between design information based on the image 110 in a state where an object is currently placed from the user's design data or placement data, and nodes and edges, which are the connection relationships between the entire target objects to be placed. (120) information, placement target information 130 including feature information of the object to be currently placed, which is information of the object to be placed, and mask 140 information, which is information about an area where the object to be currently placed can actually be placed. an input unit (100, 100a, 100b) that receives an input;
From the information of the image 110, feature information about the positions of objects that have already been placed is extracted, and from the information of the graph 120, nodes and edges between target objects are analyzed to obtain placement target information 130. A pre-processing unit that extracts an embedding vector for each object through embedding for , and extracts characteristic information considering the relationship between each object and the association with the object to be placed through attention between the embedding vectors for each object ( 200, 200a, 200b); and
Based on the extracted characteristic information and embedding information, rotation information of the object is extracted using an autoregressive method, and based on the relationship between the extracted rotation information and the state information of the placed object and each object Reinforcement learning device for optimizing the location of an object based on user data including; reinforcement learning agents (300, 300a, 300b) that determine an action by performing reinforcement learning for determining the location coordinates of an object based on location information .

According to claim 1,
The design data or placement data further includes object arrangement order information.

delete

According to claim 1,
The reinforcement learning agents (300, 300a, 300b) determine the optimal location coordinates of the object based on the location information extracted under the condition of the location information of the already placed object and the non-placement area based on the correlation between each object and the rotation information Reinforcement learning apparatus for optimizing object location based on user data, characterized in that for extracting.

According to claim 1,
The reinforcement learning apparatus further includes a derivation unit 400 that performs a sub-task to place while learning an actual masked position through a loss function that minimizes a target (metric) of the mask grid policy unit 350 that determines coordinates. Reinforcement learning device for optimizing object position based on user data, characterized in that.

According to claim 1,
The reinforcement learning apparatus includes a prediction network 500 that predicts an action value for the action when information on each action in the current state is input from the user; and
Reinforcement learning device for optimizing object location based on user data, further comprising a prediction loss function (600) for learning in a learning direction according to a user by reflecting action value prediction information for the action in the loss function .

According to claim 6,
The reinforcement learning device generates HPWL (Half-Perimeter Wirelength) added according to the reinforcement learning action as a reward,
If there is no change in HPWL according to the above action, a reward equal to a certain constant is provided to the action, and a sparse reward generated from the arrangement of objects is provided as an immediate reward according to each action to increase the number of batches. Reinforcement learning apparatus for optimizing object location based on user data, characterized in that for preventing the occurrence of credit assignment according to

According to claim 7,
The immediate reward for the above action is the following formula

- Here, t is the number of batches, and c is a positive constant - Reinforcement learning device for optimizing object location based on user data, characterized in that calculated from.

According to claim 8,
The cumulative reward G _t for the trust allocation is

- Here, t is the number of batches, T is the number of batches of all parts, γ ∈ [0,1], and R _k is compensation - Enhancement for optimizing object position based on user data, characterized in that calculated from learning device.

According to claim 1,
The reinforcement learning apparatus generates a feature map that simplifies the entire spatial domain based on the size of an object in a discretized action space and a convolution kernel trick operation in the entire space, , Reinforcement learning apparatus for optimizing object location based on user data, characterized in that for searching for a location detection area using the characteristic map.

According to claim 1,
The input unit (100b) is a reinforcement learning device for optimizing object location based on user data, characterized in that for receiving information on the arrangement order of the current arrangement target object, the object placed in a previous step, and the object to be placed in a subsequent step. .