[Object Detection] YOLO(v3)를 PyTorch로 바닥부터 구현하기

Python/PyTorch 공부

[Object Detection] YOLO(v3)를 PyTorch로 바닥부터 구현하기 - Part 3

AI 꿈나무 2021. 1. 25. 21:20

이 포스팅은 공부 목적으로 아래 게시물을 번역한 글입니다.

How to implement a YOLO (v3) object detector from scratch in PyTorch: Part 3

Part 3 of the tutorial series on how to implement a YOLO v3 object detector from scratch in PyTorch.

blog.paperspace.com

YOLO v3 detector를 바닥부터 구현하는 튜토리얼의 Part 3입니다. 지난 part에서 YOLO 구조에 사용되는 layers를 구현했고, 이번 파트에서는 주어진 이미지로부터 출력값을 생성하기 위해 PyTorch로 YOLO의 신경망 구조를 구현할 것입니다.

이 튜토리얼 코드는 Python 3.5와 PyTorch 0.4에서 작동되도록 설계되었습니다. 전체 코드는 여기에서 확인하실 수 있습니다.

이 튜토리얼은 5가지 Part로 나뉘어져 있습니다.

1. Part 1 : YOLO가 어떻게 작동하는지 이해하기

2. Part 2 : 신경망 구조의 계층 생성하기

3. Part 3 : (현재) 신경망의 순전파 구현하기

4. Part 4 : 비-최대 억제(Non-maximum suppression)와 객체 점수 임계값

5. Part 5 : 입력값과 출력값

사전 지식

튜토리얼의 Part 1과 Part 2
nn.Module, nn.Sequential, torch.nn.parameter 클래스로 custom architecture를 어떻게 만드는지를 포함하여 기본적인 PyTorch 지식
PyTorch에서 영상 작업

신경망 정의하기

앞에서 지적했듯이, PyTorch로 custom architecture를 구축하기 위해 nn.Module class를 사용합니다. 우리의 detector를 위한 신경망을 정의하겠습니다. darknet.py 파일에서 다음의 클래스를 추가합니다.

class Darknet(nn.Module):
    def __init__(self, cfgfile):
        supper(Darknet, self).__init__()
        self.blocks = parse_cfg(cfgfile)
        self.net_if, self.module_list - create_module(self, blocks)

여기서, nn.Module의 서브클래스로 지정하고 Darknet class로 이름을 정합니다. 신경망을 blocks, net_info, module_list로 초기화 합니다.

신경망의 순전파 구현하기

신경망의 순전파는 nn.Module 클래스의 forward 매서드를 오버라이딩 함으로써 구현됩니다.

forward는 두 가지 목적을 갖고 있습니다. 첫 번째는 출력을 계산하는 것이고, 두 번째는 출력 detection feature map을 처리되기 쉬운 방법으로 변환하는 것입니다.(다양한 scales에 걸친 detection maps로 변환하는 것은 연결되게 합니다. 그렇지 않으면 서로 다른 차원이기 때문에 불가능합니다.)

def forward(self, x, CUDA):
    modules = self.blocks[1:]
    outputs = {} # route layer에 대한 출력값을 저장합니다.

forward는 self, 입력값 x, 순전파를 빠르게 처리하기 위해 GPU를 사용하는 CUDA 세 가지 인자를 취합니다.

여기서, self.blocks이 첫 번째 요소는 순전파의 일부가 아닌 net block이기 때문에 self.blocks 대신에 self.blocks[1:]를 반복합니다.

route와 shortcut layers는 이전 layers에서 output map이 필요하기 때문에, 모든 layer의 output feature maps를 dic ouputs에 저장합니다. key는 layers의 인덱스이고, value는 feature map 입니다.

create_modules 함수의 경우와 마찬가지로, 신경망의 module를 포함하고 있는 module_list를 반복합니다. 여기서 알아둬야 할 것은 module은 configuration file에 나타나 있는 순서대로 append 되어야 합니다. 이것은 입력을 각 모듈을 통해서 간단히 수동하면 출력을 얻을 수 있다는 의미입니다.

    write = 0 # 이것은 추후에 설명하겠습니다.
    for i, module in enumerate(modules):
        module_type = (module['type'])

Convolutional and Upsample Layers

만약 module이 convolutional 또는 upsample module이라면, 이것은 순전파가 작동하는 방법입니다.

        if module_type == 'convolutional' or module_type == 'upsample':
            x = self.module_list[i](x)

Route Layer / Shortcut Layer

만약 route layer 코드를 보았다면, 두 가지 경우를 설명해야 합니다.(part2에 기술되어 있습니다.) 두 가지 feature maps를 연결해야 하는 경우에 두 번째 인자를 1로 하여 torch.cat 함수를 사용합니다. 이것은 feature maps를 깊이에 따라 연결하기 때문입니다.(PyTorch에서 convolutional layer의 입력과 출력은 'B x C x H x W' 형식입니다. 깊이는 channel dimension에 해당합니다.)

        elif module_type == 'route':
            layers = module['layers']
            layers = [int(a) for a in layers]
            
            if (layers[0]) > 0:
                layers[0] = layers[0] - i
            
            if len(layers) == 1:
                x = outputs[i + (layers[0])]
            
            else:
                if (layers[1]) > 0:
                    layers[1] = layers[1] - i
                    
                map1 = outputs[i + layers[0]]
                map2 = outputs[i + layers[1]]
                
                x = torch.cat((map1, map2), 1)
            
            elif module_type == 'shortcut':
                from_ = int(module['from'])
                x = outputs[i-1] + outputs[i+from_]

YOLO (Detection Layer)

YOLO의 출력값은 feature map의 깊이에 따른 바운딩 박스 속성을 포함하고 있는 convolutional feature map 입니다. cell에 의해 예측된 바운딩 박스 속성은 하나 하나 쌓입니다. 그래서, 만약 (5,6)에 있는 cell의 두 번째 바운딩 박스에 접근해야 한다면 map[5,6, (5+C) : 2*(5+C)]와 같이 인덱스를 호출해야 합니다. 이 형태는 object confidence로 임계값을 처리하고 중심에 grid offset을 추가하고 anchors를 적용하는 것과 같은 출력값 처리에 매우 편리합니다.

또 다른 문제는 detection이 세 개의 scales에 발생하기 때문에, prediction map의 차원이 서로 다른 것입니다. 비록 세 개의 feature map의 차원이 다름에도 불구하고, 그것들에 적용되어지는 출력값 처리 연산은 동일합니다. 3개의 별개 tensor보다 단일 tensor에서 이러한 연산을 하는 것이 좋을 것 입니다.

이러한 문제들을 해결하기 위해 predict_transform 함수를 소개합니다.

출력값 변환하기

predict_transform 함수는 util.py 파일에 있습니다. 그리고 Darknet 클래스의 forward에서 이것을 사용할 때, 함수를 import 할 것입니다.

util.py의 맨 위에 import를 추가합니다.

from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
from torch.autograd import Variable
import numpy as np
import cv2

predict_transform은 5개의 매개변수를 취합니다. prediction(출력값), inp_dim(입력 이미지의 차원), anchors, num_classes, 선택적인 CUDA flag

def predict_transform(prediction, inp_dim, anchors, num_classes, CUDA = True):

predict_transform 함수는 detection feature map을 취하고, 이것을 2-D tensor로 변경합니다. 2-D tensor는 아래 그림의 순서로 바운딩 박스들의 속성에 해당하는 tensor의 각 행으로 이루어져 있습니다.

위 변환을 하는 코드입니다.

    batch_size = prediction.size(0)
    stride = inp_dim // prediction.size(2)
    grid_size = inp_dim // stride
    bbox_attrs = 5 + num_classes
    num_anchors = len(anchors)
    
    prediction = prediction.view(batch_size, bbox_attrs * num_anchors, grid_size * grid_size)
    prediction = prediction.transpose(1,2).contiguous()
    prediction = prediction.view(batch_size, grid_size * grid_size * num_anchors, bbox_attrs)

anchors의 차원은 net block의 height와 width 속성에 해당합니다. 이러한 속성들은 detection map보다 stride 인자에 의해 더 큰 입력 이미지의 차원을 나타냅니다. 그러므로 detection feature map의 stride로 anchors를 나눠야 합니다.

    anchors = [(a[0]/stride, a[1]/stride) for a in anchors]

이제, Part 1에서 설명한 방정식에 따라 출력을 변환해야 합니다.

x,y 좌표와 objectness score을 Sigmodi 함수로 전달합니다.

    # 중심 x,y 좌표와 object confidence를 SIgmoid 합니다.
    prediction[:,:,0] = torch.sigmoid(prediction[:,:,0])
    prediction[:,:,1] = torch.sigmoid(prediction[:,:,1])
    prediction[:,:,4] = torch.sigmoid(prediction[:,:,4])

중심 좌표 예측에 grid offset을 추가합니다.

    # 중심 offset을 추가합니다.
    grid = np.arange(grid_size)
    a, b = np.meshgrid(grid, grid)
    
    x_offset = torch.FloatTensor(a).view(-1, 1)
    y_offset = torch.FloatTensor(b).view(-1, 1)
    
    if CUDA:
        x_offset = x_offset.cuda()
        y_offset = y_offset.cuda()
        
    x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1,num_anchors).view(-1,2).unsqueeze(0)
    
    prediction[:,:,:2] += x_y_offset

anchors를 바운딩 박스의 차원에 적용합니다.

    # 높이와 넓이를 log space 변환합니다.
    anchors = torch.FloatTensor(anchors)
    
    if CUDA:
        anchors = anchors.cuda()
        
    anchors = anchors.repeat(grid_size * grid_size, 1).unsqueeze(0)
    prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4] * anchors

class score에 sigmoid activation을 적용합니다.

    # class score에 sigmoid activation을 적용합니다.
    prediction[:,:,5: 5 + num_classes] = torch.sigmoid((prediction[:,:,5 : 5 + num_classes]))

여기서 마지막으로 할 것은, 입력 이미지의 크기로 detections map을 resize 하는 것 입니다. 여기서 바운딩 박스 속성은 feature map에 따라 크기가 정해집니다(즉, 13x13). 만약 입력 이미지가 416x416이었으면 32 및 stride 변수로 곱해야 합니다.

    # detection map을 입력 이미지의 크기로 resize 합니다.
    prediction[:,:,:4] *= stride

이제 루프를 종료합니다.

함수의 마지막에 prediction을 반환합니다.

    return prediction

Detection Layer 다시 논의하기

출력 tensor를 변환했기 때문에, 이제 3개의 서로 다른 scale인 detection map을 하나의 큰 tensor로 연결할 수 있습니다. 서로 다른 공간 차원을 갖고 있는 feature map을 연결시킬 수 없기 때문에, 이전의 변환 없이 이것이 불가능 하다는 것을 기억해야 합니다. 하지만 지금부터, 출력 tensor는 단지 행과 같이 바운딩 박스가 있는 테이블 처럼 작동하므로 연결이 가능합니다.

이 방법에서 장애물은 empty tensor를 초기화 할 수 없으므로, non-empty(다른 형태의) tensor를 이것으로 연결해야 합니다. 그래서 첫 번째 detection map을 얻을 때 까지 collector(detection을 지닌 tensor)의 초기화를 지연시킵니다. 그리고나서 연속적인 detection을 얻을 때, 이것을 map으로 연결시킵니다.

forward 함수에서 루프 바로 전에 write = 0 을 기억해보겠습니다. write flag는 첫 번째 detection을 얻었는지 아닌지를 나타내는데 사용됩니다. 만약 write가 0 이면, collector가 초기화되지 않은 것을 의미합니다. 만약 1이면, collector는 초기화 된 것을 의미하고, detection map을 이것으로 연결 시킬 수 있습니다.

이제, predict_transform 함수가 준비되었기 때문에 forward 함수에서 detection feature을 다룰 수 있는 코드를 작성할 수 있습니다.

darknet.py의 맨 위에 다음의 import를 추가합니다.

from util import *

그리고 나서 forward 함수에서 다음을 추가합니다.

        elif module_type == 'yolo':

            anchors = self.module_list[i][0].anchors
            # 입력 차원을 얻습니다.
            inp_dim = int(self.net_info['height'])

            # 클래스의 수를 얻습니다.
            num_classes = int(module['classes'])

            # util.py에 있는 predict_transform 함수를 이용하여 변환
            x = x.data
            x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
            if not write:    # collector가 초기화 되지 않은 경우
                detections = x
                write = 1

            else:
                detections = torch.cat((detections, x), 1)

        outputs[i] = x

이제, 간단하게 detections를 반환합니다.

    return detections

순전파 test해보기

여기에 있는 함수는 입력 덩어리를 생성합니다. 입력을 신경망으로 전달할 것입니다. 이 함수를 작성하기 전에 working directory에 이미지를 저장하겠습니다.

이미지는 여기서 다운 받을 수 있습니다.

https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png

이제 다음과 같이 darknet.py 맨위에 함수를 정의하겠습니다.

def get_test_input():
    img = cv2.imread('dog-cycle-car.pnp')
    img = cv2.resize(img, (416,416))        # 입력 크기로 resize 합니다.
    img_ = img[:,:,::-1].transpose((2,0,1)) # BGR - > RGB, HxWxC -> CxHxW
    img_ = img_[np.newaxis,:,:,:]/255.0     # (배치를 위한) 0 채널 추가하고 normalize 합니다.
    img_ = torch.from_numpy(img_).float()   # 정수로 변환합니다.
    img_ = Variable(img_)                   # 변수로 변환합니다.

그리고 나서 다음의 코드를 작성합니다.

model = Darknet("cfg/yolov3.cfg")
inp = get_test_input()
pred = model(inp, torch.cuda.is_available())
print (pred)

다음과 같이 출력을 확인할 수 있습니다.

이 tensor의 형태는 1 x 10647 x 85 입니다. 첫 번쨰 차원은 단일 이미지를 사용했기 때문에 1입니다. 배치의 각 이미지에서 10647 x 85 테이블을 갖고 있습니다. 이 각각의 테이블의 행은 바운딩 박스를 나타냅니다.(4 바운딩 박스 속성, 1 objectness score, 80 class score)

이쯤에서, 신경망은 무작위의 weights를 갖고 있고 올바른 출력값을 생성하지 않을 것입니다. 신경망에 weight file을 불러와야 합니다. 이 목적으로 공식적인 weight file을 사용합니다.

Pre-trained Weights 다운로드

detector directory에 weights file을 다운로드 합니다. 여기서 weights file을 다운로드 받을 수 있습니다.

Weights File 이해하기

공식적인 weights file은 순차적인 방식으로 저장된 weights를 포함하는 이진 파일입니다.

weights를 불러올 때는 각별히 주의해야 합니다. weights는 어느 layer에 속해있어야 하는지에 대한 안내가 없이 단지 정수로 저장되어 있습니다. 만약 망한다면, batch norm layer의 weight가 convolutional layer의 weight로 불러오는 것을 막을 수 없습니다. 단지 정수를 불러오기 때문에, 어느 weight가 어느 layer에 속해야 하는지 구별할 수 있는 방법이 없습니다. 따라서 어떻게 weights가 저장되는지 이해해야 합니다.

첫 번째로, weights는 단지 두 유형(batch norm layer 또는 convolutional layer)의 layers에 속합니다.

이러한 layers에 대한 weights는 configuration file에 나타나는 대로 정확히 똑같은 순서로 저장됩니다. 그래서 만약, convolutional 다음에 shortcut block이 있고 shortcut block 다음에 또 다른 convolutional이 있으면, 파일이 이전 convolutional block의 weights를 포함하고 그다음 뒤에 것들이 따라온다고 예상할 수 있습니다.

convolutional block안에 batch norm layer가 나타날 때, bias가 없습니다. 하지만 batch norm layer가 없을 때, 'weights'는 file로부터 불러옵니다.

다음의 그림은 어떻게 weight가 weights를 저장하는지 종합합니다.

Weights 불러오기

weights를 불러오는 함수를 작성하겠습니다. 이것은 Darknet 클래스의 멤버 함수가 될 것입니다. self이외에 weightsfile의 경로 인자를 취합니다.

    def load_weights(self, weightfile):

weights file의 첫 160 bytes는 file의 header를 구성하는 5개의 int32 값들을 저장합니다.

        # weights file 열기
        fp = open(weightfile, 'rb')
        
        # 첫 5개 값은 header 정보
        # 1. Magor version number
        # 2. Minor version number
        # 3. Subversion number
        # 4, 5. (training 동안) 신경망에 의하여 학습된 이미지
        header = np.fromfile(fp, dtype = np.int32, count = 5)
        self.header = torch.from_numpy(header)
        self.seen = self.header[3]

남은 bits 들은 위에 나타낸 순서대로 weights를 나타냅니다. weights는 float32로 저장되어 있습니다. 남은 weights를 np.ndarray로 불러오겠습니다.

        weights = np.fromfile(fp, dtype = np.float32)

이제, weights file을 반복하고 신경망의 modules로 weights를 불러옵니다.

        ptr = 0
        for i in range(len(self.module_list)):
            module_type = self.blocks[i + 1]['type']

loop에서 첫 번째로 convolutional block이 batch_normalize를 갖고 있는지 확인합니다. 이것에 근거하여 weights를 불러옵니다.

            # 만약 module_type이 convolutional이면 weights를 불러옵니다.
            # 그렇지 않으면 무시합니다.
            if module_type == 'convolutional':
                model = self.module_list[i]
                try:
                    batch_normalize = int(self.blocks[i+1]['batch_normalize'])
                except:
                    batch_normalize = 0
                
                conv = model[0]

ptr 변수로 weights 배열에서 어디에 있는지 추적합니다. 이제부터 만약 batch_normalize가 True이면 다음과 같이 weights를 불러옵니다.

            if (batch_normalize):
                bn = model[1]
                
                # Batch Norm layer의 weight의 수를 얻습니다.
                num_bn_niases = bn.bias.numel()
                
                # weights를 불러옵니다.
                bn_biases = torch.from_numpy(weights[ptr:ptr + num_bn_biases])
                prt += num_bn_biases
                
                bn_weights = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                ptr += num_bn_biases
                
                bn_running_mean = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                ptr += num_bn_biases
                
                bn_running_var = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                ptr += num_bn_biases
                
                # 불러온 weights를 모델 weights의 차원으로 변환합니다.
                bn_biases = bn_biases.view_as(bn.bias.data)
                bn_weights = bn_weights.view_as(bn.weight.data)
                bn_running_mean = bn_running_mean.view_as(bn.running_mean)
                bn_running_var = bn_running_var.view_as(bn.running_var)
                
                # data를 model에 복사합니다.
                bn.bias.data.copy_(bn_biases)
                bn.weight.data.copy_(bn_weights)
                bn.running_mean.copy_(bn_running_mean)
                bn.running_var.copy_(bn_running_var)

만약 batch_norm이 not true이면 간단하게 convolutional layer의 biases를 불러옵니다.

            # 만약 batch_norm이 not true이면, convolutional layer의 biases를 불러옵니다.
            else:
                # biases의 수
                num_biases = conv.bias.numel()
                
                # weights 불러오기
                conv_biases = torch.from_numpy(weights[ptr: ptr + num_biases])
                ptr = ptr + num_biases
                
                # 불러온 weights를 model weight의 차원에 맞게 reshape 합니다.
                conv_biases = conv_biases.view_as(conv.bias.data)
                
                # 마지막으로 data를 복사합니다.
                conv.bias.data.copy_(conv_biases)

마지막으로, convolutional layer의 weights를 마지막에 불러옵니다.

            # Convolutional layer에 대한 weights를 불러옵니다.
            num_weights = conv.weight.numel()
            
            # weights를 위와 같이 똑같이 합니다.
            conv_weights = torch.from_numpy(weights[ptr:ptr+num_weights])
            ptr = ptr + num_weights
            
            conv_weights = conv_weights.view_as(conv.weight.data)
            conv.weight.data.copy_(conv_weights)

함수 작성이 끝났습니다! 이제 darknet obkect에서 load_weights 함수를 호출하여 Darknet object로 weight를 불러올 수 있습니다.

model = Darknet("cfg/yolov3.cfg")
model.load_weights("yolov3.weights")

다음 part에서는 objectness confidence thresholding의 사용과 detection의 최종 집합을 생성하기 위한 Non-maximum suppression을 다루겠습니다.

'Python > PyTorch 공부' 카테고리의 다른 글

[Object Detection] YOLO(v3)를 PyTorch로 바닥부터 구현하기 - Part 5 (0)	2021.01.31
[Object Detection] YOLO(v3)를 PyTorch로 바닥부터 구현하기 - Part 4 (0)	2021.01.29
[Object Detection] YOLO(v3)를 PyTorch로 바닥부터 구현하기 - Part 2 (8)	2021.01.11
[Object Detection] YOLO(v3)를 PyTorch로 바닥부터 구현하기 - Part 1 (5)	2021.01.10
[PyTorch] 4. 검증(validation) 추가하고 fit() 와 get_data() 생성하기 (0)	2020.12.09

현재글[Object Detection] YOLO(v3)를 PyTorch로 바닥부터 구현하기 - Part 3

딥러닝 공부방