Cite
Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. โScaled-YOLOv4: Scaling Cross Stage Partial Network.โ arXiv, February 21, 2021. http://arxiv.org/abs/2011.08036.
Synth
Contribution::
Md
Author:: Wang, Chien-Yao
Author:: Bochkovskiy, Alexey
Author:: Liao, Hong-Yuan MarkTitle:: Scaled-YOLOv4: Scaling Cross Stage Partial Network
Year:: 2021
Citekey:: @WangEtAl2021Tags:: Computer Science - Computer Vision and Pattern Recognition, ๐ท๏ธ, Computer Science - Machine Learning, ๐
itemType:: preprint
LINK
Abstract
abstract:: We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks while maintaining optimal speed and accuracy. We propose a network scaling approach that modifies not only the depth, width, resolution, but also structure of the network. YOLOv4-large model achieves state-of-the-art results: 55.5% AP (73.4% AP50) for the MS COCO dataset at a speed of ~16 FPS on Tesla V100, while with the test time augmentation, YOLOv4-large achieves 56.0% AP (73.3 AP50). To the best of our knowledge, this is currently the highest accuracy on the COCO dataset among any published work. The YOLOv4-tiny model achieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while by using TensorRT, batch size = 4 and FP16-precision the YOLOv4-tiny achieves 1774 FPS.
Annotations
Highlight
We propose a network scaling approach that modifies not only the depth, width, resolution, but also structure of the network. (Go to Paper)
1. Introduction
Highlight
In order to design an effective object detector, model scaling technique is very important, because it can make object detector achieve high accuracy and real-time inference on various types of devices. (Go to Paper)
Highlight
he most common model scaling technique is to change the depth (number of layers in a neural network) and width (number of filters in a layer) of the backbone, and then train neural networks suitable for different devices. (Go to Paper)
Image
Image
Highlight
In [2], Cai et al. try to develop techniques that can be applied to various device network architectures with only training once. They use techniques such as decoupling training and search and knowledge distillation to decouple and train several sub-nets, so that the entire network and sub-nets are capable of processing target tasks. (Go to Paper)
Comment:
Cai ๋ฑ์ ๋ค์ํ ์ฅ์น ๋คํธ์ํฌ ๊ตฌ์กฐ์ ๋จ ํ ๋ฒ์ ํ๋ จ์ผ๋ก ์ ์ฉํ ์ ์๋ ๊ธฐ์ ์ ๊ฐ๋ฐํ๋ ค๊ณ ์๋ํฉ๋๋ค. ๊ทธ๋ค์ ํ๋ จ๊ณผ ํ์์ ๋ถ๋ฆฌํ๊ณ ์ง์ ์ฆ๋ฅ์ ๊ฐ์ ๊ธฐ์ ์ ์ฌ์ฉํ์ฌ ์ฌ๋ฌ ์๋ธ๋ท์ ๋ถ๋ฆฌํ๊ณ ํ๋ จํจ์ผ๋ก์จ ์ ์ฒด ๋คํธ์ํฌ์ ์๋ธ๋ท์ด ๋ชจ๋ ๋์ ์์ ์ ์ฒ๋ฆฌํ ์ ์๋๋ก ํฉ๋๋ค. ์ด๋ฌํ ์ ๊ทผ ๋ฐฉ์์ ๋คํธ์ํฌ์ ํจ์จ์ฑ๊ณผ ์ ์ฐ์ฑ์ ํฅ์์ํค๋ ค๋ ๋ ธ๋ ฅ์ ์ผํ์ ๋๋ค. ์ฌ๊ธฐ์ โํ๋ จ๊ณผ ํ์์ ๋ถ๋ฆฌโ๋ ๋ชจ๋ธ์ ํ๋ จ์ํค๋ ๊ณผ์ ๊ณผ ์ต์ ์ ๊ตฌ์กฐ๋ฅผ ํ์ํ๋ ๊ณผ์ ์ ๋ณ๋๋ก ์งํํจ์ ์๋ฏธํฉ๋๋ค. โ์ง์ ์ฆ๋ฅโ๋ ํฐ ๋ชจ๋ธ์ด๋ ๋คํธ์ํฌ๋ก๋ถํฐ ํ์ต๋ ์ ๋ณด๋ฅผ ๋ณด๋ค ์์ ๋ชจ๋ธ์ด๋ ๋คํธ์ํฌ๋ก ์ ๋ฌํ๋ ๊ธฐ๋ฒ์ ๋งํฉ๋๋ค. ์ด๋ฅผ ํตํด, ์์ ๋ชจ๋ธ์ด๋ ์๋ธ๋ท๋ ์ ์ฒด ๋คํธ์ํฌ์ ์ ์ฌํ ์ฑ๋ฅ์ ๋ผ ์ ์๋๋ก ํฉ๋๋ค. ์ด๋ฐ ๋ฐฉ์์ผ๋ก, ํ ๋ฒ์ ํ๋ จ์ผ๋ก ๋ค์ํ ์ฅ์น์ ๋คํธ์ํฌ ๊ตฌ์กฐ์ ์ฝ๊ฒ ์ ์ฉํ ์ ์๋ ๋ชจ๋ธ์ ๊ฐ๋ฐํ ์ ์๋ ๊ฒ์ด ๋ชฉํ์ ๋๋ค.
Highlight
Tan et al. [34] proposed using network architecture search (NAS) technique to perform compound scaling width, depth, and resolution on EfficientNet-B0. They use this initial network to search for the best convolutional neural network (CNN) architecture for a given amount of computation and set it as EfficientNet-B1, and then use linear scale-up technique to obtain EfficientNet-B2 to EfficientNet-B7. (Go to Paper)
Comment:
Tan ๋ฑ์ EfficientNet-B0์ ๋ณตํฉ ์ค์ผ์ผ๋ง์ ์ํํ๊ธฐ ์ํด ๋คํธ์ํฌ ์ํคํ ์ฒ ํ์(NAS) ๊ธฐ์ ์ ์ฌ์ฉํ๋ ๊ฒ์ ์ ์ํ์ต๋๋ค. ๋ณตํฉ ์ค์ผ์ผ๋ง์ ๋๋น, ๊น์ด, ํด์๋๋ฅผ ์กฐ์ ํ๋ ๊ฒ์ ์๋ฏธํฉ๋๋ค. ๊ทธ๋ค์ ์ด ์ด๊ธฐ ๋คํธ์ํฌ๋ฅผ ์ฌ์ฉํ์ฌ ์ฃผ์ด์ง ๊ณ์ฐ๋์ ๋ํด ์ต์ ์ ํฉ์ฑ๊ณฑ ์ ๊ฒฝ๋ง(CNN) ์ํคํ ์ฒ๋ฅผ ํ์ํ๊ณ , ์ด๋ฅผ EfficientNet-B1๋ก ์ค์ ํฉ๋๋ค. ๊ทธ ํ, ์ ํ ์ค์ผ์ผ ์ ๊ธฐ์ ์ ์ฌ์ฉํ์ฌ EfficientNet-B2๋ถํฐ EfficientNet-B7๊น์ง ์ป์ต๋๋ค. ์ด ๊ณผ์ ์์ NAS๋ ๋ค์ํ ๋คํธ์ํฌ ๊ตฌ์กฐ๋ฅผ ์๋์ผ๋ก ํ์ํ๊ณ ํ๊ฐํ์ฌ, ์ฃผ์ด์ง ์์(์: ๊ณ์ฐ๋) ๋ด์์ ๊ฐ์ฅ ํจ์จ์ ์ธ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ฐพ์๋ ๋๋ค. ์ด๋ฌํ ๋ฐฉ๋ฒ์ ํตํด, EfficientNet-B0๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ์ฌ ๋ค์ํ ํฌ๊ธฐ์ ๋คํธ์ํฌ๋ฅผ ํจ์จ์ ์ผ๋ก ์ค๊ณํ ์ ์์ผ๋ฉฐ, ๊ฐ๊ธฐ ๋ค๋ฅธ ์๊ตฌ ์ฌํญ(์: ๋ ๋์ ์ ํ๋, ๋ ์ ์ ๊ณ์ฐ ๋น์ฉ)์ ๋ง์ถ์ด ์ต์ ํ๋ ๋ชจ๋ธ์ ์์ฑํ ์ ์์ต๋๋ค. EfficientNet-B1์ ์ด๋ฌํ ํ์ ๊ณผ์ ์ ํตํด ๊ฒฐ์ ๋ ์ํคํ ์ฒ๋ฅผ ๋ฐํ์ผ๋ก ํ๋ฉฐ, ์ดํ ๋ชจ๋ธ๋ค์ EfficientNet-B1์ ๊ตฌ์กฐ๋ฅผ ์ ์งํ๋ฉด์ ๋๋น, ๊น์ด, ํด์๋๋ฅผ ์ ํ์ ์ผ๋ก ํ์ฅํจ์ผ๋ก์จ ์ฑ๋ฅ์ ๊ฐ์ ํฉ๋๋ค. ์ด๋ ๊ฒ ํจ์ผ๋ก์จ, ๊ฐ๊ธฐ ๋ค๋ฅธ ์ฑ๋ฅ ์์ค์ ์๊ตฌํ๋ ๋ค์ํ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ ํฉํ ๋ชจ๋ธ ์๋ฆฌ์ฆ๋ฅผ ์ ๊ณตํ ์ ์์ต๋๋ค.
Highlight
Radosavovic et al. [27] summarized and added constraints from the vast parameter search space AnyNet, and then designed RegNet to find optimal depth, bottleneck ratio, and width increase rate of a CNN. In addition, there are NAS and model scaling methods specifically proposed for object detection [6, 35]. (Go to Paper)
Comment:
Radosavovic ๋ฑ์ ๋งค์ฐ ๋์ ๋งค๊ฐ๋ณ์ ํ์ ๊ณต๊ฐ์ธ AnyNet์์ ์์ฝํ๊ณ ์ ์ฝ ์กฐ๊ฑด์ ์ถ๊ฐํ ํ, CNN์ ์ต์ ๊น์ด, ๋ณ๋ชฉ ๋น์จ, ๊ทธ๋ฆฌ๊ณ ๋๋น ์ฆ๊ฐ์จ์ ์ฐพ๊ธฐ ์ํด RegNet์ ์ค๊ณํ์ต๋๋ค. ๋ํ, ๊ฐ์ฒด ํ์ง๋ฅผ ์ํด ํน๋ณํ ์ ์๋ NAS(๋คํธ์ํฌ ์ํคํ ์ฒ ํ์) ๋ฐ ๋ชจ๋ธ ์ค์ผ์ผ๋ง ๋ฐฉ๋ฒ๋ ์์ต๋๋ค. ์ฌ๊ธฐ์ AnyNet์ ๋ค์ํ ์ํคํ ์ฒ ๊ตฌ์ฑ์ ํ์ํ ์ ์๋ ๋คํธ์ํฌ ์ค๊ณ ๊ณต๊ฐ์ ์๋ฏธํฉ๋๋ค. Radosavovic ๋ฑ์ ์ด ๋์ ๊ณต๊ฐ ๋ด์์ ํจ์จ์ ์ธ ํ์์ ๊ฐ๋ฅํ๊ฒ ํ๊ธฐ ์ํด ์ถ๊ฐ์ ์ธ ์ ์ฝ ์กฐ๊ฑด์ ๋์ ํ๊ณ , ์ด๋ฅผ ๋ฐํ์ผ๋ก RegNet์ ์ค๊ณํ์ต๋๋ค. RegNet์ ๊น์ด(depth), ๋ณ๋ชฉ ๋น์จ(bottleneck ratio), ๊ทธ๋ฆฌ๊ณ ๋๋น ์ฆ๊ฐ์จ(width increase rate) ๊ฐ์ ์ฃผ์ ๋งค๊ฐ๋ณ์๋ค์ ์ต์ ์กฐํฉ์ ์ฐพ์ CNN์ ์ฑ๋ฅ์ ๊ทน๋ํํฉ๋๋ค. ๊ฐ์ฒด ํ์ง์ ๊ฐ์ ํน์ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ํด์๋, NAS์ ๋ชจ๋ธ ์ค์ผ์ผ๋ง ๋ฐฉ๋ฒ์ด ํน๋ณํ ์ ์๋์์ต๋๋ค. ์ด๋ ๊ฐ์ฒด ํ์ง์ ์ต์ ํ๋ ์ํคํ ์ฒ์ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ ํ๊ธฐ ์ํด, ํน๋ณํ ์๊ตฌ ์ฌํญ๊ณผ ์ ์ฝ ์กฐ๊ฑด์ ๊ณ ๋ คํ๋ ๊ณผ์ ์ ํฌํจํฉ๋๋ค. ์ด๋ฌํ ๋ฐฉ๋ฒ๋ค์ ๊ฐ์ฒด ํ์ง์ ์ ํ๋์ ํจ์จ์ฑ์ ํฅ์์ํค๊ธฐ ์ํด ๊ฐ๋ฐ๋์์ผ๋ฉฐ, ๋ค์ํ ๊ณ์ฐ ์์๊ณผ ํ๊ฒฝ์์ ๋ฐ์ด๋ ์ฑ๋ฅ์ ๋ฐํํ ์ ์๋ ๋ชจ๋ธ์ ์์ฑํ๋ ๋ฐ ์ค์ ์ ๋ก๋๋ค.
Highlight
Through analysis of state-of-the-art object detectors [1, 3, 6, 26, 35, 40, 44], we found that CSPDarknet53, which is the backbone of YOLOv4 [1], matches almost all optimal architecture features obtained by network architecture search technique [27]. (Go to Paper)
Highlight
In the proposed scaled-YOLOv4, we discussed the upper and lower bounds of linear scaling up/down models, and respectively analyzed the issues that need to be paid attention to in model scaling for small models and large models. Thus, we are able to systematically develop YOLOv4-large and YOLOv4-tiny models. (Go to Paper)
Highlight
We summarize the contributions of this paper : (1) design a powerful model scaling method for small model, which can systematically balance the computation cost and memory bandwidth of a light CNN; (2) design a simple yet effective strategy for scaling a large object detector; (3) analyze the relations among all model scaling factors and then perform model scaling based on most advantageous group partitions; (4) experiments have confirmed that the FPN structure is inherently a once-for-all structure; and (5) we make use of the above methods to develop YOLOv4-tiny and YOLOv4-large. (Go to Paper)
Comment:
*์ด ๋ ผ๋ฌธ์ ๊ธฐ์ฌ๋๋ฅผ ์์ฝํ๋ฉด ๋ค์๊ณผ ๊ฐ์ต๋๋ค:
- ์ํ ๋ชจ๋ธ์ ์ํ ๊ฐ๋ ฅํ ๋ชจ๋ธ ์ค์ผ์ผ๋ง ๋ฐฉ๋ฒ์ ์ค๊ณํ์ฌ, ๊ฒฝ๋ CNN์ ๊ณ์ฐ ๋น์ฉ๊ณผ ๋ฉ๋ชจ๋ฆฌ ๋์ญํญ์ ์ฒด๊ณ์ ์ผ๋ก ๊ท ํ์ก์ ์ ์์ต๋๋ค.
- ํฐ ๊ฐ์ฒด ํ์ง๊ธฐ๋ฅผ ์ค์ผ์ผ๋งํ๊ธฐ ์ํ ๋จ์ํ์ง๋ง ํจ๊ณผ์ ์ธ ์ ๋ต์ ์ค๊ณํฉ๋๋ค.
- ๋ชจ๋ ๋ชจ๋ธ ์ค์ผ์ผ๋ง ์์๋ค ์ฌ์ด์ ๊ด๊ณ๋ฅผ ๋ถ์ํ ํ, ๊ฐ์ฅ ์ ๋ฆฌํ ๊ทธ๋ฃน ๋ถํ ์ ๊ธฐ๋ฐํ์ฌ ๋ชจ๋ธ ์ค์ผ์ผ๋ง์ ์ํํฉ๋๋ค.
- ์คํ์ ํตํด FPN(Feature Pyramid Network) ๊ตฌ์กฐ๊ฐ ๋ณธ์ง์ ์ผ๋ก ์ผํ์ฉ(once-for-all) ๊ตฌ์กฐ์์ ํ์ธํ์ต๋๋ค.
- ์์ ๋ฐฉ๋ฒ๋ค์ ํ์ฉํ์ฌ YOLOv4-tiny์ YOLOv4-large๋ฅผ ๊ฐ๋ฐํ์ต๋๋ค. ์ค๋ช : 1๋ฒ ๊ธฐ์ฌ๋๋ ์ํ ๋ชจ๋ธ์ ์ ํฉํ ๋ชจ๋ธ ์ค์ผ์ผ๋ง ๋ฐฉ๋ฒ์ ๊ฐ๋ฐํจ์ผ๋ก์จ, ์์์ด ์ ํ๋ ํ๊ฒฝ์์๋ ํจ์จ์ ์ธ ์ฑ๋ฅ์ ๋ผ ์ ์๋ CNN์ ์ค๊ณํ ์ ์์์ ์๋ฏธํฉ๋๋ค. ์ด๋ ๊ฒฝ๋ํ๋ ๋ชจ๋ธ์ด ํ์ํ ์๋ฒ ๋๋ ์์คํ ์ด๋ ๋ชจ๋ฐ์ผ ๋๋ฐ์ด์ค์์ ํนํ ์ ์ฉํฉ๋๋ค. 2๋ฒ ๊ธฐ์ฌ๋๋ ๋ํ ๊ฐ์ฒด ํ์ง ๋ชจ๋ธ์ ์ฑ๋ฅ์ ๊ฐ์ ํ๊ธฐ ์ํ ์ ๋ต์ ์ ๊ณตํจ์ผ๋ก์จ, ๋ ํฐ ์ ํ๋์ ํจ์จ์ฑ์ ๋ฌ์ฑํ ์ ์๊ฒ ํฉ๋๋ค. 3๋ฒ์์๋ ๋ชจ๋ธ ์ค์ผ์ผ๋ง ๊ณผ์ ์์ ๊ณ ๋ คํด์ผ ํ ๋ค์ํ ์์๋ค ๊ฐ์ ์ํธ์์ฉ์ ์ดํดํ๊ณ , ์ด๋ฅผ ๋ฐํ์ผ๋ก ์ต์ ์ ๊ตฌ์ฑ์ ์ฐพ๋ ๋ฐฉ๋ฒ์ ์ ์ํฉ๋๋ค. 4๋ฒ ๊ธฐ์ฌ๋๋ FPN ๊ตฌ์กฐ๊ฐ ๋ค์ํ ํฌ๊ธฐ์ ํํ์ ๋ชจ๋ธ์ ์ ์ฐํ๊ฒ ์ ์ฉ๋ ์ ์์์ ๋ณด์ฌ์ค๋๋ค. ์ด๋ ๊ฐ์ฒด ํ์ง ๋ถ์ผ์์ ์ค์ํ ๋ฐ๊ฒฌ์ผ๋ก, ๋ค์ํ ์์ฉ ํ๋ก๊ทธ๋จ์ ์ ์ฉ ๊ฐ๋ฅํ ๋ฒ์ฉ์ ์ธ ๊ตฌ์กฐ๋ฅผ ์ ์ํฉ๋๋ค. ๋ง์ง๋ง์ผ๋ก, 5๋ฒ ๊ธฐ์ฌ๋๋ ์์์ ๋ ผ์ํ ๋ชจ๋ ๊ธฐ์ ์ ์ง๋ณด๋ฅผ ํตํฉํ์ฌ, ์ค์ ๋ก YOLOv4์ ๋ ๊ฐ์ง ๋ณํ ๋ชจ๋ธ์ ๊ฐ๋ฐํ์์ ๊ฐ์กฐํฉ๋๋ค. ์ด๋ฅผ ํตํด ๋ ๋์ ๋ฒ์์ ์์ฉ ๋ถ์ผ์์ YOLOv4์ ํ์ฉ ๊ฐ๋ฅ์ฑ์ ํ์ฅ์ํต๋๋ค.*
2. Related work
2.1. Real-time object detection
Highlight
Object detectors is mainly divided into one-stage object detectors [28, 29, 30, 21, 18, 24] and two-stage object detectors [10, 9, 31]. The output of one-stage object detector can be obtained after only one CNN operation. As for twostage object detector, it usually feeds the high score region proposals obtained from the first-stage CNN to the secondstage CNN for final prediction. The inference time of onestage object detectors and two-stage object detectors can be expressed as Tone = T1st and Ttwo = T1st + mT2nd , where m is the number of region proposals whose confidence score is higher than a threshold. Todayโs popular real-time object detectors are almost one-stage object detectors. Onestage object detectors mainly have two kinds: anchor-based [30, 18] and anchor-free [7, 13, 14, 36]. Among all anchorfree approaches, CenterNet [46] is very popular because it does not require complicated post-processing, such as NonMaximum Suppression (NMS). At present, the more accurate real-time one-stage object detectors are anchor-based EfficientDet [35], YOLOv4 [1], and PP-YOLO [22]. In this paper, we developed our model scaling methods based on YOLOv4 [1]. (Go to Paper)
Comment:
1๋จ๊ณ ๊ฐ์ฒด ํ์ง๊ธฐ๋ ํ ๋ฒ์ CNN ์ฐ์ฐ์ผ๋ก ๊ฐ์ฒด์ ์์น์ ์ข ๋ฅ๋ฅผ ์ง์ ์์ธกํ๋ ๋ฐฉ์์ผ๋ก, ๋น ๋ฅธ ์ฒ๋ฆฌ ์๋ ๋๋ฌธ์ ์ค์๊ฐ ์์คํ ์ ์ ํฉํฉ๋๋ค. ๋ฐ๋ฉด, 2๋จ๊ณ ๊ฐ์ฒด ํ์ง๊ธฐ๋ ๋จผ์ ๊ฐ์ฒด๊ฐ ์์ ๋ฒํ ์์ญ์ ์ ์ํ๊ณ , ์ดํ ์ด ์์ญ๋ค์ ๋์์ผ๋ก ๋ ์ ๋ฐํ ๊ฐ์ฒด ๊ฒ์ถ์ ์ํํ์ฌ ์ ํ๋๋ฅผ ๋์ ๋๋ค. ์ต์ปค ๊ธฐ๋ฐ ๋ฐฉ์์ ๋ฏธ๋ฆฌ ์ ์๋ ์ต์ปค ๋ฐ์ค๋ฅผ ์ฌ์ฉํ์ฌ ๊ฐ์ฒด๋ฅผ ํ์งํ๋ ๋ฐ๋ฉด, ์ต์ปค ํ๋ฆฌ ๋ฐฉ์์ ์ต์ปค ๋ฐ์ค ์์ด ๊ฐ์ฒด์ ์ค์ฌ์ ๋ฑ์ ์ง์ ์์ธกํ๋ ๋ฐฉ์์ ๋งํฉ๋๋ค. ๋ณธ ๋ ผ๋ฌธ์์๋ YOLOv4, ์ค์๊ฐ ์ฒ๋ฆฌ๊ฐ ๊ฐ๋ฅํ๋ฉด์๋ ๋์ ์ ํ๋๋ฅผ ์ ๊ณตํ๋ ์ต์ปค ๊ธฐ๋ฐ 1๋จ๊ณ ๊ฐ์ฒด ํ์ง๊ธฐ๋ฅผ ํ์ฉํ์ฌ ์๋ก์ด ๋ชจ๋ธ ์ค์ผ์ผ๋ง ๋ฐฉ๋ฒ์ ๊ฐ๋ฐํ์์ต๋๋ค.
2.2. Model scaling
Highlight
Traditional model scaling method is to change the depth of a model, that is to add more convolutional layers. (Go to Paper)
Comment:
์ ํต์ ๋ชจ๋ธ ์ค์ผ์ผ๋ง์ ์ฃผ๋ก ๋ชจ๋ธ์ ๊น์ด๋ฅผ ๋๋ฆฌ๋ ๋ฐฉ์์ผ๋ก, ๋ ๋ง์ ํฉ์ฑ๊ณฑ ๊ณ์ธต์ ์ถ๊ฐํ์ฌ ๋ชจ๋ธ์ ๊น๊ฒ ๋ง๋ฆ์ผ๋ก์จ ์ฑ๋ฅ์ ํฅ์์ํค๋ ค๋ ์ ๋ต์ ๋๋ค. VGGNet์ ์ด๋ฌํ ์ ๊ทผ ๋ฐฉ์์ ๋ํ์ ์๋ก, ๋ค์ํ ๋ฒ์ ์์ ๊ณ์ธต์ ์๋ฅผ ์กฐ์ ํ์ฌ ์ฑ๋ฅ์ ๋ฌ๋ฆฌํฉ๋๋ค. ResNet์ ์ด๋ฅผ ํ ๋จ๊ณ ๋ ๋ฐ์ ์์ผ, ๋งค์ฐ ๊น์ ๋คํธ์ํฌ๋ฅผ ๊ตฌ์ถํ์ฌ ์ฑ๋ฅ์ ๋ํญ ํฅ์์์ผฐ์ต๋๋ค.
Highlight
The subsequent methods generally follow the same methodology for model scaling. (Go to Paper)
Highlight
[43] thought about the width of the network, and they changed the number of kernel of convolutional layer to realize scaling. They therefore design wide ResNet (WRN) , while maintaining the same accuracy. Although WRN has higher amount of parameters than ResNet, the inference speed is much faster. (Go to Paper)
Comment:
์ดํ ๋คํธ์ํฌ์ ๋๋น์ ์ฃผ๋ชฉํ ์ฐ๊ตฌ๋ค๋ ๋ฑ์ฅํ์ผ๋ฉฐ, WRN์ ๋๋น๋ฅผ ์ฆ๊ฐ์ํด์ผ๋ก์จ ๋ ๋น ๋ฅธ ์ถ๋ก ์๋์ ๋์ ์ฑ๋ฅ์ ๋ฌ์ฑํ์ต๋๋ค.
Highlight
The subsequent DenseNet [12] and ResNeXt [41] also designed a compound scaling version that puts depth and width into consideration. (Go to Paper)
Comment:
๋ณตํฉ ์ค์ผ์ผ๋ง์ ๋์ ํ DenseNet๊ณผ ResNeXt๋ ๊น์ด์ ๋๋น๋ฅผ ๋์์ ๊ณ ๋ คํ์ฌ ๋์ฑ ํฅ์๋ ๋ชจ๋ธ์ ์ ์ํ์ต๋๋ค.
Highlight
As for image pyramid inference, it is a common way to perform augmentation at run time. It takes an input image and makes a variety of different resolution scaling, and then input these distinct pyramid combinations into a trained CNN. Finally, the network will integrate the multiple sets of outputs as its ultimate outcome. Redmon et al. [30] use the above concept to execute input image size scaling. They use higher input image resolution to perform fine-tune on a trained Darknet53, and the purpose of executing this step is to get higher accuracy. (Go to Paper)
Comment:
์ด๋ฏธ์ง ํผ๋ผ๋ฏธ๋ ์ถ๋ก ์ ๋ค์ํ ํด์๋์ ์ด๋ฏธ์ง๋ฅผ ๋คํธ์ํฌ์ ์ ๋ ฅํ์ฌ ๋ ๋์ ์ ํ๋๋ฅผ ์ป๊ธฐ ์ํ ๋ฐฉ๋ฒ์ผ๋ก, ์ ๋ ฅ ์ด๋ฏธ์ง์ ํฌ๊ธฐ๋ฅผ ์กฐ์ ํ์ฌ ์ฑ๋ฅ์ ๋ฏธ์ธ ์กฐ์ ํ๋ ์ ๋ต์ ๋๋ค. ์ด๋ฌํ ๋ค์ํ ์ค์ผ์ผ๋ง ๋ฐฉ๋ฒ์ ๋ชจ๋ธ์ ์ฑ๋ฅ์ ์ต์ ํํ๊ณ , ํนํ ๊ณ ํด์๋ ์ด๋ฏธ์ง ์ฒ๋ฆฌ์ ์์ด ๋ ๋์ ์ ํ๋๋ฅผ ๋ฌ์ฑํ๊ธฐ ์ํ ๋ ธ๋ ฅ์ ์ผํ์ ๋๋ค.
Highlight
In recent years, network architecture search (NAS) related research has been developed vigorously, and NASFPN [8] has searched for the combination path of feature pyramid. We can think of NAS-FPN as a model scaling technique which is mainly executed at the stage level. (Go to Paper)
Comment:
์ต๊ทผ NAS ์ฐ๊ตฌ๋ ๋คํธ์ํฌ ๊ตฌ์กฐ๋ฅผ ์ต์ ํํ์ฌ ๋ ํจ์จ์ ์ธ ๋ชจ๋ธ์ ์ค๊ณํ๊ธฐ ์ํ ๋ฐฉ๋ฒ์ผ๋ก ์ฃผ๋ชฉ๋ฐ๊ณ ์์ต๋๋ค. NAS-FPN์ ํน์ง ํผ๋ผ๋ฏธ๋์ ๋ค์ํ ์กฐํฉ์ ํ์ํ์ฌ ๊ฐ์ฒด ํ์ง์ ์ ํ๋๋ฅผ ๋์ด๋ ๋ฐ ์ค์ ์ ๋ก๋๋ค.
Highlight
As for EfficientNet [34], it uses compound scaling search based on depth, width, and input size. The main design concept of EfficientDet [35] is to disassemble the modules with different functions of object detector, and then perform scaling on the image size, width,BiFPN layers, andclass layer. (Go to Paper)
Comment:
EfficientNet๊ณผ EfficientDet์ ๊ฐ๊ฐ ๊น์ด, ๋๋น, ์ ๋ ฅ ํฌ๊ธฐ ๋ฑ ๋ค์ํ ์ฐจ์์์ ๋ชจ๋ธ์ ์ค์ผ์ผ๋งํ์ฌ ์ฑ๋ฅ์ ๊ทน๋ํํ๋ ๋ณตํฉ ์ค์ผ์ผ๋ง ์ ๋ต์ ์ฑํํฉ๋๋ค. ์ด๋ค์ ๊ฐ์ฒด ํ์ง๊ธฐ์ ์ฑ๋ฅ์ ๊ฐ์ ํ๊ธฐ ์ํด ๋ชจ๋ธ์ ๋ค์ํ ์์๋ฅผ ์กฐ์ ํฉ๋๋ค.
Highlight
Another design that uses NAS concept is SpineNet [6], which is mainly aimed at the overall architecture of fish-shaped object detector for network architecture search. This design concept can ultimately produce a scalepermuted structure. Another network with NAS design is RegNet [27], which mainly fixes the number of stage and input resolution, and integrates all parameters such as depth, width, bottleneck ratio and group width of each stage into depth, initial width, slope, quantize, bottleneck ratio, and group width. Finally, they use these six parameters to perform compound model scaling search. (Go to Paper)
Comment:
SpineNet๊ณผ RegNet์ NAS๋ฅผ ์ด์ฉํ์ฌ ์ ์ฒด ๋คํธ์ํฌ ๊ตฌ์กฐ๋ฅผ ์ต์ ํํ๋ฉฐ, ํนํ RegNet์ ๊น์ด, ๋๋น, ๋ณ๋ชฉ ๋น์จ ๋ฑ์ ์ข ํฉ์ ์ผ๋ก ๊ณ ๋ คํ ๋ณตํฉ ์ค์ผ์ผ๋ง ํ์์ ์ํํฉ๋๋ค.
Highlight
The above methods are all great work, but few of them analyze the relation between different parameters. (Go to Paper)
Comment:
์ด๋ฌํ ์ ๊ทผ ๋ฐฉ์์ ๋ชจ๋ธ์ ์ฑ๋ฅ์ ํจ์จ์ ์ผ๋ก ํฅ์์ํค์ง๋ง, ๋ค์ํ ๋งค๊ฐ๋ณ์ ๊ฐ์ ์ํธ ์์ฉ์ ์ถฉ๋ถํ ๋ถ์ํ๋ ๋ฐ์๋ ํ๊ณ๊ฐ ์์ต๋๋ค.
Highlight
In this paper, we will try to find a method for synergistic compound scaling based on the design requirements of object detection. (Go to Paper)
Comment:
๋ณธ ๋ ผ๋ฌธ์ ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๊ฐ์ฒด ํ์ง์ ์๊ตฌ ์ฌํญ์ ๋ง์ถฐ ์๋ก ๋ณด์์ ์ธ ๋ฐฉ์์ผ๋ก ๋ชจ๋ธ์ ์ค์ผ์ผ๋งํ๋ ์๋ก์ด ๋ฐฉ๋ฒ์ ๋ชจ์ํฉ๋๋ค. ์ด๋ ๊ฐ์ฒด ํ์ง ๋ถ์ผ์์ ๋์ฑ ์ ๊ตํ๊ณ ํจ์จ์ ์ธ ๋ชจ๋ธ ์ค๊ณ๋ก ์ด์ด์ง ์ ์์ต๋๋ค.
3. Principles of model scaling
Highlight
After performing model scaling for the proposed object detector, the next step is to deal with the quantitative factors that will change, including the number of parameters with qualitative factors. These factors include model inference time, average precision, etc. The qualitative factors will have different gain effects depending on the equipment or database used. (Go to Paper)
3.1. General principle of model scaling
Highlight
When designing the efficient model scaling methods, our main principle is that when the scale is up/down, the lower/higher the quantitative cost we want to increase/decrease, the better. In this section, we will show and analyze various general CNN models, and try to understand their quantitative costs when facing changes in (1) image size, (2) number of layers, and (3) number of channels. The CNNs we chose are ResNet, ResNext, and Darknet. (Go to Paper)
Highlight
For the k-layer CNNs with b base layer channels, the computations of ResNet layer is kโ[conv(1 ร 1, b/4) โ conv(3 ร 3, b/4) โ conv(1 ร 1, b)], and that of ResNext layer is kโ[conv(1 ร 1, b/2) โ gconv(3 ร 3/32, b/2) โ conv(1 ร 1, b)]. As for the Darknet layer, the amount of computation is kโ[conv(1 ร 1, b/2) โ conv(3 ร 3, b)]. (Go to Paper)
Highlight
Let the scaling factors that can be used to adjust the image size, the number of layers, and the number of channels be ฮฑ, ฮฒ, and ฮณ, respectively. (Go to Paper)
Image
Image
Highlight
the scaling size, depth, and width cause increase in the computation cost. They respectively show square, linear, and square increase. (Go to Paper)
Highlight
The CSPNet [37] proposed by Wang et al. can be applied to various CNN architectures, while reducing the amount of parameters and computations. In addition, it also improves accuracy and reduces inference time. We apply it to ResNet, ResNeXt, and Darknet and observe the changes in the amount of computations, as shown in Table 2. (Go to Paper)
Highlight
From the figures shown in Table 2, we observe that after converting the above CNNs to CSPNet, the new architecture can effectively reduce the amount of computations (FLOPs) on ResNet, ResNeXt, and Darknet by 23.5%, 46.7%, and 50.0%, respectively. Therefore, we use CSP-ized models as the best model for performing model scaling. (Go to Paper)
Image
Image
3.2. Scaling Tiny Models for Low-End Devices
Highlight
For low-end devices, the inference speed of a designed model is not only affected by the amount of computation and model size, but more importantly, the limitation of peripheral hardware resources must be considered. Therefore, when performing tiny model scaling, we must also consider factors such as memory bandwidth, memory access cost (MACs), and DRAM traffic. In order to take into account the above factors, our design must comply with the following principles: (Go to Paper)
Make the order of computations less than O(whkb2):
Highlight
Lightweight models are different from large models in that their parameter utilization efficiency must be higher in order to achieve the required accuracy with a small amount of computations. (Go to Paper)
Highlight
In Table 3, we analyze the network with efficient parameter utilization, such as the computation load of DenseNet and OSANet [15], where g means growth rate. (Go to Paper)
Image
Image
Highlight
For general CNNs, the relationship among g, b, and k listed in Table 3 is k << g < b. Therefore, the order of computation complexity of DenseNet is O(whgbk), and that of OSANet is O(max(whbg, whkg2)). The order of computation complexity of the above two is less than O(whkb2) of the ResNet series. Therefore, we design our tiny model with the help of OSANet, which has a smaller computation complexity. (Go to Paper)
Minimize/balance size of feature map:
Highlight
In order to get the best trade-off in terms of computing speed, we propose a new concept, which is to perform gradient truncation between computational block of the CSPOSANet. (Go to Paper)
Highlight
If we apply the original CSPNet design to the DenseNet or ResNet architectures, because the jth layer output of these two architectures is the integration of the 1st to (j โ 1)th layer outputs, we must treat the entire computational block as a whole. (Go to Paper)
Highlight
Because the computational block of OSANet belongs to the PlainNet architecture, making CSPNet from any layer of a computational block can achieve the effect of gradient truncation. We use this feature to re-plan the b channels of the base layer and the kg channels generated by computational block, and split them into two paths with equal channel numbers, as shown in Table 4. (Go to Paper)
Image
Image
Highlight
When the number of channel is b + kg, if one wants to split these channels into two paths, the best partition is to divide it into two equal parts, i.e. (b + kg)/2. When we actually consider the bandwidth ฯ of the hardware, if software optimization is not considered, the best value is ceil((b + kg)/2ฯ ) ร ฯ . The CSPOSANet we designed can dynamically adjust the channel allocation. (Go to Paper)
Maintain the same number of channels after convolution:
Highlight
For evaluating the computation cost of low-end device, we must also consider power consumption, and the biggest factor affecting power consumption is memory access cost (MAC). Usually the MAC calculation method for a convolution operation is as follows: M AC = hw(Cin + Cout) + KCinCout (1) where h, w, Cin, Cout, and K represent, respectively, the height and width of feature map, the channel number of input and output, and the kernel size of convolutional filter. By calculating geometric inequalities, we can derive the smallest MAC when Cin = Cout [23]. (Go to Paper)
Minimize Convolutional Input/Output (CIO):
Highlight
CIO [4] is an indicator that can measure the status of DRAM IO. Table 5 lists the CIO of OSA, CSP, and our designed CSPOSANet. (Go to Paper)
Image
Image
Highlight
When kg > b/2, the proposed CSPOSANet can obtain the best CIO. (Go to Paper)
3.3. Scaling Large Models for High-End GPUs
Highlight
Since we hope to improve the accuracy and maintain the real-time inference speed after scaling up the CNN model, we must find the best combination among the many scaling factors of object detector when performing compound scaling. Usually, we can adjust the scaling factors of an object detectorโs input, backbone, and neck. (Go to Paper)
Image
Image
Highlight
The biggest difference between image classification and object detection is that the former only needs to identify the category of the largest component in an image, while the latter needs to predict the position and size of each object in an image. (Go to Paper)
Highlight
In one-stage object detector, the feature vector corresponding to each location is used to predict the category and size of an object at that location. The ability to better predict the size of an object basically depends on the receptive field of the feature vector. In the CNN architecture, the thing that is most directly related to receptive field is the stage, and the feature pyramid network (FPN) architecture tells us that higher stages are more suitable for predicting large objects. (Go to Paper)
Comment:
์ด ๋ถ๋ถ์์ ์ ์๋ค์ ์ด๋ฏธ์ง ๋ถ๋ฅ์ ๊ฐ์ฒด ํ์ง์ ํต์ฌ ์ฐจ์ด์ ์ ๊ฐ์กฐํ๊ณ ์์ต๋๋ค. ์ด๋ฏธ์ง ๋ถ๋ฅ๋ ์ด๋ฏธ์ง์์ ์ฃผ์ ๊ฐ์ฒด์ ๋ถ๋ฅ๋ง์ ์๋ณํ๋ ๋ฐ๋ฉด, ๊ฐ์ฒด ํ์ง๋ ๋ ๋ณต์กํ ๊ณผ์ ๋ก, ์ด๋ฏธ์ง ๋ด ๋ชจ๋ ๊ฐ์ฒด์ ์์น์ ํฌ๊ธฐ๊น์ง ํ์ ํด์ผ ํฉ๋๋ค. ์ด๋ 1๋จ๊ณ ๊ฐ์ฒด ํ์ง๊ธฐ์์ ์ค์ํ๋ฐ, ์ฌ๊ธฐ์๋ ์์น๋ณ๋ก ํ ๋น๋ ํน์ง ๋ฒกํฐ๋ฅผ ํตํด ๊ฐ์ฒด์ ์นดํ ๊ณ ๋ฆฌ์ ํฌ๊ธฐ ์ ๋ณด๋ฅผ ์์ธกํฉ๋๋ค. ์ด ์์ธก ๋ฅ๋ ฅ์ ํน์ง ๋ฒกํฐ๊ฐ ๊ฐ์ง ์์ฉ ํ๋์ ํฌ๊ธฐ์ ๊ธฐ๋ฐํ๋๋ฐ, ์ด๋ CNN ์ํคํ ์ฒ์ ๋ค์ํ โ์คํ ์ด์งโ์ ์ํด ๊ฒฐ์ ๋ฉ๋๋ค. FPN ์ํคํ ์ฒ๋ ์ด๋ฅผ ํ์ฉํ์ฌ ๋ค์ํ ํฌ๊ธฐ์ ๊ฐ์ฒด๋ฅผ ํจ๊ณผ์ ์ผ๋ก ํ์งํ ์ ์๋ ๊ตฌ์กฐ๋ฅผ ์ ๊ณตํฉ๋๋ค. ๋ ผ๋ฌธ์ ํ 7์์๋ ์์ฉ ํ๋์ ํฌ๊ธฐ์ ๋ค๋ฅธ ๋งค๊ฐ๋ณ์๋ค๊ณผ์ ๊ด๊ณ๋ฅผ ์๊ฐ์ ์ผ๋ก ๋ณด์ฌ์ฃผ๋ฉฐ, ์ด๋ ๋ชจ๋ธ์ด ๊ฐ์ฒด์ ํฌ๊ธฐ๋ฅผ ์ ํํ๊ฒ ์ธ์ํ๋ ๋ฐ ํ์์ ์ธ ์์์์ ๋ํ๋ ๋๋ค.
Image
Image
Highlight
From Table 7, it is apparent that width scaling can be independently operated. (Go to Paper)
Highlight
When the input image size is increased, if one wants to have a better prediction effect for large objects, he/she must increase the depth or number of stages of the network. (Go to Paper)
Highlight
Among the parameters listed in Table 7, the compound of {sizeinput,stage} turns out with the best impact. Therefore, when performing scaling up, we first perform compound scaling on sizeinput,stage, and then according to real-time requirements, we further perform scaling on depth and width respectively. (Go to Paper)
4. Scaled-YOLOv4
4.1. CSP-ized YOLOv4
Highlight
In this sub-section, we re-design YOLOv4 to YOLOv4-CSP to get the best speed/accuracy trade-off. (Go to Paper)
Backbone:
Highlight
In the design of CSPDarknet53, the computation of down-sampling convolution for cross-stage process is not included in a residual block. Therefore, we can deduce that the amount of computation of each CSPDarknet stage is whb2(9/4+3/4+5k/2). From the formula deduced above, we know that CSPDarknet stage will have a better computational advantage over Darknet stage only when k > 1 is satisfied. The number of residual layer owned by each stage in CSPDarknet53 is 1-2-8-8-4 respectively. In order to get a better speed/accuracy trade-off, we convert the first CSP stage into original Darknet residual layer. (Go to Paper)
Comment:
CSPDarknet53 ์ํคํ ์ฒ์์๋ ๋ค์ด-์ํ๋ง์ ์ํ ์ปจ๋ณผ๋ฃจ์ ์ฐ์ฐ์ด ์์ฐจ ๋ธ๋ก์ ์ผ๋ถ๋ก ๊ณ์ฐ๋์ง ์๋ ํน๋ณํ ์ค๊ณ๋ฅผ ์ ์ฉํฉ๋๋ค. ์ด๋ก ์ธํด ๊ฐ ์คํ ์ด์ง์ ๊ณ์ฐ๋์ ๋๋น(wh), ๊ธฐ๋ณธ ์ฑ๋ ์(b), ๊ทธ๋ฆฌ๊ณ ์ปจ๋ณผ๋ฃจ์ ๋ ์ด์ด ์(k)๋ฅผ ๋ณ์๋ก ํ๋ ํน์ ์์์ ์ํด ๊ฒฐ์ ๋ฉ๋๋ค. ์ด ์์์ ๋ฐ๋ฅด๋ฉด, ์ปจ๋ณผ๋ฃจ์ ๋ ์ด์ด ์(k)๊ฐ 1๋ณด๋ค ํด ๊ฒฝ์ฐ์๋ง CSPDarknet53์ด ๊ธฐ์กด Darknet ์ํคํ ์ฒ์ ๋นํด ๊ณ์ฐ์์ ์ด์ ์ ๊ฐ์ง ๊ฒ์ผ๋ก ์์ธก๋ฉ๋๋ค.CSPDarknet53๋ ๋ค์ํ ๋จ๊ณ์์ ์๋ก ๋ค๋ฅธ ์์ ์์ฐจ ๋ ์ด์ด๋ฅผ ๊ฐ์ง๋ฉฐ, ์ด๋ ๊ฐ ์คํ ์ด์ง์ ๊ณ์ฐ ๋ณต์ก์ฑ์ ์ํฅ์ ๋ฏธ์นฉ๋๋ค. ์ด๋ฌํ ๊ตฌ์ฑ์ ํตํด, ๊ฐ๋ฐ์๋ค์ ์๋์ ์ ํ๋ ์ฌ์ด์ ์ต์ ์ ๊ท ํ์ ์ฐพ์ผ๋ ค๊ณ ํฉ๋๋ค. ํนํ ์ฒซ ๋ฒ์งธ CSP ์คํ ์ด์ง๋ฅผ ์๋์ Darknet ์์ฐจ ๋ ์ด์ด ๊ตฌ์กฐ๋ก ๋ณ๊ฒฝํจ์ผ๋ก์จ, ๋ชจ๋ธ์ด ๋น ๋ฅธ ์๋์์๋ ๋์ ์ ํ๋๋ฅผ ์ ์งํ ์ ์๋๋ก ์ต์ ํํฉ๋๋ค. ์ด๋ฌํ ์กฐ์ ์ ์ค์ ์ฌ์ฉ ํ๊ฒฝ์์์ ์ฑ๋ฅ์ ํฅ์์ํฌ ์ ์๋ ์ค์ํ ๋์์ธ ๊ฒฐ์ ์ ๋๋ค.
Image
Image
Neck:
Highlight
In order to effectively reduce the amount of computation, we CSP-ize the PAN [20] architecture in YOLOv4. The computation list of a PAN architecture is illustrated in Figure 2(a). It mainly integrates the features coming from different feature pyramids, and then passes through two sets of reversed Darknet residual layer without shortcut connections. After CSP-ization, the architecture of the new computation list is shown in Figure 2(b). This new update effectively cuts down 40% of computation. (Go to Paper)
SPP:
Highlight
The SPP module was originally inserted in the middle position of the first computation list group of the neck. Therefore, we also inserted SPP module in the middle position of the first computation list group of the CSPPAN. (Go to Paper)
4.2. YOLOv4-tiny
Image
Image
Highlight
We will use the CSPOSANet with PCB architecture to form the backbone of YOLOv4. We set g = b/2 as the growth rate and make it grow to b/2 + kg = 2b at the end. Through calculation, we deduced k = 3, and its architecture is shown in Figure 3. As for the number of channels of each stage and the part of neck, we follow the design of YOLOv3-tiny. (Go to Paper)
4.3. YOLOv4-large
Highlight
YOLOv4-large is designed for cloud GPU, the main purpose is to achieve high accuracy for object detection. We designed a fully CSP-ized model YOLOv4-P5 and scaling it up to YOLOv4-P6 and YOLOv4-P7. (Go to Paper)
Highlight
Figure 4 shows the structure of YOLOv4-P5, YOLOv4P6, and YOLOv4-P7. We designed to perform compound scaling on sizeinput,stage. We set the depth scale of each stage to 2dsi , and ds to [1, 3, 15, 15, 7, 7, 7]. Finally, we further use inference time as constraint to perform additional width scaling. Our experiments show that YOLOv4P6 can reach real-time performance at 30 FPS video when the width scaling factor is equal to 1. For YOLOv4-P7, it can reach real-time performance at 16 FPS video when the width scaling factor is equal to 1.25. (Go to Paper)
Comment:
์ด ๋ถ๋ถ์์ ์ ์๋ค์ YOLOv4์ ์ธ ๊ฐ์ง ๋ฒ์ , ์ฆ YOLOv4-P5, YOLOv4-P6, YOLOv4-P7์ ์ค๊ณ ๊ตฌ์กฐ์ ์ค์ผ์ผ๋ง ์ ๋ต์ ์๊ฐํ๊ณ ์์ต๋๋ค. ๋ณตํฉ ์ค์ผ์ผ๋ง์ ๋ชจ๋ธ์ ์ ๋ ฅ ํฌ๊ธฐ์ ์คํ ์ด์ง ์์ ๋์์ ์ ์ฉ๋๋ฉฐ, ๊ฐ ์คํ ์ด์ง์ ๊น์ด๋ ์ง์์ ์ผ๋ก ์กฐ์ ๋ฉ๋๋ค. ๊น์ด ์ค์ผ์ผ๋ง ์ธ์ ๏ฟฝ๏ฟฝ๏ฟฝdsi๋ ๋ชจ๋ธ์ ๊ฐ ์คํ ์ด์ง๋ง๋ค ๋ค๋ฅด๊ฒ ์ค์ ๋์ด, ๋ชจ๋ธ์ ๊ณ์ธต์ ๊ตฌ์กฐ์ ๋ฐ๋ผ ๋ค์ํ ๊ณ์ฐ ๋ณต์ก์ฑ์ ๊ฐ์ง๊ฒ ํฉ๋๋ค.์ถ๊ฐ์ ์ผ๋ก, ์ถ๋ก ์๊ฐ์ ๊ณ ๋ คํ์ฌ ๋ชจ๋ธ์ ๋๋น ์ค์ผ์ผ๋ง์ ์กฐ์ ํจ์ผ๋ก์จ, ์ค์ ๋น๋์ค ์ฒ๋ฆฌ์์์ ์ค์๊ฐ ์ฑ๋ฅ์ ๋ฌ์ฑํ๊ธฐ ์ํ ์ต์ ์ ์กฐ๊ฑด์ ์ฐพ๊ณ ์ ํฉ๋๋ค. YOLOv4-P6์ YOLOv4-P7 ๋ชจ๋์์, ๋๋น ์ค์ผ์ผ๋ง ์ธ์๋ฅผ ์กฐ์ ํจ์ผ๋ก์จ ๊ฐ๊ฐ 30 FPS์ 16 FPS์์์ ์ค์๊ฐ ๋น๋์ค ์ฒ๋ฆฌ ์ฑ๋ฅ์ ๋ฌ์ฑํ ์ ์์์ ์คํ์ ํตํด ํ์ธํ์ต๋๋ค. ์ด๋ ๊ณ ์ฑ๋ฅ ๊ฐ์ฒด ํ์ง ๋ชจ๋ธ์ ์ค์๊ฐ ์ ํ๋ฆฌ์ผ์ด์ ์ ์ ์ฉํ ์ ์๋ ๊ฐ๋ฅ์ฑ์ ๋ณด์ฌ์ค๋๋ค.
5. Experiments
Highlight
We use MSCOCO 2017 object detection dataset to verify the proposed scaled-YOLOv4. We do not use ImageNet pre-trained models, and all scaled-YOLOv4 models are trained from scratch and the adopted tool is SGD optimizer. The time used for training YOLOv4-tiny is 600 epochs, and that used for training YOLOv4-CSP is 300 epochs. As for YOLOv4-large, we execute 300 epochs first and then followed by using stronger data augmentation method to train 150 epochs. As for the Lagrangian multiplier of hyper-parameters, such as anchors of learning rate, the degree of different data augmentation methods, we use k-means and genetic algorithms to determine. All details related to hyper-parameters are elaborated in Appendix. (Go to Paper)
Image
Image
Image
Image
Image
Image
Image
Image
5.1. Ablation study on CSP-ized model
Highlight
we will CSP-ize different models and analyze the impact of CSP-ization on the amount of parameters, computations, throughput, and average precision. (Go to Paper)
Highlight
We use Darknet53 (D53) as backbone and choose FPN with SPP (FPNSPP) and PAN with SPP (PANSPP) as necks to design ablation studies. (Go to Paper)
Highlight
We use LeakyReLU (Leaky) and Mish activation function respectively to compare the amount of used parameters, computations, and throughput. Experiments are all conducted on COCO minval dataset and the resulting APs are shown in the last column of Table 8. (Go to Paper)
Highlight
From the data listed in Table 8, it can be seen that the CSP-ized models have greatly reduced the amount of parameters and computations by 32%, and brought improvements in both Batch 8 throughput and AP. (Go to Paper)
Highlight
If one wants to maintain the same frame rate, he/she can add more layers or more advanced activation functions to the models after CSP-ization. (Go to Paper)
Highlight
we can see that both CD53s-CFPNSPP-Mish, and CD53sCPANSPP-Leaky have the same batch 8 throughput with D53-FPNSPP-Leaky, but they respectively have 1% and 1.6% AP improvement with lower computing resources. (Go to Paper)
Highlight
From the above improvement figures, we can see the huge advantages brought by model CSP-ization. Therefore, we decided to use CD53s-CPANSPP-Mish, which results in the highest AP in Table 8 as the backbone of YOLOv4-CSP. (Go to Paper)
5.2. Ablation study on YOLOv4-tiny
Highlight
In this sub-section, we design an experiment to show how flexible can be if one uses CSPNet with partial functions in computational blocks. We also compare with CSPDarknet53, in which we perform linear scaling down on width and depth. The results are shown in Table 9. (Go to Paper)
Image
Image
Highlight
we can see that the designed PCB technique can make the model more flexible, because such a design can be adjusted according to actual needs. (Go to Paper)
Highlight
we also confirmed that linear scaling down does have its limitation. It is apparent that when under limited operating conditions, the residual addition of tinyCD53s becomes the bottleneck of inference speed, because its frame rate is much lower than the COSA architecture with the same amount of computations. (Go to Paper)
Comment:
์์ ๊ฒฐ๊ณผ๋ค๋ก๋ถํฐ, ์ฐ๋ฆฌ๋ ์ ํ ์ถ์๊ฐ ๊ทธ ํ๊ณ๋ฅผ ๊ฐ์ง๊ณ ์์์ ๋ํ ํ์ธํ์ต๋๋ค. ์ ํ๋ ์ด์ ์กฐ๊ฑด ํ์์, tinyCD53s์ ์์ฌ ์ถ๊ฐ๊ฐ ์ถ๋ก ์๋์ ๋ณ๋ชฉ์ด ๋๋ ๊ฒ์ด ๋ถ๋ช ํ๋ฉฐ, ์ด๋ ๊ฐ์ ๊ณ์ฐ๋์ ๊ฐ์ง COSA ์ํคํ ์ฒ๋ณด๋ค ํ๋ ์ ์๋๊ฐ ํจ์ฌ ๋ฎ๊ธฐ ๋๋ฌธ์ ๋๋ค.
Highlight
we also see that the proposed COSA can get a higher AP. Therefore, we finally chose COSA-2x2x which received the best speed/accuracy trade-off in our experiment as the YOLOv4-tiny architecture. (Go to Paper)
5.3. Ablation study on YOLOv4-large
Highlight
In Table 10 we show the AP obtained by YOLOv4 models in training from scratch and fine-tune stages. (Go to Paper)
5.4. Scaled-YOLOv4 for object detection
Highlight
We compare with other real-time object detectors, and the results are shown in Table 11. (Go to Paper)
Highlight
We can see that all scaled YOLOv4 models, including YOLOv4-CSP, YOLOv4-P5, YOLOv4-P6, YOLOv4P7, are Pareto optimal on all indicators. (Go to Paper)
Highlight
When we compare YOLOv4-CSP with the same accuracy of EfficientDetD3 (47.5% vs 47.5%), the inference speed is 1.9 times. (Go to Paper)
Highlight
When YOLOv4-P5 is compared with EfficientDet-D5 with 1303 (Go to Paper)
Highlight
the same accuracy (51.8% vs 51.5%), the inference speed is 2.9 times. (Go to Paper)
Highlight
The situation is similar to the comparisons between YOLOv4-P6 vs EfficientDet-D7 (54.5% vs 53.7%) and YOLOv4-P7 vs EfficientDet-D7x (55.5% vs 55.1%). In both cases, YOLOv4-P6 and YOLOv4-P7 are, respectively, 3.7 times and 2.5 times faster in terms of inference speed. (Go to Paper)
Highlight
The results of test-time augmentation (TTA) experiments of YOLOv4-large models are shown in Table 12. YOLOv4P5, YOLOv4-P6, and YOLOv4-P7 gets 1.1%, 0.7%, and 0.5% higher AP, respectively, after TTA is applied. (Go to Paper)
Image
Image
Highlight
We then compare the performance of YOLOv4-tiny with that of other tiny object detectors, and the results are shown in Table 13. It is apparent that YOLOv4-tiny achieves the best performance in comparison with other tiny models. (Go to Paper)
Image
Image
Highlight
It is apparent that YOLOv4-tiny can achieve real-time performance no matter which device is used. If we adopt FP16 and batch size 4 to test Xavier AGX and Xavier NX, the frame rate can reach 380 FPS and 199 FPS respectively. In addition, if one uses TensorRT FP16 to run YOLOv4-tiny on general GPU RTX 2080ti, when the batch size respectively equals to 1 and 4, the respective frame rate can reach 773 FPS and 1774 FPS, which is extremely fast. (Go to Paper)
Image
Image
5.5. Scaled-YOLOv4 as naฤฑ ฬve once-for-all model
Highlight
In this sub-section, we design experiments to show that an FPN-like architecture is a na ฬฤฑve once-for-all model. Here we remove some stages of top-down path and detection branch of YOLOv4-P7. YOLOv4-P7\P7 and YOLOv4P7\P7\P6 represent the model which has removed {P7} and {P7, P6} stages from the trained YOLOv4-P7. Figure 5 shows the AP difference between pruned models and original YOLOv4-P7 with different input resolution. (Go to Paper)
Comment:
์ด ์ฐ๊ตฌ ๋ถ๋ถ์ FPN๊ณผ ์ ์ฌํ ์ํคํ ์ฒ๋ฅผ ์ฌ์ฉํ YOLOv4-P7 ๋ชจ๋ธ์ ํจ์จ์ฑ์ ํ๊ตฌํฉ๋๋ค. ํนํ, ๋ชจ๋ธ์์ ํน์ ์คํ ์ด์ง๋ฅผ ์ ๊ฑฐํจ์ผ๋ก์จ, ๋ค๋ฅธ ํด์๋์์์ ๊ฐ์ฒด ํ์ง ์ฑ๋ฅ์ ์ด๋ป๊ฒ ์ต์ ํํ ์ ์๋์ง ์คํ์ ์ผ๋ก ๋ถ์ํฉ๋๋ค.
Image
Image
Highlight
We can find that YOLOv4-P7 has the best AP at high resolution, while YOLOv4-P7\P7 and YOLOv4-P7\P7\P6 have the best AP at middle and low resolution, respectively. This means that we can use sub-nets of FPN-like models to execute the object detection task well. Moreover, we can perform compound scale-down the model architectures and input size of an object detector to get the best performance (Go to Paper)
Comment:
๊ทธ ๊ฒฐ๊ณผ, YOLOv4-P7 ๋ชจ๋ธ์ด ๋์ ํด์๋์์ ๊ฐ์ฅ ์ฐ์ํ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ํน์ ์คํ ์ด์ง๋ฅผ ์ ๊ฑฐํ ๋ชจ๋ธ์ด ์ค๊ฐ ๋ฐ ๋ฎ์ ํด์๋์์ ๋ ์ข์ ์ฑ๋ฅ์ ๋ํ๋์ ํ์ธํ์ต๋๋ค. ์ด๋ฌํ ๋ฐ๊ฒฌ์ ๊ฐ์ฒด ํ์ง ์์ ์ ์ํ ๋ชจ๋ธ์ ์ํคํ ์ฒ์ ์ ๋ ฅ ํฌ๊ธฐ๋ฅผ ํน์ ์ํฉ์ ๋ง๊ฒ ์กฐ์ ํ ์ ์๋ ๊ฐ๋ฅ์ฑ์ ์์ฌํฉ๋๋ค. ์ด๋ ๋ค์ํ ์ด์ ์กฐ๊ฑด์์ ์ต์ ์ ์๋์ ์ ํ๋์ ๊ท ํ์ ๋ฌ์ฑํ๋ ๋ฐ ๋์์ด ๋ ์ ์์ต๋๋ค.
6. Conclusions
Highlight
We show that the YOLOv4 object detection neural network based on the CSP approach, scales both up and down and is applicable to small and large networks. (Go to Paper)
Highlight
we achieve the highest accuracy 56.0% AP on test-dev COCO dataset for the model YOLOv4-large, extremely high speed 1774 FPS for the small model YOLOv4-tiny on RTX 2080Ti by using TensorRT-FP16, and optimal speed and accuracy for other YOLOv4 models. (Go to Paper)


















