# Hardware Design for HEVC-based Adaptive Loop Filter

Ândrio Araújo, Ruhan Conceição, Bruno Zatt, Marcelo Porto, Luciano Agostini Federal University of Pelotas – UFPEL Group of Architecture and Integrated Circuits - GACI St. Gomes Carneiro, 1 – Pelotas – Brazil {afdacampos,radconceicao,zatt,porto,agostini}@inf.ufpel.edu.br

# ABSTRACT

This paper presents a hardware architecture design for an HEVCbased Adaptive Loop Filter (ALF). The ALF is responsible to reduce the distortion between the original and the encoded image by fixing the artifacts inserted from previous encoding stages. The focus of the developed hardware architecture is to achieve real time processing (30 frames per second) for UHD 4K (3840x2160 pixels) videos, with low hardware and power consumption. The architecture was implemented targeting two different technologies: Stratix IV FPGA and ASIC, with TSMC 45nm standard cells. The synthesis results show that the designed hardware architecture is capable to process UHD 4K videos in real time, considering both technologies. The ASIC synthesis results show a power consumption of 8.95mW when processing UHD videos in real time.

### **Categories and Subject Descriptors**

B.7.1 [**Integrated Circuits**]: Types and Design Styles – advanced technologies, algorithms implemented in hardware, VLSI (very large scale integration).

#### **General Terms**

Design, Algorithm, Performance, Experimentation.

#### Keywords

Video Coding, Signal Processing, Adaptive Loop Filter, Hardware Design.

#### **1. INTRODUCTION & RELATED WORKS**

Along the last few years, quality and resolution of digital videos have been substantially increased, demanding the representation of a huge volume of data. Meanwhile, a growing number of devices with support to these digital videos have become available at low costs. With that, the study and the improvement of video encoders has become essential important activity in the current scenario, since the devices that process digital videos must be able to process high-resolution videos in real time. For this reason, researchers are constantly looking for improving video encoder in terms of compression rate, video quality, complexity and energy consumption.

There are many tools incorporated in the modern video encoders in order to increase the video compression rate without introducing significant losses on coded video quality. In addition to the high compression rates gains, these tools also increase the encoder/decoder complexity and generate undesirable coding artifacts, such as ringing artifacts, block effect, and so on. Thus, in this processes, the subjective quality can suffer degradation, especially through the quantization stage, which inserts artifacts in the video as a collateral effect of the discarding of high frequencies. In this context, the filters are inserted aiming to increase the subjective visual quality of the encoded videos. The Deblocking Filter (DF) is an example of a commonly-use filter in video encoder, which aims to reduce the "block effect" brought by the block-based encoding process. Moreover, aiming to reduce the coding error of output and reference (to be used as reference during the motion estimation) pictures, for both boundary (border of blocks) and non-boundary (within blocks) sample, the Adaptive Loop Filtering (ALF) was proposed [1].

Studies based on the High Efficiency Video Coding (HEVC) [2] Reference Software conclude that ALF can achieve a bit reduction of 4-5% for High Definition (HD) video sequences allowing Bpredictive frames and 10% when only P-predictive frames are considered [1]. However, the use of ALF promotes an increase in decoding time of about 7-14% [1]. Thus, in spite of the coding gains, the complexity brought with it was consider too high and the ALF was removed from the HEVC project in the HM 8.0 [3] version. This decision, however, was not unanimous in the video coding community. Claiming the importance of ALF coding gains, some researchers [4] [5] keep defending the reincorporation of the ALF in the HEVC. For this reason, studies focusing on the ALF filter are still relevant, since that, it can be incorporate in a HEVC extension, as 3D-HEVC [6], or in a new video coding standard.

The high complexity and data-oriented characteristic make the ALF an ideal target for hardware implementation. However, no works targeting ALF hardware designs were found in the literature, except for our previous works presented in [7] and [8]. Although few works targeting HEVC ALF hardware designs are found in the current literature, there are works focusing on other filters of the HEVC – such as Sample Adaptive Offset (SAO) and Deblocking Filter (DF) – or on the in-loop filter of previous standards. The work proposed in [9] presents an optimized parallel architecture to implement the DF and the ALF targeting a video decoder. In [10], it is presented an architecture for the H.264/AVC [12] DF. Focusing on HEVC filters implementation, the work presented in [11] proposed a hardware design for the DF and Park [13] proposed an architecture for the SAO.

In our previous works [7]**Erro! Fonte de referência não encontrada.** and [8], dedicated hardware architectures for the HEVC-based ALF filter were presented. The work presented in [7] proposes a solution with three different hardware designs – one for each size of ALF diamond shape (9x9, 7x7 and 5x5). In [8] it is proposed a hardware design which is capable to perform a multisize solution for the work presented in [7], aiming to save hardware resources consumption.

This work proposes a hardware design which is capable to estimate the filtered sample from the filter coefficients and the coded samples following the ALF square shape, proposed by the HEVC standard in the version 5.0 [14]. Moreover, the architecture proposed in this work aims to achieve processing rate enough to process UHD 4K (3840x2160 pixels) video in real time, i.e. at 30 frames per second.

The organization of the paper is as follows: section two, will be explain the HEVC adaptive loop filter (ALF). After that, the hardware design will be presented, following by the results of the architecture. Finally, the conclusion and future works.

# 2. HEVC ADAPTIVE LOOP FILTER

The video encoder is generally composed by a sequence of steps where each stage is responsible for a part of the encoding process. Among these stages, the In-Loop Filter is present and it is responsible to reduce the distortions introduced by the encoding process. The subjective image quality can be enhanced when the distortion is reduced. Fig. 1 shows a block diagram of the HEVC video encoder with the In-Loop and its main components.

The Deblocking Filter (DF), followed by the Sample Adaptive Offset (SAO) and the Adaptive Loop Filter (ALF), composes the In-Loop Filter.



Fig. 1. HEVC video encoder block diagram

The first filter in the In-Loop Filter is the DF. This filter is similar to the one existing on the H.264/AVC standard [12]. It aims to reduce the block effect caused by the block-based coding used on the HEVC [15]. After the DF, the SAO filter is applied. The use of the SAO filter is an innovation proposed by the HEVC, which aims to reduce the ringing artifacts [16].

The last filter in the In-Loop Filter is the Adaptive Loop Filter (ALF), focus of this work. The ALF was proposed by the HEVC aiming to reduce the mean square error between original samples and decoded samples by using Wiener-based adaptive filter [1]. The filter is applied to the reconstructed image after the DF and SAO filters. It works to improve the distortion error reduction generated by the previous coding stage.

In order to apply the ALF, firstly the encoder must decide which units (blocks, for example) of the frame need to be filtered. To make the decision of which regions will be or not filtered, a simplified filter (5x5) is applied in all regions. After that, the distortion obtained will be compared with the distortion without the filter application. The regions where the distortion has been reduced are marked as ON, and the regions where the distortion becomes higher are marked as OFF. The full filtering process applies three filters sizes and it compares the results to find the best filter size to be used at the analyzed block. This stage is composed only for the regions marked as ON, the OFF regions (previously identified) are not considered. This procedure describes the 16passes encoding algorithm used for the ALF [1]. Another algorithm that can be use for the ALF is the One-pass Encoding Algorithm [1]. This simplified algorithm reuses the ON/OFF decision mask from the previous coded frames, since consecutive picture tends to be very similar.

Fig. 2 (a) illustrates the pixel samples that will be used in the filtering process to generate the new value for the sample a''.

Moreover, Fig. 2 (b) shows the coefficients that will be working in the filtering process. The filter process corresponds to a multiplication of the sample with its corresponding coefficient, and then the results are added, generating only one sample. Equations 1 and 2 illustrate how the filtered sample a'' is generated considering the 5x5 ALF square-shape shown in Fig. 2.





Fig. 2. 5x5 ALF square shape

$$a'=a * C0 + b * C1 + c * C1 + ... + q * C8 + q * C8$$
(1)  
$$a''=Clip(0, 255, (a''+(1<<7)>>8))$$
(2)

In (2), the clipping operation is done aiming to maintain the result in the typical image sample bit-width (8bits).

In the next section, the hardware design implemented for the 5x5 square shape ALF filter is presented.

# **3. HARDWARE DESIGN**

This paper proposes a novel hardware architecture for the HEVCbased ALF filter, considering the square shape filter. The hardware was design aiming to have high throughput, being able to process high video resolutions on real time. This design is based on the Working Draft 5 [14] and the Test Model HM5 of the HEVC standard.

As shown in Fig. 2 and (1), the filter coefficients are symmetrical from the center and only the central coefficient does not repeat itself. This observation was used to further reduce the hardware resources required. Therefore, in order to reduce the number of multiplications, the samples, which have the same corresponding coefficient can be first added and, after that, the result of the sum can be multiplied by the coefficient. As example, the coefficient  $C_0$  multiplies the samples p and q. Instead perform  $C_0$  multiplied by p plus  $C_0$  multiplied by q; it is firstly performed p plus q and then the result is multiplied by  $C_0$ . This manipulation reduces the number of multiplication from 17 to 9.

Due the fact that the samples and the coefficients are not previously known, it is impossible to change the multipliers by sums and shifts.

Fig. 3 shows the architecture implemented in this work for the HEVC-based square shape ALF. It is shown the input samples (from *a* to *q*) and the input coefficients (from *C*<sub>0</sub> to *C*<sub>8</sub>). In the first pipeline stage, there are the sums of the samples, which have the same corresponding coefficient. The sample *a* and the coefficient *C*<sub>0</sub> passes through the first pipeline stage because they represent the sample and coefficient in the center of the square-shape.

The architecture was implemented with eight pipelines stages aiming to increase the hardware performance. Thus, considering the number of pipelines stages, the architecture also needs eight clock cycles to process one sample. Considering a HD1080p (1920x1080 pixels) frame, the architecture takes 2,073,608 clock cycles to perform the calculations.



Fig. 3. HEVC-based square shape ALF hardware architecture

Altogether, this count with architecture has 17 adders and 9 multipliers. Eight of these adders are 10-bit, followed by the others nine adders are 20-bit. All multipliers included in this architecture are 10-bit.

Finally, it is done the clipping, which aims, as mentioned before, to maintain the result in the typical image sample bit-width (8bits). For this propose, it is applied a typical hardware-rounding technique.

This work considers the filter coefficients as 10-bit wide and video samples as 8-bit wide, both, inputs of the proposed architecture. The definition of the coefficients width was made by analyzing the HEVC reference software operation.

# 4. SYNTHESIS RESULTS

The proposed architecture was described in VHDL, using the Altera Quartus II software tool, and synthesized targeting an Altera Stratix IV EP4SE530F43C2ES FPGA device. The architecture was also synthesized for an ASIC implementation, using a 45nm TSMC standard-cell technology, using the Synopsys DC-Compiler tool

The Tab. 1 presents the synthesis results obtained targeting these two technologies. Considering the FPGA implementation, the results in terms of hardware consumption are presented as the number of logic elements – Adaptive Look-up Tables (ALUTs) –, the number of registers and Digital Signal Processing (DSP) blocks (multipliers). Moreover, Tab. 1 also shows the maximum frequency achieved by the FPGA implementation and the maximum number of UHD 4K frames processed per second by the architecture. The results of the ASIC implementation are given in terms of Gate Count and Total Power Dissipation (leakage and static). The ASIC power consumption results are shown targeting the processing of HD1080p and UHD 4K video at 30 frames per second.

As it is possible to be seen, in both implementations the architecture is capable to perform UHD 4K videos in real time. Moreover, considering the ASIC implementation, our architecture dissipates only 8.95mW targeting UHD 4K videos and 5.16mW targeting HD1080p videos. The performance results is calculated considering the worst case, when all picture samples must be filtered.

| Table 1. Synthesis and performance results for | FPGA | and |
|------------------------------------------------|------|-----|
| ASIC implementations                           |      |     |

| FPGA                        |       | ASIC        |                        |       |
|-----------------------------|-------|-------------|------------------------|-------|
| Parameter                   | Value | Target      | Parameter              | Value |
| Logic Elements<br>(ALUTs)   | 107   | -           | Gate<br>Count          | 7,402 |
| Total<br>Registers          | 415   | HD<br>1080p | Frequency<br>(MHz)     | 62.2  |
| DSP Blocks<br>(Multipliers) | 10    |             | Total<br>Power<br>(mW) | 5.16  |
| Max Frequency<br>(MHz)      | 397.6 | UHD<br>4K   | Frequency<br>(MHz)     | 248.8 |
| UHD 4K fps                  | 43    |             | Total<br>Power<br>(mW) | 8.95  |

Tab. 2 presents a comparison among this work and the architectures presented in [8] and [9]. It is important to mention that this work aims to complement the previous work presented in [8], and does not focus exactly in the same subject as presented in [9]. Moreover, as mentioned before, from the best of our knowledge, there is no related work presenting a hardware design for the square shape ALF filter. However, the comparison results, it is still important to show that our solution presents competitive results in terms of hardware resources consumption and processing rate.

Our work, when compared to the related works, presents competitive results. Considering the performance results, all works (this, [8] and [9]) are capable to process HD1080p videos in real time while only this work and [8] are capable to process UHD 4K videos.

Considering the synthesis results, it is possible to notice that this design uses less logic elements than [8] and [9]. It is important to highlight that the three architectures were synthesized targeting three different FPGA devices. Furthermore, this work uses less DSP blocks than [8], which represent the embedded multipliers. Finally, it is important to highlight that [9] implements also the DF in its design. This explains the high hardware consumption obtained by its architecture.

The work [10] presents a hardware architecture for the DF on H.264/AVC standard. The architecture was synthesized with UMC 180nm technology. It consumes 14.8K gates, processing 30 HD1080p frames per second. It was not presented power consumption results.

The architecture presented in [11], shows and hardware design for the HEVC DF. It was synthesized for FPGA and ASIC technologies. This architecture consumes 16.4K gates, and it is capable to perform 30 or 86 HD1080p frames per second (according to the number of used parallel datapaths). The total power consumption of its implementations ranges from 55.99mW to 58.43mW.

In [13], the hardware design was implemented in TSMC 180nm technology, being able to process UHD 4K videos in real time. The architecture consumes 30.7K gates and can achieve 250MHz as maximum frequency. It was not presented power consumption results.

| Parameter              | This Work  | [8]       | [9]      |
|------------------------|------------|-----------|----------|
| FPGA Family Device     | Stratix IV | Stratix V | Virtex-5 |
| Logic Elements         | 107        | 436       | 15,561   |
|                        | ALUTs      | ALMs      | LUTs     |
| Total Registers        | 415        | 1,028     | -        |
| DSP Blocks             | 10         | 11        | -        |
| Max Frequency<br>(MHz) | 397.61     | 279       | 211      |
| HD1080p fps            | 172        | 132       | 30       |
| UHD 4K fps             | 43         | 33        | -        |

Table 2. Comparison with related works

As it is possible to realize, the hardware architecture presented in this work achieved compatible results for performance and power consumption when compared to other works present in the literature about filters used in video encoders/decoders.

# 5. CONCLUSION

This work presented a hardware design for the HEVC-based ALF square shape, focusing on real time processing for high definition videos. The ALF introduces significant improvement on subjective quality of coded videos, and its complexity implementation claims for dedicated hardware design.

The architecture was developed targeting two different technologies: FPGA and ASIC 45nm. The synthesis results showed that our architecture is able to process 43 frames per second on the FPGA synthesis. Considering the ASIC design, the developed architecture are able to process 30 frames per second for UHD 4K and HD 1080p videos with 8.95mW and 5.16mW, respectively.

The achieved results met goals of this work, and when comparing to related works, it is possible to realize that our design presented competitive results.

As future works, an implementation of all ALF structure, including the coefficients generation, are planned. Moreover, we intend an implementation of a whole In-Loop filter of the HEVC.

# 6. ACKNOWLEDGMENTS

Our thanks to CNPq and FAPERGS for the financial support, allowing the development of this work.

## 7. REFERENCES

- C. Tsai, et-al., "Adaptive Loop Filtering for Video Coding", IEEE Journal of Selected Topics in Signal Processing, vol. PP, July 2013
- [2] JCT-VC Editors, Recommendation ITU-T H.265 High Efficiency Video Coding (ITU-T Rec.H.265), April 2013.

- [3] Il-Koo Kim, et-al. HM8: High Efficiency Video Coding (HEVC) Test Model 8 Encoder Description. JCTVC-J1002.
  10th JCT-VC Meeting. Stockholm, 2012.
- [4] I. Chong, M. Karczewicz. AHG6: ALF in HM80. JCTVC-K0273. 11th JCT-VC Meeting. Shangai, 2012.
- [5] C. Chen, et-al, AHG6: Further cleanups and simplifications of the ALF in JCTVC-J0048. JCTVC-J0390. 10th JCT-VC Meeting. Stockholm, 2012.
- [6] Muller, K.; et al "3D High-Efficiency Video Coding for Multi-View Video and Depth Data" IEEE Transactions on Image Processing, 2013.
- [7] F. Rediess, et-al., "High Throughput Hardware Design for the Adaptive Loop Filter of the Emerging HEVC Video Coding" in 25th Symposium on Integrated Circuits and Systems Design (SBCCI), Brasília, Brazil, 2012.
- [8] R. Conceição, et-al, "Configurable Hardware Design for the HEVC-Based Adaptive Loop Filter" in 4th Latin American Symposium on Circuits and Systems (LASCAS), Santiago, Chile, 2014.
- [9] J. Du, "A parallel and area-efficient architecture for deblocking filter and Adaptive Loop Filter" in IEEE International Symposium on Circuits and Systems (ISCAS), Rio de Janeiro, Brazil, 2011.
- [10] H. Lin, et-al., "Efficient Deblocking Filter Architecture for H.264 Video Coders", in IEEE International Symposium on Circuits and Systems (ISCAS), Island of Kos, 2006.
- [11] E. Ozcan, Y. Adibelli, I. Hamzaoglu, "A high performance deblocking filter hardware for high efficiency video coding", IEEE Transactions on Consumer Eletronics, vol: 59, issue 3, August 2013.
- [12] International Telecommunication Union. "ITU-T Recommendation H.264/AVC (03/05): advanced video coding for generic audiovisual services". 2005.
- [13] S. Park, K. Ryoo, "The Hardware Design of Effective SAO for HEVC Decoder", in 2<sup>nd</sup> Global Conference on Consumer Eletronics (GCCE), Tokyo, Japan, 2013.
- [14] T. Wiegand, et-al. WD5: Working Draft 5 of High-Efficiency Video Coding. JCTVC-G001. 7th JCT-VC Meeting. Geneva, 2011.
- [15] G. J. Sullivan, et al, "Overview of the High Efficiency Video Coding (HEVC) standard," IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1648–1667, Dec. 2012.
- [16] Alexander Alshin, et al, "Sample Adaptive Offset Design in HEVC", Data Compression Conference (DCC), 2013.