Geometric Lightguide for Near-eye Light Field Display


Most near-eye displays with one fixed focal plane suffer from the vergence-accommodation conflict and cause visual discomfort to users. In contrast, light field displays can provide natural and comfortable 3D visual sensation to users without the conflict. This research presents a near-eye light field display consisting of a geometric lightguide and a light field generator, along with a collimator to ensure the light rays propagating in the lightguide are collimated. Unlike most lightguides that reduce thickness by employing total internal reflection, which can easily generate stray light, our lightguide directly propagates light rays without total internal reflection. The partially reflective mirrors of the lightguide expand the exit pupil to achieve an eyebox of 13 mm (horizontal) by 6.5 mm (vertical) at an eye relief of 18 mm. The collimator and the light field generator, both having effective focal lengths different in the horizontal and vertical directions, are designed to provide a 40-degree diagonal field of view. The working range of the light field generator, which is 30 cm to infinity, is verified qualitatively and quantitatively by experiments. We optimize the illuminance uniformity and analyze the illumination variation across the eyebox. Furthermore, we minimize the ghost artifact (referring to the split-up of light fields replicated by the partially reflective mirrors) by orienting the partially reflective mirrors at slightly different angles to enhance the image quality for short-range applications such as medical surgery.

Light Field Conversion from Stereo Images


We propose a conversion pipeline that converts stereo images to light fields for a near-eye light field display. We introduce the notion of display-oriented view synthesis and distinguish it from conventional camera-oriented view synthesis. By taking the image formation process of the near-eye light field display into consideration, we compute novel viewpoints according to the micro-projector baseline and extrapolate novel subviews at these viewpoints accordingly. In addition, by pre-warping and pre-shifting each subview, we digitally compensate for the optical artifact of the light field display. Through these operations, we optimize the visual quality of the synthesized light field in accordance with the hardware specification of the light field display. To reduce the computational complexity, we downsample and reuse image features and avoid computationally heavy 3-D convolution operations in disparity estimation. The loss of image quality from the low-resolution disparity map is mitigated by residual disparity refinement and distance-based view blending. Overall, the proposed system generates light fields of 512×512 spatial resolution at 26–30 fps. With nearly real-time speed and an extraordinarily small parameter count of 34,000, our approach is well-poised for real-world deployment.

Basal Cell Carcinoma Segmentation from Full-Field OCT Images


Semantic segmentation of basal cell carcinoma (BCC) from full-field optical coherence tomography (FF-OCT) images of human skin has received considerable attention in medical imaging. However, it is challenging for dermatopathologists to annotate the training data due to OCT’s lack of color specificity. Very often, they are uncertain about the correctness of the annotations they made. In practice, annotations fraught with uncertainty profoundly impact the effectiveness of model training and hence the performance of BCC segmentation. To address this issue, we propose an approach to model training with uncertain annotations. The proposed approach includes a data selection strategy to mitigate the uncertainty of training data, a class expansion to consider sebaceous gland and hair follicle as additional classes to enhance the performance of BCC segmentation, and a self-supervised pre-training procedure to improve the initial weights of the segmentation model parameters. Furthermore, we develop three post-processing techniques to reduce the impact of speckle noise and image discontinuities on BCC segmentation. The mean Dice score of BCC of our model reaches 0.503±0.003, which, to the best of our knowledge, is the best performance to date for semantic segmentation of BCC from FF-OCT images.

Selected Publications:
[1]
L.-W. Fu, C.-H. Liu, M. Jain, C.-S. Chen, Y.-H. Wu, S.-L. Huang, and H. H. Chen, “Training with Uncertain Annotations for Semantic Segmentation of Basal Cell Carcinoma from Full-Field OCT Images,” in IEEE Transactions on Medical Imaging, in revision stage after the first-round review.

Phase Detection Autofocus


A phase detection autofocus (PDAF) algorithm iteratively estimates the phase shift between the left and right phase images captured in an autofocus process and uses it to determine the lens movement until the estimated in-focus lens position is reached. Phase detectors have been embedded in image sensors to improve the performance of autofocus; however, the phase shift estimation between the left and right phase images is sensitive to noise. Moreover, PDAF problems have been assumed to be equivalent to stereo matching problems. In this project, we first argued that PDAF and stereo matching are two different problems and provide insights into the distinctions between phase images and stereo images from the autofocus perspective. Then, to address the noise sensitive issue, we present a robust model, called AF-Net, based on convolutional neural network. The final lens position error of our model is five times smaller than that of a state-of-the-art statistical PDAF method. Furthermore, the model works consistently well for all initial lens positions.

Selected Publications:
[1]
C.-J. Ho, C.-C. Chan, and H. H. Chen, “AF-Net: A Convolutional Neural Network Approach to Phase Detection Autofocus,” in IEEE Transactions on Image Processing, vol. 29, pp.6386-6395, 2020, doi: 10.1109/TIP.2019.294734
[2]
C.-J. Ho and H. H. Chen, “On the Distinction between Phase images and Two-View Light Field for PDAF of Mobile Imaging,” in Electronic Imaging, 2020, doi: https://doi.org/10.2352/ISSN.2470-1173.2020.14.COIMG-39

Deep Face Framework for Illumination Invariant Face Recognition


The performance of many state-of-the-art deep face recognition models deteriorates significantly for images captured under low illumination, mainly because the features of dim probe face images cannot match well with those of normal-illumination gallery images. We propose a novel deep face recognition framework to address this issue. The framework consists of a feature restoration network, a feature extraction network, an embedding refinement module, and an embedding matching module. The feature restoration network adopts a two-branch structure based on the convolutional neural network to generate a feature image from the raw image and the illumination-enhanced image. The feature extraction network encodes the feature image into an embedding, which is then made more discriminative by the embedding refinement module and used by the embedding matching module for face verification and identification. The overall verification accuracy is improved from 3.1% to 9.1% when tested on the Specs on Faces (SoF) dataset. For face identification, the rank-1 identification accuracy is improved by 3.7%.

Selected Publications:
[1]
Y.-H. Huang and H. H. Chen, “Face recognition under low illumination via deep feature reconstruction network,” accepted for presentation in Proc. IEEE International Conference on Image Processing, Oct. 2020.
[2]
Y.-H. Huang and H. H. Chen, “Deep face recognition for dim images,” submitted to IEEE Transactions on Image Processing, Apr. 2020.

H&E-Like Staining of OCT Images of Human Skin via Generative Adversarial Network


Non-invasive, high speed optical coherence tomography (OCT) has been deployed for clinical use. However, the kind of specificity provided by hematoxylin and eosin (H&E) is unavailable from grey-level OCT images, making them not easy to read for pathologists. We present an OCT2HE image translation model to convert OCT images to H&E-like stained images using unpaired OCT and H&E data for training. Specifically, pre-trained segmentation models for the dermal-epidermal junction (DEJ) and the stratum corneum (SC) are exploited to enhance the performance anatomical image translation and reduce the DEJ and SC lower boundary errors to ±2.3 μm and ±1.7 μm, respectively. The feature map of the nuclei is extracted by a pre-trained VGG16 network. The Pearson’s correlation coefficient of the nuclei location and size consistency is 84%±1%.

Selected Publications:
[1]
S.-T. Tsai, C.-C. Chan, Y.-H. Li, S.-L. Huang, and H. H. Chen, “H&E-Like Staining of OCT Images of Human Skin via Generative Adversarial Network,” submitted to IEEE Trans. on Image Process., Aug., 2020.

Deep Face Rectification for 360° Dual-Fisheye Cameras


Fisheye distortion is an important issue in many imaging tasks. We present a method to combat the effect of fisheye image distortion on face recognition. The method consists of a classification network and a restoration network. The classification network classifies an input fisheye image according to its distortion level. The restoration network takes a distorted image as input and restores the rectilinear geometric structure of the face. The performance of the proposed method is tested on an end-to-end face recognition system constructed by integrating the proposed rectification method with a conventional rectilinear face recognition system. The face verification accuracy of the integrated system is 99.18% when tested on a real image dataset, resulting in an average accuracy improvement of 6.57% over the conventional face recognition system. For face identification, the average improvement over the conventional face recognition system is 4.51%.

Selected Publications:
[1]
Y.-H. Li, I.-C. Lo, and H. H. Chen, “Deep Face Rectification for 360° Dual-Fisheye Cameras,” accepted by IEEE Trans. on Image Process., Aug., 2020.

Training Deep Auto-Tagging Models using Context-based Cost-Sensitive Tag Propagation


The advances of deep learning have led to increasing demand for training data with desirable quality. Inspired by the abundance of contextual information in the music streaming services, we develop a context-based tag propagation method to reduce the training noise. Specifically, the training tags are propagated between songs sharing the same contextual element (e.g., an artist) and between nearby songs in the same playlist. In our first attempt, the playlist-based method has successfully improved a neural network-based model (SampleCNN) by 25.8% in MAP [1]. However, certain models have degraded performance due to the propagation noise. To properly handle the noise, we further incorporate cost-sensitive learning, and, as a result, the improvement has extended to two other networks (CRNN and 2D-CNN) with the increase of improvement reaches 6.5%, 4.9%, and 5.5% (for SampleCNN, CRNN, and 2D-CNN, respectively).

Selected Publications:
[1]
Y.-H. Lin, C.-H. Chung, and H. H. Chen, “Playlist-Based Tag Propagation for Improving Music Auto-Tagging,” in Proc. EUSIPCO, Rome, Italy, pp. 2270–2274, 2018.

Cross-Cultural Music Emotion Recognition by Adversarial Discriminative Domain Adaptation


A music emotion recognizer trained on Western pop songs datasets may not work well for non-Western pop songs due to the cultural differences in acoustic characteristics and emotion perception. The problem was found in many cross-cultural and cross-dataset studies; however, little has been done to learn how to adapt a model pre-trained on a source music domain to a target music domain of interest. We propose to address the problem by an unsupervised adversarial domain adaptation method. It employs neural network models to make the target music indistinguishable from the source music in a learned feature representation space. The results show that the proposed method effectively improves the prediction of the valence of Chinese pop songs from a model trained for Western pop songs.

Selected Publications:
[1]
Y.-W. Chen, Y.-H. Yang, and H. H. Chen, “Cross-Cultural Music Emotion Recognition byAdversarial Discriminative Domain Adaptation,” in Proc. ICML, Orlando, FL, pp. 467–472, 2018.

Blood Vessel Extraction using Short-Time RPCA


Recent advances in optical coherence tomography (OCT) lead to the development of OCT angiography to provide additional helpful information for diagnosis of diseases like basal cell carcinoma. We use the robust principal component analysis (RPCA) technique to extract blood vessels of human skin from full-field OCT data. Specifically, we propose a short-time RPCA method that divides the full-field OCT data into segments and decomposes each segment into a low-rank structure representing the relatively static tissues of human skin and a sparse matrix representing the blood vessels. The method mitigates the problem associated with the slow varying background and is free of the detection error that RPCA may have when dealing with full-field OCT data. Experimental results show that the proposed method works equally well for full-field OCT volumes of different quality. The average accuracy improvements over the correlation-mapping OCT method and the amplitude-decorrelation OCT angiography method, respectively, are 18.35% and 10.32%.

Selected Publications:
[1]
P.-H. Lee, C.-C. Chan, S.-L. Huang, A. Chen, and H. H. Chen, “Blood vessel extraction from OCT data by short-time RPCA,”IEEE Int. Conf. Image Process., pp. 394 V398, 2016.

Musical meter estimation using EEG signals


Musical meter is an important element of music. Based on the fact that EEG signals resonate at the beat frequency of a music stimulus and its subharmonics, we develop an approach to classify musical meter using the EEG signals recorded from music listeners. We first apply independent component analysis (ICA), averaging techniques and spatial filtering to improve the signal-to-noise ratios of EEG signals. Then, we analyze the ratios of spectral peaks frequencies of EEG signals to determine beat frequency of a music stimulus. We finally obtain the musical meter by comparing the magnitude of the beat frequency subharmonics.

Selected Publications:
[1]
C.-H. Tsai, “Musical meter classification using EEG signals,” Master Thesis, Natl. Taiwan Univ., Mar. 2017.

WRGB Color Filter Array Demosaicking


[paper]

The presence of spectral mismatch between the components of a WRGB color filter array severely affects the performance of demosaicking. We present a novel method that compensates for the spectral mismatch and greatly enhances the accuracy of R, G, and B interpolation. The method is tested on a two-megapixel WRGB CMOS image sensor. The results show that the proposed method brings the performance of WRGB demosaicking to an unprecedented level competitive with that of the state-of-the-art Bayer demosaicking.

Selected Publications:
[1]
P.-H. Su, P.-C. Chen, and H. H. Chen, “Compensation of spectral mismatch to enhance WRGB demosaicking,” IEEE Int. Conf. Image Process. 2015, Quebec City, Canada, Sept. 2015.

Highlights Extraction


[paper]

Emotion-based highlights extraction is useful for retrieval and automatic trailer generation of drama video because the rich emotion part of a drama video is often the center of attraction to the viewer. We formulate highlights extraction as a regression problem to extract highlight segments and to predict how strong the viewer¡¦s emotion would be evoked by the video segments. Unlike conventional rule-based approaches that rely on heuristics, the proposed system determines the relation between drama highlights and audiovisual features by machine learning. We also examine the special characteristics of drama video and propose human face, music emotion, shot duration, and motion magnitude as feature sets for highlights extraction.

Selected Publications:
[1]
K.-S. Lin, A. Lee, Y.-H. Yang, C.-T. Lee, and H. H. Chen, “Automatic highlights extraction for drama video using music emotion and human face features,” Proc. IEEE Int. Workshop on Multimedia Signal Processing 2011, Hang Zhou, China, (Top 10% Paper Award), Oct. 2011
[2]
K.-S. Lin, A. Lee, Y.-H. Yang, and H. H. Chen, “Rule-based automatic highlights extraction for drama video using music emotion and human face features,” Neurocomputing, (accepted), 2012

Piano Music Transcription


Pitch, together with other midlevel music features such as rhythm and timbre, holds the promise of bridging the semantic gap between low-level features and high-level semantics for music understanding. We investigate the pitch estimation of a piano music signal by exemplar-based sparse representation. A note exemplar is a segment of a piano note, stored in the dictionary. We first describe how to represent a segment of the piano music signal as a linear combination of a small number of note exemplars from a large note exemplar dictionary and then show how the sparse representation problem can be solved by regularized minimization. Unlike previous approaches, the proposed approach does not require retraining for a new piano. Instead, only a dozen notes of the new piano are needed. This feature is computationally attractive and avoids intense manual labeling.

Selected Publications:
[1]
C.-T. Lee, Y.-H. Yang, and H. H. Chen, “Automatic transcription of piano music by sparse representation of magnitude spectra,” IEEE Int. Conf. Multimedia Expo., Barcelona, Spain, Jul. 2011
[2]
C.-T. Lee, Y.-H. Yang, and H. H. Chen, “Multipitch estimation of piano music by exemplar-based sparse representation,” IEEE Trans. Multimedia, vol. 14, no. 3, pp. 608-618, Jun. 2012

Automatic Accompaniment Generation


We present a system to automatically generate accompaniment that evokes specific emotions for a given melody. In particular, we propose harmony progression and onset rate as two key features for emotion-based accompaniment generation. The former refers to the progression of chords, and the latter refers to the number of music events (such as notes and drums) in a unit time. The harmony progression and the onset rate are altered according to the specified emotion expressed by the valence and arousal parameters, respectively.

Programmable Aperture Photography



[youtube]

We present a system including a novel component called programmable aperture and two associated post-processing algorithms for high-quality light field acquisition. The shape of the programmable aperture can be adjusted and used to capture light field at full sensor resolution through multiple exposures without any additional optics and without moving the camera. High acquisition efficiency is achieved by employing an optimal multiplexing scheme, and quality data is obtained by using the two postprocessing algorithms designed for self calibration of photometric distortion and for multi-view depth estimation. View-dependent depth maps thus generated help boost the angular resolution of light field. Various post-exposure photographic effects are given to demonstrate the effectiveness of the system and the quality of the captured light field.

Selected Publications:
[1]
C.-K. Liang, T.-H. Lin, B.-Y. Weng, C. Liu, and H. H. Chen, “Programmable aperture photography: Multiplexed light field acquisition,” ACM Trans. Graph. (Proc. SIGGRAPH 2008), vol. 27, no. 3, 55:1-55:10, Aug. 2008
[2]
C.-K. Liang, G. Liu, H. H. Chen, “Light field acquisition using programmable aperture camera,” IEEE Int. Conf. Image Proc., 233-236, San Antonio, TX, Sept. 2007

Image Enhancement for Mobile Devices


[project page]

Reducing LCD backlight can save power consumption of a portable device, but it also decreases the contrast and brightness of the images. Previous approaches adjust the backlight level frame by frame to reach a specified image quality level but do not optimize it. On the contrary, the proposed method adjusts the backlight to meet the target power level while maintaining the image quality. This is achieved by incorporating brightness compensation and local contrast enhancement depending on the given backlight level. Experimental results show that the proposed algorithm outperforms previous methods.

Selected Publications:
[1]
T.-H. Huang, C.-K. Liang, S.-L. Yeh, and H. H. Chen, “JND-based enhancement of perceptibility for dim images,” IEEE Int. Conf. Image Process., San Diego, Oct. 2008
[2]
P.-S. Tsai, C.-K. Liang, T.-H. Huang, and H. H. Chen, “Image enhancement for backlight-scaled TFT-LCD displays,” IEEE Trans. Circuits Syst. Video Technol., (accepted), 2008
[3]
K.-T. Shih, T.-H. Huang, and H. H. Chen, “An anchoring method for color enhancement of images illuminated with dim backlight,” IEEE Int. Conf. Image Process., (submitted), 2012. Image Data

Rolling Shutter Distortion


[project page]

The electronic rolling shutter approach found in most low-end CMOS image sensors collects image data row by row, analogous to an open slit that scans over the image sequentially. Each row integrates light when the slit passes over it. Therefore, the scanlines of the image are not exposed at the same time. This sensor architecture creates a geometric distortion, known as the rolling shutter effect, for moving objects. We address this problem by using digital image processing techniques.

Selected Publications:
[1]
C.-K. Liang, L.-W. Chang, and H. H. Chen, “Analysis and compensation of rolling shutter effect,” IEEE Trans. Image Process., vol. 17, no. 8, 1323-1330, Aug. 2008

Digital Image Stabilization


Digital image processing is an important technique to the consumer video electronics or other video capture devices. The digital image stabilization system is especially an important component used in the application such as security surveillance, military reconnaissance, and digital camera. The task of image sequence stabilization is accomplished by estimating and then compensating the global motion to remove the involuntary image movement caused by, for example unstable hand-shake or vibration. Our research targets to develop the fast and cost-effective image stabilization algorithm on the embedded system such as digital camera.

Selected Publications:
[1]
H. H. Chen, C.-K. Liang, Y.-C. Peng, and H.-A. Chang, “Integration of digital stabilizer with video codec for digital video cameras,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 7, 801-813, Jul. 2007 (2008 IEEE Circuits and Systems Society CSVT Transactions Best Paper Award)

Digital Home Technology


Our goal is to research on multimedia signal processing for home gateway in order to provide digital multimedia access in the home with excellent quality and complete digital right management. The milestone of the first year is to research on content-aware QoS and digital rights management system. The milestone of the second year is to research on complexity-aware streaming and digital rights management on MHP and mobile devices. The milestone of the final year is to research on the error-resilient video streaming for H.264, R-D optimization for complexity-aware encoder, system integration and performance evaluation.

Selected Publications:
[1]
M.-T. Lu, J.-C. Wu, K.-J. Peng, Polly Huang, Jason J. Yao, Homer H. Chen, “Design and Evaluation of a P2P IPTV System for Heterogeneous Networks, ” IEEE Tran. Multimedia, Dec. 2007

Music Emotion


[project page]
[demo]

Music plays an important role in human¡¦s history, even more so in the digital age. Never before has such a large collection of music been created and accessed daily by people. As the amount of content continues to explode, the way music information is organized has to evolve in order to meet the ever increasing demand for easy and effective information access. Music classification and retrieval by emotion is a plausible approach, for it is content-centric and functionally powerful.

Selected Publications:
[1]
Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H. H. Chen, “A Regression approach to music emotion recognition,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 2, 448-457, Feb. 2008
[2]
Y.-H. Yang, Y.-C. Lin, H.-T. Cheng, and H. H. Chen, “Mr. Emo: Music retrieval in the emotion plane,” ACM Multimedia Technical demonstrations, (accepted), Vancouver, Canada, Oct. 2008

Video Codec

Video coding has become a key technology for a wide range of applications, from personal computers to television. It makes it possible and efficient for video storage and transmission. The latest developed video coding standard H.264 has emerged as a popular video coding technology for multimedia communications and consumer electronics as well. Our reseach topics on H.264 encoder optimization includes fast motion estimation and mode decision, rate control, and rate-distortion optimization.

Selected Publications:
[1]
C.-C. Su, J. J. Yao, P. Huang, and H. H. Chen, “H.264/AVC-Based Multiple Description Video Coding Using Dynamic Slice Groups,” Signal Processing: Image Communication, (accepted), 2008
[2]
C.-C. Su, J. J. Yao, and H. H. Chen, “H.264/AVC-based multiple description coding scheme,” IEEE Int. Conf. Image Proc., 265-268, San Antonio, TX, Sept. 2007
[3]
M.-L. Wong, Y.-L. Lin, and H. H. Chen, “A hardware-oriented intra prediction scheme for high definition AVS encoder,” Picture Coding Symp., Lisbon, Portugal, Nov. 2007

Perceptual-Based Video Coding


The rate-distortion optimization (RDO) framework for video coding achieves a tradeoff between bit-rate and quality. However, objective distortion metrics such as mean squared error traditionally used in this framework are poorly correlated with perceptual quality. We address this issue by proposing an approach that incorporates the structural similarity index as a quality metric into the framework. In particular, we develop a predictive Lagrange multiplier estimation method to resolve the chicken and egg dilemma of perceptual-based RDO and apply it to H.264 intra and inter mode decision. Given a perceptual quality level, the resulting video encoder achieves on the average 9% bit-rate reduction for intra-frame coding and 11% for inter-frame coding over the JM reference software. Subjective test further confirms that, at the same bit-rate, the proposed perceptual RDO indeed preserves image details and prevents block artifact better than traditional RDO.

Selected Publications:
[1]
Y.-H. Huang, T.-S. Ou, and H. H. Chen, “Perceptual-based coding mode decision,” IEEE Int. Symp. Circuits and Systems, 393-396, May 2010
[2]
P.-Y. Su, Y.-H. Huang, T.-S. Ou, and H. H. Chen, “Predictive Lagrange multiplier selection for perceptual rate-distortion optimization,” Fifth International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, Arizona, Jan. 2010
[3]
T.-S. Ou, Y.-H. Huang, and H. H. Chen, “A perceptual-based approach to bit allocation for H.264 encoder,” in Proc. of SPIE Vol. 7744 Visual Communicatoin and Image Processing, 77441B, 1-10, Huang Shan, China, Jul. 2010
[4]
H. H. Chen, Y.-H. Huang, P.-Y. Su, and T.-S. Ou, “Improving video coding quality by perceptual rate-distortion optimization,” IEEE Int. Conf. Multimedia Expo., pp. 1287-1292, Jul. 2010
[5]
P.-Y. Su, Y.-H. Huang, T.-S. Ou, and H. H. Chen, “Recent progress on prceptual video coding,” in Proc. Visual Comm. Image Process., Nov. 2011

Codec-Friendliness of Perceptual Video Quality Metrics

It is a natural expectation that the field of video coding can benefit considerably from the recent advances in perceptual image/video quality assessment. However, the truth is that there has been only limited progress in the application of perceptual quality metrics as the optimality criteria for video encoder design. Indeed, such design task is extremely challenging, if not impossible, largely due to the complicated mathematical representations of most perceptual image and video quality metrics, which do not fit in well with the computational framework of modern video encoders. We provide, mostly qualitatively, an analysis of the fundamental issues of this mismatch for a number of popular perceptual quality metrics and, from the video coding perspective, suggests a number of “codec-friendly” guidelines for future development of perceptual quality metrics.

Selected Publications:
[1]
P.-Y. Su, T.-Y. Huang, C.-K. Kao, and H. H. Chen, “Adopting perceptual quality metrics in video encoders: Progress and critiques,” IEEE Int. Workshop Emerging Multimedia Systems Applications, Melbourne, Jul. 2012

Rate-Distortion Optimized Quantization for Video Encoder


[Program]

Rate-distortion optimized quantization improves the coding performance of video compression. However, the search process involved in most existing methods is computationally expensive. We develop a method for accelerating the rate-distortion optimized quantization process. The acceleration is achieved by using a rate model of entropy coding to directly solve the rate-distortion optimization problem. Compared with the H.264/AVC reference encoder, our method achieves an average 5% bitrate reduction for IBBP GOP structure and 1% bitrate reduction for IPPP GOP structure, with negligible computational overhead. Compared with existing methods, our method is significantly more efficient. We believe the efficiency gain justifies the performance tradeoff for many real-world video coding systems, particularly in low-complexity applications.

Selected Publications:
[1]
T.-Y. Huang, P.-Y. Su, C.-K. Kao, and H. H. Chen, “Quality improvement of video codec by rate-distortion optimized quantization,” IEEE Workshop on Multimedia Quality of Experience, Dana Point, CA, Dec. 2011

Visual Attention Modeling

Visual attention is an important characteristic of human visual system, useful for image processing and compression. We develop a computational scheme that adopts both low-level and high-level features to predict visual attention from video signal. The adoption of low-level features (color, orientation, and motion) is based on the study of visual cells, whereas the adoption of human face as a high-level feature is based on the study of media communications. The low-level and high-level features are then fused by using machine learning. We show that such a scheme is more robust than those using purely single low- or high-level features. It is able to learn the relationship between features and visual attention to avoid perceptual mismatch.