Document Type



Doctor of Philosophy


Materials Science and Engineering

First Adviser

Huang, Xiaolei


Representation learning is about learning representative features of the data that make it easier to extract useful information for the subsequent learning task. Due to the great success of deep learning, representations learned by deep neural networks have shown significant improvement than handcrafted features on most learning tasks. However, it is still very challenging to learn fine-grained visual representations, which refer to highly localized features extracted from images that are useful for image understanding tasks, such as fine-grained recognition, image generation and semantic segmentation. Fine-grained recognition identifies subtle visual differences to distinguish among subordinate categories; image generation learns fine-grained visual features to generate realistic details; and semantic segmentation depends on coarse-to-fine representations to segment objects with pixel-wise precision and global coherence. In this thesis, I focus on improving or extending deep neural networks to learn better fine-grained visual representations for solving those image understanding tasks. (i) Part-based fine-grained representation learning: A new Semantic Part Detection and Abstraction (SPDA) CNN architecture is proposed for fine-grained recognition. It has a detection sub-network for small semantic parts detection and a recognition sub-network to learn discriminative part-based features for fine-grained object categorization. (ii) Multimodal fine-grained representation learning: A multimodal deep learning framework is developed for fine-grained medical image classification by leveragingimage and non-image clinical data collected during a patient's visit. The proposed multimodal framework learns better complementary fine-grained features from the image and non-image modalities for disease grading. (iii) Adversarial fine-grained representation learning: An Attentional Generative Adversarial Network (AttnGAN) is presented for text-to-image synthesis, while an end-to-end adversarial neural network (called SegAN) is proposed for semantic segmentation. The AttnGAN learns coarse-to-fine-grained conditions (sentence level information and word level information) to generate images with photo-realistic details. The SegAN adopts a novel adversarial critic network with a multi-scale L1 loss function to capture long- and short-range spatial relationships between pixels. Both qualitative and quantitative validation experiments are conducted for all proposed methods.