A. James Clark School of Engineering

Permanent URI for this communityhttp://hdl.handle.net/1903/1654

The collections in this community comprise faculty research works, as well as graduate theses and dissertations.

Browse

Search Results

Now showing 1 - 10 of 22
  • Item
    FROM PARTS TO WHOLE IN ACTION AND OBJECT UNDERSTANDING
    (2024) Devaraj, Chinmaya; Aloimonos, Yiannis; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The traditional paradigm of supervised learning in action or object recognition often relieson a top-down approach, ignoring explicit modeling of what activity or objects consist of. Recent approaches in generative AI research have shown us the ability to generate images and videos using text, indirectly indicating that we have control over the constituents of images and videos. In this dissertation, we explore ways to use the constituents of actions to develop methods to improve understanding of action. We devise different approaches to utilize the parts of actions, namely object motion, object state changes, and motion descriptions obtained by LLMs in various tasks like in the next active object segmentation, zero-shot action recognition, or video-text retrieval. We show promising benefits in action anticipation, zero-shot action recognition, and text-video retrieval tasks, demonstrating the practical applications of our methods. In the first part of the dissertation, we explore the idea of using the constituents of actions inGCNs for zero-shot human-object action recognition. The main idea is that semantically similar actions (of similar constituents) are closer in feature space. Thus, in our graph, we encode the edges connecting those actions with higher similarity. We introduce a method to visually ground the external knowledge graph using the concept of shared similarity between similar actions. We evaluate the method on the EPIC Kitchens dataset and the Charades dataset showing impressive results over baseline methods. We further show that visually grounding the knowledge graph enhances the performance of GCNs when an adversarial attack corrupts the input graph. In the second part of the thesis, we extend our ideas on human-object interactions in firstpersonvideos. Human actions involving hand manipulations are structured according to the making and breaking of hand-object contact, and human visual understanding of action relies on anticipation of contact, as demonstrated by pioneering work in cognitive science. Taking inspiration from this, we introduce representations and models centered on contact, which we then use in action prediction and anticipation. We train the Anticipation Module, a module producing Contact Anticipation Maps and Next Active Object Segmentations - novel low-level representations providing temporal and spatial characteristics of anticipated near future action. On top of the Anticipation Module, we apply Egocentric Object Manipulation Graphs (Ego- OMG), a framework for action anticipation and prediction. Using the Anticipation Module to aid Ego-OMG produces state-of-the-art results, achieving first and second places on the unseen and seen test sets of the EPIC Kitchens Action Anticipation Challenge and achieving state-of-the-art results on action anticipation and action prediction over EPIC Kitchens. In the same line of thinking of constituents of action, we next focus on investigatinghow motion understanding can be modeled in current video-text models. We introduce motion descriptions generated by GPT4 on three action datasets that capture fine-grained motion descriptions of activities. We evaluated several video-text models on the task of retrieval of motion descriptions and found them to need to catch up to the human expert performance. We introduce a method of improving motion understanding in video-text models by utilizing motion descriptions. This method is demonstrated on two action datasets for the motion description retrieval task. The results draw attention to the need for quality captions involving fine-grained motion information in existing datasets and demonstrate the effectiveness of the proposed pipeline in understanding fine-grained motion during video-text retrieval.
  • Thumbnail Image
    Item
    Understanding and Improving Reliability of Predictive and Generative Deep Learning Models
    (2024) Kattakinda, Priyatham; Feizi, Soheil; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Deep learning models are prone to acquiring spurious correlations and biases during training and adversarial attacks during inference. In the context of predictive models, this results in inaccurate predictions relying on spurious features. Our research delves into this phenomenon specifically concerning objects placed in uncommon settings, where they are not conventionally found in the real world (e.g., a plane on water or a television in a cave). We introduce the "FOCUS: Familiar Objects in Common and Uncommon Settings" dataset which aims to stress-test the generalization capabilities of deep image classifiers. By leveraging the power of modern search engines, we deliberately gather data containing objects in common and uncommon settings in a wide range of locations, weather conditions, and time of day. Our comprehensive analysis of popular image classifiers on the FOCUS dataset reveals a noticeable decline in performance when classifying images in atypical scenarios. FOCUS only consists of natural images which are extremely challenging to collect as by definition it is rare to find objects in unusual settings. To address this challenge, we introduce an alternative dataset named Diffusion Dreamed Distribution Shifts (D3S). D3S comprises synthetic images generated through StableDiffusion, utilizing text prompts and image guides derived from placing a sample foreground image onto a background template image. This scalable approach allows us to create 120,000 images featuring objects from all 1000 ImageNet classes set against 10 diverse backgrounds. Due to the incredible photorealism of the diffusion model, our images are much closer to natural images than previous synthetic datasets. To alleviate this problem, we propose two methods of learning richer and more robust image representations. In the first approach, we harness the foreground and background labels within D3S to learn a foreground (background)representation resistant to changes in background (foreground). This is achieved by penalizing the mutual information between the foreground (background) features and the background (foreground) labels. We demonstrate the efficacy of these representations by training classifiers on a task with strong spurious correlations. Thus far, our focus has centered on predictive models, scrutinizing the robustness of the learned object representations, particularly when the contextual surroundings are unconventional. In the second approach, we propose to use embeddings of objects and their relationships extracted using off-the-shelf image segmentation models and text encoders respectively as input tokens to a transformer. This leads to remarkably richer features that improve performance on downstream tasks such as image retrieval. Large language models are also prone to failures during inference. Given the widespread use of LLMs, understanding the propensity of these models to fail given adversarial inputs is crucial. To that end we propose a series of fast adversarial attacks called BEAST that uses beam search to add adversarial tokens to a given input prompt. These attacks induce hallucination, cause the models to jailbreak and facilitate unintended membership inference from model outputs. Our attacks are fast and are executable in relatively compute constrained environments.
  • Thumbnail Image
    Item
    A Framework for Remaining Useful Life Prediction and Optimization for Complex Engineering Systems
    (2024) Weiner, Matthew Joesph; Azarm, Shapour; Groth, Katrina M; Reliability Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Remaining useful life (RUL) prediction plays a crucial role in maintaining the operational efficiency, reliability, and performance of complex engineering systems. Recent efforts have primarily focused on individual components or subsystems, neglecting the intricate relationships between components and their impact on system-level RUL (SRUL). The existing gap in predictive methodologies has prompted the need for an integrated approach to address the complex nature of these systems, while optimizing the performance with respect to these predictive indicators. This thesis introduces a novel methodology for predicting and optimizing SRUL, and demonstrates how the predicted SRUL can be used to optimize system operation. The approach incorporates various types of data, including condition monitoring sensor data and component reliability data. The methodology leverages probabilistic deep learning (PDL) techniques to predict component RUL distributions based on sensor data and component reliability data when sensor data is not available. Furthermore, an equation node-based Bayesian network (BN) is employed to capture the complex causal relationships between components and predict the SRUL. Finally, the system operation is optimized using a multi-objective genetic algorithm (MOGA), where SRUL is treated as a constraint and also as an objective function, and the other objective relates to mission completion time. The validation process includes a thorough examination of the component-level methodology using the C-MAPSS data set. The practical application of the proposed methodology in this thesis is through a case study involving an unmanned surface vessel (USV), which incorporates all aspects of the methodology, including system-level validation through qualitative metrics. Evaluation metrics are employed to quantify and qualify both component and system-level results, as well as the results from the optimizer, providing a comprehensive understanding of the proposed approach’s performance. There are several main contributions of this thesis. These include a new deep learning structure for component-level PHM, one that utilizes a hybrid-loss function for a multi-layer long short-term memory (LSTM) regression model to predict RUL with a given confidence interval while also considering the complex interactions among components. Another contribution is the development of a new framework for computing SRUL from these predicted component RULs, in which a Bayesian network is used to perform logic operations and determine the SRUL. These contributions advance the field of PHM, but also provide a practical application in engineering. The ability to accurately predict and manage the RUL of components within a system has profound implications for maintenance scheduling, cost reduction, and overall system reliability. The integration of the proposed method with an optimization algorithm closes the loop, offering a comprehensive solution for offline planning and SRUL prediction and optimization. The results of this research can be used to enhance the efficiency and reliability of engineering systems, leading to more informed decision-making.
  • Thumbnail Image
    Item
    DEEP LEARNING ENSEMBLES FOR LIGHTWEIGHT OBJECT DETECTION
    (2023) Mattingly, Alexander Singfei; Bhattacharyya, Shuvra S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Object detection, the task of identifying and localizing important objectswithin an image frame, is a critical task in automation, surveillance, and safety applications. Further, developments in lightweight sensor technologies, improved small-scale computing, and the widespread accessibility of well-labeled data have enabled numerous applications for object detection on inexpensive or low-power hardware. Many applications, such as self-driving and unmanned aerial vehicles, must process sensor data as it arrives (in real-time) using onboard hardware (at- the-edge) in order to continually inform systems such as navigation. Additionally, detection must be often achieved on platforms with limited Size, Weight, and Power (SWaP) since advanced computer hardware may not be possible to place nearby the sensor. This presents a unique challenge: how can we best provide accurate real-time object detection on limited SWaP systems while maintaining low power and computational cost? A widespread approach for detection is using deep-learning. An object de-tection network is trained on a labeled dataset of images containing known objects and their location. After training, the network may be used to infer on new data, providing both bounding boxes and class identifiers for each box. Popular single- shot detectors have been demonstrated to achieve real-time performance on some systems while having acceptable detection accuracy. An ensemble is a system comprised of several detectors. In theory, detectorswith architectural differences, ones trained on different data, or detectors given different augmented data at inference time will discover and detect different features of an image. Unifying the results of several different detectors has been demonstrated to improve the detection performance of the ensemble compared to the performance of any component network at the expense of additional computational cost. Further, systems using an ensemble of detectors have been shown to be good solutions to object detection problems in limited SWaP applications such as surveillance and search-and-rescue. Unlike tasks such as classification, where the output of a network describes theentire input, object detection is concerned both with localization and classification of one or multiple objects in an image. Two different bounding boxes for partially occluded objects may overlap, or highly similar bounding boxes may describe the same object. As a result, unifying the results of object detector networks is far more difficult than unifying classifier networks. Current works typically accomplish this by applying strategies that iteratively combine bounding boxes by overlap. However, little comparative study has been done to determine the effectiveness of these approaches. This thesis builds on current methods of ensembling object detector networksusing novel approaches to combine bounding boxes. We first introduce current methods for ensembling and a dataflow-based framework for efficient, scalable com- putation of ensembles of detectors. We then contribute a novel method for ensem- bling and implement a practical system for scalable detection using an elastic neural network.
  • Thumbnail Image
    Item
    Generalizable Depression Detection and Severity Prediction Using Articulatory Representations of Speech
    (2022) Seneviratne, Nadee; Espy-Wilson, Carol; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Major Depressive Disorder (MDD) is a mental health disorder that has taken a massive toll on society both socially and financially. Timely diagnosis of MDD is extremely crucial to minimize serious consequences such as suicide. Hence automated solutions that can reliably detect and predict the severity of MDD can play a pivotal role in assisting healthcare professionals in providing timely treatments. MDD is known to affect speech. Leveraging on the changes in speech characteristics that occur due to depression, a lot of vocal biomarkers are being developed to detect depression. However, the study into changes in articulatory coordination associated with depression is under-explored. Speech articulation is a complex activity that requires finely timed coordination across articulators. In a depressed state involving psychomotor slowing, this coordination changes and in turn modifies the perceived speech signal. In this work, we use a direct representation of articulation known as vocal tract variables (TVs) to capture the coordination between articulatory gestures. TVs define the constriction degree and location of articulators (tongue, jaw, lips, velum and glottis). Previously, correlation structure of formants or mel-frequency cepstral coefficients (MFCCs) were used as a proxy for underlying articulatory coordination. We compute the articulatory coordination features (ACFs) which provide details about the correlation among time-series data at different time delays and are therefore rich with information about the underlying coordination level of speech production. Using the rank-ordered eigenspectra obtained from TV based ACFs, we show that depressed speech depicts simpler coordination relative to the speech of the same subjects when in remission which is inline with previous findings. By conducting a preliminary study using a small subset of speech from subjects who transitioned from being severely depressed to being in remission, we show that TV based ACFs outperform formant based ACFs in binary depression classification. We show that depressed speech has reduced variability in terms of reduced coarticulation and undershoot. To validate this, we present a comprehensive acoustic analysis and results of a speech-in-noise perception study to compare the intelligibility of depressed speech relative to not-depressed speech. Our results indicate that depressed speech is at least as intelligible as not-depressed speech. The next stage of our work focuses on developing deep learning based models using TV based ACFs to detect depression and attempts to overcome the limitations in existing work. We combine two speech depression databases with different characteristics which helps to increase the generalizability which is a key objective of this research. Moreover, we segment audio recordings prior to feature extraction to obtain data volumes required to train deep neural networks. We reduce the dimensionality of conventional stacked ACFs of multiple delay scales by using refined ACFs which are carefully curated to remove redundancies and using the strengths of dilated Convolutional Neural Networks. We show that models trained on TV based ACFs are more generalizable compared to its proxy counterparts. Then we develop a multi-stage convolutional recurrent neural network that performs classification at the session-level. We derive the constraints under which this segment-to-session level approach could be used to boost the classification performance. We extend our models to perform depression severity level classification. The TV based ACFs outperform other feature sets in this task as well. The language pattern and semantics can reveal vital information regarding a person's mental state. We develop a multimodal depression classifier which incorporates TV based ACFs and hierarchical attention based text embeddings. The fusion strategy of the proposed architecture enables segmenting data from different modalities independently (overlapping segments for audio and sentences for text), in the most optimal way for each modality, when performing segment-to-session level classification. The multimodal classifier clearly performs better than the unimodal classifiers. Finally, we develop a multimodal system to predict the depression severity score, which is a more challenging regression problem due to the quasi-numerical nature of the scores. Multimodal regressor achieves the lowest root mean squared error showing the synergies of combining multiple modalities such as audio and text. We perform an exhaustive error analysis that reveals potential improvements to be made in the future. The work in this dissertation takes a step forward towards the betterment of humanity by exploring the development of technologies to improve the performance of speech based depression assessment, utilizing the strengths of the ACFs derived from direct articulatory representations.
  • Thumbnail Image
    Item
    Wavefront Shaping in a Complex Reverberant Environment with a Binary Tunable Metasurface
    (2021) Frazier, Benjamin West; Antonsen, Thomas M; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Electromagnetic environments are becoming increasingly complex and congested, creating a growing challenge for systems that rely on electromagnetic waves for communication, sensing, or imaging. The use of intelligent, reconfigurable metasurfaces provides a potential means for achieving a radio environment that is capable of directing propagating waves to optimize wireless channels on-demand, ensuring reliable operation and protecting sensitive electronic components. The capability to isolate or reject unwanted signals in order to mitigate vulnerabilities is critical for any practical application. In the first part of this dissertation, I describe the use of a binary programmable metasurface to (i) control the spatial degrees of freedom for waves propagating inside an electromagnetic cavity and demonstrate the ability to create nulls in the transmission coefficient between selected ports; and (ii) create the conditions for coherent perfect absorption. Both objectives are performed at arbitrary frequencies. In the first case a novel and effective stochastic optimization algorithm is presented that selectively generates coldspots over a single frequency band or simultaneously over multiple frequency bands. I show that this algorithm is successful with multiple input port configurations and varying optimization bandwidths. In the second case I establish how this technique can be used to establish a multi-port coherent perfect absorption state for the cavity. In the second part of this dissertation, I introduce a technique that combines a deep learning network with a binary programmable metasurface to shape waves in complex electromagnetic environments, in particular ones where there is no direct line-of-sight. I applied this technique for wavefront reconstruction and accurately determined metasurface configurations based on measured system scattering responses in a chaotic microwave cavity. The state of the metasurface that realizes desired electromagnetic wave field distribution properties was successfully determined even in cases previously unseen by the deep learning algorithm. My technique is enabled by the reverberant nature of the cavity, and is effective with a metasurface that covers only ~1.5% of the total cavity surface area.
  • Thumbnail Image
    Item
    Systematic Integration of PHM and PRA (SIPPRA) for Risk and Reliability Analysis of Complex Engineering Systems
    (2021) Moradi, Ramin; Groth, Katrina; Mechanical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Complex Engineering Systems (CES) such as power plants, process plants, and manufacturing plants have numerous, interrelated, and heterogeneous subsystems with different characteristics and risk and reliability analysis requirements. With the advancements in sensing and computing technology, abundant monitoring data is being collected. This is a rich source of information for more accurate assessment and management of these systems. The current risk and reliability analysis approaches and practices are inadequate in incorporating various sources of information, providing a system-level perspective, and performing a dynamic assessment of the operation condition and operation risk of CES. In this dissertation, this challenge is addressed by integrating techniques and models from two of the major subfields of reliability engineering: Probabilistic Risk Assessment (PRA) and Prognostics and Health Management (PHM). PRA is very effective at modeling complex hardware systems, and approaches have been designed to incorporate the risks introduced by humans, software, organizational, and other contributors into quantitative risk assessments. However, PRA has largely been used as a static technology mainly used for regulation. On the other hand, PHM has developed powerful new algorithms for understanding and predicting mechanical and electrical device health to support maintenance. Yet, PHM lacks the system-level perspective, relies heavily on operation data, and its outcomes are not risk-informed. I propose a novel framework at the intersection of PHM and PRA which provides a forward-looking, model- and data-driven analysis paradigm for assessing and predicting the operation risk and condition of CES. I operationalize this framework by developing two mathematical architectures and applying them to real-world systems. The first architecture is focused on enabling online system-level condition monitoring. The second architecture improves upon the first and realizes the objectives of using various sources of information and monitoring operation condition together with operational risk.
  • Thumbnail Image
    Item
    Impact Of Semantics, Physics And Adversarial Mechanisms In Deep Learning
    (2020) Kavalerov, Ilya; Chellappa, Rama; Czaja, Wojciech; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Deep learning has greatly advanced the performance of algorithms on tasks such as image classification, speech enhancement, sound separation, and generative image models. However many current popular systems are driven by empirical rules that do not fully exploit the underlying physics of the data. Many speech and audio systems fix STFT preprocessing before their networks. Hyperspectral Image (HSI) methods often don't deliberately consider the spectral spatial trade off that is not present in normal images. Generative Adversarial Networks (GANs) that learn a generative distribution of images don't prioritize semantic labels of the training data. To meet these opportunities we propose to alter known deep learning methods to be more dependent on the semantic and physical underpinnings of the data to create better performing and more robust algorithms for sound separation and classification, image generation, and HSI segmentation. Our approaches take inspiration from from Harmonic Analysis, SVMs, and classical statistical detection theory, and further the state-of-the art in source separation, defense against audio adversarial attacks, HSI classification, and GANs. Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. We compare using a short-time Fourier transform (STFT) with a learnable basis at variable window sizes for the feature extraction stage of our sound separation network. We also compare the robustness to adversarial examples of speech classification networks that similarly hybridize established Time-frequency (TF) methods with learnable filter weights. We analyze HSI images for material classification. For hyperspectral image cubes TF methods decompose spectra into multi-spectral bands, while Neural Networks (NNs) incorporate spatial information across scales and model multiple levels of dependencies between spectral features. The Fourier scattering transform is an amalgamation of time-frequency representations with neural network architectures. We propose and test a three dimensional Fourier scattering method on hyperspectral datasets, and present results that indicate that the Fourier scattering transform is highly effective at representing spectral data when compared with other state-of-the-art methods. We study the spectral-spatial trade-off that our Scattering approach allows.We also use a similar multi-scale approach to develop a defense against audio adversarial attacks. We propose a unification of a computational model of speech processing in the brain with commercial wake-word networks to create a cortical network, and show that it can increase resistance to adversarial noise without a degradation in performance. Generative Adversarial Networks are an attractive approach to constructing generative models that mimic a target distribution, and typically use conditional information (cGANs) such as class labels to guide the training of the discriminator and the generator. We propose a loss that ensures generator updates are always class specific, rather than training a function that measures the information theoretic distance between the generative distribution and one target distribution, we generalize the successful hinge-loss that has become an essential ingredient of many GANs to the multi-class setting and use it to train a single generator classifier pair. While the canonical hinge loss made generator updates according to a class agnostic margin a real/fake discriminator learned, our multi-class hinge-loss GAN updates the generator according to many classification margins. With this modification, we are able to accelerate training and achieve state of the art Inception and FID scores on Imagenet128. We study the trade-off between class fidelity and overall diversity of generated images, and show modifications of our method can prioritize either each during training. We show that there is a limit to how closely classification and discrimination can be combined while maintaining sample diversity with some theoretical results on K+1 GANs.
  • Thumbnail Image
    Item
    DEEP LEARNING FOR FORENSICS
    (2020) Zhou, Peng; Davis, Larry; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The advent of media sharing platforms and the easy availability of advanced photo or video editing software have resulted in a large quantity of manipulated images and videos being shared on the internet. While the intent behind such manipulations varies widely, concerns on the spread of fake news and misinformation is growing. Therefore, detecting manipulation has become an emerging necessity. Different from traditional classification, semantic object detection or segmentation, manipulation detection/classification pays more attention to low-level tampering artifacts than to semantic content. The main challenges in this problem include (a) investigating features to reveal tampering artifacts, (b) developing generic models which are robust to a large scale of post-processing methods, (c) applying algorithms to higher resolution in real scenarios and (d) handling the new emerging manipulation techniques. In this dissertation, we propose approaches to tackling these challenges. Manipulation detection utilizes both low-level tamper artifacts and semantic contents, suggesting that richer features needed to be harnessed to reveal more evidence. To learn rich features, we propose a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state-of-the-art performance compared to alternative methods with robustness to resizing and compression. Additionally, to extend manipulation detection from image to video, we introduce VIDNet, Video Inpainting Detection Network, which contains an encoder-decoder architecture with a quad-directional local attention module. To reveal artifacts encoded in compression, VIDNet additionally takes in Error Level Analysis (ELA) frames to augment RGB frames, producing multimodal features at different levels with an encoder. Besides, to improve the generalization of manipulation detection model, we introduce a manipulated image generation process that creates true positives using currently available datasets. Drawing from traditional work on image blending, we propose a novel generator for creating such examples. In addition, we also propose to further create examples that force the algorithm to focus on boundary artifacts during training. Extensive experimental results validate our proposal. Furthermore, to apply deep learning models to high resolution scenarios efficiently, we treat the problem as a mask refinement given a coarse low resolution prediction. We propose to convert the regions of interest into strip images and compute a boundary prediction in the strip domain. Extensive experiments on both the public and a newly created high resolution dataset strongly validate our approach. Finally, to handle new emerging manipulation techniques while preserving performance on learned manipulation, we investigate incremental learning. We propose a multi-model and multi-level knowledge distillation strategy to preserve performance on old categories while training on new categories. Experiments on standard incremental learning benchmarks show that our method improves the overall performance over standard distillation techniques.
  • Thumbnail Image
    Item
    FACIAL EXPRESSION RECOGNITION AND EDITING WITH LIMITED DATA
    (2020) Ding, Hui; Chellappa,, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Over the past five years, methods based on deep features have taken over the computer vision field. While dramatic performance improvements have been achieved for tasks such as face detection and verification, these methods usually need large amounts of annotated data. In practice, not all computer vision tasks have access to large amounts of annotated data. Facial expression analysis is such a task. In this dissertation, we focus on facial expression recognition and editing problems with small datasets. In addition, to cope with challenging conditions like pose and occlusion, we also study unaligned facial attribute detection and occluded expression recognition problems. This dissertation has been divided into four parts. In the first part, we present FaceNet2ExpNet, a novel idea to train a light-weight and high accuracy classification model for expression recognition with small datasets. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully-connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases demonstrate that our method achieves better results than state-of-the-art. In the second part, we focus on robust facial expression recognition under occlusion and propose a landmark-guided attention branch to find and discard corrupted feature elements from recognition. An attention map is first generated to indicate if a specific facial part is occluded and guide our model to attend to the non-occluded regions. To further increase robustness, we propose a facial region branch to partition the feature maps into non-overlapping facial blocks and enforce each block to predict the expression independently. Depending on the synergistic effect of the two branches, our occlusion adaptive deep network significantly outperforms state-of-the-art methods on two challenging in-the-wild benchmark datasets and three real-world occluded expression datasets. In the third part, we propose a cascade network that simultaneously learns to localize face regions specific to attributes and performs attribute classification without alignment. First, a weakly-supervised face region localization network is designed to automatically detect regions (or parts) specific to attributes. Then multiple part-based networks and a whole-image-based network are separately constructed and combined together by the region switch layer and attribute relation layer for final attribute classification. A multi-net learning method and hint-based model compression are further proposed to get an effective localization model and a compact classification model, respectively. Our approach achieves significantly better performance than state-of-the-art methods on unaligned CelebA dataset, reducing the classification error by 30.9% In the final part of this dissertation, we propose an Expression Generative Adversarial Network (ExprGAN) for photo-realistic facial expression editing with controllable expression intensity. An expression controller module is specially designed to learn an expressive and compact expression code in addition to the encoder-decoder network. This novel architecture enables the expression intensity to be continuously adjusted from low to high. We further show that our ExprGAN can be applied for other tasks, such as expression transfer, image retrieval, and data augmentation for training improved face expression recognition models. To tackle the small size of the training database, an effective incremental learning scheme is proposed. Quantitative and qualitative evaluations on the widely used Oulu-CASIA dataset demonstrate the effectiveness of ExprGAN.