Richburg, AquiaLarge language models have revolutionized natural language processing with their capabilities in text generation and understanding. Their rich contextual representations learned from training on diverse text datasets have lead LLMs to be used across a variety of settings. However this increases the chance of models being used in unintended use cases and causing harm to users. This dissertation delves into empirical studies of out-of-distribution issues in text generation (machine translation) and text classification (authorship analysis) tasks, examining how LLMs perform in settings distant from their training distributions.In our first work, the goal is to understand the characteristics of the training distribution of LLMs by visualizing the roles of samples during the training of a machine translation model. Our results indicate that sample contributions are not uniform and play complex roles throughout the training process. This highlights the difficulty of describing samples that are representative of the training distribution and motivates thorough evaluation of models in diverse settings. Our second and third works turn to the evaluation of LLMs in out-of-distribution settings to better understand their strengths and limitations for generalization on unseen tasks. We evaluate LLMs in machine translation tasks, focusing on how translation quality is affected by the presence or absence of specific language pairs in the training data. Our findings show that while finetuning improves translation for unseen languages, the impact varies across different language pairs. This emphasizes the need for further research to enable effective massively multilingual translation with LLMs. In text classification, we explore out-of-distribution generalization for authorship analysis in the context of human-AI collaborative writing. Our studies reveal that traditional AI detection models underperform when distinguishing between human and AI cowritten text. Simpler n-gram techniques are more robust than LLM for authorship identification, suggesting the need for adapted authorship analysis tools. In summary this dissertation advances our understanding of LLM generalization and provides insights for improving the robustness and adaptability of NLP systems.enOUT OF DISTRIBUTION EVALUATION OF NATURAL LANGUAGE PROCESSING SYSTEMS: GENERALIZATION TO LOW-RESOURCE AND DISTANT LANGUAGES AND HUMAN-AI COLLABORATIVE WRITINGDissertationComputer scienceApplied mathematics