POST-TRAINING OF VISION-LANGUAGE AGENTS FOR DECENTRALIZED AUTONOMOUS VEHICLE COORDINATION USING GENERALIZABLE MULTI-AGENT REWARDS

Cole, John Robert

POST-TRAINING OF VISION-LANGUAGE AGENTS FOR DECENTRALIZED AUTONOMOUS VEHICLE COORDINATION USING GENERALIZABLE MULTI-AGENT REWARDS

dc.contributor.advisor	Goldstein, Thomas A	en_US
dc.contributor.author	Cole, John Robert	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2026-07-03T05:39:36Z
dc.date.issued	2026	en_US
dc.description.abstract	Decentralized coordination at unsignalized intersections remains a persistent failure mode formodern autonomous driving policies when vehicle-to-everything (V2X) communication is unavail- able. Policies trained primarily with ego-centric objectives (e.g., collision avoidance, comfort, and action consistency) can be overly conservative in symmetric interactions, leading to deadlocks, or can make conflicting commitments, leading to unsafe near-collisions. This thesis addresses this gap by introducing a social post-training method for Alpamayo-R1 (AR1) that explicitly rewards behavior that is predictable to neighboring agents. We extend AR1’s Group Relative Policy Optimization (GRPO) post-training by augmenting thereward with Expectation Alignment (ELIGN), an intrinsic social term that penalizes mismatch between a learned neighbor-expectation model and the realized shared next observation. To make ELIGN applicable to AR1’s continuous trajectory outputs, we define the shared observation space over low- dimensional kinematic waypoints (x, y, ψ, v) rather than high-dimensional perception features, and we learn a compact trajectory prediction model offline before fine-tuning. The composite reward combines a trajectory-fidelity L2 term, a comfort score, and the ELIGN social penalty under a gated formulation that prevents the social term from masking large trajectory failures. Post-training is implemented using the cosmos rl framework with ReasoningVLAGRPOTrainer and vLLM-accelerated rollout generation, representing the first application of GRPO to a production-scale VLA model in a neural- rendered closed-loop driving simulator. We evaluate the proposed AR1+ELIGN post-training in a multi-agent simulation benchmark ofsymmetric four-way arrival scenarios in AlpaSim and compare against an ego-centric AR1 baseline as well as standard multi-agent reinforcement learning baselines (PPO and MAPPO). Performance is measured by collision rate (as a hard safety constraint), deadlock rate, intersection clearance time, and jerk variance as an indicator of indecision. Finally, we study zero-shot social generalization by testing whether ELIGN-fine-tuned agents coordinate effectively with novel partner agents not encoun- tered during training. Quantitative benchmark results are pending completion of GRPO training; pre- training reward validation and closed-loop baseline evaluation confirm that the reward pipeline is stable and well-calibrated, with the social term contributing a mean penalty of approximately −0.02 per step under the deployed weighting while the trajectory-fidelity signal remains dominant.	en_US
dc.identifier	https://doi.org/10.13016/ly5w-vadg
dc.identifier.uri	http://hdl.handle.net/1903/36019
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Artificial intelligence	en_US
dc.subject.pquncontrolled	Autonomous Driving	en_US
dc.subject.pquncontrolled	Multi-Agent AI	en_US
dc.subject.pquncontrolled	Vision-Language Agents	en_US
dc.title	POST-TRAINING OF VISION-LANGUAGE AGENTS FOR DECENTRALIZED AUTONOMOUS VEHICLE COORDINATION USING GENERALIZABLE MULTI-AGENT REWARDS	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Cole_umd_0117N_26352.pdf
Size:: 393.27 KB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations