We explore how natural language inference (NLI) tasks can be augmented with the use of visual information. Namely, we replicate and expand existing baselines for NLI, including recent deep learning methods. By adding image features to these models, we explore how the textual and visual modalities interact. Specifically, we show that image features can provide a small boost in classifier performance for simpler models, but are a subset of information provided in the premise statement and thus do not benefit complex models. Additionally, we demonstrate a weakness in the SNLI dataset, showing that textual entailment is predictable without reference to the premise statement.

CS224U Paper

YouTube Presentation