This research investigates the integration of object detection and relationship prediction models to enhance image interpretability, addressing the core question: What challenges necessitate a Comparative Analysis of Object Detection and Transformer Models in Relationship Determination? A robust object detection model exhibits commendable performance, especially at lower Intersection over Union (IoU) thresholds and for larger objects, laying a solid foundation for subsequent analyses. The transformer models, including GIT, GPT-2, and PromptCap, are evaluated for their language generation capabilities, showcasing noteworthy performance metrics, including novel keyword-based metrics. The study transparently addresses limitations related to dataset constraints and potential challenges in model generalization, offering a clear rationale for the research. The evaluation of both object detection and transformer models provides valuable insights into the dynamic interplay between visual and linguistic understanding in image comprehension. By candidly acknowledging limitations, including data constraints and model generalization, this research paves the way for future refinements, addressing identified limitations and exploring broader application domains. The comprehensive approach to understanding the interplay between visual and textual elements contributes to the evolving landscape of computer vision and natural language processing research.