Abstract—The growing prevalence of mental health issues underscores the need for advanced sentiment analysis, particularly for multimodal content on platforms like Twitter. To address this, this study investigates the fusion of text and image data for mental health sentiment analysis, proposing an attention-enhanced hybrid CNN-BiLSTM model. The model was trained and evaluated on a primary dataset of 40,000 manually labeled tweets (text and images), which was enriched with a corpus of 71,213 BBC News articles to improve semantic generalization. Key scientific findings from this research reveal that: (1) a hybrid model trained on the combined Twitter and BBC News corpus achieves the highest accuracy at 78.21%, validating the strategy of cross-domain data enrichment for model robustness; (2) text remains a significantly more dominant modality (77.55% accuracy) than images (69.85%), highlighting current challenges in visual sentiment interpretation for this domain; and (3) the attention mechanism provides a quantifiable 1-2% accuracy improvement in multimodal fusion, confirming its effectiveness in prioritizing sentiment-critical features. These findings demon- strate the efficacy of attention-based hybrid models and highlight the importance of data fusion and feature prioritization for developing more nuanced digital mental health interventions.