Advances in Large Language Models (LLM) like GPT-4o, Gemini 2.0, and LLAMA 4 offer new possibilities for automated essay scoring, which addresses the challenges of time-consuming and subjective manual grading in Indonesian language learning. This study's objective is to analyze and compare the competence of GPT-4o, Gemini 2.0, and LLAMA 4 in answering Indonesian essay questions. The methodology involved using a dataset of 203 Indonesian essay questions as prompts for each LLM. The generated answers were evaluated automatically using Sentence-BERT (SBERT) to calculate semantic similarity via cosine similarity, with BERTScore serving as a reference. The performance was measured using MAE, RMSE, and Pearson Correlation to compare the SBERT and BERTScore values. The findings indicate that Gemini 2.0 achieved the highest average cosine similarity (0.5651), while GPT-4o performed best in BERTScore (0.6744). LLAMA 4 demonstrated the highest consistency with the lowest MAE (0.1302) and RMSE (0.1581), and Gemini 2.0 showed the highest Pearson correlation (0.5699).