Nature Medicine, Published online: 17 September 2025; doi:10.1038/s41591-025-03988-x
We created and curated three benchmark datasets to assess large language model (LLM) performance on sleep and fitness tasks ranging from answering expert questions to real-world coaching scenarios. Fine-tuning the Gemini LLM on real-world coaching tasks and self-reported sleep-quality outcomes improved its performance and provided a benchmark for further development.
