Can we trust them? An expert evaluation of large language models to provide sleep and jet lag recommendations for athletes

Vitale, Jacopo; McCall, Alan; Cina, Andrea; Janse van Rensburg, Dina Christina; Halson, Shona

Can we trust them? An expert evaluation of large language models to provide sleep and jet lag recommendations for athletes

Date

2025

Authors

Vitale, Jacopo

McCall, Alan

Cina, Andrea

Janse van Rensburg, Dina Christina

Halson, Shona

Publisher

Springer Nature

Abstract

BACKGROUND : With the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes. OBJECTIVE : This study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes. METHODS : Conducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss' Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests. RESULTS : Experts’ response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss' Kappa: 0.21–0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall. CONCLUSIONS : This study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.

Description

AVAILABILITY OF DATA AND MATERIALS : The datasets generated and analyzed during the current study are available from the corresponding author, Jacopo Vitale (jacopo.vitale@kws.ch), upon reasonable request. SUPPLEMENTARY FILE 1. The horizontal 6-point Likert scale (and instructions) used by experts to rate the appropriateness of the answers provided by the LLMs. SUPPLEMENTARY FILE 2. Items and questions for final evaluation. SUPPLEMENTARY FILE 3. Content Validity Ratios (CVR) and Content Validity Indexes (CVI) for each question (Q1-Q10) on sleep for ChatGPT-3.5 (3a), Google Bard (3b), and ChatGPT-4 (3c). SUPPLEMENTARY FILE 4. Content Validity Ratios (CVR) and Content Validity Indexes (CVI) for each question (Q1-Q10) on jet lag for ChatGPT-3.5 (4a), Google Bard (4b), and ChatGPT-4 (4c). SUPPLEMENTARY FILE 6. Raters’ scores for the jet lag survey. SUPPLEMENTARY FILE 7: Median and IQR of raters' scores for answers' understandability, clarity, professionality, and length for each LLM (ChatGPT-3.5: black lines; Google Bard: red lines; ChatGPT-4: blue lines) for the sleep (upper panels) and jet lag (lower panels) survey. Legend: *: p < 0.05; **: p < 0.01, ***: p < 0.001. SUPPLEMENTARY FILE 8. Donut charts for the percentage distribution of the preferred LLM identified by the experts for the sleep (upper image) and jetlag (lower image) survey. SUPPLEMENTARY FILE 9. Upper panel: Donut charts showing the percentage distribution of experts identifying the use of LLMs in sports as a risk or advantage for sleep (left image) and jet lag (right image). Lower panel: Box and whisker plots showing individual data points, median, first and third quartiles, and minimum and maximum scores of experts' expectations (left image) and attitudes (right image) toward LLMs. Blue circles: sleep; red circles: jet lag.

Sustainable Development Goals

SDG-03: Good health and well-being

Citation

Vitale, J., McCall, A., Cina, A. et al. “Can We Trust Them?” An Expert Evaluation of Large Language Models to Provide Sleep and Jet Lag Recommendations for Athletes. Sports Medicine (2025). https://doi.org/10.1007/s40279-025-02303-5.

URI

http://hdl.handle.net/2263/104795

Collections

Research Articles (Sports Medicine)
Research Articles (Sport, Exercise Medicine & Lifestyle Institute (SEMLI))
Research Articles (University of Pretoria)

Full item page

Can we trust them? An expert evaluation of large language models to provide sleep and jet lag recommendations for athletes

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Sustainable Development Goals

Citation

URI

Collections