Game-Based Assessment of Statistical Self-Efficacy

An Alternative to the Self-Report of Internal Unobservable Beliefs

G. Curt Fulwider

2026-06-16

Self-Efficacy

Defined as the belief one holds about their ability to manage and execute actions required to achieve specific outcomes, self-efficacy plays a critical role in how people think, behave, and feel [1].

Why self-efficacy matters

Self-efficacy is pivotal in understanding motivation and achievement [2].
It shapes whether students engage, avoid, persist, or withdraw.
It is one of the strongest self-belief predictors of academic achievement [3].

Perceived threat versus perceived opportunity figure

Self-belief bias

Overconfident learner example

Underconfident learner example

Self-efficacy beliefs can distort how students interpret their own performance, leading to overconfidence or underconfidence [4].
Self-efficacy can change performance itself, so assessment scores are not insulated from self-belief [1], [4].
Regularly assessing self-efficacy can help identify and address these biases.

The Problem

Assessing Self-Efficacy

Hard to measure

Self-efficacy is internal and not directly observable.
- Side effects (e.g., persistence) are present, but beliefs are internal.
It is also dynamic and task performance specific.
Self-report is the only method.

Self-report is Problematic

Self-report introduces both bias and burden [5], [6], [7].
Added time and effort required to complete surveys [6], [7].
Overtested students and overburdened teachers.

Design Constraints

To make this work, the study had to satisfy three design constraints.
The game had to elicit relevant behavior.
- Sufficient difficulty to elicit persistence and risk-taking
- Opportunities for goal setting and interest expression
The assessment had to remain unobtrusive.
- Interrupting play breaks authentic behavior
- More frequent measurement opportunities
The models that can be feasible deployed in practice.

The Idea

Bridge Variables

Observable behaviors can serve as bridge variables to reach an unobservable construct.
Relied on ECD to construct the logic [8].
The variables persistence, goal setting, and risk-taking are…
- Theoretically related to self-efficacy beliefs.
- Directly observable (by definition) in a learning game context.

Bridge Variable Operationalization

Bridge variable	Operationalized definition
Persistence	Higher self-efficacy, more attempts and more time spent on task
Goal setting	Higher self-efficacy, sets higher goals
Risk-taking	Higher self-efficacy, willing to take risks

Educational Data Mining

Educational Data Mining (EDM) offers a complementary response to the limitations of ECD by focusing on the extraction of meaningful patterns from educational data to better understand and improve learning processes [9].

In practice, simple models often did not recover the outcome well.
The relationship between self-efficacy and its behavioral correlates may be too complex to reduce cleanly to a simple pattern.
- Consider a 6-dimensional scatterplot.
Educational Data Mining (EDM) helped by supporting large-scale feature engineering and fast model testing.
That made it possible to search for stable patterns across many theoretically derived variables.

The Method

Study context

Setting: two-day classroom study in a video game design course at a K–12 research school in the southeastern United States
Population: grades 8–12; predominantly male, consistent with the course context
Gameplay: about 48 minutes across two class periods, generating full event-level telemetry

Sample stage	n
Full gameplay telemetry	109
Matched pretest survey + gameplay	102
Self-efficacy posttest available	95
Content posttest available	93
Common analytic modeling subset	86

Mean Alchemy

A custom learning game designed to elicit interpretable behavioral evidenc of statistical self-efficacy.
Targeting middle school students learning about measures of center and spread (i.e., 6th to 8th grade).

Cycle of gameplay flow

Bounty board screenshot

Screenshot of the alchemy table

Example figure three

Study Design

Design:
- quantitative,
- nonexperimental predictive,
- multiverse-style model search [10]
Model-building choices as analytic uncertainty rather than one final model.
- Not asking: “Which single model wins?”
- I’m asked: “What patterns remain when reasonable analytic choices change?”
In practice, that meant varying feature subsets, preprocessing pipelines, model families, and hyperparameters while holding the theoretical target constant.

Scope of the model search

For each pretest and posttest target, I defined the model search around these decision layers:

Search layer	What varied	Plain-language purpose
Feature ranking	Mutual-information ranking within each target	Decide which variables looked most promising before building subsets.
Feature subsets	All 11 features; top 6, 8, and 11; six reproducible random subsets from 4 to 10 features	Test whether the signal depended on one exact variable combination.
Preprocessing	Base; standardized; standardized + PCA (6 components); discretized numeric features (5 quantile bins)	Check whether the models worked better with different ways of preparing the data.
Model families	Logistic regression; L1 logistic regression; Naive Bayes; gradient boosting; family-specific hyperparameter grids	Compare different kinds of learners instead of trusting one algorithm.
Evaluation	Stratified 20-fold cross-validation with shuffling; accuracy, macro-F1, precision, recall, MCC, kappa, AUC, off-diagonal error rate	Judge performance from multiple angles, not one score.

This produced up to 60 evaluated configurations per model family per target, for 480 retained configurations overall.

The Results

Main findings

When predicting pretest self-efficacy, there was a modest but consistent signal across models.
When predicting posttest self-efficacy, the signal was weaker but more consistent across models.
The strongest recurring features were:
- goal setting,
- persistence after failure,
- and interest.

PREtest Model Metric Profile

POSTtest Model Metric Profile

Model Metric Heatmaps

Pretest top models heatmap

Pretest models

Stronger agreement across model families
Average metrics higher than posttest

Posttest top models heatmap

Posttest models

Weaker agreement across models for posttest
Single model that performed well, maybe too well…

Feature Importance from Top Three Models

L1 logistic regression feature importance Naive Bayes feature importance

Final pretest (L1 Logistic Regression) confusion matrix

Confusion matrix for final pretest model

What changed across targets

Dynamic or shifting cases were harder to classify.
Aggregated participant-level features may be more useful for static constructs.
The most informative errors pointed toward movement in self-efficacy over time.

On- versus off-diagonal content and interest

Self-efficacy change by classification group

What It Means

Interpretation

Behavioral traces can carry information about self-efficacy beliefs.
The value is in recurring agreement across models, not one isolated result.
This supports the use of theory-guided behavioral proxies for hard-to-measure constructs.

Methodological implication

Participant-level aggregate features appear to compress too much within-session variation.
Dynamic constructs likely require finer-grained temporal modeling.
Future work should preserve more of the sequence and timing of behavior.

Limits and Implications

Limits

Modest sample
One game context
Short exposure in an MVP environment
Predominantly male sample from a video game design course

Implications

Accessibility: game-based assessment could reduce reliance on burdensome testing formats for students whose performance is undermined by test anxiety or other barriers tied to traditional assessment.
Differentiation: if self-belief and related behaviors can be assessed continuously, support could be offered sooner and with better targeting rather than waiting for failure on a later test.
More complete education: the goal is not only that students leave knowing statistical facts, but that they leave believing they can use those ideas and keep learning beyond the classroom.

Closing

Main takeaway

Behavior in a learning game produced a modest but credible signal for statistical self-efficacy, suggesting a viable path beyond self-report alone.

Questions

Thank you.

References

[1]

A. Bandura, Self-efficacy: The exercise of control. W. H. Freeman, 1997.

[2]

D. H. Schunk and F. Pajares, “The development of academic self-efficacy,” in Development of achievement motivation, A. Wigfield and J. S. Eccles, Eds., Academic Press, 2002, pp. 15–31. doi: 10.1016/B978-012750053-9/50003-6.

[3]

L. Stankov, S. Morony, and Y. P. Lee, “Confidence: The best non-cognitive predictor of academic achievement?” Educational Psychology, vol. 34, no. 1, pp. 9–28, Jan. 2014, doi: 10.1080/01443410.2013.814194.

[4]

F. Pajares and D. H. Schunk, “Self-beliefs and school success: Self-efficacy, self-concept, and and school achievement.” in Self perception., in International perspectives on individual differences, vol. 2., Westport, CT, US: Ablex Publishing, 2001, pp. 239–265.

[5]

C. Kormos and R. Gifford, “The validity of self-report measures of proenvironmental behavior: A meta-analytic review,” Journal of Environmental Psychology, vol. 40, pp. 359–371, Dec. 2014, doi: 10.1016/j.jenvp.2014.09.003.

[6]

K. Watson, T. Baranowski, D. Thompson, R. Jago, J. Baranowski, and L. M. Klesges, “Innovative application of a multidimensional item response model in assessing the influence of social desirability on the pseudo-relationship between self-efficacy and behavior,” Health Education Research, vol. 21, pp. i85–i97, Oct. 2006, doi: 10.1093/her/cyl137.

[7]

P. Ben-Nun, “Respondent fatigue,” in Encyclopedia of survey research methods, 2455 Teller Road, Thousand Oaks California 91320 United States of America: Sage Publications, Inc., 2008, p. 743. Accessed: Sep. 17, 2024. [Online]. Available: https://methods.sagepub.com/reference/encyclopedia-of-survey-research-methods/n480.xml

[8]

S. Toulmin, The uses of argument. Cambridge: Cambridge University Press, 1958.

[9]

R. S. Baker and P. S. Inventado, “Educational Data Mining and Learning Analytics,” in Learning Analytics, J. A. Larusson and B. White, Eds., New York, NY: Springer New York, 2014, pp. 61–75. doi: 10.1007/978-1-4614-3305-7_4.

[10]

S. Steegen, F. Tuerlinckx, A. Gelman, and W. Vanpaemel, “Increasing transparency through a multiverse analysis,” Perspectives on Psychological Science, vol. 11, no. 5, pp. 702–712, 2016, doi: 10.1177/1745691616658637.