Validate the Best Model ************* Lesson ================ .. article-info:: :avatar: UCLA_Suit.jpg :avatar-link: https://www.decisionneurosciencelab.com/elijahgalvan :author: Elijah Galvan :date: September 1, 2023 :read-time: 17 min read :class-container: sd-p-2 sd-outline-muted sd-rounded-1 Goal During this Stage --------------- We want to see if the model is actually valid: there are two things that we need to prove in order for our model to be valid: first, we have to show that our model accurately captures the data generation process and, second, we have to show that our parameter recovery process is robust. How to Achieve this Goal ------------ .. dropdown:: Assessing Accuracy in Capturing the Data Generation Process In order to show that our model accurately captures the data generation process, we have to show that the model predicts behavior equally well at all values. Essentially, we must test four assumptions about how our model predictions relate to the behavioral data that they were trained upon: linearity, normality of error, independence of error, and homoscedasticty. These are also four of the five assumptions of linear regression (the fifth is the independence of :bdg-primary:`Independent Variables`, but we don't rely on this assumption). .. Note:: The assumptions of our model are (with one exception) the assumptions of linear regression, but not the assumptions of mixed effects regression despite the fact that we obviously are using repeated measures - why is this? Well, the answer is essentially the same answer as why this fifth assumption - the independence of :bdg-primary:`Independent Variables` - is not included. According to our model, all of the ways that :bdg-success:`Subjects` can be different are encapsulated in the :bdg-success:`Free Parameters`. Another way to phrase the assumption of the independence of :bdg-primary:`Independent Variables` is the independence of observations. Linear modeling requires that observations are not related, so if :bdg-primary:`Independent Variables` are collinear, then the :bdg-danger:`Dependent Variable` are predicted by both :bdg-primary:`Independent Variables` which means that the observations are dependent upon each other. Thus since we don't know what actually causes the :bdg-danger:`Dependent Variable` to change neither :bdg-primary:`Independent Variable` can be used to predict the variance explained by both. In a similar vein, linear mixed effects models control for the independence of observations (i.e. produced by the same :bdg-success:`Subject`) using random effects. See, where we're going here? Since our model asserts that all of the ways that :bdg-success:`Subjects` can be different are encapsulated in the :bdg-success:`Free Parameters`, we essentially assert that the way our model predicts :bdg-danger:`Decisions` does not differ across :bdg-success:`Subjects`. We have one :bdg-primary:`Independent Variable` - the model predicted :bdg-danger:`Decisions`. According to our model, *nothing else* can predict :bdg-danger:`Decisions`. However, we're not going to blindly assume this - we'll also verify that this is true. .. dropdown:: Model Performance .. tab-set:: .. tab-item:: Plain English Before we start checking assumptions, we need to do a simple regression: using the predicted :bdg-danger:`Decisions` of our best model to predict actual :bdg-danger:`Decisions`. So, we will have to use the trial-by-trial data to accomplish this. What we want to see first is the R-squared: the multiple R-squared should always be essentially equal to the adjusted R-squared since there are going to be a lot of observations. The R-squared tells us how much variance our model explains: any model with an R-squared above 0.70 is very good. Particularly for models without noise parameters, it means that you are really capturing the data generation process and essentially rules out the possibility that you are missing an additionally strategy or additional motive. As long as you don't find any assumptions to be grotesquely violated, you should proceed and feel very confident that your model is a good representation of the data generation process. An R-squared between 0.5 and 0.7 is acceptable. Although this good model performance by objective standards, these tasks are designed to elicit very well-defined, consistent preferences with a very high signal to noise ratio. You might be missing an additional strategy in your model - so use the steps below and see if this is the case! If not, you should proceed with apprehension but it does not immediately invalidate any conclusions that you want to draw. An R-squared of less than 0.5 is bad. Either your task-and-experimental procedure was 1) poorly thought out, producing too high of a signal-to-noise ratio, or 2) your model does not account for one or more preferences that govern behavior in your task. Ask yourself what the cause is: if it is a problem with the task either consider adding noise or bias parameters, find a different kind of model to analyze your data, or otherwise throw the data out because you cannot analyze it with a utility model. If it is the latter then identify :bdg-success:`Subjects` who are fit particularly poorly by the model, identify the behavioral trend, and try to think about a value-based preference that could lead one to behave in such a way. We can try this on all of these :bdg-success:`Subjects` - if this doesn't offer clear insight then we might want to look at individual subjects. .. tab-item:: R :: modelPredictions = lm(data = trialData, Decisions ~ Prediction) summary(modelPredictions) # R-squared #we can identify if we're missing a strategy using a density plot of MFIs - a plot with a group of really high AICs with a group of lower AICs would suggest a missing strategy qplot(x = subjectData$modelAIC, geom = 'density') #if we're missing something let's identify worst explained quartile of subjects according to our model worstExplained = which(subjectData$modelAIC > as.numeric(summary(subjectData$modelAIC)[5])) qplot(data = trialData[which(trialData$SubjectID == subjectData$SubjectID[worstExplained]), x = IV, y = DV, group = trialData$SubjectID]) + geom_smooth() #a loess line for all subjects .. tab-item:: MatLab :: modelPredictions = fitlm(trialData, 'Decisions ~ Prediction'); summary(modelPredictions); % R-squared % We can identify if we're missing a strategy using a density plot of MFIs - a plot with a group of really high AICs with a group of lower AICs would suggest a missing strategy figure; ksdensity(subjectData.modelAIC); % If we're missing something, let's identify the worst-explained quartile of subjects according to our model worstExplained = find(subjectData.modelAIC > str2double(summary(subjectData.modelAIC).Variables{5, 1})); figure; for i = 1:length(worstExplained) subjectIndex = worstExplained(i); subplot(length(worstExplained), 1, i); scatter(trialData.IV(trialData.SubjectID == subjectData.SubjectID(subjectIndex)), trialData.DV(trialData.SubjectID == subjectData.SubjectID(subjectIndex)), 'filled'); hold on; lsline; hold off; end .. tab-item:: Python :: import statsmodels.api as sm import seaborn as sns import matplotlib.pyplot as plt modelPredictions = sm.OLS(trialData['Decisions'], sm.add_constant(trialData['Prediction'])).fit() print(modelPredictions.summary()) # R-squared # We can identify if we're missing a strategy using a density plot of MFIs - a plot with a group of really high AICs with a group of lower AICs would suggest a missing strategy sns.kdeplot(subjectData['modelAIC']) # If we're missing something, let's identify the worst-explained quartile of subjects according to our model worstExplained = subjectData[subjectData['modelAIC'] > float(subjectData['modelAIC'].describe()['75%'])].index plt.figure(figsize=(8, 6)) for i, subjectIndex in enumerate(worstExplained): plt.subplot(len(worstExplained), 1, i + 1) sns.scatterplot(x=trialData.loc[trialData['SubjectID'] == subjectData.loc[subjectIndex, 'SubjectID'], 'IV'], y=trialData.loc[trialData['SubjectID'] == subjectData.loc[subjectIndex, 'SubjectID'], 'DV'], hue=trialData['SubjectID']) sns.regplot(x=trialData.loc[trialData['SubjectID'] == subjectData.loc[subjectIndex, 'SubjectID'], 'IV'], y=trialData.loc[trialData['SubjectID'] == subjectData.loc[subjectIndex, 'SubjectID'], 'DV'], scatter=False) plt.show() .. dropdown:: Visually Checking Assumptions .. dropdown:: Linearity .. tab-set:: .. tab-item:: Plain English We need to ensure that the relationship between model predictions of :bdg-danger:`Decisions` and observed :bdg-danger:`Decisions` is linear. If this relationship is flat (slope is 0) rather than linear (slope is 1) then our model is doing terribly at predicting :bdg-danger:`Decisions`, essentially making predictions completely at chance-level. Thus our model would not be capturing the data generation process and, most likely, our :bdg-success:`Free Parameters` are fitted to noise at all values. So let's plot a regression line with model predictions on the x-axis and observed :bdg-danger:`Decisions` on the y-axis. The slope should be essentially 1. .. tab-item:: R :: qplot(x = trialData$Prediction, y = trialData$Decision, geom = 'smooth') + geom_abline(slope = 1, intercept = 0) .. tab-item:: MatLab :: scatter(trialData.Prediction, trialData.Decision); hold on; loess_smooth = fitloess(trialData.Prediction, trialData.Decision); plot(loess_smooth); plot(trialData.Prediction, trialData.Prediction, 'k--'); hold off; .. tab-item:: Python :: plt.scatter(trialData['Prediction'], trialData['Decision']) lowess = sm.nonparametric.lowess(trialData['Decision'], trialData['Prediction']) plt.plot(lowess[:, 0], lowess[:, 1], 'r-') plt.plot(trialData['Prediction'], trialData['Prediction'], 'k--') plt.show() .. dropdown:: Normality of Error .. tab-set:: .. tab-item:: Plain English We need to ensure that prediction errors are normally distributed. If prediction errors are skewed, then our model is making making more underpredictions (negative) or overpredictions (positive) of :bdg-danger:`Decisions`. If the distribution is kurtosed (i.e. too skinny or too fat) then we're more or less likely to have extreme errors in predicting :bdg-danger:`Decisons` compared to a normal distribution. We can create a density plot of prediction errors (i.e. the difference between model predictions of :bdg-danger:`Decisions` and observed :bdg-danger:`Decisions`) and see if it follows a bell-curve. We can create a density plot with a normal distribution where the standard deviation is the standard deviation of prediction errors to check this. .. tab-item:: R :: normvals = rnorm(1000, mean = 0, sd = sd(trialData$Prediction - trialData$Decision)) qplot(x = trialData$Prediction - trialData$Decision, geom = 'density', bw = 1, color = 'Actual') + geom_density(aes(x = normvals, color = 'Predicted'), bw = 1) .. tab-item:: MatLab :: rng('default'); % Set random number generator seed for reproducibility normvals = normrnd(0, std(trialData.Prediction - trialData.Decision), 1000, 1); figure; hold on; histogram(trialData.Prediction - trialData.Decision, 'Normalization', 'pdf', 'EdgeColor', 'b', 'FaceColor', 'none'); histogram(normvals, 'Normalization', 'pdf', 'EdgeColor', 'r', 'FaceColor', 'none'); legend('Actual', 'Predicted'); .. tab-item:: Python :: np.random.seed(0) # Set random number generator seed for reproducibility normvals = np.random.normal(0, np.std(trialData['Prediction'] - trialData['Decision']), 1000) plt.hist(trialData['Prediction'] - trialData['Decision'], bins='auto', density=True, color='b', edgecolor='b', alpha=0.5) plt.hist(normvals, bins='auto', density=True, color='r', edgecolor='r', alpha=0.5) plt.legend(['Actual', 'Predicted']) plt.show() .. dropdown:: Independence of Error .. tab-set:: .. tab-item:: Plain English Next, we need to ensure that our model's prediction errors of :bdg-danger:`Decisions` are not confounded with values of :bdg-primary:`Independent Variable`. If our model predicts :bdg-danger:`Decisions` worse or better at certain values of :bdg-primary:`Independent Variable` compared to others, then our recovery of :bdg-success:`Free Parameters` is influenced disproportionately by :bdg-danger:`Decisions` made when :bdg-primary:`Independent Variable` have a certain value. Thus, our :bdg-success:`Free Parameters` would be overfit for certain values and underfit for other values. We can check this assumption by creating loess line with our :bdg-primary:`Independent Variable` on the x-axis and the model prediction errors on the y-axis. This slope should be 0 with an intercept of 0. .. tab-item:: R :: qplot(x = trialData$IV, y = (trialData$Prediction-trialData$Decision), geom = 'smooth') .. tab-item:: MatLab :: x = trialData.IV; y = trialData.Prediction - trialData.Decision; loess = fit(x, y, 'loess'); scatter(x, y); hold on; plot(loess, x, y); hold off; .. tab-item:: Python :: x = trialData['IV'] y = trialData['Prediction'] - trialData['Decision'] lowess = sm.nonparametric.lowess(y, x, frac=0.25) # Adjust the frac parameter as needed plt.scatter(x, y) plt.plot(lowess[:, 0], lowess[:, 1], 'r-') plt.show() .. dropdown:: Homoscedasticty .. tab-set:: .. tab-item:: Plain English Finally, we need to ensure that variance in model prediction errors does not change as a function of an :bdg-primary:`Independent Variable`. Of all of the assumptions, this one is the least problematic if violated - essentially it indicates that our model has less predictive accuracy at certain values of the :bdg-primary:`Independent Variable`. This is a bigger issue for regression models because it can make Confidence Intervals of regression coefficients which are too narrow - we don't have this issue unless you're doing means testing on your :bdg-success:`Free Parameters` (i.e. testing modulatory hypotheses). Nonetheless, this could produce unreliable :bdg-success:`Free Parameter` estimates so we have to rely on out-of-sample validation to rule this out. If this assumption is very badly violated, it might make sense to included a noise parameter into your model to scale with the :bdg-primary:`Independent Variable` or to, instead, use a different estimator such as Weighted Least Squares to recover your :bdg-success:`Free Parameters`. To check this, we can create a loess line with a variance cloud with our :bdg-primary:`Independent Variable` on the x-axis and the model prediction errors on the y-axis. The cloud should be a constant width around the loess line. .. tab-item:: R :: qplot(x = trialData$IV, y = (trialData$Prediction-trialData$Decision), geom = 'smooth') .. tab-item:: MatLab :: x = trialData.IV; y = trialData.Prediction - trialData.Decision; loessSmoothing = fit(x, y, 'loess'); plot(loessSmoothing, x, y); hold on; plot(x, y, 'o', 'MarkerEdgeColor', 'b', 'MarkerFaceColor', 'b', 'MarkerSize', 5); legend('LOESS Smoothed Line', 'Data Points'); xlabel('IV'); ylabel('Prediction - Decision'); hold off; .. tab-item:: Python :: x = trialData['IV'] y = trialData['Prediction'] - trialData['Decision'] lowess = sm.nonparametric.lowess(y, x, frac=0.3) # Adjust frac as needed smoothed_x, smoothed_y = lowess.T plt.plot(smoothed_x, smoothed_y, label='LOESS Smoothed Line') plt.scatter(x, y, marker='o', edgecolors='b', facecolors='b', s=50, label='Data Points') plt.legend() plt.xlabel('IV') plt.ylabel('Prediction - Decision') plt.show() .. dropdown:: Assessing Independence of Observations We want to ensure that accounting for :bdg-success:`Subjects`' differences from each other using :bdg-success:`Free Parameters` results in model predictions of :bdg-danger:`Decisions` which are not attributable to individual differences. We can accomplish this using the same linear modeling formula, but including random intercepts for the subject. .. tab-set:: .. tab-item:: Plain English First, we can try a model with a random intercept and random slope of model predicted value. This model might have convergence issues or singularity issues - if so, great! Otherwise, don't worry just go to the next model. Next, we can try a model with a only a random intercept. We can estimate an R-squared for this model - the marginal R-squared (i.e. including variance explained by random effects) should be roughly equivalent to the conditional R-squared (i.e. variance only explained by fixed effects - i.e. your model) - within 0.05 is reasonable. .. tab-item:: R :: ris_model = lmer(data = trialData, Decision ~ Prediction + (1 + Prediction | SubjectID)) summary(ris_model) #model should have issues ri_model = lmer(data = trialData, Decision ~ Prediction + (1 | SubjectID)) summary(ri_model) library(MuMin) r.squaredGLMM(ri_model) #if conditional Rsq is between 0 and 0.05 lower than the multiple Rsq, that's good enough .. tab-item:: MatLab :: ris_model = fitlme(trialData, 'Decision ~ Prediction + (1 + Prediction | SubjectID)'); disp(ris_model); ri_model = fitlme(trialData, 'Decision ~ Prediction + (1 | SubjectID)'); disp(ri_model); disp(ri_model.Rsquared); #if conditional Rsq is between 0 and 0.05 lower than the multiple Rsq, that's good enough .. tab-item:: Python :: import statsmodels.api as sm import r2glmm ris_model = sm.MixedLM.from_formula('Decision ~ Prediction + (1 + Prediction | SubjectID)', data=trialData) result_ris = ris_model.fit() print(result_ris.summary()) ri_model = sm.MixedLM.from_formula('Decision ~ Prediction + (1 | SubjectID)', data=trialData) result_ri = ri_model.fit() print(result_ri.summary()) r_squared = r2glmm.get_r2(result_ri) print(r_squared) #if conditional Rsq is between 0 and 0.05 lower than the multiple Rsq, that's good enough .. dropdown:: Validating Parameter Recovery Process In order to show that our parameter recovery process is robust, we have to show that the model can predict behavior that it was not trained on. We accomplish this by performing Fivefold Validation. .. dropdown:: Fivefold Validation .. tab-set:: .. tab-item:: Plain English We essentially want to prove that our :bdg-success:`Free Parameters` are not overfitting the :bdg-danger:`Decisions` that they are training on. In other words, we want to rule out that our favored model isn't outperforming other models because it's fitting weird quirks in :bdg-success:`Subjects`' :bdg-danger:`Decisions`. It should seem really intuitive that, in order to prove this, we're going to separate :bdg-danger:`Decisions` into a training set and a testing set. Let's talk about how specifically we're going to do this. So, we're going to randomly split each :bdg-success:`Subjects`' :bdg-danger:`Decisions` into one of five groups, called folds. We're going to take one fold and remove those :bdg-primary:`Trials` from our training set - we're going to recover :bdg-success:`Free Parameters` from four-fifths of :bdg-danger:`Decisions` and we're going to test it on this fold - this fifth of :bdg-danger:`Decisions` that we excluded. We're going to save the :bdg-success:`Free Parameters` from this four-fifths and the predicted :bdg-danger:`Decisions` of the withheld one-fifth. We rinse and repeat for the remaining four folds. Once this is done, we can get the model error for the predicted-against-observed :bdg-danger:`Decisions` for all :bdg-primary:`Trials`. Now we should have five sets of :bdg-success:`Free Parameters` for each :bdg-success:`Subject` and, combining the model error of all five folds, we should have a model error estimate across all :bdg-primary:`Trials`. Now, we want to look at two things: first we want to assess how much larger this model error is compared to the model error of the model trained on all of the data. Here, you should report the change in root mean squared error per trial - you can also do a paired t-test on the MFI of the fivefold predictions compared to the standard model predictions but this is not necessary. Second, we also want to assess the similarity of the :bdg-success:`Free Parameters` we recovered withholding each of the five folds to the :bdg-success:`Free Parameters` recovered on the entire data set. You should report the average cosine similarity across all folds for each of the :bdg-success:`Free Parameters`. .. tab-item:: R :: fivefold = data.frame() #preallocate for parameters and errors from the fivefold validation to go into for (i in 1:length(included_subjects)){ datafile = paste(parentfolder, included_subjects[i], restoffilepath, sep = '') # produces a character vector 'parentfolder/included_subjects[i]**.filetype' df = read.csv2(datafile) order = sample(20) Parameter1_ff = vector('numeric', length = 5) Parameter2_ff = vector('numeric', length = 5) SS_ff = 0 Prediction_ff = vector('numeric', length(df$Decision)) for (z in 1:5){ j = (z - 1) * 4 + 1 n = z * 4 withheld = order[j:n] m = ((i - 1) * 5) + z result_ff = fmincon(obj_function,x0 = initial_params, A = NULL, b = NULL, Aeq = NULL, beq = NULL, lb = lower_bounds, ub = upper_bounds, df = df[-withheld,]) Parameter1_ff[m] = result_ff$par[1] Parameter2_ff[m] = result_ff$par[2] for (n in 1:length(withheld)){ utility_choices = vector('numeric', length(Choices)) for (q in 1:length(Choices)){ utility_choices[q] = utility(result_ff$par[1], result_ff$par[2], construct1(df$IV[withheld[n]], df$Constant[withheld[n]], Choices[q]), construct2(df$IV[withheld[n]], df$Constant[withheld[n]], Choices[q]), construct3(df$IV[withheld[n]], df$Constant[withheld[n]], Choices[q])) } Prediction_ff[withheld[n]] = Choices[which(utility_choices == max(utility_choices))] } } SS_ff = sum((df$Decision - Prediction_ff)**2) fivefold[i, 1:11] = c(SS_ff, Parameter1_ff, Parameter2_ff) } colnames(fivefold) = c('SS', 'Par1_fold1', 'Par1_fold2', 'Par1_fold3', 'Par1_fold4', 'Par1_fold5', 'Par2_fold1', 'Par2_fold2', 'Par2_fold3', 'Par2_fold4', 'Par2_fold5', SubjectID) sqrt(mean(fivefold$SS)/length(df$IV)) - sqrt(mean(subjectData$modelSS)/length(df$IV)) #the change in root mean squared error, per trial fivefold$AIC = length(df$IV) * log(fivefold$SS/length(df$IV)) + 2 * 2 t.test(fivefold$AIC, subjectData$AIC, paired = T) #test fivefold MFI against normal MFI for this model library(lsa) cosines = vector('numeric', length = 10) for (i in 1:5){ cosines[i] = cosine(subjectData$Parameter1, fivefold[, (i + 1)]) #to get the correct columns in the fivefold dataframe (2-6) cosines[(i+5)] = cosine(subjectData$Parameter1, fivefold[, (i + 6)]) #to get the correct columns in the fivefold dataframe (7-11) } mean(cosines[1:5]) #cosine similarity of parameter 1 mean(cosines[6:10]) #cosine similarity of parameter 2 .. tab-item:: MatLab :: fivefold = []; for i = 1:length(included_subjects) datafile = strcat(parentfolder, included_subjects{i}, restoffilepath); df = readtable(datafile); order = randperm(20); Parameter1_ff = zeros(1, 5); Parameter2_ff = zeros(1, 5); SS_ff = 0; Prediction_ff = zeros(1, length(df.Decision)); for z = 1:5 j = (z - 1) * 4 + 1; n = z * 4; withheld = order(j:n); m = ((i - 1) * 5) + z; options = optimset('fmincon'); options.Display = 'off'; result_ff = fmincon(@(params) obj_function(params, df(~ismember(1:size(df, 1), withheld), :)), initial_params, [], [], [], [], lower_bounds, upper_bounds, [], options); Parameter1_ff(m) = result_ff(1); Parameter2_ff(m) = result_ff(2); for n = 1:length(withheld) utility_choices = zeros(1, length(Choices)); for q = 1:length(Choices) utility_choices(q) = utility(result_ff(1), result_ff(2), ... construct1(df.IV(withheld(n)), df.Constant(withheld(n)), Choices(q)), ... construct2(df.IV(withheld(n)), df.Constant(withheld(n)), Choices(q)), ... construct3(df.IV(withheld(n)), df.Constant(withheld(n)), Choices(q))); end [~, max_utility_idx] = max(utility_choices); Prediction_ff(withheld(n)) = Choices(max_utility_idx); end end SS_ff = sum((df.Decision - Prediction_ff).^2); fivefold(i, 1:11) = [SS_ff, Parameter1_ff, Parameter2_ff]; end fivefold.Properties.VariableNames = {'SS', 'Par1_fold1', 'Par1_fold2', 'Par1_fold3', 'Par1_fold4', 'Par1_fold5', ... 'Par2_fold1', 'Par2_fold2', 'Par2_fold3', 'Par2_fold4', 'Par2_fold5'}; root_mean_squared_error_change = sqrt(mean(fivefold.SS) / length(df.IV)) - sqrt(mean(subjectData.modelSS) / length(df.IV)); fivefold.AIC = length(df.IV) * log(fivefold.SS / length(df.IV)) + 2 * 2; ttest_result = ttest(fivefold.AIC, subjectData.AIC, 'paired', true); cosines = zeros(1, 10); for i = 1:5 cosines(i) = cosine(subjectData.Parameter1, fivefold{:, i + 1}); cosines(i + 5) = cosine(subjectData.Parameter1, fivefold{:, i + 6}); end mean_cosine_parameter1 = mean(cosines(1:5)); mean_cosine_parameter2 = mean(cosines(6:10)); .. tab-item:: Python :: from scipy.stats import ttest_rel from lsa.cosine import cosine fivefold = pd.DataFrame() for i in range(len(included_subjects)): datafile = parentfolder + included_subjects[i] + restoffilepath df = pd.read_csv(datafile, sep=';') order = np.random.permutation(20) Parameter1_ff = np.zeros(5) Parameter2_ff = np.zeros(5) SS_ff = 0 Prediction_ff = np.zeros(len(df['Decision'])) for z in range(1, 6): j = (z - 1) * 4 n = z * 4 withheld = order[j:n] m = ((i - 1) * 5) + z result_ff = fmincon(obj_function_ff, x0=initial_params, args=( df[~df.index.isin(withheld)]), bounds=(lower_bounds, upper_bounds)) Parameter1_ff[m - 1] = result_ff[0] Parameter2_ff[m - 1] = result_ff[1] for n in withheld: utility_choices = np.zeros(len(Choices)) for q in range(len(Choices)): utility_choices[q] = utility(result_ff[0], result_ff[1], construct1(df['IV'][n], df['Constant'][n], Choices[q]), construct2(df['IV'][n], df['Constant'][n], Choices[q]), construct3(df['IV'][n], df['Constant'][n], Choices[q])) max_utility_idx = np.argmax(utility_choices) Prediction_ff[n] = Choices[max_utility_idx] SS_ff = sum((df['Decision'] - Prediction_ff)**2) fivefold.loc[i, 0:10] = [SS_ff] + list(Parameter1_ff) + list(Parameter2_ff) fivefold.columns = ['SS', 'Par1_fold1', 'Par1_fold2', 'Par1_fold3', 'Par1_fold4', 'Par1_fold5', 'Par2_fold1', 'Par2_fold2', 'Par2_fold3', 'Par2_fold4', 'Par2_fold5'] root_mean_squared_error_change = np.sqrt(fivefold['SS'].mean() / len(df['IV'])) - np.sqrt(subjectData['modelSS'].mean() / len(df['IV'])) fivefold['AIC'] = len(df['IV']) * np.log(fivefold['SS'] / len(df['IV'])) + 2 * 2 ttest_result = ttest_rel(fivefold['AIC'], subjectData['AIC']) cosines = np.zeros(10) for i in range(5): cosines[i] = cosine(subjectData['Parameter1'], fivefold.iloc[:, i + 1]) cosines[i + 5] = cosine(subjectData['Parameter1'], fivefold.iloc[:, i + 6]) mean_cosine_parameter1 = np.mean(cosines[0:5]) mean_cosine_parameter2 = np.mean(cosines[5:10]) Tutorials ========== Tutorial 1 - van Baar, Chang, & Sanfey, 2019 ---------------------- .. dropdown:: Assessing Accuracy in Capturing the Data Generation Process .. tab-set:: .. tab-item:: R :: ### Model Performance modelPredictions = lm(data = trialData, as.numeric(Returned) ~ Prediction) summary(modelPredictions) # R-squared ### Visually Checking Assumptions qplot(x = trialData$Prediction, y = as.numeric(trialData$Returned), geom = 'smooth') + geom_abline(slope = 1, intercept = 0) + labs(x = 'Prediction', y = 'Observed') + lims(x = c(0, 30), y = c(0, 30)) # linearity normvals = rnorm(1000, mean = 0, sd = sd(trialData$Prediction - as.numeric(trialData$Returned))) qplot(x = trialData$Prediction - as.numeric(trialData$Returned), geom = 'density', bw = 1, color = 'Actual') + geom_density(aes(x = normvals, color = 'Predicted'), bw = 1) + labs(x = 'Prediction - Observed', y = 'Density') #normality qplot(x = trialData$Investment, y = (trialData$Prediction-as.numeric(trialData$Returned)), group = as.factor(trialData$Multiplier), color = as.factor(trialData$Multiplier), geom = 'smooth') + labs(x = 'Investment', y = 'Prediction - Observed', color = 'Multiplier') #independence of error and homoscedasticity ### Assessing Independence of Observations library(lme4) library(MuMIn) ris_model = lmer(data = trialData, as.numeric(Returned) ~ Prediction + (1 + Prediction | Subject)) r.squaredGLMM(ris_model) ri_model = lmer(data = trialData, as.numeric(Returned) ~ Prediction + (1 | Subject)) r.squaredGLMM(ri_model) ric_model = lmer(data = trialData, as.numeric(Returned) ~ Prediction + (1 | Subject)) r.squaredGLMM(ric_model) .. tab-item:: MatLab :: % Model Performance modelPredictions = fitlm(trialData, 'Returned ~ Prediction'); disp('R-squared:'); disp(modelPredictions.Rsquared.Ordinary); % Visually Checking Assumptions figure; % Linearity subplot(2, 2, 1); scatter(trialData.Prediction, trialData.Returned, 'filled'); hold on; refline(1, 0); xlabel('Prediction'); ylabel('Observed'); xlim([0, 30]); ylim([0, 30]); title('Linearity'); % Normality subplot(2, 2, 2); diffResiduals = modelPredictions.Residuals.Raw - trialData.Returned; histogram(diffResiduals, 'Normalization', 'pdf', 'EdgeColor', 'none'); hold on; normvals = normrnd(0, std(diffResiduals), 1000, 1); histogram(normvals, 'Normalization', 'pdf', 'EdgeColor', 'none'); xlabel('Prediction - Observed'); ylabel('Density'); title('Normality'); % Independence of Error and Homoscedasticity subplot(2, 2, 3); gscatter(trialData.Investment, diffResiduals, trialData.Multiplier, [], [], 15); xlabel('Investment'); ylabel('Prediction - Observed'); title('Independence of Error and Homoscedasticity'); % Assessing Independence of Observations ris_model = fitlme(trialData, 'Returned ~ Prediction + (1 + Prediction | SubjectID)'); disp(ris_model); disp(r2(ris_model)); ri_model = fitlme(trialData, 'Returned ~ Prediction + (1 | SubjectID)'); disp(ri_model) disp(r2(ri_model)); ric_model = fitlme(trialData, 'Returned ~ Prediction + (1 | SubjectID)'); disp(ric_model); disp(r2(ric_model)); .. tab-item:: Python :: import seaborn as sns import matplotlib.pyplot as plt import statsmodels.api as sm # Model Performance modelPredictions = sm.OLS(trialData['Returned'], sm.add_constant(trialData['Prediction'])).fit() print(modelPredictions.summary()) # Visually Checking Assumptions fig, axs = plt.subplots(2, 2, figsize=(12, 10)) # Linearity axs[0, 0].scatter(trialData['Prediction'], trialData['Returned'], color='blue', alpha=0.5) axs[0, 0].plot([0, 30], [0, 30], color='red', linestyle='--') axs[0, 0].set_xlabel('Prediction') axs[0, 0].set_ylabel('Observed') axs[0, 0].set_xlim([0, 30]) axs[0, 0].set_ylim([0, 30]) axs[0, 0].set_title('Linearity') # Normality axs[0, 1].hist(modelPredictions.resid, bins='auto', color='blue', alpha=0.5, density=True) sns.histplot(normvals, bins='auto', color='red', alpha=0.5, kde=True, ax=axs[0, 1]) axs[0, 1].set_xlabel('Prediction - Observed') axs[0, 1].set_ylabel('Density') axs[0, 1].set_title('Normality') # Independence of Error and Homoscedasticity axs[1, 0].scatter(trialData['Investment'], modelPredictions.resid, c=trialData['Multiplier'], cmap='viridis', s=15) axs[1, 0].set_xlabel('Investment') axs[1, 0].set_ylabel('Prediction - Observed') axs[1, 0].set_title('Independence of Error and Homoscedasticity') plt.tight_layout() plt.show() ### Assessing Independence # can do directly in R .. dropdown:: Validating Parameter Recovery Process .. tab-set:: .. tab-item:: R :: fivefold = data.frame() #preallocate for parameters and errors from the fivefold validation to go into for (i in 1:length(included_subjects)){ df = trialData[which(included_subjects[i] == trialData$Subject), ] order = sample(length(df$Returned)) Theta_ff = vector('numeric', length = 5) Phi_ff = vector('numeric', length = 5) SS_ff = 0 Prediction_ff = vector('numeric', length(df$Returned)) for (z in 1:5){ j = round((z - 1) * (length(df$Returned)/5) + 1) n = round(z * (length(df$Returned)/5)) withheld = order[j:n] m = ((i - 1) * 5) + z result_ff = fmincon(obj_function,x0 = initial_params, A = NULL, b = NULL, Aeq = NULL, beq = NULL, lb = lower_bounds, ub = upper_bounds, df = df[-withheld,]) Theta_ff[z] = result_ff$par[1] Phi_ff[z] = result_ff$par[2] for (n in 1:length(withheld)){ if (df$Investment[withheld[n]] > 10) { Choices = seq(0, (df$Investment[withheld[n]] * df$Multiplier[withheld[n]]), round((df$Investment[withheld[n]]*df$Multiplier[withheld[n]])/10)) } else { Choices = seq(0, (df$Investment[withheld[n]] * df$Multiplier[withheld[n]]), 1) } utility_choices = vector('numeric', length(Choices)) for (q in 1:length(Choices)){ utility_choices[q] = utility(theta = result_ff$par[1], phi = result_ff$par[2], guilt = guilt(df$Investment[withheld[n]], believed_multiplier = 4, Choices[q], df$Multiplier[withheld[n]]), payout = payout_maximization(df$Investment[withheld[n]], df$Multiplier[withheld[n]], Choices[q]), inequity = inequity(df$Investment[withheld[n]], df$Multiplier[withheld[n]], Choices[q], endowment = 10)) } Prediction_ff[withheld[n]] = Choices[which(utility_choices == max(utility_choices))[1]] } } SS_ff = sum((as.numeric(df$Returned) - Prediction_ff)**2) fivefold[i, 1:12] = c(SS_ff, Theta_ff, Phi_ff, included_subjects[i]) } colnames(fivefold) = c('SS', 'Par1_fold1', 'Par1_fold2', 'Par1_fold3', 'Par1_fold4', 'Par1_fold5', 'Par2_fold1', 'Par2_fold2', 'Par2_fold3', 'Par2_fold4', 'Par2_fold5', 'SubjectID') sqrt(mean(fivefold$SS)/length(df$Investment)) - sqrt(mean(subjectData$modelSS)/length(df$Investment)) #the change in root mean squared error, per trial fivefold$AIC = length(df$Investment) * log(fivefold$SS/length(df$Investment)) + 2 * 2 t.test(fivefold$AIC, subjectData$modelAIC, paired = T) #test fivefold MFI against normal MFI for this model library(lsa) cosines = vector('numeric', length = 10) for (i in 1:5){ cosines[i] = cosine(subjectData$Theta, fivefold[, (i + 1)]) #to get the correct columns in the fivefold dataframe (2-6) cosines[(i+5)] = cosine(subjectData$Phi, fivefold[, (i + 6)]) #to get the correct columns in the fivefold dataframe (7-11) } mean(cosines[1:5]) #cosine similarity of parameter 1 mean(cosines[6:10]) #cosine similarity of parameter 2 .. tab-item:: MatLab :: fivefold = table(); % preallocate for parameters and errors from the fivefold validation to go into for i = 1:length(included_subjects) df = trialData(included_subjects(i) == trialData.Subject, :); order = randperm(length(df.Returned)); Theta_ff = NaN(1, 5); Phi_ff = NaN(1, 5); SS_ff = 0; Prediction_ff = NaN(1, length(df.Returned)); for z = 1:5 j = round((z - 1) * (length(df.Returned)/5) + 1); n = round(z * (length(df.Returned)/5)); withheld = order(j:n); result_ff = fmincon(@(params) obj_function(params, df(~ismember(1:length(df.Returned), withheld), :)), ... initial_params, [], [], [], [], lower_bounds, upper_bounds); Theta_ff(z) = result_ff(1); Phi_ff(z) = result_ff(2); for n = 1:length(withheld) if df.Investment(withheld(n)) > 10 Choices = 0:round((df.Investment(withheld(n)) * df.Multiplier(withheld(n)))/10):(df.Investment(withheld(n)) * df.Multiplier(withheld(n))); else Choices = 0:1:(df.Investment(withheld(n)) * df.Multiplier(withheld(n))); end utility_choices = NaN(1, length(Choices)); for q = 1:length(Choices) utility_choices(q) = utility(result_ff(1), result_ff(2), ... guilt(df.Investment(withheld(n)), 4, Choices(q), df.Multiplier(withheld(n))), ... payout_maximization(df.Investment(withheld(n)), df.Multiplier(withheld(n)), Choices(q)), ... inequity(df.Investment(withheld(n)), df.Multiplier(withheld(n)), Choices(q), 10)); end [~, maxIndex] = max(utility_choices); Prediction_ff(withheld(n)) = Choices(maxIndex); end end SS_ff = sum((df.Returned - Prediction_ff).^2); fivefold(i, 1:12) = [SS_ff, Theta_ff, Phi_ff, included_subjects(i)]; end fivefold.Properties.VariableNames = {'SS', 'Par1_fold1', 'Par1_fold2', 'Par1_fold3', 'Par1_fold4', 'Par1_fold5', ... 'Par2_fold1', 'Par2_fold2', 'Par2_fold3', 'Par2_fold4', 'Par2_fold5', 'SubjectID'}; sqrt(mean(fivefold.SS)/length(df.Investment)) - sqrt(mean(subjectData.modelSS)/length(df.Investment)); % the change in root mean squared error, per trial fivefold.AIC = length(df.Investment) * log(fivefold.SS/length(df.Investment)) + 2 * 2; ttest(fivefold.AIC, subjectData.modelAIC); % test fivefold MFI against normal MFI for this model cosines = NaN(1, 10); for i = 1:5 cosines(i) = cosine(subjectData.Theta, fivefold{:, (i + 1)}); % to get the correct columns in the fivefold dataframe (2-6) cosines(i + 5) = cosine(subjectData.Phi, fivefold{:, (i + 6)}); % to get the correct columns in the fivefold dataframe (7-11) end mean(cosines(1:5)) % cosine similarity of parameter 1 mean(cosines(6:10)) % cosine similarity of parameter 2 .. tab-item:: Python :: fivefold = pd.DataFrame() # preallocate for parameters and errors from the fivefold validation to go into for i in range(1, len(included_subjects) + 1): df = trialData[included_subjects[i - 1] == trialData['Subject']] order = np.random.permutation(len(df['Returned'])) Theta_ff = np.zeros(5) Phi_ff = np.zeros(5) SS_ff = 0 Prediction_ff = np.zeros(len(df['Returned'])) for z in range(1, 6): j = round((z - 1) * (len(df['Returned']) / 5) + 1) n = round(z * (len(df['Returned']) / 5)) withheld = order[j-1:n] result_ff = minimize(lambda params: obj_function(params, df.loc[~df.index.isin(withheld), :]), initial_params, bounds=list(zip(lower_bounds, upper_bounds))) Theta_ff[z-1] = result_ff.x[0] Phi_ff[z-1] = result_ff.x[1] for n in withheld: if df['Investment'].iloc[n] > 10: Choices = np.arange(0, (df['Investment'].iloc[n] * df['Multiplier'].iloc[n]), round((df['Investment'].iloc[n] * df['Multiplier'].iloc[n])/10)) else: Choices = np.arange(0, (df['Investment'].iloc[n] * df['Multiplier'].iloc[n]), 1) utility_choices = np.zeros(len(Choices)) for q in range(len(Choices)): utility_choices[q] = utility(result_ff.x[0], result_ff.x[1], guilt(df['Investment'].iloc[n], 4, Choices[q], df['Multiplier'].iloc[n]), payout_maximization(df['Investment'].iloc[n], df['Multiplier'].iloc[n], Choices[q]), inequity(df['Investment'].iloc[n], df['Multiplier'].iloc[n], Choices[q], 10)) correct_choice = Choices[np.argmax(utility_choices)] Prediction_ff[n] = correct_choice SS_ff = np.sum((df['Returned'].to_numpy() - Prediction_ff)**2) fivefold[i - 1:i] = [SS_ff] + list(Theta_ff) + list(Phi_ff) + [included_subjects[i - 1]] sqrt(np.mean(fivefold['SS'])/len(df['Investment'])) - np.sqrt(np.mean(subjectData['modelSS'])/len(df['Investment'])) # the change in root mean squared error, per trial fivefold['AIC'] = len(df['Investment']) * np.log(fivefold['SS']/len(df['Investment'])) + 2 * 2 ttest_rel(fivefold['AIC'], subjectData['modelAIC']) # test fivefold MFI against normal MFI for this model cosines = np.zeros(10) for i in range(1, 6): cosines[i - 1] = cosine_similarity(subjectData['Theta'].to_numpy().reshape(1, -1), fivefold.iloc[:, i:i+1].to_numpy().T)[0, 0] cosines[i + 4] = cosine_similarity(subjectData['Phi'].to_numpy().reshape(1, -1), fivefold.iloc[:, i+5:i+6].to_numpy().T)[0, 0] np.mean(cosines[0:5]) # cosine similarity of parameter 1 np.mean(cosines[5:10]) # cosine similarity of parameter 2 Tutorial 2 - Galvan & Sanfey, 2024 ------------------- .. dropdown:: Assessing Accuracy in Capturing the Data Generation Process .. tab-set:: .. tab-item:: R :: ### Model Performance taxRatePredictions = lm(data = trialData, observedTaxRate ~ predictedTaxRate) summary(taxRatePredictions) # R-squared predicting the subject's tax rate payoutPredictions = lm(data = trialData, observedOutcome ~ predictedOutcome) summary(payoutPredictions) # R-squared predicting the subject's payout, we're going to focus on this ### Visually Checking Assumptions qplot(x = trialData$predictedOutcome, y = trialData$observedOutcome, geom = 'smooth') + geom_abline(slope = 1, intercept = 0) + labs(x = 'Prediction', y = 'Observed') + lims(x = c(0, 30), y = c(0, 30)) # linearity normvals = rnorm(1000, mean = 0, sd = sd(trialData$predictedOutcome - trialData$observedOutcome)) qplot(x = (trialData$predictedOutcome - trialData$observedOutcome), geom = 'density', bw = 1, color = 'Actual') + geom_density(aes(x = normvals, color = 'Predicted'), bw = 1) + labs(x = 'Prediction - Observed', y = 'Density') #normality qplot(x = trialData$initialAllocation, y = (trialData$predictedOutcome - trialData$observedOutcome), group = as.factor(trialData$Multiplier), color = as.factor(trialData$Multiplier), geom = 'smooth') + labs(x = 'Initial Allocation', y = 'Prediction - Observed', color = 'Multiplier') #independence of error and homoscedasticity ### Assessing Independence library(lme4) library(MuMin) ris_model = lmer(data = trialData, observedOutcome ~ predictedOutcome + (1 + predictedOutcome | SubjectID)) summary(ris_model) #model should have issues r.squaredGLMM(ris_model) cri_model = lmer(data = trialData, observedOutcome ~ predictedOutcome + condition + (1 | SubjectID)) summary(cri_model) #model should have issues r.squaredGLMM(cri_model) cric_model = lmer(data = trialData, observedOutcome ~ predictedOutcome + condition + (1 + condition | SubjectID)) summary(cric_model) #model should have issues r.squaredGLMM(cric_model) ri_model = lmer(data = trialData, observedOutcome ~ predictedOutcome + (1 | SubjectID)) summary(ri_model) r.squaredGLMM(ri_model) #if conditional Rsq is between 0 and 0.05 lower than the multiple Rsq, that's good enough .. tab-item:: MatLab :: % Model Performance taxRatePredictions = fitlm(trialData, 'observedTaxRate ~ predictedTaxRate'); disp(taxRatePredictions); % Display summary payoutPredictions = fitlm(trialData, 'observedOutcome ~ predictedOutcome'); disp(payoutPredictions); % Display summary % Visually Checking Assumptions figure; % Linearity subplot(2, 2, 1); scatter(trialData.predictedOutcome, trialData.observedOutcome, 'filled'); hold on; plot([0 30], [0 30], '--k'); xlabel('Prediction'); ylabel('Observed'); xlim([0 30]); ylim([0 30]); title('Linearity'); % Normality subplot(2, 2, 2); histogram(trialData.predictedOutcome - trialData.observedOutcome, 'Normalization', 'probability'); hold on; normvals = normrnd(0, std(trialData.predictedOutcome - trialData.observedOutcome), 1, 1000); histogram(normvals, 'Normalization', 'probability'); xlabel('Prediction - Observed'); ylabel('Density'); title('Normality'); % Independence of Error and Homoscedasticity subplot(2, 2, 3); gscatter(trialData.initialAllocation, trialData.predictedOutcome - trialData.observedOutcome, trialData.Multiplier, 'br', '.', 10); xlabel('Initial Allocation'); ylabel('Prediction - Observed'); title('Independence of Error and Homoscedasticity'); legend('off'); % Turn off automatic legend % Assessing Independence ris_model = fitlme(trialData, 'observedOutcome ~ predictedOutcome + (1 + predictedOutcome | SubjectID)'); disp(ris_model); disp(r2(ris_model)); cri_model = fitlme(trialData, 'observedOutcome ~ predictedOutcome + condition + (1 | SubjectID)'); disp(cri_model); disp(r2(cri_model)); cric_model = fitlme(trialData, 'observedOutcome ~ predictedOutcome + condition + (1 + condition | SubjectID)'); disp(cric_model); disp(r2(cric_model)); ri_model = fitlme(trialData, 'observedOutcome ~ predictedOutcome + (1 | SubjectID)'); disp(ri_model); disp(r2(ri_model)); .. tab-item:: Python :: import seaborn as sns import matplotlib.pyplot as plt from statsmodels.regression.linear_model import OLS import statsmodels.api as sm # Model Performance tax_rate_predictions = OLS.from_formula('observedTaxRate ~ predictedTaxRate', trialData).fit() print(tax_rate_predictions.summary()) payout_predictions = OLS.from_formula('observedOutcome ~ predictedOutcome', trialData).fit() print(payout_predictions.summary()) # Visually Checking Assumptions # Linearity plt.subplot(2, 2, 1) sns.scatterplot(x='predictedOutcome', y='observedOutcome', data=trialData) plt.plot([0, 30], [0, 30], '--k') plt.xlabel('Prediction') plt.ylabel('Observed') plt.title('Linearity') # Normality plt.subplot(2, 2, 2) sns.histplot(trialData['predictedOutcome'] - trialData['observedOutcome'], kde=True, color='blue', label='Actual') norm_vals = np.random.normal(0, np.std(trialData['predictedOutcome'] - trialData['observedOutcome']), 1000) sns.histplot(norm_vals, kde=True, color='orange', label='Predicted') plt.xlabel('Prediction - Observed') plt.ylabel('Density') plt.title('Normality') plt.legend() # Independence of Error and Homoscedasticity plt.subplot(2, 2, 3) sns.lineplot(x='initialAllocation', y='predictedOutcome - observedOutcome', hue='Multiplier', style='Multiplier', markers=True, data=trialData) plt.xlabel('Initial Allocation') plt.ylabel('Prediction - Observed') plt.title('Independence of Error and Homoscedasticity') plt.tight_layout() plt.show() ### Assessing Independence # can do directly in R .. dropdown:: Validating Parameter Recovery Process .. tab-set:: .. tab-item:: R :: fivefold = data.frame() for (i in 1:length(included_subjects)){ datafile = paste(parentfolder, included_subjects[i], restoffilepath, sep = '') # produces a character vector 'parentfolder/included_subjects[i]**.filetype' fullData = read.csv2(datafile) thetaPerCondition = vector('numeric', length(conditions) * 5) phiPerCondition = vector('numeric', length(conditions) * 5) SSPerCondition = vector('numeric', length(conditions)) for (j in 1:length(conditions)){ df = fullData[which(fullData$condition == conditions[j]), c(49, 40:48), 33] #49 is subject's initial allocation, 40:48 are players 1:9 initial allocation, 33 is redistribution rate df$redistributionRate = df$redistributionRate/100 #converting to a decimal from a percent order = sample(20) for (z in 1:5){ j = (z - 1) * 4 + 1 n = z * 4 withheld = order[j:n] m = ((j - 1) * 5) + z result = fmincon(obj_function,x0 = initial_params, A = NULL, b = NULL, Aeq = NULL, beq = NULL, lb = lower_bounds, ub = upper_bounds, df = df[-withheld,]) thetaPerCondition[m] = result$par[1] phiPerCondition[m] = result$par[2] for (k in 1:length(withheld)){ y = withheld[k] Utility = vector('numeric', length(Choices)) for (n in 1:length(Choices)){ Utility[n] = utility(theta = thetaPerCondition[m], phi = phiPerCondition[m], Equity = equity(new_value(df[y, 1:10], choices[n]), df[y, 1:10], choices[n]), Equality = equality(new_value(df[y, 1:10], choices[n]), df[y, 1:10], choices[n]), Payout = payout(new_value(df[y, 1], choices[n]), df[y, 1], choices[n])) } correct_choice = which(Utility == max(Utility)) predictedOutcome = new_value(df$myself[y], Choices[correct_choice[sample(length(correct_choice), 1)]]) observedOutcome = new_value(df$myself[y], df$redistributionRate) SSPerCondition = SSPerCondition + (observedOutcome - predictedOutcome)**2 } } } fivefold[i, 1:45] = c(included_subjects[i], thetaPerCondition, phiPerCondition, SSPerCondition) } colnames(fivefold) = c('SubjectID', 'thetaMeritF1', 'thetaMeritF2', 'thetaMeritF3', 'thetaMeritF4', 'thetaMeritF5', 'thetaEntitlementF1', 'thetaEntitlementF2', 'thetaEntitlementF3', 'thetaEntitlementF4', 'thetaEntitlementF5', 'thetaCorruptionF1', 'thetaCorruptionF2', 'thetaCorruptionF3', 'thetaCorruptionF4', 'thetaCorruptionF5', 'thetaLuckF1', 'thetaLuckF2', 'thetaLuckF3', 'thetaLuckF4', 'thetaLuckF5', 'phiMeritF1', 'phiMeritF2', 'phiMeritF3', 'phiMeritF4', 'phiMeritF5', 'phiEntitlementF1', 'phiEntitlementF2', 'phiEntitlementF3', 'phiEntitlementF4', 'phiEntitlementF5', 'phiCorruptionF1', 'phiCorruptionF2', 'phiCorruptionF3', 'phiCorruptionF4', 'phiCorruptionF5', 'phiLuckF1', 'phiLuckF2', 'phiLuckF3', 'phiLuckF4', 'phiLuckF5', 'SSMerit', 'SSEntitlement', 'SSCorruption', 'SSLuck') sqrt(mean(fivefold$SSMerit)/length(df$redistributionRate)) - sqrt(mean(subjectData$SSMerit)/length(df$redistributionRate)) #the change in root mean squared error, per trial in merit condition sqrt(mean(fivefold$SSEntitlement)/length(df$redistributionRate)) - sqrt(mean(subjectData$SSEntitlement)/length(df$redistributionRate)) #the change in root mean squared error, per trial in entilement condition sqrt(mean(fivefold$SSCorruption)/length(df$redistributionRate)) - sqrt(mean(subjectData$SSCorruption)/length(df$redistributionRate)) #the change in root mean squared error, per trial in corruption condition sqrt(mean(fivefold$SSLuck)/length(df$redistributionRate)) - sqrt(mean(subjectData$SSLuck)/length(df$redistributionRate)) #the change in root mean squared error, per trial in luck condition fivefold$AICMerit = length(df$redistributionRate) * log(fivefold$SSMerit/length(df$redistributionRate)) + 2 * 2 fivefold$AICEntitlement = length(df$redistributionRate) * log(fivefold$SSEntitlement/length(df$redistributionRate)) + 2 * 2 fivefold$AICCorruption = length(df$redistributionRate) * log(fivefold$SSCorruption/length(df$redistributionRate)) + 2 * 2 fivefold$AICLuck = length(df$redistributionRate) * log(fivefold$SSLuck/length(df$redistributionRate)) + 2 * 2 t.test(fivefold$AICMerit, subjectData$AIC, paired = T) #test fivefold MFI against normal MFI for this model t.test(fivefold$AICEntitlement, subjectData$AICEntitlement, paired = T) #test fivefold MFI against normal MFI for this model t.test(fivefold$AICCorruption, subjectData$AICCorruption, paired = T) #test fivefold MFI against normal MFI for this model t.test(fivefold$AICLuck, subjectData$AICLuck, paired = T) #test fivefold MFI against normal MFI for this model library(lsa) cosines = vector('numeric', length = length(conditions) * 2 * 5) #number of conditions times number of parameters times the number of folds for (i in 1:5){ for (j in 1:length(conditions)){ k = ((i - 1) * 5) + j cosines[k] = cosine(subjectData$Parameter1, fivefold[, (k + 1)]) cosines[(k+20)] = cosine(subjectData$Parameter1, fivefold[, (k + 21)]) } } mean(cosines[1:5]) #cosine similarity of theta in merit condition mean(cosines[6:10]) #cosine similarity of theta in entitlement condition mean(cosines[11:15]) #cosine similarity of theta in corruption condition mean(cosines[16:20]) #cosine similarity of theta in luck condition mean(cosines[21:25]) #cosine similarity of phi in merit condition mean(cosines[26:30]) #cosine similarity of phi in entitlement condition mean(cosines[31:35]) #cosine similarity of phi in corruption condition mean(cosines[36:40]) #cosine similarity of phi in luck condition mean(cosines[1:20]) #cosine similarity of theta across all conditions mean(cosines[21:40]) #cosine similarity of phi across all conditions .. tab-item:: MatLab :: fivefold = table(); for i = 1:length(included_subjects) datafile = strcat(parentfolder, included_subjects{i}, restoffilepath); fullData = readtable(datafile); thetaPerCondition = zeros(length(conditions) * 5, 1); phiPerCondition = zeros(length(conditions) * 5, 1); SSPerCondition = zeros(length(conditions), 1); for j = 1:length(conditions) df = fullData(fullData.condition == conditions{j}, [49, 40:48, 33]); df.redistributionRate = df.redistributionRate / 100; order = randperm(20); for z = 1:5 start_index = (z - 1) * 4 + 1; end_index = z * 4; withheld = order(start_index:end_index); m = ((start_index - 1) * 5) + z; result = fmincon(@obj_function, initial_params, [], [], [], [], lower_bounds, upper_bounds, df(~ismember(1:20, withheld), :)); thetaPerCondition(m) = result(1); phiPerCondition(m) = result(2); for k = 1:length(withheld) y = withheld(k); Utility = zeros(length(Choices), 1); for n = 1:length(Choices) Utility(n) = utility(result(1), result(2), ... equity(new_value(df{y, 1:10}, choices{n}), df{y, 1:10}, choices{n}), ... equality(new_value(df{y, 1:10}, choices{n}), df{y, 1:10}, choices{n}), ... payout(new_value(df{y, 1}, choices{n}), df{y, 1}, choices{n})); end [~, correct_choice] = max(Utility); predictedOutcome = new_value(df.myself(y), Choices(correct_choice(randi(length(correct_choice))))); observedOutcome = new_value(df.myself(y), df.redistributionRate(y)); SSPerCondition(j) = SSPerCondition(j) + (observedOutcome - predictedOutcome)^2; end end end fivefold{i, 1:45} = [included_subjects{i}, thetaPerCondition', phiPerCondition', SSPerCondition']; end fivefold.Properties.VariableNames = {'SubjectID', ... 'thetaMeritF1', 'thetaMeritF2', 'thetaMeritF3', 'thetaMeritF4', 'thetaMeritF5', ... 'thetaEntitlementF1', 'thetaEntitlementF2', 'thetaEntitlementF3', 'thetaEntitlementF4', 'thetaEntitlementF5', ... 'thetaCorruptionF1', 'thetaCorruptionF2', 'thetaCorruptionF3', 'thetaCorruptionF4', 'thetaCorruptionF5', ... 'thetaLuckF1', 'thetaLuckF2', 'thetaLuckF3', 'thetaLuckF4', 'thetaLuckF5', ... 'phiMeritF1', 'phiMeritF2', 'phiMeritF3', 'phiMeritF4', 'phiMeritF5', ... 'phiEntitlementF1', 'phiEntitlementF2', 'phiEntitlementF3', 'phiEntitlementF4', 'phiEntitlementF5', ... 'phiCorruptionF1', 'phiCorruptionF2', 'phiCorruptionF3', 'phiCorruptionF4', 'phiCorruptionF5', ... 'phiLuckF1', 'phiLuckF2', 'phiLuckF3', 'phiLuckF4', 'phiLuckF5', ... 'SSMerit', 'SSEntitlement', 'SSCorruption', 'SSLuck'}; % Change in root mean squared error per trial in each condition disp(sqrt(mean(fivefold.SSMerit) / height(df) - sqrt(mean(subjectData.SSMerit) / height(df)))); disp(sqrt(mean(fivefold.SSEntitlement) / height(df) - sqrt(mean(subjectData.SSEntitlement) / height(df)))); disp(sqrt(mean(fivefold.SSCorruption) / height(df) - sqrt(mean(subjectData.SSCorruption) / height(df)))); disp(sqrt(mean(fivefold.SSLuck) / height(df) - sqrt(mean(subjectData.SSLuck) / height(df)))); fivefold.AICMerit = height(df) * log(fivefold.SSMerit / height(df)) + 2 * 2; fivefold.AICEntitlement = height(df) * log(fivefold.SSEntitlement / height(df)) + 2 * 2; fivefold.AICCorruption = height(df) * log(fivefold.SSCorruption / height(df)) + 2 * 2; fivefold.AICLuck = height(df) * log(fivefold.SSLuck / height(df)) + 2 * 2; % Test fivefold MFI against normal MFI for this model disp(ttest(fivefold.AICMerit, subjectData.AIC, 'paired', true)); disp(ttest(fivefold.AICEntitlement, subjectData.AICEntitlement, 'paired', true)); disp(ttest(fivefold.AICCorruption, subjectData.AICCorruption, 'paired', true)); disp(ttest(fivefold.AICLuck, subjectData.AICLuck, 'paired', true)); cosines = zeros(length(conditions) * 2 * 5, 1); for i = 1:5 for j = 1:length(conditions) k = ((i - 1) * 5) + j; cosines(k) = cosine(subjectData.Parameter1, fivefold{:, k + 1}); cosines(k + 20) = cosine(subjectData.Parameter1, fivefold{:, k + 21}); end end % Cosine similarity of theta and phi across conditions and folds disp(mean(cosines(1:5))); disp(mean(cosines(6:10))); disp(mean(cosines(11:15))); disp(mean(cosines(16:20))); disp(mean(cosines(21:25))); disp(mean(cosines(26:30))); disp(mean(cosines(31:35))); disp(mean(cosines(36:40))); % Cosine similarity of theta across all conditions and folds disp(mean(cosines(1:20))); % Cosine similarity of phi across all conditions and folds disp(mean(cosines(21:40))); .. tab-item:: Python :: fivefold = pd.DataFrame() for i in range(len(included_subjects)): datafile = parentfolder + included_subjects[i] + restoffilepath fullData = pd.read_csv(datafile) thetaPerCondition = np.zeros(len(conditions) * 5) phiPerCondition = np.zeros(len(conditions) * 5) SSPerCondition = np.zeros(len(conditions)) for j in range(len(conditions)): df = fullData[fullData['condition'] == conditions[j]][[48, *range(39, 48), 32]] df['redistributionRate'] = df['redistributionRate'] / 100 order = np.random.permutation(20) for z in range(1, 6): start_index = (z - 1) * 4 end_index = z * 4 withheld = order[start_index:end_index] m = (start_index * 5) + z - 1 # Corrected indexing result = minimize(obj_function, initial_params, args=(df[~df.index.isin(withheld)],), bounds=list(zip(lower_bounds, upper_bounds))) thetaPerCondition[m] = result.x[0] phiPerCondition[m] = result.x[1] for k in withheld: Utility = np.zeros(len(Choices)) for n in range(len(Choices)): Utility[n] = utility(result.x[0], result.x[1], equity(new_value(df.iloc[k, 0:9], choices[n]), df.iloc[k, 0:9], choices[n]), equality(new_value(df.iloc[k, 0:9], choices[n]), df.iloc[k, 0:9], choices[n]), payout(new_value(df.iloc[k, 0], choices[n]), df.iloc[k, 1], choices[n])) correct_choice = np.argmax(Utility) predictedOutcome = new_value(df['myself'].iloc[k], Choices[np.random.choice(np.where(Utility == max(Utility))[0])]) observedOutcome = new_value(df['myself'].iloc[k], df['redistributionRate'].iloc[k]) SSPerCondition[j] += (observedOutcome - predictedOutcome) ** 2 fivefold.loc[i, 0:44] = [included_subjects[i], *thetaPerCondition, *phiPerCondition, *SSPerCondition] fivefold.columns = ['SubjectID', 'thetaMeritF1', 'thetaMeritF2', 'thetaMeritF3', 'thetaMeritF4', 'thetaMeritF5', 'thetaEntitlementF1', 'thetaEntitlementF2', 'thetaEntitlementF3', 'thetaEntitlementF4', 'thetaEntitlementF5', 'thetaCorruptionF1', 'thetaCorruptionF2', 'thetaCorruptionF3', 'thetaCorruptionF4', 'thetaCorruptionF5', 'thetaLuckF1', 'thetaLuckF2', 'thetaLuckF3', 'thetaLuckF4', 'thetaLuckF5', 'phiMeritF1', 'phiMeritF2', 'phiMeritF3', 'phiMeritF4', 'phiMeritF5', 'phiEntitlementF1', 'phiEntitlementF2', 'phiEntitlementF3', 'phiEntitlementF4', 'phiEntitlementF5', 'phiCorruptionF1', 'phiCorruptionF2', 'phiCorruptionF3', 'phiCorruptionF4', 'phiCorruptionF5', 'phiLuckF1', 'phiLuckF2', 'phiLuckF3', 'phiLuckF4', 'phiLuckF5', 'SSMerit', 'SSEntitlement', 'SSCorruption', 'SSLuck'] # Change in root mean squared error per trial in each condition print(np.sqrt(np.mean(fivefold['SSMerit']) / len(df) - np.sqrt(np.mean(subjectData['SSMerit']) / len(df)))) print(np.sqrt(np.mean(fivefold['SSEntitlement']) / len(df) - np.sqrt(np.mean(subjectData['SSEntitlement']) / len(df)))) print(np.sqrt(np.mean(fivefold['SSCorruption']) / len(df) - np.sqrt(np.mean(subjectData['SSCorruption']) / len(df)))) print(np.sqrt(np.mean(fivefold['SSLuck']) / len(df) - np.sqrt(np.mean(subjectData['SSLuck']) / len(df)))) fivefold['AICMerit'] = len(df) * np.log(fivefold['SSMerit'] / len(df)) + 2 * 2 fivefold['AICEntitlement'] = len(df) * np.log(fivefold['SSEntitlement'] / len(df)) + 2 * 2 fivefold['AICCorruption'] = len(df) * np.log(fivefold['SSCorruption'] / len(df)) + 2 * 2 fivefold['AICLuck'] = len(df) * np.log(fivefold['SSLuck'] / len(df)) + 2 * 2 # Test fivefold MFI against normal MFI for this model print(ttest_rel(fivefold['AICMerit'], subjectData['AIC'])) print(ttest_rel(fivefold['AICEntitlement'], subjectData['AICEntitlement'])) print(ttest_rel(fivefold['AICCorruption'], subjectData['AICCorruption'])) print(ttest_rel(fivefold['AICLuck'], subjectData['AICLuck'])) cosines = np.zeros(len(conditions) * 2 * 5) for i in range(5): for j in range(len(conditions)): k = (i * 5) + j cosines[k] = cosine(subjectData['Parameter1'], fivefold.iloc[:, k + 1]) cosines[k + 20] = cosine(subjectData['Parameter1'], fivefold.iloc[:, k + 21]) # Cosine similarity of theta and phi across conditions and folds print(np.mean(cosines[0:4])) print(np.mean(cosines[5:9])) print(np.mean(cosines[10:15])) print(np.mean(cosines[15:19])) print(np.mean(cosines[20:24])) print(np.mean(cosines[25:29])) print(np.mean(cosines[30:34])) print(np.mean(cosines[35:39])) # Cosine similarity of theta across all conditions and folds print(np.mean(cosines[0:19])) # Cosine similarity of phi across all conditions and folds print(np.mean(cosines[20:39])) Tutorial 3 - Crockett et al., 2014 ------------------- .. dropdown:: Assessing Accuracy in Capturing the Data Generation Process .. tab-set:: .. tab-item:: R :: ### Assessing Model Performance sum(trialData$choseHarm == round(trialData$ProbHarm))/nrow(trialData) qplot(subjectData$BIC[-20], geom = 'density') worstExplained = which(subjectData$BIC[-20] > as.numeric(summary(subjectData$BIC[-20])[4])) qplot(data = trialData[which(trialData$SubjectID %in% altSubjectData$SubjectID[worstExplained]),], x = moneyDifference, y = choseHarm - ProbHarm, group = SubjectID) + geom_smooth() qplot(data = trialData[which(trialData$SubjectID %in% altSubjectData$SubjectID[worstExplained]),], x = shocksDifference, y = choseHarm - ProbHarm, group = SubjectID) + geom_smooth() ### Checking Assumptions ggplot(data = trialData) + geom_smooth(aes(x = ProbHarm, y = choseHarm, group = isSelf, color = factor(isSelf))) ggplot(data = trialData) + geom_density(aes(x = choseHarm, fill = factor(choseHarm))) normvals = rnorm(1000, mean = 0, sd = sd(trialData$ProbHarm - trialData$choseHarm)) qplot(x = trialData$ProbHarm - trialData$choseHarm, geom = 'density', bw = sd(trialData$ProbHarm - trialData$choseHarm), color = 'Actual') + geom_density(aes(x = normvals, color = 'Predicted'), bw = sd(trialData$ProbHarm - trialData$choseHarm)) qplot(x = trialData$shocksDifference, y = trialData$ProbHarm - trialData$choseHarm, geom = 'jitter') + geom_smooth() qplot(x = trialData$moneyDifference, y = trialData$ProbHarm - trialData$choseHarm, geom = 'jitter') + geom_smooth() qplot(x = trialData$shocksDifference, y = trialData$ProbHarm - trialData$choseHarm, color = factor(trialData$isSelf), geom = 'smooth') + lims(y = c(-1,1)) qplot(x = trialData$moneyDifference, y = trialData$ProbHarm - trialData$choseHarm, color = factor(trialData$isSelf), geom = 'smooth') + lims(y = c(-1,1)) ### Assessing Independence library(lme4) library(MuMIn) ris_model = glmer(data = trialData, choseHarm ~ ProbHarm + (1 + ProbHarm | SubjectID), family = "binomial") r.squaredGLMM(ris_model) ri_model = glmer(data = trialData, choseHarm ~ ProbHarm + (1 | SubjectID), family = "binomial") r.squaredGLMM(ri_model) ric_model = glmer(data = trialData, choseHarm ~ ProbHarm + isSelf + (1 | SubjectID), family = "binomial") r.squaredGLMM(ric_model) .. tab-item:: MatLab :: .. tab-item:: Python :: .. dropdown:: Validating Parameter Recovery Process .. tab-set:: .. tab-item:: R :: fivefold = data.frame() #preallocate for parameters and errors from the fivefold validation to go into trialData$ProbHarm_ff = 0 optimize = function(df){ tryCatch({ fmincon(obj_function, x0 = initial_params, lb = lower_bounds, ub = upper_bounds, df = df, optimMethod = "MLE") }, error = function(e){ fmincon(obj_function, x0 = initial_params, lb = lower_bounds, ub = upper_bounds, df = df, optimMethod = "OLS") }) } adj = 1 for (i in 1:length(included_subjects)){ df = grab_data(included_subjects[i]) if (is.character(df)){adj = adj + 1; next} df$ProbHarm = 0 order = sample(152) KSelf_ff = vector('numeric', length = 5) KOther_ff = vector('numeric', length = 5) BSelf_ff = vector('numeric', length = 5) BOther_ff = vector('numeric', length = 5) E_ff = vector('numeric', length = 5) G_ff = vector('numeric', length = 5) Deviance_ff = 0 df$Pred = 0 for (z in 1:5){ j = round((z - 1) * (152/5) + 1) n = round(z * (152/5)) withheld = order[j:n] result_ff = optimize(df[-withheld,]) KSelf_ff[z] = result_ff$par[1] KOther_ff[z] = result_ff$par[2] BSelf_ff[z] = result_ff$par[3] BOther_ff[z] = result_ff$par[4] E_ff[z] = result_ff$par[5] G_ff[z] = result_ff$par[6] df[withheld, ] = generate_predictions(df[withheld, ], result_ff) } Deviance_ff = -2*sum(df$choseHarm * log(df$ProbHarm) + (1 - df$choseHarm) * log(1 - df$ProbHarm)) fivefold[i, 1:32] = c(included_subjects[i], Deviance_ff, KSelf_ff, KOther_ff, BSelf_ff, BOther_ff, E_ff, G_ff) j = (i-adj)*152 + 1; n = (i-adj + 1)*152 trialData$ProbHarm_ff[j:n] = df$ProbHarm } colnames(fivefold) = c('SubjectID', 'Deviance', 'KS_F1', 'KS_F2', 'KS_F3', 'KS_F4', 'KS_F5', 'KO_F1','KO_F2','KO_F3','KO_F4','KO_F5', 'BS_F1', 'BS_F2', 'BS_F3', 'BS_F4', 'BS_F5', 'BO_F1','BO_F2','BO_F3','BO_F4','BO_F5', 'E_F1', 'E_F2', 'E_F3', 'E_F4', 'E_F5', 'G_F1', 'G_F2', 'G_F3', 'G_F4', 'G_F5') sum(round(trialData$ProbHarm_ff) == trialData$choseHarm)/nrow(trialData) fivefold$BIC = as.numeric(fivefold$Deviance) + log(152) * 6 t.test(fivefold$BIC, subjectData$BIC, paired = T) #test fivefold MFI against normal MFI for this model library(lsa) cosines = vector('numeric', length = 60) for (i in 1:5){ cosines[i] = cosine(as.numeric(subjectData$Kappa_Self[-20]), as.numeric(fivefold[-20, (i + 2)])) cosines[(i+5)] = cosine(as.numeric(subjectData$Kappa_Other[-20]), as.numeric(fivefold[-20, (i + 7)])) cosines[(i+10)] = cosine(as.numeric(subjectData$Beta_Self[-20]), as.numeric(fivefold[-20, (i + 12)])) cosines[(i+15)] = cosine(as.numeric(subjectData$Beta_Other[-20]), as.numeric(fivefold[-20, (i + 17)])) cosines[(i+20)] = cosine(as.numeric(subjectData$Epsilon[-20]), as.numeric(fivefold[-20, (i + 22)])) cosines[(i+25)] = cosine(as.numeric(subjectData$Gamma[-20]), as.numeric(fivefold[-20, (i + 27)])) } mean(cosines[1:5]) mean(cosines[6:10]) mean(cosines[11:15]) mean(cosines[16:20]) mean(cosines[21:25]) mean(cosines[26:30]) .. tab-item:: MatLab :: .. tab-item:: Python :: Tutorial 4 - Li et al., 2022 ------------------- .. dropdown:: Assessing Accuracy in Capturing the Data Generation Process .. tab-set:: .. tab-item:: R :: ### Assessing Model Performance sum(altTrialData$Chose1 == round(altTrialData$ar_Prob1))/nrow(altTrialData) qplot(subjectData$BIC, geom = 'density') worstExplained = which(subjectData$BIC > as.numeric(summary(subjectData$BIC)[4])) qplot(data = altTrialData[which(altTrialData$SubjectID %in% altSubjectData$SubjectID[worstExplained]),], x = a0-a1, y = Chose1 - Prob1, group = SubjectID) + geom_smooth(se=F) qplot(data = altTrialData[which(altTrialData$SubjectID %in% altSubjectData$SubjectID[worstExplained]),], x = a0-a2, y = Chose1 - Prob1, group = SubjectID) + geom_smooth(se=F) qplot(data = altTrialData[which(altTrialData$SubjectID %in% altSubjectData$SubjectID[worstExplained]),], x = b0-b1, y = Chose1 - Prob1, group = SubjectID) + geom_smooth(se=F) qplot(data = altTrialData[which(altTrialData$SubjectID %in% altSubjectData$SubjectID[worstExplained]),], x = b0-b2, y = Chose1 - Prob1, group = SubjectID) + geom_smooth(se=F) ### Checking Assumptions ggplot(data = altTrialData) + geom_smooth(aes(x = ar_Prob1, y = Chose1, group = a0 < b0, color = factor(a0 < b0))) ggplot(data = altTrialData) + geom_density(aes(x = Chose1, group = a0 < b0, fill = factor(a0 < b0), alpha = 0.5)) normvals = rnorm(1000, mean = 0, sd = sd(altTrialData$ar_Prob1 - altTrialData$Chose1)) qplot(x = altTrialData$ar_Prob1 - altTrialData$Chose1, geom = 'density', bw = sd(altTrialData$ar_Prob1 - altTrialData$Chose1), color = 'Actual') + geom_density(aes(x = normvals, color = 'Predicted'), bw = sd(altTrialData$ar_Prob1 - altTrialData$Chose1)) qplot(x = abs(altTrialData$a1 - altTrialData$a2), y = altTrialData$ar_Prob1 - altTrialData$Chose1, geom = 'jitter') + geom_smooth(method = 'loess') qplot(x = abs(altTrialData$a1 - altTrialData$a2), y = altTrialData$ar_Prob1 - altTrialData$Chose1, color = factor(altTrialData$a0 > altTrialData$b0), geom = 'jitter') + geom_smooth(method = 'loess') ### Assessing Independence library(lme4) library(MuMIn) ris_model = glmer(data = altTrialData, Chose1 ~ ar_Prob1 + (1 + ar_Prob1 | SubjectID), family = "binomial") r.squaredGLMM(ris_model) ri_model = glmer(data = altTrialData, Chose1 ~ ar_Prob1 + (1 | SubjectID), family = "binomial") r.squaredGLMM(ri_model) ric_model = glmer(data = altTrialData, Chose1 ~ ar_Prob1 + factor(a0