Kwantlen Polytechnic University KORA: Kwantlen Open Resource Access All Faculty Scholarship Faculty Scholarship 7-2014 Predicting the Uptake of a University’s Offers of Places Stephen Peplow Kwantlen Polytechnic University Follow this and additional works at: http://kora.kpu.ca/facultypub Part of the Higher Education Administration Commons, and the Management Sciences and Quantitative Methods Commons Original Publication Citation Peplow, Stephen. "Predicting the Uptake of a University's Offers of Places." Transformative Dialogues : Teaching & Learning Journal 7:2 (2014) Online. This Article is brought to you for free and open access by the Faculty Scholarship at KORA: Kwantlen Open Resource Access. It has been accepted for inclusion in All Faculty Scholarship by an authorized administrator of KORA: Kwantlen Open Resource Access. For more information, please contact kora@kpu.ca. Offers of Places July 2014 Predicting the Uptake of a University’s Offers of Places Stephen Peplow, PhD, Kwantlen Polytechnic University Author's Contact Information Stephen Peplow, PhD, Kwantlen Polytechnic University 12666 – 72nd Avenue, Surrey, BC, V3W 2M8 Email: Stephen.Peplow@kpu.ca Abstract: Institutions such as Universities spend considerable resources in recruiting and following up on applicants. Unfortunately, much wastage results from the sending out of follow-up letters to students who never arrive, and who perhaps had applied only as a form of insurance; and also in hiring instructors and scheduling classes when the attendance is uncertain. In a competitive market, a predictive model of the uptake of offers made might well be helpful. Key Words: Recruiting, offers of place, predictive model, students, faculty. Introduction This paper sets out to develop a predictive model for one particular institution, but the techniques and the modeling process could be applied elsewhere with adaptations. In fact, the techniques could be used in any environment in which customers are free to make multiple applications. As well as reducing the more immediate wastage problem, a predictive model based on historical information would be helpful in diverting attention towards the most plausible reasons for the rejection of an offer, highly useful information at a time when the values that students place on higher education are changing. As Maringe points out in an interesting but limited study of student motivations in the United Kingdom, students are becoming increasingly ‘consumerist’ in their approach (Maringe, 2006). This paper provides an example of the construction of a predictive and analytical model. Free open-source software is used in both the GIS (QGIS) and statistical analysis (‘R') parts of the paper (R Core Team, 2013) . There are therefore no software costs in implementation. The accuracy of the model is reasonably high at nearly eighty per cent. 1 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 Data and methods The data has been kindly provided by Kwantlen Polytechnic University in British Columbia, Canada. The dataset identifies students who were accepted for the academic year 2011 and also whether or not they took up the offer. Each applicant’s record is georeferenced by postcode. There are 8,899 unique postcodes in the dataset, the majority of which (seventy-three percent) contain information on only one applicant. The total number of student records is 12,968. The acceptance or refusal of the offer provides the binary dependent variable to be used in the analyses. The dataset also provides information such as age; faculty applied for; ratecode (international or domestic) and other pertinent details. Of the nearly thirteen thousand students in the dataset, just over twenty-two per cent declined their offers. The dataset provides the names of the Faculties to which the student applied and also the age-group of the student. Faculty Name Academic and Career Advancement Arts Business Community and Health Design Non-credential Science and horticulture Trade and Technology Code 1 2 3 4 5 6 7 8 Table 1. Coding of Faculty names Agegroups have been coded as follows: Agegroup 18 Code 1 19-22 2 23-28 3 29-32 4 33-38 5 39-44 6 45-50 7 51-55 8 56-60 9 60 + 10 Below 18 11 Table 2. Coding of agegroups 2 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 Statistical methods The task is to ‘explain’ the binary dependent variable in terms of the other independent variables. I have approached the task in two ways; by using the classification tree, and by logistic regression. These methods have their own strengths and weaknesses, and a combination of the two yields deeper insights (Long, 1997). This is not an instructional document, and so I have not explained the statistical theory behind the approaches in depth. Instead I have provided suitable references and would be pleased to enter into correspondence. The classification tree The classification tree method dates from the 1980s (Breiman, 1993). An algorithm partitions the data using splits or nodes. At each possible split, the algorithm decides whether the node contributes useful information about the dependent variable. If it does then it is defined as a split. The method has been used in a wide range of disciplines, for example in mental health care to calculate suicide risks , while in oncology, Camp and Slattery use the classification tree to identify types of cancer (Camp & Slattery, 2002). In remote sensing, the tree has been used to assess ground cover (Davranche, Lefebvre, & Poulin, 2010). One advantage of the classification tree approach is that the nodes are ordered in decreasing statistical significance. This means that the node which contributes the most information comes first. Logistic regression Logistic regression is a well-established method of calculating the odds ratio of the occurrence of one of the two alternatives in a binary dependent variable. An odds ratio is defined as the probability of an event (p) divided by (1-p). An odds ratio of 1 thus means that the event is as likely to occur as not to occur. In betting parlance, this is ‘evens’. It is common practice to present results from the analysis as the natural logarithm of the odds ratio. This is to obviate problems with zeroes and negative numbers. Logistic regression is used in a wide range of disciplines, but naturally is especially popular in disciplines in which binary dependent variables are common. This includes marketing and in finance. For example, Yeung and Yee use the tool to predict customer propensity to purchase (Yeung & Yee, 2011). Restaurant bankruptcies have been predicted by logistic regression (Youn & Gu, 2010). The tests From the university's perspective, interesting questions might be: 1. Does the acceptance rate change between faculties? Are some faculties more successful than others in retaining the students who have applied to them? 2. Is the age of the applicant related to the acceptance rate? 3. Are international students more or less likely than domestic students to accept their offers? 3 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 4. Does the home location of the applicant affect his or her acceptance rate? This is tied to a supplementary question regarding applications to multiple institutions as an ‘insurance policy’. Tests 1 - 3 can be answered by the classification tree and logistic regression. Test 4 will use data gleaned from GIS. Results of Tests 1 -3 Using classification tree The plot below provides a classification tree using the 'rpart' algorithm with the splits in order of statistical significance. Figure 1. Classification tree output The data has been split into training and testing sets. The most important split is at node 1, concerning whether the applicant was seeking vocational or undergraduate study. If YES, then the probability of accepting the offer (0) was 99%. If he or she was seeking undergraduate training, then the next most important node concerning age group. The algorithm has identified a split at age group >= 1.5, which, referring back to the table, meaning that the applicant was aged over 18. The rest of the tree is straightforward. The overall error is 23%, meaning that 77% of the classification were correct. 4 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 The model may be used for prediction in bulk. A large dataset of applicants detailed could be applied to the model, and the predicted probability of offer acceptance for each applicant produced. This output could then be ranked and appropriate action taken. Using logistic regression I have repeated the analysis using the same variables. Given the coding of zero for accept and 1 for reject, the meaning of the coefficients is this: the more positive (or less negative) the coefficient, the more likely the student is to reject the offer and vice-versa. Arts Business Community and Health Studies Design Non-credential students (Academic) Science and Horticulture Trades and Technology INTERNATIONAL age18 age19 - 22 age23 - 28 age29 - 32 age33 - 38 age39 - 44 age45 - 50 age51 - 55 age56 - 60 age60+ Constant N Log Likelihood AIC Refuse -0.094 -0.520** -2.749*** -14.677 0.065 0.005 -3.297*** -0.367** 0.666** -0.073 -0.294 -0.330 -0.220 0.010 0.273 -0.004 0.322 -0.347 -0.954** 12,968 -6,289.654 12,617.310 *p < .05; **p < .01; ***p < .001 Table 3. Logistic regression output The asterisks after the coefficients indicate the statistical significance of the variable, as shown in the table below the results. There are only seven faculties listed above, while Table 1 provides eight. This is because we need one faculty to be the reference level. The missing faculty is academic and career advancement. The coefficients for the seven displayed faculties should be considered in reference to Academic and Career Advancement, which had a forty-one per cent drop rate. It is immediately apparent that the probability of a student failing to take up an offer from the Faculty of Design is extremely small, and in fact no student dropped an offer. LevelVO refers to Vocational or Undergraduate, with Undergraduate being the reference level. We already know that Vocational students rarely drop offers, and so the negative sign is expected. The probability of a vocational student rejecting an offer is 5 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 much lower than that of the reference level, undergraduate students. The age variable uses age < 18 as the reference level. Age (18) shows high statistical significance and also a positive coefficient. This means that eighteen year olds are the group most at risk of failing to take up offers. The size of the negative coefficient decreases with age, perhaps reflecting greater stability and decision-making maturity. The exception is the coefficient for the age 23-28 group. Perhaps students of this age are active in the job market and turn down the offer because they have obtained a job? I have no other plausible explanation for this change, but it might be worthy of further research. A section below explores this issue a little further. Ratecode is a dummy variable, splitting the applicants into domestic and international students, with the international students paying more and perhaps having more choice. The reference level for this variable is domestic, simply because there were many more of them in the dataset. The negative sign shows that international students are less at risk for rejecting offers compared to domestic students. In fact, the difference is quite stark: twenty-three per cent of domestic students failed to take up offers made, while the figure for international students was fourteen per cent. GIS The statistical analysis above can be complemented by insights from geographical information systems (GIS). We can use GIS in three ways. 1. to gain a visual impression of the geographical distribution of the offer take-up 2. to gauge the effect of competition. KPU students almost certainly apply for other institutions apart from KPU as a form of insurance. It is interesting to gauge the effect of competing offers. The dataset does not provide data on alternative offers of course, but we can estimate it by constructing 'buffers' around both KPU and competing institutions and observing whether the takeup rate differs. We can also estimate the effect of having to cross a bridge or travel great distances. Visualisation It may be helpful to visualise the geographical distribution of the acceptance of offers. As with the statistical analyses above, I have assigned a zero to a student who accepted an offer, and a one to a student who did not take up the offer. I have selected only those postcodes which contained one applicant. Figure 1 below shows the distribution. Yellow marks those who took up offers, red those who received an offer but who did not take it up. 6 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 Figure 2. Map of offer takeup (light green) and refusal (red). KPU campuses indicated with blue circles. It is noticeable that acceptances (the light green colour) are clustered around the campuses of KPU. Buffers We can examine this further by placing a buffer around KPU campuses and also those of institutions which might be considered ‘competing’. I chose a 2 km radius buffer, but this was an arbitrary choice. Table 3 below shows the numbers and also the odds of a student rejecting an offer. The column headings are: the names of postsecondary institutions likely to be attractive to KPU applicants. ‘Drop’ and ‘total offers’ are the number of applicants who did not take up offers and the total KPU offers made within a 2 km radius of the institution. Pdrop is the probability of a drop, shown as an odds ratio in the next column. Institution BCIT Langara Langley Richmond SFU Surrey TWU UBC Drop 16 31 14 53 0 13 4 1 Total_Offers Pdrop 27 0.593 108 0.287 99 0.141 245 0.216 3 0.000 142 0.092 12 0.333 11 0.091 2km odds 1.455 0.403 0.165 0.276 0.000 0.101 0.500 0.100 Table 3. Odds within 2km buffers of competing institutions I also calculated the odds using a 1 km buffer but only for KPU’s three campuses: Richmond, Surrey and Langley. Data was insufficient at the other campuses. From the 7 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 2 km figures, it appears that BCIT is KPU’s greatest competitor, which is hardly surprising since both institutions are similar. The overall rejection rate is 22 per cent, or an odds of rejection of 0.28. Since the odds of rejection are lower within the 2 km buffer, it is possible that proximity to a campus increases the take up rate. Again this is not surprising; students who live near a campus are perhaps more likely to apply to only that institution. It is interesting that the 1 km figures confirm the proximity finding; the odds are lower. Langley is particularly small, perhaps because students there have few nearby alternatives. Effect of age Above I noted that the sign for age in the logistic regression changed. The plot below in Figure 3 uses the age group as the explanatory variable, with the probability of rejection of KPU’s offer on the vertical y axis. This is interesting because the plot begins and ends with a highly defined probability, because the line is tightly focussed. However the focus diminishes in the middle age ranges, reinforcing the logistic regression result. Figure 3. Refusal probability by agegroup Prediction example The logistic regression model above yields insights into the areas of concern in recruitment. It is also possible to use the model to predict the probability of refusal or acceptance either for an individual or a whole group. As an example, I predict the probability of refusal for this applicant: 8 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 Age = 18; level = UG; ratecode = DOMESTIC; faculty = Arts. The response is 0.4009791, meaning that there is a forty per cent change of refusal. It is possible to feed in a large dataset of applicants and receive probabilities for each one. The applicants could then be ranked by probability and appropriate action taken. Operationalization The information discussed and presented here is a beginning only. However, the following insights drawn from the analysis could be acted upon. Triage system. KPU’s Office of International Analysis and Planning prepared an internal report in March 2013, the Acceptance/Declined Survey. In the survey, five per cent of respondents claimed that other institutions communicated more quickly than KPU, and this was an important reason for their decision not to take up the offer. Perhaps the frontline staff in the registrar’s office could use the full classification tree to prioritize work. Of course, all students should be attended to promptly, and no doubt are, but students whose applications fall into ‘risky’ nodes might warrant extra care. The logistic regression has highlighted particular areas of concern. By faculty, Academic and Career Advancement, Non-credential, and Science and Horticulture students deserve attention at the institutional level; why are so many students rejecting offers from these two faculties? Age groups also provide interesting questions and opportunities. If scholarships are to be offered, they could be targeted to the youngest age-group; the effect of a scholarship on mature students is likely to be negligible and would therefore be money unwisely spent. Further work I would like to combine raw responses to the KPU Acceptance/Decline Survey with geospatial and observed offer take-up. From this, we might be able to observe patterns which match students’ responses to the Survey. For example, did those who claimed that lack of public transport was a major factor live inconveniently distant from a KPU campus? Correspondence and principal component analysis would also be possible. In addition, investigating more recent data might yield some interesting intertemporal insights. Conclusion The analyses which I have performed above are very elementary, and yet have brought out some interesting insights. In addition, the dataset is limited and in particular lacks intertemporal data. We cannot therefore examine trends over time, perhaps the most interesting feature of dynamic student enrolments. Data analyses such as mine are the bread and butter of modern business, and organisations which fail to build data capture and data analysis into their regular routines are likely to fall behind. In contrast, as some interesting recent studies have found, even modest data applications can propel organisations forward 9 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014 Offers of Places July 2014 References Breiman, L. (1993). Classification and regression trees. New York: Chapman & Hall. Camp, N. J., & Slattery, M. L. (2002). Classification tree analysis: a statistical tool to investigate risk factor interactions with an example for colon cancer (United States). Cancer Causes & Control, 13(9), 813–823 Davranche, A., Lefebvre, G., & Poulin, B. (2010). Wetland monitoring using classification trees and SPOT-5 seasonal time series. Remote Sensing of Environment, 114(3), 552– 562. doi:10.1016/j.rse.2009.10.009 Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. SAGE. Maringe, F. (2006). University and course choice: Implications for positioning, recruitment and marketing. International Journal of Educational Management, 20(6), 466–479. R Core Team. (2013). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Yeung, R. M. W., & Yee, W. M. S. (2011). Logistic Regression: An advancement of predicting consumer purchase propensity. Marketing Review, 11(1), 71–81. Youn, H., & Gu, Z. (2010). Predict US restaurant firm failures: The artificial neural network model versus logistic regression model. Tourism & Hospitality Research, 10(3), 171–187. 10 Transformative Dialogues: Teaching & Learning Journal Volume 7 Issue 2 July 2014