969753 research-article2020 CJBXXX10.1177/0093854820969753Criminal Justice and BehaviorGhasemi et al. / Application of ML to Risk–Need Assessment The Application of Machine Learning to a General Risk–Need Assessment Instrument in the Prediction of Criminal Recidivism Mehdi Ghasemi University of Saskatchewan Daniel Anvari Kwantlen Polytechnic University Mahshid Atapour Capilano University J. Stephen wormith Keira C. Stockdale Saskatoon Police Service University of Saskatchewan Raymond J. Spiteri University of Saskatchewan The Level of Service/Case Management Inventory (LS/CMI) is one of the most frequently used tools to assess criminogenic risk–need in justice-involved individuals. Meta-analytic research demonstrates strong predictive accuracy for various recidivism outcomes. In this exploratory study, we applied machine learning (ML) algorithms (decision trees, random forests, and support vector machines) to a data set with nearly 100,000 LS/CMI administrations to provincial corrections clientele in Ontario, Canada, and approximately 3 years follow-up. The overall accuracies and areas under the receiver operating characteristic curve (AUCs) were comparable, although ML outperformed LS/CMI in terms of predictive accuracy for the middle scores where it is hardest to predict the recidivism outcome. Moreover, ML improved the AUCs for individual scores to near 0.60, from 0.50 for the LS/CMI, indicating that ML also improves the ability to rank individuals according to their probability of recidivating. Potential considerations, applications, and future directions are discussed. Keywords: LS/CMI; risk–need assessment; predictive accuracy; machine learning A lthough efforts to predict criminal recidivism date back 90 years (Burgess, 1928), the last two decades have witnessed an explosion in the use of risk-assessment tools in criminal justice systems around the world. These tools vary dramatically in their length, Authors’ Note: The views expressed are solely those of the authors and do not necessarily reflect those of the Saskatoon Police Service. In addition, we wish to acknowledge support from the Ontario Ministry of Community Safety and Correctional Services, the Centre for Forensic Behavioural Science and Justice Studies at the University of Saskatchewan, the Saskatchewan Police Predictive Analytics Lab, Mitacs, and the Natural Sciences and Engineering Research Council of Canada. Correspondence concerning this article should be addressed to Raymond J. Spiteri, Department of Computer Science, University of Saskatchewan, 176 Thorvaldson Building, 110 Science Place, Saskatoon, Saskatchewan, Canada S7N 5C9; e-mail: spiteri@cs. usask.ca. CRIMINAL JUSTICE AND BEHAVIOR, 201X, Vol. XX, No. X, Month 2020, 1­–21. DOI: 10.1177/0093854820969753 Article reuse guidelines: sagepub.com/journals-permissions © 2020 International Association for Correctional and Forensic Psychology ogdr/.oi/p:stht 1 2 Criminal Justice and Behavior scope, design, and method of calculating or appraising risk. They also vary in the type of forensic clientele for whom they are designed, the type of outcome they are meant to predict (e.g., types of recidivism), and the context in which they are applied (Andrews et al., 2006). Yet, they also tend to have some common characteristics. For example, most risk-assessment tools capture data about an individual’s criminal history, a so-called static or historical factor and perhaps the most well-established risk factor of subsequent criminal behavior. Another characteristic that binds all forensic risk-assessment instruments is that they are ultimately intended to promote public safety by identifying individuals who are most likely to reoffend. It is then the responsibility of the criminal justice system (police, courts, correctional agencies, and community organizations) to use the results of forensic risk assessments to employ the appropriate means at their disposal to reduce or prevent further criminal behavior. The Level of Service (LS) Family of Risk-Assessment Tools The Level of Service/Case Management Inventory (LS/CMI; Andrews et al., 2004) is the latest version of a forensic risk–need assessment measure from a family of tools known as the LS scales. Versions of the LS scales have been used worldwide since the early 1990s, with increasing popularity over the last decade. For instance, by 2010, more than one million administrations were officially registered with the test publisher in a single year (Wormith, 2011). The popularity of the LS scales may be attributed to several important characteristics. First, unlike strictly actuarial measures, the LS scales were developed from well-established criminological and psychological theories (e.g., differential association theory, social learning theory), including a general personality and cognitive social learning theory of criminal behavior (e.g., Andrews & Bonta, 1994). Second, the LS scales have a rich tradition of research supporting its content and use in practical ways for correctional practitioners (Gendreau et al., 1996). This includes numerous validation studies and metaanalyses (e.g., Olver et al., 2014). Third, the LS scales have been found to have general applicability across many forensic populations. This includes adults and youth in custody or on community supervision, male and female populations, and various ancestral/ethnic backgrounds and cultures on diverse measures of recidivism, ranging from technical violations to criminal charges and convictions (e.g., Olver et al., 2009; Smith et al., 2009; Wilson & Gutierrez, 2014). Fourth, the LS scales have multiple applications in corrections. This includes not only the prediction of criminal recidivism but also the planning and delivery of forensic services and case management practices to prevent recidivism (e.g., Luong & Wormith, 2011), an attribute made possible because the scale includes dynamic risk factors, also known as criminogenic needs, as well as static risk factors, hence its status as a risk– need scale. Fifth, the LS scales strike a balance between comprehensiveness and simplicity. Ratings in applied settings require a skilled interview of forensic clientele, yet items are scored in a dichotomous (0–1) fashion and then summed. As such, it can easily be scored manually by a trained assessor. A pilot version of the LS/CMI called the Level of Service Inventory–Ontario Revision (LSI-OR; Andrews et al., 1995) was introduced in Ontario, Canada, in 1995, and remains in use throughout this provincial jurisdiction. More than 20,000 administrations of this version are applied to forensic clientele in Ontario annually. For simplicity, in this study, we use the more generally known and widely used name for this version of the tool, the LS/CMI. Ghasemi et al. / Application of ML to Risk–Need Assessment 3 The LS/CMI (Andrews et al., 2004) consists of 43 items that are grouped into eight domains or subsections, commonly referred to as the “central eight” (Andrews & Bonta, 2010). They include criminal history (eight items), education/employment (nine items), family/marital (four items), leisure/recreation (two items), companions (four items), substance abuse (eight items), pro-criminal attitudes (four items), and antisocial pattern (four items). Although individual items are scored in the same dichotomous fashion, the domains are weighted by virtue of their differing number of items. Other sections of the LS/CMI are used in a checklist fashion, serving as flags for issues of particular concern, but they will not be reviewed here because they are not the focus of the current study. Numerous studies have examined the predictive validity of the LS scales. A recent metaanalysis of 151 independent samples and 137,931 justice-involved individuals by Olver et al. (2014) demonstrated the predictive validity of the LS scales for any recidivism (mean random effects correlations of .30 for males and .31 for females). Although the predictive accuracy for general recidivism was consistently higher than the predictive accuracy for violent recidivism (overall mean random effects correlations r were .29 and .23, respectively), the LS/CMI generated higher correlations for both general and violent recidivism (r = .42 and .27, respectively) than the other LS variants (r = .25–.30 and .21–.28, respectively). However, the number of studies examining the LS/CMI means was modest (k = 12 and 11, respectively) because of the relative newness of this version of the tool. Regardless, these investigations have evaluated the prescribed arithmetic scoring of the LS/CMI using traditional statistical approaches. It is possible that all predictor variables (risk factors) may not be of equal weight or demonstrate only linear relationships with criterion data (recidivism) as discussed by Garb and Wood (2019) in their recent review of methodological advances in statistical prediction. Newer statistical approaches that examine complex predictors in novel ways may yield important insights. Applications of Machine Learning (ML) to Forensic Risk Assessment It is well known that humans generally do not have the best track record when it comes to making rational decisions and judgments (e.g., Grove et al., 2000). Even trained professionals do not fare nearly as well as basic actuarial algorithms when predicting human behaviors such as criminal recidivism (Andrews & Bonta, 2010), especially when faced with extensive information, limited feedback, and varying base rates for recidivistic events (Lin et al., 2020). The reasons for this are many, including the human mind’s limits on working memory and human susceptibility to cognitive bias, emotion, fatigue, and so on (e.g., Dawes et al., 1989). Augmenting human capabilities with actuarial algorithms and computer-aided tools, including ML, may help to improve risk assessment and decision-making for correctional clientele. ML is a branch of computer science that evolved from computational learning theory in artificial intelligence (e.g., Marsland, 2015; Murphy, 2012). It explores the analysis and construction of algorithms that can learn from, and make predictions about, relevant data. Because the amount of data available to scientists has recently seen unprecedented growth and ML techniques are “designed for the analysis of high-dimensional data with hundreds or thousands of predictors” (Garb & Wood, 2019, p. 1461), they have been attracting a great deal of attention. ML has been successfully applied to the solution of problems in diverse areas, including medicine and health care delivery systems, and has resulted in improved diagnostic accuracy and efficiency (e.g., Deo, 2015; Lavecchia, 2015; Topol, 2019). 4 Criminal Justice and Behavior Over the last decade, a growing debate also has mounted about the use of “big data” and ML algorithms in criminology and criminal justice generally and forensic risk assessment in particular (Berk & Bleich, 2013). Some have suggested that ML may improve the “hit rate” of extant risk-assessment tools (true positives [TPs] plus true negatives [TNs]) such as the LS scales (Wormith & Bonta, 2017, p. 135), whereas others have cautioned that the incremental validity of using ML approaches may be “modest,” especially when the data are less complex (e.g., a data set containing scores on a single risk-assessment instrument; Garb & Wood, 2019, p. 1464). However, Duwe and Kim (2017) remind us that ML includes many different statistical techniques (e.g., decision tree [DT]–based algorithms, neural networks [NN], and support vector machines [SVMs]) and applications to the criminal justice field are still in their “infancy” (p. 597). Helpful overviews of ML techniques are provided by Tollenaar and van der Heijden (2013) and Duwe and Kim (2017). We limit the scope of our review to applications of ML to risk assessment, specifically predictive validity, given the importance of this type of validity in criminal justice decision-making and considering the majority of applications of ML to risk assessment has focused on predictive accuracy. However, we recognize that the “success” of a model strongly depends on the performance metric used, and predictive accuracy is only one of many important functions of risk-assessment tools. Although a range of prediction performance metrics exist, in addition to accuracy (ACC), we report the commonly reported area under the receiver operating characteristic curve (AUC) because of its frequency of use and intuitive application. Possible AUC values range from 0 to 1, representing the probability that a randomly selected recidivist would have a higher score than a randomly selected nonrecidivist (Rice & Harris, 2005). In an early application of ML, Liu et al. (2011) compared logistic regression (LR), classification and regression trees (CART), and NN in the prediction of violent reoffending using a large sample of adult males in custody in the United Kingdom (N = 1,225). Prediction variables were taken from the Historical Clinical Risk Management–20 (HCR20; Webster et al., 1997), a structured professional judgment (SPJ) approach to the assessment and management of violence, whereby assessors make clinical decisions based on the item data they collect, as opposed to quantitative estimates of risk. Although NN performed marginally better than LR and CART, the authors concluded that the improvement did not warrant the use of NN over traditional prediction schemes (with all AUCs ranging from 0.65 to 0.72 for violent recidivism). In 2013, Tollenaar and van der Heijden compared the use of LR with several ML techniques, including multivariate regression spline, linear discriminant analysis, flexible discriminant analysis, recursive partitioning, adaptive boosting, logitBoost, NN, linear support vector networks, k-nearest-neighbors classification, and partial least squares, in the prediction of general, violent, and sexual recidivism. However, rather than using items from an existing risk-assessment tool, Tollenaar and van der Heijden used a host of available static variables that were available from offender databases (N = 20,000) in the Netherlands (e.g., criminal history counts). Overall, they found the most accurate model varied with the type of sample (e.g., offending subtypes and recidivism base rates) and the outcome being predicted (e.g., sexual, violent, and general reoffending), and ML approaches to the prediction of criminal recidivism generally were not superior to traditional regression-based approaches (with AUCs ranging from 0.708 to 0.776 for general recidivism). However, the conclusions drawn by Tollenaar and van der Heijden in 2013 were criticized at the time for being Ghasemi et al. / Application of ML to Risk–Need Assessment 5 premature (Berk & Bleich, 2013), and there were calls for further explorations of ML approaches (e.g., Brennan & Oliver, 2013; Bushway, 2013; Ridgeway, 2013). More recent results have been mixed. For instance, Hamilton et al. (2015) compared the predictive accuracy of the Washington State Static Risk Assessment using traditional (LR) and ML methodologies (NN and random forest (RF) approaches) in a large sample of corrections clients reentering the community in the state of Washington (N = 297,600). AUCs ranged from 0.732 to 0.762 depending on the outcome of interest, with LR and ML approaches demonstrating comparable performance. However, using a sample of 40,000 individuals released from prison in Minnesota, Duwe and Kim (2016) found that prediction models developed with supervised learning classifiers outperformed classification techniques commonly used in risk–needs assessment tools (e.g., summative classification or the Burgess method). To further investigate the performance of newer ML approaches relative to traditional methods in predicting recidivism, Duwe and Kim (2017) subsequently compared the predictive accuracy of 12 supervised learning algorithms. The data set used in the study was derived from that used to develop the Minnesota Screening Tool Assessing Recidivism Risk (MnSTARR; Duwe, 2014) and comprised 27,772 individuals released from prisons in Minnesota. The MnSTARR contains both static (e.g., criminal history) and dynamic items pertaining to institutional adjustment (e.g., disciplinary infractions, involvement in programming; Duwe, 2014; Duwe, 2019), and as such, both static and dynamic predictors were included in the data set. Newer ML approaches such as LogitBoost (AUC = 0.777), RFs (AUC = 0.781), and MulitBoosting (AUC = 0.775) were found to yield better results for general recidivism, albeit only modestly. Moreover, the methods yielding the best performance varied across 10 “scenarios” that varied by gender and type of recidivism. As Duwe and Kim (2017) pointed out, and we concur, these results would seem to suggest that the type of statistical methods employed to assess risk for recidivism may depend on the purpose for which risk is being assessed and how they are being used. For example, a large correctional agency looking to automate risk classification within a geographic region or across institutions may require different technologies than an individual assessor who is looking to identify criminogenic needs and generate recommendations for case management. As with traditional methodologies, further “tuning” of ML models (e.g., calibration) is also required to address issues of diversity, including responsivity considerations (e.g., gender, ethnic/cultural background, age), location (e.g., region, country, institution vs. community settings), and time of assessment (e.g., intake, release, pre-/posttreatment). Risk variables may also change over time with or without intervention (e.g., cohort effects, treatment change). In the last few years, there have been some promising and also concerning findings. For instance, using a large data set of predictors of offending in Texas (N = 258, 248), Curtis (2018) found that modern ML approaches predicted general arrest. Large effects were reported for RFs (AUC = 0.808) and XGBoost (AUC = 0.792). However, the majority of the top predictors were static predictors (e.g., criminal history), and one of the top predictors was the number of tattoos! An examination of 336 predictor variables in a sample of 3,061 juveniles in Florida with a history of sexual offenses by Ozkan et al. (2020) also found that RF models yielded strong findings with AUCs of 0.71 for an “all-predictors model” and 0.65 a for a “legal factors” model. Although comparable with AUCs reported in a recent 6 Criminal Justice and Behavior meta-analysis of tools used to access sexual recidivism in juveniles (AUCs = 0.64–0.67; Viljoen et al., 2012) to be included in the data set, youth were required to have been scored on the Positive Achievement Change Tool (PACT; Baglivio, 2009). It is unclear as to how these findings may compare with the predictive accuracy of the PACT alone in this sample or how best to interpret and apply this “black-box” all-predictors model. Perhaps most interestingly, using only static, historical information about adult males who had been convicted of a sexual offense for the first time (N = 756), Lussier et al. (2019) were able to generate novel insights using ML approaches, specifically decision tree algorithms (DTAs), including chi-square automatic interaction detection (CHAID), Quick Unbiased Efficient Statistical Tree (QUEST), and CART. Although classic LR and DTA predictive models were not appreciably different, the use of DTAs (AUCs = 0.704–0.733) revealed the presence of different risk profiles for entry into sexual recidivism that were not revealed by classic LR (AUC = 0.746). Thus, there may benefit to combining traditional and modern approaches when assessing risk over time, development, and the course of a criminal career. The purpose of the current study is to extend this body of work by applying modern ML approaches to a widely used, general risk–need assessment tool that is theoretically informed (in contrast to dustbowl empiricism), includes both static and dynamic factors, and can follow justice-involved individuals from intake through to case closure, often referred to as a “fourth-generation” tool (Andrews et al., 2006). Duwe and Kim (2017) have suggested that “fourth-generation risk-assessment instruments based on ML algorithms could potentially improve correctional practice” (p. 596; for example, access to programming). Moreover, the approach taken by Lussier and colleagues (2019) indicates that deductive (e.g., LR), inductive (i.e., ML), and combined approaches to risk modeling may contribute different theoretical and analytic insights. As a first exploratory step, we examine the performance of ML techniques relative to LS/CMI score in terms of predictive accuracy and the ability to rank individuals according to their probability of recidivating by means of a secondary analysis of two large data sets containing LS/CMI administrations for individuals in provincial custody in Ontario, Canada. Method Data Sets Provincial correctional policy in Ontario requires the administration of the LS/CMI to all individuals who are given a term of probation or sentenced to a period of incarceration of between 3 months and 2 years. Individuals who are sentenced to more than 2 years are transferred to the federal correctional authority, the Correctional Service of Canada; hence, they are not under provincial jurisdiction and are not administered the LS/CMI by the provincial correctional authority. With the introduction of an electronic data capture and scoring mechanism for the LS/CMI, the collection of large LS/CMI data sets became possible. In this article, we analyzed a combination of two data sets provided by the Ontario Ministry of Community Safety and Correctional Services (MCSCS). The first data set (D1) consists of 72,725 records for individuals who were interviewed and assessed on the LS/CMI by MCSCS correctional staff during 2010 and 2011. This cohort included correctional clientele who had been sentenced to prison for a term of 3 to 24 months and then released from custody as well as those who were sentenced to a period of probation during this 2-year period. Individuals were then followed up at the end of 2013 Ghasemi et al. / Application of ML to Risk–Need Assessment 7 through official records of readmission to the justice system in Ontario. Where applicable, dates of the first recidivistic event were used to determine which individuals criminally recidivated and how long it took them to do so. The average follow-up time during which individuals were eligible to reoffend was approximately 2.96 years (SD = 7.5 months), ranging from 1.04 to 4.5 years. The second data set (D2) was retrieved using an earlier cohort from the same jurisdiction and under the same conditions as the 2010–2011 cohort; however, the second cohort spanned only a single year, 2004. LS/CMI data and recidivism outcomes were collected in the same manner as with the previous data set. A total of 26,450 individuals were then followed until January 2009, an average follow-up time of 4.54 years (SD = 3.5 months), ranging from 4.02 to 5.02 years. It is important to note that these data sets represent two cohorts of consecutive admissions to the MCSCS system. Recidivism for both data sets was defined as any criminal offense for which an individual is returned into the MCSCS system on a reconviction, sentenced to either incarceration or community supervision. Table 1 provides summary statistics for data sets D1 and D2, including information on mean total LS/CMI risk scores as well as recidivism rates for each gender and risk level. We note that the mean LS/CMI score for D1 (M = 14.30; SD = 8.91) is higher than the mean score for D2 (M = 12.53; SD = 8.79). We find this difference to be statistically significant, with a p value of less than 10–8; however, we consider this p value to be an artifact of the large sample size. Effect sizes given in Table 1 using Hedges’s g suggest small effect sizes (≤0.27) between mean LS/CMI scores across databases (overall and sex stratified) and trivial effect sizes (≤0.07) across databases stratified by risk levels. Figure 1 provides a more detailed visual presentation of the difference in the distribution of the LS/CMI scores for the two data sets. Referring to the boxplots in Figure 1A and B, it can be seen that the box containing the middle 50% scores for D1 is located at a higher level than that of D2. Also, D2 contains more outliers with high scores. We also note from Table 1 that the rate of recidivism is significantly higher (with p value of less than 10–8) for D2, 36.01%, compared with 30.52% for D1. We also consider this p value to be an artifact of the large sample size. Effect sizes given in Table 1 using Cohen’s h suggest small effect sizes (≤0.26) between recidivism rates across databases (overall and stratified by sex and risk level). One possible explanation for this difference is the longer average follow-up time for D2 (4.54 years) compared with about 3 years for D1, allowing more opportunity for recidivism to occur and be recorded. Other summary statistics are given, including the mean, standard deviation, and median LS/CMI scores as well as AUCs (with 95% confidence interval) for the individual data sets according to gender and LS/CMI risk level. Although data sets D1 and D2 differ in potentially important ways, the AUC values associated with the LS/CMI total scores are similar for D1 and D2 as well as for male and female subgroups (AUCs from 0.70 to 0.72). We make use of the combined data set (comprised of D1 and D2) for building and testing our predictive models to retain maximal data. The rationale for this approach is that our ultimate goal is to build a dynamic predictive model based on the maximum amount of available data. We create a training data set by selecting 50% of the records from each of D1 and D2 in a uniformly random fashion. These records are combined to form the training data set. The remaining records are combined to form the testing data set. 8 25.69 31.54 30.52 7.51 15.00 30.07 52.51 72.87 13.67 25.90 33.37 20.32 6.74 Recid. (%) 17.35 82.65 100 Prop. (%) 2.51 (1.22) 7.49 (1.69) 14.74 (2.59) 23.93 (2.83) 33.23 (2.78) 13.32 (8.45) 14.51 (8.99) 14.30 (8.91) M (SD) 3 7 15 24 33 12 13 13 Mdn — — — — — 0.72 [0.69, 0.74] 0.72 [0.70, 0.73] 0.72 [0.71, 0.73] AUC [95% CI] — — — — — 33.81 34.02 33.99 Age 19.88 29.25 29.97 15.44 5.46 18.32 81.68 100 Prop. (%) 11.73 22.57 41.75 64.85 83.39 28.90 37.61 36.01 Recid (%) 2.44 (1.22) 7.38 (1.70) 14.59 (2.59) 23.99 (2.85) 33.05 (2.72) 11.05 (8.01) 12.86 (8.93) 12.53 (8.79) M (SD) 3 7 14 24 33 9 11 11 Mdn D2 (n = 26,450) — — — — — 0.71 [0.69, 0.73] 0.70 [0.68, 0.72] 0.70 [0.68, 0.72] AUC [95% CI] — — — — — 33.42 33.23 33.27 Age 0.06 0.06 0.06 0.02 0.07 0.27 0.18 0.20 Hedges’s g 0.14 0.19 0.24 0.25 0.26 0.07 0.13 0.12 Cohen’s h D1 vs. D2 Note. Significant bivariate differences between D1 and D2 on mean LS/CMI scores and rates of recidivism. AUC = area under the receiver operating characteristic curve; CI = confidence interval; LS/CMI = Level of Service/Case Management Inventory; Prop. = proportion of data set size; Recid. = rates of recidivism for each LS/CMI risk level for the data sets D1 and D2; M (SD) = mean (standard deviation) LS/CMI score; Mdn = median LS/CMI score. Demographics Female Male All LS/CMI risk level Very low (0–4) Low (5–10) Medium (11–19) High (20–29) Very high (30–43) Category D1 (n = 72,725) Table 1: Summary Statistics of LS/CMI Scores and Recidivism Rates for Data Sets D1 and D2 Ghasemi et al. / Application of ML to Risk–Need Assessment 9 Figure 1: Visualizations of Different Aspects of the Distribution of LS/CMI Scores for Data Sets D1, D2, and the Combined Data Set: (A) Boxplot for Data Set D1 (n = 72,725); (B) Boxplot for Data Set D2 (n = 26,450); (C) Distribution of the LS/CMI Scores for the Combined Data Set; and (D) Rate of Recidivism for Each LS/CMI Score for the Combined Data Set Note. LS/CMI = Level of Service/Case Management Inventory. Additional details regarding these two data sets can be obtained from Wormith et al. (2012) and Wormith et al. (2015) as well as two master’s theses (Hogg, 2011; Orton, 2014). Ethics approvals were obtained from the University of Saskatchewan for these projects as well as for a broader program of predictive research the current work sought to inform (BEH 16-166). 10 Criminal Justice and Behavior LS/CMI Scores The General Risk/Need Factors section of the LS/CMI consists of 43 risk–need items, Ai , scored dichotomously (0 = not present or 1 = present). Items are summed to provide a total score LS/CMI ranging from 0 to 43, 43 LS/CMI = ∑A , i i =1 and there are five risk levels associated with various ranges of scores. Table 1 gives the proportion of scores for each risk level as well as the corresponding rates of recidivism for data sets D1 and D2. In practice, a total LS/CMI score is obtained and compared with available norms to get a recidivism estimate. To algorithmically simulate this process in this study, LS/CMI is applied in the following way to predict recidivism. For a given data set, the recidivism rate for each score is calculated. If the recidivism rate for a given score is above 0.5, then any individual with that score is classified as likely to recidivate and otherwise not. ML Algorithms There are various types of ML algorithms that can be applied to our data set, but because we have outcome or “target” data (i.e., data on whether an individual recidivated), we employ a class of ML algorithms generally known as supervised algorithms (Marsland, 2015). In supervised algorithms, the algorithm is fed by previously existing data (training data), where the target data are known, and the algorithm builds a model from these data. The goal is to enable the model to reliably predict target values on a new set of data (test data; that is, data on which the model has not been trained). More detailed descriptions of various ML algorithms can be found in Marsland (2015) and Murphy (2012). We now briefly describe the supervised ML algorithms used in this study. DTs An early example of a DT approach applied to risk assessment comes from the MacArthur Violence Risk Assessment study, in which Steadman and colleagues (2000) designed a method to predict violent recidivism among offenders with mental disorders. A DT learning approach refers to a predictive model that maps observations about an item to conclusions about the item’s target value. Tree models, where the target variable can take a finite set of values, are called classification trees. In these tree structures, leaves represent class labels, and branches represent conjunctions of features that lead to those class labels. A tree can be constructed by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner (known as recursive partitioning). The recursion is completed when the subset at a node has the same value of the target variable or, when splitting, no longer adds value to the predictions (Marsland, 2015). DTs may best be applied to those problems where instances are represented by attribute– value pairs, the target function has discrete output values, and the training data may contain errors. This makes DTs appropriate for our study because the target function (whether an individual recidivated) and all the input data (scores on LS/CMI items) are binary values. In Ghasemi et al. / Application of ML to Risk–Need Assessment 11 fact, the LS/CMI itself can be interpreted as a DT. For example, the LS/CMI classifies an individual based on their LS/CMI score into one of the five risk levels; then, based on the existing statistics for each risk level, it determines whether they are likely to recidivate or not. RFs RFs (e.g., Marsland, 2015) represent a set of classification algorithms that make predictions based on outputs of large number of decisions trees built on random subsets of features. RFs use the so-called bagging process to allow each individual tree to randomly sample from the training data set with replacement. This results in trees that are ultimately trained with different data and leads to more variation and diversification among the large number of trees in the forest. To increase the success of RF models, one needs to start with features that have a good level of predictive power and ensure these features are not highly correlated with each other. The overall idea is that if one tree can provide a good model, then many trees (a forest) should be able to do even better, provided there is enough diversity in the constituent trees. SVMs SVMs provide a state-of-the-art learning method that has been highly successful in a variety of applications. They are particularly effective when dealing with continuous data and data sets that are not linearly separable (Schölkopf & Smola, 2002). The SVM method has been developed based on two main ideas. The first idea is to map the feature vectors (data points) in a (nonlinear) way to a high (possibly infinite) dimensional space and then utilize linear classifiers in this new space. This mapping produces in nonlinear classifiers in the original space, thus overcoming the representational limitations of linear classifiers. However, the use of linear classifiers in the transformed space depends heavily on the computational methods for finding a classifier that performs well on the training data. The second idea is that, among the generally infinitely many hyperplanes that may separate the data, the linear classifier chosen is the one that maximizes the separation of the data (i.e., the one whose distance from it to the nearest data point on each side is maximized; Steinwart & Christmann, 2008). SVMs are suitable for classifying data of relatively high dimension. Because our data set consists of 43 LS/CMI variables, SVMs are a reasonable approach for classification. K-Fold Cross-Validation The various ML models were all built in the following way using k-fold cross-validation with k = 10. First, the training set was randomly shuffled and divided into k equal parts (or folds). For each ML algorithm, k models were built using k − 1 of the folds as training data and the final fold as testing data. For any given performance metric, the results of the k models are averaged to provide the value reported. Software Used The analytics presented in this article are the results from scripts written in Python programming language that use standard Python libraries for ML calculations and visualizations. 12 Criminal Justice and Behavior Data Analysis Evaluating the Performance of Classification Methods We examine three types of ML algorithms (DTs, RFs, and SVMs) to predict whether an individual is likely to recidivate or not. These algorithms can be thought of as classifiers. To evaluate a classifier, we need a way to compare the performance of each one (i.e., a measure that shows how well a given classifier predicts positive and negative cases of recidivism). A natural way to do this is to apply the classifier to a data set where the outcome is known, and hence, the performance of the classifier can be compared with existing data. We use the following quantities to compare performance of various classifiers: TP, the number of cases correctly predicted as positive; false positive (FP), the number of cases incorrectly predicted as positive; TN, the number of cases correctly predicted as negative; and false negative (FN), the number of cases incorrectly predicted as negative. All of these numbers are often summarized in the following matrix:  TP FP   .  FN TN  The most popular (and yet arguably naïve when used exclusively) measure associated with a classifier is the accuracy (ACC), which is defined as follows: ACC = TP + TN . TP + TN + FP + FN The accuracy tells us that what portion of the testing data is correctly classified. However, such a measure is not without its shortcomings. For example, consider a hypothetical classifier that classifies everything as negative (i.e., no individual is predicted to recidivate). The accuracy of this all-negative classifier applied to our data set is in fact about 65% because the recidivism rate is about 35%. Despite its high success rate, this classifier is likely not acceptable because it never correctly classifies positives (i.e., individuals who recidivate). The high accuracy can be attributed to the low recidivism rate. Hyperparameters The hyperparameters of an ML algorithm refer to parameters of the algorithm that must be specified in addition to the data. We experimented with Gini and Entropy versions of DTs and RFs and SVMs with linear, quadratic, cubic, and radial basis function (RBF) kernels. The DT and RF methods with Gini (Marsland, 2015) and the SVM method with Cubic kernel (Steinwart & Christmann, 2008) resulted in the best performance and hence were chosen as hyperparameters and used in all the comparisons with LS/CMI score. Comparing Overall Performance We examine whether there is a difference in performance of the ML algorithms relative to LS/CMI score in predicting recidivism. We examine the predictive validity of the LS/ CMI and ML algorithms for general recidivism using receiver operating characteristic Ghasemi et al. / Application of ML to Risk–Need Assessment 13 (ROC) analyses. ROCs generate an AUC value from 0 to 1, representing the probability that a randomly selected recidivist will obtain a higher score than a randomly selected nonrecidivist (Rice & Harris, 1995). We use the interpretive rubric of Rice and Harris (2005) in which the magnitude of AUC values is mapped to predictive effect sizes as follows: 0.55 to 0.63 (small/low), 0.64 to 0.70 (medium), and 0.71 and up (large/high). AUCs are evaluated by magnitude and in their ability to rank predictive models according to individual LS/CMI scores. Sensitivity Analysis for Feature Selection ML algorithms such as DT and RF have the capability to identify and report the most influential features in the models they build. In this study, we use sensitivity analysis to elicit the LS/CMI items (or features) that have the most importance predicting recidivism. Sensitivity analysis is generally the study of how perturbations (small changes or uncertainties) in model inputs are propagated to uncertainties in model outputs. Specifically, when a small change to a model input leads to a large change in model output, we say that the model is sensitive to that input. There are a number of ways in which sensitivity can be measured. In this study, we consider three of the most popular methods: the Morris method, which performs global sensitivity analysis by making a number of local changes at different possible input values; the Sobol (or variance-based sensitivity analysis) method, which decomposes the variance of the output of the model into fractions and attributes them to inputs; and the moment-independent δ index, which measures the relative importance of an individual input in determining the uncertainty of model output by looking at the entire distribution range of model output. As a basic usage of sensitivity analysis, one can consider a scenario when two individuals have the same scores and hence the same prediction and ranking based on LS/CMI. In this situation, the values of the top items can be used as additional information to rank and predict their future recidivism. For example, a positive value for top items indicates a higher probability for positive recidivism, and a negative value indicates a lower probability. Results The overall rate of recidivism for the two data sets is 31.98%. Table 1 shows the distribution of individuals with respect to the five LS/CMI risk levels and their corresponding rates of recidivism for each data set. A typical interpretation of a row in Table 1 is, for example, that an individual classified as high in D2 is likely to recidivate with probability of 64.85% (and correspondingly will not recidivate with probability of 35.15%). In general, an individual is classified as likely to recidivate if more than 50% of individuals with the same score have done so; otherwise, the individual is classified as unlikely to recidivate. From this table, we also see that the data are imbalanced (i.e., the risk levels do not have equal representation). Figure 1C provides a visual representation of the skew in the population distribution of LS/CMI scores in the combined data set. Figure 1D shows the rate of recidivism for each LS/CMI score of the combined data set. As expected, the recidivism rate shows a steady increase as the LS/CMI score increases. The decrease in the recidivism rate and prediction accuracy for the maximum LS/CMI score (43) is likely due to insufficient data (n = 7). 14 Criminal Justice and Behavior Figure 2: ACC and AUC Performance Metrics for the Four Methods Examined—LS/CMI, DT, RF, and SVM: (A) ACC of LS/CMI Method for Each Score; (B) ACC of the Four Methods, LS/CMI, DT, RF, and SVM, for Each LS/CMI Score; and (C) AUC of the Four Methods—LS/CMI, DT, RF, and SVM for Each LS/CMI Score Note. LSI refers to LS/CMI = Level of Service/Case Management Inventory; DT = decision tree; RF = random forest; SVM = support vector machine; ACC = accuracy; AUC = area under the receiver operating characteristic curve. The distribution of the prediction accuracy of LS/CMI scores is depicted in Figure 2A. From this figure, it can be observed that individuals classified as very high risk (LS/CMI scores between 30 and 43) can be confidently regarded as likely to recidivate, and those classified as low or very low risk (LS/CMI scores between 0 and 10) as unlikely to recidivate. However, for the relatively wide range of LS/CMI scores between 14 and 29, the LS/ CMI predictive accuracy is below 70%. This decrease in the predictive properties of LS/ CMI scores occurs for individuals classified as medium and high risk, and these two risk groups form over half (51.48%) of the individuals in the combined data set. Next we calculate the performance measures for the prediction of recidivism for the LS/ CMI, DT, RF, and SVM methods. Table 2 shows the overall predictive accuracy (ACC) as a weighted average of the accuracies for each score. It can be observed that the overall predictive accuracy of all four methods is comparable, with RF only slightly outperforming LS/CMI. From Figure 2B, it can be observed that all four methods behave similarly as a function of the LS/CMI score, showing high predictive accuracy at the extreme scores and relatively low predictive accuracy for the middle scores. It is noteworthy, however, that the RF method essentially outperforms the LS/CMI method in terms of predictive accuracy Ghasemi et al. / Application of ML to Risk–Need Assessment 15 Table 2: Performance Measures for Each Prediction Method Method ACC AUC (95% CI) LS/CMI DT RF SVM 0.734 0.695 0.736 0.704 0.7517 [0.7511, 0.7524] 0.7529 [0.7514, 0.7545] 0.7531 [0.7519, 0.7545] 0.7545 [0.7528, 0.7562] Note. ACC = accuracy; AUC = area under the receiver operating characteristic curve; CI = confidence interval; LS/CMI = Level of Service/Case Management Inventory; DT = decision tree; RF = random forest; SVM = support vector machine. over the entire range of LS/CMI scores. In particular, the lowest value of predictive accuracy for RF is approximately 0.57, whereas for LS/CMI, it is slightly below 0.50. The performance according to AUC is also shown in Table 2. According to the interpretive rubric of Rice and Harris (2005), these AUCs for all four methods correspond to large predictive effect sizes, with extremely small differences in magnitude between the methods. With this in mind, the AUC for LS/CMI total score was slightly lower than the other three methods, and its 95% CI does not overlap with that of SVM, which has the highest AUC, and represents a statistically significant difference. Figure 2C shows the distribution of AUC values for the different methods tested at each possible individual LS/CMI score. This figure illustrates that ML algorithms do a better job in discriminating recidivists from nonrecidivists compared with traditional LS/CMI summative methods (AUC values often around 0.6 compared with 0.5 for LS/CMI summative score) for a broad range of scores from low to moderately high. From their construction, ML algorithms take into account the way in which the individual items are combined to produce a given total score, and this leads to improved performance compared with simple consideration of total scores. RF has the second-highest AUC and seems to be the most effective method overall in terms of both the ACC and AUC performance metrics. Figure 3 shows a heatmap of the sensitivities of the LS/CMI items according to the three different sensitivity metrics discussed above and sorted in decreasing order by Sobol index. These analyses can be used to inform prediction of future recidivism because they demonstrate which factors have the most influence in predicting recidivism. From Figure 3, we see that Items A18 (charge laid, probation breached, or parole suspended during prior community supervision), A14 (three or more present offenses), and A423 (could make better use of time), and to a lesser extent, A735 (current drug problem) are the most sensitive items in the LS/CMI according to the Sobol and moment-independent δ indices. Discussion ML is often misconstrued as “pitting human minds against the machine” (Ahuja, 2019; Norman, 2018) or equated with completely “automated offender risk assessment” (Wormith, 2017). Neither represents accurate or complete understandings of applications of ML. In our view, ML is a tool or set of techniques that can potentially augment current risk-assessment approaches and assist in understanding behavioral patterns relevant to criminal justice (e.g., criminal recidivism). The results of this study build upon the limited previous findings demonstrating that ML algorithms can perform as well or better than summative scores on validated risk-assessment tools (e.g., Duwe & Kim, 2016); a novel contribution of this 16 Criminal Justice and Behavior Figure 3: Sensitivities of LS/CMI Items According to Different Metrics Note. LS/CMI = Level of Service/Case Management Inventory and individual LS/CMI items are numbered along the left-hand side (e.g., A18, A14) with methods used for sensitivity analyses across the top, including Sobol, Morris, and moment-independent δ index. study is that it features a theoretically driven, fourth-generation, general risk–need tool, the LS/CMI. Interestingly, classification accuracy as measured by ACC was found to improve significantly for the middle LS/CMI scores using RFs. As an example, for an LS/CMI score of 22, the ACC increased from 0.51 to 0.59 using RFs. The accuracy of the LS/CMI for these middle scores is near 0.50, making it difficult to assess as to whether an individual is likely to recidivate. Because as many as 15% to 20% of individuals may fall into this portion of the “High” risk band, these results suggested that ML algorithms may help us to increase the predictive capability for recidivism among individuals in lower-confidence, higher-density risk classification groupings. Examination of a more sensitive performance metric (AUCs), which accounts for fluctuations in base rates and considers the relative rankings of scores, also revealed that ML algorithms performed equally as well as total LS/CMI scores, with SVM slightly outperforming LS/CMI score (AUCs = 0.7517 for LS/CMI to 0.7545 for SVM; Table 2), with prediction magnitudes consistent with previous meta-analytic reviews (Olver et al., 2014). Ghasemi et al. / Application of ML to Risk–Need Assessment 17 This pattern was observed across all LS/CMI scores with the exception of those in the very high risk range, owing to insufficient n at the most extreme scores (e.g., maximum score of 43). Moreover, all the ML approaches investigated consistently improved AUC by approximately 10 percentage points for the majority of individual LS/CMI scores (Figure 2C). These AUCs may be further improved with incorporation of additional risk-relevant and dynamic data beyond LS/CMI scores as recommended by Garb and Wood (2019). It is important to note that LS/CMI summative scoring makes an identical prediction for two individuals with similar scores, say a score of 22. However, mathematically, the number of distinct ways an individual can be assigned the summative score of 22 is approximately 1 trillion. In fact, the LS/CMI method does not distinguish among any of these different cases, whereas an ML algorithm like SVM can provide a more nuanced analysis and could differentiate between individuals with a summative score of 22 as likely to recidivate, and others not, depending upon how the score was reached. Thus, these preliminary results suggest that there may be underlying patterns or different combinations of scores that are more predictive of recidivism. Further analyses of these patterns may be fruitful, not only in terms of predictive accuracy but also to identify clusters and weightings of criminogenic needs that may separate recidivists from nonrecidivists with similar LS/CMI scores, including frequently obtained and seemingly less predictive risk scores. Results of exploratory sensitivity analyses have begun to identify dynamic factors or criminogenic risk–needs that may have the most influence in the prediction of recidivism (e.g., poor use of time, current drug problem). Our plan for future work is to conduct additional mathematical and statistical analyses on the most common combinations of these items and the number of unique paths resulting in a specific summative LS/CMI score as well as to include available features beyond LS/CMI items. Mere “prediction” should not be the primary goal of any risk assessment. Rather, prevention is the primary purpose (e.g., risk reduction). Thus, all predictive technology is perhaps best viewed as preventive technology and this includes ML. To our knowledge, there are few studies that have evaluated “real world” applications of ML. In one such illustrative study, Berk (2016) examined the impact of ML “risk forecasts” on Parole Board decisions in Pennsylvania. Although some evidence for “smarter decision making” was reported, it was difficult to ascertain the full implications of the evidence because standard practices and ML approaches were drawing upon much of the same information by virtue of the fact that ML “forecasts were meant to supplement the information available to the Board, not replace it” (p. 22). This is one of many possible applied uses of ML data to augment, not replace, other decision-making tools and mechanisms. No one tool, computer-aided or not, should be used in a standalone fashion to inform criminal justice decision-making, but rather use of validated tools as part of comprehensive and contextualized assessments is required as part of best practices. Moreover, as we seek to integrate additional information into risk assessments, such as individual strengths or protective factors or changes in dynamic risk over time, ML approaches may make better use of enriched information (e.g., repeated or multiple assessments incorporating risk, protective factors, and change information), and they could also uncover relationships between risk-relevant variables and outcomes that may differ across groups, settings, and time. Furthermore, ML may also provide new insights that can inform intervention practices. For instance, Lussier et al. (2019) recently used DT algorithms to identify risk factors for entry into sexual reoffending. Future work may be able to employ ML to test hypotheses 18 Criminal Justice and Behavior regarding changes in risk over time and test causal inferences (Barabas et al., 2018), including the effects of intervention such as correctional programming and resultant changes in risk–need scores. When used in this manner, ML approaches could have utility in preventing criminal or aggressive behavior, including identifying situations that may lead to violence toward self and others. For example, Bala and Truatman (2019) have recently explored the applicability of ML approaches to promote identification and intervention for individuals in custody who may be at risk for engaging in self-harming behaviors. Study Limitations and Future Directions Like all tools, we would encourage further evaluation of ML approaches and advocate for their responsible use. As cautioned by Barabas et al. (2018), some ML approaches “transform the space of input features into a higher order space that is often difficult to interpret” (p. 8). Clear interpretations must be advanced and tested, and such analyses should include local validation and updated models. There is also the important issue of algorithmic fairness (Berk et al., 2018; Corbett-Davies & Goel, 2018). It has been argued that both risk-assessment tools and models may be biased for a variety of reasons (e.g., label bias, feature bias, sample bias, and calibration issues); however, the very data used to train and test ML algorithms may also be compromised, and one must avoid retrenching biases (Corbett-Davies & Goel, 2018). Although ML approaches require large sample sizes, “big data” are not necessarily “deep.” The current study examined data from a single source (i.e., LS/CMI scores). Integrating data from other sources may improve models, advance understanding, and reduce inherent biases. Recently, Menger and colleagues (2019) endeavored to predict in-patient violence from clinical notes in patient electronic health records. As such, novel data sources and data elements (e.g., text data) could be integrated with traditional data sources (e.g., risk scores) to enhance statistical models and further our understanding of behavioral patterns relevant to criminal justice outcomes (e.g., recidivism and desistance). Finally, although two large data sets of LS/CMI administrations were utilized to retain maximal data for exploratory analyses, it is recognized that as field research, there is a lack of uniformity between the samples (e.g., differences in available follow-up time and mean LS/CMI score). However, healthy sampling variance was observed, and use of techniques robust to fluctuations in base rate was employed. This said, more nuanced analyses beyond the scope of the present work could examine the associations between important moderators and recidivism that may have bearing on LS/CMI score and outcome. For instance, ML approaches may further contribute to our understanding of the mechanisms that may underlie the relative superiority of discrimination of item patterns by gender, age, or other risk moderators. Further examinations of underlying risk–need patterns (e.g., pathways of criminogenic needs) using ML may also assist practitioners to refine prevention and correctional strategies based on the patterns observed in the data. Conclusions for the “Near Future” of Risk–Need Assessment Always looking to push the field of risk assessment forward, Andrews, Bonta, and Wormith discussed “The Recent Past and Near Future of Risk and/or Need Assessment” in their seminal 2006 paper. Over 10 years later, Wormith (2017) provided additional glimpses into the future of risk assessment in his policy paper titled “Automated Offender Risk Ghasemi et al. / Application of ML to Risk–Need Assessment 19 Assessment: The Next Generation or a Black Hole?” Risk assessment in the digital era, use of artificial intelligence in criminal justice, and “smart prisons” are no longer the near future—they are the present. ML approaches are now making significant contributions to health care, not to mention business and entertainment, and there have been calls to build “fair algorithms” to assist criminal justice decision-making (Corbett-Davies & Goel, 2018). Recent preliminary findings suggest that although ML approaches can contribute meaningfully to risk assessment, management, and reduction, they should be developed with care. With smart, automated technologies advancing at “warp speed” (Wormith, 2017, p. 281), research and statistical methodologies must keep pace to support ethical, effective, and cost-efficient correctional practices; promote innovation in risk assessment and management; and ultimately, better, safer outcomes for criminal justice clients and communities. ORCID iD Raymond J. Spiteri https://orcid.org/0000-0002-3513-6237 References Ahuja, A. (2019). The impact of artificial intelligence in medicine on the future role of the physician. PeerJ, 7, Article e7702. https://doi.org/10.7717/peerj.7702 Andrews, D. A., & Bonta, J. (1994). The psychology of conduct (1st ed.). Anderson. Andrews, D. A., & Bonta, J. (2010). The psychology of conduct (5th ed.). LexisNexis. Andrews, D. A., Bonta, J., & Wormith, J. S. (1995). Level of Service Inventory–Ontario Revision (LSI-OR): Interview and scoring guide. Ontario Ministry of the Solicitor General and Correctional Services. Andrews, D. A., Bonta, J., & Wormith, J. S. (2004). Level of Service/Case Management Inventory (LS/CMI): An offender assessment system. User’s guide. Multi-Health Systems. Andrews, D. A., Bonta, J., & Wormith, J. S. (2006). The recent past and near future of risk and/or needs assessment. Crime and Delinquency, 52, 7–27. https://doi.org/10.1177/0011128705281756 Baglivio, M. T. (2009). The assessment of risk to recidivate among a juvenile offending population. Journal of Criminal Justice, 37(6), 596–607. https://doi.org/10.1016/j.jcrimjus.2009.09.008 Bala, N., & Truatman, L. (2019). “Smart” technology is coming for prisons, too. Slate. https://slate.com/technology/2019/04/ smart-ai-prisons-surveillance-monitoring-inmates.html Barabas, C., Dinakar, K., Ito, J., Virza, M., & Zittrain, J. (2018). Interventions over predictions: Reframing the ethical debate for actuarial risk assessment. Proceedings of Machine Learning Research, 81, 1–15. Berk, R. A. (2016). An impact assessment of machine learning risk forecasts on parole board decisions and recidivism (Working Paper No. 2016-4.0). University of Pennsylvania. Berk, R. A., & Bleich, J. (2013). Statistical procedures for forecasting criminal behavior: A comparative assessment. Criminology & Public Policy, 12, 513–544. https://doi.org/10.1111/1745-9133.12047 Berk, R. A., Heidari, H., Jabbari, S., Kearns, M., & Roth, A. (2018). Fairness in criminal justice risk assessments: The state of the art. Sociological Methods and Research. Advance online publication July 2018. https://doi. org/10.1177/0049124118782533 Brennan, T., & Oliver, W. L. (2013). The emergence of machine learning techniques in criminology: Implications of complexity in our data and in research questions. Criminology & Public Policy, 12, 551–562. https://doi.org/10.1111/17459133.12055 Burgess, E. W. (1928). Factors determining success or failure on parole. In A. A. Bruce (Ed.), The workings of the indeterminate sentence law and the parole system in Illinois (pp. 221–234). Illinois State Board of Parole. Bushway, S. D. (2013). Is there any logic to using logit: Finding the right tool for the increasingly important job of risk prediction. Criminology & Public Policy, 12, 563–567. https://doi.org/10.1111/1745-9133.12059 Corbett-Davies, S., & Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning. https://arxiv.org/abs/1808.00023. Curtis, J. (2018). On using machine learning to predict recidivism [Unpublished doctoral dissertation, Texas Tech University]. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674. Deo, R. C. (2015). Machine learning in medicine. Circulation, 132, 1920–1930. https://doi.org/10.1161%2FCIRCULATIO NAHA.115.001593 Duwe, G. (2014). The development, validity, and reliability of the Minnesota Screening Tool Assessing Recidivism Risk (MnSTARR). Criminal Justice Policy Review, 25, 579–613. https://doi.org/10.1177%2F0887403413478821 20 Criminal Justice and Behavior Duwe, G. (2019). Better practices in the development and validation of recidivism risk assessments: The Minnesota Sex Offender Screening Tool-4. Criminal Justice Police Review, 30, 538–564. https://doi.org/10.1177/0887403417718608 Duwe, G., & Kim, K. (2016). Sacrificing accuracy for transparency in recidivism risk assessment: The impact of classification method on predictive performance. Corrections Policy, Practice, and Research, 1, 155–176. https://doi.org/10.1080/23 774657.2016.1178083 Duwe, G., & Kim, K. (2017). Out with the old and in with the new? An empirical comparison of supervised learning algorithms to predict recidivism. Criminal Justice Policy Review, 28, 570–600. https://doi:10.1177/0887403415604899 Garb, H. N., & Wood, J. M. (2019). Methodological advances in statistical prediction. Psychological Assessment, 31, 1456–1466. http://dx.doi.org/10.1037/pas0000673 Gendreau, P., Little, T., & Goggin, C. (1996). A meta-analysis of predictors of adult offender recidivism: What works! Criminology, 34, 401–433. https://doi.org/10.1111/j.17459125.1996.tb01220.x Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical versus mechanical prediction: A metaanalysis. Psychological Assessment, 12, 19–30. https://doi.org/10.1037/1040-3590.12.1.19 Hamilton, Z., Neuilly, M., Lee, S., & Barnoski, R. (2015). Isolating modeling effects in offender risk assessment. Journal of Experimental Criminology, 11, 299–318. https//doi.org/10.1007/s11292-014-9221-8 Hogg, S. M. (2011). The Level of Service Inventory (Ontario Revision) scale validation for gender and ethnicity: Addressing reliability and predictive validity [Unpublished master’s thesis, University of Saskatchewan]. Lavecchia, A. (2015). Machine-learning approaches in drug discovery: Methods and applications. Drug Discovery Today, 20, 318–331. https://doi.org/10.1016/j.drudis.2014.10.012 Lin, Z., Jung, J., Goel, S., & Skeem, J. (2020). The limits of human predictions of recidivism. Science Advances, 6, Article eaaz0652. https://doi/org/10.1126/sciadv.aaz0652 Liu, Y. Y., Yang, M., Ramsey, M., Xiao, S. L., & Coid, J. W. (2011). A comparison of logistic regression, classification and regression tree, and neural networks models in predicting violent re-offending. Journal of Quantitative Criminology, 27, 547–573. https://doi.org/10.1007/s10940-011-9137-7 Luong, D., & Wormith, J. S. (2011). Applying risk/need assessment to probation practice and its impact on the recidivism of young offenders. Criminal Justice and Behavior, 38, 1177–1199. https://doi.org/10.1177/0093854811421596 Lussier, P., Deslauriers-Varin, N., Collin-Santerre, J., & Bélanger, R. (2019). Using decision tree algorithms to screen individual at risk of entry into sexual recidivism. Journal of Criminal Justice, 63, 12–24. https://doi.org/10.1016/j.jcrimjus.2019.05.003 Marsland, S. (2015). Machine learning: An algorithmic perspective. (2nd ed.). Chapman and Hall/CRC. https://doi. org/10.1201/b17476 Menger, V., Spruit, M., van Est, R., Nap, E., & Scheepers, F. (2019). Machine learning approach to inpatient violence risk assessment using routinely collected clinical notes in electronic health records. Journal of the American Medical Association Network Open, 2, Article e196709. https://doi.org/10.1001%2Fjamanetworkopen.2019.6709 Murphy, K. P. (2012). Machine learning: A probabilistic perspective. The MIT Press. Norman, A. (2018, January 31). Your future doctor may not be human. This is the rise of AI in medicine. Futurism. https:// www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjGv5jWjofpAhXKQc0KHTm-CN8 QFjAAegQIARAB&url=https%3A%2F%2Ffuturism.com%2Fai-medicine-doctor&usg=AOvVaw3W-WB1mvsjKAtpwL1ag4XV Olver, M. E., Stockdale, K. C., & Wormith, J. S. (2009). Risk assessment with young offenders: A meta-analysis of three assessment measures. Criminal Justice and Behavior, 36, 329–353. http://doi.org/10.1177/0093854809331457 Olver, M. E., Stockdale, K. C., & Wormith, J. S. (2014). Thirty years of research on the level of service scales: A metaanalytic examination of predictive accuracy and sources of variability. Psychological Assessment, 26, 156–176. https:// doi.org/10.1037/a0022200 Orton, L. C. (2014). An examination of the professional override in the Level of Service Inventory–Ontario Revision (LSI-OR) [Unpublished master’s thesis, University of Saskatchewan]. Ozkan, T., Clipper, S. J., Piquero, A. R., Baglivio, M., & Wolff, K. (2020). Predicting sexual recidivism. Sexual Abuse, 32, 375–399. https://doi.org/10.1177/1079063219852944 Rice, M. E., & Harris, G. T. (1995). Violent recidivism: Assessing predictive validity. Journal of Consulting and Clinical Psychology, 63(5), 737–748. https://doi.org/10.1037/0022-006X.63.5.737 Rice, M. E., & Harris, G. T. (2005). Comparing effect sizes in follow-up studies: ROC area, Cohen’s d, and r. Law and Human Behavior, 29, 615–620. https://doi.org/10.1007/s10979-005-6832-7 Ridgeway, G. (2013). Linking prediction and prevention. Criminology & Public Policy, 12, 545–550. https://doi. org10.1111/1745-9133.12057 Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. The MIT Press. Smith, P., Cullen, F. T., & Latessa, E. J. (2009). Can 14,737 women be wrong? A meta-analysis of the LSI-R and recidivism for female offenders. Criminology & Public Policy, 8, 183–208. https://doi.org/10.1111/j.1745-9133.2009.00551.x Steadman, H. J., Silver, E., Monahan, J., Appelbaum, P., Robbins, P. C., Mulvey, E. P., Grisso, T., Roth, L. H., & Banks, S. (2000). A classification tree approach to the development of actuarial violence risk assessment tools. Law and Human Behavior, 24, 83–100. https://doi.org/10.1023/A:1005478820425 Ghasemi et al. / Application of ML to Risk–Need Assessment 21 Steinwart, I., & Christmann, A. (2008). Support vector machines (1st ed.). Springer. https://doi.org/10.1007/978-0-38777242-4 Tollenaar, N., & van der Heijden, P. G. M. (2013). Which method predicts recidivism best? A comparison of statistical, machine learning and data mining predictive models. Journal of the Royal Statistical Society, 176, 565–584. https://doi. org/10.1371%2Fjournal.pone.0213245 Topol, E. (2019). Deep medicine: How artificial intelligence can make healthcare human again. Basic Books. Viljoen, J. L., Mordell, S., & Beneteau, J. (2012). Prediction of adolescent sexual reoffending: A meta-analysis of the J-SOAP-II, ERASOR, JSORRAT-II, and Static-99. Law and Human Behavior, 36, 423–438. https://doi.org/10.1037/ h0093938 Webster, C. D., Douglas, K. S., Eaves, D., & Hart, S. D. (1997). HCR 20. Assessing risk for violence. Version 2. Mental Health, Law and Policy Institute. Wilson, H. A., & Gutierrez, L. (2014). Does one size fit all?: A meta-analysis examining the predictive ability of the Level of Service Inventory (LSI) with aboriginal offenders. Criminal Justice and Behavior, 41(2), 196–219. http://doi. org/10.1177/0093854813500958 Wormith, J. S. (2011). The legacy of D. A. Andrews in the field of criminal justice: How theory and research can change policy and practice. International Journal of Forensic Mental Health, 10, 78–82. https://doi.org/10.1080/14999013.20 11.577138 Wormith, J. S. (2017). Automated offender risk assessment: The next generation or a black hole? Criminology & Public Policy, 16, 281–303. https://doi.org/10.1111/1745-9133.12277 Wormith, J. S., & Bonta, J. (2017). The Level of Service (LS) instruments. In J. P. Singh, D. G. Kroner, J. S. Wormith, S. L. Desmarais, & Z. Hamilton (Eds.), Handbook of recidivism risk/need tools (pp. 117–145). John Wiley. Wormith, J. S., Hogg, S. M., & Guzzo, L. (2012). The predictive validity of a general risk/needs assessment inventory on sexual offender recidivism and an exploration of the professional override. Criminal Justice and Behavior, 39, 1511–1538. https://doi.org/10.1177/0093854812455741 Wormith, J. S., Hogg, S. M., & Guzzo, L. (2015). The predictive validity of the LS/CMI with Aboriginal Offenders in Canada. Criminal Justice and Behavior, 42, 481–508. https://doi.org/10.1177%2F0093854814552843 Mehdi Ghasemi works as a senior scientist at Edmonton Police Service and an adjunct professor in the Department of Math & Stats at the University of Saskatchewan. His scientific activities mainly include mathematical modeling, optimization, and advanced data analysis. Daniel Anvari is a faculty member in the Department of Mathematics and Statistics at Kwantlen Polytechnic University. His areas of research are applications of dynamical systems and machine learning in biology, health, and social sciences. Mahshid Atapour is a faculty member in the Department of Mathematics and Statistics at Capilano University. Her areas of research are applied probability and applications of statistics and machine learning in health and social sciences. J. Stephen Wormith (now deceased) began his career as a psychologist and researcher in various correctional jurisdictions in Canada. He then became a professor in the psychology department at the University of Saskatchewan and also the director of the Centre of Forensic Behavioural Science and Justice Studies. Over his long career, he made fundamental contributions to offender risk and psychological assessment, offender treatment, sexual offenders, and crime prevention. He was a fellow of the Canadian Psychological Association (CPA) and represented the CPA on the National Associations Active in Criminal Justice (NAACJ). Keira C. Stockdale is a registered clinical psychologist currently employed by the Saskatoon Police Service and an adjunct professor in the Department of Psychology at the University of Saskatchewan. Her research and clinical activities include risk assessment and treatment for justice-involved youth and adults. Raymond J. Spiteri is a professor in the Department of Computer Science at the University of Saskatchewan. His areas of research are numerical analysis, scientific computing, and high-performance computing with specialization in time-stepping methods for differential equations.