Development of an Analytical Process to Measure Teacher Effectiveness Based on Student Growth to Augment an Educator Evaluation System by Lamar D. Adams A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 4, 2012 Keywords: Student Growth, Teacher Evaluation System, Linear Mixed Model, Quantile Regression, Principal Component Analysis, Cluster Analysis Copyright 2012 by Lamar D. Adams Approved by Dr. Jeffrey S. Smith, Chair, Joe W. Forehand Jr. Professor, Industrial and Systems Engineering Dr. Saeed Maghsoodloo, Professor Emeritus of Industrial and Systems Engineering Dr. David Shannon, Humana-Germany-Sherman Distinguished Professor, Educational Foundations, Leadership and Technology Dr. Joni Lakin, Assistant Professor of Educational Foundations, Leadership and Technology ii Abstract Teacher quality is one of the most important school related variables associated with student achievement. Therefore, raising the quality of the U.S. public education teaching force is essential to ensure that every child has the opportunity to achieve academic success. In order to accomplish this, significant analytical inspection of teachers is needed to assist with the determination of whether teachers contribute appropriately to students attaining adequate yearly growth. The primary objective of this research was to fill the need of augmenting Alabama?s formative educator evaluation system, EDUCATEAlabama, with a precise and stable teacher effectiveness index based on student growth. The methodology of computing such an index consisted of three phases. Phase I entailed calculating four teacher effectiveness metrics. Subject-specific and overall teacher index values were calculated in Phase II utilizing the Phase I metrics and principal component analysis. The principal components served as the inputs to Phase III, Cluster Analysis, with Ward?s clustering method employed as a general prescription to illuminate teachers with similar characteristics (principal components) in the data. A medium- sized, suburban district in Alabama and a dataset from the National Center for Education Statistics consisting of 17 urban districts from across the United States provided the requisite student and teacher data to fully implement the process, which concluded with successfully placing teachers into effectiveness categories by grade, subject(s), and year. iii Acknowledgments Marlin Lavon Adams passed away on October 13, 2009, at the age of 76. I had completed a mere semester of doctoral studies at Auburn University on that day. My father wanted me to earn a Ph.D. before I even considered pursuing one. He had vision. I want to thank him and tell him that I miss him. Children of military service members seldom have a choice in where they live. My children are no exception. David has lived through eight assignments, seven moves, and eight schools spanning six states. Leslie has lived through seven assignments, six moves, and six schools spanning five states. They are ?military? children in the true sense of the word by being wonderfully supportive, adaptive, and selfless. I want to thank them for being who they are and tell them that I am proud of them. My bride has proudly stood next to me through every promotion, graduation, and award ceremony for the past 17 years. More importantly, Jennifer has stood next to me through every disappointment, failure, and setback over those same 17 years. I want to thank her for the love and support she has given me while I pursued this degree. Dr. Jeffrey Smith took a chance when he agreed to support this educational research. I thank him for serving as my committee chairman and allowing me to tackle an area of research outside of the department. I would also like to thank Dr. David Shannon for serving as a committee member. His efforts to acquire data were significant, and I would not have obtained the requisite data from an Alabama district without his help. Lastly, I would like to thank my iv remaining committee members, Dr. Saeed Maghsoodloo and Dr. Joni Lakin. Their insight and assistance have been immeasurable. Bonnie ?G.B.? Adams moved to Auburn in 2009 to support our family while I completed my doctoral studies and Jennifer completed her degree at Auburn. G.B. packed up her house in Lancaster, California, and moved east to establish a home nearby. Never one to shy away from work, she completed any and every ?duty? to help our family to be successful. I want to thank my mother for her love and support during our time in Auburn. v Table of Contents Abstract ........................................................................................................................................... ii Acknowledgments.......................................................................................................................... iii List of Tables ................................................................................................................................. ix List of Figures ................................................................................................................................ xi List of Abbreviations ................................................................................................................... xiii Chapter 1 : Introduction ................................................................................................................. 1 1.1 Teacher Quality ..................................................................................................................... 1 1.2 No Child Left Behind............................................................................................................ 1 1.3 Alabama?s Race to the Top ................................................................................................... 3 1.4 Educator Evaluation System ................................................................................................. 5 1.5 Research Objectives .............................................................................................................. 7 1.6 Organization of Research ...................................................................................................... 8 Chapter 2 : Literature Review ...................................................................................................... 11 2.1 Introduction ......................................................................................................................... 11 2.2 Linear Mixed Models (LMMs) ........................................................................................... 12 2.2.1 Linear Mixed Model General Specification ................................................................ 14 2.2.2 Linear Mixed Model Hierarchical Specification ......................................................... 17 2.2.3 Linear Mixed Model Implementation in Texas ........................................................... 19 2.3 Quantile Regression (QR) ................................................................................................... 20 2.3.1 Computing QR Coefficients ........................................................................................ 22 2.3.2 Quantile Regression Extended to Cubic Splines ......................................................... 23 vi 2.3.3 Quantile Regression and B-Splines ............................................................................. 25 2.3.4 Quantile Regression Implementation in Colorado ....................................................... 26 2.4 Discussion ........................................................................................................................... 29 2.4.1 Measures of Effective Teaching (MET) Project .......................................................... 31 2.4.2 A Risk-Mitigated Approach ......................................................................................... 33 2.5 Principal Component Analysis ........................................................................................... 35 2.6 Cluster Analysis .................................................................................................................. 37 2.7 Summary ............................................................................................................................. 40 Chapter 3 : Risk-Mitigated Teacher Effectiveness Index ............................................................ 42 3.1 Introduction ......................................................................................................................... 42 3.2 Phase I: Teacher Effectiveness Metrics ............................................................................. 43 3.2.1 Linear Mixed Model Teacher Effect ............................................................................ 45 3.2.2 Linear Mixed Model Value-Added Measure ............................................................... 46 3.2.3 Median Student Growth Percentile .............................................................................. 48 3.2.4 Quantile Regression Value-Added Measure ................................................................ 49 3.3 Phase II: Principal Component Analysis (PCA) ................................................................ 50 3.4 Phase III: Cluster Analysis................................................................................................. 57 3.4.1 Comparison of Clustering Results with Phase I Metrics ............................................. 62 3.5 Summary ............................................................................................................................. 63 Chapter 4 : Data Analysis ............................................................................................................ 66 4.1 Introduction ......................................................................................................................... 66 4.2 Alabama Data Analysis....................................................................................................... 67 4.2.1 Alabama Reading and Mathematics Test .................................................................... 69 4.2.2 Testing Histories .......................................................................................................... 70 4.2.3 Student, Teacher, and School Level Predictors for Alabama Data.............................. 72 vii 4.3 National Center for Education Statistics Data Analysis ..................................................... 74 4.3.1 NCES Student Achievement Data Recorded as Z-Scores ........................................... 76 4.3.2 Student, Teacher, and School Level Predictors for NCES Data .................................. 77 4.4 Checking Model Assumptions for the Final Linear Mixed Models ................................... 80 4.5 Diagnostics for the Final Quantile Regression Models ...................................................... 82 4.6 Principal Component Analysis (PCA) ................................................................................ 84 4.7 Cluster Analysis .................................................................................................................. 87 4.7.1 Comparison of Clustering Results with Phase I Metrics ............................................. 90 4.8 Subject-Specific and Overall RMTEI Values ..................................................................... 92 4.9 Precision of the Risk-Mitigated Teacher Effectiveness Index ............................................ 96 4.10 Stability of the Risk-Mitigated Teacher Effectiveness Index ........................................... 99 4.10.1 Comparison of Stability Results with Phase I Metrics ............................................ 105 4.11 Summary ......................................................................................................................... 108 Chapter 5 : Assessment of Teacher Evaluation in Alabama ...................................................... 110 5.1 Introduction ....................................................................................................................... 110 5.2 Professional Education Personnel Evaluation (PEPE) ..................................................... 111 5.3 EDUCATEAlabama ......................................................................................................... 111 5.4 NCES Teacher Effectiveness Scoring Analysis ............................................................... 113 5.5 Summary ........................................................................................................................... 116 Chapter 6 : Research Summary .................................................................................................. 118 6.1 Conclusion ........................................................................................................................ 118 6.2 Limitations of the Risk Mitigated Teacher Effectiveness Index ...................................... 121 6.2.1 Limitations in Reporting ............................................................................................ 121 6.2.2 RMTEI Values and Small Populations of Teachers .................................................. 123 6.3 Future Study ...................................................................................................................... 124 viii References ................................................................................................................................... 127 Appendix 1: Establishing Longitudinal Student Achievement Data Linked with Teacher Information ................................................................................................................................. 134 Appendix 2: SAS Code .............................................................................................................. 142 Appendix 3: Alabama District Results ...................................................................................... 165 Appendix 4: NCES District Results ........................................................................................... 168 ix List of Tables Table 2.1: Linear and Quantile Regression Comparison ............................................................. 21 Table 3.1: Excerpt of Example Dataset for Phase I Metrics ........................................................ 44 Table 3.2: Partial Output of Solution for Random Effects .......................................................... 46 Table 3.3: Partial Output of LMM Value-Added Measure .......................................................... 47 Table 3.4: Partial Output of Median SGP Metric ........................................................................ 48 Table 3.5: Partial Output of Overall QR Teacher Value-Added Measure ................................... 49 Table 3.6: Output of Phase I Metrics ........................................................................................... 51 Table 3.7: Correlation Matrix of Phase I Metrics ........................................................................ 52 Table 3.8: Eigenvectors of the Correlation Matrix ...................................................................... 53 Table 3.9: Output of Principal Component Scores ...................................................................... 54 Table 3.10: Eigenvalues of the Correlation Matrix ..................................................................... 55 Table 3.11: Correlation of Phase I Metrics with Principal Component 1 .................................... 56 Table 3.12: Sample of Teachers with Extraordinary Gains in Student Growth .......................... 61 Table 3.13: Sample of Teachers with Poor Gains in Student Growth ......................................... 61 Table 4.1: Excerpt from 6th grade, 2009 Dataset ......................................................................... 69 Table 4.2: Components of Alabama Reading and Mathematics Test .......................................... 69 Table 4.3: Summary of Model Predictors for Alabama Data ...................................................... 74 Table 4.4: NCES Testing Histories .............................................................................................. 75 Table 4.5: Summary of Model Predictors for NCES Data .......................................................... 80 Table 4.6: Proportion of the Total Variance for First Principal Component in the Alabama District........................................................................................................................................... 85 x Table 4.7: Proportion of the Total Variance for First Principal Component in a NCES District 86 Table 4.8: Results from PCA for 4th Grade, 2011, in the Alabama District ................................ 87 Table 4.9: Excerpt of Final Results Sorted by the 2009 Mathematics RMTEI value ................. 93 Table 4.10: Excerpt of Treatment Status Analysis ...................................................................... 96 Table 4.11: Correlation of Yearly RMTEI Math Values for the Alabama District ................... 101 Table 4.12: Correlation of Yearly Reading, Overall RMTEI Values for the Alabama District 102 Table 4.13: Correlation of Yearly Mathematics RMTEI Values for NCES Data ..................... 103 Table 4.14: Correlation of Yearly Reading and Overall RMTEI Values for NCES Data ......... 104 Table 5.1: Correlation of RMTEI Reading Values and Observational Evaluations .................. 115 Table 6.1: Alabama Student Assessment Program Overview ................................................... 122 Table 6.2: Excerpt of Schedule Dataset ..................................................................................... 136 Table 6.3: Excerpt of Course Counts Dataset ............................................................................ 137 Table 6.4: Excerpt of Teacher Information File ........................................................................ 138 Table 6.5: Excerpt from 6th grade, 2009 Dataset ....................................................................... 140 xi List of Figures Figure 2.1: Linear Parameterization of Quantile Regression ....................................................... 21 Figure 2.2: Point/Line Duality (Edgeworth, 1888) ...................................................................... 22 Figure 2.3: Cubic B-Spline Parameterization of Student Growth Percentiles ............................. 27 Figure 3.1: Scree Plot ................................................................................................................... 55 Figure 3.2: Subject-specific RMTEI Value along Dominant Principal Component Axis ........... 57 Figure 3.3: Ward?s Clustering Method Statistics for Determining Number of Clusters ............. 59 Figure 3.4: Ward?s Clustering of Teachers by Effectiveness ...................................................... 60 Figure 3.5: Comparison of Clustering Results with Phase I Metrics ........................................... 62 Figure 4.1: Distribution Analysis of Mathematics LMM Teacher Effects .................................. 80 Figure 4.2: Distribution Analysis of Studentized Residuals from Mathematics LMM ............... 81 Figure 4.3: Paired Comparison of MathGain and Conditional Predicted Value ......................... 82 Figure 4.4: Variance Analysis of Standardized Residuals from 0.5 QR Mathematics Model .... 83 Figure 4.5: Agreement Plot of MathGain versus 0.5 QR Prediction of MathGain ..................... 84 Figure 4.6: Distribution Analysis of Standardized Residuals from 0.5 QR Mathematics Model 84 Figure 4.7: Linear Relationship of the Number of Clusters and Teachers for the Alabama District........................................................................................................................................... 88 Figure 4.8: 4th Grade, 2011, Effectiveness Categories for the Alabama District ........................ 89 Figure 4.9: Linear Relationship of Number of Clusters and Teachers for a NCES District ....... 90 Figure 4.10: Mathematics Clustering Results for 4th Grade, 2011, in the Alabama District ....... 91 Figure 4.11: Mathematics Clustering Results for 3rd Grade, 2006, in a NCES District .............. 91 Figure 4.12: Final 4th Grade Mathematics RMTEI values ........................................................... 94 xii Figure 4.13: Control Chart Analysis for Alabama 4th Grade Mathematics Teachers .................. 98 Figure 4.14: Control Chart Analysis for 3rd Grade Mathematics Teachers in a NCES District .. 98 Figure 4.15: Excerpt of Final Stability Analysis for the Alabama District ............................... 100 Figure 4.16: Excerpt of Final Stability Analysis for NCES Data .............................................. 103 Figure 4.17: Mathematics Correlation Results in the Alabama District for 2009 ..................... 105 Figure 4.18: Mathematics Correlation Results in the AL District between 2010 and 2011 ...... 106 Figure 4.19: Mathematics Correlation Results for 2006 in a NCES District............................. 107 Figure 4.20: Mathematics Correlation Results between 2007 and 2008 for a NCES District... 107 Figure 5.1: Scatter Plots of RMTEI Reading Values versus Observational Evaluations .......... 115 Figure 6.1: Establishing Longitudinal Student Achievement Data Linked with Teacher Information ................................................................................................................................. 141 xiii List of Abbreviations AL Alabama ALSDE Alabama State Department of Education ARMT Alabama Reading and Mathematics Test ANOVA Analysis of Variance AQTS Alabama Quality Teaching Standards AYP Adequate Yearly Progress CCC Cubic Clustering Criterion COMPETES Creating Opportunities to Meaningfully Promote Excellence in Technology, Education, and Science CSAP Colorado Student Assessment Program CSV Comma Separated Values EBLUP Empirical Best Linear Unbiased Predictor ERIC Education Reform and Innovation Council ESEA Elementary and Secondary Education Act ETS Educational Testing Service EVAAS Education Value Added Assessment System HLM Hierarchical Linear Model IRT Item Response Theory LEA Local Education Agency LMM Linear Mixed Model xiv MET Measures of Effective Teaching NCES National Center for Education Statistics NCLB No Child Left Behind Act NTC New Teacher Center PCA Principal Component Analysis PEPE Professional Education Personnel Evaluation RMTEI Risk Mitigated Teacher Effectiveness Index SGP Student Growth Percentile SQL Structured Query Language TPM Texas Projection Measure TVAAS Tennessee Value-Added Assessment System QR Quantile Regression VAM Value Added Model 1 Chapter 1 : Introduction 1.1 Teacher Quality Teacher quality is one of the most important school related variables associated with student achievement. Many education accountability systems built around student test results ignore the initial conditions of the students and merely measure a status at a specific point in time. Teacher evaluations based solely on a snapshot of the number of students that have attained a particular testing proficiency level without consideration of the initial conditions of those students are flawed. Certainly, status model (snapshot) evaluations provide useful information to administrators, but they do not account for student growth and are ?confounded with other non-school factors? (Lissitz & Doran, 2009, p. 39). A snapshot of student scores could portray a teacher to be ineffective despite tremendous student growth occurring within the classroom, or effective despite little student growth occurring within the classroom. A favorable alternative to status evaluations is to determine which teachers contribute appropriately to students attaining adequate yearly growth. Therefore, a statistically supportable measure of teacher effectiveness based on student growth is desired. 1.2 No Child Left Behind The No Child Left Behind Act (NCLB) of 2001, a reauthorization of the Elementary and Secondary Education Act (ESEA) of 1965, is a performance-based accountability system built around student test results. It requires Adequate Yearly Progress (AYP) determinations for schools to be based on a snapshot of the number of students that have attained proficiency. 2 Therefore, several states gained approval from the U.S. Department of Education to use ?growth models? to augment AYP calculations. These states could use longitudinal testing data to extrapolate the expected growth for individual students over time thereby accounting for ?differences in the initial conditions of the students? (Pearson & Stecher, 2004, p. 99). This expected growth would then become benchmarks for students to meet in order to be counted for making AYP. ?Fifteen states now have approved growth models: North Carolina, Tennessee, Delaware, Arkansas, Florida, Iowa, Ohio, Alaska, Arizona, Michigan, Missouri, Colorado, Minnesota, Pennsylvania and Texas? (?Secretary Spellings,? 2009, para. 5). The subsequent reauthorization of the ESEA titled ?A Blueprint for Reform?, released by the U.S. Department of Education in March 2010, now requires all state accountability systems to ?recognize progress and growth? (?A Blueprint for Reform,? 2010, p. 9). It goes further by requiring states to ?identify effective and highly effective teachers and principals on the basis of student growth? (?A Blueprint for Reform,? 2010, p. 4). Mathematical techniques presently employed in growth models that augment AYP calculations can be applied to this new state requirement of measuring teacher effectiveness. A specific application of ?growth models? is a Value Added Model (VAM). VAMs can measure the influence of educational entities on student growth using student longitudinal testing data. This measure of influence is the additional value that a teacher brings to the classroom above that of his/her peers. Several states currently calculate a quantitative indicator of value that identifies teachers who employ pedagogical strategies or exhibit certain behaviors that positively impact student learning. Alabama (AL) is absent from this group of states that have such a measure, yet it desires to improve its existing educator evaluation system and meet the requirement of A Blueprint for Reform (Bice, 2010). 3 1.3 Alabama?s Race to the Top The U.S. Department of Education created the Race to the Top grant program to allow states to compete for federal funding that supports needed education reform. The structure of the program allows $4 billion to be funded for state reforms in the following four areas: 1. ?Adopting standards and assessments that prepare students to succeed in college and the workplace; 2. Building data systems that measure student growth and success, and inform teachers and principals how to improve instruction; 3. Recruiting, developing, rewarding, and retaining effective teachers and principals, especially where they are needed most; and 4. Turning around their lowest-performing schools? (?Delaware and Tennessee,? 2011, para. 6). Alabama submitted a Phase I application to the U.S. Department of Education in January 2010 along with 39 other states and the District of Columbia. Tennessee and Delaware won grants for Phase I and were awarded $500 million and $100 million respectively (?Delaware and Tennessee,? 2011, para. 3). ?Delaware and Tennessee [had] aggressive plans to improve teacher and principal evaluation, use data to inform instructional decisions, and turn around their lowest- performing schools? (?Delaware and Tennessee,? 2011, para. 8). Phase II of the program commenced after the announcement of the Phase I winners in March 2010 and had $3.4 billion still available for reform grants. Based on recommendations from the reviewers of Alabama?s Phase I application, the Alabama State Department of Education (ALSDE) attempted to bolster its Phase II application by including reforms in teacher and principal evaluation in which teacher and principal effectiveness ratings are tied to student growth. In order to include such reforms in the Phase II application, stakeholder support had to come from the ALSDE Board. In a contentious 5-4 vote 4 on May 27, 2010, with Governor Bob Riley providing the swing vote, the ALSDE Board passed the Educator Effectiveness Resolution tying teacher and principal effectiveness to student performance (Morton, 2010, p. 3). This special session of the ALSDE board on the eve of the Phase II application deadline of June 1, 2010, was critical to being able to submit the comprehensive document of reforms that the ALSDE believed it needed. Alabama submitted its Phase II Race to the Top grant application, ?Advancing Education as the 21st Century Civil Right?, to the U.S. Department of Education on June 1, 2010. The Phase II application contained language throughout that supports research in measuring teacher and principal effectiveness based on student growth. A sample of the reforms follows: 1. ?Create data systems?that are readily available to colleges and universities for research? (?Alabama?s Race,? 2010, p. 4) 2. ?Alabama will redesign the current accountability system ?as measured by student growth? (?Alabama?s Race,? 2010, p. 9). 3. ?Alabama will apply a growth model to existing data?to develop predictive trajectories for its students? (?Alabama?s Race,? 2010, p. 9). 4. ?The Educator Effectiveness Resolution allow[s] the use of multiple and objective measures of student growth outcomes as the predominant factor for determining teacher and principal effectiveness? (?Alabama?s Race,? 2010, p. 86). Alabama came in last place out of the 36 states that submitted Phase II applications. Dr. Joseph Morton, State Superintendent of Education, sent an open letter to Arne Duncan, U.S. Secretary of Education, dated September 1, 2010, following the announcement of the Phase II winners. In the letter Dr. Morton critically addressed the grading of the applications with his assertion that reviewers placed unnecessarily high importance on the ability of states to have charter schools, teacher union support of measuring teacher effectiveness based on student achievement, and adoption of the Common Core Standards. At the time of the application, Alabama had none of those elements. Dr. Morton stated that despite knowing Alabama?s 5 application would not be competitive in Phase II, it would serve as a foundation for needed reforms. The letter demonstrates the conviction of Dr. Morton to provide a document of reforms for the State in spite an anticipated poor outcome and the difficulty to even obtain the authority to submit it (Morton, 2010). 1.4 Educator Evaluation System Alabama desires an objective effectiveness index in its educator evaluation system to augment its presently used formative assessment, EDUCATEAlabama. The combination of the two components would produce a yearly effectiveness score for each teacher in order to place teachers into ?at least? four categories: 1. Extraordinary gains in student growth. 2. Meets student growth. 3. Did not meet student growth but produce evidence that progress is being made. 4. Consistently failed to produce student growth for multiple or consecutive years (?Alabama?s Race,? 2010, p. 90). The most important aspect of obtaining such categories is being able to implement policies and practices aimed at preparing every student ?to graduate from high school ready for college and a career? (?A Blueprint for Reform,? 2010, p. 3). For example, effectiveness categories could be used to ensure the equitable distribution of effective teachers, target incentives at effective teachers to teach in high-need schools, and target professional development for those teachers not able to demonstrate effectiveness. Following the development of the new evaluation system, ?alignment? of the two components is desired through refinement of the formative component (?Alabama?s Race,? 2010, p. 90). 6 Alabama?s Race to the Top application contains an aggressive but systematic plan to implement an educator evaluation system that has teacher effectiveness ratings tied to student growth. The application as well as the Educator Effectiveness Resolution passed by the ALSDE Board on May 27, 2011, stipulate that the Education Reform and Innovation Council (ERIC) will be formed to identify an approach to measure teacher and principal effectiveness based on student growth. According to the application, the council was to convene in the summer of 2010 and develop approaches to measuring student growth and teacher and principal effectiveness by December 2010 (?Alabama?s Race,? 2010, p. 86,89). The Educator Effectiveness Resolution stated that the ALSDE Board would receive the recommendations in early 2011 to review, discuss, and take possible action in order for ?full implementation to occur with the first day of school in 2011? (?Educator Effectiveness Resolution,? 2010). The Race to the Top application was less optimistic. It required the development and implementation of an evaluation system that includes an objective measure based on student growth by the 2012-2013 school year (?Alabama?s Race,? 2010, p. 86). Regardless of the implementation date for the new evaluation system, the ERIC never convened to provide the required recommendations. The ALSDE acknowledged that it ?did not know where to start? (Bice, 2010). More study was necessary before the council could come together to determine how to include teacher effectiveness, as measured by student growth, in an educator evaluation system (Bice, 2010). In addition to determining how to best measure teacher effectiveness with student growth, a practical matter exists for Alabama to link student achievement data with teachers. According to the Alabama?s Race to the Top application, Alabama currently meets 10 of the 12 data elements of the America Creating Opportunities to Meaningfully Promote Excellence in 7 Technology, Education, and Science (COMPETES) Act (?Alabama?s Race,? 2010, pp. 59?61). On January 4, 2011 President Obama signed into law the America COMPETES reauthorization Act which further authorizes the investment in ?research and development, education, innovation, and competitiveness? (Holdren, 2011, para. 2). Alabama stated clearly that it met data element No. 8 of America COMPETES which requires a ?teacher identifier with the ability to link teachers to students? (?Alabama?s Race,? 2010, p. 60). Certainly a difference exists between having the ?ability? to link student data to teachers and analyzing student data incorporating teacher identifiers. The State does not provide the linked testing data to the districts nor does it provide analysis as a result of the linkage (Crouse, 2011; DiChiara, 2011). Presently, the ALSDE considers it difficult to link student data with teachers due to the requirement to merge a teacher scheduling database with a student achievement database (Larson, 2010). Districts only receive yearly student achievement data and are not provided longitudinal student data to allow longitudinal analysis to occur (Crouse, 2011). Therefore, the lack of teacher-linked, longitudinal data endures as an administrative issue, not a technical one. 1.5 Research Objectives The primary objective of this research is to fill the need of augmenting Alabama?s formative educator evaluation system, EDUCATEAlabama, with a precise and stable teacher effectiveness index based on student growth. This index will be determined with an analytical process involving principal component and cluster analysis that uses multiple and objective measures of teacher effectiveness to minimize the risk inherent in isolating the value that a teacher brings to the classroom above that of his/her peers. In order to accomplish the primary objective, this research intends to accomplish the following sub-objectives: 8 1) Develop techniques with Alabama?s database infrastructure to streamline the process of establishing longitudinal data from existing yearly data in order to make this information readily accessible to school districts. 2) Develop techniques with Alabama?s database infrastructure to streamline the process of linking student achievement data with teachers in order to make this information readily accessible to school districts. 3) Confirm or modify Alabama?s effectiveness rating categories based on being able to detect statistically different groups of teachers by effectiveness. 4) Report all teachers in accordance with the rating categories for an Alabama school district using the newly developed objective effectiveness index. 5) Provide an assessment of Alabama?s educator evaluation system, compare and contrast the results of an objective effectiveness index with observational teacher evaluation data, and propose an observational, summative assessment for Alabama that is a predictor of student achievement gains and correlated with an objective effectiveness index. 1.6 Organization of Research The structure needed to accomplish the research objectives of Chapter 1 consists of five additional chapters. Chapter 2 contains a literature review to provide the technical context for the development of a teacher effectiveness index. Through the use of an example, Chapter 3 describes the methodology of computing the Risk-Mitigated Teacher Effectiveness Index (RMTEI) within a three phase process. Phase I entails calculating four teacher effectiveness metrics: 1) Linear Mixed Model Teacher Effect - statistical prediction of the relative value of a particular teacher by measuring the teacher deviation from the district mean; greater is better. 2) Overall Linear Mixed Model Value-Added Measure - a teacher?s average of the difference between students? actual achievement and predicted achievement had they been taught by the average teacher in the district; greater is better. 3) Median Student Growth Percentile - serves as an indicator of student growth associated with each teacher by calculating the median of a teacher?s student growth 9 percentiles. A student?s growth percentile is obtained by determining what percentage of other students had less growth in testing achievement; greater is better. 4) Overall Quantile Regression Value-Added Measure - a teacher?s average of the difference between students? actual achievement and predicted achievement of a typical student within the district; greater is better. Subject-specific and overall teacher index values are calculated in Phase II with an analytical process involving principal component analysis. Since teachers? Phase I metrics for a given subject, grade, and year have varying units of measurement, the principal components are obtained from the standardized version of those metrics by calculating the eigenvectors of the metrics? correlation matrix. The variance of the principal component scores from the dominant principal component is the largest eigenvalue associated with the corresponding eigenvector. This eigenvalue absorbs the preponderance of the variation of the system, and the principal component scores from the dominant principal component become teachers? subject-specific RMTEI values. If teachers instruct mathematics and reading, then an overall teacher effectiveness index is obtained by taking the mean of their two subject indexes. Otherwise, the single, subject index is the teachers? overall values. The principal components serve as the inputs to Phase III, Cluster Analysis. Since the RMTEI process produces a dominant principal component that continually leads to elliptically shaped clusters in two dimensions, Ward?s clustering method, which expects elliptically shaped clusters, is employed as a general prescription of the RMTEI process. Ward?s clustering method illuminates teachers with similar characteristics (principal components) in the data, which provides better assignments to teacher effectiveness categories compared to clustering with a single Phase I metric. The methodology is placed into practice in Chapter 4 by examining five years of student and teacher data from an Alabama school district and four years of student and teacher data from 10 418 elementary schools in 17 urban districts from across the United States. Further analysis is undertaken in Chapter 4 to confirm the desired outcome of a precise and stable teacher effectiveness index along with a discussion of the results. Chapter 5 provides an assessment of Alabama?s educator evaluation system, and then propose an observational, summative assessment for Alabama that is a predictor of student achievement gains and correlated with the Risk-Mitigated Teacher Effectiveness Index. Lastly, Chapter 6 concludes the dissertation and provides recommendations for future study. 11 Chapter 2 : Literature Review 2.1 Introduction Most prominent methods of analytically measuring teacher effectiveness are rooted in the application of Linear Mixed Models. Another analytical method that is currently applied to determining school effectiveness in the state of Colorado, which can be extended to teacher effectiveness, employs Quantile Regression (Betebenner, 2007). Methods rooted in the application of Linear Mixed Models based on longitudinal student testing data either derive individual teacher effects in order to make teacher comparisons or make student test score projections for the successive year. For the latter case students? projections can be compared against their actual performance at the end of the year. If students? actual scores are above the projections, then their teachers have provided instruction that contributed to obtaining appropriate growth. Conversely, if students? actual scores are below projections, then their teachers have not provided instruction that contributed to obtaining appropriate growth. Quantile Regression has been applied to student testing scores within a school district in order to calculate a growth percentile for each student which can be viewed similarly to a child?s growth chart assessment following a visit to the doctor. The median of a school?s aggregated student growth percentiles can be calculated and compared to other schools? medians in the district. These comparisons contribute to schools obtaining recognition for performance (?Colorado,? 2008, p. 10). The calculations needed to make these comparisons, however, deserve investigation as a technique to be applied to measuring teacher effectiveness. 12 2.2 Linear Mixed Models (LMMs) The most prominent LMM approach to measure teacher effectiveness belongs to Dr. William Sanders of the SAS institute who implemented the Tennessee Value-Added Assessment System (TVAAS) in 1992 (Braun, Chudowsky, & Koenig, 2010, p. 3; McCaffrey, Lockwood, Koretz, & Hamilton, 2004, p. 2). This methodology is now packaged as the SAS Education Value Added Assessment System (EVAAS) for K-12 and commercially available for implementation by states and school districts. Tennessee, Pennsylvania, Ohio, and North Carolina presently use SAS EVAAS for K-12 as a means to measure the effects of teachers on the academic growth of their students (?SAS EVAAS,? 2010). LMMs are statistical models that ?quantify the relationship between a continuous dependent variable and various predictor variables? (West, Welch, & Galecki, 2007, p. 9). Datasets that can be analyzed with LMMs include: nested data (i.e. students in classrooms) and longitudinal or repeated measures studies (subjects are measured regularly over time or under different conditions) (West et al., 2007, p. 1). LMMs get their name from the fact that the model is a linear function of the predictor coefficients (parameters), and the predictors, themselves, can be a mix of fixed and random effects (West et al., 2007, p. 1). Fixed effects can be either continuous or categorical and describe the relationship between the predictors and the response for the entire population (West et al., 2007, p. 9). If an effect has factor levels that can expand during the course of study, then the effect can be considered a sample of the population and thus random (i.e., teachers will change during a study as new teachers emerge each year) (Lissitz & Doran, 2009, p. 24). Random effects model the random variation in the response variable for different levels in the data (West et al., 2007, p. 9). 13 Whether to estimate teacher effectiveness as a fixed or random effect has consequences for LMMs. If one considers teachers a random effect, then the variance of the estimates is reduced at the expense of introducing more bias (Braun et al., 2010, p. 52). Conversely, modeling teacher effectiveness as a fixed effect reduces bias but tends to produce ?quite volatile? estimates particularly for teachers with small numbers of students (Braun et al., 2010, p. 52). By believing something is known about the distribution of teacher effects, ?a large positive or negative estimate of the teacher effect is unlikely and is probably the result of random errors? (Braun et al., 2010, p. 52). Most well-known models estimating teacher effects specify the effects to be random. The random effects are calculated as Empirical Best Linear Unbiased Predictors (EBLUPs) and shrunk toward the mean with reduced variance but, perhaps, with the introduction of some bias (Braun et al., 2010, p. 52). One can then obtain variability estimates of the random teacher effects in order to make inferences about the random effects within the population (West et al., 2007, p. 2). Students are often the subjects of analysis nested within teachers nested within schools. Longitudinal data exists when multiple evaluations are made on the same subject (student) over time. Evaluations of the same subject over time are most likely correlated, and LMMs capture this correlation by estimating the covariance parameters. With improvements in software, LMMs can now fit different covariance structures to the data while capturing this correlation. Depending on the covariance structure specified in formulation, efficiency can be obtained by not having to estimate the full covariance structure of the multivariate normal model. For example, one could specify that the covariance between random effects is zero, thus the structure of the D matrix (see section 2.2.1) for two random effects can be reduced: 11 22 ( ) 0 ( )() 0 ( ) ( )ii iD ii V a r u V a r uD V a r u V a r u V a r u?? ? ? ?? ? ? ?? ? ? ? ? ? ? ? 14 In addition to the benefit of being able to specify the covariance structure in the model formulation, one can also fit a LMM to a dataset with missing observations (West et al., 2007, pp. 2?3). 2.2.1 Linear Mixed Model General Specification LMMs can be specified in a general or hierarchical manner. Statistical software packages such as SAS, SPSS, and R follow the general formulation, whereas Hierarchical Linear Model (HLM) software follows the hierarchical formulation (West et al., 2007, p. 1). This section follows the general formulation by West et al. in which a Linear Mixed Model is initially presented for just a single student i with scores from 1,..., in followed by a description of the matrices for the collection of students (2007, pp. 16?22) i i i i ifixed randomY X Z u??? ? ? ),0(~ ),0(~ iii RN DNu? iY is an in ? 1 observation vector of test scores for the ith student. 1 2 i i i i ni Y YY Y ?? ?? ??? ?? ?? ?? iX is a known in ? p matrix, which represents the values of the p predictors, such as previous test scores in a particular subject. In a model including an intercept term, the first column would be equal to 1 for all observations. ?? ? ? ? ? ? ?? ? ? ? ? ? ? inpinin ipii ipii i iii XXX XXX XXX X )()2()1( 2)(2)2(2)1( 1)(1)2(1)1( ? ???? ? ? 15 ? is an unknown p ? 1 vector of fixed effects to be estimated from the data. ?? ? ? ? ? ? ?? ? ? ? ? ? ? p? ? ? ? ?2 1 iZ is a known in ? q matrix of observed values for the q predictor variables for the ith student that vary randomly across students. A model in which only the intercept is assumed to be random from student to student, the Z matrix would be a column of 1?s. ?? ? ? ? ? ? ?? ? ? ? ? ? ? inqinin iqii iqii i iii ZZZ ZZZ ZZZ Z )()2()1( 2)(2)2(2)1( 1)(1)2(1)1( ? ???? ? ? iu is an unknown q ? 1 vector of random effects to be estimated from the data: 1 2 ~ (0, ) i i i qi u uu N D u ?? ?? ??? ?? ?? ?? D is a q ? q variance-covariance matrix that reflects the correlation among the random effects. Elements along the main diagonal represent the variances of each random effect in iu and off diagonal elements represent the covariance between two corresponding random effects. D is symmetric and positive definite (an n ? n real symmetric matrix M is positive definite if 0Tz Mz? for all non-zero vectors z with real entries). ?? ? ? ? ? ? ?? ? ? ? ? ? ?? )()c o v ()c o v ( )c o v ()()c o v ( )c o v ()c o v ()( )( ,2,1 ,222,1 ,12,11 qiqiiqii qiiiii qiiiii i uV a ruuuu uuuV a ruu uuuuuV a r uV a rD ? ???? ? ? 16 ? is a non-observable in ? 1 random vector variable representing unaccountable random variation. 1 2 ~ (0 , ) i i i ii ni NR ? ?? ? ?? ?? ??? ?? ?? ?? iR is a positive definite symmetric covariance matrix of the form: ?? ? ? ? ? ? ?? ? ? ? ? ? ?? )()c o v ()c o v ( )c o v ()()c o v ( )c o v ()c o v ()( )( ,2,1 ,222,1 ,12,11 ininiini iniiii iniiii ii iii i i V a r V a r V a r V a rR ????? ????? ????? ? ? ???? ? ? One assumes the residuals of different subjects are independent of each other and the vector of residuals, 1,..., m??, and random effects, 1,..., muu, are independent of each other. One can also specify the LMM for all students as: .rand omfixedY X u??? ? ? ? ~ (0, )~ (0, )u N GNR? Y is a n ? 1 vector where inn?? . This is a result of placing all iY as defined above on top of each other. The X matrix is n ? p obtained by placing all iX on top of each other. Z is a block- diagonal matrix, with the iZ ?s stated above on the diagonal. The u vector places all iu on top of each other. The ? vector places all i? on top of each other. The G matrix is a block-diagonal matrix representing the variance-covariance matrix for all random effects with blocks of D as stated above for each subject along the diagonal. The R matrix is an n ? n block-diagonal matrix representing the variance-covariance matrix for all residuals with blocks of iR along the diagonal. 17 2.2.2 Linear Mixed Model Hierarchical Specification The Hierarchical Model Specification belongs to the work of Raudenbush and Bryk (2002) and will involve three levels of data and models. It begins with the student-level predictor variables and one student-level outcome variable: test score gain. This constitutes the Level 1 Model (Student): ( 1 ) ( 2 )0 1 2( ) ( )i j k j k i j k i j k i j kT e s t G a i n b X X? ? ?? ? ? ? where ),0(~ 2?? Nijk . The outcome variable for student i with teacherj nested in school k depends on an unobserved intercept specific to teacherj nested in schoolk , the fixed effects 1? and 2? for the student predictors of (1)X and (2)X , and student residuals. Level 2 Model (Teacher): (1)0 0 3 ()jk k jk jkb b T u?? ? ? where ),0(~ 2 tea ch erjk Nu ? . The level 1 intercept, jkb0 , teacher j nested in school k , depends on an unobserved intercept specific to school k , kb0 , a teacher specific effect 3? for teacher predictor (1)T , and a random effect, jku , associated with teacherj within school k . Level 3 Model (School): (1)0 0 4 ()k k kb S u??? ? ? where ),0(~ 2 sch o o lk Nu ? . The level 2 school specific intercept, kb0 , depends on the overall fixed intercept 0? , a school specific effect 4? for school predictor (1)S , and the random effect ku associated with the intercept for school k. The nesting of students within teachers within schools is problematic with longitudinal data consisting of more than one test score for a student due to students not receiving instruction 18 from the same teacher every year. Even in a two-level model consisting of students and schools, the HLM structure is difficult to apply with longitudinal data due to the transient nature of some students as they move to different schools after a school year. Removing transient students in a two-level model as a remedy to the problem would likely not be appropriate for a researcher trying to develop the best model incorporating every type of student. With multiple yearly test scores for students as predictors, the HLM structure cannot be sustained. With a single test score as a predictor, the elegant structure of the HLM is easily satisfied (Lissitz & Doran, 2009, p. 25). In order to implement an HLM and longitudinal data, the process would have to be a yearly one. A researcher would need to separate the longitudinal data into single year data and apply an HLM every year based on a single test score as a predictor. Although practical and efficient, this approach would yield less accurate predictions of student achievement since all prior known information would not be used. In either the general or hierarchical specification of the LMM, a final model can be developed to predict MathGain scores for all students using data from the previous cohort of students. This is done to offset the inherent lack of randomization of students being assigned to teachers and the subsequent confounding nature of student test scores with other characteristics of students (Lissitz & Doran, 2009, p. 20). ?This predicted score can be thought of as each student?s counterfactual level of achievement ? that is, their predicted achievement had they been taught by a different teacher (say, the average teacher in the district)? (Corcoran, 2010, p. 10). The difference between the prediction and the student?s actual performance becomes the teacher?s value-added measure for that student. Averaging the value-added measures for a teacher?s students becomes his/her final value-added measure for the year (Corcoran, 2010, p. 10). 19 2.2.3 Linear Mixed Model Implementation in Texas In order to meet several of its own legislative acts as well as Federal requirements to include growth in its AYP calculations, Texas developed the Texas Projection Measure (TPM) (Texas Education Agency, 2009, p. 6). The TPM uses Linear Mixed Models to augment their AYP calculations as discussed in Section 1.2. The Texas Education Agency?s Growth Model Pilot Application to the U.S. Department of Education in January 2009 describes its Linear Mixed Model implementation methods whereby Texas is credited for students projected to meet proficiency and met proficiency with regard to AYP (Texas Education Agency, 2009). The TPM is used to make projections of student achievement test scores in reading and mathematics at selected evaluation grades in the future (Texas Education Agency, 2009, p. 1). The models use current year scale scores in reading and mathematics and school-level mean scores in the projection subject as predictors. Development of the models, however, is accomplished the year prior with data from the previous cohort of students. For example, 3rd and 4th grade data from cohort 2019 can be used to develop a model to make a 4th grade prediction using Linear Mixed Models. The model is applied to the successive cohort (class that just finished 3rd grade, cohort 2020). The development of the models using the data of the previous year allows school administrators to know projected scores of their current students prior to the beginning of the school year. In terms of AYP, projections contribute to the calculations by adding the number of students who are projected to meet the proficiency target at a specific evaluation grade in the future to the number of students who already meet the proficiency target at the current grade (Texas Education Agency, 2009, p. 12). This sum divided by the number of students in the particular grade determines the AYP percentage. This percentage is compared with the state 20 objectives for that grade. Whether aggregation occurs at the state, district, or school level, AYP above stated objectives defines success for the year. 2.3 Quantile Regression (QR) A linear regression model identifies the conditional-mean function. It assumes constant variance and normality with its residuals. Difficulties arise when trying to overcome model inadequacy should assumption violations be present. Outliers also negatively affect a linear regression model by having an undue influence on the results. As a result, some linear regression models may not account for the full distributional properties of the response and can be invalid. An alternative modeling approach is desired to remedy these inadequacies (Hao & Naiman, 2007, pp. 22,24?25). To address these issues of conditional-mean estimates, Quantile Regression was developed to estimate the effect of predictors on various quantiles that make up the distribution of the data. Similar to linear regression, Quantile regression models can deal with continuous response variables. Unlike linear regression, quantile regression can account for the full distributional properties of the response (Hao & Naiman, 2007, p. 29). In terms of educational data, estimation of a quantile is accomplished with students? scores at time t using the students? prior scores at times 1, 2,?, t-1 as the conditioning variables (Betebenner, 2007, p. 3). As a means to showcase Quantile Regression, a comparison with Linear Regression is developed in Table 2.1 using the QR notation of Hao and Naiman (2007). 21 Table 2.1: Linear and Quantile Regression Comparison Model Formulation Equation Linear Regression 01 2 ,~ (0 , )i i i i yxN? ? ???? ? ? 01???[ | ]i i iE y x y x??? ? ? Quantile Regression ( ) ( ) ( )0101p p pi i iyxp? ? ?? ? ??? ( ) ( ) ( ) ( ) ( ) ( )0 1 0 1 () [ | ] ( )( ) 0p p p p p pi i i i i p i Q y x x Q xth u s Q ? ? ? ? ??? ? ? ? ?? The estimators 0?? and 1?? of least-squares estimation minimize the sum of squared distances between data points ( , )iixy and the fitted line 01???yx???? . Extending this concept to QR, one seeks to find estimators that minimize the sum of weighted vertical distances between data points and a fitted line where points below the fitted line are weighted 1p? and points above the fitted line are weightedp . Each p (.05, .25, .50 for example) leads to a different fitted line, called a conditional-quantile function, that has p number of points below the fitted line and 1p? points above the fitted line (Hao & Naiman, 2007, pp. 33?34). An example of the linear parameterization of QR is shown in Figure 2.1. Figure 2.1: Linear Parameterization of Quantile Regression 22 2.3.1 Computing QR Coefficients Each line in the (, )xy plane of the form 01yx???? has a corresponding point 01( , )?? in the 01( , )?? plane. Conversely, lines in the 01( , )?? plane of the form 10( / ) (1 / )y x x????correspond to points in the (, )xy plane. The goal is to find a line in the (, )xy plane that minimizes the sum of weighted vertical distances between the data points and the line. This line corresponds to a point in the 01( , )?? plane. For example, a sample of lines in the (, )xy plane has a corresponding set of points in the 01( , )?? plane that form polygonal regions. Figure 2.2 depicts lines in the (, )xy plane corresponding to points in the 01( , )?? plane of the same color with lines in the 01( , )?? plane corresponding to points in the (, )xy plane of the same color. Figure 2.2: Point/Line Duality (Edgeworth, 1888) The vertices of these polygonal regions in the 01( , )?? plane are extreme points. Each polygonal region in the 01( , )?? plane corresponds to a family of lines in the (, )xy plane that maintain the same number of points above or below the line. Based on exterior point algorithms for solving linear-programming problems, one can start at one of the vertices in the 01( , )?? plane and 23 iteratively move from vertex to vertex along the edge of the polygonal region, choosing at each vertex the correspondingly smallest value of: 0 1 0 1??? ? ? ?( ) ( 1 ) ( )i i i ii i i iy y y yp y x p y x? ? ? ???? ? ? ? ? ??? (2-1) In the (, )xy plane, one is iteratively moving from line to line defined by pairs of data points, at each step ?deciding which new data point to swap with one of the two current ones by picking the one that leads to the smallest value in Equation (2-1) (Hao & Naiman, 2007, pp. 34?38). Practically, the Quantile Regression Procedure in SAS ?offers simplex, interior point, and smoothing algorithms for estimation? (?SAS/STAT 9.2,? 2009, p. 5354). Figure 2.2 depicts the point in the 01( , )?? plane that corresponds to the line in the (, )xy plane that minimizes Equation (2-1) for 25.?p to arrive at the .25 quantile. 2.3.2 Quantile Regression Extended to Cubic Splines ?A spline of degree 3 is a piecewise cubic curve whose values, slopes, and curvature coincide at the knots. Visually, a cubic spline is a smooth curve, and it is the most commonly used spline when a smooth fit is desired? (?SAS/STAT 9.2,? 2009, p. 387). Given ),() ,...,,(),,( 1100 nn yxyxyx , one can approximate the data by fitting a cubic spline through two consecutive data points. By extending the formulation of quadratic splines by Kaw and Keteltas (2009), the cubic splines are the following (2009, p. 6): 321 1 1 1 0 1()f x a x b x c x d x x x? ? ? ? ? ? (2-2) 322 2 2 2 1 2a x b x c x d x x x? ? ? ? ? ? (2-3) 32 1n n n n n na x b x c x d x x x?? ? ? ? ? ? 24 These cubic splines generate 4n coefficients that can be determined by simultaneously solving 4n equations. 2n equations are created as a result of each spline going through two consecutive data points: 321 0 1 0 1 0 1 0 32 1 1 1 1 1 1 1 1 32 1 1 1 1 32 () () () () n n n n n n n n n n n n n n n n a x b x c x d f x a x b x c x d f x a x b x c x d f x a x b x c x d f x ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? The cubic splines must be continuous at interior points. At a particular interior point (where two splines meet), the first derivatives must also be equal (have the same slope). For example the first derivative of Equations (2-2) and (2-3) are equal at 1x : 22 1 1 1 1 1 2 1 2 1 2 22 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 3 2 3 2 3 2 3 2 0 3 2 3 2 0n n n n n n n n n n a x b x c a x b x c a x b x c a x b x c a x b x c a x b x c? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? These equations produce 1n? equations. Lastly, the second derivative of each spline at each interior point must be equal (curvature must be the same) (House, 2010, p. 6): 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 1 6 2 6 2 6 2 6 2 0 6 2 6 2 0n n n n n n a x b a x b a x b a x b a x b a x b? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? This also produces 1n? equations. The sum of the equations generated is 2 1 1 4 2n n n n? ? ? ? ? ?. In order to produce two additional equations, assume the second derivatives are zero at the endpoints to produce a natural spline (Mathews, 2004, para. 2). A natural spline has endpoints that are inflection points. 25 0 1 0 1" ( ) 6 2 0" ( ) 6 2 0 n n n n f x a x bf x a x b? ? ?? ? ? With the 4n equations with 4n unknowns, one can solve for the coefficients to arrive at each cubic spline. An extension of the process above is to express each individual cubic spline )(xf as a linear combination of the cubic spline basis functions to create a single interpolating function, which is more stable during numerical calculations (Kincaid & Cheney, 2002, p. 366). The interpolating function evaluates the spline curves in basis form. 2.3.3 Quantile Regression and B-Splines The coefficients or weights of the cubic splines obtained in the previous section are determined explicitly given a select number of points that connect the cubic splines. Given a large dataset, however, one must determine the best coefficients for an interpolating function that describes the data knowing the boundaries and interior locations (knots) that help give the spline its shape. This is achieved through quantile regression in which the coefficients are chosen that minimize the sum of weighted residuals for a specific quantile. Following the text of Hao and Naiman, the thp quantile-regression coefficients are the values that minimize the weighted sum of distances between ?iy and iy , where a weight of 1p? is used if the fitted value ?iy overpredicts the observed value iy and a weight of p is used if it underpredicts the observed value. Specifically, minimization of a weighted sum of residuals, ?iiyy? , is desired where positive residuals receive a weight of p and negative residuals receive a weight of 1p? : ????(1 )i i i ii i i iy y y yp y y p y y??? ? ? ??? (2007, p. 37). 26 The Colorado Department of Education currently implements a model that ?parameterize[s] the conditional quantile functions as linear combinations of B-spline basis functions? (Betebenner, 2007, p. 5). As Betebenner (2007) points out, models of B-spline basis functions do a good job of interpolating data that is skewed or does not have constant variance: ?Using B-splines is attractive both theoretically and computationally in that they provide excellent data fit, seldom lead to estimation problems, and are simple to implement in available software. As will be seen when examining goodness-of-fit, use of B-splines instead of linear percentile curves leads to appreciable improvement in goodness-of-fit over the more common linear parameterization of the conditional percentile functions? (2007, p. 5). The implication with education data is that B-splines can account for ?slightly greater variability for higher? scale scores than for lo wer scores? (Betebenner, 2007, p. 5). 2.3.4 Quantile Regression Implementation in Colorado Quantile Regression Analysis is used by the state of Colorado to calculate Student Growth Percentiles (SGP) and determine whether a student has made a year?s worth of growth over a period of a year. The resulting model calculates a SGP for each student based on a normative comparison with all other students with the same testing history. The minimum testing history is two successive Colorado Student Assessment Program (CSAP) tests in at least one academic subject (Betebenner, 2007, p. 2). SGPs can also be compared to the ?50th percentile representing typical growth or one year?s growth in one year?s time? and evaluated in order to determine whether growth is sufficient ?to reach proficient and advanced levels of achievement within one, two, and three years? (?Colorado,? 2008, p. 10). For example, models are developed for five different percentiles with Quantile Regression using a dataset of scaled 3rd Grade Mathematics test scores and the MathGain obtained following the administration of the 4th Grade Mathematics Test. Three equally spaced 27 knots positioned between the minimum and maximum values of 3rd Grade Mathematics Score are chosen. Individual Student Growth Percentiles (SGP) are determined by plotting the individual?s 3rd Grade Mathematics Score versus his/her MathGain to provide a reference for individuals with respect to the population. For example, a student with a score of 50 on the 3rd Grade Mathematics test obtains a 77 on the 4th Grade Mathematics Test. The MathGain of 27 is used to complete the plot in Figure 2.3 and obtain a SGP of 75%. Therefore, the student?s growth was greater or equal to 75% of students with the same testing history (?Colorado,? 2008, p. 9). Figure 2.3: Cubic B-Spline Parameterization of Student Growth Percentiles Colorado currently calculates the medians of individual SGPs to ?quantify the level of student growth attained at specific schools and districts relative to other schools and districts within the state? (?Colorado,? 2008, p. 6). 28 ?The median SGP computed for each school serves as an indicator of student growth associated with each school,? describes a characteristic of a school?s students as a group and can be used to evaluate school outcomes,? [and] measures the relative growth that has occurred for the group attending a specific school? (?Colorado,? 2008, p. 10). An extension of this process to be discussed in Chapter 3 can be to calculate the median of individual SGPs for a teacher and compare it to the median of other teachers within the same population. As stated previously, Quantile Regression chooses the best coefficients for each quantile interpolating function given boundaries of the dataset and interior locations, knots, which help give the spline its shape. One can choose any location for the desired knots when determining an interpolating function for each quantile. If knots that are not equidistant from each other are selected, then one has a Non-Uniform Rational B-Spline (NURB). Significant literature exists for different knot placement techniques that may lead to better fitting models. The Colorado Department of Education author, Betebenner, followed the work of Wei and He (2006) in which they preselected a set of knots for growth charts ?using [their] general understanding of growth patterns? (Wei & He, 2006, p. 2073). Wei and He (2006) placed more knots during rapid changes in the data, such as during infancy and puberty, compared to other times. They stated, ?In this paper, we do not go into the issue of automated knot selection? (Wei & He, 2006, p. 2073). Betebenner (2007), however, never states how he chose knots other than to say that the number and placement of knots would change the fit of the percentile curves (2007, p. 10). Colorado had initially used Linear Mixed Models in 2004, but after a few years switched to student growth percentiles to address the main objective of a Colorado law (HB 07-1048) to determine whether students attain adequate yearly growth. The Colorado Department of Education determined that it could not adequately address this with the Linear Mixed Model approach, thus it changed due to the following shortcomings of the Model: 29 1. ?The model fit a linear growth trajectory to each student and used that trajectory to predict future achievement. Longitudinal achievement of students across vertical CSAP scale is not linear but displays a negative concavity. The use of a linear trend resulted in higher predicted achievement than was likely for low achieving students? (?Colorado,? 2008, p. 7). 2. ?The percentage of students projected to be proficient was strongly correlated with current status measures and likely confounded growth of students at a given school with their initial status? (?Colorado,? 2008, p. 7). 2.4 Discussion The ability to definitively state that a student?s yearly growth in testing achievement is attributable to a particular teacher serves as one of the main goals of value-added modeling. The method to ideally determine teachers? effectiveness would be to randomly assign students to teachers, measure the achievement of those students, and then make inferences regarding the differences in effectiveness between the teachers based on the scores. Although the randomization of students assigned to teachers is ideal for measuring teacher effectiveness, it does not exist practically. As a result, efforts to offset this lack of randomization are accomplished by constructing a model to predict a student?s yearly gain in testing with data from a previous cohort of students. The model provides predictions for successive students with the theory that those students were taught by the average teacher in the population being studied. Model construction takes on added importance when determining which variables to include in addition to prior achievement that predict the yearly gain in testing achievement. Lissitz and Doran (2009) propose that variable selection should be based on the intended purpose of the study. For example, if the desire is to determine what student characteristics are associated with successful students, then the study should include background variables such as socioeconomic status, gender, or race. If the desire is to look for academic attributes associated 30 with the learning environment that can be changed for the betterment of students, then student background characteristics should not be included in a model (Lissitz & Doran, 2009, p. 27). The No Child Left Behind Act of 2001 and the next reauthorization of the Elementary and Secondary Education Act does not/will not permit the use of student background characteristics in predictive growth models for determining whether a student can be counted for making Adequate Yearly Progress. Their exclusion prevents the implication of ?different expectations for students of different sociodemographic classes? (Braun et al., 2010, p. 43). Sanders, Saxton, and Horn (1997) assert that the use of student longitudinal test data precludes including student background characteristics in their LMM (1997, p. 138). They state, ?Each child can be thought of as a blocking factor that enables the estimation of school system, school, and teacher effects free of the socio-economic confounding that historically have rendered unfair any attempt to compare districts and schools based on the inappropriate comparison of group means?(Sanders et al., 1997, p. 138). Including socio-economic factors into analysis that is not longitudinal in nature does allow the comparison of districts, schools, and teachers without bias; however, the effort to gather accurate data that is often incomplete creates other problems that cannot be overcome (Sanders et al., 1997, p. 138). In addition to the difficulty of obtaining accurate and complete information regarding demographic attributes of students, Ballou, Sanders, and Horn (2004) determined that including such variables in the analysis had a negligible impact on the estimates of teacher effects. Small sample sizes also present challenges for measuring teacher effectiveness. Precision of the results becomes greater for middle school teachers who may teach a greater number of students compared to an elementary school teacher who may teach a single class multiple subjects. Research performed by McCaffrey et al. (2004) consistently found large standard 31 errors such that about two-thirds of teacher effects from teacher effectiveness models are not statistically different from the mean (Braun et al., 2010, p. 45). The precision of the results may also lead to a lack of stability from year to year. McCaffrey, Sass, and Lockwood (2009) compared the teacher effectiveness results from two successive cohorts of students in four counties of Florida elementary and middle schools and found low correlations of teacher effectiveness between the two years (Braun et al., 2010, pp. 45?46). 2.4.1 Measures of Effective Teaching (MET) Project The Bill and Melinda Gates Foundation initiated the MET project in the Fall of 2009 to ?develop and test multiple measures of teacher Effectiveness? (?Working with Teachers,? 2010, p. 1). Similar to ?A Blueprint for Reform?, the MET project cites research declaring that teachers have the greatest impact on student learning compared to other factors controlled by school systems. Therefore, the project aims at increasing the quality of teacher effectiveness information that is presently provided to education leaders to improve teacher feedback, direct professional development, and make informed decisions regarding teacher placement and retention (?Working with Teachers,? 2010, p. 3). The MET project is led by several prominent academic institutions, nonprofit organizations, and for-profit education consultants. The project clearly states that teacher evaluation has two components: one based on student growth in standardized testing and another based on classroom-observed ?aspects of teaching? that are valid predictors of student learning (?Working with Teachers,? 2010, pp. 4?5). The two components of teacher evaluation are constructed with five measures related to teacher effectiveness: 1) Student achievement gains on assessments 2) Classroom observations and teacher reflections 32 3) Teachers? pedagogical content knowledge 4) Student perceptions of the classroom instructional environment 5) Teachers? perceptions of working conditions and instructional support at their schools (?Working with Teachers,? 2010, pp. 6?8) Stage 1 of the project consists of measuring the unique influence of individual teachers on student growth for 2009-10 with a single VAM to establish baseline values. The VAM will use three years of student testing data and control for ?student demographics and teacher characteristics (such as degrees, certification, licensing scores, tenure, district performance review ratings, years of experience, and [National Board for Professional Teaching Standards] (NBPTS) status)? (?Working with Teachers,? 2010, p. 8). Stage 2 consists of combining measures two through five ?to form a composite indicator of effective teaching? by assigning a weight to each measure based on how much each measure contributes to predicting student achievement gains (?Working with Teachers,? 2010, p. 8). Lastly, stage 3 will attempt to show that the composite score of effective teaching is a stable predictor of teachers? student achievement gains. Included in stage 3 is a process to show if baseline value added measures accurately predict student achievement gains for 2010-11 in which students were randomly assigned to teachers unlike in 2009-10 (?Working with Teachers,? 2010, p. 9). The MET project is ambitious and comprehensive in scope. A preliminary report of the project was published in December 2010 with four general findings: 1) Teachers? value added estimates are one of the strongest predictors of teachers? future student achievement gains (?Learning about Teaching,? 2010, p. 4). 2) ?Teachers with the highest value-added scores on state tests also tend to help students understand math concepts or demonstrate reading comprehension through writing? (?Learning about Teaching,? 2010, p. 4). 3) ?The average student knows effective teaching when he or she experiences it? (?Learning about Teaching,? 2010, p. 5). 33 4) ?Valid feedback need not be limited to test scores alone? (?Learning about Teaching,? 2010, p. 5). The ultimate goal of the MET project as with this research is to improve the quality of teachers by providing quality information to education leaders to make decisions regarding teachers? professional development, placement, and retention. 2.4.2 A Risk-Mitigated Approach Despite some of the concerns addressed above, quantifying teacher effectiveness using student growth has significant merit. Each approach comes with some risk and should not be the sole basis of measuring teacher effectiveness. Risk takes the form of identifying teachers that are effective when they are ineffective, and identifying teachers as ineffective when they are effective. The amount of risk allowed in either scenario depends upon one?s perspective of the situation. If one views the situation from a statistical and student perspective where the null hypothesis is framed negatively, then the risk of identifying teachers as effective when they are ineffective should be extremely small. Therefore, the risk of identifying teachers as ineffective when they are effective would be correspondingly higher. For example, consider rejecting the following null hypothesis: 0 1 : Te a c h e r is in e f f e c tiv e: Te a c h e r is e f f e c tiv eHH . If the teacher is, in fact, ineffective, then this would be classified as Type I error. Alternatively, consider failing to reject the null hypothesis: 0 1 : Te a c h e r is in e f f e c tiv e: Te a c h e r is e f f e c tiv eHH . If the teacher is, in fact, effective, then this would be classified as Type II error. 34 Conversely, viewing the situation from the perspective of the teacher would require assuming smaller risk for identifying teachers as ineffective when they are effective. For example, consider rejecting the following null hypothesis: 0 1 : Te a c h e r is e f f e c tiv e: Te a c h e r is in e f f e c tiv eHH . If the teacher is, in fact, effective, then this would be classified as Type I error. Alternatively, consider failing to reject the null hypothesis: 0 1 : Te a c h e r is e f f e c tiv e: Te a c h e r is in e f f e c tiv eHH . If the teacher is, in fact, ineffective, then this would be classified as Type II error. With either a student or teacher perspective regarding risk, a process is needed to combine different indicators of teacher effectiveness to arrive at an index of teacher effectiveness that contains less overall risk (sum of Type I and Type II error) than that of any single measurement. Ultimately, the goal is to produce a teacher effectiveness index with the smallest error as possible. Although the desire is to have an index that will provide a more nuanced approach to effectiveness (i.e., ?at least? four effectiveness categories compared to the binary approach discussed above), placement of teachers across the boundary between ?did not meet growth? and ?meets student growth? must be met with redundancy to ensure any errors are minimized (?Alabama?s Race,? 2010, p. 89). In the end classroom observations by administrators serve to augment any objective result. Several techniques to harness the desired redundancy exist in the literature. Principal Component Analysis can reduce dimensionality of highly correlated variables to allow Cluster Analysis to assign a set of items into groups of comparable quality. As a result, Principal Component and Cluster Analysis are reviewed in the following sections to show their applicability in producing a risk-mitigated teacher effectiveness index. 35 2.5 Principal Component Analysis ?Principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are 1) data reduction, 2) interpretation. Although p components [variables] are required to reproduce the total system variability, often much of this variability can be accounted for by a small number k of the principal components. The k principal components can then replace the initial p variables, and the original dataset, consisting of n measurements on p variables, is reduced to a dataset consisting of n measurements on k principal components. An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result? (Johnson & Wichern, 2007, p. 430). Following the development of Johnson and Wichern (2007), let random vector 1, 2 ,[ ..., ]pX X X X?? have a covariance matrix ? with eigenvalues 12... 0??? ? ? , where 0?? if and only if the column vectors are not linearly independent (2007, pp. 431?437). Consider the linear combinations: 1 1 1 1 1 1 2 2 1 2 2 2 1 1 2 2 2 2 1 1 2 2 ... ... ... pp pp p p p p p p p Y a X a X a X a X Y a X a X a X a X Y a X a X a X a X ?? ? ? ? ? ?? ? ? ? ? ?? ? ? ? ? Obtain ( ) 1 , 2 , . . . , ( , ) , 1 , 2 , . . . ,i i ii k i kV a r Y a a i pC o v Y Y a a i k p?? ? ??? ? ? The principal components are those linear combinations 1, 2,..., pYY Y whose variances are as large as possible, coefficient vectors are of unit length, and covariance between them is zero. 36 Let random vector 1, 2 ,[ ..., ]pX X X X?? have a covariance matrix ? with eigenvalue-eigenvector pairs 1 1 2 2( , ), ( , ), ..., ( , )ppe e e? ? ? where 12 ... 0p? ?? ? ? ?. The principal component is given by 1 1 2 2 , ... 1 , 2 ..., ( ) 1 , 2 , ..., ( , ) 0 , 1 , 2 , ..., ik i i i i ip p i i i i i k i k ik i YX kk Y e X e X e X e X i p wit h V a r Y e e i p Cov Y Y e e i k e i k p ? ? ? ? ?? ? ? ? ? ? ?? ? ? ? ?? ? ? ? ?? The proportion of total population variance due to the kth principal component = 12 , 1 , 2 , . . . , .... k p kp ?? ? ? ?? ? ? ?If most (for instance, 80 to 90%) of the total population variance, for large p , can be attributed to the first one, two, or three components, then these components can ?replace? the original p variables without much loss of information? (Johnson & Wichern, 2007, p. 433). Principal components can be calculated similarly if the variables are standardized. Standardized variables need to be calculated if ranges of the variables are significantly different or units of measurement are dissimilar (Johnson & Wichern, 2007, p. 439). The ith principal component of the standardized variables 12[ , ,..., ]pZ Z Z Z?? with ()Cov Z ?? is: 1, 2 , ...,iiY e Z i p??? with 11 , ( ) ( ) , 1 , 2 , .. .,ik pp ii ii Y Z ik i V a r Y V a r Z p a n d e i k p?? ?? ?? ?? ?? 1 1 2 2( , ), ( , ), ..., ( , )ppe e e? ? ? are the eigenvalue-eigenvector pairs for ? , where 12 ... 0p? ? ?? ? ? ?. The proportion of standardized population variance due to kth principal component = , 1, 2,..., .k kp p? ? 37 Principal components with variables iX that have large positive or negative coefficients typically have large correlations between that variable iX and iY . Thus, both measures (coefficients and correlations) provide similar results of what they contribute to the component. Johnson and Wichern (2007) recommend, however, that both coefficients and correlations be examined to determine variable contribution to a component (2007, p. 434). The development of principal components is normally an intermediate step leading to some final analysis. Principal component analysis can be an input to cluster analysis. To provide context for teacher effectiveness, principal components can be calculated to represent the varying metrics of teacher effectiveness for each teacher. Cluster analysis can then identify teachers with similar characteristics (principal components) in the data. 2.6 Cluster Analysis Cluster analysis can be considered as an assignment of a set of items into groups so that items of the same group are of comparable quality. It is conducted without any assumptions regarding the number or structure of groups present in the data (Johnson & Wichern, 2007, p. 671). The groups are formed based on ?distances? where ?closeness? equates to being similar. Many distance measures exist in the literature to determine similarity (Euclidean, Minkowski, Canberra, etc.) (Johnson & Wichern, 2007, pp. 673?674). Due to the computationally expensive nature of examining all grouping possibilities, clustering algorithms have been developed to find good clusters without having to check all possible clustering configurations. Hierarchical Clustering Methods form groups by either ?agglomerative? or ?divisive? techniques. With agglomerative techniques the most similar objects are first grouped together followed by combining those groups that are most similar. Divisive techniques place all objects 38 in a single group with subsequent subgroups partitioned away that are farther from the other objects in another subgroup (Johnson & Wichern, 2007, pp. 680?681). An agglomerative hierarchical clustering method of J.H. Ward is based on minimizing the increase in an error sum of squares criterion (sum of squared deviations of every item in the cluster to the centroid). Each cluster begins as a single object with that object being the centroid. An iteration of the method considers every combination of merging two clusters. The merger of two clusters that produces the smallest increase in error sum of squares is completed. Iterations are performed until all objects are contained in a single cluster (Johnson & Wichern, 2007, pp. 692?693). Several statistics exist to aid in determining the number of clusters that naturally exist in the data. For compact or slightly elongated clusters with a preference for roughly multivariate normal clusters, the three best statistics for hierarchical clustering methods are the pseudo F statistic, pseudo 2t statistic, and the Cubic Clustering Criterion (CCC) (?SAS/STAT 9.2,? 2009, p. 245). The clustering methods produce the statistics at each step of the algorithm to evaluate the cluster solution. The pseudo F statistic measures ?the separation among all the clusters at the current level? with local maximum values indicating a good number of clusters (?SAS/STAT 9.2,? 2009, p. 1267). It equals the ratio, ( ) / ( 1) / ( )GGT P KP n K??? , with T equal to the total sum of squares, GP equal to the within group sum of squares, and K equal to the number of clusters (?SAS/STAT 9.2,? 2009, p. 1258). Essentially it is ?a ratio of the mean sum of squares between groups to the mean sum of squares within group? (Lim, Acito, & Rusetski, 2006, p. 508). The pseudo 2t statistic measures ?the separation between the two clusters most recently joined? (?SAS/STAT 9.2,? 2009, p. 1267). Good candidates for the number of clusters are ?the 39 number of clusters one greater than the level at which [a] large pseudo 2t value is displayed? (?SAS/STAT 9.2,? 2009, p. 1283). The pseudo 2t statistic is a variant of Hoetelling?s 2T in which large values provide evidence that the ?two clusters being considered should not be combined since the mean vectors of these two clusters can be regarded as different? (Schmidhammer, 2010, p. 20) . Therefore, clusters can be combined for small values of the pseudo 2t . ?The CCC is based on the assumption that a uniform distribution on a hyperrectangle will be divided into clusters shaped roughly like hypercubes? (?SAS/STAT 9.2,? 2009, p. 245). Local maximum values of the CCC suggest a good number of clusters by rejecting the null hypothesis that the data has been sampled from a uniform distribution on a hyperrectangle. The alternative is then accepted that ?the data has been sampled from a mixture of spherical multivariate normal distributions with equal variances and sampling probabilities?[and] the obtained 2R value is greater than would be expected if the sampling was from a uniform distribution? (Schmidhammer, 2010, p. 17). Nonhierarchical clustering methods are less computationally expensive than hierarchical methods as a matrix of distances between clusters do not have to be calculated. Larger datasets can be examined with nonhierarchical methods as a result. The method begins with user defined initial clusters or a set of seed points that form the centroid of the clusters. A popular nonhierarchical clustering method is the K-Means Method (MacQueen, 1967). The algorithm consists of three steps: 1. Place all items into user defined K clusters and calculate the initial centroid. Alternatively one can just specify K initial centroids. 2. Review each item and determine which centroid is closest. Assign item to the cluster of the nearest centroid. Recalculate the centroid for the gaining and losing cluster after assignment of each item. 3. Repeat Step 2 until no item switches clusters. 40 Johnson and Wichern contend that the final assignment of items is often dependent upon the initial partition or seed point of the cluster. Also, clustering methods that fix the number of clusters prior to analysis can be ineffective if the data includes outliers, if the data does not support the specified K clusters, and/or if random seed points place centroids near each other during step one above (Johnson & Wichern, 2007, pp. 696,701?702). 2.7 Summary A review of the literature to evaluate the effectiveness of teachers has highlighted two general techniques. Firstly, LMMs produce estimates of the random teacher effect in the form of EBLUPs, and, secondly, LMMs produce predictions of student achievement scores in order to calculate teachers? value-added measures by comparing the students? actual achievement scores with the predicted scores. Colorado currently calculates SGPs using Quantile Regression to determine school effectiveness by comparing the medians of schools? student growth percentiles. The two general techniques to measure teacher effectiveness produce risk of not properly identifying the quality of teachers. As a result, they should not be the sole basis of measuring teacher effectiveness. Multiple objective measures of teacher effectiveness must be employed to isolate the value that a teacher brings to the classroom. A teacher?s ?specific causal impact on learning cannot be discerned only from [a] single descriptive measure? (?Colorado,? 2008, p. 10). To expand the measures of teacher effectiveness, this research proposes to calculate two additional metrics. Specifically, Quantile Regression will be used to calculate a third objective measure of teacher effectiveness by analyzing student test scores within a school district and calculating a growth percentile for each student. The median of a teacher?s aggregated student growth percentiles will be calculated and compared to other teachers? medians within the 41 population. Lastly, a fourth method to measure teacher effectiveness will be developed by combining Quantile Regression with the LMM practice of making achievement score predictions to derive teachers? value-added measures. The 0.5 quantile model generated by Quantile Regression will produce predictions of student achievement scores in order to calculate teachers? value-added measures by comparing the students? actual achievement scores with the predicted scores. In order to exploit these multiple measures of teacher effectiveness, an analytical process involving principal component and cluster analysis will be developed. The principal components of the four objective measures of teacher effectiveness become composite indicators of teacher effectiveness, unlike Stage 2 of the MET Project that augments a single measure based on student achievement scores with a composite indicator based on measures not tied to student achievement scores. The principal components then serve as the inputs to Cluster Analysis. The process will isolate the value that a teacher brings to the classroom above that of his/her peers while minimizing the risk of not properly identifying the quality of teachers. The desired outcome is a precise and stable teacher effectiveness index to augment Alabama?s formative educator evaluation system, EDUCATEAlabama. 42 Chapter 3 : Risk-Mitigated Teacher Effectiveness Index 3.1 Introduction Various methods exist to measure teacher effectiveness using student growth. Each method generates some risk of not properly identifying the quality of teachers. As a result, one method should not be the sole basis of measuring teacher effectiveness. An analytical process that uses multiple and objective measures of teacher effectiveness to isolate the value that a teacher brings to the classroom above that of his/her peers warrants development for the state of Alabama. The development of the Risk-Mitigated Teacher Effectiveness Index (RMTEI) consists of three phases. Phase I begins with the calculation of four teacher effectiveness metrics: Linear Mixed Model Teacher Effect, Overall Linear Mixed Model Value-Added Measure, Median Student Growth Percentile, and Overall Quantile Regression Value-Added Measure. Phase II consists of determining the principal components of the four metrics determined in Phase I and providing each teacher with a quantitative indicator of effectiveness. The principal components from Phase II serve as the inputs to Phase III, Cluster Analysis. Phase III will then illuminate teachers with similar characteristics (principal components) in the data in order to place them into effectiveness categories. Through the use of an example, Chapter 3 describes the methodology of computing the RMTEI within the three phase process. The data for this example was obtained from the Early Childhood Longitudinal Study [United States] of the Kindergarten class of 1998-1999 (?Fifth Grade Data Codebook,? 2006). Researchers of this study recorded a vast amount of data of a 43 singular class of students from kindergarten to fifth grade. The data involved every aspect of childhood education with investigation of schools, teachers, parents, and students. The nested, hierarchical nature of the data supports study at multiple levels. In order to provide an instrument to develop the RMTEI process, the data received some structuring to align with a typical school district of six elementary schools with five teachers per grade. Although teachers were appropriately linked with students in the Early Childhood Longitudinal Study, the number of students per teacher rarely resembled a classroom. However, the number of students per school presented desirable numbers for classroom analysis. Therefore, schools were regarded as the ?teachers?, and the actual schools? region were then analyzed at the ?school? level. This research finds the data restructuring appropriate for the purpose of demonstrating the RMTEI process. The analysis of Chapter 3 is truly at the school and region level while being called ?teacher? and ?school?. The literature supports the study of any two-level hierarchical data structure. Therefore, lowering the examination of the data by one level in name only does not alter the results or render them inadequate. 3.2 Phase I: Teacher Effectiveness Metrics Every year a student earns a score in a particular subject on a standardized test that is vertically scaled to allow for comparisons from year to year. The process to compute scaled scores is called equating with one such method being Item Response Theory (IRT). The scores of the Chapter 3 example are IRT scores. ?IRT uses the pattern of right, wrong, and omitted responses to the items actually administered in an assessment and the difficulty, discriminating ability, and ?guess-ability? of each item to place each child on a continuous ability scale. IRT scoring makes possible longitudinal measurement of gain in achievement over time, even though the assessments that are administered are not identical at each point. The common items 44 present?allow the scores to be placed on the same scale? (?Fifth Grade Data Codebook,? 2006, pp. 3?5, 3?6). Every teacher within a district has a unique teacher identification code. The student?s teacher in that particular subject and year has his/her teacher identification code recorded against the student?s standardized test score. The result is a row vector of data with student, teacher, school, yearly score, ? , and yearly score. Given data of this nature for all students in a district, the desire is to extract what teachers contribute to their students obtaining a year?s growth in a year?s time as measured by the difference in the current year score and the previous year?s score. This difference is captured in a new variable, MathGain, to allow further analysis to focus on the current teacher. An excerpt of the example dataset is shown in Table 3.1 that illustrates the structure described above. Table 3.1: Excerpt of Example Dataset for Phase I Metrics This example dataset consists of 718 students nested in 33 teachers nested in four schools and will be used to calculate all of the teacher effectiveness metrics. The application of LMMs to longitudinal student testing data will provide two objective measures of teacher effectiveness by calculating individual teacher effects (see Section 2.2.1) and teacher value-added measures by comparing students? actual achievement scores with their 45 predicted scores (see Section 2.2.2). The application of QR to longitudinal student testing data will provide two additional objective measures of teacher effectiveness. Subsequent to calculating a growth percentile for each student (see Section 2.3.4), the median of a teacher?s aggregated student growth percentiles will be calculated and compared to other teachers? medians within the population. Lastly, the 0.5 quantile model generated by QR will produce predictions of student achievement scores in order to calculate teachers? value-added measures by comparing the students? actual achievement scores with the predicted scores. With all four Phase I metrics, larger values represent greater value that teachers bring to the classroom above that of their peers. 3.2.1 Linear Mixed Model Teacher Effect The general LMM specification for an individual response follows: ( 1 ) ( 2 ) ( 1 ) ( 1 )0 1 2 3 4( ) ( ) ( ) ( ) } } f i x e dr a n d o mi j k i j k i j k j k k j k i j k M a t h G a i n X X T S u ? ? ? ? ? ? ? ? ? ? ? ? ? ijkMathGain represents the value of the dependent variable for student i nested in teacher j nested in school k . 0? through 4? represent the fixed intercept and the fixed effects of the student (1) (2)( , )XX , teacher ( (1)T ), and school level predictors ( (1)S ). jku is the random effect associated with the intercept for teacher j nested in school k, and ijk? represents the residual. The assumed distribution of the random effects associated with teachers nested in schools is: 2~ (0, )jk teacheruN ? . The assumed distribution of the residuals of student scores is: ),0(~ 2?? Nijk . The assumption is made that and jk ijku ? are mutually independent. For this example, only student level predictors will be in the model: 46 ( 1 ) ( 2 )0 1 2( ) ( ) } } fi x edran d o mi j k i j k i j k j k i j k M a t h G a i n X X u ? ? ? ? ? ? ? ? ? Sections 4.2.3 and 4.3.2 describe the process of selecting model predictors at the student, school, and teacher level. The PROC MIXED procedure in SAS is used to calculate the solutions for the fixed and random effects of the model with 3rd Grade Reading and Mathematics Test Scores predicting the MathGain obtained after completing 4th grade. The partial output of the solution for random effects is shown in Table 3.2. Table 3.2: Partial Output of Solution for Random Effects A result of modeling teacher effects as random is that the estimates for the teacher intercepts are Empirical Best Linear Unbiased Predictors (EBLUPs). EBLUPs are linear, unbiased, and have minimum variance among all linear estimators. They are also known as ?shrinkage estimators? because the random teacher effects are smaller in value than if teacher effects were modeled as fixed (West et al., 2007, p. 45). The estimated teacher effect is a statistical numerical prediction of the relative value of a particular teacher and a ?direct measure of the teacher deviation from the [district] mean? of the corresponding TestGain for a particular grade (Sanders et al., 1997, p. 156). 3.2.2 Linear Mixed Model Value-Added Measure In either the general or hierarchical specification of the LMM, a final model can be developed to predict MathGain scores for all students using data from the previous cohort of 47 students. The difference between the prediction of the student?s score and the student?s actual performance becomes the teacher?s value-added measure for that student. Averaging the value- added measures for a teacher?s students becomes his/her final value-added measure. Analyzing data from the previous cohort of students following their completion of 4th grade yields the following solution and model: Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| Intercept 30.2771 2.2158 40 13.66 <.0001 read3 0.07181 0.02170 676 3.31 0.0010 math3 -0.1922 0.02508 676 -7.66 <.0001 3 0 . 2 7 7 1 . 0 7 1 8 1 ( 3 R e a d ) . 1 9 2 2 ( 3 )i j k i j k i j kM a t h G a i n r d G r r d G r M a t h? ? ? Applying this model to the following cohort of students leads to predictions for MathGain, which can be compared to the students? actual performance, and the calculation of teachers? value- added measures. Partial output of the LMM Value-Added Measure is contained in Table 3.3. Each row in Table 3.3 represents a child with a LMM MathGain prediction. Table 3.3: Partial Output of LMM Value-Added Measure The difference between the prediction and the student?s actual performance becomes the teacher?s value-added measure for that student. As discussed in Section 2.2.2 averaging the value-added measures for a teacher?s students becomes the teacher?s final value-added measure. In Table 3.3 Teacher 11 obtained an overall LMM Value-Added Measure of 2.82. Therefore, 48 Teacher 11 contributed to the growth of his/her students that was, on average, 2.82 test score units more than predicted without teacher effects. 3.2.3 Median Student Growth Percentile Based on the discussion in Section 2.3.4 Quantile Regression cubic B-spline models are developed for percentiles 1-99 using the dataset of scaled 3rd Grade Mathematics and Reading test scores. The variables of 3rd Grade Mathematics and Reading test scores predict the MathGain quantiles for all students. Individual Student Growth Percentiles (SGP) are calculated for all students by determining the quantile prediction closest to the actual MathGain for each student. The closest quantile prediction provides a reference of the student with respect to the population. For example, a student has scores of 140.86, 116.66, and 132.7 on the 3rd Grade Reading test, 3rd Grade Mathematics test, and 4th Grade Mathematics test, respectively. Therefore, MathGain is 16.04. The quantile closest to the student?s actual MathGain score is the 0.31 quantile. Therefore, the student obtains a SGP of 31. Upon determining the growth percentiles for all students, the median of a teacher?s aggregated SGPs is calculated and compared to other teachers within the same population. Partial output of the Median SGP metric is contained in Table 3.4. Table 3.4: Partial Output of Median SGP Metric In Table 3.4 Teacher 11 from School 1 obtained a Median SGP of 60. Teacher 1212 from School 4 obtained a Median SGP of 36. Therefore, Teacher 11 contributed to the growth of 49 his/her students that was 10 percentile points greater than typical growth placing him/her near the top of the effectiveness ratings. Conversely, Teacher 1212 contributed to the growth of his/her students that was 14 percentile points less than typical growth placing him/her near the bottom of the effectiveness ratings. 3.2.4 Quantile Regression Value-Added Measure Similar to calculating the LMM Value-Added Measure, Quantile Regression can use a single cohort of students to develop a model and make a prediction of MathGain for all students. For example, 3rd and 4th grade data from cohort 2019 can be used to develop a model to make a 4th grade prediction using QR. The model is applied to the successive cohort (class that just finished 3rd grade, cohort 2020). The goal is to predict each student?s MathGain along the 50th growth percentile using the student?s 3rd Grade Reading and 3rd Grade Mathematics scores. When cohort 2020 finishes the 4th grade, the students? predicted MathGain can be compared against their actual mathematics gain. If a student?s actual gain is greater than the predicted gain, then the teacher?s value added measure (actual-predicted) is positive. If a student?s actual gain is less than the predicted gain, then the teacher?s value added measure (actual-predicted) is negative. Averaging the value-added measures for a teacher?s students becomes his/her final value added measure for the year. Partial output of the Overall QR Teacher Value-Added Measure is shown in Table 3.5. Table 3.5: Partial Output of Overall QR Teacher Value-Added Measure 50 Each row in Table 3.5 represents a teacher with an Overall QR Value Added Measure. For example Teacher 1212 from School 4 contributed to the growth of his/her students that was, on average, 2.79 test score units less than expected. 3.3 Phase II: Principal Component Analysis (PCA) Phase II of the process to determine the Risk-Mitigated Teacher Effectiveness Index consists of determining the principal components of the four metrics determined in Phase I. The Phase I metrics are shown for all 33 teachers in Table 3.6. 51 Table 3.6: Output of Phase I Metrics PCA needs correlated responses to be effective and will transform input variables into a smaller set of uncorrelated variables without losing much information. The literature suggests that if the transformed space accounts for 80-90% of the variance, then PCA achieved its objective of dimension reduction. Based on the construction of the four Phase I metrics, which all measure teacher effectiveness with TestGain scores as the dependent variable, the expectation is that the metrics will be highly correlated. Not only is the expectation that the metrics will be highly 52 correlated, but also that the metrics will be transformed into a single variable with almost as much information as the original four. In fact this research hypothesizes this outcome to be a general conclusion applicable to all academic datasets. In the case of the Chapter 3 example, the four Phase I metrics in Table 3.6 are highly correlated and will be projected onto a lower-dimensional surface in order to discover relationships between the effectiveness metrics. The correlation matrix of the Phase I metrics is shown in Table 3.7. Table 3.7: Correlation Matrix of Phase I Metrics Phase I Metric Pearson Correlation Coefficient (p-value) Metric Overall LMM Value- Added Measure Median SGP Overall QR Value- Added Measure LMM Teacher Effect .994 (.000) .954 (.000) .946 (.000) Overall LMM Value-Added Measure .939 (.000) .939 (.000) Median SGP .937 (.000) As stated in Section 2.5, the different units of measurement across all four of the metrics listed in Table 3.6 triggers calculating the eigenvalues and eigenvectors from the correlation matrix instead of the covariance matrix. The PROC PRINCOMP procedure in SAS accounts for this and produces the principal components of the Phase I metrics (LMM Teacher Effect, Overall LMM Value-Added Measure, Median SGP, and Overall QR Value-Added Measure). The eigenvectors of the correlation matrix provide the coefficients of the newly constructed uncorrelated components. Let random vector 1, 2 ,[ ..., ]pX X X X?? have eigenvectors 12, ,..., pe e e . The principal component is given by: 53 1 1 2 2 . . . 1 , 2 . . . ,i i i i i p pY e X e X e X e X i p?? ? ? ? ? ?. The eigenvectors of the Phase I metrics are shown in Table 3.8. Table 3.8: Eigenvectors of the Correlation Matrix Principal Component 1, 2, 3, and 4 then become: Pri n 1 . 5 0 5 1 ( ) . 5 0 2 3 ( ) . 4 9 6 7( ) . 4 9 5 8 ( ) Pri n 2 . 4 1 8 6 ( ) . 5 5 8 9( ) . 3 9 5 3 ( ) . 5 9 6 8 ( ) Pri n 3 . 0 0 4 5 ( ) . 1 2 9 3 ( ) . 7 6 5 0 ( ) . 6 3 0 9( L M M E f f ect L M M V a l u e M ed S G P Q R V a l u e L M M E f f ect L M M V a l u e M ed S G P Q R V a l u e L M M E f f ect L M M V a l u e M ed S G P Q ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ) Pri n 4 . 7 5 4 7( ) . 6 4 7 0 ( ) . 1 0 8 6 ( ) . 0 0 4 6 ( ) R V a l u e L M M E f f ect L M M V a l u e M ed S G P Q R V a l u e? ? ? ? ? Principal component scores for all 33 teachers are shown in Table 3.9. 54 Table 3.9: Output of Principal Component Scores The objective of PCA, however, is to provide a lower-dimensional surface as the input for Phase III, Cluster Analysis. Recall in Section 2.5, the eigenvalues of the correlation matrix provide the variance of the corresponding principal components with Principal Component 1 having the largest eigenvalue, Principal Component 2 having the second largest eigenvalue, etc. As shown in Table 3.10, PCA 1 accounts for 96.63% of the variance within the system. 55 Table 3.10: Eigenvalues of the Correlation Matrix The structure of the correlation matrix and the ability of PCA to reduce dimensionality efficiently confirm that the teacher effectiveness metrics are highly correlated. Furthermore, the scree plot in Figure 3.1 identifies the distinct bend at 2i? . Figure 3.1: Scree Plot There is clearly one dominant principal component as the remaining eigenvalues are relatively small and about the same size. LMM Teacher Effect and Overall LMM Value-Added Measure are the largest contributors of the first principal component due to their coefficients. Both of the QR regression metrics, Median SGP and Overall QR Value-Added Measure, follow closely behind. LMM Teacher Effect and Overall LMM Value-Added Measure also share the largest correlations with the first principal component as displayed in Table 3.11. 56 Table 3.11: Correlation of Phase I Metrics with Principal Component 1 The correlation of the metrics to the first principal component ?measure only the univariate contribution of an individual [metric] to [the] component? (Johnson & Wichern, 2007, p. 433). It does not capture the importance of the metric to the principal component while considering the presence of the other metrics. Johnson and Wichern (2007) contend, however, that using coefficients, which is a multivariate evaluation of importance to the principal component, or correlations, tends to yield similar results when evaluating the contributions of the metrics to the principal components (2007, p. 434). As a result, the first principal component is a weighted sum of all Phase I metrics. This dominant principal component represents the axis of greatest variability in a transformed space. For example, the highly correlated data provided by the four Phase I metrics are projected onto the two-dimensional surface in Figure 3.2. 57 543210- 1- 2- 3- 4 5 4 3 2 1 0 - 1 - 2 - 3 - 4 P r i n 2 P r i n 1 S c a t t e r p l o t o f T e a c h e r s Figure 3.2: Subject-specific RMTEI Value along Dominant Principal Component Axis Principal Component 1 is dominant and proceeds along the long axis of the data cloud in the transformed space. Effectively, one can represent all of the Phase I metrics and the majority of the variance within the system with a lone principal component. Therefore, teachers? principal component scores along this dominant principal component will become their subject-specific RMTEI values. If teachers instruct mathematics and reading, then an overall teacher effectiveness index will be obtained by taking the mean of their two subject indexes. Otherwise, the single, subject index will become teachers? overall values. 3.4 Phase III: Cluster Analysis The purpose of Phase III, Cluster Analysis, is to illuminate teachers with similar characteristics (principal components) in the data in order to place them into effectiveness 58 categories. Figure 3.2, a simple scatterplot of teachers in two dimensions, provides the starting point to extract clusters of teachers with similar attributes. The scatterplot suggests that the clusters could be compact and elliptical in nature with the long axis of an ellipse in the direction of principal component 1 due to its greater range of values compared to principal component 2. No assumption is made, however, regarding the number of clusters to form. Therefore, hierarchical clustering techniques will be analyzed in an effort to derive the appropriate number of effectiveness categories to retain. Ward?s clustering method ?is based on the notion that the clusters of multivariate observations are expected to be roughly elliptically shaped? (Johnson & Wichern, 2007, p. 693). It is a hierarchical, agglomerative clustering method as described in Section 2.6 that is based on minimizing the increase in an error sum of squares criterion (sum of squared deviations of every item in the cluster to the centroid). The scatter plot in Figure 3.2 suggests elliptical clusters. Also, a feature of the RMTEI process, as described in Section 3.3, is its ability to produce a dominant principal component that will continually lead to elliptical shaped clusters in two dimensions. Therefore, Ward?s Method should be employed as a general prescription of the RMTEI process. Ward?s Method generated the pseudo F statistic, pseudo 2t statistic, and the Cubic Clustering Criterion (CCC) in accordance with Section 2.6 after execution. The evidence directs fixing the number of clusters at 5K? . In Figure 3.3 local peaks of the CCC and pseudoF statistic as well as the point (number of clusters) prior to a large jump in value when viewing the pseudo 2t plot from right to left are desired. 59 Figure 3.3: Ward?s Clustering Method Statistics for Determining Number of Clusters Ward?s Method produced the Effectiveness Clusters in Figure 3.4 for 5K? with the variables, principal component 1 and 2. 60 543210- 1- 2- 3- 4 5 4 3 2 1 0 - 1 - 2 - 3 - 4 P r i n 2 P r i n 1 1 2 3 4 5 C l u s t e r W a r d C l u s t e r i n g o f T e a c h e r s b y E f f e c t i v e n e s s Figure 3.4: Ward?s Clustering of Teachers by Effectiveness Having illuminated teachers with similar principal components in the data, the development of the Risk-Mitigated Teacher Effectiveness Index concludes with determining whether Ward?s Method produced clusters with statistically dissimilar characteristics. Multivariate Analysis of Variance (MANOVA) confirms there is a statistical difference between the clusters at the ? = 0.05 level. The null hypothesis of equal mean vectors, 0 1 2 3 4 5:H ? ? ? ?? ? ? ? ?, for all clusters is rejected where 12,llXXare principal component 1 and 2 scores for cluster 1,2,...,5l ? : 11 , 12 21 , 22 51 , 52 C lus ter 1: , C lus ter 1: , C lus ter 5: , XX XX XX 61 The alternative is accepted which states there is at least one cluster that has a statistically different mean vector. Conducting Analysis of Variance (ANOVA) on the variables, Principal Component 1 and 2, demonstrates that only Principal Component 1 is statistically different between the clusters at the ? = 0.05 level. Computing 95% Bonferroni simultaneous confidence intervals validates that each cluster is statistically different than every other cluster for Principal Component 1. For example, Cluster 1 is statistically different than Clusters 2, 3, 4, and 5. Cluster 2 is statistically different than Clusters 3, 4, and 5. Cluster 3 is statistically different than Clusters 4 and 5. The pattern concludes with Cluster 4 being statistically different than Cluster 5. For the example dataset Ward?s clustering method produced five statistically different clusters of teachers while considering their principal component 1 scores. In Table 3.12, for example, teachers 2092 and 966 had comparable values for the Phase I metrics and demonstrated they had extraordinary gains in student growth. Likewise in Table 3.13, teachers 430 and 7393 had comparable values and demonstrated they had poor gains in student growth. Table 3.12: Sample of Teachers with Extraordinary Gains in Student Growth Table 3.13: Sample of Teachers with Poor Gains in Student Growth 62 3.4.1 Comparison of Clustering Results with Phase I Metrics Figure 3.4 captures the Effectiveness Clusters produced by Ward?s Method for 5K? with the variables, principal component 1 and 2. A natural question emerges; do these clusters provide better information than the clusters formed using only a single Phase I metric? The answer is ?yes?. Ward?s clustering method produced the results contained in Figure 3.5 in which clustering was performed for each of Phase I metrics and the inputs from PCA (principal component 1 and 2). P r i n 1 a n d P r i n 2 O v e r a l l Q R V A M M e d i a n S G P O v e r a ll L M M V A M L M M E f f e c t P r i n 1 a n d P r i n 2 O v e r a ll Q R V A M M e d ia n S G P O v e r a l l L M M V A M L M M E f f e c t P r i n 1 a n d P r i n 2 O v e r a l l Q R V A M M e d i a n S G P O v e r a ll L M M V A M L M M E f f e c t 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 P r i n 1 a n d P r i n 2 O v e r a ll Q R V A M M e d i a n S G P O v e r a ll L M M V A M L M M E f f e c t 5 4 3 2 1 P r i n 1 a n d P r i n 2 O v e r a ll Q R V A M M e d i a n S G P O v e r a ll L M M V A M L M M E f f e c t 5 4 3 2 1 P r i n 1 a n d P r i n 2 O v e r a ll Q R V A M M e d ia n S G P O v e r a l l L M M V A M L M M E f f e c t 1 1 C l u s t e r 3 4 8 9 9 7 1 2 8 1 3 0 1 3 4 2 0 6 2 2 7 2 4 2 2 4 7 2 8 8 3 3 7 3 4 9 3 5 1 3 6 2 3 6 7 3 7 9 4 0 7 4 3 0 4 4 2 4 5 2 4 5 5 6 2 1 9 4 7 9 6 6 1 1 1 9 1 1 6 5 1 2 1 1 1 2 1 2 2 0 9 2 6 2 9 0 7 3 9 3 P a n e l v a r i a b l e : T e a c h e r C l u s t e r i n g R e s u l t s Figure 3.5: Comparison of Clustering Results with Phase I Metrics The data highlights two thoughts. Firstly, the clusters are similar across the variables. This is expected due to the correlational structure presented in Table 3.7. Secondly, the clustering results using principal component 1 and 2 successfully represent the collection of the metrics? results with no one metric being identical to the clustering results using the input from PCA. 63 Therefore, the final clustering results are not based on a single measure, and the risk associated with crafting accurate effectiveness categories has been mitigated. 3.5 Summary Principal component scores for all 33 teachers are shown in Table 3.9. As shown in Table 3.10, PCA 1 accounts for 96.36% of the variance within the system. Teachers? PCA 1 scores are their Mathematics RMTEI values. If teachers also taught reading, then an overall teacher effectiveness index would be obtained by taking the mean of their two subject indexes. Otherwise, the Mathematics RMTEI values are the teachers? overall values. Ward?s clustering method produced five statistically different clusters of teachers while considering teachers? principal component 1 scores. Confidence in crafting clusters of teachers with similar attributes is paramount. Therefore, future work must consider the results of Ward?s Method in determining the assignment of teachers to teacher effectiveness categories. The Alabama State Department of Education desires to place teachers into ?at least? four categories following the development of an objective measure of teacher effectiveness based on student growth that augments its presently used formative assessment, EDUCATEAlabama (?Alabama?s Race,? 2010, p. 89). Based on Ward?s clustering method statistics with the example dataset, the evidence suggests fixing the number of clusters at five. A modification of the initially proposed categories in Section 1.4 complements the clusters found in Figure 3.4: 1. Extraordinary gains in student growth. 2. Significant gains in student growth. 3. Meets student growth. 4. Did not meet student growth. 5. Poor gains in student growth. 64 The Phase I metrics presented in this chapter are normative. The LMM Teacher Effect is a measure of the teacher deviation from the district mean. The LMM Value-Added Measure is based on the difference between actual scores and predicted scores had students been taught by the average teacher in the district. The QR Value-Added Measure is based on the difference between actual scores and typical gains within a district. The Median SGP is based on determining for each student in a district the percentage of other students that had less gain in testing achievement. Since everything is normative, how can one be sure, for example, that teachers at the bottom compared to their peers have poor gains such that the pursuit of college and workplace readiness has been negatively impacted? Regrettably, no assurances regarding adequate progress toward college and workplace readiness can be established following the RMTEI process. The only statement that can be made is that teacher gains are relative to their peers within the evaluated population. As stated in Section 3.1 the results of Chapter 3 are truly measuring school effectiveness while being called ?teacher effectiveness?. The literature supports the study of any two-level hierarchical data structure. Therefore, the practicality of labeling the data is this manner provided an instrument to showcase the methodology of the RMTEI. In order to place the methodology of Chapter 3 into practice, longitudinal student and teacher data will be examined in Chapter 4. Success depends on the following: 1) Develop techniques with Alabama?s database infrastructure to streamline the process of establishing longitudinal data from existing yearly data 2) Develop techniques with Alabama?s database infrastructure to streamline the process of linking student achievement data with teachers Further analysis will be undertaken in Chapter 4 to confirm the desired outcome of a precise and stable teacher effectiveness index with a discussion of the results. Chapter 5 provides an 65 assessment of Alabama?s educator evaluation system, and then proposes an observational, summative assessment for Alabama that is a predictor of student achievement gains and correlated with the Risk-Mitigated Teacher Effectiveness Index. Lastly, Chapter 6 concludes the dissertation and provides recommendations for future study. 66 Chapter 4 : Data Analysis 4.1 Introduction Implementation of the methodology discussed in Chapter 3 began with the acquisition of student achievement and teacher data. A significant amount of energy was put forth to obtain quality student and teacher data from the Alabama State Department of Education (ALSDE) and Alabama Local Education Agencies (LEA). This consisted of extensive written and verbal communication with individuals from these organizations who have the authority and ability to provide the data. Despite receiving approval from the ALSDE and several LEAs at the beginning of the process, action to provide the data rarely took place despite repeated requests at all levels of the organizational structure. In the end only a single district provided the requisite data to fully implement the RMTEI process. The district is classified as medium-sized with a student population of approximately 4,000 students (?System Profile,? 2009). It is situated in a suburban area with approximately 50% of its students available for free/reduced lunch (?System Profile,? 2009). Examining only this single Alabama school district would limit the applicability of this research due to the lack of a large, urban school district. A large, urban school district would provide more breadth in the analysis by presenting an additional learning environment to the RMTEI process. In order to remedy this limitation, data was obtained from the National Center for Education Statistics (NCES), which ?is the primary federal entity for collecting and analyzing data related to education? (?National Center for Education Statistics,? n.d.). The NCES falls 67 under the Institute of Education Sciences, which ?is the research arm of the U.S. Department of Education? (?Institute of Education Sciences,? n.d., para. 1) The NCES data consisted of student achievement and teacher data from 418 elementary schools (Kindergarden-5th Grade) in 17 urban districts from across the United States. ?Although the districts selected for the study did not form a statistically representative sample of the nation, they were drawn from 13 states with a variety of regulatory, administrative, and demographic contexts? (Glazerman et al., 2010, p. 3). The data began with testing in 2005 and concluded in 2008. The data?s original use consisted of evaluating the benefits of comprehensive teacher induction training for beginning teachers compared to the existing, less intensive induction services provided by the district (Glazerman et al., 2010, p. vi). The benefits of comprehensive teacher induction could be positive impacts on classroom practices, a positive impact on teacher retention, and a statistically significant impact on student achievement (Glazerman et al., 2010, p. viii). Observational evaluations were conducted in 2006 by trained observers to score the effective teaching practices of both teacher groups (Glazerman et al., 2010, p. xiv). These observational evaluations will be discussed and compared to the teacher effectiveness outcomes obtained with the RMTEI in Chapter 5. 4.2 Alabama Data Analysis Upon acquiring five years of student and teacher data from an Alabama district, methods were developed to accomplish two of this research?s sub-objectives: 1) Develop techniques with Alabama?s database infrastructure to streamline the process of establishing longitudinal data from existing yearly data 2) Develop techniques with Alabama?s database infrastructure to streamline the process of linking student achievement data with teachers 68 As stated in Section 1.4, Alabama claims to have the ability to link student data with teachers, yet it does not provide the linked testing data to the districts or provide analysis as a result of the linkage. Districts only receive yearly student achievement data via distributed compact disks from the Alabama State Department of Education. The lack of effective data extraction methods limits the ability of districts to analyze testing data. This research developed procedures to create longitudinal student data linked with current year teachers. For example, analysis for 2009 required merging student scaled scores from the 2006-07, 2007-08, and 2008-09 Alabama Reading and Mathematics Test (ARMT) with the students? respective mathematics and reading teachers for the 2008-09 school year. Much of this pre-analysis work consisted of extracting students and their teachers from a schedule dataset using the desired grade and course (mathematics or reading). Care was taken, especially in older grades, to find the appropriate courses/teachers that would ultimately be linked with the reading and mathematics student achievement data of the current year. See Appendix 1 for the SAS code and its description that accomplishes these two research sub-objectives. Similar to the example in Chapter 3, every teacher and student within the district had a unique identification code. Teachers were already linked to students in the schedule dataset. Student achievement files consisted of a matrix with column vectors of Student Identification Number, grade, school, and a single year of ARMT Reading and Mathematics Scores. The student identification number from the student achievement files provides the linkage to teachers/students in the schedule dataset. Upon merging the files, the desire is to extract what teachers contribute to their students obtaining a year?s growth in a year?s time as measured by the difference in the current year score and the previous year?s score. This difference is captured in the new variables, ReadGain and MathGain, to allow further analysis to focus on the current 69 teacher. An excerpt from the 6th grade, 2009 dataset is contained in Table 4.1 that illustrates the structure described above. Table 4.1: Excerpt from 6th grade, 2009 Dataset 4.2.1 Alabama Reading and Mathematics Test The ARMT is a criterion-referenced test administered to 3-8 Grade students. It is built using portions of the Stanford Achievement Test (Stanford 10) and additionally developed subtests. This combination ensures that all state-level content standards are appropriately tested. A student must take the items in Table 4.2 to receive ARMT reading and mathematics scores. Table 4.2: Components of Alabama Reading and Mathematics Test ARMT Reading ARMT Mathematics Stanford 10 Word Study Skills (Grade 3) Stanford 10 Mathematics Procedures (Grades 3-8) Stanford 10 Reading Vocabulary (Grades 3-8) Stanford 10 Mathematics Problem Solving (Grades 3-8) Stanford 10 Reading Comprehension (Grades 3-8) ARMT Part 2 Mathematics Subtest ARMT Part 2 Reading Subtest In addition to providing the points possible, points earned, percent correct, and achievement levels, the ARMT provides scaled scores that can be used to determine the amount of progress that is earned between years. These scaled scores are a focus of this research and contribute to studying change in performance over time, which is one of the primary purposes of the ARMT. Additionally, ARMT results for grades 3-8 are used to meet No Child Left Behind legislation 70 requirements with Adequate Yearly Progress determinations (?Alabama Reading and Mathematics Test: Interpreting the Student Report,? 2009, p. I?1; Pugh, 2008). 4.2.2 Testing Histories To prepare for calculating the Phase I metrics, mutually exclusive groups of students, each with a different testing history, were created in order to compare similar students. The students of a typical school district have different testing histories based on their grade and multiple, personal circumstances. Students in grades 3-5 have fewer testing opportunities, thus they have immature testing histories. Although students in grades 6-8 should have robust testing histories, there are many reasons why that is not always the case. For example, new students from outside of the State and absences during testing contribute to a lack of data for students. Even new students from outside of the district , but within the State, contribute to the issue as there is not a systemic process at the district level to electronically capture the testing data from the previous district. Testing histories consist of consecutive, yearly mathematics and readings scores in the ARMT with the minimum testing history being two consecutive years. For example, analysis in 2009 creates two testing histories for grades 5-8 (4th grade has a single testing history of Mathematics and Reading scores in 2008 and 2009): 1) Mathematics and Reading scores in 2007, 2008, and 2009 2) Mathematics and Reading scores in 2008 and 2009 Linear Mixed Model Phase I metrics were calculated within the testing history groups and then placed together in order to calculate an overall value for each metric. In the case of the Linear Mixed Model Teacher Effect, teachers received an effect for the testing histories in which they were present. A weighted sum for each teacher was then calculated based on the number of 71 students in each testing history group. Secondly, Linear Mixed Model Value Added Measures were calculated for each student within a testing history group. The testing history groups were placed back together with their new value added measures. The mean was then calculated for each teacher to determine the final Value Added Measure. Student Growth Percentiles on the other hand were calculated for each student within a single testing history group of two years. This ensured one large testing history group to create 99 quantiles within the data. The median was calculated for each teacher to determine the Median Student Growth Percentile metric. When multiple testing history groups were considered, the medium-sized district would produce small testing history groups that did not accommodate the number of parameters to be estimated. For example analysis of 6th grade in 2011 resulted in a student testing history group of 17 students that had mathematics and reading scores in 2009, 2010, and 2011. This testing history group encountered errors with all previous year scores in the predicted subject and the prior year score in the non-predicted subject as model predictors (i.e., MathGain would have Mathematics scores in 2009 and 2010 and Reading scores in 2010 as predictors); see Section 4.2.3 for discussion of model predictors. Quantile Regression attempted to calculate seven parameters ( 0 1 2 3, , ,a a a a and 1 2 3,,bb b ) for each of the three independent variables in order to calculate the independent variable?s component of the QR prediction: 01 ()? ( ) w h ere ( ) { 0 3 ( d eg ree ), 3 ( t o t al n u m b er o f k n o t s ), k n o t l o cat i o n kkn i j i jj k k i j i j i j i j jj ij j x t i f x ty a x b x t x t i f x t k n t ?? ?? ??? ? ? ? ? ? ? ? ? ?? In this example the QR prediction for each student would be the sum of the three independent variable components plus an intercept. The number of parameters was greater than the number of observations, thus the model lacked the requisite degrees of freedom and did not provide a QR 72 prediction for each student. As a result, individual SGPs could not be calculated to allow for the calculation of teachers? Median SGP. Lastly, Quantile Regression Value-Added Measures were also calculated for each student within the minimum, two year testing history. The mean was calculated for each teacher to determine the final Quantile Regression Value Added Measure. 4.2.3 Student, Teacher, and School Level Predictors for Alabama Data Any future effort to implement the RMTEI process at the district level should undergo an evaluative procedure to assess the adequacy of model predictors. This research examined predictors at all levels (student, teacher, and school) in an attempt to improve the predictive capability of the models used to generate the four Phase I metrics. The process began with using a student?s entire testing history to predict MathGain and ReadGain. For example, a testing history in 2009 consists of Mathematics and Reading scores in 2007, 2008, and 2009. Therefore, ARMT Reading 2007 - ARMT Reading 2008 and ARMT Mathematics 2007 - ARMT Mathematics 2008 predict ReadGain and MathGain in separate models. This method of analysis produced similar results for all grades and years in which the scores of all previous years in the predicted subject were significant at an .05?? , and only the prior year score in the non- predicted subject were statistically significant. For example, only the prior year reading score was significant for MathGain in addition to all previous math scores. Adding predictors at the teacher level led to varying results. Teacher data provided by the districts included teaching experience expressed in months teaching in district, state (not including the district), public (outside the district or state), and private systems. It also included the teacher?s highest degree obtained, expressed as a bachelor?s, master?s, greater than 6 years (working toward a doctorate), or doctorate degree. In many instances, teaching experience and highest degree obtained were not statistically significant. 73 Adding predictors at the school level also led to varying results. This research applied the methods of the Texas Projection Measure, discussed in Section 2.2.3, by considering campus mean of the predicted subject as a predictor of student achievement. In many instances, MathMean and ReadMean for a given campus were not significant predictors of MathGain and ReadGain respectively. Secondly, this research obtained the Reduced/Free lunch percentages of the different campuses within the two districts and considered them as predictors of student achievement. Obtaining results similar to using the campus mean of the predicted subject as a predictor, schools? Reduced/Free lunch percentages were not significant predictors of student achievement. Output follows for the solution of fixed effects from 4th grade, 2011, where the predictor, myrs_employ, is a math teacher?s experience expressed in years. Included are student, teacher, and school level predictors as described above. All of the teacher and school level predictors for the minimum testing history group failed to reject the null hypotheses, 0 3 0 4 0 5 0 6: 0 , : 0 , : 0 , : 0H H H H? ? ? ?? ? ? ? at an .05?? significance level: Solution for Fixed Effects mhighest_ Standard Effect degree Estimate Error DF t Value Pr > |t| Intercept 784.97 422.11 9 1.86 0.0959 ARMT_Math_2010 -0.4637 0.04811 256 -9.64 <.0001 ARMT_Read_2010 0.2700 0.05393 256 5.01 <.0001 Percent_Free_Reduced 8.7894 28.4892 256 0.31 0.7579 MathMean -1.0404 0.6834 256 -1.52 0.1291 myrs_employ 0.2164 0.2985 256 0.73 0.4691 mhighest_degree 6 -0.6414 11.5877 256 -0.06 0.9559 mhighest_degree B 0.2089 5.5794 256 0.04 0.9702 mhighest_degree M 0 . . . . The result of examining model predictors for the Alabama district led to an approach similar to the Chapter 3 example in which only student-level predictors were included in the models. The only difference, however, is that multiple years of test scores in the predicted subject were included for LMMs that utilized testing histories as discussed in Section 4.2.2. 74 Table 4.3 summarizes the outcome with each check mark representing the inclusion of the independent variable(s) in the corresponding model for each Phase I metric. Table 4.3: Summary of Model Predictors for Alabama Data Metric LMM Teacher Effect Overall LMM Value-Added Measure Median SGP Overall QR Value-Added Measure Dep Var Indep Var MathGain ReadGain MathGain ReadGain MathGain ReadGain MathGain Readgain Math Score (previous year only) ? ? ? ? ? ? Math Score (all previous years) ? ? Reading Score (previous year only) ? ? ? ? ? ? Reading Score(all previous years) ? ? Teacher-Level Predictors School-Level Predictors 4.3 National Center for Education Statistics Data Analysis The data from the National Center for Education Statistics is significant in that it captures student achievement data linked to 1009 beginning teachers from 17 urban districts over 4 years. Several items of interest with regard to the data are the following: 1) the data's original use consisted of evaluating the benefits of comprehensive teacher training for beginning teachers compared to the existing, less intensive training services provided by the district; 2) each district chose one of two providers for the comprehensive training services; 3) comprehensive training services lasted one or two years; 4) schools within districts were randomly assigned to receive comprehensive training services; and 5) student achievement data are recorded as z-scores to account for the difficulty of comparing 1009 teachers across 43 different tests that use scaled scores, normal curve equivalents, percent correct, or percentile rankings (Glazerman et al., 2010, p. 33) 75 The student achievement data is comprised of three files (2006, 2007, 2008) each representing a cohort of students who are taught by the same beginning teachers. Each file has a student pre- test and post-test score for mathematics and reading. The pre-test score from 2006 is obtained from testing at the conclusion of the 2004-2005 school year. Students? post-test scores from 2006 are the same scores as the pre-test scores from 2007. This pattern continues through 2008; therefore, the data contains four years of student achievement scores. Student identification codes are not recorded in the student files. To account for students, the files contain teacher identification codes which are repeated with each observation representing a single student. Teachers are linked with their students? achievement scores unlike the methods presently employed by the Alabama State Department of Education. The nature of the data, however, precludes the ability to track students over time and create additional testing histories as discussed in Section 4.2.2. Therefore, all four Phase I metrics were calculated within a single two year testing history for each year as shown in Table 4.4. Table 4.4: NCES Testing Histories Year Variables 2006 2007 2008 Mathematics and Reading Test Scores 2005, 2006 2006, 2007 2007, 2008 "Each district was assigned to one of the two providers of treatment services, either Educational Testing Service (ETS) or New Teacher Center (NTC), based primarily on district preferences" (Glazerman et al., 2010, p. 8). Districts then received either one or two years of treatment services determined, principally, by the availability of mentors for the second year (Glazerman et al., 2010, p. 3). This research?s approach was to analyze districts individually by grade. As a result one does not encounter issues related to comparing teachers who may be 76 receiving similar induction services but from different providers and for different lengths of time. Furthermore, ?the preference-based method of assigning districts to providers does not allow for and should not be used to make direct comparisons of one provider to the other. Observed differences in impacts between ETS and NTC districts may be due to the programs or to the set of districts each provider worked with; those effects cannot be separated? (Glazerman et al., 2010, p. 8). Analysis within a district did not concern itself with treatment status. All district teachers (treatment and control) by grade were compared with each other. However, the knowledge of teachers? treatment status precipitated answering the following research question related to the RMTEI results that will be addressed in Section 4.8: Are the RMTEI results related to whether teachers received induction services (treatment vs. control) while considering length of induction? Glazerman et al. (2010) did observe a statistically significant difference in student achievement with teachers who received two years of induction services (2010, p. 92). 4.3.1 NCES Student Achievement Data Recorded as Z-Scores Student achievement data from the NCES are recorded as z-scores, mean subtracted from the test score divided by the standard deviation. Population mean and standard deviation of a particular grade, subject, and test were approximated by a state or national norm sample depending on the administered test (Glazerman et al., 2010, p. A?13). A single z-score represents the percentage of the standard deviation away from the mean. The natural question then becomes, can a student?s growth be measured with z-scores? In fact, a student?s growth can be measured with z-scores. For example, a teacher can move a student to a better position by either obtaining a smaller percentage of the standard deviation away from the mean if the student?s test score is less than the mean or obtaining a larger percentage of the standard 77 deviation away from the mean if the test score is greater than the mean. The difference of the pre-test and post-test z-scores was captured in the new variables, ReadGain and MathGain, to allow analysis to focus on the current teacher who provided instruction leading to the administration of the post-test . This difference represents the movement of a student ?relative to their local reference group? (Glazerman et al., 2010, p. A?14). A larger, positive movement compared to other students would be considered more growth. In order to ensure the adequacy of using z-scores with the NCES data, the teacher values obtained with the RMTEI process using scaled scores from the Alabama district were compared to the teacher values obtained with the RMTEI process after converting the scaled scores to z- scores. The z-scores were computed using the mean and standard deviation of the scaled scores by grade and year. The RMTEI results using both types of scores were nearly identical, perfectly correlated, and not different with statistical significance. The results confirm the ability to safely complete the RMTEI process with z-scores for the NCES data. 4.3.2 Student, Teacher, and School Level Predictors for NCES Data As with the Alabama district data, this research examined predictors with the NCES data in an attempt to improve the predictive capability of the models used to generate the four Phase I metrics. The process of selecting predictors began with using a student?s pre-test in mathematics and reading to predict MathGain and ReadGain. For example, the 2006 models started with the Pre-Read and Pre-Math scores from the spring of 2005 as predictors for ReadGain and MathGain in separate models. This method of analysis produced similar results for all grades, districts, and years in which the pre-test scores were significant at an .05?? for both ReadGain and MathGain models. Therefore, Pre-Read and Pre-Math scores were included as predictors in all models for 2006, 2007, and 2008. 78 Adding predictors at the teacher level led to inconsistent results. NCES teacher data included many variables that received evaluation as predictors for student reading and mathematics growth: 1) Route into teaching 2) Highest degree 3) Holds a degree in an education?related field 4) Hired after the school year began 5) Attended a competitive college1 6) Not a beginning teacher 7) Held a nonteaching job for five or more years 8) Type of teaching certificate/licensure/credential currently held 9) How teacher entered the teaching profession In most instances teacher level predictors were not statistically significant. Therefore, teacher level predictors were not included in models for 2006, 2007, and 2008. Adding predictors at the school level also led to unreliable results. This research applied the methods of the Texas Projection Measure, discussed in Chapter 2, by considering campus mean of the predicted subject as a predictor of student achievement. In many instances, MathMean and ReadMean for a given campus were not significant predictors of MathGain and ReadGain respectively. Secondly, the NCES data contained the Reduced/Free lunch percentages of the different campuses within the districts and considered them as predictors of student achievement. Obtaining results similar to using the campus mean of the predicted subject as a 1Glazerman et al. (2010) provide the definition: ?A ?highly selective? college or university is one that is rated as ?most competitive,? ?highly competitive,? or ?very competitive? by the 2003 edition of the Barron?s Profile of American Colleges.? (2010, p. 13). 79 predictor, schools? Reduced/Free lunch percentages were not significant predictors of student achievement. Therefore, school level predictors were not included in models for 2006, 2007, and 2008. An example follows for the solution of fixed effects from a district, 4th grade, 2006, where the predictors, COMPCOLLEGE2 and EDFIELD, are categorical variables for whether the math teacher attended a competitive college and holds a degree in an education?related field respectively. Also included are student, teacher, and school level predictors as described above. All of the teacher and school level predictors for the minimum testing history group failed to reject the null hypotheses, 0 3 0 4 0 5 0 6 0 7: 0 , : 0 , : 0 , : 0 , : 0H H H H H? ? ? ? ?? ? ? ? ? at an .05?? significance level: Solution for Fixed Effects COMPCOLLEGE. Attended Most, EDFIELD. Has a highest_ Highly, or Very degree in education Standard Effect degree competitive college related field Estimate Error DF t Value Pr > |t| Intercept 0.3388 0.3005 5 1.13 0.3107 PRE_MATH -0.5598 0.05764 157 -9.71 <.0001 PRE_READ 0.3270 0.07742 157 4.22 <.0001 MathMean -0.01909 0.2165 157 -0.09 0.9298 percent_free_reduced -0.00534 0.004399 157 -1.21 0.2265 COMPCOLLEGE 0 = No 0.2000 0.1257 157 1.59 0.1136 COMPCOLLEGE 1 = Yes 0 . . . . EDFIELD 0 = No -0.04655 0.09863 157 -0.47 0.6376 EDFIELD 1 = Yes 0 . . . . highest_degree B 0 . . . . The result of examining model predictors for the NCES data also led to an approach similar to the Chapter 3 example in which only student-level predictors were included in the models. Table 4.5 summarizes the outcome with each check mark representing the inclusion of the independent variable in the corresponding model for each Phase I metric. 2 Glazerman et al. (2010) provide the definition: ?A ?highly selective? college or university is one that is rated as ?most competitive,? ?highly competitive,? or ?very competitive? by the 2003 edition of the Barron?s Profile of American Colleges.? (2010, p. 13). 80 Table 4.5: Summary of Model Predictors for NCES Data Metric LMM Teacher Effect Overall LMM Value-Added Measure Median SGP Overall QR Value-Added Measure Dep Var Indep Var MathGain ReadGain MathGain ReadGain MathGain ReadGain MathGain Readgain Pre-Math (previous year score) ? ? ? ? ? ? ? ? Pre-Read (previous year score) ? ? ? ? ? ? ? ? Teacher-Level Predictors School-Level Predictors 4.4 Checking Model Assumptions for the Final Linear Mixed Models Analysis of the final models began by examining the EBLUPs for the random teacher effects. The values yielded consistent results of failing to reject normality with no outliers. An example is shown in Figure 4.1 for the Alabama mathematics teacher effects for 4th grade, 2009. 1 51 05- 0- 5- 1 0- 1 5 M e d ia n M e a n 5 . 02 . 50 . 0- 2 . 5- 5 . 0- 7 . 5 1 s t Q u a r t ile - 6 . 9 7 4 0 M e d ia n - 1 . 7 3 1 0 3 r d Q u a r t ile 6 . 5 7 2 7 M a x im u m 1 5 . 9 6 6 2 - 5 . 1 3 7 0 5 . 1 3 7 0 - 6 . 9 5 8 4 5 . 6 5 7 5 6 . 4 5 0 0 1 4 . 3 3 3 5 A - S q u a r e d 0 . 3 1 P - V a lu e 0 . 5 1 9 M e a n 0 . 0 0 0 0 S t D e v 8 . 8 9 7 1 V a r ia n c e 7 9 . 1 5 7 5 S k e w n e s s 0 . 5 2 2 9 2 4 K u r t o s is - 0 . 7 3 4 5 8 4 N 1 4 M in im u m - 1 2 . 7 4 3 5 A n d e r s o n - D a r lin g N o r m a lit y T e s t 9 5 % C o n f id e n c e I n t e r v a l f o r M e a n 9 5 % C o n f id e n c e I n t e r v a l f o r M e d ia n 9 5 % C o n f id e n c e I n t e r v a l f o r S t D e v 9 5 % C o n f i d e n c e I n t e r v a l s S u m m a r y f o r M L M M _ T e a c h e r _ E f f e c t Figure 4.1: Distribution Analysis of Mathematics LMM Teacher Effects Secondly, analysis of the conditional studentized residuals validated the model assumptions of normality and constant variance. ?The conditional studentized residual for an observation is the difference between the observed value and the predicted value, based on both 81 the fixed and random effects in the model, divided by its estimated standard error? (West et al., 2007, p. 104). An example is shown in Figure 4.2 from the MathGain model using Alabama data from the cohort preceding 4th grade, 2009. 6 3 9 49 1 1 85 2 1 18 5 0 89 1 2 93 2 9 98 7 8 43 6 9 71 7 0 86 7 3 37 9 2 26 8 9 17 4 1 33 1 4 89 1 9 8 3 2 1 0 - 1 - 2 - 3 - 4 M T e a c h e r S tu d e n t iz e d R e s id u a l D i s t r i b u t i o n o f C o n d i t i o n a l S t u d e n t i z e d R e s i d u a l b y M a t h T e a c h e r Figure 4.2: Distribution Analysis of Studentized Residuals from Mathematics LMM Lastly, a paired comparison of MathGain and its conditional predicted value yielded inferences about their difference in means. The 95% confidence interval for the difference in means contains zero. Therefore, one would fail to reject the null hypotheses: 0 :0ijH ????. The difference in means of the two variables is not significant. An example is shown in Figure 4.3 from the MathGain model using Alabama data from the cohort preceding 4th grade, 2009. 82 Figure 4.3: Paired Comparison of MathGain and Conditional Predicted Value The paired comparison of MathGain and its conditional predicted value also yielded inferences about their difference in means for each teacher. The 95% confidence intervals for the differences in means for each teacher contain zero; therefore, the differences in means are not significant. 4.5 Diagnostics for the Final Quantile Regression Models Analysis of the final model for the QR Value-Added Measure metric began by examining some diagnostic plots from the predicted 0.5 quantile of MathGain and ReadGain, ?which expresses the conditional median of a response variable given predictor variables? (Hao & Naiman, 2007, p. 56). Recall that the median can be a more suitable measure of central location than the conditional mean if the conditional mean?s model suffers from inadequacy due to assumption violations. Although not unexpected, the plot of predicted values versus standardized residuals in Figure 4.4 produces a double bow effect indicating the variance of the errors is not constant 83 (West et al., 2007, p. 131). This is also seen in Figure 4.4 when the distribution of standardized residuals is not homogeneous across mathematics teachers (West et al., 2007, p. 106). Both graphs in Figure 4.4 originate from analysis of the medium-sized Alabama district for 4th grade, 2009. 1 7 1 44 7 2 65 5 1 84 2 0 31 4 3 33 0 0 77 4 5 87 5 4 28 0 5 78 5 7 45 1 3 28 8 1 55 0 6 47 5 3 9 4 3 2 1 0 - 1 - 2 - 3 - 4 M a t h T e a c h e r S t a n d a r iz e d R e s id u a l D i s t r i b u t i o n o f S t a n d a r d i z e d R e s i d u a l b y M a t h T e a c h e r Figure 4.4: Variance Analysis of Standardized Residuals from 0.5 QR Mathematics Model One would expect the points to be symmetrically distributed around a diagonal line in a plot of observed versus predicted values for a Linear Mixed Model. Since the median is often used to indicate the center of skewed distributions, it is not unexpected for the symmetry to be absent in Figure 4.5 for the plot of MathGain versus the 0.5 quantile prediction of MathGain for 4th grade, 2009, in the Alabama district (Hao & Naiman, 2007, pp. 12?13). 84 Figure 4.5: Agreement Plot of MathGain versus 0.5 QR Prediction of MathGain The plot in Figure 4.6, however, illustrates the normal nature of the standardized residuals from the 0.5 quantile prediction of MathGain for 4th grade, 2009, in the Alabama district. Figure 4.6: Distribution Analysis of Standardized Residuals from 0.5 QR Mathematics Model 4.6 Principal Component Analysis (PCA) After the applying the methods of the previous sections, Phase I concluded with calculating the four teacher effectiveness metrics: Linear Mixed Model Teacher Effect, Overall Linear Mixed Model Value-Added Measure, Median Student Growth Percentile, and Overall 85 Quantile Regression Value-Added Measure; see Appendix 3 for the Alabama district?s results and Appendix 4 for a district from the NCES dataset. Phase II consists of determining the principal components of the four metrics determined in Phase I. Recall from Chapter 2 that the correlation matrix of the four metrics is used to extract the eigenvalue-eigenvector pairs, 1 1 2 2( , ), ( , ), ..., ( , )ppe e e? ? ?, due to the dissimilar units of measurement for the Phase I metrics. The high correlation between metrics and the ability of PCA to reduce dimensionality efficiently often led to one dominant principal component. For the Alabama district in a given grade and year, the Phase I metrics with MathGain as the dependent variable produced a single, large eigenvalue and principal component that absorbed the preponderance of the variance in the system. ReadGain, however, would clearly have a dominant eigenvalue but without the near total absorption of the variance. Table 4.6 summarizes the results for the Alabama district. Table 4.6: Proportion of the Total Variance for First Principal Component in the Alabama District Similarly, a district in the NCES dataset produced the results in Table 4.7. 86 Table 4.7: Proportion of the Total Variance for First Principal Component in a NCES District Principal component analysis allowed this research to project the four metrics into a single dimension to obtain a teacher effectiveness index by subject. If a teacher taught mathematics and reading, then an overall teacher effectiveness index was obtained by taking the mean of the two subject indexes. Otherwise, the single subject index became the teacher?s overall teacher effectiveness index. In addition to reducing the dimensionality of the Phase I metrics, PCA discovered the relationship between the effectiveness metrics. In every instance the first principal component was a weighted sum of all Phase I metrics. Based on the grade, year, and subject, the weights (coefficients) for the metrics were different; however, the prevailing theme was that each metric was highly correlated with the first principal component and contributed nearly the same proportion (compared to the other metrics) of its value to the principal component score. Lastly, PCA provided the inputs for Phase III, Cluster Analysis, in which clustering took place based on principal component scores in a single subject. Clustering was also performed 87 based on overall teacher effectiveness index values. An example of the input for Cluster Analysis is contained in Table 4.8. Table 4.8: Results from PCA for 4th Grade, 2011, in the Alabama District 4.7 Cluster Analysis The purpose of Phase III, Cluster Analysis, is to accomplish two of this research?s sub- objectives methods: 1) Confirm or modify Alabama?s effectiveness rating categories based on being able to detect statistically different groups of teachers by effectiveness. 2) Report all teachers in accordance with the ratings categories for an Alabama school district using the newly developed objective effectiveness index. As described in Section 3.4, the agglomerative hierarchical clustering method of J.H. Ward produced clusters based on the first two principal component scores (simultaneously) in a single subject. The results produced statistically different clusters when considering the cluster means of the first principal component score in a single subject. Clustering was also performed based on overall teacher effectiveness values, and the results produced statistically different 88 clusters when considering the cluster means of the overall teacher effectiveness index value. The analysis of the medium-sized Alabama district could not replicate the result of Chapter 3, however, by producing five statistically different clusters within a specific grade, subject, and year. This can be attributed to the size of the district with the number of teachers per grade being less than the Chapter 3 example. In many instances grades 5-8 would only have three teachers for a single subject. In these situations, statistically different clusters did not provide useful information. Clustering based on the overall teacher effectiveness index values always produced greater statistically different clusters since the pool of teachers was greater (both mathematics and reading teachers are included). The number never exceeded four statistically different clusters, however. The graphs of Figure 4.7 capture the discussion above for the Alabama district for all three years. 4321 1 6 1 4 1 2 1 0 8 6 4 2 S t a t i s t i c a l l y D i f f e r e n t C l u s t e r s N u m b e r o f T e a c h e r s 4 5 6 7 8 G r a d e L i n e a r R e l a t i o n s h i p o f N u m b e r o f C l u s t e r s a n d T e a c h e r s b y G r a d e 4321 1 6 1 4 1 2 1 0 8 6 4 2 S t a t i s t i c a l l y D i f f e r e n t C l u s t e r s N u m b e r o f T e a c h e r s O v e r a l l M a t h R e a d R M T E I L i n e a r R e l a t i o n s h i p o f N u m b e r o f C l u s t e r s a n d T e a c h e r s b y R M T E I V a l u e Figure 4.7: Linear Relationship of the Number of Clusters and Teachers for the Alabama District An example follows of the Bonferroni confidence intervals for the difference of cluster means (Overall Index Value) for 4th Grade, 2011, in the Alabama school district: Bonferroni (Dunn) t Tests for Overall Index Value Alpha 0.05 Error Degrees of Freedom 11 Error Mean Square 0.11881 Critical Value of t 3.20812 Comparisons significant at the 0.05 level are indicated by ***. 89 Difference CLUSTER Between Simultaneous 95% Comparison Means Confidence Limits 4 - 3 3.4750 2.1982 4.7519 *** 4 - 1 4.6573 3.4751 5.8394 *** 4 - 2 6.7909 5.5546 8.0273 *** 3 - 4 -3.4750 -4.7519 -2.1982 *** 3 - 1 1.1822 0.4192 1.9453 *** 3 - 2 3.3159 2.4713 4.1605 *** 1 - 4 -4.6573 -5.8394 -3.4751 *** 1 - 3 -1.1822 -1.9453 -0.4192 *** 1 - 2 2.1337 1.4406 2.8268 *** 2 - 4 -6.7909 -8.0273 -5.5546 *** 2 - 3 -3.3159 -4.1605 -2.4713 *** 2 - 1 -2.1337 -2.8268 -1.4406 *** Computing 95% Bonferroni simultaneous confidence intervals validates that each cluster is statistically different than every other cluster for the Overall Index Value. For example, Cluster 1 is statistically different than Clusters 2, 3, and 4. Cluster 2 is statistically different than Clusters 3 and 4. Lastly, Cluster 3 is statistically different than Cluster 4. The result of placing teachers in effectiveness categories using their overall effectiveness index values is shown in Figure 4.8 for 4th Grade, 2011, in the Alabama school district. 5 4 3 2 1 0 - 1 - 2 - 3 O v e r a l l T e a c h e r E f f e c t i v e n e s s I n d e x V a l u e 1 2 3 4 C l u s t e r 5 6 2 9 1 7 1 4 4 7 2 6 5 5 1 8 4 2 0 3 1 4 3 3 3 0 0 7 7 4 5 8 7 5 4 2 8 0 5 7 8 5 7 4 5 1 3 2 8 8 1 5 5 0 6 4 7 5 3 9 O v e r a l l T e a c h e r E f f e c t i v e n e s s I n d e x V a l u e b y C a t e g o r y Figure 4.8: 4th Grade, 2011, Effectiveness Categories for the Alabama District3 3 Cluster values have been modified to ensure Cluster 1 represents extraordinary gains in student growth,?, and Cluster 4 represents poor gains. 90 Due to the nature of the NCES study, districts were analyzed separately with only beginning teachers included in the study. In several instances districts did not have any or few beginning teachers of a specific grade, subject, and year. Similar to the medium-sized Alabama district, the analysis of the NCES data could not replicate the result of Chapter 3 by producing five statistically different clusters within a specific grade, subject, and year. The graphs of Figure 4.9 for a district in the NCES dataset also capture the linear relationship of the number of clusters and teachers for all three years. 432 2 3 2 0 1 7 1 4 1 1 8 5 N u m b e r o f S t a t i s t i c a l l y D i f f e r e n t C l u s t e r s N u m b e r o f T e a c h e r s 3 4 5 G r a d e L i n e a r R e l a t i o n s h i p o f N u m b e r o f C l u s t e r s a n d T e a c h e r s b y G r a d e 432 2 3 2 0 1 7 1 4 1 1 8 5 S t a t i s t i c a l l y D i f f e r n t C l u s t e r s N u m b e r o f T e a c h e r s M a t h O v e r a l l R e a d R M T E I L i n e a r R e l a t i o n s h i p o f N u m b e r o f C l u s t r s a n d T e a c h e r s b y R M T E I V a l u e Figure 4.9: Linear Relationship of Number of Clusters and Teachers for a NCES District 4.7.1 Comparison of Clustering Results with Phase I Metrics Ward?s clustering method produced the results in Figure 4.10 and Figure 4.11 in which clustering was performed for each of Phase I metrics and the input from PCA (principal components 1 and 2). 91 P r i n 1 a n d P r i n 2 O v e r a l l Q R V A M M e d ia n S G P O v e r a ll L M M V A M L M M E f f e c t P r in 1 a n d P r i n 2 O v e r a ll Q R V A M M e d ia n S G P O v e r a l l L M M V A M L M M E f f e c t 4 3 2 1 4 3 2 1 4 3 2 1 P r in 1 a n d P r i n 2 O v e r a ll Q R V A M M e d i a n S G P O v e r a l l L M M V A M L M M E f f e c t 4 3 2 1 P r i n 1 a n d P r i n 2 O v e r a l l Q R V A M M e d ia n S G P O v e r a ll L M M V A M L M M E f f e c t 1 4 3 3 C l u s t e r 1 7 1 4 3 0 0 7 4 2 0 3 4 7 2 6 5 0 6 4 5 1 3 2 5 5 1 8 5 6 2 9 7 4 5 8 7 5 3 9 7 5 4 2 8 0 5 7 8 5 7 4 8 8 1 5 P a n e l v a r i a b l e : T e a c h e r M a t h e m a t i c s C l u s t e r i n g R e s u l t s - 4 t h G r a d e , 2 0 1 1 Figure 4.10: Mathematics Clustering Results for 4th Grade, 2011, in the Alabama District4 4 3 2 1 P r i n 1 a n d P r i n 2 O v e r a ll Q R V A M M e d ia n S G P O v e r a l l L M M V A M L M M E f f e c t P r i n 1 a n d P r i n 2 O v e r a l l Q R V A M M e d i a n S G P O v e r a ll L M M V A M L M M E f f e c t 4 3 2 1 4 3 2 1 P r i n 1 a n d P r i n 2 O v e r a ll Q R V A M M e d i a n S G P O v e r a ll L M M V A M L M M E f f e c t P r i n 1 a n d P r i n 2 O v e r a ll Q R V A M M e d ia n S G P O v e r a ll L M M V A M L M M E f f e c t 4 3 2 1 P r i n 1 a n d P r i n 2 O v e r a l l Q R V A M M e d ia n S G P O v e r a l l L M M V A M L M M E f f e c t 4 3 2 1 1 C l u s t e r 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 P a n e l v a r i a b l e : T e a c h e r M a t h e m a t i c s C l u s t e r i n g R e s u l t s - 3 r d G r a d e , 2 0 0 6 Figure 4.11: Mathematics Clustering Results for 3rd Grade, 2006, in a NCES District5 4 Values have been modified to ensure clusters across the variables are on the same scale (i.e., Cluster 1 represents extraordinary gains in student growth,?, and Cluster 4 represents poor gains). 5 Values have been modified to ensure clusters across the variables are on the same scale (i.e., Cluster 1 represents extraordinary gains in student growth,?, and Cluster 4 represents poor gains). 92 Once again a question must be asked; do the clusters provided by principal components 1 and 2 provide better information than the clusters formed using only a single Phase I metric? The clusters remain similar across the variables, and the clustering results using principal component 1 and 2 successfully represent the collection of the metrics? results with no one metric being identical to the clustering results using the input from PCA. Therefore, the final clustering results provide better information, are not based on a single measure, and have mitigated the risk associated with crafting accurate effectiveness categories. 4.8 Subject-Specific and Overall RMTEI Values Five years of student and teacher data from a medium-sized Alabama district produced subject-specific and overall Risk-Mitigated Teacher Effectiveness Index values for grades 4-8 for the 2008-09, 2009-10, and 2010-11 school years; see Appendix 3 for complete results. Four years of student and teacher data from the NCES yielded subject-specific and overall Risk- Mitigated Teacher Effectiveness Index values for grades 3-5 for the 2005-06, 2006-07, 2007-08 school years; see Appendix 4 for complete results from a district in the NCES dataset. An excerpt of the final results from the medium-sized Alabama district is shown in Table 4.9 with each 4th grade teacher receiving a sparkline, ?small, high-resolution graphics embedded in a context of words, numbers, images? to capture the index values over time (Tufte, 2006, p. 7). 93 Table 4.9: Excerpt of Final Results Sorted by the 2009 Mathematics RMTEI value The ability to detect consistent teachers, positively or negatively, relative to their peers is readily achieved with the sparklines. A positive index value places a teacher in the top half of the population of teachers in terms of attaining student growth. The graph in Figure 4.12 allows the comparison of 4th grade mathematics teachers over the same three year span. 94 Figure 4.12: Final 4th Grade Mathematics RMTEI values Based on Figure 4.12 the preliminary indication is that the index does not suffer from stability concerns from year to year due to the majority of teachers having modest movement of their mathematics-specific index value over time. Nine of 14 teachers would fall into this category, which does not include teacher 5629 with a single value in 2011. This notion as well as the precision of the index will be explored fully, however, in Sections 4.9 and 4.10. Lastly, NCES analysis was undertaken to determine whether there is a statistical difference between induction groups (control vs. treatment) while considering their RMTEI values and length of induction. The analysis required merging district RMTEI results by years of induction services (one or two years) for the districts? treatment group. Recall, Glazerman et al. 95 (2010) did observe a statistically significant difference in student achievement with teachers who received two years of induction services (2010, p. 92). The grouping of ten (10) one-year districts led only to a significance of Treatment Status (one year of comprehensive induction versus the existing, less intensive training) for mathematics teachers in the third year (p-value=.0245). Reading and Overall RMTEI values failed to reject 0 1 2:H ??? at an ? = 0.05 level for 2006, 2007, and 2008. The grouping of seven (7) two-year districts led only to a significance of Treatment Status (two years of comprehensive induction versus the existing, less intensive training) for mathematics teachers in the third year (p-value=.0084). Reading and Overall RMTEI values failed to reject 0 1 2:H ??? at an ? = 0.05 level for 2006, 2007, and 2008. Considering the analysis above, this research concludes in the third year that there is a statistical difference between treatment and control teachers? performance while considering their RMTEI mathematics values for both one-year and two-year districts. As a result, this research asserts that comprehensive induction services for beginning teachers have a greater impact on student achievement in mathematics compared to reading. An excerpt of the results is contained in Table 4.10. 96 Table 4.10: Excerpt of Treatment Status Analysis Dep Var: Mathematics RMTEI Source: Treatment (p-value) Dep Var: Reading RMTEI Source: Treatment (p-value) Dep Var: Overall RMTEI Source: Treatment (p-value) Year Districts 2006 2007 2008 2006 2007 2008 2006 2007 2008 One-year districts 0.3643 0.2989 0.0245 0.8439 0.1091 0.8343 0.5240 0.1446 0.1315 Two-year districts 0.1399 0.9192 0.0084 0.8471 0.6842 0.8470 0.4822 0.7763 0.0631 4.9 Precision of the Risk-Mitigated Teacher Effectiveness Index As stated in Section 2.4 small sample sizes present challenges for measuring teacher effectiveness. Precision of the results still remain an issue, however, for middle school teachers who may teach a greater number of students compared to an elementary school teacher who may teach a single class multiple subjects. Lockwood, Louis, and McCaffrey (2002) investigated the precision of teacher effects from value-added models and found that variation in most scenarios caused ?estimated rankings [to] be sufficiently imprecise to preclude distinguishing among all but the most extreme teachers? (McCaffrey et al., 2004, p. 107). Therefore, one of the primary objectives of this research is to overcome this lack of precision when measuring teacher effectiveness. Since teachers? Phase I metrics for a given subject, grade, and year have varying units of measurement, the Teacher Effectiveness Index values were obtained from the standardized version of those metrics by calculating the eigenvectors of the metrics? correlation matrix. The correlation matrix captures the variation of the Phase I metrics. The eigenvectors of the correlation matrix form the four distinct principal components. The variance of the principal component scores from the dominant principal component is the largest eigenvalue associated with the corresponding eigenvector. This eigenvalue absorbed the preponderance of the variation of the system, and the principal component scores from the dominant principal 97 component became teachers? subject-specific RMTEI values. Section 4.7 demonstrated that the results of cluster analysis consistently produced statistically different clusters when considering the cluster means of the first principal component score in a single subject. Clustering also produced statistically different clusters based on the overall teacher effectiveness index values, thereby allowing teacher effectiveness to be measured in an overall and subject-specific manner. After compiling the final results for all subject-specific and overall Risk-Mitigated Teacher Effectiveness Index values, analysis was undertaken to ascertain the precision of the RMTEI. Considering the results following a specific year in which teachers taught both mathematics and reading, teachers earned two subject-specific index values and an overall effectiveness index value for that year. An excerpt of the final results is contained in Table 4.9. To form an analogy, the values were treated as automobile part measurements in a manufacturing process where one could determine whether the process was in a state of statistical control through the use of a control chart. With assessments over three years, an individual teacher could consist of a subgroup of three Mathematics, Reading, or Overall RMTEI values. All teachers, acting as subgroups, then contributed to being able to determine whether the RMTEI process was in a state of statistical control. ?The purpose of any control chart is to identify occurrences of special causes of variation that come from outside of the usual process? since a stable process only contains variation from common causes (Johnson & Wichern, 2007, p. 239). The analysis produced the control charts in Figure 4.13 and Figure 4.14 for 4th grade and 3rd grade mathematics teachers, respectively. 98 7 5 3 95 0 6 48 8 1 55 1 3 28 5 7 48 0 5 77 5 4 27 4 5 83 0 0 71 4 3 34 2 0 35 5 1 84 7 2 61 7 1 45 6 2 9 4 2 0 - 2 T e a c h e r S a m p l e M e a n __ X = - 0 . 0 0 0 U C L = 1 . 9 5 4 L C L = - 1 . 9 5 4 7 5 3 95 0 6 48 8 1 55 1 3 28 5 7 48 0 5 77 5 4 27 4 5 83 0 0 71 4 3 34 2 0 35 5 1 84 7 2 61 7 1 45 6 2 9 4 . 5 3 . 0 1 . 5 0 . 0 T e a c h e r S a m p l e R a n g e _ R = 1 . 9 1 0 U C L = 4 . 9 1 7 L C L = 0 1 11 1 T e s t s p e r f o r m e d w i t h u n e q u a l s a m p l e s i z e s X b a r - R C h a r t o f M a t h E f f e c t i v e n e s s I n d e x Figure 4.13: Control Chart Analysis for Alabama 4th Grade Mathematics Teachers 1 1 71 1 51 1 31 1 11 0 91 0 71 0 51 0 31 0 1 4 2 0 - 2 T e a c h e r S a m p l e M e a n _ _ X = - 0 . 0 0 0 U C L = 3 . 2 2 8 L C L = - 3 . 2 2 8 1 1 71 1 51 1 31 1 11 0 91 0 71 0 51 0 31 0 1 4 . 5 3 . 0 1 . 5 0 . 0 T e a c h e r S a m p l e R a n g e _ R = 1 . 8 2 2 U C L = 4 . 6 8 9 L C L = 0 1 1 1 T e s t s p e r f o r m e d w i t h u n e q u a l s a m p l e s i z e s X b a r - R C h a r t o f M a t h e m a t i c s E f f e c t i v e n e s s I n d e x Figure 4.14: Control Chart Analysis for 3rd Grade Mathematics Teachers in a NCES District 99 The charts in Figure 4.13 and Figure 4.14 of the process mean and process range over the span of the mathematics teachers as subgroups reveal two noteworthy items. Firstly, the R- Chart, which measures within-subgroup variability, remains in control. Secondly, the x chart, which measures between-subgroup variability, determines that several teachers fall outside of the control limits. The following paragraph from Trietsch (1999) articulates why the control charts demonstrate the adequacy of the RMTEI process for these grades, subject, and districts: ?As discussed in AT&T (1958), control charts may be used to verify whether a given measurement instrument is adequate for a particular job. The main idea there is that multiple measurements of the same item are used as subgroups, and if the measurement instrument is adequate, many points in the x chart should fall outside the control limits. In other words, we expect the variation between parts to be large relative to the range of multiple measurements of the same part? (1999, p. 38). 4.10 Stability of the Risk-Mitigated Teacher Effectiveness Index As stated in Section 2.4, McCaffrey, Sass, and Lockwood (2009) compared the teacher effectiveness results from two successive cohorts of students in four counties of Florida elementary and middle schools and found low correlations of teacher effectiveness between the two years (Braun et al., 2010, pp. 45?46). The literature suggests that teacher effectiveness measurement instruments lack stability from year to year. Therefore, one of the primary objectives of this research is to overcome this lack of stability and demonstrate high correlation between the subject-specific and overall Risk Mitigated Teacher Effectiveness Index values for the Alabama and NCES datasets in 2009-2011 and 2006-2008, respectively. The first three years of data, 2007-2009, were used to generate the 2009 subject-specific and overall Risk-Mitigated Teacher Effectiveness Index values for the medium-sized Alabama School district. Years 2010 and 2011 were used to demonstrate the stability and precision of the Effectiveness Index values. Effectiveness index values for grades 4-8 were determined for 2010 100 and 2011 based on 2007-2010 and 2007-2011 data, respectively. These index values were compared to the index values obtained from 2009 for the same teachers. For example, Mr. Jones taught 7th grade mathematics to cohort 2014 in 2009 and received an effectiveness index. This value can be compared to the effectiveness index obtained following the 2010 school year in which he taught 7th grade mathematics to cohort 2015. Collectively, effectiveness index values were generated for all teachers in grades 4-8 for 2009, 2010, and 2011. Stability analysis is shown in Figure 4.15 and Table 4.11 for mathematics teachers in the Alabama district. Figure 4.15: Excerpt of Final Stability Analysis for the Alabama District 101 Table 4.11: Correlation of Yearly RMTEI Math Values for the Alabama District Mathematics RMTEI Pearson Correlation Coefficient (p-value) year 2009 2010 2011 Mathematics RMTEI 2009 1.0000 .4858 (.0138) .6317 (.0009) 2010 1.0000 .6448 (.0005) 2011 1.000 As the analysis suggests in Table 4.11, mathematics index values remained stable with statistically significant correlation between 2009 and 2010, 2009 and 2011, and 2010 and 2011. Reading index values remained stable with statistically significant correlation between 2009 and 2010 as well as between 2010 and 2011. Reading index values, however, did not demonstrate statistically significant correlation between 2009 and 2011. The reason for this fact remains elusive although future study as discussed in Chapter 6 may suggest that the reading portion of the ARMT may suffer from faulty test-forms equating. As a result, overall RMTEI values remained stable with statistically significant correlation between 2009 and 2010 as well as between 2010 and 2011. The correlation of overall RMTEI values between 2009 and 2011 was almost statistically significant at an ? = 0.05 with a p-value = .0507. Table 4.12 summarizes the results for yearly reading and overall RMTEI values. 102 Table 4.12: Correlation of Yearly Reading, Overall RMTEI Values for the Alabama District Reading RMTEI Pearson Correlation Coefficient (p-value) Overall RMTEI Pearson Correlation Coefficient (p-value) year 2009 2010 2011 2009 2010 2011 Reading RMTEI 2009 1.0000 .3853 (.0268) .0645 (.7491) 2010 1.0000 .3769 (.0481) 2011 1.0000 Overall RMTEI 2009 1.0000 .4344 (.0036) .3236 (.0507) 2010 1.0000 .4127 (.0081) 2011 1.0000 In addition to testing whether correlation values differed from zero with a null hypothesis of 0 :0H ?? (depicted in Table 4.11 and Table 4.12), this research was interested in using a Fisher?s z-transformation to test differences between the correlations. For example, is there a statistical difference between the correlation of 2009-2010 and 2009-2011 for Mathematics RMTEI values? A desirable feature of the RMTEI process would be no statistical significance in the difference of correlation values. Conducting a Fisher?s z-transformation of the correlation values with the corresponding hypotheses, 0 2 0 0 9 , 2 0 1 0 2 0 0 9 , 2 0 1 1 0 2 0 0 9 , 2 0 1 0 2 0 1 0 , 2 0 1 1 0 2 0 0 9 , 2 0 1 1 2 0 1 0 , 2 0 1 1: 0 , : 0 , : 0H H H? ? ? ? ? ?? ? ? ? ? ?, demonstrated no statistical difference between the correlation values in each set of Mathematics, Reading, and Overall RMTEI values for the Alabama district. The first two years of the NCES data, 2005-2006, were used to generate the 2006 subject- specific and overall Risk-Mitigated Teacher Effectiveness Index values for 17 school districts. Years 2007 and 2008 were used to demonstrate the stability and precision of the Effectiveness Index values. Effectiveness Index values for grades 3-5 were determined for 2007 and 2008 based on 2006-2007 and 2007-2008 data respectively. These index values were compared to the 103 index values obtained from 2006 for the same teachers. Collectively, Effectiveness Index values were generated for all teachers in grades 3-5 for 2006, 2007, and 2008. Stability analysis is shown in Figure 4.16 and Table 4.13 for mathematics teachers from the NCES districts. Figure 4.16: Excerpt of Final Stability Analysis for NCES Data Table 4.13: Correlation of Yearly Mathematics RMTEI Values for NCES Data Mathematics RMTEI Pearson Correlation Coefficient (p-value) year 2006 2007 2008 Mathematics RMTEI 2006 1.0000 .3564 (<.0001) .1906 (.0825) 2007 1.0000 .2838 (.0093) 2008 1.000 104 As the analysis suggests in Table 4.13, mathematics index values remained stable with statistically significant correlation between 2006 and 2007 as well as between 2007 and 2008. Reading index values remained stable with statistically significant correlation between 2006 and 2007 as well as between 2007 and 2008. Reading and Mathematics index values, however, did not demonstrate statistically significant correlation between 2006 and 2008. A reason for this can likely be attributed to the two years of comprehensive induction services obtained by approximately half of the treatment teachers who generated statistically significant difference in student achievement in the third year compared to the control group of teachers (Glazerman et al., 2010, p. 92). Overall RMTEI values, however, remained stable with statistically significant correlation between all three years. Table 4.14 summarizes the results for yearly reading and overall RMTEI values. Table 4.14: Correlation of Yearly Reading and Overall RMTEI Values for NCES Data Reading RMTEI Pearson Correlation Coefficient (p-value) Overall RMTEI Pearson Correlation Coefficient (p-value) year 2006 2007 2008 2006 2007 2008 Reading RMTEI 2006 1.0000 .2823 (.0016) .1597 (.1418) 2007 1.0000 .2503 (.0217) 2008 1.0000 Overall RMTEI 2006 1.0000 .3819 (<.0001) .2432 (.0202) 2007 1.0000 .3144 (.0030) 2008 1.0000 In addition to testing whether the NCES correlation values differed from zero with a null hypothesis of 0 :0H ?? (depicted in Table 4.13 and Table 4.14), this research used a Fisher?s z- transformation to test differences between the correlations. Conducting a Fisher?s z- transformation of the correlation values with the corresponding hypotheses, 105 0 2 0 0 6 , 2 0 0 7 2 0 0 6 , 2 0 0 8 0 2 0 0 6 , 2 0 0 7 2 0 0 7 , 2 0 0 8 0 2 0 0 6 , 2 0 0 8 2 0 0 7 , 2 0 0 8: 0 , : 0 , : 0H H H? ? ? ? ? ?? ? ? ? ? ?, demonstrated no statistical difference between the correlation values in each set of Mathematics, Reading, and Overall RMTEI values for the NCES dataset. 4.10.1 Comparison of Stability Results with Phase I Metrics Correlation analysis produced the results of Figure 4.17 and Figure 4.18 in which correlation coefficients were calculated for each of the mathematics Phase I metrics and Mathematics RMTEI values. Figure 4.17 indicates the correlation of the mathematics values between 2009 and 2010 as well as between 2009 and 2011. Figure 4.17: Mathematics Correlation Results in the Alabama District for 2009 Figure 4.18 indicates the correlation between the mathematics values of 2010 and 2011. 106 Figure 4.18: Mathematics Correlation Results in the AL District between 2010 and 2011 The purpose of the preceding analysis is to compare the stability results of the RMTEI values with the Phase I metrics; are the RMTEI values more stable than a single Phase I metric? The results are mixed. The Mathematics RMTEI values attained greater correlation than the LMM metrics but fell below both QR metrics between 2009 and 2010 as well as between 2009 and 2011. The Mathematics RMTEI values attained greater correlation than all of the mathematics Phase I metrics between 2010 and 2011. Reading values also produced mixed results. The Reading RMTEI values attained greater correlation than all of the reading Phase I metrics for the values between 2009 and 2010 as well as between 2009 and 2011. The Reading RMTEI values, however, only attained greater correlation than LMM Teacher Effect and Median SGP between 2010 and 2011. Figure 4.19 indicates the correlation of the mathematics values between 2006 and 2007 as well as between 2006 and 2008 for a district in the NCES dataset. 107 Figure 4.19: Mathematics Correlation Results for 2006 in a NCES District Figure 4.20 indicates the correlation between the mathematics values of 2007 and 2008 for the same district in the NCES dataset. Figure 4.20: Mathematics Correlation Results between 2007 and 2008 for a NCES District Once again, the results are mixed. The Mathematics RMTEI values attained greater correlation than all of the Phase I metrics between 2006 and 2007, but fell below only Median SGP between 108 2006 and 2008. The Mathematics RMTEI values, however, only attained greater correlation than the LMM metrics between 2007 and 2008. The Reading RMTEI values attained greater correlation than all of the reading Phase I metrics for the values between 2006 and 2007; however, the Reading RMTEI values fell below all of the Phase I metrics for the values between 2006 and 2008 and between 2007 and 2008. The stability results from the Alabama and NCES districts demonstrate that the RMTEI process improved the stability of teacher effectiveness measurement while not fully solving the problem of poor stability from year to year. The RMTEI process did not fully overcome the unreliable nature of the Phase I metrics when applied individually. Testing revealed, however, that the correlation between years was statistically different from zero while not being statistically different from each other. 4.11 Summary The methodology of Chapter 3 was placed into practice in Chapter 4 by examining five years of student and teacher data from a medium-sized Alabama school district. Overall and subject-specific Risk-Mitigated Teacher Effectiveness Index values were calculated for grades 4- 8 for the 2008-09, 2009-10, and 2010-11 school years. Analysis of the NCES data yielded overall and subject-specific Risk-Mitigated Teacher Effectiveness Index values for the 2005-06, 2007-08, 2008-09 school years. Analysis was also undertaken in Chapter 4 to confirm the desired outcome of a precise and stable teacher effectiveness index. The NCES RMTEI values for the 2005-06 school year were calculated based on a two- year process instead of three years (two years of data to generate a model from the previous cohort with application of the model in the current, third year). Linear Mixed Model and Quantile Regression Value-Added Measures were calculated based on a model generated from 109 the current year cohort and not the previous cohort. Linear Mixed Model Teacher Effect and Median Student Growth Percentile were calculated as previously discussed in Chapter 3. All four metrics utilized a single testing history of two consecutive years (2005 and 2006). The NCES RMTEI values for the 2006-07 and 2007-08 schools years were calculated in the same fashion as the medium-sized Alabama district while utilizing only a single testing history group of two years. Analysis of the NCES data, which consists of 17 urban districts from across the United States, remedied the limitation of only exploring a medium-sized district in Alabama. In addition the NCES data also overcame the Alabama data limitation of only being from a suburban area. Recall that the Alabama State Department of Education desires to place teachers into ?at least? four categories following the development of an objective measure of teacher effectiveness based on student growth that augments its presently used formative assessment, EDUCATEAlabama (?Alabama?s Race,? 2010, p. 89). Teachers can be assigned to more subject-specific effectiveness categories in grades with a greater number of teachers (large districts, evaluating more than one district at a time, or grades within elementary schools). Also, teachers can be assigned to more effectiveness categories using the overall effectiveness index since the pool of teachers is greater (includes both mathematics and reading teachers). In the end the evidence suggests fixing the number of clusters at a maximum of four for a medium-sized district. A modification of the initially proposed categories in Section 1.4 complements the clusters found in Section 4.7: 1. Extraordinary gains in student growth. 2. Meets student growth. 3. Did not meet student growth. 4. Poor gains in student growth. 110 Chapter 5 : Assessment of Teacher Evaluation in Alabama 5.1 Introduction The Alabama State Department of Education (ALSDE) implemented the Professional Education Personnel Evaluation (PEPE) system for administrators and teachers in 1993 and 1997 respectively (?AL PEPE for Teachers,? 1998, pp. 9?10). The ALSDE then implemented EDUCATEAlabama to replace PEPE as the educator evaluation system for typical classroom teachers prior to the 2010-2011 School Year (SY) (?EDUCATEAlabama Webinar,? 2011, p. 8). School counselors, school librarians, Alabama Reading Initiative Reading Coaches, and all special educators (pre-K, psychometrist/school psychologists, speech-language pathologists, and ?special educators that teach students who generally take the Alabama Alternate Assessment?) will be integrated into the EDUCATEAlabama evaluation system for the 2011-2012 SY (?Alabama Professional,? 2011, sec. 2). School administrators will remain under the PEPE system until the ALSDE finalizes the development of a new administrator evaluation system, LEADAlabama, at a date to be published (?EDUCATEAlabama Webinar,? 2011, p. 46). Prior to the implementation of EDUCATEAlabama, the PEPE system led school administrators to provide non-tenured teachers with annual summary scores for eight competencies. These eight competencies were then summed to provide a composite competency score to be used for summative purposes. Tenured teachers received a composite competency score every year, two years, or three years at the discretion of the local school system (?Professional Education,? 2008, p. 15). 111 5.2 Professional Education Personnel Evaluation (PEPE) The PEPE system required instructional leaders to evaluate teachers? performance against eight competencies ?which effective educators are known to possess? (?Professional Education,? 2008, p. 1). The competencies include: Preparation for Instruction, Presentation of Organized Instruction, Assessment of Student Performance, Classroom Management, Positive Learning Climate, Communication, Professional Development and Leadership, and Performance of Professional Responsibilities (?Professional Education,? 2008, p. B?3?B?8). A four point scale was used to score all eight competencies: 1) ?Unsatisfactory - Indicates the educator's performance in this position requirement is not acceptable. Improvement activities must be undertaken immediately. 2) Needs Improvement - Indicates the educator?s performance sometimes but not always meets expectations in this position requirement. Improvement activities are required for performance to consistently meet standards. 3) Area of Strength - Indicates the educator consistently meets and sometimes exceeds expectations for performance in this position requirement. Performance can be improved in the area(s) indicated, but current practices are clearly acceptable. 4) Demonstrates Excellence - Indicates the educator does an outstanding job in this position requirement. No area for improvement is readily identifiable? (?Professional Education,? 2008, p. 19). Multiple observations, an interview, and a review supported the final summary score for each competency. The eight competencies were then summed to provide a composite competency score for each teacher. 5.3 EDUCATEAlabama ?EDUCATEAlabama is a formative system designed to provide information about an educator?s current level of practice within the Alabama Continuum for Teacher Development, which is based on the Alabama Quality Teaching Standards (AQTS)? (?EDUCATEAlabama,? 112 n.d.). The Alabama Continuum for Teacher Development provides benchmarks of performance for each teaching standard along the teacher continuum: pre-service, beginning, emerging, applying, integrating, and innovating (?Alabama Continuum,? 2009). The system is a means to encourage dialogue between the educator and the instructional leader. EDUCATEAlabama begins with an educator self-assessment that is completed at the beginning of the school year based on the AQTS. The educator and instructional leader then complete a Professional Learning Plan focused on a select number of areas that demand the greatest attention. Based on observations and continued dialogue throughout the school year, evidence of educator growth is recorded for the corresponding areas of the Professional Learning Plan. Instructional leaders close educators? evaluations following the conclusion of the school year and open a new evaluation prior to the beginning of the next school year. The EDUCATEAlabama process does not generate summative evaluation data (?EDUCATEAlabama Webinar,? 2011). Based on the focus of U.S. Education policy to close achievement gaps and research demonstrating that teachers have the greatest impact on student learning compared to other factors controlled by school systems, Alabama has moved away from measures that offer teacher accountability and development (PEPE) and focused entirely on teacher development (EDUCATEAlabama). This dissertation research intends to provide evidence of whether NCES teacher observational data from 2006 are a stable predictor of teachers? student achievement gains and correlated with the Risk-Mitigated Teacher Effectiveness Index. The desired outcome is evidence of quantitative, observational evaluations being a stable predictor of student achievement gains, a complement to the Risk-Mitigated Teacher Effectiveness Index, and an appropriate tool for Alabama to provide teacher accountability and development. 113 5.4 NCES Teacher Effectiveness Scoring Analysis Teacher observational data obtained from 17 urban school districts from 2006 contain scoring of effectiveness practices for approximately 700 beginning teachers that received either comprehensive teacher induction or the existing, less intensive induction services provided by the district. Student Achievement data from the same 17 school districts will be used to provide evidence of whether the observational effectiveness scoring is a stable predictor of teachers? student achievement gains and correlated with the Risk-Mitigated Teacher Effectiveness Index. Data from cohort 1, 2005-2006, were used to generate Risk-Mitigated Teacher Effectiveness Index values for all teachers with pre-test and post-test student achievement scores. The 2006 teachers who then received a classroom observation (all teachers who taught literacy but not English as a Second Language, special education, or those with one year or greater experience) were subsequently analyzed. Teachers were not separated by grade since students? z-scores allow comparison of teachers across grades. Both treatment and control groups were included. Approximately 180 teachers met all requirements. ?Observers scored teachers in each of three constructs based on a set of items that are believed to be indicators of good practice: implementation of a lesson, content of a lesson, and classroom culture? (Glazerman et al., 2010, p. 32). The three domains of good teaching practice consisted of multiple indicators that were measured on a five-point scale: (1) no evidence, (2) limited evidence, (3) moderate evidence, (4) consistent evidence, and (5) extensive evidence. Each domain then received an overall composite score consisting of the average of the domain?s indicators (Glazerman et al., 2010, p. 32). Concurrent to the development of the RMTEI values, NCES teacher effectiveness scores in 2006 were analyzed as predictors for student achievement gains in 2006. Specifically, teacher 114 composite scores for Content of a Lesson, Classroom Culture, and Implementation of a Lesson (CSLITIMP) were examined as predictors for student reading gains. The results were clear that the observational evaluations were a strong predictor for student achievement gains in reading with each composite score being statistically significant at the 0.05?? level. P-values for Content of a Lesson, Classroom Culture, and Implementation of a Lesson were .0136, < .0001, and .0003, respectively. In order to suggest a relationship between observational evaluations and an objective measure of teacher effectiveness, a requirement exists to ascertain the correlation of NCES teacher effectiveness scores with the Risk-Mitigated Teacher Effectiveness Index computed for the 2005-2006 school year. Recall Alabama?s Race to the Top Phase II application that expressed the need for correlated components of a teacher evaluation system. As a result, Figure 5.1 and Table 5.1 show a comparison of the RMTEI values for reading against the observational evaluations. 115 Figure 5.1: Scatter Plots of RMTEI Reading Values versus Observational Evaluations Table 5.1: Correlation of RMTEI Reading Values and Observational Evaluations Observational Domains Pearson Correlation Coefficient (p-value) Domain RMTEI Content of Lesson Classroom Culture Implementation of Lesson r2006 .2289 (.0019) .3266 (<.0001) .1239 (.0965) The reading index values showed statistically significant correlation at an ? = 0.05 with Content of a Lesson and Classroom Culture from the 2006 classroom observations. Reading index values showed near statistically significant correlation with Implementation of a Lesson. These correlation results also confirm the precision of RMTEI values. 116 As a final bit of investigation for the group of literacy teachers that received observational evaluations in 2006, analysis was undertaken to determine whether there is a statistical difference between induction groups (control vs. treatment) while considering their RMTEI values. Glazerman et al. did not observe statistically significant differences between treatment and control teachers? performance on observational scoring (2010, p. xxxi). Years of induction services (one or two years) did not merit consideration for the treatment group since only one year had transpired when the observational evaluations took place. This research similarly concludes there is not a statistical difference between treatment and control teachers? performance while considering their RMTEI reading values. Conducting an ANOVA to test if teachers? Treatment Status affects the RMTEI reading values demonstrated that one fails to reject 0 1 2:H ??? at an ? = 0.05 level. 5.5 Summary The Alabama State Department of Education has expended significant resources developing EDUCATEAlabama as the next evolution of teacher evaluation. This process provides an approach for instructional leaders to communicate with educators, to chart a path for development, and to make a final assessment of the established goals. There is no accountability mechanism in place with this formative assessment that is directly tied to student growth, however. Alabama is heading in the opposite direction of the strategical framework established by the U.S. Department of Education in ?A Blueprint for Reform?. According to Alabama?s Race to the Top Phase II application, the ultimate desire is correlated components of a teacher evaluation system tied to student growth that are able to place teachers into ?at least? four effectiveness categories (?Alabama?s Race,? 2010, p. 89). The Risk 117 Mitigated Teacher Effectiveness Index along with the observational assessment system in the Teacher Induction Study provide the Alabama State Department of Education with an approach to meet its stated need. 118 Chapter 6 : Research Summary 6.1 Conclusion Alabama's Race to the Top Phase II application clearly stated a desire to research and develop a process to measure teacher effectiveness based on student growth with Race to the Top funds. Alabama came in last place out of the 36 states that submitted Phase II applications, thus it was not granted funds by the U.S. Department of Education. Dr. Morton, the former State Superintendent of Education, stated that despite knowing Alabama?s application would not be competitive in Phase II, it would serve as a foundation for needed reforms. As a result, the primary objective of this research was to fill the need of augmenting Alabama?s formative educator evaluation system, EDUCATEAlabama, with a precise and stable teacher effectiveness index based on student growth. This was accomplished. A review of the literature to evaluate the effectiveness of teachers highlighted two general techniques. Firstly, LMMs produce estimates of the random teacher effect in the form of EBLUPs. Secondly, LMMs produce predictions of student achievement scores in order to calculate teachers? value-added measures by comparing the students? actual achievement scores with the predicted scores. Colorado calculated SGPs using Quantile Regression to determine school effectiveness by comparing the medians of schools? student growth percentiles. This research used Quantile Regression to determine a third objective measure of teacher effectiveness by calculating a growth percentile for each student. The median of a teacher?s aggregated student growth percentiles was calculated and compared to other teachers? medians within the population. 119 Lastly, a fourth method to measure teacher effectiveness was developed by combining Quantile Regression with the LMM practice of calculating achievement score predictions to derive teachers? value-added measures. The 0.5 quantile model generated by Quantile Regression produced predictions of student achievement scores in order to calculate teachers? value-added measures by comparing the students? actual achievement scores with the predicted scores. The development of the Risk-Mitigated Teacher Effectiveness Index comprised three phases. Phase I entailed calculating the four teacher effectiveness metrics described above: 1) Linear Mixed Model Teacher Effect - statistical prediction of the relative value of a particular teacher by measuring the teacher deviation from the district mean; greater is better. 2) Linear Mixed Model Value-Added Measure - a teacher?s average of the difference between students? actual achievement and predicted achievement had they been taught by the average teacher in the district; greater is better. 3) Median Student Growth Percentile - serves as an indicator of student growth associated with each teacher by calculating the median of a teacher?s student growth percentiles. A student?s growth percentile is obtained by determining what percentage of other students had less growth in testing achievement; greater is better. 4) Quantile Regression Value-Added Measure - a teacher?s average of the difference between students? actual achievement and predicted achievement of a typical student within the district; greater is better. Subject-specific and overall teacher index values were calculated in Phase II with an analytical process involving principal component analysis that used the four teacher effectiveness metrics determined in Phase I. Since teachers? Phase I metrics for a given subject, grade, and year have varying units of measurement, the principal components were obtained from the standardized version of those metrics by calculating the eigenvectors of the metrics? correlation matrix. The variance of the principal component scores from the dominant principal component is the largest eigenvalue associated with the corresponding eigenvector. This eigenvalue absorbs the preponderance of 120 the variation of the system, and the principal component scores from the dominant principal component became teachers? subject-specific RMTEI values. If teachers instruct mathematics and reading, then an overall teacher effectiveness index was obtained by taking the mean of their two subject indexes. Otherwise, the single, subject index was the teachers? overall values. The principal components served as the inputs to Phase III, Cluster Analysis. Since the RMTEI process produces a dominant principal component that continually leads to elliptically shaped clusters in two dimensions, Ward?s clustering method, which expects elliptically shaped clusters, was employed as a general prescription of the RMTEI process. Ward?s clustering method illuminated teachers with similar characteristics (principal components) in the data, which provided better assignments to teacher effectiveness categories compared to clustering with a single Phase I metric. Categories can be then used to identify teachers who employ pedagogical strategies or exhibit certain behaviors that positively impact student learning. In order to accomplish the primary objective of this research, the following sub- objectives were completed: 1) Develop techniques with Alabama?s database infrastructure to streamline the process of establishing longitudinal data from existing yearly data in order to make this information readily accessible to school districts (see Appendix 1). 2) Develop techniques with Alabama?s database infrastructure to streamline the process of linking student achievement data with teachers in order to make this information readily accessible to school districts (see Appendix 1). 3) Confirm or modify Alabama?s effectiveness rating categories based on being able to detect statistically different groups of teachers by effectiveness. 4) Report all teachers in accordance with the rating categories for an Alabama school district using the newly developed objective effectiveness index. 5) Provide an assessment of Alabama?s educator evaluation system, compare and contrast the results of an objective effectiveness index with observational teacher evaluation data, and propose an observational, summative assessment for Alabama that is 121 a predictor of student achievement gains and correlated with an objective effectiveness index. Specifically addressing sub-objective five (5), Alabama?s educator evaluation system, EDUCATEAlabama, does not contain an accountability mechanism directly tied to student growth. It is not in compliance with ?A Blueprint for Reform?. The Risk Mitigated Teacher Effectiveness Index provides the Alabama State Department of Education (ALSDE) with an objective measure of teacher effectiveness. It is correlated with the observational assessment system in the Teacher Induction Study. The combination of these two assessment instruments that are based on student growth provides the ALSDE with a structure to meet its Race to the Top requirement of a new evaluation system with correlated components. 6.2 Limitations of the Risk Mitigated Teacher Effectiveness Index As with any endeavor whose purpose is to improve the quality of an existing process, there are always certain aspects of that endeavor that cannot be fully addressed or improved. The Risk Mitigated Teacher Effectiveness Index is no exception; however, the Index?s purpose by definition is to reduce the negative consequences that exist in the literature when measuring teacher effectiveness. Many of those negative consequences have been addressed in the preceding sections, but some limitations of the process remain and deserve discussion. 6.2.1 Limitations in Reporting The RMTEI process places teachers into categories following the development of an objective measure of teacher effectiveness based on student growth. The range of categories extends from extraordinary to poor gains of student growth. Despite the importance of knowing such information, a limitation exists in its reporting by not being able to explicitly state why 122 certain teachers have appropriate growth whereas others do not. Although this is characteristic of many teacher effectiveness measuring instruments, education leaders would then have to conduct observational evaluations to ascertain the nature of a teacher?s performance that leads to the corresponding student growth. Secondly, measuring teacher effectiveness through the use of student testing data precludes conducting assessments of teachers in certain subjects and grades. While mathematics and reading testing typically commence yearly for grades 3-8, the lack of yearly testing of other subjects does not allow student growth to be measured or linked to a teacher. Therefore, the application of the RMTEI process can rarely take place beyond the eighth grade or in subjects other than mathematics and reading due to data limitations. Alabama presently administers the tests in Table 6.1 (?Alabama Student Assessment Program Overview,? n.d.). Table 6.1: Alabama Student Assessment Program Overview Grades Subject Test 3-8 Reading and Mathematics Stanford 10 and ARMT 9-12 Mathematics, Reading, Language, Social Studies, Science/Biology Alabama High School Graduation Exam (AHSGE) 5, 7, and 10 Writing Alabama Direct Assessment of Writing (ADAW) 5 and 7 Science Alabama Science Assessment (ASA) 8 English, Mathematics, Reading, and Science EXPLORE Lastly, with the creation of mutually exclusive testing histories for the Alabama district as described in Section 4.2.2 for LMMs (NCES students only formed the minimum, two year testing history), all students became part of a testing history except those without a recorded test in the previous year and those who did not have a linked teacher. The lack of a linked teacher 123 only applies to the Alabama district as the NCES data had all teachers linked with their students? achievement scores. Therefore, the excluded students fell under one of the three classifications: 1) A lone test in the current year (students new to the school district); 2) No recorded test for the previous year (students who may have been in the district for an extended period but failed to take the previous year test); and 3) A teacher could not be linked to the student for the current year, which only applies to the Alabama district (for unknown reasons the scheduling database did not contain the student despite the student having recorded test scores). All students in the testing histories then contributed to the creation of the two Phase I LMM metrics. Median Student Growth Percentiles and Quantile Regression Value-Added Measures on the other hand were calculated from students within a single testing history of two years. Once again, select students were omitted from the analysis. In this case, however, only students without a linked teacher in the current year were omitted (Alabama district only). As a result, the research findings apply only to students likely to be tested and free of errors with regard to recording student schedules. 6.2.2 RMTEI Values and Small Populations of Teachers Alabama?s effectiveness index values in 2009 and 2010 were analyzed as predictors for student achievement gains in 2010 and 2011, respectively. The analysis paints a clear picture that the effectiveness index values are statistically significant predictors of student achievement gains for larger populations of teachers for each grade. For example, grades 5-8 would routinely have only three teachers for a single subject, thus the significance of the index values was rarely achieved. Grade 4, however, typically consisted of 14 teachers and produced subject-specific and overall effectiveness index values that were statistically significant predictors of student achievement gains for the next school year as stated above. The lone exception was the 2009 4th 124 grade reading index value that was not a statistically significant predictor of reading gains for 2010 with a p-value of .0843. The precision of the RMTEI, as demonstrated with control charts in Section 4.9, was limited to larger populations of teachers. Specifically, when an individual teacher consisted of a subgroup of three or fewer effectiveness values and was compared to few teachers (e.g., two teachers per subject in grades 5-8 for the Alabama district) , the x chart, which measures between-subgroup variability, determined that most teachers fell within the control limits and did not adequately distinguish teachers. The R-Chart, which measures within-subgroup variability, consistently remained in control as desired. 6.3 Future Study As stated in Section 4.1 a significant amount of energy was put forth by this research to obtain quality student and teacher data from the ALSDE and Alabama Local Education Agencies. Much of the energy expenditure was of little benefit to this research as only a single medium-sized district provided the requisite data to fully implement the RMTEI process. As a result, additional data was obtained from the NCES. Alabama?s Phase II Race to the Top application clearly stated a desire to create partnerships with the research institutions of the State and to make data readily available to them. The ALSDE must take action to accomplish these goals. Future study remains whereby additional data must be obtained from Alabama to fully investigate the usefulness of the RMTEI approach for the State. The results of this research clearly allow one to offer its conclusions to the State, but the findings would have broader implications had the preponderance of the data come from Alabama. This research also proposes future study of the test-forms equating for the reading portion of the ARMT. Based on the 125 analysis of the previous chapters, reading RMTEI values trailed behind mathematics RMTEI values in correlation across years and as predictors for future reading gains. Additional data can illuminate whether this is a persistent issue or particular to the lone medium-sized Alabama district for the specific study years. If the future study also suffers from these issues, then analysis must be undertaken to determine if successive reading forms of the ARMT are equally difficult. For example, ?if a test form is more difficult than those that precede and follow, it will systematically make the first measure of gain too low and the second too high? (Bock, Wolfe, & Fisher, 1996, p. 13). As a means to ?help campuses and educators identify individual students in need of intervention?, desirable research in Alabama includes calculating statistical projections of test scores to the next grade and evaluating those projections versus desired proficiency levels (Texas Education Agency, 2009, p. 25). Intervention methods of administrators would be employed when presented with the following two scenarios: 1) Students who currently meet the standard but are projected to not meet the proficiency standard. 2) Students who currently do not meet the standard and are projected to not meet the proficiency standard (Texas Education Agency, 2009, p. 25). Alabama included similar language on this topic in its reform agenda as part of the Race to the Top application whereby it wants to ?develop predictive trajectories for its students through graduation?[and] create a dashboard-style early warning system for teachers? (?Alabama?s Race,? 2010, p. 9). Any state?s strategic education plan would be wise to consider this method of projection ?to accelerate student achievement, close achievement gaps, [and] inspire our children to excel? (?A Blueprint for Reform,? 2010, p. 2). 126 Lastly, the documents supporting EDUCATEAlabama?s development provide an excellent structure to return teacher accountability to the evaluation. Noteworthy is the cross- walk established in the Alabama Continuum for Teacher Development that provides benchmarks of performance for the Alabama Quality Teaching Standards along the teacher continuum: pre- service, beginning, emerging, applying, integrating, and innovating (?Alabama Continuum,? 2009). Therefore, future study consists of developing a summative assessment comprised of the observational effectiveness scoring in the Teacher Induction Study to be incorporated into EDUCATEAlabama to provide teacher accountability and development. 127 References A Blueprint for Reform: The Reauthorization of the Elementary and Secondary Education Act -- TOC. (2010, September 2). Laws; Publicity. Retrieved April 12, 2011, from http://www2.ed.gov/policy/elsec/leg/blueprint/publicationtoc.html Alabama Continuum for Teacher Development. (2009). Alabama State Department of Education. Retrieved from http://alex.state.al.us/leadership/Alabama%20Continuum%20for%20Teacher%20Develo pment.pdf Alabama Professional Education Personnel Evaluation Program. (2011, May 12).Alabama PEPE Program. Retrieved May 27, 2011, from http://www.alabamapepe.com/ Alabama Professional Education Personnel Evaluation Program for Teachers. (1998, August 31). Alabama State Department of Education. Retrieved from http://pixdoc.com/doc/alabama+board+of+education+pepe+form/ Alabama Reading and Mathematics Test: Interpreting the Student Report. (2009, September 11). Alabama State Department of Education Student Assessment. Retrieved from https://docs.alsde.edu/documents/91/ARMT%20Interpreting%20Student%20Group- %20Reports.pdf Alabama Student Assessment Program Overview. (n.d.). Retrieved from http://www.hsv.k12.al.us/dept/merts/testing/Testing_overview.php Alabama?s Race to the Top Application. (2010, June 1). Alabama State Department of Education. Retrieved from http://www2.ed.gov/programs/racetothetop/phase2- applications/alabama.pdf 128 Ballou, D., Sanders, W., & Wright, P. (2004). Controlling for Student Background in Value- Added Assessment of Teachers. Journal of Educational and Behavioral Statistics, 29(1), 37 ?65. doi:10.3102/10769986029001037 Betebenner, D. W. (2007, October 5). Estimation of Student Growth Percentiles for the Colorado Student Assessment Program. Colorado Department of Education. Retrieved from http://www.cde.state.co.us/cdedocs/Research/PDF/technicalsgppaper_betebenner.pdf Bice, T. (2010, September 21). Alabama Deputy State Superintendent of Education. Bock, R. D., Wolfe, R. G., & Fisher, T. H. (1996). A review and analysis of the Tennessee Value- Added Assessment System. Comptroller of the Treasury. Braun, H. I., Chudowsky, N., & Koenig, J. A. (Eds.). (2010). Getting Value Out of Value-Added: Report of a Workshop. Washington: National Academies Press. Colorado?s Academic Growth Model. (2008, February 13). Colorado Department of Education. Retrieved from http://www.cde.state.co.us/cdeassess/documents/res_eval/FinalLongitudinalGrowthTAP Report.pdf Corcoran, S. P. (2010). Can Teachers be Evaluated by their Students? Test Scores? Should They Be? The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice. Annenberg Institue for School Reform at Brown University. Retrieved from http://www.annenberginstitute.org/products/Corcoran.php Crouse, D. (2011, April 6). Director of Federal Programs and Professional Services, Roanoke City Schools, Roanoke, AL. 129 Delaware and Tennessee Win First Race to The Top Grants. (2011, January 27). Press Releases; Retrieved April 13, 2011, from http://www2.ed.gov/news/pressreleases/2010/03/03292010.html DiChiara, L. (2011, April 1). Superintendant, Phenix City Public Schools, Phenix City, AL. Edgeworth, F. Y. (1888). On a new method of reducing observations relating to several quantiles. Philosophical magazine: a journal of theoretical, experimental and applied physics (Vol. 25, pp. 184?191). Taylor & Francis. EDUCATEAlabama. (n.d.).Educator Evaluations. Retrieved May 27, 2011, from http://alex.state.al.us/leadership/evaluations.html EDUCATEAlabama Information Webinar. (2011, April 11). Alabama State Department of Education. Retrieved from http://alex.state.al.us/leadership/evaluations.html Educator Effectiveness Resolution. (2010, May 27). Alabama State Board of Education. Retrieved from http://www.alsde.edu/html/boe_resolutions2.asp?id=1662 Fifth Grade Data Codebook: Early Childhood Longitudinal Study [United States]: Kindergarten Class of 1998-1999, Fifth Grade. (2006, February). United States Department of Education. National Center for Education Statistics. Glazerman, S., Isenberg, E., Dolfin, S., Bleeker, M., Johnson, A., Grider, M., & Jacobus, M. (2010, October). Impacts of Comprehensive Teacher Induction: Final Results from a Randomized Controlled Study (NCEE 2010-4028). National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. Retrieved from http://ies.ed.gov/ncee/pubs/20104027/pdf/20104028.pdf Hao, L., & Naiman, D. Q. (2007). Quantile Regression. Quantitative applications in the social sciences. Thousand Oaks, Calif: Sage Publications. 130 Holdren, J. P. (2011, January 6). America COMPETES Act Keeps America?s Leadership on Target. The White House Blog. Retrieved from http://www.whitehouse.gov/blog/2011/01/06/america-competes-act-keeps-americas- leadership-target House, D. H. (2010, March 17). Spline Curves. Retrieved from http://www.cs.clemson.edu/~dhouse/courses/405/notes/splines.pdf Institute of Education Sciences. (n.d.).Director of IES. Retrieved from http://ies.ed.gov/director/biography.asp Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis (6th ed.). Prentice Hall. Kaw, A., & Keteltas, M. (2009, November 20). Spline Method of Interpolation. Retrieved from http://numericalmethods.eng.usf.edu/mws/gen/05inp/mws_gen_inp_txt_spline.pdf Kincaid, D., & Cheney, W. (2002). Numerical Analysis: Mathematics of Scientific Computing (3rd Revised ed.). American Mathematical Society. Larson, J. (2010, November 16). Developer/Project Lead, Information Systems, Alabama State Department of Education. Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. (2010, December). Bill and Melinda Gates Foundation. Retrieved from http://www.metproject.org/downloads/Preliminary_Finding-Policy_Brief.pdf Lim, L. K. S., Acito, F., & Rusetski, A. (2006). Development of archetypes of international marketing strategy. Journal of International Business Studies, 37(4), 499?524. doi:10.1057/palgrave.jibs.8400206 131 Lissitz, B., & Doran, H. (2009, July). Modeling Growth for Accountability and Program Evaluation: An Introduction for Wisconsin Educators. Retrieved from http://dpi.wi.gov/oea/pdf/introgrowth.pdf Lockwood, J. R., Louis, T. A., & McCaffrey, D. F. (2002). Uncertainty in Rank Estimation: Implications for Value-Added Modeling Accountability Systems. Journal of Educational and Behavioral Statistics, 27(3), 255?270. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281?297). Berkeley, CA: University of California Press. Mathews, J. H. (2004). Cubic Splines. Retrieved from http://math.fullerton.edu/mathews/n2003/CubicSplinesMod.html McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2004). Evaluating Value- Added Models for Teacher Accountability (1st ed.). RAND Corporation. McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009, October). The Intertemporal Variability of Teacher Effect Estimates. Retrieved from http://www.performanceincentives.org/data/files/news/PapersNews/200903_McCaffrey_ etAl_TeacherEffectEstimate1.pdf Morton, J. B. (2010, September 1). An Open Letter. Alabama State Department of Education. National Center for Education Statistics. (n.d.).Welcome to NCES. Retrieved from http://nces.ed.gov/ Pearson, M., & Stecher, B. (2004). Organizational Improvement and Accountability: Lessons for Education from Other Sectors. (B. Stecher & S. N. Kirby, Eds.). RAND. 132 Professional Education Personnel Evaluation Program of Alabama. (2008, May 1). Alabama State Department of Education. Retrieved from http://www.alabamapepe.com/teacher.htm Pugh, J. (2008, February 25). Alabama Reading and Mathematics Test. Alabama State Department of Education Student Assessment. Retrieved from https://docs.alsde.edu/documents/91/Alabama%20Reading%20and%20Mathematics%20 Test.pdf Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical Linear Models: Applications and Data Analysis Methods. Advanced quantitative techniques in the social sciences (2nd ed.). Thousand Oaks: Sage Publications. Sanders, W. L., Saxton, A. M., & Horn, S. P. (1997). The Tennessee Value-Added Assessment System: A Quantitative, Outcomes-Based Approach to Educational Assessment. In J. Millman (Ed.), Grading Teachers, Grading Schools (pp. 137?162). Thousand Oaks, Calif: Corwin Press. SAS EVAAS for K-12. (2010). Retrieved from http://www.sas.com/resources/product- brief/SAS_EVAAS_for_K-12.pdf SAS/STAT 9.2 User?s Guide Second Edition. (2009). Retrieved from http://support.sas.com/documentation/cdl/en/statug/63033/PDF/default/statug.pdf Schmidhammer, J. L. (2010). Agglomerative Hierarchical Clustering Methods. Retrieved from http://bus.utk.edu/stat/stat579/Hierarchical%20Clustering%20Methods.pdf Secretary Spellings Approves Additional Growth Model Pilots for 2008-2009 School Year. (2009, January 21). Press Releases; Retrieved April 13, 2011, from http://www2.ed.gov/news/pressreleases/2009/01/01082009a.html 133 System Profile Report 2008-2009. (2009). Alabama State Department of Education. Retrieved from http://www.alsde.edu/html/school_info.asp?menu=school_info&footer=general&sort=co unty Texas Education Agency. (2009, January 12). Growth Model Pilot Application for Adequate Yearly Progress Determinations under the No Child Left Behind Act. Retrieved from http://www.tea.state.tx.us/student.assessment/measures/archive/ Trietsch, D. (1999). Statistical Quality Control: A Loss Minimization Approach. World Scientific Pub Co Inc. Tufte, E. R. (2006). Beautiful Evidence. Graphics Pr. Wei, Y., & He, X. (2006). Conditional Growth Charts. The Annals of Statistics, 34(5), 2069? 2097. West, B., Welch, K. B., & Galecki, A. T. (2007). Linear Mixed Models: A Practical Guide Using Statistical Software. Boca Raton: Chapman & Hall/CRC. Working with Teachers to Develop Fair and Reliable Measures of Effective Teaching. (2010, June). Bill and Melinda Gates Foundation. Retrieved from http://www.gatesfoundation.org/highschools/Documents/met-framing-paper.pdf 134 Appendix 1: Establishing Longitudinal Student Achievement Data Linked with Teacher Information The process to create longitudinal student data linked with current year teachers began with actions at the school district level followed by the actions of this research. Appendix 1 will describe each set of actions in order to offer these methods to other districts within Alabama. I. District Actions The medium-sized Alabama District provided four types of files for this research: Schedule, Course Counts, Teacher Information, and Student Achievement. A. Schedule The schedule dataset was obtained by a Structured Query Language (SQL) program that queried the STI Education Data Management Solutions database. The SQL program requested the specific fields of student identification number, period, course number, teacher identification number, and school identification number to export to a comma delimited/Comma Separated Values (CSV) file. In order to account for the manner in which students were recorded in ?terms?, the district had to perform a query for elementary schools and an additional query for secondary schools. Elementary school students are recorded in a single term for a year whereas secondary school students are recorded in two terms for a single year. This accounts for secondary students possibly having different schedules in the second half of the school year. B. Course Counts The ?course counts? dataset was obtained by a SQL program that queried the STI Education Data Management Solutions database. The SQL program requested the specific fields of course number, short name, long name, school identification number, and the number of students to export to a CSV file. The course counts information was queried separately from the 135 schedule information due to the applicable data being in a different STI table. The district technology coordinator acknowledged that the schedule and course counts information could possibly be obtained together in a single query. C. Teacher Information The teacher information dataset was obtained from the yearly Local Education Agency Personnel System (LEAPS) report. This report is generated from a query of the McAleer Accounting System database. The district provided this research with a portion of the CSV LEAPS report, specifically the fields of school year, teacher identification number, gender, ethnicity, highest degree obtained, school, and teaching experience. D. Student Achievement School districts receive yearly student achievement files via distributed compact disks from the Alabama State Department of Education. The student achievement files are CSV in nature and consist of a matrix with over 200 column vectors reporting the nature of students and their respective ARMT scores. The district provided this research with a portion of the CSV student achievement files, specifically Student Identification Number, grade, school, and ARMT Reading and Mathematics Scores. II. Researcher Actions The process to create longitudinal student data linked with current year teachers began with extracting students and their teachers from the schedule dataset using the desired grade and course (mathematics or reading). The schedule datasets were yearly in nature and consisted of elementary and secondary school files. For example, two Rectangular format files (columns represent variables and rows represent observations) with comma separated values contained the schedules of all students in 2009: 136 1) 2008-2009 Elementary School (kindergarten ? 4th grade) 2) 2008-2009 Secondary School (5th grade ? 8th grade) The files contained the column headings of student identification number, period, course number, teacher identification number, and school identification number. For each student, his/her student identification code was repeated to capture the different courses and corresponding periods. The teacher identification code also repeated if that teacher taught all subjects for that student. The school identification number also repeated for the student. An excerpt of the schedule dataset is presented in Table 6.2. Table 6.2: Excerpt of Schedule Dataset The ?course counts? dataset for each year was used to discern the course name for each course number. Column headings consisted of course number, short name, long name, school identification number, and the number of students in that particular course number. The course numbers were devised to represent not only a particular course in a particular grade, but also the particular section. The course number along with the school identification number depicted an actual classroom of students. An excerpt of the ?course counts? dataset is presented in Table 6.3. 137 Table 6.3: Excerpt of Course Counts Dataset The reading and mathematics course numbers for each grade and year were then recorded to create the requisite extraction from the schedule dataset. After importing the schedule dataset into SAS, sorting by teacher, and specifying the year and grade for analysis, the appropriate template found the appropriate courses/teachers in the schedule dataset that would be linked with the reading and mathematics student achievement data of the current year. For example, the user specifies the year and grade to be 2009 and 6th grade respectively. As a result, the required schedule dataset is retrieved (secondary school dataset) followed by the extraction of reading and mathematics courses/teachers for that particular grade and year using a range of course numbers: %let year=2009; %let SY='2008-2009'; %let grade=6; %macro teachers (year=); data mathtchrs; if &grade = 4 then set schedule.elemtchrs&year; else if &grade=5 and &year=2007 then set schedule.elemtchrs&year; else if &grade=5 and &year=2008 then set schedule.elemtchrs&year; else if &grade=5 and &year>2008 then set schedule.sectchrs&year; else if &grade>=6 then set schedule.sectchrs&year; if &grade= 6 and &year=2009 then if school ne 40 or cnum < 6300.01 then delete; else if cnum > 6350.02 then delete; MTeacher=teacher; run; 138 data readtchrs; if &grade = 4 then set schedule.elemtchrs&year; else if &grade=5 and &year=2007 then set schedule.elemtchrs&year; else if &grade=5 and &year=2008 then set schedule.elemtchrs&year; else if &grade=5 and &year>2008 then set schedule.sectchrs&year; else if &grade>=6 then set schedule.sectchrs&year; if &grade=6 and &year=2009 then if school ne 40 or cnum < 6100.01 then delete; else if cnum > 6550.03 then delete; else if 6130.03=7 %then %do; data Mtcheffects&grade; merge mtcheffects&grade&j-mtcheffects&grade&k; by mteacher; Est_total=sum(of estimate&grade&j-estimate&grade&k); DFtotal= sum(of DF&grade&j-DF&grade&k); MLMM_Teacher_Effect=Est_total/DFtotal; If DFtotal = " " then MLMM_Teacher_Effect=0; run; data rtcheffects&grade; merge rtcheffects&grade&j-rtcheffects&grade&k; by rteacher; Est_total=sum(of estimate&grade&j-estimate&grade&k); DFtotal= sum(of DF&grade&j-DF&grade&k); RLMM_Teacher_Effect=Est_total/DFtotal; If DFtotal = " " then RLMM_Teacher_Effect=0; run; %goto exit; 147 %end; %exit: %mend check; %check /*-------------------------------------------------------------------------*/ /*Calculate Median SGP for Math and Reading*/ %macro QR_SGP; proc quantreg data=SGP&grade ci=resampling ; effect sp=spline (ARMT_Math_&k ARMT_Read_&k /details ); model mathgain = sp/quantile=.01 .02 .03 .04 .05 .06 .07 .08 .09 .1 .11 .12 .13 .14 .15 .16 .17 .18 .19 .2 .21 .22 .23 .24 .25 .26 .27 .28 .29 .3 .31 .32 .33 .34 .35 .36 .37 .38 .39 .4 .41 .42 .43 .44 .45 .46 .47 .48 .49 .5 .51 .52 .53 .54 .55 .56 .57 .58 .59 .6 .61 .62 .63 .64 .65 .66 .67 .68 .69 .7 .71 .72 .73 .74 .75 .76 .77 .78 .79 .8 .81 .82 .83 .84 .85 .86 .87 .88 .89 .9 .91 .92 .93 .94 .95 .96 .97 .98 .99 ; output out=chp3_2_3m&grade pred=p quantile=q ; run; data msgpbase; set chp3_2_3m&grade; keep mteacher school mathgain p1-p99; run; data mconvert; set msgpbase; array test{*}_numeric_; DO i = 4 to dim(test); test(i)=abs(test(i)-test(3)); end; drop i; mindev=min(of p1-p99);/* p1-p99 become the abs deviations due to array test*/ /* columns 4 102 */ run; data msgp&grade; set mconvert; sgp=0; array test{*} _numeric_; do i=4 to 102; if test{i}=mindev then sgp=i-3;/*will give the higest sgp if multiple values are mindev*/ end; if mindev= ' ' then sgp=' ' ; drop i; run; /* Read SGP */ proc quantreg data=SGP&grade ci=resampling ; effect sp=spline (ARMT_Math_&k ARMT_Read_&k /details ); model readgain = sp/quantile=.01 .02 .03 .04 .05 .06 .07 .08 .09 .1 .11 .12 .13 .14 .15 .16 .17 .18 .19 .2 148 .21 .22 .23 .24 .25 .26 .27 .28 .29 .3 .31 .32 .33 .34 .35 .36 .37 .38 .39 .4 .41 .42 .43 .44 .45 .46 .47 .48 .49 .5 .51 .52 .53 .54 .55 .56 .57 .58 .59 .6 .61 .62 .63 .64 .65 .66 .67 .68 .69 .7 .71 .72 .73 .74 .75 .76 .77 .78 .79 .8 .81 .82 .83 .84 .85 .86 .87 .88 .89 .9 .91 .92 .93 .94 .95 .96 .97 .98 .99 ; output out=chp3_2_3r&grade pred=p quantile=q ; run; data rsgpbase; set chp3_2_3r&grade; keep rteacher school readgain p1-p99; run; data rconvert; set rsgpbase; array test{*}_numeric_; DO i = 4 to dim(test); test(i)=abs(test(i)-test(3)); end; drop i; mindev=min(of p1-p99);/* p1-p99 become the abs deviations due to array test*/ /* 4 102 */ run; data rsgp&grade; set rconvert; sgp=0; array test{*} _numeric_; do i=4 to 102; if test{i}=mindev then sgp=i-3;/*will give the higest sgp if multiple values are mindev*/ end; if mindev= ' ' then sgp=' ' ; drop i; run; %mend QR_SGP; %QR_SGP data msgp; set msgp&grade; run; proc sort data=msgp; by mteacher; run; data rsgp; set rsgp&grade; run; proc sort data=rsgp; by rteacher; run; 149 proc means data=msgp; by mteacher; var sgp; output out=msgrowthper&grade median=MTeacher_Median_SGP n=Mnumber_students; run; proc means data=rsgp ; by rteacher; var sgp; output out=rsgrowthper&grade median=RTeacher_Median_SGP n=Rnumber_students; run; /*-------------------------------------------------------------------------*/ /*LMM Value added measure */ %macro LMM_model_build; /*creating the model from previous cohort of students*/ %do i = &j %to &k ; %let m= %eval(&i-1); %let n=%eval(&k-1); ods output solutionf=mfixed&grade&i ; ods graphics on; proc mixed data=model&grade&i boxplot covtest; class mteacher school ; model mathgain = ARMT_Math_&m-ARMT_Math_&n ARMT_Read_&n-ARMT_Read_&n /solution ; random int/subject= mteacher(school) solution; run; ods graphics off; ods output solutionf=rfixed&grade&i; proc mixed data=model&grade&i covtest; class rteacher school ; model readgain = ARMT_Math_&n-ARMT_Math_&n ARMT_Read_&m-ARMT_Read_&n /solution; random int/subject= rteacher(school) solution; run; %end; %mend LMM_model_build; %LMM_model_build /* apply created models */ /* use data (indep vars) from cohort requiring predictions*/ %macro LMM_apply_model; %if &grade= 4 %then %do; %do i = &k %to &k ; Proc IML; Reset NoLog; /* send output to the listing file */ /* Read the data into the matrix X */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&i":"ARMT_Math_&k") into x2; read all var("ARMT_Read_&k":"ARMT_Read_&k") into x3; close LMM&grade&i; x=x1||x2||x3; 150 use mfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close mfixed&grade&i; pred=x*y; create MLMMpredicted&grade&i var{pred}; append var{pred}; close MLMMpredicted&grade&i; Reset log; /* Reading LMM Value Added Measure */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&k":"ARMT_Math_&k") into x2; read all var("ARMT_Read_&i":"ARMT_Read_&k") into x3; close LMM&grade&i; x=x1||x2||x3; use rfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close rfixed&grade&i; pred=x*y; create RLMMpredicted&grade&i var{pred}; append var{pred}; close RLMMpredicted&grade&i; Reset log; Quit; data chp3_2_2_1&grade&i; merge LMM&grade&i MLMMpredicted&grade&i; vam=mathgain-pred; run; data chp3_2_2_2&grade&i; merge LMM&grade&i RLMMpredicted&grade&i; vam=readgain-pred; run; %end; data chp3_2_2_1; set chp3_2_2_1&grade&k-chp3_2_2_1&grade&k; run; data chp3_2_2_2; set chp3_2_2_2&grade&k-chp3_2_2_2&grade&k; run; %end; %if &grade= 5 %then %do; %do i = &k_1 %to &k ; Proc IML; Reset NoLog; /* send output to the listing file */ /* Read the data into the matrix X */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&i":"ARMT_Math_&k") into x2; read all var("ARMT_Read_&k":"ARMT_Read_&k") into x3; close LMM&grade&i; 151 x=x1||x2||x3; use mfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close mfixed&grade&i; pred=x*y; create MLMMpredicted&grade&i var{pred}; append var{pred}; close MLMMpredicted&grade&i; Reset log; /* Reading LMM Value Added Measure */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&k":"ARMT_Math_&k") into x2; read all var("ARMT_Read_&i":"ARMT_Read_&k") into x3; close LMM&grade&i; x=x1||x2||x3; use rfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close rfixed&grade&i; pred=x*y; create RLMMpredicted&grade&i var{pred}; append var{pred}; close RLMMpredicted&grade&i; Reset log; Quit; data chp3_2_2_1&grade&i; merge LMM&grade&i MLMMpredicted&grade&i; vam=mathgain-pred; run; data chp3_2_2_2&grade&i; merge LMM&grade&i RLMMpredicted&grade&i; vam=readgain-pred; run; %end; data chp3_2_2_1; set chp3_2_2_1&grade&k_1-chp3_2_2_1&grade&k; run; data chp3_2_2_2; set chp3_2_2_2&grade&k_1-chp3_2_2_2&grade&k; run; %end; %if &grade=6 %then %do; %do i = &k_2 %to &k ; Proc IML; Reset NoLog; /* send output to the listing file */ /* Read the data into the matrix X */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&i":"ARMT_Math_&k") into x2; read all var("ARMT_Read_&k":"ARMT_Read_&k") into x3; 152 close LMM&grade&i; x=x1||x2||x3; use mfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close mfixed&grade&i; pred=x*y; create MLMMpredicted&grade&i var{pred}; append var{pred}; close MLMMpredicted&grade&i; Reset log; /* Reading LMM Value Added Measure */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&k":"ARMT_Math_&k") into x2; read all var("ARMT_Read_&i":"ARMT_Read_&k") into x3; close LMM&grade&i; x=x1||x2||x3; use rfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close rfixed&grade&i; pred=x*y; create RLMMpredicted&grade&i var{pred}; append var{pred}; close RLMMpredicted&grade&i; Reset log; Quit; data chp3_2_2_1&grade&i; merge LMM&grade&i MLMMpredicted&grade&i; vam=mathgain-pred; run; data chp3_2_2_2&grade&i; merge LMM&grade&i RLMMpredicted&grade&i; vam=readgain-pred; run; %end; data chp3_2_2_1; set chp3_2_2_1&grade&k_2-chp3_2_2_1&grade&k; run; data chp3_2_2_2; set chp3_2_2_2&grade&k_2-chp3_2_2_2&grade&k; run; %end; %if &grade>=7 %then %do; %do i = &j %to &k ; Proc IML; Reset NoLog; /* send output to the listing file */ /* Read the data into the matrix X */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&i":"ARMT_Math_&k") into x2; 153 read all var("ARMT_Read_&k":"ARMT_Read_&k") into x3; close LMM&grade&i; x=x1||x2||x3; use mfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close mfixed&grade&i; pred=x*y; create MLMMpredicted&grade&i var{pred}; append var{pred}; close MLMMpredicted&grade&i; Reset log; /* Reading LMM Value Added Measure */ use LMM&grade&i; read all var{int} into x1; read all var("ARMT_Math_&k":"ARMT_Math_&k") into x2; read all var("ARMT_Read_&i":"ARMT_Read_&k") into x3; close LMM&grade&i; x=x1||x2||x3; use rfixed&grade&i var{estimate}; /* coefficients calculated from model data */ read all var{estimate} into y; close rfixed&grade&i; pred=x*y; create RLMMpredicted&grade&i var{pred}; append var{pred}; close RLMMpredicted&grade&i; Reset log; Quit; data chp3_2_2_1&grade&i; merge LMM&grade&i MLMMpredicted&grade&i; vam=mathgain-pred; run; data chp3_2_2_2&grade&i; merge LMM&grade&i RLMMpredicted&grade&i; vam=readgain-pred; run; %end; data chp3_2_2_1; set chp3_2_2_1&grade&j-chp3_2_2_1&grade&k; run; data chp3_2_2_2; set chp3_2_2_2&grade&j-chp3_2_2_2&grade&k; run; %end; %mend LMM_apply_model; %LMM_apply_model proc sort data=chp3_2_2_1; by mteacher; run; 154 proc sort data=chp3_2_2_2; by rteacher; run; proc means data = chp3_2_2_1; by mteacher; var vam; output out=MLMMvam&grade mean=Overall_MLMM_Teacher_VAM ; run; proc means data = chp3_2_2_2; by rteacher; var vam; output out=RLMMvam&grade mean=Overall_RLMM_Teacher_VAM; run; /*-------------------------------------------------------------------------*/ /*QR value added measure */ %macro QR_VAM; ods graphics on; proc quantreg data=SGP&grade ci=resampling plots=all ; effect sp=spline (ARMT_Math_&k ARMT_Read_&k /details); model mathgain = sp/quantile=.5 diagnostics; output out=chp3_2_4m&grade pred=p50 sresidual=sresid; run; ods graphics off; proc quantreg data=SGP&grade ci=resampling ; effect sp=spline (ARMT_Math_&k ARMT_Read_&k /details ); model readgain = sp/quantile=.5; output out=chp3_2_4r&grade pred=p50; run; /*QR estimates */ data mqrestimate; set chp3_2_4m&grade; mqrvamstud=mathgain-p50; run; data rqrestimate; set chp3_2_4r&grade; rqrvamstud=readgain-p50; run; %mend QR_VAM; %QR_VAM proc sort data=mqrestimate; by mteacher; run; proc sort data=rqrestimate; by rteacher; run; proc means data = mqrestimate; by mteacher; 155 var mqrvamstud; output out=mqrteacher&grade mean=Overall_MQR_Teacher_VAM; run; proc means data = rqrestimate; by rteacher; var rqrvamstud; output out=rqrteacher&grade mean=Overall_RQR_Teacher_VAM; run; /*-------------------------------------------------------------------------*/ /* Chapter 3_2table creation */ data chp3_2mtable; merge mtcheffects&grade MLMMvam&grade mqrteacher&grade msgrowthper&grade; by mteacher; keep mteacher school MLMM_Teacher_Effect Overall_MLMM_Teacher_VAM MTeacher_Median_SGP Overall_MQR_Teacher_VAM Mnumber_students; if Mnumber_students < 5 then delete; run; data chp3_2rtable; merge rtcheffects&grade RLMMvam&grade rqrteacher&grade rsgrowthper&grade; by rteacher; keep rteacher school RLMM_Teacher_Effect Overall_RLMM_Teacher_VAM RTeacher_Median_SGP Overall_RQR_Teacher_VAM Rnumber_students; if Rnumber_students < 5 then delete; run; /*-------------------------------------------------------------------------*/ /* Phase II and III for Math teachers */ /* Principal Component Analysis */ ods graphics on; proc princomp data=chp3_2mtable out=mteacher_prn1 PLOTS=(SCORE(NCOMP=2 ellipse)patternprofile ) ; var MLMM_Teacher_Effect Overall_MLMM_Teacher_VAM MTeacher_Median_SGP Overall_MQR_Teacher_VAM; run; ods graphics off; Proc IML; Reset NoLog; /* send output to the listing file */ /* Read the data into the matrix X */ use mteacher_prn1 var{MLMM_Teacher_Effect Overall_MLMM_Teacher_VAM MTeacher_Median_SGP 156 Overall_MQR_Teacher_VAM}; read all var{MLMM_Teacher_Effect Overall_MLMM_Teacher_VAM MTeacher_Median_SGP Overall_MQR_Teacher_VAM} into x; close mteacher_prn1; /* Number of Observations and Variables */ n=nrow(x); p=ncol(x); /* Compute sample mean, covariance, and inverse */ one=J(n,1,1); xbar=(X`*one)/n; print "Sample Means", xbar; xstar=X-one*xbar`; s=xstar`*xstar/(n-1); print "Sample Covariance Matrix", s; rho=corr(x); print "Sample Correlation Matrix", rho; detrho=det(rho); print detrho; call eigen(lambda,Evecs,rho); print lambda, Evecs; d=sqrt(diag(lambda)); print d; /* compute correlation of metric with prin comp */ corr=evecs*d; print corr; Reset log; Quit; proc sort data=mteacher_prn1; by prin1; run; Proc GPlot Data=mteacher_prn1; Plot Prin1*Prin2=1 / HRef=0 VRef=0 VAxis=Axis1 HAxis=Axis2; Axis1 Label=(A=90 "Principal Component 1"); Axis2 Label=("Principal Component 2"); Symbol1 C=Black V=Dot H=0.7 I=None PointLabel=(C=Black "#mteacher"); title "Mathematics Teachers"; Run; symbol1;Quit; title; /* Cluster Analysis */ /* Ward's Method */ ods graphics on; proc cluster data=mteacher_prn1 outtree=mtree method=ward plots=all; id mteacher; var prin1 prin2; run; ods graphics off; %macro mcluster_check; %do y = 4 %to 2 %by -1; proc tree data=mtree noprint out=mout&y n=&y; copy prin1 prin2; 157 run; ods output CLDiffs=mbonci&y&grade; proc glm data=mout&y; class cluster; model prin1 = cluster ; means cluster/bon cldiff; run; quit; proc means data=mbonci&y&grade; var significance; output out=msigmean&y&grade mean=sigmean; run; data mcheck&y&grade; set msigmean&y&grade; if sigmean = 1 then CALL SYMPUT('mcheck',1) ; run; %if &mcheck =1 and &y =4 %then %do; data appen3.mbonci&grade&year; set mbonci&y&grade; run; %global m_numclust; %let m_numclust=4; %let mcheck=0; %goto exit; %end; %if &mcheck=1 and &y =3 %then %do; data appen3.mbonci&grade&year; set mbonci&y&grade; run; %global m_numclust; %let m_numclust=3; %let mcheck=0; %goto exit; %end; %if &mcheck=1 and &y=2 %then %do; data appen3.mbonci&grade&year; set mbonci&y&grade; run; %global m_numclust; %let m_numclust=2; %let mcheck=0; %goto exit; %end; %if &mcheck=0 and &y =2 %then %do; data appen3.mbonci&grade&year; set mbonci&y&grade; run; %global m_numclust; %let m_numclust=1; 158 %goto exit; %end; %end; %exit: %mend mcluster_check; %mcluster_check proc tree data=mtree noprint out=mout n=&m_numclust; copy prin1 prin2; run; proc sgplot data=mout; scatter y=prin1 x=prin2 /group=Cluster ; title "Ward Clustering of Mathematics Teachers "; run; title; /* MANOVA, ANOVA, and Bon CI's to determine if clusters are statistically different */ proc sort data=mout; by _name_; run; data mward; set mout; ident=_n_; keep _name_ cluster ident; run; proc sort data=mteacher_prn1; by mteacher; run; data mteacher_prn3; set mteacher_prn1; ident=_n_; run; data mcombine; merge mteacher_prn3 mward; by ident; mprin1=prin1; mprin2=prin2; mcluster=cluster; run; proc glm data=mcombine; class cluster; model prin1 prin2 = cluster; means cluster/bon cldiff; lsmeans cluster / out=Mmeansout; manova h=cluster ; run; quit; 159 /*-------------------------------------------------------------------------*/ /* Phase II and III for Reading teachers */ /* Principal Component Analysis */ ods graphics on; proc princomp data=chp3_2rtable out=rteacher_prn1 plots=scree ; var RLMM_Teacher_Effect Overall_RLMM_Teacher_VAM RTeacher_Median_SGP Overall_RQR_Teacher_VAM; run; ods graphics off; Proc IML; Reset NoLog; /* send output to the listing file */ /* Read the data into the matrix X */ use rteacher_prn1 var{RLMM_Teacher_Effect Overall_RLMM_Teacher_VAM RTeacher_Median_SGP Overall_RQR_Teacher_VAM}; read all var{RLMM_Teacher_Effect Overall_RLMM_Teacher_VAM RTeacher_Median_SGP Overall_RQR_Teacher_VAM} into x; close rteacher_prn1; /* Number of Observations and Variables */ n=nrow(x); p=ncol(x); /* Compute sample mean, covariance, and inverse */ one=J(n,1,1); xbar=(X`*one)/n; print "Sample Means", xbar; xstar=X-one*xbar`; s=xstar`*xstar/(n-1); print "Sample Covariance Matrix", s; rho=corr(x); print "Sample Correlation Matrix", rho; detrho=det(rho); print detrho; call eigen(lambda,Evecs,rho); print lambda, Evecs; d=sqrt(diag(lambda)); print d; /* compute correlation of metrics with prin comps */ corr=evecs*d; print corr; Reset log; Quit; proc sort data=rteacher_prn1; by prin1; run; Proc GPlot Data=rteacher_prn1; Plot Prin1*Prin2=1 / HRef=0 VRef=0 VAxis=Axis1 HAxis=Axis2; Axis1 Label=(A=90 "Principal Component 1"); Axis2 Label=("Principal Component 2"); 160 Symbol1 C=Black V=Dot H=0.7 I=None PointLabel=(C=Black "#rteacher"); title "Reading Teachers"; Run; Symbol1; title; Quit; /* Cluster Analysis */ /* Ward's Method */ ods graphics on; proc cluster data=rteacher_prn1 outtree=rtree method=ward plots=all; id rteacher; var prin1 prin2; run; ods graphics off; %macro rcluster_check; %do y = 4 %to 2 %by -1; proc tree data=rtree noprint out=rout&y n=&y; copy prin1 prin2; run; ods output CLDiffs=rbonci&y&grade; proc glm data=rout&y; class cluster; model prin1 = cluster ; means cluster/bon cldiff; run; quit; proc means data=rbonci&y&grade; var significance; output out=rsigmean&y&grade mean=sigmean; run; data rcheck&y&grade; set rsigmean&y&grade; if sigmean = 1 then CALL SYMPUT('rcheck',1) ; run; %if &rcheck =1 and &y =4 %then %do; data appen3.rbonci&grade&year; set rbonci&y&grade; run; %global r_numclust; %let r_numclust=4; %let rcheck=0; %goto exit; %end; %if &rcheck=1 and &y =3 %then %do; data appen3.rbonci&grade&year; set rbonci&y&grade; run; %global r_numclust; %let r_numclust=3; %let rcheck=0; 161 %goto exit; %end; %if &rcheck=1 and &y=2 %then %do; data appen3.rbonci&grade&year; set rbonci&y&grade; run; %global r_numclust; %let r_numclust=2; %let rcheck=0; %goto exit; %end; %if &rcheck=0 and &y =2 %then %do; data appen3.rbonci&grade&year; set rbonci&y&grade; run; %global r_numclust; %let r_numclust=1; %goto exit; %end; %end; %exit: %mend rcluster_check; %rcluster_check proc tree data=rtree noprint out=rout n=&r_numclust; copy prin1 prin2; run; /* MANOVA, ANOVA, and Bon CI's to determine if clusters are statistically different */ proc sort data=rout; by _name_; run; data rward; set rout; ident=_n_; keep _name_ cluster ident; run; proc sort data=rteacher_prn1; by rteacher; run; data rteacher_prn3; set rteacher_prn1; ident=_n_; run; data rcombine; merge rteacher_prn3 rward; by ident; rprin1=prin1; 162 rprin2=prin2; rcluster=cluster; run; proc glm data=rcombine; class cluster; model prin1 prin2 = cluster ; means cluster/bon cldiff; lsmeans cluster / out=Rmeansout; manova h=cluster ; run; quit; /*-------------------------------------------------------------------------*/ /* combining prin1 from both math and reading processes to calculate an overall value*/ data RMTEI; merge mcombine rcombine; by _name_; if mprin1 = " " then IndexValue = rprin1; else if rprin1= " " then IndexValue = mprin1; else IndexValue=(mprin1+rprin1)/2; keep _name_ school mprin1 mprin2 rprin1 rprin2 indexvalue MLMM_Teacher_Effect Overall_MLMM_Teacher_VAM MTeacher_Median_SGP Overall_MQR_Teacher_VAM RLMM_Teacher_Effect Overall_RLMM_Teacher_VAM RTeacher_Median_SGP Overall_RQR_Teacher_VAM mteacher rteacher mcluster rcluster Mnumber_students Rnumber_students; run; /*Cluster Analysis of overall value */ /* Ward's Method */ ods graphics on; proc cluster data=RMTEI outtree=tree method=ward plots=all; id _name_; var IndexValue; run; ods graphics off; %macro ivcluster_check; %do y = 4 %to 2 %by -1; proc tree data=tree noprint out=ivout&y n=&y; copy indexvalue; run; ods output CLDiffs=ivbonci&y&grade; proc glm data=ivout&y; class cluster; model indexvalue = cluster ; means cluster/bon cldiff; run; quit; proc means data=ivbonci&y&grade; var significance; 163 output out=ivsigmean&y&grade mean=sigmean; run; data ivcheck&y&grade; set ivsigmean&y&grade; if sigmean = 1 then CALL SYMPUT('ivcheck',1) ; run; %if &ivcheck =1 and &y =4 %then %do; data appen3.ivbonci&grade&year; set ivbonci&y&grade; run; %global iv_numclust; %let iv_numclust=4; %let ivcheck=0; %goto exit; %end; %if &ivcheck=1 and &y =3 %then %do; data appen3.ivbonci&grade&year; set ivbonci&y&grade; run; %global iv_numclust; %let iv_numclust=3; %let ivcheck=0; %goto exit; %end; %if &ivcheck=1 and &y=2 %then %do; data appen3.ivbonci&grade&year; set ivbonci&y&grade; run; %global iv_numclust; %let iv_numclust=2; %let ivcheck=0; %goto exit; %end; %if &ivcheck=0 and &y =2 %then %do; data appen3.ivbonci&grade&year; set ivbonci&y&grade; run; %global iv_numclust; %let iv_numclust=1; %goto exit; %end; %end; %exit: %mend ivcluster_check; %ivcluster_check proc tree data=tree noprint out=out n=&iv_numclust; copy IndexValue; run; proc sgplot data=out; 164 scatter y=indexvalue x=_name_ / group=cluster; xaxis label="Teacher"; title "Ward's Cluster Analysis for Teachers by Overall Index Value"; run; Proc GPlot Data=out; Plot indexvalue*_name_=cluster/HRef=0 VRef=0 VAxis=Axis1 HAxis=Axis2 ; Axis1 Label=(A=90 "Index Value");/*A=90 rotates title 90 degrees*/ Axis2 Label=("Teacher"); Symbol1 H=1 PointLabel=(C=Black "#_name_"); Run; Symbol1; title; Quit; /* MANOVA, ANOVA, and Bon CI's to determine if clusters are statistically different */ proc glm data=out; class cluster; model Indexvalue = cluster ; means cluster/bon cldiff; lsmeans cluster / out=meansout; manova h=cluster ; run; quit; /*-------------------------------------------------------------------------*/ /* create final file */ proc sort data=out; by _name_; run; data final; merge RMTEI out; by _name_; drop clusname; teacher=input(_name_,best12.); run; data appen3.appen3&grade&year; set final; run; %MEND RMTEI; %macro reports; %do p = 4 %to 8; %RMTEI(grade=&p); %end; %mend reports; %reports 165 Appendix 3: Alabama District Results 2009: 166 2010: 167 2011: 168 Appendix 4: NCES District Results 2006: 169 2007: 170 2008: