© 2001 International Association for the Evaluation of Educational Achievement (IEA)
Exhibit A.1 The Three Aspects oand Major Categories of the Mathematics Frameworks
Exhibit A.2 Distribution of Mathematics Items by Content Reporting Category and Performance Category
Exhibit A.3 Coverage of TIMSS 1999 Target Population Countries
Exhibit A.4 School Sample Sizes Countries
Exhibit A.5 Student Sample Sizes Countries
Exhibit A.6 Overall Participation Rates Countries
Exhibit A.7 TIMSS 1999 Within-Country Free-Response Scoring Reliability Data for Mathematics Items
Exhibit A.8 Cronbach's Alpha Reliability Coefficient TIMSS 1999 Mathematics Test
Exhibit A.9 Country-Specific Variations in Mathematics Topics in the Curriculum Questionnaire
TIMSS 1999 represents the continuation of a long series of studies conducted by the International Association for the Evaluation of Educational Achievement (IEA). Since its inception in 1959, the IEA has conducted more than 15 studies of cross-national achievement in the curricular areas of mathematics, science, language, civics, and reading. The Third International Mathematics and Science Study (TIMSS), conducted in 1994-1995, was the largest and most complex IEA study, and included both mathematics and science at third and fourth grades, seventh and eighth grades, and the final year of secondary school. In 1999, TIMSS again assessed eighth-grade students in both mathematics and science to measure trends in student achievement since 1995. TIMSS 1999 was also known as TIMSS-Repeat, or TIMSS-R.(1)
To provide U.S. states and school districts with an opportunity to benchmark the performance of their students against that of students in the high-performing TIMSS countries, the International Study Center at Boston College, with the support of the National Center for Education Statistics and the National Science Foundation, established the TIMSS 1999 Benchmarking Study. Through this project, the TIMSS mathematics and science achievement tests and questionnaires were administered to representative samples of students in participating states and school districts in the spring of 1999, at the same time the tests and questionnaires were administered in the TIMSS countries. Participation in TIMSS Benchmarking was intended to help states and districts understand their comparative educational standing, assess the rigor and effectiveness of their own mathematics and science programs in an international context, and improve the teaching and learning of mathematics and science.
Thirteen states availed of the opportunity to participate in the Benchmarking Study. Eight public school districts and six consortia also participated, for a total of fourteen districts and consortia. They are listed in Exhibit 1 of the Introduction, together with the 38 countries that took part in TIMSS 1999.
The TIMSS curriculum framework underlying the mathematics tests was developed for TIMSS in 1995 by groups of mathematics educators with input from the TIMSS National Research Coordinators (nrcs). As shown in Exhibit A.1, the mathematics curriculum framework contains three dimensions or aspects. The content aspect represents the subject matter content of school mathematics. The performance expectations aspect describes, in a non-hierarchical way, the many kinds of performances or behaviors that might be expected of students in school mathematics. The perspectives aspect focuses on the development of students attitudes, interest, and motivation in mathematics. Because the frameworks were developed to include content, performance expectations, and perspectives for the entire span of curricula from the beginning of schooling through the completion of secondary school, some aspects may not be reected in the eighth-grade TIMSS assessment.(2) Working within the framework, mathematics test specifications for TIMSS in 1995 were developed that included items representing a wide range of mathematics topics and eliciting a range of skills from the students. The 1995 tests were developed through an international consensus involving input from experts in mathematics and measurement specialists, ensuring they reected current thinking and priorities in mathematics.
About one-third of the items in the 1995 assessment were kept secure to measure trends over time; the remaining items were released for public use. An essential part of the development of the 1999 assessment, therefore, was to replace the released items with items of similar content, format, and difficulty. With the assistance of the Science and Mathematics Item Replacement Committee, a group of internationally prominent mathematics and science educators nominated by participating countries to advise on subject-matter issues in the assessment, over 300 mathematics and science items were developed as potential replacements. After an extensive process of review and field testing, 114 items were selected for use as replacements in the 1999 mathematics assessment.
Exhibit A.2 presents the five content areas included in the 1999 mathematics test and the numbers of items and score points in each area. Distributions are also included for the five performance categories derived from the performance expectations aspect of the curriculum framework. About one-fourth of the items were in the free-response format, requiring students to generate and write their own answers. Designed to take about one-third of students test time, some free-response questions asked for short answers while others required extended responses with students showing their work or providing explanations for their answers. The remaining questions used a multiple-choice format. In scoring the tests, correct answers to most questions were worth one point. Consistent with the approach of allotting students longer response time for the constructed-response questions than for multiple-choice questions, however, responses to some of these questions (particularly those requiring extended responses) were evaluated for partial credit, with a fully correct answer being awarded two points (see later section on scoring). The total number of score points available for analysis thus somewhat exceeds the number of items.
Every effort was made to help ensure that the tests represented the curricula of the participating countries and that the items exhibited no bias towards or against particular countries. The final forms of the tests were endorsed by the nrcs of the participating countries.(3)
Not all of the students in the TIMSS assessment responded to all of the mathematics items. To ensure broad subject-matter coverage without overburdening individual students, TIMSS used a rotated design that included both the mathematics and science items. Thus, the same students participated in both the mathematics and science testing. As in 1995, the 1999 assessment consisted of eight booklets, each requiring 90 minutes of response time. Each participating student was assigned one booklet only. In accordance with the design, the mathematics and science items were assembled into 26 clusters (labeled A through Z). The secure trend items were in clusters A through H, and items replacing the released 1995 items in clusters I through Z. Eight of the clusters were designed to take 12 minutes to complete; 10 of the clusters, 22 minutes; and 8 clusters, 10 minutes. In all, the design provided 396 testing minutes, 198 for mathematics and 198 for science. Cluster A was a core cluster assigned to all booklets. The remaining clusters were assigned to the booklets in accordance with the rotated design so that representative samples of students responded to each cluster.(4)
TIMSS in 1999 administered a broad array of questionnaires to collect data on the educational context for student achievement and to measure trends since 1995. National Research Coordinators, with the assistance of their curriculum experts, provided detailed information on the organization, emphases, and content coverage of the mathematics and science curriculum. The students who were tested answered questions pertaining to their attitudes towards mathematics and science, their academic self-concept, classroom activities, home background, and out-of-school activities. The mathematics and science teachers of sampled students responded to questions about teaching emphasis on the topics in the curriculum frameworks, instructional practices, professional training and education, and their views on mathematics and science. The heads of schools responded to questions about school staffing and resources, mathematics and science course offerings, and teacher support.
The TIMSS instruments were prepared in English and translated into 33 languages, with 10 of the 38 countries collecting data in two languages. In addition, it sometimes was necessary to modify the international versions for cultural reasons, even in the nine countries that tested in English. This process represented an enormous effort for the national centers, with many checks along the way. The translation effort included (1) developing explicit guidelines for translation and cultural adaptation; (2) translation of the instruments by the national centers in accordance with the guidelines, using two or more independent translations; (3) consultation with subject-matter experts on cultural adaptations to ensure that the meaning and difficulty of items did not change; (4) verification of translation quality by professional translators from an independent translation company; (5) corrections by the national centers in accordance with the suggestions made; (6) verification by the International Study Center that corrections were made; and (7) a series of statistical checks after the testing to detect items that did not perform comparably across countries.(5)
TIMSS in 1995 had as its target population students enrolled in the two adjacent grades that contained the largest proportion of 13-year-old students at the time of testing, which were seventh- and eighth-grade students in most countries. TIMSS in 1999 used the same definition to identify the target grades, but assessed students in the upper of the two grades only, which was the eighth grade in most countries, including the United States.(6) The eighth grade was the target population for all of the Benchmarking participants.
The selection of valid and efficient samples was essential to the success of TIMSS and of the Benchmarking Study. For TIMSS internationally, NRCs, including Westat, the sampling and data collection coordinator for TIMSS in the United States, received training in how to select the school and student samples and in the use of the sampling software, and worked in close consultation with Statistics Canada, the TIMSS sampling consultants, on all phases of sampling. As well as conducting the sampling and data collection for the U.S. national TIMSS sample, Westat was also responsible for sampling and data collection in each of the Benchmarking states, districts, and consortia.
To document the quality of the school and student samples in each of the TIMSS countries, staff from Statistics Canada and the International Study Center worked with the TIMSS sampling referee (Keith Rust, Westat) to review sampling plans, sampling frames, and sampling implementation. Particular attention was paid to coverage of the target population and to participation by the sampled schools and students. The data from the few countries that did not fully meet all of the sampling guidelines are annotated in the TIMSS international reports, and are also annotated in this report. The TIMSS samples for the Benchmarking participants were also carefully reviewed in light of the TIMSS sampling guidelines, and the results annotated where appropriate. Since Westat was the sampling contractor for the Benchmarking project, the role of sampling referee for the Benchmarking review was filled by Pierre Foy, of Statistics Canada.
Although all countries and Benchmarking participants were expected to draw samples representative of the entire internationally desired population (all students in the upper of the two adjacent grades with the greatest proportion of 13-year-olds), the few countries where this was not possible were permitted to define a national desired population that excluded part of the internationally desired population. Exhibit A.3 shows any differences in coverage between the international and national desired populations. Almost all TIMSS countries achieved 100 percent coverage (36 out of 38), with Lithuania and Latvia the exceptions. Consequently, the results for Lithuania are annotated, and because coverage fell below 65 percent for Latvia, the Latvian results are labeled Latvia (lss), for Latvian-Speaking Schools. Additionally, because of scheduling difficulties, Lithuania was unable to test its eighth-grade students in May 1999 as planned. Instead, the students were tested in September 1999, when they had moved into the ninth grade. The results for Lithuania are annotated to reect this as well. Exhibit A.3 also shows that the sampling plans for the Benchmarking participants all incorporated 100 percent coverage of the desired population. Four of the 13 states (Idaho, Indiana, Michigan, and Pennsylvania) as well as the Southwest Pennsylvania Math and Science Collaborative included private schools as well as public schools.
In operationalizing their desired eighth-grade population, countries and Benchmarking participants could define a population to be sampled that excluded a small percentage (less than 10 percent) of certain kinds of schools or students that would be very difficult or resource-intensive to test (e.g., schools for students with special needs or schools that were very small or located in extremely rural areas). Exhibit A.3 also shows that the degree of such exclusions was small. Among countries, only Israel reached the 10 percent limit, and among Benchmarking participants, only Guilford County and Montgomery County did so. All three are annotated as such in the achievement chapters of this report.
Within countries, TIMSS used a two-stage sample design, in which the first stage involved selecting about 150 public and private schools in each country. Within each school, countries were to use random procedures to select one mathematics class at the eighth grade. All of the students in that class were to participate in the TIMSS testing. This approach was designed to yield a representative sample of about 3,750 students per country. Typically, between 450 and 3,750 students responded to each achievement item in each country, depending on the booklets in which the items appeared.
States participating in the Benchmarking study were required to sample at least 50 schools and approximately 2,000 eighth-grade students. School districts and consortia were required to sample at least 25 schools and at least 1,000 students. Where there were fewer than 25 schools in a district or consortium, all schools were to be included, and the within-school sample increased to yield the total of 1,000 students.
Exhibits A.4 and A.5 present achieved sample sizes for schools and students, respectively, for the TIMSS countries and for the Benchmarking participants. Where a district or consortium was part of a state that also participated, the state sample was augmented by the district or consortium sample, properly weighted in accordance with its size. Schools in a state that were sampled as part of the U.S. national TIMSS sample were also used to augment the state sample. For example, the Illinois sample consists of 90 schools, 41 from the state Benchmarking sample (including five schools from the national TIMSS sample), 27 from the Chicago Public Schools, 17 from the First in the World Consortium, and five from the Naperville School District.
Exhibit A.6 shows the participation rates for schools, students, and overall, both with and without the use of replacement schools, for TIMSS countries and Benchmarking participants. All of the countries met the guideline for sampling participation 85 percent of both the schools and students, or a combined rate (the product of school and student participation) of 75 percent although Belgium (Flemish), England, Hong Kong, and the Netherlands did so only after including replacement schools, and are annotated accordingly in the achievement chapters.
With the exception of Pennsylvania and Texas, all the Benchmarking participants met the sampling guidelines, although Indiana did so only after including replacement schools. Indiana is annotated to reect this in the achievement chapters, and Pennsylvania and Texas are italicized in all exhibits in this report.
Each participating country was responsible for carrying out all aspects of the data collection, using standardized procedures developed for the study. Training manuals were created for school coordinators and test administrators that explained procedures for receipt and distribution of materials as well as for the activities related to the testing sessions. These manuals covered procedures for test security, standardized scripts to regulate directions and timing, rules for answering students questions, and steps to ensure that identification on the test booklets and questionnaires corresponded to the information on the forms used to track students. As the data collection contractor for the U.S. national TIMSS, Westat was fully acquainted with the TIMSS procedures, and applied them in each of the Benchmarking jurisdictions in the same way as in the national data collection.
Each country was responsible for conducting quality control procedures and describing this effort in the nrcs report documenting procedures used in the study. In addition, the International Study Center considered it essential to monitor compliance with standardized procedures through an international program of quality control site visits. nrcs were asked to nominate one or more persons unconnected with their national center, such as retired school teachers, to serve as quality control monitors for their countries. The International Study Center developed manuals for the monitors and briefed them in two-day training sessions about TIMSS, the responsibilities of the national centers in conducting the study, and their own roles and responsibilities. In all, 71 international quality control monitors participated in this training.
The international quality control monitors interviewed the nrcs about data collection plans and procedures. They also visited a sample of 15 schools where they observed testing sessions and interviewed school coordinators.(7) Quality control monitors interviewed school coordinators in all 38 countries, and observed a total of 550 testing sessions. The results of the interviews conducted by the international quality control monitors indicated that, in general, nrcs had prepared well for data collection and, despite the heavy demands of the schedule and shortages of resources, were able to conduct the data collection efficiently and professionally. Similarly, the TIMSS tests appeared to have been administered in compliance with international procedures, including the activities before the testing session, those during testing, and the school-level activities related to receiving, distributing, and returning material from the national centers.
As a parallel quality control effort for the Benchmarking project, the International Study Center recruited and trained a team of 18 quality control observers, and sent them to observe the data collection activities of the Westat test administrators in a sample of about 10 percent of the schools in the study (98 schools in all).(8) In line with the experience internationally, the observers reported that the data collection was conducted successfully according to the prescribed procedures, and that no serious problems were encountered.
Because about one-third of the written test time was devoted to free-response items, TIMSS needed to develop procedures for reliably evaluating student responses within and across countries. Scoring used two-digit codes with rubrics specific to each item. The first digit designates the correctness level of the response. The second digit, combined with the first, represents a diagnostic code identifying specific types of approaches, strategies, or common errors and misconceptions. Although not used in this report, analyses of responses based on the second digit should provide insight into ways to help students better understand mathematics concepts and problem-solving approaches.
To ensure reliable scoring procedures based on the TIMSS rubrics, the International Study Center prepared detailed guides containing the rubrics and explanations of how to implement them, together with example student responses for the various rubric categories. These guides, along with training packets containing extensive examples of student responses for practice in applying the rubrics, were used as a basis for intensive training in scoring the free-response items. The training sessions were designed to help representatives of national centers who would then be responsible for training personnel in their countries to apply the two-digit codes reliably. In the United States, the scoring was conducted by National Computer Systems (NCS) under contract to Westat. To ensure that student responses from the Benchmarking participants were scored in the same way as those from the U.S. national sample, NCS had both sets of data scored at the same time and by the same scoring staff.
To gather and document empirical information about the within-country agreement among scorers, TIMSS arranged to have systematic subsamples of at least 100 students responses to each item coded independently by two readers. Exhibit A.7 shows the average and range of the within-country percent of exact agreement between scorers on the free-response items in the mathematics test for 37 of the 38 countries. A high percentage of exact agreement was observed, with an overall average of 99 percent across the 37 countries. The TIMSS data from the reliability studies indicate that scoring procedures were robust for the mathematics items, especially for the correctness score used for the analyses in this report. In the United States, the average percent exact agreement was 99 percent for the correctness score and 96 percent for the diagnostic score. Since the Benchmarking data were combined with the U.S. national TIMSS sample for scoring purposes, this high level of scoring reliability applies to the Benchmarking data also.
Exhibit A.8 displays the mathematics test reliability coefficient for each country and Benchmarking participant. This coefficient is the median kr-20 reliability across the eight test booklets. Among countries, median reliabilities ranged from 0.76 in the Philippines to 0.94 in Chinese Taipei. The international median, 0.89, is the median of the reliability coefficients for all countries. Reliability coefficients among Benchmarking participants were generally close to the international median, ranging from 0.88 to 0.91 across states, and from 0.84 to 0.91 across districts and consortia.
To ensure the availability of comparable, high-quality data for analysis, TIMSS took rigorous quality control steps to create the international database.(9) TIMSS prepared manuals and software for countries to use in entering their data, so that the information would be in a standardized international format before being forwarded to the IEA Data Processing Center in Hamburg for creation of the international database. Upon arrival at the Data Processing Center, the data underwent an exhaustive cleaning process. This involved several iterative steps and procedures designed to identify, document, and correct deviations from the international instruments, file structures, and coding schemes. The process also emphasized consistency of information within national data sets and appropriate linking among the many student, teacher, and school data files. In the United States, the creation of the data files for both the Benchmarking participants and the U.S. national TIMSS effort was the responsibility of Westat, working closely with NCS. After the data files were checked carefully by Westat, they were sent to the IEA Data Processing Center, where they underwent further validity checks before being forwarded to the International Study Center.
The general approach to reporting the TIMSS achievement data was based primarily on item response theory (IRT) scaling methods.(10) The mathematics results were summarized using a family of 2-parameter and 3-parameter IRT models for dichotomously-scored items (right or wrong), and generalized partial credit models for items with 0, 1, or 2 available score points. The IRT scaling method produces a score by averaging the responses of each student to the items that he or she took in a way that takes into account the difficulty and discriminating power of each item. The methodology used in TIMSS includes refinements that enable reliable scores to be produced even though individual students responded to relatively small subsets of the total mathematics item pool. Achievement scales were produced for each of the five mathematics content areas (fractions and number sense, measurement, data representation, analysis, and probability, geometry, and algebra), as well as for mathematics overall.
The IRT methodology was preferred for developing comparable estimates of performance for all students, since students answered different test items depending upon which of the eight test booklets they received. The IRT analysis provides a common scale on which performance can be compared across countries. In addition to providing a basis for estimating mean achievement, scale scores permit estimates of how students within countries vary and provide information on percentiles of performance. To provide a reliable measure of student achievement in both 1999 and 1995, the overall mathematics scale was calibrated using students from the countries that participated in both years. When all countries participating in 1995 at the eighth grade are treated equally, the TIMSS scale average over those countries is 500 and the standard deviation is 100. Since the countries varied in size, each country was weighted to contribute equally to the mean and standard deviation of the scale. The average and standard deviation of the scale scores are arbitrary and do not affect scale interpretation. When the metric of the scale had been established, students from the countries that tested in 1999 but not 1995 were assigned scores on the basis of the new scale. IRT scales were also created for each of the five mathematics content areas for the 1999 data. Students from the Benchmarking samples were assigned scores on the overall mathematics scale as well as in each of the five mathematics content areas using the same item parameters and estimation procedures as for TIMSS internationally.
To allow more accurate estimation of summary statistics for student subpopulations, the TIMSS scaling made use of plausible-value technology, whereby five separate estimates of each students score were generated on each scale, based on the students responses to the items in the students booklet and the students background characteristics. The five score estimates are known as plausible values, and the variability between them encapsulates the uncertainty inherent in the score estimation process.
Because the statistics presented in this report are estimates of performance based on samples of students, rather than the values that could be calculated if every student in every country or Benchmarking jurisdiction had answered every question, it is important to have measures of the degree of uncertainty of the estimates. The jackknife procedure was used to estimate the standard error associated with each statistic presented in this report.(11) The jackknife standard errors also include an error component due to variation between the five plausible values generated for each student. The use of confidence intervals, based on the standard errors, provides a way to make inferences about the population means and proportions in a manner that reects the uncertainty associated with the sample estimates. An estimated sample statistic plus or minus two standard errors represents a 95 percent confidence interval for the corresponding population result.
This report makes extensive use of statistical hypothesis-testing to provide a basis for evaluating the significance of differences in percentages and in average achievement scores. Each separate test follows the usual convention of holding to 0.05 the probability that reported differences could be due to sampling variability alone. However, in exhibits where statistical significance tests are reported, the results of many tests are reported simultaneously, usually at least one for each country and Benchmarking participant in the exhibit. The significance tests in these exhibits are based on a Bonferroni procedure for multiple comparisons that hold to 0.05 the probability of erroneously declaring a statistic (mean or percentage) for one entity to be different from that for another entity. In the multiple comparison charts (Exhibit 1.2 and those in Appendix B), the Bonferroni procedure adjusts for the number of entities in the chart, minus one. In exhibits where a country or Benchmarking participant statistic is compared to the international average, the adjustment is for the number of entities.(12)
International benchmarks of student achievement were computed at each grade level for both mathematics and science. The benchmarks are points in the weighted international distribution of achievement scores that separate the 10 percent of students located on top of the distribution, the top 25 percent of students, the top 50 percent, and the bottom 25 percent. The percentage of students in each country and Benchmarking jurisdiction meeting or exceeding the international benchmarks is reported. The benchmarks correspond to the 90th, 75th, 50th, and 25th percentiles of the international distribution of achievement. When computing these percentiles, each country contributed as many students to the distribution as there were students in the target population in the country. That is, each countrys contribution to setting the international benchmarks was proportional to the estimated population enrolled at the eighth grade.
In order to interpret the TIMSS scale scores and analyze achievement at the international benchmarks, TIMSS conducted a scale anchoring analysis to describe achievement of students at those four points on the scale. Scale anchoring is a way of describing students performance at different points on a scale in terms of what they know and can do. It involves a statistical component, in which items that discriminate between successive points on the scale are identified, and a judgmental component in which subject-matter experts examine the items and generalize to students knowledge and understandings.(13)
In an effort to collect information about the content of the intended curriculum in mathematics, TIMSS asked National Research Coordinators and Coordinators from the Benchmarking jurisdictions to complete a questionnaire about the structure, organization, and content coverage of their curricula. Coordinators reviewed 56 mathematics topics and reported the percentage of their eighth-grade students for which each topic was intended in their curriculum. Although most topic descriptions were used without modification, there were occasions when Coordinators found it necessary to expand on or qualify the topic description to describe their situation accurately. The country-specific adaptations to the mathematics curriculum questionnaire are presented in Exhibit A.9. No adaptations to the list of topics were necessary for the U.S. national version, nor were any adaptations made by any Benchmarking participants.
TIMSS 1999 is a project of the International
Boston College, Lynch School of Education