Direct Observational Assessment During Test Sessions and Child Clinical Interviews
...t session (e.g., length of session; child's medication status during testing). Immediately after completing the test session, examiners rate the child on the 125 TOF items. The TOF manual (McConaughy & Achenbach, 2004) provides rules for scoring the items 0, 1, 2, or 3 along with examples of behaviors tapped by different items. Test examiners are instructed to choose the one item that best describes each discreet observed behavior. McConaughy and Achenbach (2004) conducted a series of exploratory and confirmatory factor analyses to derive the five TOF syndrome scales listed in Table 1: Withdrawn/Depressed, Language/Thought Problems, Anxious, Oppositional, and Attention Problems. Second-order factor analyses produced an Internalizing scale (Withdrawn/Depressed plus Language/Thought Problems) and an Externalizing scale (Oppositional plus Attention Problems). The Anxious syndrome was not strongly associated with either Internalizing or Externalizing, but instead seemed to represent anxiety and related problems that were more specific to test situations. The TOF also has a DSM-oriented Attention Deficit/Hyperactivity Problems (ADHP) scale composed of 22 items describing problems consistent with DSM-IV (American Psychiatric Association, 2000) criteria for ADHD, plus Inattention and Hyperactivity-Impulsivity subscales, each of which has 11 items. The TOF Profile provides T scores and percentiles for boys and girls for three ages: 2-5, 6-11, and 12-18. The TOF normative sample includes 3,943 nonreferred children who were tested with the Stanford-Binet Intelligence Scales-Fifth Edition (SB5; Roid, 2003). Of these, 2,442 children were part of the standardization sample for the SB5, which was stratified on the basis of the 2000 U.S. Census according to age, race/ethnicity, gender, and SES. The mean SB5 FSIQ for the TOF normative sample ranged from 100.2 to 104.6 (SD = 11.3 to 12.7). Rating the TOF items takes approximately 5 to 10 minutes, once users become familiar with the scoring rules. Hand-scoring or computer-scoring the TOF Profile takes an additional 10 to 15 minutes. In addition to printing scores and profiles for all TOF scales, the computer-scoring program produces a narrative report that summarizes the TOF scores, plus demographic information for the child. Psychometric Properties of the TOF McConaughy and Achenbach (2004) reported good test-retest reliabilities for the TOE with rs from .53 to .87 for 10 scales plus Total Problems and a mean r of .80, as summarized in Table 1. Withdrawn/Depressed, Attention Problems, ADHP Total, and Total Problems showed the highest test-retest reliabilities (rs = .85). Interrater reliabilities between trained lay observers and test examiners ranged from .42 to .78 for 9 scales plus Total Problems, with a mean r of .62. (The interrater r for the Anxious syndrome was not significant.) Internal consistencies ranged from .74 to .94 for the total sample, with a mean alpha of .84. Because the TOF scales were derived from samples who were administered several different intelligence tests, McConaughy and Achenbach (2004) maintain that the TOF is not restricted to use with the SB5. Support for this wider use is provided by analyses of TOF scale scores for a sample of 55 children who were administered the SB5 and WISC-III. Multivariate analyses of variance (MANOVAs) and subsequent post hoc tests showed no significant differences between TOF mean raw scores based on administration of the WISC-III versus SB5 for any of the 11 scales. Correlations between TOF scores for the WISC-III and SB5 for the same sample ranged from .52 to .97, with a mean r of .79. The lowest correlation was for the TOF Inattentive subscale. McConaughy and Achenbach (2004) tested criterion-related validity of the TOF for demographically matched samples of clinically referred versus nonreferred children. Six-to-11-year-old referred children scored significantly (p <.05) higher than nonreferred children on all 11 TOF scales (n = 906). Twelve-to-18-year-old referred children scored significantly higher than nonreferred children on all TOF scales, except the ADHP Inattention and Hyperactivity-Impulsivity subscales (n = 650). McConaughy, Volpe, and Eiraldi (2005) tested the discriminant validity of the TOF for differentiating children with DSM-IV ADHD diagnoses from clinically referred children without ADHD and normal controls, as well as for differentiating the ADHD-Combined subtype from the ADHD-Inattentive subtype. Children with ADHD scored significantly higher (p <.05) than clinically referred children without ADHD on 8 of 11 TOF scales and significantly higher than control children on 10 TOF scales. As predicted, the ADHD-Combined type scored significantly higher than the ADHD-Inattentive type on the TOF ADHP Total Score and the Hyperactivity-Impulsivity subscale, plus the TOF Oppositional, Externalizing, and Total Problems scales. Construct validity of the TOF was supported by correlations of .60 to .76 between scores on comparable TOF and GATSB scales. There were no significant correlations between the TOF Withdrawn/Depressed syndrome and any GATSB scale, most likely because the TOF Withdrawn/Depressed syndrome measures a construct not assessed by the GATSB. Associations Between Test Session Observations and Standardized Test Scores Glutting and colleagues coined the term intrasession validity (a type of criterion-related validity) to describe the strength of associations between test session observations and formal test scores (Glutting & McDermott, 1988; Glutting, Oakland, & McDermott, 1989; Glutting & Oakland, 1993). To synthesize previous findings on intrasession validity, Glutting, Youngstrom, Oakland, and Watkins (1996) conducted a meta-analysis of six studies (33 correlations), which produced an average correlation of -.34 between test session observations and children's IQ scores during the same test session. The top half of Table 2 summarizes findings on intrasession validity for the GATSB and WISC-III FSIQ. Glutting et al. (1996) reported correlations of-.21 to -.39 between GATSB scale scores and WISC-III FSIQ and an average r of -.27 across all GATSB scales and WISC-III IQ and composite scores. Similarly, Daleiden, Drabman, and Benton (2002) reported correlations of-.28 to -.47 between GATSB scores and WISC-III FSIQ, plus similar correlations between GATSB scores and the Broad Cognitive Index of the Woodcock-Johnson-Revised (WJR; Woodcock & Johnson, 1989) and WJ-R broad achievement scores. The bottom half of Table 2 summarizes intrasession validity for the TOF and WISC-III and SB5 FSIQ. McConaughy et al. (2005) reported correlations of -.19 to -.31 between four TOF scales and WISC-III FSIQ, plus similar correlations for WISC-III VIQ and PIQ. TOF Language/Thought Problems showed medium correlations with WISC-III FSIQ and VIQ (r = -.31 to -.32). McConaughy and Achenbach (2004) reported correlations of -.17 to -.40 between 10 TOF scales and SB5 FSIQ, plus similar correlations for SB5 VIQ and NVIQ. TOF Language/Thought Problems, Internalizing, and the Inattention subscale showed medium correlations with SB5 scores (r = -.37 to -.40). (Fewer significant correlations between TOF scores and WISC-III IQs than for the TOF and SB5 were probably due to differences in sample size.) Intrasession validity coefficients for the GATSB and TOF were consistent with the average correlation of -.34 from Glutting et al.'s (1996) meta-analyses. In general, these findings indicate that lower IQ scores are associated with more test session behavior problems and vice versa. However, test examiners should be cautious against making the causal inference that more behavior problems produce low test performance. The opposite is equally plausible: that difficulty in test performance produces more behavior problems. Moreover, the low to moderate correlations leave much room for individual variability. When an individual child does score unusually high (i.e., well above average) on one or more GATSB or TOF scales, examiners must then consider whether the obtained test scores still reflect the child's optimal test performance versus typical performance, or underestimate performance (Oakland & Glutting, 1998). Examiners can also use GATSB or TOF results to develop hypotheses about situational variables that might influence a child's behavior and test performance, such as one-on-one attention versus group attention or easy versus challenging tasks, verbal versus performance tasks, etc. Their hypotheses can then be tested in subsequent test sessions and other situations, such as classrooms, as part of functional behavioral assessment and intervention planning. The GATSB (Glutting & Oakland, 1993) and TOF (McConaughy & Achenbach, 2004) manuals each provide case examples to illustrate interpretations of different patterns of test session observations. Associations Between Test Session Observations and Informant Reports Glutting and colleagues coined the term exosession validity (also a form of criterion-related validity) to describe the strength of associations between test session observations and other informants' reports of children's behavior, such as reports by parents and teachers (Glutting & Oakland, 1993; Glutting et al., 1996). One way to test exosession validity is to obtain correlations between examiners' ratings of test session observations and parent and teacher ratings of behavior for the same children. In a synthesis of previous research, Glutting et al. (1996) found four studies (26 correlations) that addressed exosession validity. Their meta-analysis of these studies produced an average correlation of. 18 between test session observations and reports of behavior in other settings by parents or teachers. In their research with the GATSB, Glutting et al. (1996) reported an average correlation of. 16 between GATSB scores and teachers' ratings of children's behavior on the Adjustment Scales for Children and Adolescents (ASCA; McDermott, 1994). The largest correlation was .39 between the GATSB Avoidance and ASCA Under-Reactivity scales. The GATSB Inattentiveness scale correlated .24 with ASCA Attention-Deficit Hyperactivity scale and .33 with the ASCA Solitary Aggressive Impulsive scale. Table 3 summarizes findings on exosession validity for comparable scales of the GATSB and 1991 version of the CBCL (Achenbach, 1991) and between comparable scales of the TOF and the 2001 versions of the CBCL and TRF (Achenbach & Rescorla, 2001). Daleiden et al. (2002) reported significant correlations of .25 to .33 for comparable scales of the GATSB and CBCL. McConaughy and Achenbach (2004) reported significant positive correlations of .27 to .43 for comparable scales of the TOF and CBCL and .26 to .38 for comparable scales of the TOF and TRF. A negative correlation of -.26 was found between TOF Language/Thought Problems and CBCL Affective Problems. There were no significant correlations between the TOF Withdrawn/Depressed or Internalizing scales and any CBCL/TRF scales. Exosession validity is important to determine whether observations during test sessions generalize to other settings. Comparing test session observations with reports from other informants is also important for making placement decisions, such as special education disability determinations, formulating diagnoses, and planning intervention strategies. The correlations between the GATSB and CBCL were generally consistent with those between the TOF and CBCL/TRF (see Table 3). The correlations for both instruments were also consistent with low to moderate correlations found in meta-analyses of cross-informant agreement (Achenbach, McConaughy, & Howell, 1987), as discussed in a later section. These findings suggest that there is considerable room for situational variability in children's behavior and thus caution against overgeneralizing from test session observations to other settings is warranted. Findings for the TOF and CBCL/TRF also suggest that standardized ratings of test session observations concur more with parent and teacher ratings of externalizing problems and attention problems than with internalizing problems. Situational specificity of children's behavior is addressed again in a later section. Semistructured Clinical Interview for Children and Adolescents Clinical interviews are critical for learning children's unique perspectives on their problems and life circumstance as well as establishing rapport for assessment and intervention. In addition to obtaining children's self-reports, clinical interviews provide opportunities to observe many different aspects of children's behavior, such as motor coordination, activity level, attention span and distractibility, nervous mannerisms, range of affect and mood states, interaction style, receptive and expressive language, reasoning, and problem-solving (Hughes & Baker, 1990; McConaughy, 2003, 2005; Sattler, 1998). Recording observations of these behavioral characteristics is as important as recording children's responses to interview questions. The Semistructured Clinical Interview for Children and Adolescents-Second Edition (SCICA; McConaughy & Achenbach, 2001) provides structured forms for rating and scoring behavioral observations and children's self-reported problems during clinical interviews. The SCICA is part of the ASEBA and was designed for children ages 6 to 18. Interviewers use the SCICA Protocol to query children about their activities and interests, school, peer relations, family relations, self-awareness and feelings, and adolescent issues. The protocol form provides space for interviewers to record notes of their observations and children's responses to interview questions. After completing the SCICA, interviewers rate the child on the SCICA Observation and Self-Report Forms. The SCICA Observation Form contains 121 items for rating observations of children's behavior during the interview. Examples are: acts too young for age; cries; doesn't concentrate or pay attention long on tasks, questions, or topics; has difficulty expressing self verbally; shows off, clowns or acts silly; unhappy, sad, or depressed; and withdrawn, doesn't get involved with interviewer. The SCICA Self-Report Form contains 127 items for rating problems that children report during the interview. One hundred fifteen of the SCICA Observation items are comparable to TOF items. Sixty-two SCICA observation items and 87 self-report items are comparable to items on the CBCL, TRF, and/or YSR. Immediately after completing the SCICA, interviewers rate the child on each observation and self-report item, using a 4-point scale similar to the scale for scoring the TOF: 0 = no occurrence to 3 = definite occurrence with severe intensity or 3 or more minutes duration. The SCICA manual (McConaughy & Achenbach, 2001) provides rules for scoring the items 0, 1, 2, or 3 along with examples for choosing among different items. Interviewers are instructed to choose the one item that best describes each discreet observed behavior or self-reported problem. McConaughy and Achenbach (2004) conducted a series of exploratory and confirmatory factor analyses to derive the eight SCICA syndrome scales listed in Table 1. Five syndromes were derived from interviewers' observations: AnxiousOB, Withdrawn/DepressedOB, Language/Motor-ProblemsOB, Attention ProblemsOB, and Self-Control ProblemsOB. Three syndromes were derived from interviewers' ratings of children's self-reported problems: Anxious/DepressedSR, Aggressive/Rule-BreakingSR, and Somatic ComplaintsSR (scored only for ages 12-18). (The superscripts OB and SR indicate scales derived from observations versus children's self-reports.) Second-order factor analyses produced an Internalizing scale (AnxiousOB and Anxious/DepressedSR) and an Externalizing scale (Aggressive/Rule-BreakingSR, Attention ProblemsOB, and Self-Control ProblemsOB). Six additional SCICA scales are composed of both observation and self-report items consistent with DSM-IV diagnoses. The SCICA Profile also provides separate scores for Total Observations and Total Self-Reports. For completeness, Table 1 lists all SCICA scales, although the observation scales are of most interest for this article. The SCICA Profile provides clinical T scores and percentiles for all scales based on a sample of 686 clinically referred children. Separate scores are provided for ages 6-11 and 12-18. Scoring the two SCICA rating forms takes approximately 10 minutes, once users become familiar with the scoring rules. Hand- or computer-scoring the SCICA profile takes an additional 10 to 15 minutes. The computer-scoring program prints profiles for all SCICA scales and produces a narrative report that summarizes the SCICA scores. Practitioners can also practice SCICA scoring procedures using a 90-minute training video. The video (available on tape or CD-ROM) depicts segments of the SCICA with six different children portrayed by child actors. The SCICA training computer-scoring program computes and prints comparisons between a trainee's scores on the SCICA Profile versus scores by experienced interviewers. Intraclass correlations indicate whether agreement is "poor," "fair," or "good" between the trainee's item and scale scores and those of the experienced interviewers for each segment (McConaughy, Arnold, Jacobowitz, & Achenbach, 2001). Psychometric Properties of the SCICA McConaughy and Achenbach (2001) reported good test-retest reliability for the SCICA over an average interval of 12 days, with rs ranging from .57 to .86 for 20 scales and two total scores, plus a mean r of .78 across all scales (see Table 1). Test-retest rs ranged from .61 to .78 for the observation scales. Interrater reliabilities between trained lay observers(n2) and interviewers ranged from .47 to .85, with a mean r of .74 across all SCICA scales. Interrater rs ranged from .60 to .75 for the observation scales. McConaughy and Achenbach (2001) tested criterion-related validity of the SCICA for demographically matched samples of clinically referred versus nonreferred children. Six- to 11-year-old clinically referred children scored significantly (p <.05) higher than nonreferred children on all SCICA scales, except AnxiousOB and Anxiety Problems (n = 198). Twelve- to 18-year-old children scored significantly higher than nonreferred children on all SCICA scales (n = 80). SCICA Total Observations showed large effects for both age groups (ES = 23 to 28% of variance). McConaughy and Achenbach (1996) tested the discriminant validity of the 1994 version of the SCICA for differentiating children with emotional and behavioral disorders (EBD) from matched samples of normal controls and children with learning disabilities (LD). Children with EBD scored significantly higher than control children on four observation scales and one self-report scale, as well as Externalizing, Total Observations, and Total Self-Reports. Children with EBD scored higher than children with LD on two observation scales, Externalizing, and Total Observations, but no self-report scales. Associations Between Interview Observations, Informant Reports, and Self-Reports McConaughy and Achenbach (2001) reported significant positive correlations of .16 to .58 between interviewers' ratings on the SCICA and parent ratings on the CBCL, .20 to .41 between the SCICA and teacher ratings on the TRF, and .23 to .60 between the SCICA and youths' self-ratings on the YSR. Table 4 summarizes correlations between comparable SCICA observation scales and CBCL/TRF/YSR scales. The table also includes correlations with comparable CBCL/TRF/YSR scales for the SCICA Internalizing and Externalizing scales (which include observations and children's self-reports) and SCICA Total Self-Reports. The magnitude of correlations between the SCICA observation scales and CBCL/TRF/YSR scales was consistent with similar correlations for the GATSB and TOF and CBCL/TRF (see Table 3). The findings in Table 4 generally showed more agreement between interviewers' observations and parent reports than between interviewers' observations and teacher reports or youth self-reports. However, there was good agreement between interviewers' ratings and all other informants' ratings of externalizing problems, as well as interviewers' ratings and youths' self-ratings of internalizing problems and total self-reported problems. Situational Specificity of Children's Behavior As a working assumption for behavioral assessment, Shapiro and Kratochwill (2000) stated: "Within a behavioral assessment framework, all behavior [is] assumed to be situationally specific and only after empirical validation would one determine that a behavior was cross-situational" (p. 7). To illustrate, they gave the example of a child with disruptive behavior being removed from the classroom to a small room for an evaluation. There, the evaluator observed that the child is compliant, pleasant, and respectful. Shapiro and Kratochwill cautioned that the evaluator should not assume that the behavior observed in the assessment situation will occur in other settings, such as the classroom. Instead, the evaluator should assume that the child's behavior in the assessment situation is indicative of context or events that surround that behavior, such as having been removed from aversive events in the classroom, receiving one-on-one attention from an adult, working in a controlled environment, and a host of other "confounding" variables. On the opposite side of the same picture, evaluators should not assume that the same child's disruptive behavior in the classroom will generalize to other settings, such as test sessions, clinical interviews, or the home environment. Classroom behavior has its own antecedents, consequences, and surrounding circumstances that create situational specificity in the same way as behavior in other settings. The key to good behavioral assessment is to identify patterns of behaviors that are more specific to certain situations versus patterns of behaviors that occur across several situations. Situational specificity in children's behavior was supported by Achenbach et al.'s (1987) meta-analytic study that showed significant, but modest, correlations between reports about children's behavior from different informants Under different conditions. They reported an average correlation of .28 between ratings of children's behavior by parents versus teachers or parents/teachers versus mental health professionals. They also found an average correlation of .22 between children's self-ratings and ratings of other informants. These low cross-informant correlations contrasted with an average correlation of .60 between informants from similar situations or similar relationships with the child (e.g., pairs of parents or pairs of teachers). The highest average correlation was .64 for pairs of teachers, who were often a teacher and teacher aide in the same classroom. The average correlation between two observers in the same setting was .57. Thus, even the higher correlations for pairs of observers and pairs of similar informants leave room for situational variability in children's behavior. The low to moderate correlations between the GATSB/TOF and CBCL/TRF (Table 3) and between the SCICA and CBCL/TRF/YSR (Table 4) are consistent with Achenbach et al.'s (1987) meta-analytic findings on cross-informant agreement. Such limits on cross-informant agreement do not suggest that one informant is right and the other is wrong, or that one situation represents a "truer" picture of a child's behavior than does another situation. Even direct observations of children's behavior are based on perceptions of the observer in that particular situation. Differences in people's perceptions of the child can be as informative as similarities in perceptions. Examining differences in perceptions can help practitioners identify important clues to factors affecting the child's behavior in specific situations and specific relationships. Examining similarities in perspectives across situations can help identify factors influencing behavior that are consistent across situations and relationships. The challenge then is to put all the pieces of the assessment puzzle together to form a meaningful picture of the child's functioning under given circumstances. This, in turn, can lead to intervention strategies that are best suited to each special circumstance and relationship, as well as interventions that are appropriate across multiple settings. Advantages of Test Session and Interview Observations Keeping situational specificity in mind, tes...