Principal-component solution. Of the 10 components summarized in Table 1, only the first 9 The components are quite well-defined, with few items loading on more than one component for details, see Gottfredson, The first component might properly be labeled a list of "don'ts" - practices to avoid if we want our peers to be favorably impressed with our work.

The second and third components seem to suggest a differentiation of two types of "do's" - those dealing primarily with scientific or substantive matters Component II and those dealing with stylistic, compositional, or expository matters Component III. Component IV suggests the importance of originality and heurism, and the fifth component might be labeled "trivial.

While the number of items defining each of the remaining components is small, do critical evaluation research paper is the proportion of the total variance for which each accounts, they nonetheless remain readily interpretable.

Component VI seems primarily to reflect scientific advancement, while Component VII seems to merit the label ''data grinders'' or ''brute empiricism," with emphasis on description rather than explanation. Given the vast differences in subject matter and methodological approaches represented by these nine journals, one might suspect that journal-group differences with respect to the dimensional structure would be evident.

This hypothesis was tested on the heterogeneous sample of respondents. Each respondent in the full sample was scored on each of nine scales constructed from items loading on evaluate scientific research paper nine interpretable components described above. Scale scores were entered as discriminating variables in a stepwise multiple-discriminant-function analysis with journal affiliation as the variable to be distinguished to determine whether members of the respective groups differ in their treatment of these scales.

Analyses were performed without regard to prior knowledge of group size. The hypothesis that group members differ with respect to treatment of the nine scales was not confirmed. The original value of Wilks's lambda, which assesses potential discriminability based on the scale scores, is.

After removing the effect of the first discriminant function, lambda increased to. While discriminations based on these functions are statistically significant marginally so for the second discriminantthe discrimination itself is of little practical significance.

These analyses have suggested that a psychologists heavily involved in manuscript review for nine major psychological journals agree remarkably on the desirability of specific characteristics of journal articles, b a well-defined dimensional structure can be obtained which accounts for half of the remaining ''individual differences'' variance, and c these dimensions are employed in similar ways by persons across subdisciplines in psychology.

These results suggest that prescriptive norms for scientific evaluation exist and transcend sub-disciplinary bounds. There are things we should as researchers and authors do, and there are things we shouldn't do; and many of these behaviors are allegedly prominent in the peer evaluation process. Our next task, then, is to determine if in fact these criteria can be used to achieve increased reliability of peer evaluations of psychological work.

Two samples were needed for this phase of the investigation: a sample of psychological works to evaluate and a sample of judges competent to evaluate these specific contributions.

Since a later portion of the study focuses on the relation between citation counts and quality judgments, the target year selected for study was Science Citation Index coverage is adequate for that and succeeding years but not for previous years, and the elapsed time span is sufficient for citation analysis Garfield, ; Price, Thus, articles published during the calendar year in the nine journals listed earlier provided the sample of psychological works to evaluate.

A sample of judges competent to evaluate these specific works was difficult to obtain for a number of reasons. My solution was simply to survey the authors of these articles and to ask them for the names of three persons whom they considered competent to evaluate the significance of their article in the current framework of psychological knowledge.

Persons nominated comprised the "expert" sample pool. This procedure was chosen primarily because of its practicality, simplicity, and proven productivity Gottfredson et do critical evaluation research paper.

Although this approach may appear to introduce potential bias in the evaluations, a the nature of the biasing effect if any may actually be conservative if it results in a restriction of range in the ratings, b several journals e. Target articles and authors. A total of 1, substantive articles appeared in the nine target journals during There were, however, only 1, single- or first-listed authors i.

For each of these authors, one article was selected at random, resulting in a final article pool of 1, Thirteen of these authors were dead, and no address could be found for others, resulting in a survey base of authors and articles.

Of the responding authors, Although 1, nominations were' made, only individual experts were named. In order to maximize the number of articles to be judged, the following assignment procedure was used: Any article for which only one expert had been nominated was assigned that expert, provided it was not in competition with another article for which that single reviewer had also do critical evaluation research paper suggested.

Where such obtained, the expert was randomly assigned to one of these articles, and the other s was dropped from further consideration. This procedure was then repeated with articles for which multiple experts had been nominated. Once assigned, an expert was excluded from further consideration. Only 12 articles could not be assigned at least one expert. This survey was conducted during November March Sample members were assumed to have ready access to the published article they were asked to evaluate, although copies were mailed upon request.

Given the agreement shown with respect to scale placements of the 83 items discussed above, it would have been possible to build a single "evaluation scale" by simply selecting items falling along a wide range of the original dimension. It is also clear, however, that a multidimensional approach adds information, and a multidimensional approach was therefore followed throughout.

Three criteria were employed to select 36 scale items do critical evaluation research paper the initial First, an item was to load heavily on one component and essentially zero on all others. Second, its variance was to be as small as possible. These two constraints served to ensure a that items were empirically good exemplars of their respective principal components, b that the resulting nine scales would be as orthogonal as possible, and c good agreement with respect to the "value" of the items.

The third criterion, given that the first two had been met, was that the item be a subjectively good exemplar of the component with which it had been identified. In addition evaluation research paper rating the article on the 36 items, experts were asked to make three global assessments of the quality and two of the impact of articles in the sample.

They were first requested to compare the article to others published at about the same time and dealing with similar topics or problems as well as to others on the same topic regardless of publication date. Since issues of quality in science are relative and timebound Kuhn, ; Polanyi,these items were intended to clarify the domains of comparison.

Again, a 7-point response scale was used, bounded by the categories "exceptionally low quality; few, if any, articles worse" and "exceptionally high quality; few, critical evaluation research paper any, articles better. Results of an earlier study Gottfredson et al.

Experts were asked to give their general impression of the impact the article had had upon a its specific subject-matter area, and b psychological knowledge in general. Experts indicated both judgments on a 7-point scale bounded by the categories "no impact" and "great impact. Overall quality and impact scales. Table 2 gives the matrix of intercorrelations for these five items. The three quality rating items are highly correlated, as are the two impact ratings, while the correlations across quality and impact items are moderate.

Accordingly, the three quality items were summed, as were the two impact items, resulting in a "quality scale" and an "impact scale. Evaluation relative to other. Overall quality. Impact on subject. Cell Ns are in parentheses. Two important types of reliability must be considered - interjudge reliability agreement across judges with respect to assessments and intrajudge reliability.

The latter can be thought of both in terms of a measure of a given judge's consistency with respect to his or her judgments and as a measure of the reliability internal consistency of the measuring instrument itself in this case, the set of scales.

Table 3 presents interrater and homogeneity coefficients for the overall quality and impact scales described above. The internal consistency coefficients are based on all articles for which there was at least one judgment, and interjudge coefficients are based on all articles for which at least two experts were available where more than two experts were available, extras essay democracy vs dictatorship randomly excluded.

Both scales have high internal consistencies, especially considering the small number of items composing each. Interjudge agreement, however, is relatively modest. Don'ts 5. Substantive do's 4.

Trivia 4. Where do we go from here? Data grinders 3. Ho-hum research 3. Reliability of evaluative scales. Each expert was asked to indicate whether each of the 36 items derived from the study of evaluative criteria was characteristic or descriptive of the article he or she had read.

Responses were made on a 6-point scale bounded by the categories "strongly disagree" and "strongly agree" that the statement is characteristic or descriptive of the article. Scores for each respondent for each of the nine scales were examined relative to the quality and impact scales, and in terms of both reliability measures. Table 3 also summarizes these results. All nine scales are correlated in the expected direction with the quality and impact measures, although sizable differences in the magnitude of these coefficients are evident.

Scale 8 "Ho-hum research" is essentially uncorrelated with either quality or impact - a reflection of the unreliability of the scale. In all cases but one, reliable scales evidence significantly lower correlations with the impact than with the quality measure. While this could reflect the degree of independence of these two measures the quality and impact measures correlate. Internal consistency coefficients for all but the last two scales are quite acceptable. It should be noted that these final two scales a accounted for very little variance in the dimensional structure obtained from the study of evaluative criteria and b fell toward the middle of the response continuum for that survey - indicating their relative irrelevancy.

Again, inspection shows that this is due to lack of variance in the subsample. The remaining five scales all demonstrate acceptable reliability relative to results reported previously e. Reliability of combined scales. Missing data i. The internal and interjudge reliabilities just discussed are based on all cases for which for a given scale responses were complete.

When combining scales, however, far too many cases are lost. To counteract this problem, the mean judgment over items on a given scale was computed for each respondent who had completed the majority of items on that scale; it was then used in the remaining analyses discussed in this article. To assess agreement across judges over all scales, scales that were negatively correlated with the quality and impact judgments were reflected and all scale scores summed.

As noted earlier, little agreement is evidenced on four scales due in two cases to a lack of variance in the subsample, and in two others to the unreliability of the scales themselves. As expected, agreement across experts is better for the subset of seven reliable scales than for the full set of nine, as is the internal consistency measure see Table 3.

For neither set of scales does reliability increase substantially over the reliabilities of specific individual scales. In general, these analyses have documented greater reliability of peer judgments of article quality than has been presented in past reports i.

Although agreement across judges is only moderate, the internal consistency evident suggests that the relative lack of bruno kugel dissertation is not due simply to unreliability in the scales themselves. As noted earlier, the use of experts nominated by the article authors might have two effects. First, we might expect agreement from this set of judges simply because all experts were so named by the authors of the articles they judged.

The data suggest, however, that this is not evaluation research paper case. Experts were asked a series of questions designed to assess the extent of their familiarity with a the field represented by the article they were to judge and b the author s of the articles. Table 4 gives the correlations between responses to several of these items and the experts' judgments of the target articles.

Although all of these coefficients are statistically significant, they are of little practical importance. It may be the case, however, that articles in this sample are in fact "good" articles.

On this point, it is interesting to note that this is a relatively homogeneous set of articles, and evaluation research paper might thus expect relatively modest reliability. The problem faced by editors receiving manuscripts for publication, however, is somewhat different. The pool of incoming manuscripts is likely to be much more heterogeneous with respect to quality than a pool of published manuscripts. Hence, we would expect reliability to be better. In other words, mba admissions resume present research has achieved better agreement on a less heterogeneous sample.

Much effort has been expended in a search for measures of quality in science, and attention has recently focused on citation counts. The problem of identifying a significant contribution of a scientist to science, however, has received less attention - despite the obvious assumption that citations of papers reflect a measure of their quality.

This section makes use of the two studies previously reported to examine this issue. The Science Citation Index was searched for all citations made of the articles in an 8-year period following the date of their publication Specific notation was made of self-citations defined as citations of the referent article by its single or first-listed author in subsequent publications and citation in review articles. Intercoder reliability coefficients r for the various citation measures were all well above.

Distributions of citations, whether of articles, journals, or people, are highly skewed see Table 5. Results of correlational analyses based on distributions as highly skewed as these can be misleading, since the least-squares model gives disproportionate weight to deviant scores. Table 6 contains the product-moment correlation coefficients obtained between experts' judgments of both the quality and impact of the target articles for which at least one response was received and the citations made of those articles during the 8-year period following their publication.

While largely statistically significant, these relations are very weak. The highest observed between experts' judgments of impact and the log of the total citations made of the articles was. None of the individual evaluative scales approaches even this degree of relation with critical evaluation research paper citation measure although the combined set of evaluative scales correlates with the citation measures to essentially the same extent as does the overall judgment of article quality.

Finally, it is apparent that controlling for self-citation is not necessary cf. As demonstrated in Table 7, no change is evident when these combined scores are correlated with the citation evaluate scientific research paper. Issues of heteroscedasticity. Hagstrom and others Gottfredson et al. The joint distributions of the various peer-judgment and citation measures suggest that the relations are markedly better for higher values of the citation measure than for lower values.

Critical evaluation research paper

