Peer review versus the h-index for evaluation of individual researchers in the biological sciences


 Past performance is a key consideration when rationalising the allocation of grants and other opportunities to individual researchers. The National Research Foundation of South Africa (NRF) has long used a highly structured system of ‘rating’ the past performance of individual researchers. This system relies heavily on peer review, and has seldom been benchmarked against bibliometric measures of research performance such as Hirsch’s h-index. Here I use data for about 600 rated researchers in the biological sciences to evaluate the extent to which outcomes of peer review correspond to bibliometric measures of research performance. The analysis revealed that values of the h-index based on the Scopus database are typically 5–20 for researchers placed in the NRF’s C rating category (‘established’), 20–40 for those in the B rating category (‘considerable international recognition’) and >40 for those in the A rating category (‘leading international scholars’). Despite concerns that citation patterns differ among disciplines, the mean h-index per rating category was remarkably consistent across five different disciplines in the biological sciences, namely animal sciences, plant sciences, ecology, microbiology and biochemistry/genetics. This observation suggests that the NRF rating system is equitable in the sense that the outcomes of peer review are generally consistent with bibliometric measures of research performance across different disciplines in the biological sciences. However, the study did reveal some notable discrepancies which could reflect either bias in the peer-review process or shortcomings in the bibliometric measures, or both.



Introduction
Peer review is one of the pillars of the global research enterprise. 1 At a political level, national governments often steer overall research activities in particular directions through funding allocations to programmes, but they also almost always devolve the final allocation of resources to a system of peer review. The logic is that peers are considered to be in the best position to evaluate the quality of past and proposed research, even though it is also acknowledged that reviewers can exhibit bias, which can be either explicit, such as overt competition among researchers, or implicit, such as underlying prejudice according to race, gender and the perceived status of the institution to which a researcher is affiliated. 1 The peer-review system is under enormous strain worldwide and grant administrators increasingly struggle to obtain quality reviews of funding proposals. 2 Peer review, whether of grants or submitted manuscripts, is usually performed without added remuneration and there are thus limits to the time that researchers are willing to allocate to this process, particularly if it involves an assessment of the entire track record of each individual researcher.
One possible way of reducing the burden on peer reviewers (and the institutions that manage the process) is to separate periodic evaluation of the track record of the applicant from the evaluation of proposed research, such that a single measure of the applicant's track record can be used for multiple decision-making processes. This approach has been adopted in South Africa for several decades and takes the form of peer evaluation of the research performance of researchers in the higher education and research sector. This evaluation is usually based on the opinions of about six peers (some suggested by the candidate and some suggested by a panel of disciplinespecific experts) and is codified as a particular 'rating' which lasts for a period of 6 years. This information can then be used by administrative panels when making grant allocations or for other purposes such as informing a university selection committee about the past performance and general standing of an applicant in a research field.
The availability of massive computerised databases of publications that include citation information has led to the development of bibliometric measures of research performance. By far the most widely adopted of these is Hirsch's h-index, which is simply the number h of papers that have been cited at least h times. 3 The h-index is intended to strike a balance between the total number of citations, which may be unduly influenced by a few very well cited papers, and the total number of publications, which may not reflect the actual impact of the research in terms of citations. 4 The h-index has been shown to be closely associated with measures of academic standing in a field. 5,6 Potential drawbacks of the h-index include its insensitivity to number of authors, author positions, and discipline-specific citation patterns. 4,[7][8][9][10][11] Like the metric of total citations, the h-index also shows a ratchet effect whereby it will continue to increase even after a researcher has become inactive. 4 There have been very few attempts worldwide to determine the extent of agreement between peer review and bibliometric measures. Most of these involve studies of whether the h-index can predict the outcomes of applications for fellowships 12,13 or future career trajectories 14 . Hirsch 9 has characterised the h-index as 'an indicator of the impact of a researcher on the development of his or her scientific field'. This is uncannily similar to the objectives of the National Research Foundation (NRF) rating system in South Africa. Studies of correlations between peer assessment of research standing, such as NRF ratings, and the h-index are particularly valuable, but remain rare. 5,8,15,16 Lovegrove and Johnson 17 analysed data for a small sample (163) of botanists and zoologists in South Africa and found that the outcome of peer review of research standing in the form of NRF rating categories was fairly well correlated with the ISI-based h-index. They did not attempt to analyse trends across other disciplines in biology, however, and the ranges and means they obtained for the h-index in each rating category need to be updated, given the large increase in publications worldwide over the past 12 years.
The aim of the present study was to establish, across different disciplines of biology, the relations between estimates of the standing of researchers in their field based on peer review and those based on bibliometric analysis of their h-index.

Methods
I used the public database of NRF ratings assigned to 4176 South African researchers, available at www.nrf.ac.za, which was last updated on 30 June 2020. I filtered this database down to 644 researchers who mentioned 'Biological Sciences' as one of their primary disciplines (Supplementary table 1). I was able to find the Scopus-based h-index values for 614 of these researchers (searches took place 18-19 July 2020). Values of the Scopus-based h-index are generally very similar to those based on the Web of Science. 18 I did not use Google Scholar because Google Scholar profiles that are not frequently curated often include papers that are not authored by the researcher, thus inflating their actual h-index.
I was able to assign 569 researchers to a sub-discipline, usually based on their own statement of a sub-discipline in their NRF profile, or, more rarely, by examining the content of papers in their Scopus profile or consulting with other researchers in the sub-discipline. I used the following sub-discipline categories (number of researchers): 'Ecology' (137), 'Plant Sciences' (75), 'Animal Sciences' (127), 'Microbiology' (115) and 'Biochemistry, Molecular Biology and Genetics' (115). Palaeontologists were allocated to the Animal Sciences category only if their main focus is on animal structures and evolution rather than geology. These disciplines are based loosely on those used by Scopus and do not correspond exactly to the specialist committees used by NRF to oversee the results of peer review. For example, I have grouped all ecologists into one category, regardless of whether they would fall under the plant sciences or animal sciences committees.
I analysed values of the h-index of researchers to determine the degree to which these could be predicted by peer-review outcomes (the rating categories in which these researchers had been placed) as well as according to the specific discipline of the researchers. The three rating categories used by the NRF that were analysed (and the number of researchers in these) were A ('leading international scholars', n=27), B ('scholars with considerable international recognition', n=129) and C ('established scholars', n=414). I did not analyse rating sub-categories, e.g. C1, C2 and C3, as this information is not made public by the NRF. Even if the rating sub-category data could be obtained on the condition of confidentiality, as was the case in the analysis by Lovegrove and Johnson 17 , analysing them here without making the data available as an appendix would violate the principle of data transparency. I did not analyse researchers who applied for a rating and were unsuccessful as this information is not publicly available; nor did I analyse the Y and P rating categories as these specifically apply to early-career researchers who are evaluated largely according to future potential and not on their past track record in terms of impact within their fields.
Values of the h-index were skewed with a longer tail to the right and were thus analysed using a generalised linear model which incorporated a gamma distribution and log link function (implemented in SPSS 25, IBM Corp.). Rating category, sub-discipline and their interaction were treated as fixed predictors, and time (0-5 years) since the rating was awarded was treated as a covariate. Model significance was assessed using Peer review versus the h-index Page 2 of 5 likelihood ratios and post-hoc comparisons among means were based on the Dunn-Šidák method. In statistical terms, this model tests the null hypothesis that different rating category allocations are drawn from the same distribution of the h-index. An equally valid approach not used here would be to use a logistic model to assess whether the h-index is a good predictor of the probability of different rating categorisations. 8

Results
The distributions of raw (non-adjusted) h-indices according to rating category revealed clear clustering, but with some overlap of values between categories (Figure 1). The h-index values for researchers differed significantly according to rating category (χ 2 = 399.9, p<0.0001), sub-discipline (χ 2 = 14.2, p=0.006) and time since rating (χ 2 = 399.9, p<0.0001), but not by the interaction between rating category and discipline (χ 2 = 10.4, p=0.234).

Discussion
These results indicate that there is substantial agreement between the evaluation of research performance based on peer review and values of the h-index based on bibliometrics across a wide range of subdisciplines in biology ( Table 1). The consistency across disciplines is particularly evident for the B and C rating categories ( Table 1). The greatest variation in the mean h-index across disciplines, notably a twofold difference between animal sciences and molecular biology, is found in the A category ( Finding that the h-index is largely congruent with NRF ratings does not automatically imply that measures of research performance based on peer review can simply be supplanted by bibliometrics. The reason for this caution is that both approaches have disadvantages, and a combination of both may offer the best safeguard against unfair evaluation of an individual. Given that the advantages and disadvantages of peer review, partially summarised in Table 2, are well known 1,15 , I focus here on the advantages and risks of relying on the h-index as a measure of research performance. The purpose of measuring research performance is usually to make decisions about allocations of public funds in the form of grants. It is therefore important that the measure reflects overall research competence and ability. In this sense, the h-index has some serious drawbacks, such as favouring authors who publish as part of large teams, including consortia. 7,19 For example, an author who is a minor (middle) author of a large number of multi-authored papers may quickly develop a healthy h-index and yet may not have the requisite experience in managing research or writing papers. 9,20 This issue would probably be flagged by peer reviewers. There is an increasing trend for papers to have large numbers of authors. The charitable view is that this reflects a genuine increase in collaboration among researchers, but a more cynical view is that it reflects a form of collaborative gaming of the system by authors who wish to collectively increase their numbers of publications and citations. 21 A curious footnote to this issue is that in South Africa, government incentive funding for each publication is divided among authors, thus providing a perverse incentive (counter to that applied by the h-index) for researchers to minimise the number of co-authors. Another form of gaming by authors is to focus on writing review articles, rather than to conduct original laboratory or field-based research, simply because review articles are well known to garner more citations. 21 The need to more fairly reward authors for original research has recently been recognised in the 'San Francisco Declaration on Research Assessment' (https://sfdora.org/) which includes an advisory for authors to cite original research rather than reviews wherever it is possible to do so.
The h-index ratchets upwards throughout a person's career and even continues to increase after they have become research-inactive. This is potentially a serious problem given that the purpose of performance evaluation is usually to allocate resources to individuals who are currently active. Reviewers for the NRF rating system in South Africa, for example, are expected to focus on the quality and impact of the research performed in the 8 years preceding the evaluation, and the overall h-index can be misleading in this regard. Only by focusing on the year-by-year trends in citations and publication quantity and quality over that period can a panel gain insight into the recent impact of a person in their field. Of course, the situation is completely different if the purpose of using the h-index is to award a prize for career impact to someone who has retired or who is approaching retirement, as opposed to using the information for allocation of resources. It also seems obvious that the h-index has limited utility for evaluating early-career researchers, although there have been some studies that have shown that early-career values of the h-index (medians around 2-3) do correlate to some extent with the peer-review outcomes of fellowship applications. 13 Personally I would be sceptical of relying too heavily on the h-index to evaluate early-career researchers for purposes such as awarding postdoctoral grants as it may favour applicants in a manner that is directly proportional to the number of years since completing (or commencing) their PhD, even if those intervening years were not particularly productive.
Given these drawbacks of the h-index, should it be used in performance evaluation at all? The answer, I believe, is an emphatic 'yes'. If the evaluation of a researcher by a peer-review process leads to an outcome that is incongruent with their h-index, then a panel needs to consider the basis for the incongruence. For example, a person with an h-index of 10 whose peers write reports that place them in the B category ('considerable international recognition') may have benefitted from reviewers who had been primed to provide favourable reports, and the case should be reconsidered. It would be very difficult to detect this misfeasance without the use of a benchmark provided by the h-index. However, it can also be the case that peer review is sometimes the more reliable measure. For example, a person whose peers write reports suggestive of a B rating may have an h-index of 60. Further investigation by a panel may reveal that such a person is no longer fully research-active, and the final decision may be more in accordance with the peer review than the h-index. It is even possible that the peer reviewers themselves make use of the h-index when asked to evaluate a candidate for rating, especially one with whom they are not very familiar. 21 Thus it could be the case that measures of research standing in a field based on peer review, such as NRF ratings, are, in fact, already being informed by the h-index. This lack Peer review versus the h-index Page 3 of 5 of clear separation between independent and dependent variables means that any statistical test of associations between the h-index and ratings of research standing, such as those in the present study, should be viewed as no more than crude heuristic approximations. This also means that any characterisation of peer review as 'subjective' and the h-index as 'objective' would be overly simplistic. 22 It could even be argued, quite reasonably, that the h-index itself is already a form of peer review as most citations are essentially a form of validation by peers -although it is of course possible for papers to be cited as examples of faulty research.
An unexpected result of this analysis was the huge inflation (roughly doubling) of h-index values within each rating category in the biological sciences since the previous analyses by Lovegrove and Johnson 17 and Fedderke 8 . For example, the lower h-index threshold for a B rating in the 2008 study was about 10 and is now about 20, and for an A rating this lower threshold has shifted from about 20 to 40 ( Figure 1). It is not clear why this is the case, but one possibility is that the yearly increase in publications is outstripping the number of new researchers in the global system, thus leading to increasing values of the h-index. Another (related) possibility is that the number of authors per paper is increasing, leading to more papers for the same researcher effort. The most cynical and depressing explanation is that there is limited turnover of new researchers, particularly in the A and B categories, and so the NRF system is simply repeatedly re-evaluating much the same cohort of people whose h-index is increasing with time. Whatever the reason, it is clear that the h-index norms for each rating category are moving targets and, for administrative purposes, may need to be adjusted upwards every few years.
As fitted by eye, the h-index norms in the data for researchers in the biological sciences appear to be about 5-20 for the C category, 20-40 for the B category and >40 for the A category ( Figure 1). These norms encompass 75.8% of researchers in the C category, 69.7% of researchers in the B category and 88.8% of researchers in the A category. In this analysis, there were 142 researchers (c. 25% of the total) whose ratings lie outside these norms and some of these represent marked outliers (Figure 1). These outliers could reflect failures of the peer-review system, such as review reports that are negatively biased or primed to be favourable when they should not be, or they could represent special cases fairly deliberated on by a panel, such as cases of researchers who have high a h-index, but whose research standing has waned or whose contributions to well-cited papers were relatively minor.
It should be noted here that the average gap between rating and recording of the h-index in this data set is c. 2.5 years, meaning that the actual h-index at the time of rating would have typically been slightly lower for each researcher. I controlled for the time gap between rating and scoring of each h-index in the statistical model by including it as a covariate. Although the marginal means for each category are adjusted according to this covariate, the mean h-index at the time that rating took place will still be slightly overestimated for some researchers. This problem of overestimation of the h-index at the actual time of rating also applies to previous analyses 8, 17 and cannot account for the above-mentioned inflation of the h-index per rating category over the past decade and more.
It is uncertain whether the NRF will continue with the rating of researchers, even though it has not publicly indicated any intention to phase out the system. It has already been pointed out that the rating system seems to be quite disconnected with allocations of large state resources to the South African Research Chairs initiative 8 and rating itself is no longer connected with any significant 'incentive' funding allocations to researchers. 23 On the basis of cost and effort relative to application, it seems unlikely that the rating system will persist, but the general need for an evaluation of the standing of a researcher in a discipline will remain relevant even if the process is eventually bundled into the evaluation of grant applications.
It would be straightforward to extend this study to other disciplines besides the biological sciences, as was done previously by Fedderke 8 who used data derived from Google Scholar and found strong disciplinespecific associations between the h-index and probabilities of ratings. Such an exercise would further clarify whether h-index norms are more or less universal across the natural and social sciences, or whether each broad discipline needs a different set of norms with which to benchmark the assignment of researchers to different rating categories. There is no doubt that the h-index can serve as a valuable benchmark when peer review is biased or blatantly unfair. Some authors have even expressed the view that a rating system based on bibliometrics would be viewed as being more progressive and transparent. 8,24 In cases where the spread of data such as those in Figure 1 are available, I would even venture to suggest that under conditions of severe resource or time constraints, organisations such as the NRF could consider using the h-index in a discipline-adjusted manner as the sole measure of the standing of a researcher in their field. However, peer review will usually provide more nuanced interpretations of career impact than will a single number based on bibliometrics, and should be used whenever it is feasible. For earlycareer (<8 years) researchers, whose h-index will almost always be modest, my advice would be to simply ignore this whole debate and focus on publishing captivating high-quality papers, because that remains the only way to make an impact on one's research field, no matter how it may ultimately be measured. • Does not take number of co-authors into account • Does not take author positions into account • Not suitable for evaluating early-career researchers • Can be 'gamed' by self-citation of papers just below h-index threshold • Relatively insensitive to massively cited 'big hit' papers (can also be interpreted as an advantage) • May vary among disciplines