Author correction: ensembles of data-efficient vision transformers as a new paradigm for automated classification in ecology


Author correction: ensembles of data-efficient vision transformers as a new paradigm for automated classification in ecology

Play all audios:


Correction to: _Scientific Reports_ https://doi.org/10.1038/s41598-022-21910-0, published online 03 November 2022 The original version of this Article contained errors, where the Ensembles


of Data-efficient image Transformers (EDeiTs) had a lower performance than stated. After publication, the authors noticed a bug in the evaluation of the EDeiT models, where different models


in the same ensemble were seeing different images, (always from the same class), which artificially increased the test performance. This bug was only present in the script for EDeiTs, all


other models (single models and ensembles), were unaffected. The main conclusion of the paper, that one can use EDeiTs to obtain well-performing models avoiding parameter tuning, is still


valid, since the performances are always similar to the previous state-of-the-art (SOTA). Errors have been corrected in the Abstract, the Results, the Discussion, Figure 1 and its caption,


Figure 2, Table 1, and the Supplementary Information file. The original Figures 1 and 2, and the original Supplementary Information file are provided below. In the Results, the subheading


‘Arithmetic versus geometric averaging for ensembling’ and Table 2 have been removed. Consequently, the following references have been removed from the reference list and subsequent


references have been renumbered in the text. 34. Alkoot, F. & Kittler, J. Experimental evaluation of expert fusion strategies. _Pattern Recogn. Lett._ 20, 1361–1369.


https://doi.org/10.1016/S0167-8655(99)00107-5 (1999). 35. Kittler, J., Hatef, M., Duin, R. & Matas, J. On combining classifiers. _IEEE Trans. Pattern Anal. Mach. Intell._ 20, 226–239.


https://doi.org/10.1109/34.667881 (1998). 37. Mi, A. & Huo, Z. Experimental comparison of six fixed classifier fusion rules. _Proc. Eng._ 23, 429–433.


https://doi.org/10.1016/j.proeng.2011.11.2525 (2011). As a result of the errors, the Abstract, “We overcome this limitation through ensembles of Data-efficient image Transformers (DeiTs),


which not only are easy to train and implement, but also significantly outperform the previous state of the art (SOTA). We validate our results on ten ecological imaging datasets of diverse


origin, ranging from plankton to birds. On all the datasets, we achieve a new SOTA, with a reduction of the error with respect to the previous SOTA ranging from 29.35% to 100.00%, and often


achieving performances very close to perfect classification. Ensembles of DeiTs perform better not because of superior single-model performances but rather due to smaller overlaps in the


predictions by independent models and lower top-1 probabilities. This increases the benefit of ensembling, especially when using geometric averages to combine individual learners. While we


only test our approach on biodiversity image datasets, our approach is generic and can be applied to any kind of images.” now reads: “We overcome this limitation through ensembles of


Data-efficient image Transformers (DeiTs), which we show can reach state-of-the-art (SOTA) performances without hyperparameter tuning, if one follows a simple fixed training schedule. We


validate our results on ten ecological imaging datasets of diverse origin, ranging from plankton to birds. The performances of our EDeiTs are always comparable with the previous SOTA, even


beating it in four out of ten cases. We argue that these Ensemble of DeiTs perform better not because of superior single-model performances but rather due to smaller overlaps in the


predictions by independent models and lower top-1 probabilities, which increases the benefit of ensembling.” In the Introduction, “We show that while the single-model performance of DeiTs


matches that of alternative approaches, ensembles of DeiTs (EDeiTs) significantly outperform the previous SOTA, both in terms of higher accuracy and of better classification of minority


classes (_i.e._ rare species). We see that this mainly happens because of a higher disagreement in the predictions, with respect to other model classes, between independent DeiT models.


Finally, we find that while CNN and ViT ensembles perform best when individual learners are combined through a sum rule, EDeiTs perform best when using a product rule.” now reads: “We show


that while the single-model performance of DeiTs matches that of alternative approaches, ensembles of DeiTs (EDeiTs) achieve very good performances without requiring any hyperparameter


tuning. We see that this mainly happens because of a higher disagreement in the predictions, with respect to other model classes, between independent DeiT models.” In the Results, under the


subheading ‘A new state of the art’, “As shown in Fig. 1, the error rate of EDeiTs is drastically smaller than the previous SOTA, across all datasets.” now reads: “As shown in Fig. 1, the


error rates of EDeiTs are sometimes close to or even smaller than those of previous SOTA.” Furthermore, under the subheading ‘Ensemble comparison’ of the same section, “While all ensembled


CNNs perform similarly to each other \(F1-score\le 0.924\) ensembled DeiTs reach an almost perfect classification accuracy (with the F1-score reaching 0.983). With arithmetic average


ensembling, we have 15 misclassifications out of 2691 test images, 14 of which on classes not associated to a specific taxon classes (see Supplemental Information, SI (See Footnote 1)). With


geometric averaging, the performance is even higher with only 10 out of 2691 misclassified test images.” now reads: “The CNN family reaches a maximum \(F1-score\le 0.920\) for ensemble of


Efficient-B7 network across initial conditions. When the best CNNs are picked and ensembled the ensemble performance (Best_6_avg) reaches \(1-score\le 0.924\). In the case of DeiT models,


the ensemble was carried out without picking the best model across different DeiTs but still reaches similar classification accuracy (with the \(F1-score\) reaching 0.924) with no


hyperparameter tuning.” Additionally, under the subheading ‘Why DeiT models ensemble better’, “The CNN ensemble has more RRR cases (2523) than the EDeiT (2430), but when the three models


have some disagreement, the EDeiTs catch up and outperform the CNN ensembles. In particular: * The correct RWW cases are \(2.3\times\) to \(2.6\times\) more common in the geometric average


and arithmetic average EDeiT respectively (Geometric CNN: 10, Geometric EDeiT: 23; Arithmetic CNN: 8, Arithmetic EDeiT: 21). In the SI (See Footnote 1) we show that the probability that a


RWW ensembling results in a correct prediction depends on the ratio between the second and third component of the ensembled confidence vector, and that the better performance of DeiT


ensembles in this situation is justified by the shape of the confidence vector. * The correct RRW cases are \(2.4\times\) more common in the geometric average and arithmetic average EDeiT


(Geometric CNN: 93, Geometric EDeiT: 226; Arithmetic CNN: 94, Arithmetic DeiT: 225), and they represent the bulk of the improvement of DeiT versus CNN ensembles. Since the single-model


performances are similar, this suggests that a higher degree of disagreement among individual models is allowing for a better ensembling.” now reads: “The CNN ensemble has more RRR cases


(2523) than the EDeiT (2515), but when the three models have some disagreement, the EDeiTs catch up with the CNN ensembles. In particular: * The correct RWW cases are \(2.0\times\) more


common in the geometric average and arithmetic average EDeiT (Geometric CNN: 8, Geometric EDeiT: 15; Arithmetic CNN: 8, Arithmetic EDeiT: 16). In the SI (See Footnote 1) we show that the


probability that a RWW ensembling results in a correct prediction depends on the ratio between the second and third component of the ensembled confidence vector, and that the better


performance of DeiT ensembles in this situation is justified by the shape of the confidence vector. And, “For DeiTs, we have \(S=0.773\pm 0.004\), while for CNNs the similarity is much


higher, \(S=0.945\pm 0.003\).” now reads: “For DeiTs, we have \(S=0.799\pm 0.004\), while for CNNs the similarity is much higher, \(S=0.945\pm 0.003\).” In addition, “Given a fixed budget of


single-model correct answers, this has a double benefit: (i) the best ensembled performance is obtained by maximizing the number of RRW combinations with respect to the RRR combinations;


(ii) RWW combinations result more likely in a correct answer when the two wrong answers are different (see SI (See Footnote 1)). The situation is analogous for geometric averaging (Fig. 2c),


where we further note that there can be WWW models resulting in a correct prediction, because all the (wrong) top answers of each model can be vetoed by another model.” now reads: “Given a


fixed budget of single-model correct answers, RWW combinations result more likely in a correct answer when the two wrong answers are different (see SI (See Footnote 1)). The situation is


analogous for geometric averaging (Fig. 2c).” In Table 1, the values for “Arithmetic ensemble (accuracy/F1-score)” and “Geometric ensemble (accuracy/F1-score)” were incorrect for the Model


“DeiT-Base”. The correct and incorrect values appear below. Incorrect: Model Arithmetic ensemble (accuracy/F1-score) Geometric ensemble (accuracy/F1-score) DeiT-Base 0.994/0.973 0.996/0.984


Correct: Model Arithmetic ensemble (accuracy/F1-score) Geometric ensemble (accuracy/F1-score) DeiT-Base 0.973/0.924 0.972/0.922 In the Discussion section, “Besides being of simple training


and deployment (we performed no specific tuning for any of the datasets), EDeiTs systematically lead to a substantial improvement in classifying biodiversity images across all tested


datasets, when compared to the previous state of the art. Furthermore, our results were obtained by averaging over three DeiT models but increasing the number of individual learners can lead


to a further improvement in the performances.” now reads: “Besides being of simple training and deployment (we performed no specific tuning for any of the datasets), EDeiTs achieve results


comparable to those of earlier carefully tuned state-of-the-art methods, and even outperform them in classifying biodiversity images in four of the ten datasets.” Finally, the Supplementary


Information file published with this Article contained errors in the Supplementary Text, Tables 1, 2 and 3, as well as Figs. 4, 6 and 7. The original Supplementary Figs. 1 and 2 have been


removed. The original Supplementary Information file is provided below. The original Article and accompanying Supplementary Information file have been corrected. AUTHOR INFORMATION AUTHORS


AND AFFILIATIONS * Eawag, Überlandstrasse 133, 8600, Dübendorf, Switzerland S. P. Kyathanahally, T. Hardeman, M. Reyes, E. Merz, T. Bulas, F. Pomati & M. Baity-Jesi * WSL, Zürcherstrasse


111, 8903, Birmensdorf, Switzerland P. Brun Authors * S. P. Kyathanahally View author publications You can also search for this author inPubMed Google Scholar * T. Hardeman View author


publications You can also search for this author inPubMed Google Scholar * M. Reyes View author publications You can also search for this author inPubMed Google Scholar * E. Merz View author


publications You can also search for this author inPubMed Google Scholar * T. Bulas View author publications You can also search for this author inPubMed Google Scholar * P. Brun View


author publications You can also search for this author inPubMed Google Scholar * F. Pomati View author publications You can also search for this author inPubMed Google Scholar * M.


Baity-Jesi View author publications You can also search for this author inPubMed Google Scholar CORRESPONDING AUTHORS Correspondence to S. P. Kyathanahally or M. Baity-Jesi. SUPPLEMENTARY


INFORMATION SUPPLEMENTARY INFORMATION. RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,


adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons


licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise


in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the


permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and


permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Kyathanahally, S.P., Hardeman, T., Reyes, M. _et al._ Author Correction: Ensembles of data-efficient vision transformers as a new paradigm


for automated classification in ecology. _Sci Rep_ 13, 6243 (2023). https://doi.org/10.1038/s41598-023-32258-4 Download citation * Published: 17 April 2023 * DOI:


https://doi.org/10.1038/s41598-023-32258-4 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not


currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative