eDNA监测测序数据分析注释中参考数据库选择、指标阈值选择、目标数据准备的影响——以长江中游鱼类为监测目标
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

中央级公益性科研院所基本科研业务费专项(YFI202201)和农业财政专项“长江禁捕后常态化监测专项”(CJJC-2023-01)联合资助。


The impacts of reference database selection, indicator threshold determination and target data preparation on the sequence data analysis of eDNA monitoring-Taking fish as the target in the middle Yangtze River
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 附件
  • |
  • 文章评论
    摘要:

    在基于宏条形码(meta-barcoding)的eDNA监测技术中,eDNA测序数据的分析和注释是决定监测结果判断和评估精准与否的基础,而参考数据库选择、指标阈值选择、目标数据准备是eDNA测序数据分析和注释中最为关键的3个技术环节。为厘清上述3个技术环节处理方案的影响,本研究以长江中游2组eDNA监测COI基因测序数据为分析对象,针对鱼类的检出进行3组实验来分别检验:1)不同参考数据库及物种注释算法对注释结果的影响;2)不同OTU聚类序列相似度和物种注释分类置信度(序列一致性和序列覆盖度)对注释结果的影响;3)目标数据中各物种不同序列丰富度对注释结果的影响。结果显示:1)Blast算法下,3个版本nt库注释出的物种基本一致(72%~78%),2个本地序列参考库注释出的物种也基本一致(91%~96%),这5个序列参考库注释出的物种52%~68%一致;nt库RDP Classifier算法注释出的物种覆盖95%以上Blast算法注释出的物种,并比Blast算法注释出的物种多151%~443%,多出的物种大都是错误注释,本地参考数据库RDP Classifier算法注释出的物种覆盖66%~85%的Blast算法注释出的物种,并存在数条只注释到科属的结果。2)OTU聚类序列相似度阈值,取值0.999比取值0.99获得的OTU多154%~209%,注释到鱼类的OTU多240%~490%;注释分类置信度阈值(Blast算法,序列一致性和序列覆盖度)从0.8到0.99注释获得的物种组成(94%以上)基本一致,OTU组成(83%以上)也基本一致,注释分类置信度阈值取0.7时注释获得的物种组成、OTU组成与取0.8及以上时注释获得的有较大差异。3)在OTU聚类序列相似度阈值为0.999、注释分类置信度阈值为0.9时,多序列数据注释所得鱼类物种数、OTU数最多,物种注释正确率最高(达81.49%),分别比单序列数据的多7%、215%和高5%。在具体eDNA测序数据的分析和注释中,可通过建立完善本地参考数据库、优化OTU聚类序列相似度和物种注释分类置信度(序列一致性和序列覆盖度)取值、增加目标数据的丰富度来提高注释结果的准确性,但受制于物种注释算法的局限性,物种注释错误和注释遗漏的问题可能将长期存在,物种注释正确率通常低于85%(基于COI基因的eDNA监测)。

    Abstract:

    In the meta-barcoding based eDNA monitoring technology, the analysis and annotation of eDNA sequence data serve as the foundation for obtaining accurate and reliable monitoring results. The selection of reference databases, the determination of analysis and annotation indicator thresholds, and the preparation of target data are the most critical technical steps in eDNA sequence data analysis and annotation. To clarify the impacts of these three technical aspects and provide scientific support for the standardization of eDNA monitoring technology, the current study used two sets of COI gene sequence data from eDNA monitoring in the middle reach of the Yangtze River as the analysis objects and designed three sets of experiments to test 1) the impacts of different reference databases and species annotation algorithms on the annotation results, 2) the impacts of different OTU clustering sequence similarity and species annotation classification confidence (sequence consistency and sequence coverage) on the annotation results, and 3) the impacts of different target sequence data richness of each species on the annotation results. The results showed that: 1) under the Blast algorithm, the annotated species matched with three versions of nt library from NCBI were generally consistent (72%-78%); those matched with two local sequence reference libraries were also generally consistent (91%-96%); and the annotated species from the five results matched with these five sequence reference libraries were consistent in 52%-68%. The RDP Classifier algorithm annotated species matched with nt libraries covered over 95% of Blast algorithm annotated species, and increased by 151%-443% species, but most additional species were misannotated. The RDP Classifier algorithm annotated species matched with local sequence reference libraries covered 66%-85% of Blast algorithm annotated species, and there were several results only annotated to family or genus level. 2) When the OTU clustering sequence similarity threshold was set to 0.999, it obtained 154%-209% more OTUs than when set to 0.99, and 240%-490% more annotated OTUs of fish were obtained. The classification confidence threshold (Blast algorithm) had little effect on species composition when changed from 0.8 to 0.99, with over 94% consistency, but there was a significant difference when it was set to 0.7. 3) When the OTU clustering sequence similarity threshold was 0.999 and the classification confidence threshold was 0.9, the number of fish species and OTUs obtained from multiple-sequences data annotation was the largest. It also had the highest species annotation accuracy (81.49%), which increased by 7% fish species, 215% OTUs and 5% accuracy respectively compared to single-sequence data annotation. In eDNA sequenc data analysis and annotation, accuracy can be improved by establishing and improving local reference databases, optimizing OTU clustering sequence similarity and species annotation classification confidence thresholds (sequence consistency and sequence coverage), increasing target sequence data richness. However, due to the limitation of species annotation algorithms, problems such as species annotation errors and omissions may persist in eDNA sequence data analysis and annotation in the future. Then, the species annotation accuracy of eDNA monitoring (based on the COI gene) would always be lower than 85%.

    参考文献
    相似文献
    引证文献
引用本文

许兰馨,杨海乐,刘志刚,杜浩. eDNA监测测序数据分析注释中参考数据库选择、指标阈值选择、目标数据准备的影响——以长江中游鱼类为监测目标.湖泊科学,2024,36(6):1843-1852. DOI:10.18307/2024.0631

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-08-16
  • 最后修改日期:2024-04-01
  • 录用日期:
  • 在线发布日期: 2024-11-05
  • 出版日期: 2024-11-06
文章二维码
您是第    位访问者
地址:南京市江宁区麒麟街道创展路299号    邮政编码:211135
电话:025-86882041;86882040     传真:025-57714759     Email:jlakes@niglas.ac.cn
Copyright:中国科学院南京地理与湖泊研究所《湖泊科学》 版权所有:All Rights Reserved
技术支持:北京勤云科技发展有限公司

苏公网安备 32010202010073号

     苏ICP备09024011号-2