Zeeshan Ahmed, Duygu Ucar
Ahmed and Ucar present a platform for processing ATAC-seq samples. The platform starts with a fastq file, and generates peak calls. The rationale for developing this program is to provide a user-friendly solution for processing ATAC-seq samples, however it is a general pipeline for finding read-depth peaks and could be also applied to Chip-Seq data, for example.
However, this tool does not really make the analysis 'easy' for the end-user, as quite a bit of investment must be made in installing all the software dependencies. Also, it seems that command line tools are required to inspect the results, and there are no tools for assisting in interpretation or visualisiation of the results of the peak calling.
1. The authors should do a more thorough comparison with features of other programs for processing ATAC-seq data. The alternatives are only mentioned in passing, and no real comparison of features is made.
2. Consider making a galaxy pipeline incorporating this pipeline. In particular, I am not convinced that a standalone tool is a more straightforward solution than using Galaxy, particularly if the user only has a modest number of samples to process. This is because the user still has to install all the software dependencies, One advantage of Galaxy is that the user does not need to install any software, as it will already be installed on a Galaxy installation.
3. One of the major points of this preprint is that it makes the analysis 'interactive'. However, there is little evidence of interactivity, and I am not sure what interactive means in the context of a pipeline for processing data. The interactivity seems to mostly consist of being able to modify various parameters in a GUI, and pressing the 'run' button. To be truly interactive, some kind of visualization of the data must be presented, and the tools to choose specific analysis pipelines on the basis of this visual feedback. At the moment it seems to me that the user must still inspect the results in the file-system, so interactivity consists of being able to change the run parameters in a GUI.
4. The authors have run the pipeline on GM12878 (and presumably other samples) but provide no results from these runs. It would be very useful to see the utility of the pipeline if some indication of the results which were obtained were presented. This gets back to the point about visualisation – it seems the tool provides no data visualization?
5. Downstream processing of peak calls. Does the tool offer any downstream processing of peak calls? I dont think the authors could claim this is a ATAC-seq pipeline if it doesnt offer some downstream tools for e.g. identifying changes in histone positioning, maybe some kind of fourier transform of peaks? Something which would assist in interpreting this as the result of an ATAC-seq experiment.
1. There are too many flow diagrams, and many of them essentially represent the same information in different ways. I understand that they each have a subtly different point, e,g. One is to illustrate the file -structure created by the program. However, it would be better to just have a single flowchart illustrating everything, perhaps with extra figure panels to illustrate other relevant points of interest, like the file structure generated. Also in Figure 2, its impossible to get any useful information out of the screenshots included, so I dont see the point. Moreover, the point of the tool is to help biologists not familiar with command line, so why does \figure 2 show commands and outputs on the command line?
Ben N Bimber, Michael J Raboin, John Letaw, Kimberly Nevonen, Jennifer E Spindel, Susan McCouch, Rita Cervera-Juanes, Eliot Spindel, Lucia Carbone, Betsy Ferguson, Amanda Vinson
Bimber et al present an interesting approach to whole-genome genotyping variants in an extended primate family using a combination of deep WGS on a few samples, and GBS on the remaining samples, combined with imputation. This is a very interesting approach to try to cost-effectively genotype extended pedigrees in populations which dont have the benefit of whole-genome genotype arrays. It is important to try to find approaches for cost-effective genotyping in this setting, and I think this is a valid approach to try. However, I am not sure that the results presented conclusively demonstrate that this is a good way forward. Also, there were a few things which were hard to follow and could be presented more clearly.
1. What was the rationale for using chromosome 19 only to evaluate the best approach. There does seem to be a lot of variation in the genotype accuracy plots, which using more data (i.e. whole-genome) would help to address.
2. Do you see that some individuals consistently have low accuracy and others have higher accuracy across different chromosomes? One way to show this would be to use different symbols for each of the family members. I think this could be informative in terms of seeing which positions in the pedigree were poorly imputed
3. It seems to me that the genotyping accuracy presented is a bit unsatisfactory – 92% median accuracy is much lower than what is expected from genotyping arrays, low-coverage population sequencing with imputation, etc, and probably makes downstream analyses error-prone. GIGI estimates a posterior probability distribution over genotypes at each marker, why not use this to identify which genotype calls are more confident, and to assign as missing those values which are not confident above a certain threshold. This would allow you to trade of the amount of missing data with the accuracy. It would be preferable for example to have 10% missing data, but a 99% genotyping accuracy.
4. The 25% allele frequency in the family seems to be a very high threshold, and would mean losing a lot of variants. I think its important to investigate genotyping accuracy at different frequency bands, and also to present the number of variants which have family allele frequency in each range.
5. Did you investigate not applying such a stringent selection criteria to the framework markers, but instead using all of them?
6. I don't think you have conclusively showed that GBS is a better strategy than alternative approaches. In particular, I am not convinced that a combination of WGS with skim sequencing might not work better. You write that the optimal strategy is 1 30X WGS per 3-5 GBS., so lets just say that is roughly 1 WGS + 4 GBS, which translate to $3000 + 4 * 50 = $3200. Based on these costings, a GBS run is the same cost as 0.5X of sequencing, so for the same money you could do one 16X and 4 4X genomes, or 1 28X and 4 1X genomes. I think it would be worth comparing the results to what you would get if you employed this strategy (or even just 5 6X genomes). I think you have enough WGS data in this experiment to evaluate this strategy by downsampling some of the 30X genomes (and using the remaining reads to call the gold-standard genotypes for comparison), and just using the GIGI pick samples as 30X imputation framework samples. I think this would show more conlusively that GBS is the most cost-effective approach.
7. Regarding the conclusion that the optimal strategy is 1 30X WGS per 3-5 GBS. Why do you think it is applicable to all family structures? Also the process of selecting informative individuals is not clearly explained and appears to have some circularity. For the given family in this paper, "Individuals with WGS data were added consecutively in the following order: B, H, J, F, M, K, P, C, D". The order was suggested by the GIGI-pick algorithm with WGS data from these individuals. How do you know which individuals to select without WGS. Would GIGI-pick select the same individuals if GBS data was used instead?
1. Regarding the line: “We additionally removed a set of 578 markers that consistently performed poorly in imputation. “
How was this assessed? If it was assessed by comparison to the gold standard genotype results from WGS, then it could result in a circular logic, whereby you improve the apparent imputation results by removing markers which are poorly imputed.
2. How did you assess accuracy at missing data sites? I presume they were left out of this calculation. What was the missing rate
3. Its very hard to compare fig 2C and 2D. I wonder if a log x scale would help.
Jan Schroeder, Santhosh Girirajan, Anthony T Papenfuss, Paul Medvedev
Schroeder et al explore the potential benefits of augmenting the reference prior to calling copy number variation. They make the observation that calling insertions is much harder than calling deletions, and propose an elegant and simple solution is to make a new 'expanded' reference consisting of the genome plus all insertions from a second reference, and then focus on calling only deletions in this expanded reference. The authors have written tools for wrapping any CNV detection algorithm – essentially using the caller on the expanded set, then projecting back to the original co-ordinates. The authors demonstrate that such an approach leads to a higher sensitivity to detect insertions (using a single caller – Delly in both hg18 and ref+ space). They validate the calls by looking for direct evidence in the sequence data supporting either the inserted or non-inserted sequence (i.e. read spanning the breakpoint with or without the inserted sequence). Its a nice approach, but I am concerned that the shortcomings of such an approach have been somewhat overlooked in this paper, and that the benefits have been over-stated, as I outline below.
1. One issue which is not well discussed is what happens when there is not enough information for the caller to make a call on ref+ (i.e. not enough read depth or spanning reads). In this case, Delly would make a call of no CNV (as there is not enough information to make a call). This would have the tendency of creating false positive 'duplication' calls when mapped back to hg18. Indeed the authors saw an inflated FDR, and perhaps this is the reason? Of course, the corresponding outcome of not making a call in hg18 is a false negative, and hence the lower sensitivity of Delly calls in hg18. I would suggest the authors more clearly acknowledge this potential shortcoming of their approach, and discuss potential ways to alleviate this issue (ie. By excluding regions with low coverage?).
2. The increased accuracy figure is misleading, because it implies a 67% increase in accuracy for all insertions, but in fact it is just for those insertions which they have included in their augmented reference. This is of course likely a fraction of all insertions which are present in the donor genome. The proposed approach as laid out only accommodates insertions from a single extra genome. So the 67% gain is highly artificial.
3. I found this 67% figure a bit confusing (stated on line 38) as it seems to contradict line 103 (which says 31%). Probably one is an average per sample, and the other is not, and yet standard errors are reported for both? Also if the authors are reporting sensitivity, it would be good to also see specificity (i.e. not just FDR), so that the author can directly calculate the accuracy from sensitivity and specificity
4. Delly is not really designed for insertion detection. The abstract of Ddelly states that its for finding deletions and tandem duplications. The authors seem to exclude tandem duplications. So this comparison seems bound to favour ref+ as Delly can find deletions, but it can't find insertions very well. Of course a tandem duplication will result in an insertion but probably most of the venter insertions included in this benchmark are not tandem duplications. The authors should state what proportion of the venter insertions are tandem duplications, and thus potentially typeable by Delly. It seems quite likely that the huge increase in sensitivity is mostly a reflection of how much Delly is better at finding deletions than insertions, and the difference would not be as extreme for other callers.
5. I would think a better comparison would be using a tool which actively attempts to find both insertions and deletions (.e.g Pindel? Dindel? amongst others).
1. As this paper explores copy number variation only, perhaps the title might use the term 'copy number variation' instead of structural variation. I realise that this approach could be used for other types of structural variation, e.g. inversions, but this has not been explored in this paper.
2. The authors use the term 'adjacency' without explaining what this means. I think most readers would not be familiar with the use of this term in this context.
GWATCH is an innovative and well-conceived idea, successfully implemented and presented as a web-platform. Such tools are necessary and timely at present in order to tackle the burgeoning data in genomics. GWATCH can prove to be useful, especially in light of their applications in genomic-studies like genome-wide association studies (GWAS). New GWAS studies based on next-generation-sequencing (NGS) data, at-present taking place simultaneously across different parts of the world, will require such online-tools, in order to quickly compare, analyse and replicate genomic-association results. To this end, the innovative visualization tools developed through GWATCH, and the detailed results for every SNP for different studies, can prove to be very useful for researchers. The manuscript is well written and the online tutorial and the content of the web-site is simple, but useful and easy to follow.
However, in order to publish this work in accordance with the intended content and purpose of GWATCH, we would like to advise the following corrections and modifications.
1. The manuscript mentions GWATCH can do association and visualization of indels and copy number variants (CNVs), but there are no examples for these. Visualization of copy-number classes, breakpoints along with the SNP data might be useful for researchers, hence an example how to do this can be added to the GWATCH documentation.
2. It might be useful if the GWATCH results and web-pages can show the genomic-build of the datasets being displayed. Further, allowing the users to convert from one build to another can prove to be useful. The past, present and the future build-coordinates are likely to change, hence this is an important issue to be addressed. This feature might also be added to the search fields for genomic co-ordinates.
3. The Active-Datasets tab on the GWATCH web-portal has long names for the various datasets stored, but no identifiers (ids) specific to GWATCH database. Adding an identifier might be more useful for users for searching and comparing results in future.
4. The authors state that GWATCH can be applicable to sequencing based GWAS as well, but there is no documentation in the manuscript or online, regarding how to upload sequencing results or in which format they should be used. In case the authors want the users to convert sequencing results into the GWATCH specific format, then some guidelines and instructions might be provided. If the authors wish to mention sequencing capability of GWATCH in the manuscript, they must show some examples or details of how to use sequencing data for GWATCH.
5. The GWATCH database should also allow some general data-mining tools which require less prior knowledge from the users. For example, at present it seems that a one can search by gene-name, only if user first selects the correct chromosome. There is no genome-wide tool for searching by gene-names or SNP ids. This might pose a bit of inconvenience and can be easily rectified.
6. On the website, there seems to be no way of registering new users. There is only login for existing users. The authors might want to add further documentation and details regarding how to do this.
7. The highway-browser is a and novel useful tool for visualizing association results. However, some improvements may added such as, allowing the users to choose colours (especially for colour-blind people). Further, an option to add a filter for p-value thresholds may help in visually plotting interesting results on the highway-browser and further remove background-noise.
8. The TRAX report does not mention the genotyping-platform and also the genomic-build for the reported SNP. These details can be added.
9. The authors mention the calculation of principal components analysis and meta-analysis for their example AIDS datasets. It is not clear from the manuscript whether this functionality are intended to be a part of the GWATCH platform or is for the user to do it. Also the authors can include details whether and how covariates can be added to the association model in the GWATCH.
10. GWAS-central is a similar online tool available at present. The authors can perhaps highlight some of the advantages and differences of using GWATCH over GWAS central, and highlight instances where it can prove to be more useful for a researcher. The authors may also want to list the current limitations of GWATCH and their planned future work, mainly related to how they want to develop this in future and tackle the future challenges.
11. The authors might want to clarify more regarding the open source license for this project.
We hope the authors find these comments useful,
Tisham De and Lachlan Coin