Significant Digits in Computational Genomics

Very often a computational genomics project in the lab requires the output of some tables. I noticed in many cases, some number columns have numbers that stretch VERY LOOOOOOONG. To display a chromosome location, it is totally legitimate to use a long number, e.g. chr1 42392716, because it is different from chr1 42392715 and chr1 42392717. However, in some other calculations, such as false discovery rate, transcription factor motif matching score, differential expression fold-change, it would not make sense to have 10 significant digits. For FDR, most people just want to know whether the gene / peak is around 1%, 5%, 20% or 80% FDR, so there is no need to show FDR as 1.238346226253182% (our FDR estimate simply doesn’t have that level of accuracy). Sometimes from cross validation, we know that the area under the curve of our prediction is around 0.813 (pretty good), then it would make sense to only use 2-3 significant digits to make classification predictions, even though the regression model might give 100 digits after the decimal points. Using fewer significant digits is often easier to visualize, creates smaller-sized files, and makes better sense.

This entry was posted in Uncategorized. Bookmark the permalink.