BLAST Analysis of our Initial Sequence


There are two sources of Mitragyna speciosa data that we are aware of.

A search of NCBI will provide 40 sequences. These are well annotated sequences that sum to 344,569bp including two chloroplast sequences, ITS sequences and many genes of interest. They are derived from multiple different researchers from around the world using multiple different sequencing technologies. Anyone can download these as a FASTA file posted here.

Below is a BLASTN output table of these 40 sequences aligned to our assembly. All of them have hits. Some SNPs are expected as the Red Vein Thai sample sequenced is not the same as previous strains sequenced.

The ironclad evidence that this sequence is in fact Mitragyna speciosa is the 100% perfect BLAST hit to the published ITS sequence AB249645.1




There is also a SRA archive of 1.3M reads from the FDA. You will need to download specialized software and be knowledgable with command line interfaces to download this data.

Run Spots Bases Size GC content Published Access Type
1.3M 658.7Mbp 420.7M 35.4% 2017-05-25 public
This run has 2 reads per spot:
L=248, 100% L=248, 100%

We have attempted to assemble this data with little success. After further inspection it appears the FastQ file has the Forward and Reverse reads concatenated into a single 499 base pair read.


This can be seen with an AT analysis over Read Length which demonstrates a spike at the strand flipping base at 250bp. This is consistent with the SRA table above. In order to make use of this data one needs to decouple these Forward and Reverse reads so assemblers do not attempt to assemble each read as contiguous 499bp reads and adapter trim the strand flipping base and the 1st and last base which appear to be adapter derived.


The unix command cut can help trim the first 200 bases off the reads.

cut -c 1-200 in.fastq >out.fastq These reads will map to the assembly with elevated error rates on the 3 prime end of the reads. 



Mapping the highest quality first 100 bases of these reads to the assembly produces 62% of the reads mapping. This implies that we are still on the most productive part of the shotgun sequencing curve. One more run should substantially improve the quality of the assembly.

One can assemble the unmapped reads and BLAST them against NCBI and one will find small hits to various plant Mito and Chloroplast DNA. These are likely repetitive or NUMT DNA that is not assembling at lower coverages.

Use the NCBI sratoolkit Fastq-dump -I –split-files SRR5602600 to separate these reads more cleanly.

Chemotype certificate for Red Vein Thai kratom sample we sequenced.