Analysis of Covid-19 Genome Sequences based on Geo-Locations

  • Aqsa Umar Department of Software Engineering, Mehran University of Engineering Technology, Jamshoro, Pakistan
  • Naeem Ahemd Mahoto Department of Software Engineering, Mehran University of Engineering Technology, Jamshoro, Pakistan
  • Sania Bhatti Department of Software Engineering, Mehran University of Engineering Technology, Jamshoro, Pakistan
  • Sapna Rathi Department of Software Engineering, Mehran University of Engineering Technology, Jamshoro, Pakistan
Keywords: COVID-19, Sequential pattern mining, Genome sequences, Closed sequential patterns, Nucleotide bases, Amino acid codons.

Abstract

The COVID-19 pandemic has become a major worldwide serious health risk of the current 21st century. It is necessary to examine the genomic sequences of the deadly virus COVID-19 strains to fully understand the virus’s behavior, origin, and how rapidly it mutates. This paper addresses the analysis of the COVID-19 genome sequences CGS of China, Pakistan, and India. In this research, we have looked at the usage of sequential pattern mining (SPM), a closed sequential pattern technique to discover valuable information from COVID-19 genomic sequences. The analysis is performed on the three strains of genome sequences. First, the sequences data files of genome sequences are being transformed to the computer-readable corpus of CGS and then the SPM technique is applied to discover the frequent patterns of nucleotides. Second, Frequent codons of Amino acids are extracted from three strains of genome sequences. Third, we have evaluated the performance of the proposed approach in terms of time execution, the number of frequent patterns, and memory consumption. Obtained results suggest that the codon of Threonine amino acid ACA with support 1576 in Pakistan is the most frequent pattern from the other two strains of CGS. Furthermore, when the user minimum threshold value is low, the closed sequential pattern mining using sparse and vertical id-lists CloFAST algorithm performance evaluates that a high number of frequent patterns consumes more time and memory

Published
2021-12-22
How to Cite
[1]
A. Umar, N. Mahoto, S. Bhatti, and S. Rathi, “Analysis of Covid-19 Genome Sequences based on Geo-Locations”, PakJET, vol. 4, no. 4, pp. 41-45, Dec. 2021.