OptiM—a referential genome compression algorithm based on suffix array using an optimal strategy for seeking common substrings

Authors

  • Tingting Yi
  • Jianhua Chen

DOI:

https://doi.org/10.56028/aetr.14.1.90.2025

Keywords:

gene compression; reference compression; suffix array; shortest path; optimal matching.

Abstract

In the field of referential genome compression research, how to select appropriate matching substrings, which are used to encode the target sequence, from the common substrings between the reference and the target sequences to improve the compression performance is a key issue. This study proposed the OptiM algorithm, which uses an optimal matching strategy that considers multiple factors to improve the compression performance. The candidate common substrings are reprsented as nodes and a directed graph is constructed. The storage overhead of method is measured in the form of node cost and edge cost. The nodes on the lowest cost path are selected as subsequent matching substrings through the shortest path algorithm. In the experiment, we used 8 sets of human genome as the data set for compression testing. The results show that compared with the existing algorithms, the OptiM algorithm improved the compression performance.

Downloads

Published

2025-05-29