Top-k on Sequences: A New Approach to Enhanced Similarity Search
Abstract
Performing efficient top-k similarity search on massive data sets is challenging due to the increasing complexity with data scale, and many existing methods are computationally expensive. In this context, top-k sequence similarity search, which aims to identify the k most relevant sequences to a query sequence using a scoring function, becomes a critical task.
We argue that the similarity between two sequences should not be based only on their longest common subsequence, but should consider all common subsequences. This approach allows for a more comprehensive similarity score, recognising as more similar sequences that share many subsequences and thus providing
a more meaningful ranking.
Our study addresses these challenges by proposing: (1) a novel scoring function that assigns higher scores to candidates that share more subsequences with a query sequence, and (2) an efficient method that factorises the computation while computing all sequence scores simultaneously. Our contributions include a detailed analysis of existing techniques and the proposal of an efficient method for top-k similarity search. We validate our approach through extensive experimental evaluations and demonstrate its efficiency compared to a baseline method. Our results provide a promising solution for sequence similarity search that significantly improves the speed of top-k sequence retrieval in large mobility datasets.