Static and adaptive indexing framework for big data using predictor logic / Aisha Siddiqa

Big data with exponential growth come in various forms and require efficient data processing systems for fast retrieval. The disrupted features that are associated with big data have elicited attention from research and industry; the research efforts aim to explore viable solutions that can improve...

Full description

Bibliographic Details
Main Author: Aisha , Siddiqa
Format: Thesis
Published: 2017
Subjects:
Description
Summary:Big data with exponential growth come in various forms and require efficient data processing systems for fast retrieval. The disrupted features that are associated with big data have elicited attention from research and industry; the research efforts aim to explore viable solutions that can improve data retrieval performance for better insight. Indexing has undoubtedly contributed to increased search performance for big data sets; for big data indexing, researchers have used many indexing structures such as clustered and non-clustered. However, because of the continuous increase in data size, contemporary big data indexing mechanisms are inadequate to achieve efficiency in query responses. Clustered indexing approaches are constrained to number of replicas to offer indexing on a sufficient number of attributes, whereas non-clustered indexing implementation incurs high indexing overhead. Therefore, existing big data indexing structures are unable to achieve the maximum index hit ratio. The aim of this study is to expedite the data retrieval process with minimum indexing overhead and maximum index hit ratio against search queries for big data by using non-clustered indexing approach. Static indexes are created based on a user-provided list of index attributes before starting query execution, which are updated adaptively based on changing query workload to obtain an increased index hit ratio. We investigate contemporary big data indexing implementation and analyze its inefficiency in index creation time and index size. Furthermore, we observe that because of the limited number of indexes available with clustered indexing approaches, most queries are executed without using indexes. Thus, we propose a novel indexing framework for big data, named SmallClient, with minimized indexing overhead, improved search performance, and improved index hit ratio. SmallClient leverages B-Tree indexing structure and uses novel predictor logic for indexing. We collected data for indexing overhead (both in terms of indexing time and index size) as well as search performance and index hit ratio for static and adaptive indexing, respectively, to validate the performance of the framework. We use benchmarking and mathematical modeling for verification of SmallClient results. The results of indexing time prove that SmallClient has decreased indexing time overhead by up to 32% from 47%, taken by the Lucene indexing library. Similarly, index size overhead is 41% for large data sets where Lucene fails to create indexes. The results also prove that the search performance of SmallClient is more than 92% without intervening data uploading cost and that this framework achieves improved index hit ratio by adaptively updating indexes.