New problems in exploring distributed data

Update Item Information
Publication Type dissertation
School or College College of Engineering
Department Computing
Author Tang, Mingwang
Title New problems in exploring distributed data
Date 2015-05
Description In the era of big data, many applications generate continuous online data from distributed locations, scattering devices, etc. Examples include data from social media, financial services, and sensor networks, etc. Meanwhile, large volumes of data can be archived or stored offline in distributed locations for further data analysis. Challenges from data uncertainty, large-scale data size, and distributed data sources motivate us to revisit several classic problems for both online and offline data explorations. The problem of continuous threshold monitoring for distributed data is commonly encountered in many real-world applications. We study this problem for distributed probabilistic data. We show how to prune expensive threshold queries using various tail bounds and combine tail-bound techniques with adaptive algorithms for monitoring distributed deterministic data. We also show how to approximate threshold queries based on sampling techniques. Threshold monitoring problems can only tell a monitoring function is above or below a threshold constraint but not how far away from it. This motivates us to study the problem of continuous tracking functions over distributed data. We first investigate the tracking problem on a chain topology. Then we show how to solve tracking problems on a distributed setting using solutions for the chain model. We studied online tracking of the max function on ""broom"" tree and general tree topologies in this work. Finally, we examine building scalable histograms for distributed probabilistic data. We show how to build approximate histograms based on a partition-and-merge principle on a centralized machine. Then, we show how to extend our solutions to distributed and parallel settings to further mitigate scalability bottlenecks and deal with distributed data.
Type Text
Publisher University of Utah
Subject data synopsis; distributed; histogram; monitoring; tracking; uncertainty
Dissertation Institution University of Utah
Dissertation Name Doctor of Philosophy
Language eng
Rights Management Copyright © Mingwang Tang 2015
Format Medium application/pdf
Format Extent 27,170 bytes
Identifier etd3/id/3829
ARK ark:/87278/s6768pm4
Setname ir_etd
ID 197380
Reference URL https://collections.lib.utah.edu/ark:/87278/s6768pm4