A Performance Study of a Large-scale Data Collection Problem
View/ Open
Date
2002-08-01Author
Chou, Cheng-Fu
Wan, Yung-Chun (Justin)
Cheng, William C.
Golubchik, Leana
Khuller, Samir
Metadata
Show full item recordAbstract
In this paper, we consider the problem of moving a large amount of
data from several source hosts to a destination host over a wide-area
network, i.e., a large-scale data collection problem. This problem is
important since improvements in data collection times in many
applications such as wide-area upload applications, high-performance
computing applications and data mining applications are crucial to
performance of those applications.
Existing approaches to the large-scale research are transferring data
either directly, i.e., direct methods, or using ``best''-path type of
application-level re-routing techniques, which we refer as
non-coordinated methods. However, we believe that in the case of
large-scale data collection applications, it is important to
*coordinate* data transfers from multiple sources. More
specifically, our coordinated method would take into consideration the
transfer demands of all source hosts and then schedule all data
transfers in parallel by using all possible existing paths between the
source hosts and the destination host.
We present a performance and robustness study of different data
collection methods. Our results showed that coordinated methods can
perform significantly better than non-coordinated and direct methods
under various degrees and types of network congestion. Moreover, we
also showed that coordinated methods are more robust than
non-coordinated methods under inaccuracies in network condition
information. Therefore, we believe that coordinated methods are a
promising approach to large-scale data collection problems.
Also UMIACS-TR-2002-62