A Performance Study of a Large-scale Data Collection Problem
Wan, Yung-Chun (Justin)
Cheng, William C.
MetadataShow full item record
In this paper, we consider the problem of moving a large amount of data from several source hosts to a destination host over a wide-area network, i.e., a large-scale data collection problem. This problem is important since improvements in data collection times in many applications such as wide-area upload applications, high-performance computing applications and data mining applications are crucial to performance of those applications. Existing approaches to the large-scale research are transferring data either directly, i.e., direct methods, or using ``best''-path type of application-level re-routing techniques, which we refer as non-coordinated methods. However, we believe that in the case of large-scale data collection applications, it is important to *coordinate* data transfers from multiple sources. More specifically, our coordinated method would take into consideration the transfer demands of all source hosts and then schedule all data transfers in parallel by using all possible existing paths between the source hosts and the destination host. We present a performance and robustness study of different data collection methods. Our results showed that coordinated methods can perform significantly better than non-coordinated and direct methods under various degrees and types of network congestion. Moreover, we also showed that coordinated methods are more robust than non-coordinated methods under inaccuracies in network condition information. Therefore, we believe that coordinated methods are a promising approach to large-scale data collection problems. Also UMIACS-TR-2002-62