Three papers accepted in ACM WWW 2023 (short papers)
Title: RealGraph+: A High-Performance Single-Machine-Based Graph Engine that Utilizes IO Bandwidth Effectively
Author: Myung-Hwan Jang, Jeong-Min Park, Ikhyeon Jo, Duck-Ho Bae, and Sang-Wook Kim
Abstract
This paper proposes RealGraph+, an improved version of RealGraph that processes large-scale real-world graphs efficiently in a single machine. Via a preliminary analysis, we observe that the original RealGraph does not fully utilize the IO bandwidth provided by NVMe SSDs, a state-of-the-art storage device. In order to increase the IO bandwidth, we equip RealGraph+ with three optimization strategies to issue more-frequent IO requests: (1) Userspace IO, (2) Asynchronous IO, and (3) SIMD processing. Via extensive experiments with four graph algorithms and six real-world datasets, we show that (1) each of our strategies is effective in increasing the IO bandwidth, thereby reducing the execution time; (2) RealGraph+ with all of our strategies improves the original RealGraph significantly; (3) RealGraph+ outperforms state-of-theart single-machine-based graph engines dramatically; (4) it shows performance comparable to or even better than those of other distributed-system-based graph engines.
Title: Is the Impression Log Beneficial to Effective Model Training in News Recommender Systems? No, It’s NOT
Author: Jeewon Ahn, Hong-Kyun Bae, and Sang-Wook Kim
Abstract
The MIND dataset, one of the most-popular real-world news datasets, has been used in many news recommendation researches. They all employ the impression log as training data to train their models (i.e., impression-based training). In this paper, we claim that the impression log has the preference bias issue and thus should not be used for model training in news recommendation. We validate our claim via extensive experiments; we also demonstrate the dramatic improvement of recommendation accuracy in five existing state-of-the-art models up to 82.1% by simply ignoring the impression log. We believe this surprising result provides a new insight towards better model training for both researchers and practitioners working in a news recommendation area.
Title: C-Affinity: A Novel Similarity Measure for Effective Data Clustering
Author: Jiwon Hong and Sang-Wook Kim
Abstract
Clustering is widely employed in various applications as it is one of the most useful data mining techniques. In performing clustering, a similarity measure, which defines how similar a pair of data objects are, plays an important role. A similarity measure is employed by considering a target dataset’s characteristics. Current similarity measures (or distances) do not reflect the distribution of data objects in a dataset at all. From the clustering point of view, this fact may limit the clustering accuracy. In this paper, we propose c-affinity, a new notion of a similarity measure that reflects the distribution of objects in the given dataset from a clustering point of view. We design c-affinity between any two objects to have a higher value as they are more likely to belong to the same cluster by learning the data distribution. We use random walk with restart(RWR) on the 𝑘-nearest neighbor graph of the given dataset to measure (1) how similar a pair of objects are and (2) how densely other objects are distributed between them. Via extensive experiments on sixteen synthetic and real-world datasets, we verify that replacing the existing similarity measure with our c-affinity improves the clustering accuracy significantly.