Local News And Event Detection In Twitter
Publication or External Link
Twitter, one of the most popular micro-blogging services, allows users to publish
short messages on a wide variety of subjects such as news, events, stories, ideas, and opinions,
called tweets. The popularity of Twitter, to some extent, arises from its capability
of letting users promptly and conveniently contribute tweets to convey diverse information.
Specifically, with people discussing what is happening outside in the real world by
posting tweets, Twitter captures invaluable information about real-world news and events,
spanning a wide scale from large national or international stories like a presidential election
to small local stories such as a local farmers market. Detecting and extracting small
news and events for a local place is a challenging problem and is the focus of this thesis.
In particular, we explore several directions to extract and detect local news and events
using tweets in Twitter: a) how to identify local influential people on Twitter for potential
news seeders; b) how to recognize unusualness in tweet volume as signals of potential
local events; c) how to overcome the data sparsity of local tweets to detect more and
smaller undergoing local news and events. Additionally, we also try to uncover implicit
correlations between location, time, and text in tweets by learning embeddings for them
using a universal representation under the same semantic space.
In the first part, we investigate how to measure the spatial influence of Twitter users
by their interactions and thereby identify the locally influential users, which we found are
usually good news and event seeders in practice. In order to do this, we built a large-scale
directed interaction graph of Twitter users. Such a graph allows us to exploit PageRank
based ranking procedures to select top local influential people after innovatively incorporating
in geographical distance to the transition matrix used for the random walking.
In the second part, we study how to recognize the unusualness in tweet volume at
a local place as signals of potential ongoing local events. The intuition is that if there
is suddenly an abnormal change in the number of tweets at a location (e.g., a significant
increase), it may imply a potential local event. We, therefore, present DeLLe, a methodology
for automatically Detecting Latest Local Events from geotagged tweet streams (i.e.,
tweets that contain GPS points). With the help of novel spatiotemporal tweet count prediction
models, DeLLe first finds unusual locations which have aggregated an unexpected
number of tweets in the latest time period and then calculates, for each such unusual location,
a ranking score to identify the ones most likely to have ongoing local events by
addressing the temporal burstiness, spatial business, and topical coherence.
In the third part, we explore how to overcome the data sparsity of local tweets when
trying to discover more and smaller local news or events. Local tweets are those whose
locations fall inside a local place. They are very sparse in Twitter, which hinders the detection
of small local news or events that have only a handful of tweets. A system, called
Firefly, is proposed to enhance the local live tweet stream by tracking the tweets of a
large body of local people, and further perform a locality-aware keyword based clustering
for event detection. The intuition is that local tweets are published by local people,
and tracking their tweets naturally yields a source of local tweets. However, in practice,
only 20% Twitter users provide information about where they come from. Thus, a social
network-based geotagging procedure is subsequently proposed to estimate locations for
Twitter users whose locations are missing.
Finally, in order to discover correlations between location, time and text in geotagged
tweets, e.g., “find which locations are mostly related to the given topics“ and
“find which locations are similar to a given location“, we present LeGo, a methodology
for Learning embeddings of Geotagged tweets with respect to entities such as locations,
time units (hour-of-day and day-of-week) and textual words in tweets. The resulting compact
vector representations of these entities hence make it easy to measure the relatedness
between locations, time and words in tweets. LeGo comprises two working modes: crossmodal
search (LeGo-CM) and location-similarity search (LeGo-LS), to answer these two
types of queries accordingly. In LeGo-CM, we first build a graph of entities extracted
from tweets in which each edge carries the weight of co-occurrences between two entities.
The embeddings of graph nodes are then learned in the same latent space under
the guidance of approximating stationary residing probabilities between nodes which are
computed using personalized random walk procedures. In comparison, we supplement
edges between locations in LeGo-LS to address their underlying spatial proximity and
topic likeliness to support location-similarity search queries.