Local News And Event Detection In Twitter

Thumbnail Image


Publication or External Link





Twitter, one of the most popular micro-blogging services, allows users to publish

short messages on a wide variety of subjects such as news, events, stories, ideas, and opinions,

called tweets. The popularity of Twitter, to some extent, arises from its capability

of letting users promptly and conveniently contribute tweets to convey diverse information.

Specifically, with people discussing what is happening outside in the real world by

posting tweets, Twitter captures invaluable information about real-world news and events,

spanning a wide scale from large national or international stories like a presidential election

to small local stories such as a local farmers market. Detecting and extracting small

news and events for a local place is a challenging problem and is the focus of this thesis.

In particular, we explore several directions to extract and detect local news and events

using tweets in Twitter: a) how to identify local influential people on Twitter for potential

news seeders; b) how to recognize unusualness in tweet volume as signals of potential

local events; c) how to overcome the data sparsity of local tweets to detect more and

smaller undergoing local news and events. Additionally, we also try to uncover implicit

correlations between location, time, and text in tweets by learning embeddings for them

using a universal representation under the same semantic space.

In the first part, we investigate how to measure the spatial influence of Twitter users

by their interactions and thereby identify the locally influential users, which we found are

usually good news and event seeders in practice. In order to do this, we built a large-scale

directed interaction graph of Twitter users. Such a graph allows us to exploit PageRank

based ranking procedures to select top local influential people after innovatively incorporating

in geographical distance to the transition matrix used for the random walking.

In the second part, we study how to recognize the unusualness in tweet volume at

a local place as signals of potential ongoing local events. The intuition is that if there

is suddenly an abnormal change in the number of tweets at a location (e.g., a significant

increase), it may imply a potential local event. We, therefore, present DeLLe, a methodology

for automatically Detecting Latest Local Events from geotagged tweet streams (i.e.,

tweets that contain GPS points). With the help of novel spatiotemporal tweet count prediction

models, DeLLe first finds unusual locations which have aggregated an unexpected

number of tweets in the latest time period and then calculates, for each such unusual location,

a ranking score to identify the ones most likely to have ongoing local events by

addressing the temporal burstiness, spatial business, and topical coherence.

In the third part, we explore how to overcome the data sparsity of local tweets when

trying to discover more and smaller local news or events. Local tweets are those whose

locations fall inside a local place. They are very sparse in Twitter, which hinders the detection

of small local news or events that have only a handful of tweets. A system, called

Firefly, is proposed to enhance the local live tweet stream by tracking the tweets of a

large body of local people, and further perform a locality-aware keyword based clustering

for event detection. The intuition is that local tweets are published by local people,

and tracking their tweets naturally yields a source of local tweets. However, in practice,

only 20% Twitter users provide information about where they come from. Thus, a social

network-based geotagging procedure is subsequently proposed to estimate locations for

Twitter users whose locations are missing.

Finally, in order to discover correlations between location, time and text in geotagged

tweets, e.g., “find which locations are mostly related to the given topics“ and

“find which locations are similar to a given location“, we present LeGo, a methodology

for Learning embeddings of Geotagged tweets with respect to entities such as locations,

time units (hour-of-day and day-of-week) and textual words in tweets. The resulting compact

vector representations of these entities hence make it easy to measure the relatedness

between locations, time and words in tweets. LeGo comprises two working modes: crossmodal

search (LeGo-CM) and location-similarity search (LeGo-LS), to answer these two

types of queries accordingly. In LeGo-CM, we first build a graph of entities extracted

from tweets in which each edge carries the weight of co-occurrences between two entities.

The embeddings of graph nodes are then learned in the same latent space under

the guidance of approximating stationary residing probabilities between nodes which are

computed using personalized random walk procedures. In comparison, we supplement

edges between locations in LeGo-LS to address their underlying spatial proximity and

topic likeliness to support location-similarity search queries.