I am a researcher at the Institute for System Analysis and Computer Science -- in Italian: Istituto di Analisi e Sistemi d'Informatica (IASI) -- of the Italian National Research Council -- in Italian: Consiglio Nazionale delle Ricerche (CNR) -- in Rome.
Previously, I was a postdoctoral researcher at ISTI-CNR (Pisa, Italy). Before joining the CNR, I was a postdoctoral researcher at the Università della Svizzera italiana (Lugano, Switzerland) and at Max Planck Institute for Informatics (Saarbrücken, Germany).
I received my Ph.D. in Engineering in Computer Science from Sapienza University of Rome, under the supervision of Prof. Stefano Leonardi.
My Ph.D. Thesis focused on Web usage mining for optimizing information retrieval and recommendation systems. Part of my research was conducted while I was an intern at the Yahoo Research Lab (Barcelona, Spain) and at the Max Planck Institute for Informatics (Saarbrücken, Germany).
My research interests are: Web mining, information retrieval, conversational systems, recommendation algorithms, and topic modeling.
A. De Nicola, A. Formica, I. Mele, M. Missikoff, and F. Taglino A Comparative Study of LLMs and NLP Approaches for Supporting Business Process Analysis. Enterprise Information Systems (Under Review)
The evaluation of the semantic similarity of concepts organized according to taxonomies is a long-standing problem in computer science and has attracted great attention from researchers over the decades. In this regard, the notion of information content plays a key role, and semantic similarity measures based on it are still on the rise. In this review, we address the methods for evaluating the semantic similarity between either concepts or sets of concepts belonging to a taxonomy that, often, in the literature, adopt different notations and formalisms. The results of this systematic literature review provide researchers and academics with insight into the notions that the methods discussed have in common through the use of the same notation, as well as their differences, overlaps, and dependencies, and, in particular, the role of the notion of information content in the evaluation of semantic similarity. Furthermore, in this review, a comparative analysis of the methods for evaluating the semantic similarity between sets of concepts is provided.
In this paper, we address the problem of answering complex questions formulated by users in natural language. Since traditional Information Retrieval systems are not suitable for complex questions, these questions are usually run over knowledge bases, such as Wikidata or DBpedia. We propose a semi-automatic approach for transforming a natural language question into a SPARQL query that can be easily processed over a knowledge base. The approach applies classification techniques to associate a natural language question with a proper query template from a set of predefined templates. The nature of our approach is semi-automatic as the query templates are manually written by human assessors, who are the experts of the knowledge bases, whereas the classification and query processing steps are completely automatic. Our experiments on the large-scale CSQA dataset for question-answering corroborate the effectiveness of our approach.
Rapid response, namely low latency, is fundamental in search applications; it is particularly so in interactive search sessions,
such as those encountered in conversational settings. An observation with a potential to reduce latency asserts that
conversational queries exhibit a temporal locality in the lists of documents retrieved. Motivated by this observation,
we propose and evaluate a client-side document embedding cache, improving the responsiveness of conversational search systems.
By leveraging state-of-the-art dense retrieval models to abstract document and query semantics,
we cache the embeddings of documents retrieved for a topic introduced in the conversation, as they are likely relevant to
successive queries. Our document embedding cache implements an efficient metric index, answering nearest-neighbor similarity
queries by estimating the approximate result sets returned.
We demonstrate the efficiency achieved using our cache via reproducible experiments based on TREC CAsT datasets,
achieving a hit rate of up to 75% without degrading answer quality. Our achieved high cache hit rates significantly
improve the responsiveness of conversational systems while likewise reducing the number of queries managed on the
search back-end.
The massive shock of the COVID-19 pandemic has already shown its negative effects on economies around the world, unprecedented in recent history. COVID-19 infections and containment measures caused a general slowdown in research and new knowledge production. Because of the link between R&D output and economic growth, it is to be expected then that a slowdown in research activities will slow in turn the global recovery from the pandemic. Many recent studies also claim an uneven impact on scientific production across gender. In this paper, we investigate the phenomenon across countries, analysing preprint depositions in main repositories. Differently from other works, that compare the number of preprint depositions before and after the pandemic outbreak, we analyse the depositions trends across geographical areas, and contrast after-pandemic outbreak depositions with expected ones. Differently from common belief and initial evidence, the decrease in research output is not more severe for women than for men.
In a conversational context, a user converses with a system through a sequence of natural-language questions, i.e., utterances. Starting from a given subject, the conversation evolves through sequences of user utterances and system replies. The retrieval of documents relevant to an utterance is difficult due to informal use of natural language in speech and the complexity of understanding the semantic context coming from previous utterances. We adopt the 2019 TREC Conversational Assistant Track (CAsT) framework to experiment with a modular architecture performing in order: (i) automatic utterance understanding and rewriting, (ii) first-stage retrieval of candidate passages for the rewritten utterances, and (iii) neural re-ranking of candidate passages. By understanding the conversational context, we propose adaptive utterance rewriting strategies based on the current utterance and the dialogue evolution of the user with the system. A classifier identifies those utterances lacking context information as well as the dependencies on the previous utterances. Experimentally, we evaluate the proposed architecture in terms of traditional information retrieval metrics at small cutoffs. Results demonstrate the effectiveness of our techniques, achieving an improvement up to 0.6512 (+201%) for P@1 and 0.4484 (+214%) for nDCG@3 w.r.t. the CAsT baseline.
Caching search results is employed in information retrieval systems to expedite query processing and reduce back-end server workload. Motivated by the observation that queries belonging to different topics have different temporal-locality patterns, we investigate a novel caching model called STD (Static-Topic-Dynamic cache), a refinement of the traditional SDC (Static-Dynamic Cache) that stores in a static cache the results of popular queries and manages the dynamic cache with a replacement policy for intercepting the temporal variations in the query stream.
Our proposed caching scheme includes another layer for topic-based caching, where the entries are allocated to different topics (e.g., weather, education). The results of queries characterized by a topic are kept in the fraction of the cache dedicated to it. This permits to adapt the cache-space utilization to the temporal locality of the various topics and reduces cache misses due to those queries that are neither sufficiently popular to be in the static portion nor requested within short-time intervals to be in the dynamic portion.
We simulate different configurations for STD using two real-world query streams. Experiments demonstrate that our approach outperforms SDC with an increase up to 3% in terms of hit rates, and up to 36% of gap reduction w.r.t. SDC from the theoretical optimal caching algorithm.
News documents published online represent an important source of information that can be used for event detection and tracking as well as for analyzing the temporal publishing relationships among different news streams.
In this paper, we describe our research on detecting, tracking, and predicting events from multiple news streams. We also analyze the temporal publishing patterns of newswires on different platforms and their timeliness in reporting the events. First, we present an approach based on discrete dynamic topic modeling and Hidden Markov Model for event detection and tracking. Then, we predict the events that would persist in the next time slice, which can be important for forecasting facts that would be popular in the future. We leverage the detected events for clustering news documents according to the events they describe. This allows us to determine which news- wires published news about an event and to analyze their temporal ordering in reporting events. Finally, we propose two scoring functions for ranking the newswires based on their timeliness.
We tested our methodologies on different collections of news articles and tweets. Moreover, we built a collection of heterogeneous news documents with event-document labels which were manually assessed using crowdsourcing. Experimental results showed that, compared to the traditional dynamic topic model, our approach is able to timely detect emerging topics (events). Overall, we could register an event coverage of about 90% w.r.t. the pool of labeled events. The evolution of events is captured by event chains which are highly coherent (0.76) and informative (0.60) allowing to effectively reconstruct the stories. Furthermore, the event-based clustering of news documents has a good trade-off of precision and recall (F-score = 0.83) and the topic keywords provide a semantic description of the events represented by the clusters. Concerning our analysis on the temporal publishing relationships among news streams, we could observe interesting patterns on the usage of the different platforms, for example, some newswires still favor their own official websites, while others tend to publish more timely on Twitter.
We design algorithms that, given a collection of documents and a distribution over user queries,
return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset.
This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection.
We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection,
they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings,
we experimentally show the versatility of our approach by considering two important cases in the context of Web search.
In the first case, we favor the retrieval of documents that are relevant to the query,
whereas in the second case we aim for document diversification.
Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios.
Business Process Analysis (BPA) is a strategic activity, necessary for enterprises to model their business operations.
It is a central activity in information system development, but also for business process design and reengineering.
Despite several decades of research, the effectiveness of available BPA methods is still questionable.
The majority of methodologies adopted by enterprises are rather qualitative and lack a formal basis,
often yielding inadequate specifications. On the other hand, there are methodologies with a solid theoretical background,
but they appear too cumbersome for the majority of enterprises.
This paper proposes a knowledge framework, referred to as BPA Canvas, conceived to be easily mastered by business people and, at the same time,
based on a sound formal theory. The methodology starts with the construction of natural language knowledge artifacts and,
then, progressively guides the user toward more rigorous structures. The formal approach of the methodology allows us to prove
the correctness of the resulting knowledge base while maintaining the centrality of business people in the whole knowledge
construction process.
The Mixed-Initiative ConveRsatiOnal Systems workshop (MICROS) aims at bringing novel ideas and investigating new solutions on conversational assistant systems. The increasing popularity of personal assistant systems, as well as smartphones, has changed the way users access online information, posing new challenges for information seeking and filtering. MICROS has a particular focus on mixed-initiative conversational systems, namely, systems that can provide answers in a proactive way (e.g., asking for clarification or proposing possible interpretations for ambiguous and vague requests). We invite people working on conversational systems or interested in the workshop topics to send us their position and research manuscripts.
The massive shock of the COVID-19 pandemic is already showing its negative effects on economies around the world, unprecedented in recent history. COVID-19 infections and containment measures have caused a general slowdown in research and new knowledge production. Because of the link between R&D spending and economic growth, it is to be expected then that a slowdown in research activities will slow in turn the global recovery from the pandemic. Many recent studies also claim an uneven impact on scientific production across gender. In this paper, we investigate the phenomenon across countries, analysing preprint depositions. Differently from other works, that compare the number of preprint depositions before and after the pandemic outbreak, we analyse the depositions trends across geographical areas, and contrast after-pandemic depositions with expected ones. Differently from common belief and initial evidence, in few countries female scientists increased their scientific output while males plunged.
To help research on Conversational Information Seeking, TREC has organized a competition on conversational assistant systems, called Conversational
Assistant Track (CAsT). It provides test collections for open-domain conversational search systems. For our participation in CAsT 2021, we implemented a
three-step architecture consisting of: (i) automatic utterance rewriting, (ii) first-stage retrieval of candidate passages, and (iii) neural re-ranking of candidate
passages.
Each run is based on a different utterance rewriting technique for enriching
the raw utterance with context extracted from the previous utterances and/or
replies in the conversation. Two of our approaches use only raw utterances and
other two use utterances plus the canonical responses of the automatically rewritten utterances provided by CAsT 2021. Our approaches also rely on utterances
manually classified by human assessors using a taxonomy defined ad hoc for this
task.
The 1st edition of the workshop on Mixed-Initiative ConveRsatiOnal Systems (MICROS@ECIR2021) aims at investigating and collecting novel ideas and contributions in the field of conversational systems. Oftentimes, the users fulfill their information need using smartphones and home assistants. This has revolutionized the way users access online information, thus posing new challenges compared to traditional search and recommendation. The first edition of MICROS will have a particular focus on mixed-initiative conversational systems. Indeed, conversational systems need to be proactive, proposing not only answers but also possible interpretations for ambiguous or vague requests.
The TREC Conversational Assistant Track (CAsT) provides test collections
for open-domain conversational search systems with the purpose of pursuing research on Conversational Information Seeking (CIS). For our participation in
CAsT 2020, we implemented a modular architecture consisting of three steps:
(i) automatic utterance rewriting, (ii) first-stage retrieval of candidate passages,
and (iii) neural re-ranking of candidate passages. Each run is based on a different
utterance rewriting technique for enriching the raw utterance with context extracted from the previous utterances in the conversation. Two of our approaches
are completely unsupervised, while the other two rely on utterances manually
classified by human assessors. These approaches also employ the canonical responses for the automatically rewritten utterances provided by CAsT 2020.
I. Mele, C.I. Muntean, F.M. Nardini, R. Perego, N. Tonellotto, and O. Frieder Topic Propagation in Conversational Search In SIGIR'20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
[Short paper], [Abstract],
[PDF],
[arXiv]
In a conversational context, a user expresses her multi-faceted information need as a sequence of natural-language questions, i.e., utterances. Starting from a given topic, the conversation evolves through user utterances and system replies. The retrieval of documents relevant to a given utterance in a conversation is challenging due to ambiguity of natural language and to the difficulty of detecting possible topic shifts and semantic relationships among utterances. We adopt the 2019 TREC Conversational Assistant Track (CAsT) framework to experiment with a modular architecture performing: (i) topic-aware utterance rewriting, (ii) retrieval of candidate passages for the rewritten utterances, and (iii) neural-based re-ranking of candidate passages. We present a comprehensive experimental evaluation of the architecture assessed in terms of traditional IR metrics at small cutoffs. Experimental results show the effectiveness of our techniques that achieve an improvement up to 0.28 (+93%) for P@1 and 0.19 (+89.9%) for nDCG@3 w.r.t. the CAsT baseline.
In this paper, we present a collection of news documents labeled at the level of crisp events. Compared to other publicly-available collections,
our dataset is made of heterogeneous documents published by popular news channels on different platforms in the same temporal window and, therefore, dealing with roughly the same events and topics.
The collection spans 4 months and comprises 147K news documents from 27 news streams, i.e., 9 different channels and 3 platforms:
Twitter, RSS portals, and news websites. We also provide relevance labels of news documents for some selected events.
These relevance judgments were collected using crowdsourcing. The collection can be useful to researchers investigating challenging news-mining tasks,
such as event detection and tracking, multi-stream analysis, and temporal analysis of news publishing patterns.
Proactive search technologies aim at modeling the users' information seeking behaviors for a just-in-time information retrieval (JITIR) and to address the information needs of users even before they ask. Modern virtual personal assistants, such as Microsoft Cortana and Google Now, are moving towards utilizing various signals from users' search history to model the users and to identify their short-term as well as long-term future searches. As a result, they are able to recommend relevant pieces of information to the users at just the right time and even before they explicitly ask (e.g., before submitting a query). In this paper, we propose a novel neural model for JITIR which tracks the users' search behavior over time in order to anticipate the future search topics. Such technology can be employed as part of a personal assistant for enabling the proactive retrieval of information. Our experimental results on real-world data from a commercial search engine indicate that our model outperforms several important baselines in terms of predictive power, measuring those topics that will be of interest in the near-future. Moreover, our proposed model is capable of not only predicting the near-future topics of interest but also predicting an approximate time of the day when a user would be interested in a given search topic.
Nowadays, one of the main sources for people to access and read news are social media platforms. Different types of news trigger different emotional reactions to users who may feel happy or sad after reading a news article. In this paper, we focus on the problem of predicting emotional reactions that are triggered on users after they read a news post. In particular, we try to predict the number of emotional reactions that users express regarding a news post that is published on social media. In this paper, we propose features extracted from users' comments published about a news post shortly after its publication to predict users' the triggered emotional reactions. We explore two different sets of features extracted from users' comments. The first group represents the activity of users in publishing comments whereas the second refers to the comments' content. In addition, we combine the features extracted from the comments with textual features extracted from the news post. Our results show that features extracted from users' comments are very important for the emotional reactions prediction of news posts and that combining textual and commenting features can effectively address the problem of emotional reactions prediction
Nowadays, on-line news agents post news articles on social media platforms with the aim to spread information as well as to attract more users and understand their reactions and opinions. Predicting the emotional influence of news on users is very important not only for news agents but also for users, who can filter out news articles based on the reactions they trigger. In this paper, we focus on the problem of emotional influence prediction of a news post on users before publication. For the prediction, we explore a range of textual and semantic features derived from the content of the posts. Our results show that terms is the most important feature and that features extracted from news posts' content allow to effectively predict the amount of emotional reactions triggered by a news post.
In the last few decades, topic models have been extensively used to discover the latent
topical structure of large text corpora; however, very little has been done to model the
continuation of such topics in the near future. In this paper we present a novel approach for
tracking topical changes over time and predicting the topics which would continue in the
near future. For our experiments, we used a publicly available corpus of conference papers,
since scholarly papers lead the technological advancements and represent an important
source of information that can be used to make decisions regarding the funding strategies in
the scientific community. The experimental results show that our model outperforms two
major baselines for dynamic topic modeling in terms of predictive power.
Linking multiple news streams based on the reported events and analyzing the streams' temporal publishing patterns are two very important tasks for information analysis, discovering newsworthy stories, studying the event evolution, and detecting untrustworthy sources of information. In this paper, we propose techniques for cross-linking news streams based on the reported events with the purpose of analyzing the temporal dependencies among streams.
Our research tackles two main issues: (1) how news streams are connected as reporting an event or the evolution of the same event and (2) how timely the newswires report related events using different publishing platforms. Our approach is based on dynamic topic modeling for detecting and tracking events over the timeline and on clustering news according to the events. We leverage the event-based clustering to link news across different streams and present two scoring functions for ranking the streams based on their timeliness in publishing news about a specific event.
Cascade measures like α-nDCG, ERR-IA, and NRBP
take into account novelty and diversity of query results and
are computed using judgments provided by humans,
which are costly to collect.
These measures expect that all documents in the result list of a query are
judged and cannot make use of judgments beyond
the assigned labels. Existing work has demonstrated
that condensing the query results by taking out documents without
judgment can address this problem to some extent.
However, how highly incomplete judgments can affect cascade measures
and how to cope with such incompleteness have not been addressed yet.
In this paper, we propose an approach which mitigates incomplete judgments
by leveraging the content of documents relevant to the query's subtopics.
These language models are estimated at each rank taking into account the document and the upper ranked ones.
Then, our method determines gain values based on the Kullback-Leibler divergence between the language models. Experiments on the diversity tasks of the TREC Web Track 2009-2012 show that with only 15% of the judgments our method accurately reconstructs the original rankings determined by the established cascade measures.
Suggesting personalized venues helps users to find interesting places on location-based social networks (LBSNs). Although there are many LBSNs online, none of them is known to have thorough information about all venues. The Contextual Suggestion track at TREC aimed at providing a collection consisting of places as well as user context to enable researchers to examine and compare different approaches, under the same evaluation setting. However, the officially released collection of the track did not meet many participants' needs related to venue content, online reviews, and user context. That is why almost all successful systems chose to crawl information from different LBSNs. For example, one of the best proposed systems in the TREC 2016 Contextual Suggestion track crawled data from multiple LBSNs and enriched it with venue-context appropriateness ratings, collected using a crowdsourcing platform. Such collection enabled the system to better predict a venue's appropriateness to a given user's context. In this paper, we release both collections that were used by the system above. We believe that these datasets give other researchers the opportunity to compare their approaches with the top systems in the track. Also, it provides the opportunity to explore different methods to predicting contextually appropriate venues.
The advent of social media has given the opportunity to users to publicly express and share their opinion about any topic. Public opinion is very important for the interested entities that can leverage such information in the process of making decisions. In addition, identifying sentiment changes and the likely causes that have triggered them allows interested parties to adjust their strategies and attract more positive sentiment. With the aim to facilitate research on this problem, we describe a collection of tweets that can be used for detecting and ranking the likely triggers of sentiment spikes towards different entities. To build the collection, we first group tweets by topic which are then manually annotated according to sentiment polarity and strength. We believe that this collection can be useful for further research on detecting sentiment change triggers, sentiment analysis and sentiment prediction.
In this paper we tackle the problem of detecting events from multiple and heterogeneous streams of news. In particular, we focus on news which are heterogeneous in length and writing styles since they are published on different platforms (i.e., Twitter, RSS portals, and news websites). This heterogeneity makes the event detection task more challenging, hence we propose an approach able to cope with heterogeneous streams of news. Our technique combines topic modeling, named-entity recognition, and temporal analysis to effectively detect events from news streams. The experimental results confirmed that our approach is able to better detect events than other state-of-the-art techniques and to divide the news in high-precision clusters based on the events they describe.
Topic modeling is an important area which aims at indexing and exploring massive data streams. In this paper we introduce a discrete Dynamic Topic Modeling (dDTM) algorithm, which is able to model a dynamic topic that is not necessarily present over all time slices in a stream of documents. Our proposed model has applications in modeling dynamic topics of rapidly changing and less structured data, such as online microblogs and news streams.
Our results show that the topical chains (i.e., evolution of topics) computed by our algorithm is more representative of the contents of documents than the original Dynamic Topic Modeling (DTM) in terms of likelihood on held-out data. Furthermore, we show that our method is effective in identifying emerging trends in streaming data.
Making personalized and context-aware suggestions of venues to the users is very crucial in venue recommendation. These suggestions are often based on matching venues' features with users' preferences, which can be collected from previously visited locations. In this paper we present a novel user-modeling approach which relies on a set of scoring functions for making personalized suggestions of venues based on venues content and reviews as well as users context. Our experiments, conducted on the dataset of the TREC Contextual Suggestion Track, proved that our methodology outperforms state-of-the-art approaches by a significant margin.
One of the core tasks of Online Reputation Monitoring is to determine whether a text mentioning the entity of interest has positive or negative implications for its reputation. A challenging aspect of the task is that many texts are polar facts, i.e. they do not convey sentiment but they do have reputational implications (e.g. A Samsung smartphone exploded during flight has negative implications for the reputation of Samsung). In this paper we explore the hypothesis that, in order to determine the reputation polarity of factual information, we can propagate sentiment from sentiment-bearing texts to factual texts that discuss the same issue. We test two approaches that implement such hypothesis: the first one is to directly propagate sentiment to similar texts, and the second one is to augment the polarity lexicon. Our results (i) confirm our propagation hypothesis, with improvements of up to 43% in weakly supervised settings and up to 59% with fully supervised methods; and (ii) indicate that building domain-specific polarity lexicons is a cost-effective strategy.
This technical report presents the work of Università della Svizzera italiana (USI) at TREC 2016 Contextual Suggestion track. The goal of the Contextual Suggestion track is to develop systems that could make suggestions for venues that a user will potentially like. Our proposed method attempts to model the users' behavior and opinion by training a SVM classifier for each user. It then enriches the basic model using additional data sources such as venue categories and taste keywords to model users' interest. For predicting the contextual appropriateness of a venue to a user's context, we modeled the problem as a binary classification one. Furthermore, we built two datasets using crowdsourcing that are used to train a SVM classifier to predict the contextual appropriateness of venues. Finally, we show how to incorporate the multimodal scores in our model to produce the final ranking. The experimental results illustrate that our proposed method performed very well in terms of all the evaluation metrics used in TREC.
An important task in recommender systems is suggesting relevant venues in a city to a user. These suggestions are usually created by exploiting the user's history of preferences, which are, for example, collected in previously visited cities. In this paper, we first introduce a user model based on venues' categories and their descriptive keywords extracted from Foursquare tips. Then, we propose an enriched user model which leverages the users' reviews from Yelp. Our participation in the TREC 2015 Contextual Suggestion track, confirmed that our model outperforms other approaches by a significant margin.
A. Giachanou, I. Mele, and F. Crestani Explaining Sentiment Spikes in Twitter In CIKM'16: Proceedings of the 25th ACM International Conference on Conference on Information and Knowledge Management
[Extended short paper], [Abstract],
[PDF],
[BibTeX]
Tracking public opinion in social media provides important information to enterprises or governments during a decision making process. In addition, identifying and extracting the causes of sentiment spikes allows interested parties to redesign and adjust strategies with the aim to attract more positive sentiments. In this paper, we focus on the problem of tracking sentiment towards different entities, detecting sentiment spikes and on the problem of extracting and ranking the causes of a sentiment spike. Our approach combines LDA topic model with Relative Entropy. The former is used for extracting the topics discussed in the time window before the sentiment spike. The latter allows to rank the detected topics based on their contribution to the sentiment spike.
Privacy of Internet users is at stake because they expose personal information in posts created in online communities, in search queries, and other activities. An adversary that monitors a community may identify the users with the most sensitive properties and utilize this knowledge against them (e.g., by adjusting the pricing of goods or targeting ads of sensitive nature). Existing privacy models for structured data are inadequate to capture privacy risks from user posts.
This paper presents a ranking-based approach to the assessment of privacy risks emerging from textual contents in online communities, focusing on sensitive topics, such as being depressed. We propose ranking as a means of modeling a rational adversary who targets the most afflicted users. To capture the adversarys background knowledge regarding vocabulary and correlations, we use latent topic models. We cast these considerations into the new model of R-Susceptibility, which can inform and alert users about their potential for being targeted, and devise measures for quantitative risk assessment. Experiments with real-world data show the feasibility of our approach.
N. A. Alawad, A. Anagnostopoulos, S. Leonardi, I. Mele, and F. Silvestri Network-Aware Recommendations of Novel Tweets In SIGIR'16: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
[Short paper], [Abstract],
[PDF],
[BibTeX]
With the rapid proliferation of microblogging services such as Twitter, a large number of tweets is published everyday often making users feel overwhelmed with information. Helping these users to discover potentially interesting tweets is an important task for such services. In this paper, we present a novel tweet-recommendation approach, which exploits network, content, and retweet analyses for making recommendations of tweets. The idea is to recommend tweets that are not visible to the user (i.e., they do not appear in the user timeline) because nobody in her social circles published or retweeted them. To do that, we create the user's ego-network up to depth two and apply the transitivity property of the friends-of-friends relationship to determine interesting recommendations, which are then ranked to best match the user's interests. Experimental results demonstrate that our approach improves the state-of-the-art technique.
Crowdsourcing is a computational paradigm whose distinctive feature is the involvement of human workers in key steps of the computation. It is used successfully to address problems that would be hard or impossible to solve for machines. As we highlight in this work, the exclusive use of nonexpert individuals may prove ineffective in some cases, especially when the task at hand or the need for accurate solutions demand some degree of specialization to avoid excessive uncertainty and inconsistency in the answers. We address this limitation by proposing an approach that combines the wisdom of the crowd with the educated opinion of experts. We present a computational model for crowdsourcing that envisions two classes of workers with different expertise levels. One of its distinctive features is the adoption of the threshold error model, whose roots are in psychometrics and which we extend from previous theoretical work. Our computational model allows to evaluate the performance of crowdsourcing algorithms with respect to accuracy and cost. We use our model to develop and analyze an algorithm for approximating the best, in a broad sense, of a set of elements. The algorithm uses naïve and expert workers to find an element that is a constant factor approximation to the best. We prove upper and lower bounds on the number of comparisons needed to solve this problem, showing that our algorithm uses expert and naïve workers optimally up to a constant factor. Finally, we evaluate our algorithm on real and synthetic datasets using the CrowdFlower crowdsourcing platform, showing that our approach is also effective in practice.
Phrase queries are a key functionality of modern search
engines. Beyond that, they increasingly serve as an important
building block for applications such as entity-oriented search, text
analytics, and plagiarism detection. Processing phrase queries is
costly, though, since positional information has to be kept in the
index and all words, including stopwords, need to be considered.
We consider an augmented inverted index that indexes selected
variable-length multi-word sequences in addition to single words. We
study how arbitrary phrase queries can be processed efficiently on
such an augmented inverted index. We show that the underlying
optimization problem is NP-hard in the general case and
describe an exact exponential algorithm and an approximation
algorithm to its solution. Experiments on ClueWeb09 and The New York
Times with different real-world query workloads examine the
practical performance of our methods.
In this paper we present a novel graph-based data abstraction
for modeling the browsing behavior of web users. The
objective is to identify users who discover interesting pages
before others. We call these users early adopters. By tracking
the browsing activity of early adopters we can identify
new interesting pages early, and recommend these pages to
similar users. We focus on news and blog pages, which are
more dynamic in nature and more appropriate for recommendation.
Our proposed model is called early-adopter graph. In this
graph, nodes represent users and a directed arc between
users u and v expresses the fact that u and v visit similar
pages and, in particular, that user u tends to visit those
pages before user v. The weight of the edge is the degree to
which the temporal rule “u visits a page before v” holds.
Based on the early-adopter graph, we build a recommendation
system for news and blog pages, which outperforms
other out-of-the-shelf recommendation systems based on collaborative
filtering.
A. Anagnostopoulos, L. Becchetti, S. Leonardi, I. Mele, and P. Sankowski Stochastic Query Covering In WSDM '11: Proceedings of the 4th ACM International Conference on Web Search and Data Mining (Best Poster Award)
[Full paper], [Abstract],
[PDF],
[BibTeX]
In this paper we introduce the problem of query covering as a
means to efficiently cache query results. The general idea is to populate the cache
with documents that contribute to the result pages of a large number of queries,
as opposed to caching the top documents for each query.
It turns out that the problem is hard and solving it requires knowledge
of the structure of the queries and the results space, as well as
knowledge of the input query distribution. We formulate the problem
under the framework of stochastic optimization;
theoretically it can be seen as a stochastic universal version of set
multicover. While the problem is NP-hard to be solved exactly,
we show that for any distribution it can be approximated using a simple
greedy approach. Our theoretical findings are complemented by
experimental activity on real datasets, showing the feasibility
and potential interest of query-covering approaches in practice.
Workshop Papers
A. De Nicola, A. Formica, I. Mele, M. Missikoff, F. Taglino Towards a formal approach to a Knowledge Base supporting Business Process Analysis In SEBD'23: 31st Symposium on Advanced Database Systems. 2023.
[Abstract]
Business Process Analysis (BPA) is a strategic activity, necessary for enterprises to model their business operations. It is a central activity in information system development, but also for business process design and reengineering. Despite several decades of research, the effectiveness of available methods is still questionable. The majority of methodologies adopted by enterprises are rather qualitative and lack a formal basis, often yielding inadequate specifications. On the other hand, there are methodologies with a solid theoretical background, but they appear too cumbersome for the majority of enterprises.
This paper proposes a knowledge framework, referred to as BPA Canvas, conceived to be easily mastered by business people and, at the same time, based on a sound formal theory. The methodology starts with the construction of natural language knowledge artifacts and, then, progressively guides the user toward more rigorous structures. The formal approach of the methodology allows us to prove the correctness of the resulting knowledge base while maintaining the centrality of business people in the whole knowledge construction process.
A. De Nicola, A. Formica, I. Mele, M. Missikoff, F. Taglino Formal Approach to a Knowledge Base for Business Process Analysis In ital-AI'23: Terzo Convegno Nazionale CINI sull'Intelligenza Artificiale. 2023.
[Abstract]
Business Process Analysis (BPA) is a strategic activity, necessary for enterprises to model their business operations. It is a central activity in information system development, but also for business process design and reengineering. Despite several decades of research, the effectiveness of available methods is still questionable. The majority of methodologies adopted by enterprises are rather qualitative and lack a formal basis, often yielding inadequate specifications. On the other hand, there are methodologies with a solid theoretical background, but they appear too cumbersome for the majority of enterprises.
This paper proposes a knowledge framework, referred to as BPA Canvas, conceived to be easily mastered by business people and, at the same time, based on a sound formal theory. The methodology starts with the construction of natural language knowledge artifacts and, then, progressively guides the user toward more rigorous structures. The formal approach of the methodology allows us to prove the correctness of the resulting knowledge base while maintaining the centrality of business people in the whole knowledge construction process.
Nowadays, on-line news agents post news articles on social media platforms with the aim to attract more users. Different types of news trigger different emotions on users who may feel surprised or sad after reading some piece of news. In this paper, we are interested to predict the amount of users that will react with a specific emotional reaction after reading a news post. To address the problem, we propose a model that is trained on features extracted from users' early commenting activity. Our results show that users' early activity features are very important and that combining those features with terms can effectively predict the amount of emotional reactions triggered on users by a news post.
A. Giachanou, I. Mele, and F. Crestani USI Participation at SMERP 2017 Text Retrieval Task In SMERP'17: Proceedings of the 1st International Workshop on Exploitation of
Social Media for Emergency Relief and Preparedness (workshop co-located with the 39th European Conference on Information Retrieval)
[Abstract],
[PDF],
[BibTeX]
This report describes the participation of the Università della Svizzera italiana (USI) at the SMERP Workshop Data Challenge Track for the text retrieval task for both Level 1 and Level 2. For this task, we propose a methodology based on query expansion and boolean expressions. For Level 1, we submitted two different methods based on query expansion, where queries were expanded using terms mined from an earthquake-related collection of tweets. In this way, we managed to extract useful expansion terms for each query. In addition to the query expansion, we tried to improve the quality of the retrieved results by incorporating Part-Of-Speech tags. For Level 2, we additionally used information from the partial ground truth that was provided by the organizers in relation to our submitted runs on Level 1. The results showed that our query expansion method had the highest performance in terms of MAP and precision on both levels. In addition, we managed to achieve the second best performance on Level 1 among the submitted semi-supervised approaches in terms of bpref metric.
A. Giachanou, I. Mele, and F. Crestani USI Participation at SMERP 2017 Text Summarization Task In SMERP'17: Proceedings of the 1st International Workshop on Exploitation of
Social Media for Emergency Relief and Preparedness (workshop co-located with the 39th European Conference on Information Retrieval)
[Abstract],
[PDF],
[BibTeX]
This short report describes the participation of the Università della Svizzera italiana (USI) at the SMERP Workshop Data Challenge Track for the task text summarization of Level 1. Our participation is based on a linear interpolation for combining relevance and novelty scores of the retrieved tweets. Our method is fully automatic. For the relevance score we used the results from our runs at the text retrieval task whereas for the novelty we used a method based on Word2Vec. In total, we submitted four different runs and we used two different weight parameters. The results showed that when relevance and novelty have an equal contribution in selecting the tweets to use for the summary, the performance is better compared to favoring only the novelty. Additionally, information from POS tags improves the performance of the summarization task.
J. Biega, I. Mele, and G. Weikum Probabilistic Prediction of Privacy Risks in User Search Histories In PSBD '14: Proceedings of the 1st International Workshop on Privacy and Secuirty of Big Data (workshop co-located with the 23rd ACM International Conference on Conference on Information and Knowledge Management)
[Abstract],
[PDF],
[BibTeX]
This paper proposes a new model of user-centric, global,
probabilistic privacy, geared for today's challenges of helping
users to manage their privacy-sensitive information across a
wide variety of social networks, online communities, QA forums,
and search histories. Our approach anticipates an
adversary that harnesses global background knowledge and
rich statistics in order to make educated guesses, that is,
probabilistic inferences at sensitive data. We aim for a
tool that simulates such a powerful adversary, predicts privacy
risks, and guides the user. In this paper, our framework
is specialized for the case of Internet search histories.
We present preliminary experiments that demonstrate
how estimators of global correlations among sensitive and
non-sensitive key-value items can be fed into a probabilistic
graphical model in order to compute meaningful measures
of privacy risk.
Web usage mining is the application of data mining techniques to the data generated
by the interactions of users with web servers.
This kind of data, stored in server logs, represents a valuable source of information,
which can be exploited to optimize the document-retrieval task,
or to better understand, and thus, satisfy user needs.
Our research focuses on two important issues: improving search-engine performance
through static caching of search results, and helping users to find interesting web pages
by recommending news articles and blog posts.
Concerning the static caching of search results, we present the query covering approach.
The general idea is to populate the cache with those documents that contribute to the
result pages of a large number of queries,
as opposed to caching the top documents of most frequent queries.
For the recommendation of web pages, we present a graph-based approach, which leverages the
user-browsing logs to identify early adopters.
These users discover interesting content before others, and monitoring their activity we can find web pages to recommend.
Yesterday is history, tomorrow is a mystery, but today is a gift. That is why it is called the present - Oogway, Kung Fu Panda
I've learned that people will forget what you said, people will forget what you did, but people will never forget how you made them feel - Maya Angelou
We know what we are, but know not what we may be - William Shakespeare
I dream a dream that dreams back at me - Toni Morrison
There are only 10 types of people in the world: those who understand binary, and those who don't - Nerdish joke