Report from SIGIR 2011 (Beijing, China)
By Jochen L. Leidner
The 34th Annual International Conference of the Special Interest Group Conference on Research and Development in Information Retrieval (SIGIR 2011) was held in Beijing on July 24-28, 2011.
Trends
A few sessions represent the recurring core of any SIGIR (Content Analysis, Effectiveness, Efficiency, Indexing, Test collections, Web IR, Retrieval models, Retrieval models II). People often from other research communities occasionally ask me why SIGIR "still has sessions on retrieval models - isn't everything already be said on that?" What they don't realize is that in the IR community, new paradigms and models are thriving, even in SIGIR's 34th year, as evidenced by two sessions on IR models. Machine learning methods for IR, now mainstream continue to be on the rise, as indicated by four sessions dedicated to the topic: Classification, Clustering, Learning to Rank and Latent Semantic Analysis (LSA). LSA in particular is an emerging trend on both SIGIR and EMNLP this year (seven years after Blei's paper). This year, the most up-and-coming areas (as measured by number of sessions) were the analysis of queries (sessions Query Analysis, Query Analysis II, Query suggestions and Web Queries), recommendation (sessions Collaborative filtering, Collaborative filtering II, Recommender systems) and personalization/looking at the user (sessions Personalization, Users and Users II). As somebody interested in the intersection of NLP and IR, I was pleased to see a whopping three sections related to NLP impact Vertical & Entity Search, Summarization and Linguistic Analysis, respectively. Not surprisingly, with Social Media and Communities, two sessions are dedicated to the timely topic of searching user-contributed content. Searching media other than text and search in languages other than English were also present (sessions Image Search, Multimedia IR and Multilingual IR).Finally, the Industrial track presented commercial applications.
Lei Li and colleagues (Florida International U. and DailyMe Inc.) presented the personalized SCENE news recommender. They point out the shortcomings of purely content based (e.g. NewsJunkie) or purely collaborative filtering based methods (e.g. Google News according to a 2007 account) and present a hybrid models: news are clustered using Locality Sensitive Hashing (LSH), then language models are applied to each cluster to create summaries based on a two-level representation of topics represented in the leaf nodes and and inner nodes representing more general topics. Personal user profiles are constructed based on accessed news content and preferred named entities (GATE is used to extract these according to the paper, the Website also mentions our OpenCalais service). User profiles are used by comparing the topic distributions per-cluster with the news content in the user profile using a greedy optimization (formalized as budgeted maximum coverage problem).
Their pipeline comprises news selection -> representation -> scalability -> user profiling -> personalization. LSH is used in the initial clustering to reduce the O^2 theoretical complexity of the pairwise comparisons as follows: after tokenizing, stop-word removal and stemming, articles are split into k-shingles, from which a hash signature is computed following Indyk's 1999 SIAM algorithm. LSH is used to split these signatures into various bands. Similiarity is defined as hashing into the same buckets (among a total of 5) at least once. The authors' "submodularity model" for recommendation aims to ensure diversity among stories covering the same topic. The evaluation based on 50 volunteers (5-Likert scale ratings) shows positive evidence for the presented approach. In the future they want to scale up SCENE by porting the offline clustering to MapReduce.
Bennett et al. (2011)'s paper "Inferring and using location metadata to personalize web search" is about geographic personalization, of course a topic dear to my heart. They discuss the notion of geographical relevance (i.e. the question "Is x (geographically) relevant, given a user location?") and, instead of deriving it from the Web page's content, the location distribution of visitors to the page is used. Proprietary, anonymized logs (Q4/2010) from a browser plug-in are used for experimentation. A location-interest model (mixture of Gaussians) is estimated as
As features the use original rank (in Microsoft's Bing search engine), entropy of URL model (is the URL viewed predominantly from specific areas?), contextual features such as P(URL|loc) (estimated as P(loc|URL)P(URL) / P(loc)) and others. The following map (Figure 4 from Bennett et al. (2011)) hows an example distribution of geographical relevance for the query "rta bus schedule".
Bennett et al. ranked queries by log likelihood (P(URL|loc) / P(background|loc)) and found that classified ads rank highest (which mean they are the most place-dependent category). In their evaluation of geo-personalization, they found that 16.8% of queries were affected, and that their method could effect an improvement for >10% of queries.
Kazai et al. (2011)'s talk aimed at assessing the use of crowdsourcing for IR evaluations and showed and evaluated the impact of alternative ways of presenting crowdsourced tasks to workers. As a setup, they chose crowd sourcing on Amazon Mechanical Turk (AMT) for book IR evaluation. Afer introducing the general crowdsourcing pipeline along the lines of data preparation -> HIT ("Human Intelligent Task") design -> AMT -> output treatment (e.g. majority votes) -> IR evaluation (qrel set), they described two alternative designs for HIT questionnaires, which were compared. Their "full design" (FD), was more complex in the sense that it required more reading, scrolling and form interaction, but was also rewarded more; a "simple design" (SD) was easier to do but also paid proportionally less. Kendall's tau correlations were measured between crowd worker based rankings and INEX rankings.
The experiment led to three main findings: (1) somewhat surprisingly, the full design (FD) selected harder working and better workers; (2) in terms of correlation strength between turkers and traditional IR scores, MAP (strongest) > Bpref > nDCG@10 > P@10 (weakest); and (3) consensus through overlap/redundancies helped the simple design more than the full design.
Zhang, Wang and Si (2011) induce adaptive hashing functions from multiple information sources to allow for faster similarity comparisons of large datasets (text, images). Several data sets are used to demonstrate the viability of the approach (by using each document in a training set as a query to retrieve documents known to be similar): Cora, Thomson Reuters Reuters-21578, Thomson Reuters RCV1, WebKB and a healthcare dataset. The authors found their method outperformed all alternatives, and consider it an alternative to LSH.
Legal documents such as draft laws are overwhelming for citizens. ManyBills (http://manybills.us) is an IBM system that performs a section analysis of proposed bills (U.S. congressional legislation); bills are long, so citizens/users need some kind of assistant that points to "interesting" sections. The reason is bills may contain off-topic outliers, which are an artifact of party political negotiations and concessions.
Aktolga et al.'s tool, which can benefits journalists and experts, aims to alert to such "sneaked in" material, for example healthcare-related sections in finance bills. Interaction between law makers and the general public is certainly desirable, and also reflected by prior art such as my former colleague Gloria Lau et al.'s (2004) article and system.
The IBM demonstrator, which has based on language modeling, is compared against a baseline that uses a Mallet-based MaxEnt classifier (trained with 83 topics, on 60K bills from 9 years, and evaluated using cross validation): P_outlier(s|D_cats,s_class) = 1 - max Popularity(s_class,d_c_t).
They use unigram LMs induced with MLE (bag-of-words based models for documents, categories, and sections, respectively). They also tried add-one smoothing.
2-step-approach: because the section model comparison is very slow (quadratic number of comparisons, large number of sections), Okapi BM25 ranking over the bill title keywords is used to rank, so that only the top-k need the crosswise comparisons.
The authors propose KLC (KL Divergence Contribution), a KLD variant, as a their measure for dissimilarity (KLC is too aggressive for mild outliers, KLD - best for mild outliers. However, KLC turns out to work best for strong outliers detection).
The system was trained on 13 bills and tested on 11 bills. Annotation was carried out by 3 annotators (kappa=0.6) using 3 categories: no outlier, mild outlier, and strong outlier. The 2-step approach yielded the best results and returns small number of outlier sections (which is the desired behavior).
I have played with the demo and like it. Conceptually, it is also a good idea to develop tools that lead to transparency of the political process (whatever one's leanings); what I would like to see is that work like this eventually gets generalized to broader application settings. The tool has so far found cases like a section permitting carrying guns in national parks in a bill about credit cards.
Li et al. (2011) present a method for unsupervised query segmentation. Their core idea is to combine a language model with click-through data. They want to analyze queries like
bank of america online banking
into
bank of america | online banking
which is useful for noun phrase discovery query reformulation and suggestion, phrasal models for IR and user intent analysis. In the past, people have used MI, supervised learning (Keys, Jones), or MDL (Tau) to split queries, but those approaches didn't take relevance in model into account. For example, in
president | of the | United States
the middle part here is not very useful to most applications. They used a generative model to predict boundaries, given a query (at runtime) and a set of <query; click> pairs (click=document) at training time:
The used of documents clicked on makes sens as a news fragment like "the white house and President Barack Obama, the 44th president of the United States" typically contains the right segmentation(s), here underlined.
Using a Microsoft Bing query log and Bergsma's EMNLP-CoNLL 2007 log datasets for evaluation, they found that their method outperforms both MI and Tan's approach, and that a LM with query segmentation can improve search ranking. What I liked about this paper is that example output is shown in the paper, and the box "how to score" slide shows constituent numbers, which should facilitate replication. One question I would encourage the authors to pursue is how to inform the model of segmentation requirements specific to a task (intuitively, query intent applications requires somewhat different slicing from NP phrase analysis for phrasal IR)?
Finally, the best paper award went to Ageev et al. (2011) for their work on evaluating search intent in a crowdsourcing setting akin to a game.
Overall, this year's SIGIR was a conference as strong as usual (despite the low acceptance rate of 19.8% the proceedings now comprise two volumes - and I'm very glad the proceedings are still available in paper form every year as many people - yours truly - had issues opening the electronic proceedings form the memory sticks provided this time). The next SIGIR will take place in 2012 in Portland, Oregon, USA.



