Content on the Multilingual Web: Report from the 2nd W3C Workshop on Multilinguality and Web Standards (Pisa)

By Jochen L. Leidner

The Second Workshop on Content for the Multilingual Web was funded by the European Union by an award to the project Multilingual Web. With some delay, here's a trip report. You can find the program here.

 

 


An introduction talk by CNR, Italy's national research center, the Pisa site of which also manages Italy's International Domain Names. They are currently working on permitting TLDs and sub-domains in Italian, French and German), an effort that will launch in July (a test phase has been successfully completed). Tim Berners-Lee, the founder of the Web, has the official mission "To lead the World Wide Web to its fullest potential" according to an introduction talk by Oreste Signore. Despite widespread talk about globalization, about five billion people are NOT using the Web (yet). The World Wide Web Foundation strives to abolish obstacles that people face so that they can use the Web, including geographic, political and other challenges (such as the WCAG 2.0 access guidelines - a set of high-level principles to permit access to elderly, handicapped etc.). The speaker also pointed out culture-specific dangers of multi-cultural Web sites (such as colors: "don't publish a Website with a green hat in China: means your wife is cheating on you in Chinese culture").

Kimmo Rossi, project officer at the European Commission, spoke about the expectations of the Multlingual Web project from a funder's perspective. He wants to learn about linguistic fragmentation of the Web and produce industry recommendations. He pointed out that by 2012, there will be about 50 multilingual technology projects funded by the EU. In the most recent call, 90 proposals asked for 240 mio. Euro, but only fifty million Euro in funding were available. The EU also carried out a study of 13,500 subjects which were interviewed (sampled from all EU states), asking about their online linguistic behaviour (to appear shortly). It was found that 44% worry they are "missing out important information" because online users don't understand the language, or not well enough.

Ralf Steinberger from the EU's Joint Research Centre in Ispra, Italy, presented his group's work on a set of media analysis tools, and made the point that complementarity exists across countries and languages, i.e. to get the full picture of a news event, it is insufficient to only look at English-language reports.

  • EMM NewsBrief is a real-time news analyzer that processes thousands of newspaper sites every day. It can display country preferences/biases of media topics and can track quotations.
  • MediSys, is a real-time news analyzer in up to 50 languages focused on the medical vertical. It categorizes news in 10,000 categories, and clusters them every 10 minutes to find trending topics (topic detection and tracking). It can also send email notifications for paritcular categories (everybody can register for categories publicly). Country-category pairs in particular help to link articles across languages (multilingual cross-linkage). One valuable features of this system is the fact that graphs and alerts may show events not yet reported in the user's own language.
  • NewsExplorer is a multilingual daily news overview system based on batch processing at midnight each day of collected news in 20 languages (including Swahili). Its messages are geo-located and spatially as well as temporally indexed and  cross-linked with Wikipedia (for photos and background).
  • Finally, NEXUS is a multilingual event extraction system (conflicts, crimes, disasters) for global crisis monitoring. As Ralf pointed out, only 20% of events are mentioned in more than one language, i.e. most events are local as far as reporting is converned. At the moment Chinese and Arabic commercial-grade machine translation are being incorporated. The system can handle 40-50 event types, and there is ongoing work in sentiment analysis. The group's users - European politicians and analysts working for them - are interested in sentiment shifts (per country, per medium). There is a public version on the Web, and an improved version for internal access only. Work on social media monitoring (blogs, Twitter, Facebook) is onoing.

Steven Pemberton's practical talk focused on Web development for internationalization and localization. Here are some takeaways for I18N/L10N developers:

  • Sites like Yoyo tell you what your browser is sending; to find internationalization/local-related errors, it's good advice to paste the response string along with any bug reports;
  • never redirect HTTP error code 404 because it breaks link checkers;
  • don't forget about returning HTTP code "406 - not acceptable: negotiation failt" (Google ignores negotiation, uses cookie instead as a Google representative in the audience confessed);
  • language buttons ("[FR] [EN] [ES]"...) are bad because they require one additional click per site visited, recognizing the user's lanugage of preference (and switching automatically) is much better;
  • W3C XForms is a little known recommendation to make forms on the Web easier (offers MVC, super-simple, read/write data from/to URLs, consistency checks). It can also be used to localize button texts of forms, make calculations (Web spreadsheet), ascertain constraints and to separate language-specific data from the "rest". There are multiple implementations (native: Mozilla, Server-based: Orbeon).

Jochen Leidner from Thomson Reuters focused on an assessment, from an online information services standpoint, of the state of internationalization and multilinguality. Overall, a lot of progress has been made in the last 10 years. In the past, special libraries and character data types had to be used, and fonts were lacking to render mixed-language documents. This is for the most part a solved problem: Standards like Unicode (currently 6.0), its UTF-8 encoding and XML have been widely embraced, including by Thomson Reuters.  On the negative side developers' I18N, L10N knowledge is lacking, mostly because it's not part of the computer science curriculum. This assessment was widely shared by the audience.
As far as a standards "wish list" is concerned, he posited it would be good to have a standardized tag to mark up the original document (source document) among a set of documents that contain the original, perhaps even a tag to mark pages as being machine-generated (as opposed to human-written) w.r.t. the text or particular annotations they contain. He warned from the negative consequences for the open Web of "walled garden" effects caused by application stores and proprietary social media (acknowledgments to Misha Wolf, also of Thomson Reuters, are due here for discussions).

Paula Shannon, CSO & SVP at Lionbridge (L10NBRIDGE), showed a video that demonstrated the impact of social media based on some recent statistics; one of its bottom line: "The ROI of social media is: you will still be in business in 5 years." (ROI as Risk of Ignoring) The center of the social media ecosystem, according to Paula, is search. She painted a picture of consolidation of the social media space internationally (V. Kontakte in Russia, QQ in China, Orkut in Brazil, Hi5 in Mexico, Peru, Portugal, Romania, and Thailand, Mongolia, Lide in the Czech Republic, Maktoob in Oman, Saudi Arabia, Yemen, and Lybia). Her company is going to publish a study around social media adoption and multilinguality (Forrester, Burson-Marsteller, Lionbridge et al.: How users are using social media multilingually). Interesting was the observation that corporate use is rising faster in Europe (Xing, LinkedIn, ..) than in the US. On Twitter 60% of tweets are NOT in English. So we are talking big impact - Quakebook, created by Facebook, aims to help Japanese tsunami victims. Overall there are 110 million tweets/day, 40% from mobiles, and a new metric, "TPS" (tweets per second) is emerging to measure crowd engagement (could be used as a proxy for event impact). Paula also reported people don't like to sift through tweet streams in multiple languages.

Marten de Rijke, professor at the University of Amsterdam, demonstrated the output of various emotion monitoring systems his group built, which use mood analysis from the online site LiveJournals. He observed that we are increasingly living "online lives", and that the study of these is opening up many opportunities:

  • mining for business reasons
  • mining to study human behaviour (sociology, religious studies, ...)

He outlined four projects at his home institution:

  • in Project "Political mashup" (led by his colleague Maarten Marx), they are aggregating parliamentary data (link to background information, tracking topic ownership in campaigns): see politicalmashup.nl
  • in Project CoSyne (PI Christof Monz, a Catalyst Lab collaborator in the VOXPOP proposal), the automatic translation of wiki pages is attempted, using Wikipedia as training material
  • in Project MoodViews (PI de Rijke, 2005-2009) - follow, predict, explain, discover associations from LiveJournals
  • From these project a number of Web applications resulted: MoodViews, Moodgrapher, Moodteller, Moodsignals, Moodspotter, Moodfeeds - some of which can found on the Web, others were discontinued after the research projects concluded
  • in Project Fietstas a scalable text analysis service (NL, EN): is developed, which provides a plug-in based architecture for doing the kind of research as described above more easily (it includes Web scrapers, indexers etc.)

W3c-pisa-arno

 

Chiara Pacella, Language Manager at Facebook Dublin described the process of Facebook's site localization in different languages. Developers create a set of strings that will be combined together ("{name} is now friends with {friends}.", "{friend} like this.", "{name} is listening to jazz."). User-generated content elements are mixed with fixed, controlled UI elements. Facebook developed a special translation (Facebook) application that permits the users to translate FB's UI (crowdsourcing). FB was translated to French completely within 24 hours and released in less than 3 weeks. It worked because people love Facebook. The most prolific translators were shown on a leaderboard as a sign of accomplishment. The process worked in four phases:

  1. translation of a glossary in order to ensure consistency of the language;
  2. translation of UI elements/sentences (inline or bulk modes);
  3. selection which translation is chosen; and
  4. verification by professional linguist.

 

W3c-pisa-ws-social